Introducing the fastest, easiest way to move to microservices with simple, scalable feature flags.

Available as an open-source project under the Open Source Initiative.

1. Introduction & Use Case

Pioneer is a feature management service built to handle an organization’s migration from a monolithic architecture to a microservices architecture.

a monolith before and after services are split out

As an application grows and demands on the system increase, an organization may find that they need to begin scaling their application. A monolithic architecture may naturally lead to tightly coupled code that is difficult to scale. Conversely, a microservices architecture is more loosely coupled and can be independently scaled & deployed. A small team can also organize a microservices architecture around business capabilities.

When transitioning from a monolith to a microservices-oriented architecture, an organization may initially want to expose the new service to a small number of users while it collects analytics and user feedback, and analyze how the service performs under various loads. By using feature flags with Pioneer to handle this transition, any change can be quickly rolled back in real-time simply by toggling the flag off; no redeployment is necessary.

1.1 Hypothetical

To better understand this use case, we can consider a hypothetical situation involving a company, Harvest Delivery; we can use this hypothetical scenario to illustrate what the catalysts for a conversion from a monolith to microservices might be for a small team, to consider the challenges that this organization will be facing before undertaking such a project, and to examine some of the potential options that are available in this space.

Harvest Delivery is a regional shopping service that allows users to order groceries online. A locally-contracted shopper purchases the requested groceries and delivers them to the doorstep of the user. Harvest Delivery’s web application receives a variable amount of user traffic, with peak traffic occurring in the days leading up to major holidays.

the landing page of harvest delivery

The architecture of Harvest Delivery’s web application is currently a monolith. As the organization grows, the monolithic architecture begins to cause a strain on the engineering team. One such strain is that the team has become reluctant to add new features to the monolithic codebase. Currently, different components of the code are highly dependent on one another and changing one component means an engineer also has to update multiple other components, increasing their workload. This issue also makes the codebase difficult to maintain, as fixing one bug has the tendency to introduce new problems.

Another issue the team has encountered is the inability to scale each business component (like payment processors or catalog) of their architecture independently. The team has found that during periods of high traffic, users are experiencing long delays during the payment stage. Harvest Delivery would like to scale the payment processing component independently from the rest of the codebase, but this isn’t possible with the current monolith. If the code responsible for payments was abstracted into a microservice, not only would this enable independent scaling, it would also cultivate a team solely dedicated to this payment service. The payment team could carry out their development and deployment pipeline independently to the rest of the monolith, resulting in faster improvements to payment processing.

The final issue the Harvest Delivery team is encountering with their monolith pertains to availability. Currently, if a bug triggers an outage, then the entire application becomes unavailable. The team would like to ensure that even if one component of the application is unavailable, for example creating a new user account, users not using the affected component can still perform their desired actions.

In response to these issues, the CTO has decided that they should migrate the application code towards a microservices architecture. The CTO has instructed the team to develop a system architecture organized around Harvest Delivery’s business concerns, with the shopping catalog, payment processing, and shopper communication all abstracted into individual microservices.

Modifying the system architecture of Harvest Delivery is a significant undertaking; additionally, it’s important to avoid any impact on user experience. The holidays are coming up and Harvest Delivery does not want to lose any customers to system outages.

Harvest Delivery plans to collect analytics and perform load testing on the new services to ensure that they can handle the load of holiday shopping sprees. They also plan to solicit user feedback from a percentage of users before rolling these new services out to the entire customer base.

2. Potential Solutions

The Harvest Delivery team needs to develop a strategy that enables them to achieve two primary goals - migrate to a microservices architecture and avoid system outages during the migration. This section discusses the available options to achieve these goals.

2.1 Canary Deployment

Harvest Delivery could consider a canary deployment, also known as side-by-side deployment, which involves creating a clone of the application’s production environment. A load balancer is used to initially send all traffic to one version while new functionality is built in the other version (the canary). When the new service is deployed, some percentage of users are directed to the canary deployment. If no issues arise, the service can be gradually rolled out to more users until the new version is used by everyone. If there are problems, deployments can be rolled back, and the majority of users will not be impacted 1.

a load balancer sits in front of two deployments: monolith and microservice extracted monolith

While this is a potential solution for Harvest Delivery, there are significant drawbacks to a canary deployment. Canary deployments result in an additional layer of complexity. The engineering team would need to deal with multiple production environments, monitor an additional system, and migrate users 1. A canary deployment would also require Harvest Delivery to maintain additional infrastructure.

Canary deployments operate at the deployment level; therefore, if an issue arises the most recent deployment will need to be rolled back entirely. Unexpected deployment rollbacks can result in additional downtime, degrading user experience for those users who were previously routed to the canary deployment. Additionally, the engineers responsible for incident management must be able to fully respond to incidents immediately to prevent problems from having a significant impact. Such incidents are likely to result in lost revenue and a damaged reputation for the company.

Another drawback related to canary deployments is that engineers still lack the granular control to develop features in parallel and roll them out to users independently of one another. A user is either routed to the canary deployment or the original deployment. Therefore, only one percentage rollout can dictate the routing of users. For example, imagine that one feature or service within the canary is ready for a 70% rollout, but another is only ready to be rolled out to 5% of users. This discrepancy limits us to directing only 5% of users to the canary deployment. If there is a significant problem with a feature and we need to make sure that no users are exposed to that feature, we must either roll back a deployment or route 0% of users to the canary deployment until the problem has been fixed. Again, this magnifies the impact of each issue that the engineering team encounters during the transition from a monolith to a microservices architecture.

The above limitations are problematic for the CTO of Harvest Delivery who wants to migrate to using several microservices over a period of time. The team needs the ability to roll out or rollback each microservice independently. Therefore, canary deployment isn’t suitable for Harvest Delivery due to their requirement for a solution that provides granular control over each individual microservice and its rollout status.

2.2 Feature Flags

The requirement for granular control over individual microservices has led Harvest Delivery to consider feature flags. Feature flags allow for one feature to be rolled out or rolled back completely independently of another, without a hotfix or redeployment 5. Feature flags work by incorporating conditional branching logic into the application code and evaluating the boolean status of a feature flag. For example, if Harvest Delivery wishes to test a new payment processing microservice then they need to add a conditional statement to the monolith, at the point where the current monolithic payment processing code is invoked. The conditional statement will return a boolean value indicating whether the flag is toggled “on” (true) or “off” (false). If the feature flag is toggled on, then a call to the microservice is executed and the subsequent response is handled. If the feature flag is off, the original monolithic code is executed.

a toggle changes the control flow of the monolith

In addition to enabling independent control over the rollout status of each microservice, feature flags also eliminate the need for frequent redeployment as seen with a canary deployment approach. This is because whilst a canary deployment lives in the infrastructure networking layer, feature flags live within the application and are evaluated in real-time 5. When rollout-related incidents occur and feature flags are employed, a microservice can be toggled off immediately and the original monolith code can be executed, rather than waiting for a potentially time-consuming redeployment. This makes incident management easier and provides the engineering team the time required to track down a bug and develop a robust solution to the cause of the incident. Moreover, disruption to the user experience is minimized, preventing an outage resulting in a loss of revenue.

The ability to “switch off” new features in response to an issue means feature flags substantially mitigate the risk of engineering teams releasing immature functionality 3. This low-risk experimentation allows small teams to maximize developer efficiency and release new functionality with confidence, knowing that they have the option of quickly reversing course without affecting the rest of the application.

Feature flags also enable microservices to be rolled out to a certain percentage of users. This occurs by using a unique identifier for each individual user, such as an IP address or user ID. The identifier is used by a hashing algorithm in the feature flag logic, which determines if that individual user falls within the current percentage rollout strategy (i.e. if a feature flag is being rolled out to 10% of users, does this individual user fall within that 10%?). Features that are toggled “off” will never be served, regardless of rollout percentage. If a feature flag is toggled “on”, the hashing algorithm of the Software Development Kit (SDK) will determine whether the user’s unique identifier falls within the rollout percentage. If so, the flag will evaluate as `true`, and the feature will be served to the user.

In conclusion, feature flags meet the requirement of the Harvest Delivery team to have control over the rollout status of each microservice independently. If issues arise from a new microservice, then feature flags enable the microservice to be “switched off” in real-time, without the need for redeployment. This allows Harvest Delivery to minimize any user disruption during the architectural changes, which reduces revenue losses and reputation impact. Moreover, Harvest Delivery can rollout microservices to a specified percentage of users, enabling analytics on the new service to be collected.

2.3 Feature Flags as a Solution

Harvest Delivery has decided to move forward with using feature flags to assist in their migration from a monolith to a microservices architecture. This section outlines how Harvest Delivery could integrate feature flags into their existing system.

2.3.1 Developing a Feature Flag Service In-house

Harvest Delivery could choose to develop its own feature flag service. One naive approach would be to maintain a simple feature flag rule set within their application code as a configuration file. However, this would require re-deployment any time the team wished to toggle a feature on or off, or update the rollout percentage. Ultimately, this would negate the benefit of using feature flags.

A more robust solution would be for the Harvest Delivery engineering team to build a true feature flag management system that would allow flags to be updated and dynamically evaluated without re-deploying any code. However, this would require engineering hours to build, test, and maintain the service. Harvest Delivery is a small team without excess engineering resources to devote to such a project. Additionally, the CTO prefers to move quickly into the transition to microservices, rather than waiting for an additional tool to be developed. A third-party solution that will work out-of-the-box is a better fit for the small, fast-moving team.

2.3.2 Existing Third-Party Solutions

Harvest Delivery requires an easy-to-use solution for their fast-moving team. The existing third-party solutions are evaluated below.

  • Launch Darkly

    LaunchDarkly is an enterprise-level feature management service. It’s a feature-rich, hosted service that includes a wide variety of features. However, because it is proprietary software, it comes at a monetary cost. For Harvest Delivery, LaunchDarkly’s plethora of features is beyond their needs.

  • Cloud Bees

    CloudBees is another hosted service with proprietary software. It provides end-to-end DevOps features, which may be a great fit for larger teams with a focus on DevOps. Unfortunately, Harvest Delivery is a small team and is not concerned with the wide variety of DevOps tools CloudBees provides.

  • FeatureHub

    FeatureHub is an open-source, self-hosted service. Because it is open-source, it can be customized to fit the unique needs of the team. FeatureHub focuses on feature flags for both client-side and server-side features. It offers an array of different flag types, along with complex logic allowing for organizations, users, and a variety of feature sets; the robust features and permissions management make it less accessible for a small team like Harvest Delivery.

  • 2.3.3 Thoughts on Existing Third-Party Solutions

    Enterprise solutions have a large array of financially costly DevOps services that may strain a small, regional organization like Harvest Delivery. Existing open-source solutions are complicated by focusing on client-side features evaluated in the browser and a litany of complex features including user permissions and service accounts. A small business like Harvest Delivery doesn’t need the extra expenses of an enterprise service or the unnecessary complexity that other offerings entail.

3. Introducing Pioneer

The limitations of the third-party solutions have led the Harvest Delivery team to pursue a new feature flag management system that fits their requirements perfectly: Pioneer.

3.1 What Is Pioneer?

Pioneer is an open-source, self-hosted feature flag management service that aids in the transition from a monolithic to microservices architecture. Feature flags can be used to roll out new services to all users, or an assigned percentage of users, and can easily be toggled on/off with a single click. These feature flags, and any updates made to them, are propagated to the client application in real-time, in an asynchronous and fault-tolerant manner.

3.2 Revisiting the Problem

Pioneer is a lightweight software that will allow Harvest Delivery to test and migrate their new microservices in a production environment under increasing load, without having to make any additional changes to their infrastructure, such as cloning the production environment.

Using Pioneer to aid in the transition from a monolith to microservices architecture will reduce risk by allowing for immediate rollback without requiring a re-deployment or any additional downtime for the application. Services can be developed in parallel and rolled out at an independent rate because toggling a flag on or off, or changing its rollout percentage, does not impact any other functionality in the application. This will allow the small engineering team at Harvest Delivery to experiment in an agile manner with confidence.

Pioneer is specifically built to support the evaluation of flags by returning a boolean value based on either their toggle status or their associated rollout percentage. As every flag will ultimately return a boolean value, they are an excellent fit for the use case in which requests should be routed in one direction or the other, either to a new external microservice or an existing feature internal to the monolith.

3.3 Using Pioneer

Pioneer has a simple feature set, which minimizes the set-up costs associated with configuring a new tool. This enables Pioneer users to quickly start a migration towards a microservices architecture. Below, we outline how Pioneer can be used to get started with this migration.

Users that wish to use Pioneer out-of-the-box can simply clone the Pioneer GitHub repository and start up the application with a single command, docker-compose up. This functionality is possible because we provide a docker-compose.yml and .env file that will configure and launch Pioneer in a Docker network.

all components in a docker network with cute whales

The Pioneer application is composed of several components: Compass, Scout, and NATS JetStream. Compass is the primary application and offers a graphical user interface (GUI) built on React, as well as an API and Postgres database on the backend. Compass communicates directly with a NATS JetStream server, which relays messages to the Scout daemon. Scout communicates with all connected SDK clients in a unidirectional manner to provide up-to-date feature flag data.

Feature flag data lives in the Compass application. Users can create, read, update, or delete flags via the Compass UI or the Compass API. Compass also provides the user with an SDK key, which is required for SDK client authorization (discussed in section 4.4).

the pioneer ui shows how it is easy to view and filter all flags

Feature flags are evaluated in the user’s codebase via our server-side SDKs. Pioneer currently offers SDKs written in Node.js, Ruby, and Golang. Once an SDK has been installed, the user must provide the aforementioned SDK key to successfully receive data. On application startup, authorized SDKs will connect to the Scout daemon as a server-sent events (SSE) client. All SDK clients connected to Scout will receive the feature flag data, and any subsequent changes to it, in real-time.

A Pioneer user utilizes the feature flag data by incorporating conditional branching logic into their application code and evaluating the boolean status of a feature flag. For example, if a user wishes to migrate to a new microservice for payment processing, they could create a new flag on the Compass GUI called payment_processor and toggle the flag to “on”. Within their application code, they would add an if/else statement, which evaluates the status of the payment_processor flag and executes the appropriate code. When the payment_processor flag is toggled on, evaluating the flag will return true and the code responsible for managing the payment processing via calling the new microservice will be executed. If the Pioneer user decides not to use the microservice anymore, the payment_processor flag can be toggled “off” on the Compass GUI. The next time payment_processor is evaluated by the SDK, it will return false and the monolith code will be executed.

if else conditional code block

Pioneer offers an accessible, real-time feature flag service integrated with a user’s codebase and getting started takes just minutes. Because Pioneer is completely open-source, organizations with specific needs that fall outside of the default configuration can add their own customizations to make Pioneer fit their specific requirements.

3.4 Where Pioneer Fits

When thinking about third-party feature flagging solutions, it is important to consider them with certain criteria. We believe that when comparing existing options, there are four pivotal criteria to consider.

  • The first is flexibility. A flexible solution will allow users to have greater control over the product itself. Open-source solutions are more flexible because they allow an organization to modify the software to suit their specific needs. Proprietary software does not provide this same flexibility.
  • The next is accessibility. Solutions with high accessibility are easy to set up and easy to use. Less-accessible products may require user account setups, complex permissions, and dense documentation that a user must parse through to achieve the appropriate configuration.
  • Affordability is also an important criteria. Small and mid-sized teams may not have the financial resources needed to comfortably afford third-party solutions with a significant monetary cost. Low-cost or no-cost options are more affordable, but may lack features and support. Higher-cost options mean paying for the services, but may also come with a more robust feature set and highly available support services.
  • The last criteria to consider is simplicity. Simpler solutions will have a less robust feature set, but instead, focus on a core set of features tailored to a specific use case. Because proprietary software does not allow users to customize the software themselves, they may offer a larger feature set to fit a wider variety of use cases. However, this can result in bloated products that have many features the user does not require. For this reason, we consider simplicity to be a benefit, not a drawback.
  • chart showing pioneer is flexible, affordable, and accessible, but doesn't offer a rich featureset

    With these criteria in mind, we can see that Pioneer is the only option that is flexible, accessible, and affordable. It does not provide a robust feature set, but this simplicity makes it the best choice for organizations who want a solution that they can use to manage their feature flags straight away with no extra work.

4. Techinical Deep Dive

Pioneer consists of three main components: Compass, NATS JetStream, and Scout, all of which are run within a Docker network. Additionally, there is an SDK embedded in the user’s application.

diagram showing pioneer's architecture; there are 4 components: Compass, NATS, Scout, and SDKs

Below we will discuss each component’s role in the Pioneer architecture. For more information regarding the engineering decisions that led to this architecture, refer to section 5.

4.1 NATS

NATS JetStream is an open-source message streaming service 9. Pioneer utilizes a NATS JetStream server to facilitate communication between the Compass and Scout servers due to its ability to provide asynchronous, fault-tolerant messaging with guaranteed delivery. The benefit of Scout and Compass communicating via NATS JetStream, rather than directly, is that if Scout or Compass goes down, NATS JetStream will store the most recently transmitted message until the server comes back online and acknowledges receipt of the message. This mitigates the problems posed by an unreliable network.

NATS JetStream groups messages by topics that are referred to as streams. The messages sent within Pioneer are only concerned with the request for, or transmission of, data; therefore, Pioneer uses a single NATS stream called DATA. In addition to the stream name, messages published to NATS JetStream have a subject. The subject enables clients connected to NATS to only receive specific messages within a stream. Messages sent to NATS JetStream are assigned a title with the syntax STREAM NAME.subject. Clients connected to NATS Jetstream can both publish and receive messages from NATS stream, referred to as publishers and subscribers respectively.

In Pioneer, messages are published to NATS JetStream in response to two events - when an SDK client connects to Scout and when a change occurs to the feature flag dataset, either via the Compass GUI or API.

4.1.1 Connection of an SDK Client

When a Pioneer SDK client is initialized, it sends an HTTP request to the Scout daemon to establish a server-sent events (SSE) connection. First, the Scout daemon authorizes the SDK client via an SDK key.

Following SDK authorization, Scout will publish a message to NATS JetStream to request the latest feature flag dataset from Compass. This occurs by Scout publishing a message with the title, DATA.FullRuleSetRequest. NATS JetStream receives this message and if Compass is connected to NATS, the message will be sent to Compass.

Compass receives the message from NATS because it is configured as a subscriber of messages sent with that specific title. Once Compass receives the message, it sends an Ack back to NATS to acknowledge message receipt. If Compass is not currently connected to NATS, the message will be stored by JetStream until Compass connects to NATS and is able to receive the message.

Once Compass has received the DATA.FullRuleSetRequest message, it will retrieve the latest feature flag data from Postgres. Following data retrieval, Compass will publish a NATS message with the title DATA.FullRuleSet. Furthermore, the body of the message will contain the flag data in JSON format.

Scout subscribes to messages with the DATA.FullRuleSet title, and once Scout receives (and acknowledges) the NATS message, the flag data is parsed from the message body and is sent to all connected SDK clients, via SSE.

The SSE connection between the SDK client and Scout will disconnect after 30 seconds of idle activity (see section 4.3 for details), forcing the SDK to reconnect to Scout. The connection process described above occurs in the same manner both during initial SDK connections and reconnections.

4.1.2 Transmitting Updated Feature Flag Data

When a Pioneer user creates, deletes, or updates a feature flag, the flag dataset needs to be updated in the SDK. The transmission of updated data occurs via NATS JetStream. Any changes to the feature flag dataset result in Compass retrieving the latest data from Postgres and publishing a DATA.FullRuleSet message, with the flag data in the message body. Scout is a subscriber for messages with this title, as described in section 4.1.1. Upon receipt of a DATA.FullRuleSet message, Scout will parse the flag data from the body and transmit the data to all connected SDK clients via SSE.

4.2 Compass

Compass is Pioneer’s primary application for managing feature flags. The front-end of the application is built on React and allows users to view, create, update, and delete feature flags. Each flag has a title, an optional description, an assigned rollout percentage, and may be toggled on or off with a single click. Users may also use the Compass UI to view event logs regarding a flag’s history and to retrieve a valid SDK key for use in their own application.

web application frontpage; a toggle button is clicked

The Compass API is built with Node.js and Express. The provided RESTful API allows users to perform CRUD operations on the feature flag data, as well as retrieve the event logs for all flags from the Postgres database.

4.3 Scout

Scout is a daemon that acts as the interface between Compass and the SDK embedded in a clients’ application. A persistent HTTP connection is formed between Scout and the SDK, through which Scout sends feature flag data as server-sent events (SSE).

SDK clients connect to Scout by sending an HTTP request to the /features endpoint. Scout will publish a message requesting feature flag data from Compass (see section 4.1). It will also verify that the SDK attempting to connect has provided a valid SDK key in the Authorization header of the request. If the SDK key was determined to be valid, Scout will open the SSE connection and transmit updated feature flag information to the SDK. This is to prevent malicious agents from gaining access to feature flag data, which may contain confidential information.

web application frontpage; a toggle button is clicked

4.4 SDKs

Pioneer currently offers server-sent SDKs in three languages -- Node.js, Ruby, and Golang. The user should install the appropriate SDK in their application code. After doing so, the SDK can attempt to connect to Scout as an SSE client by providing Scout’s server address and a valid SDK key. The SDK will then automatically receive feature flag data updates each time there has been an update to a feature flag. The SDK stores the current feature flag data in memory and uses it to evaluate flags.

The Pioneer user’s application code should use the provided SDK interface to evaluate feature flags. This evaluation will likely be part of a conditional expression. For example, if a flag is toggled on, we may wish to direct a request to a new microservice. If a flag is off, we should not direct any requests to that feature.

Pioneer also allows flags to be evaluated within the context of a unique identifier when using rollout percentages. The provided identifier should be something unique to each of the application’s end-users, such as a UUID or an IP address. If the flag is toggled on, Pioneer’s rollout algorithm will evaluate the provided context against the current rollout percentage to determine whether or not the feature should be served to the application’s end-user.

After integrating the SDK into their codebase and connecting with Scout, Pioneer’s users can be confident that changes to the data will be propagated down to their application in real-time behind the scenes. SSE connections may drop, but will automatically reconnect autonomously. SDKs are also integrated with Google Analytics, allowing end-users to more easily monitor the activities of the end-users based on the feature toggle.

5. Engineering Decisions and Tradeoffs

5.1 Hosted vs. Self-hosted

One of the first questions we needed to answer surrounded the delivery of Pioneer. We considered whether it would be best for our team to set up private cloud infrastructure on which we would host Pioneer, offering it as a service, or whether we should build it so that it could be hosted entirely by the user.

We decided to provide Pioneer as a self-hosted application rather than hosting it ourselves for several reasons.

Firstly, it means that the user can deploy Pioneer on their infrastructure of choice, whether that’s on an AWS VPC, a DigitalOcean Droplet, or their own on-prem server.

Allowing user organizations to self-host Pioneer reduces the security concerns that an organization may have with a multi-tenancy architecture hosted by an external organization. In that situation, users may not have full knowledge of the security measures taken to protect their data. Because Pioneer is self-hosted, users will maintain full control of their data and can implement whatever security measures they feel are necessary.

Distributing Pioneer as an open-source and self-hosted application means that users of Pioneer can fully adapt the application to their own unique needs, increasing flexibility. If the out-of-the-box configuration isn’t matching a user’s requirements, they have the freedom to change whatever isn’t working for them or to add whatever components might suit them better.

In addition, self-hosting is affordable because Pioneer is intended for companies with a modest feature flag ruleset, obviating the need for storing large amounts of data and computing resources.

5.2 Inter-application Communication

Our messaging service of choice, NATS JetStream, allows for decoupled messaging. Naively, we could have enabled the Scout daemon to communicate directly with the Compass API. However, Compass would then have the additional responsibility of tracking all listening Scout instances and ensuring message delivery. Using a third-party messaging tool to handle message delivery allows for a better separation of concerns, and allows Compass to only worry about publishing the correct message. We chose not to pursue Kafka because its complexity and larger infrastructure were unnecessary for our use case. NATS has a smaller infrastructure and provides all of the features that Pioneer requires.

NATS streaming allows for many-to-one communication. This means that as an organization scales, they could choose to also horizontally scale the number of Scout daemons sending updates to SDK clients. Any Scout daemon subscribed to the NATS stream would receive feature flag updates as usual. Alternatively, a logging service could also subscribe to the NATS stream and preserve messages for later analysis.

We intentionally chose to use NATS JetStream over its predecessor, Core NATS, because JetStream allows for guaranteed message delivery.

Take for example a situation in which the Scout daemon may temporarily go down and is not able to receive communications from the NATS server:

If a flag is toggled, or some other change is made that results in an updated ruleset being disseminated, the new ruleset will be sent to the NATS server. With JetStream, the message containing the updated ruleset will be queued.

When the Scout daemon comes back up and communication with the NATS server is reestablished, the message will then be delivered, and Scout can then pass the updated ruleset down to connected SDKs.

This alleviates concerns of missed messages due to network partitions resulting in stale feature flag data.

5.3 Providing Feature Flag Data to User Applications via SSE

  • A fundamental decision in our architecture was determining the best way to send feature flag updates from Pioneer to the SDK installed in the user’s application. Options we considered included: API polling, webhooks, WebSockets, and streaming.

  • While API polling seemed to be the simplest approach, it would require SDK clients to periodically poll the Scout daemon for an update, rather than receiving flag data updates right away. This would eliminate the real-time benefits of streaming, and may also result in unnecessary network traffic.

  • Webhooks was another alternative we considered. Webhooks are more suitable than API polling due to their event-driven nature. While API polling is triggered by time, irrespective of whether or not a data change has occurred, with webhooks an HTTP request would only be sent in response to an event. However, this approach would require an additional HTTP endpoint to be exposed on the client application, requiring user configuration. Ultimately, we found it preferable to minimize interference with the client application.

  • Data could also be sent from Pioneer to SDKs via WebSockets. WebSockets are primarily used for bi-directional communication. Although the SDK client initially sends an HTTP request to Scout to initialize an SSE connection, all subsequent messages are sent from Scout to the SDK; therefore, only unidirectional communication is required. Thus, there is no need for the bi-directional capabilities of WebSockets.

  • Ultimately, we decided to use Server-Sent Events to enable efficient Scout-to-SDK streaming of feature flag data. This approach is an excellent tool for handling real-time data, as the single, long-lived connection provides low latency data delivery. Upon receiving a new SSE event, the SDK will parse the newly provided data and use it to evaluate feature flags.

  • One additional concern that we discussed was how to authorize SSE clients, in order to protect data within the feature flag ruleset. We decided to provide an SDK key to users via the Compass UI. Users must provide this SDK key when integrating an SDK into their application code to connect to Scout and receive feature flag data. If no valid key is provided, Scout will reject the SDK’s request to connect as an SSE client.

    A potential disadvantage of using SSE is the fact that connections will close if they have been idle for more than ~30 seconds. This is due to the default behavior of the EventSource API which handles the connection. We could have configured a longer timeout period; however, the intent of the connection closing is to prevent stale and phantom connections, which is something we want to avoid. The closing of connections due to the 30-second timeout is handled automatically by the EventSource API which will reconnect after a short interval. One concern we had was how to handle unsuccessful connection requests from an SDK to Scout. We wanted the SDK to retry the connection, but to avoid swamping the daemon with requests by sending infinite unsuccessful requests to connect. We addressed this issue by adding a reconnection attempt limit to all three of our SDKs. Once the reconnection attempt limit has been reached an error message will be logged and the SSE connection will be closed.

5.4 Redis Cache

Initially, we considered if Pioneer would require a Redis cache to offload read requests from the Compass Postgres database when Scout requests flag data through NATS JetStream. The proposed cache would request data from the Compass API to initially populate, and listen for subsequent feature flag updates.

After further analysis, we determined that adding a cache to our application was not appropriate for our use case and would unnecessarily increase the complexity of Pioneer’s architecture. Because Pioneer’s intended use case involves small- or medium-sized organizations, the number of read operations on Compass’ Postgres database should be manageable without a cache.

Furthermore, due to the open-source nature of Pioneer, user organizations have the freedom to add their own cache if required.

5.5 Sending Feature Flag Updates - Piecemeal vs Whole

A decision needed to be made regarding the content of messages that were distributed throughout the system in response to a change to the feature flag data. One option was to send information pertaining only to the modified flag (piecemeal). The other option was to transmit the entire up-to-date set of feature flag data (whole).

The decision was made to implement Pioneer such that the entirety of the up-to-date feature flag data would be sent, regardless of the operation performed. This method of distribution offers several advantages. The first is that by transmitting the full data set, we can ensure that every SDK has the most up-to-date information available at all times.

The alternative solution of sending individual feature updates had a few drawbacks. Potentially, an SDK could miss an update from Scout due to network issues. This would result in an SDK evaluating flags using outdated feature flag data. The discrepancy would persist until the next update related to that particular flag was made, resulting in conflicts between the data sets of individual SDK clients. By sending the full feature flag data set, we can significantly reduce the possibility of SDKs serving outdated feature flag data.

An additional benefit to sending the entirety of the feature flag data is that it allows the code on both ends of the communication to be simple and elegant. The SDK merely has to save the newly received data as a whole. This approach avoids introducing additional surface area for bugs by excluding the need for complex logic required to parse feature flag data, update specific elements, and handle every type of CRUD operation that might occur.

The obvious tradeoff of sending the entirety of the feature flag data is the increased size of messages and the impact on network bandwidth that might have. Pioneer is designed to be used by relatively small teams to migrate to microservices from a monolith. We reasoned that the standard use case would not likely exceed 20-30 distinct flags at a time.

To test Pioneer’s capacity, we tested a data set composed of 100 distinct flags. Even at this seemingly inflated data set size, the total size of the data transmission from Scout to each connected SDK client was almost exactly 20KB. With an expected rate of 10 requests/second, we felt the impact of 2MB/second should fall well within the limits of any modern network. Because Pioneer is open-source software, if an organization does find that they need to transmit very large amounts of feature flag data they can add logic that will compress data before it is sent and decompress data when it is received by the SDK.

Therefore, we concluded that the tradeoff of increased transmission size for sending full feature flag datasets was acceptable given the benefits in ensuring that SDKs evaluate up-to-date data.

5.6 Load Testing

One area with which we wanted to take extra consideration was understanding and testing the limitations regarding the number of SDK clients that can connect to Scout simultaneously and be served feature flag data efficiently. Though our intended use case implies that a high number of SDKs are unlikely to be connected to Scout, we still wanted to explore how the system performs under increasing levels of load in theory.

With this use case in mind, we reasoned that a rate of 10 new connections to Scout every second would cover most usage scenarios. Beyond that, testing a higher number of connections would only reaffirm the robustness of the Scout daemon for our intended use case.

Our goal for these tests was to simulate the process of an SDK client establishing an SSE connection with Scout and subsequently receiving the feature flag data. For these tests, we used the relatively large 100-flag data set previously mentioned in section 5.5. Recall that the data itself along with HTTP headers resulted in the transmission of nearly 20KB of data.

For testing purposes, we chose to isolate the process of connecting and transmitting an initial set of feature flag data. In order to achieve this, we temporarily modified the Scout daemon to close SSE connections after the initial flag data had been sent to the SDK. If each SSE client connection remained open indefinitely, the performance tolerances of the Scout daemon would most certainly perform differently. However, we felt that because SSE connections are likely to be dropped and added somewhat regularly as the client application spins up new instances and terminates others, it is reasonable to test Scout without leaving every connection open in perpetuity.

Our testing was performed via Artillery.io 7, using a configuration file that ran several different phases for extended periods of time. We incrementally increased the load on Scout by a factor of 10, beginning at 1 request per second and peaking at 1000 requests per second before ramping back down.

The results of our tests demonstrated a few things. First, the Scout daemon can easily handle the expected case of 10 requests per second. Second, under a usage load of 100 requests per second, Scout’s median response time increased by about 450 milliseconds, but the system could still serve all of the data payloads successfully. Lastly, we observed a degradation of performance under a load of 1000 requests per second.

Ultimately, our tests were successful in demonstrating that Pioneer’s system is more than capable of handling the anticipated load. If an organization using Pioneer were approaching a load of 1000 connection requests per second, they may consider implementing an additional instance of Scout to share the load to prevent performance degradation.

6. Future Work

6.1 Accommodate Multiple Applications

Currently, an instance of Pioneer supports a single application. More specifically, Pioneer broadcasts the entire set of flag data to all connected SSE clients in an application-agnostic manner. If an organization would like to use Pioneer with additional applications that require different feature flag data, they will need to spin up an additional instance of Pioneer to communicate with that application. This is a natural consequence of Pioneer’s simplicity and ease of use. However, in the future, we may consider adding support for multiple sets of flag data handled by a single instance of Pioneer.

6.2 Additional Rollout Strategies

Offering additional rollout strategies that allow the organization to target particular users would allow for more granular control over the initial users of a new feature. Some special users that we may choose to accommodate in the future are users internal to the organization, a predetermined group of beta-testers, or particular segments of the market.

6.3 Flag Expiration

Because Pioneer is meant to be used to roll out new services, the conditional logic related to a feature flag for a service likely shouldn’t live in the codebase indefinitely. Flag expiration would allow engineers to set an expiration date on a flag after which the flag will throw an exception or log a warning message if it is evaluated in the codebase. The motivation behind flag expiration is to avoid technical debt. When a feature flag is no longer necessary, the flag and the application logic that evaluates the flag should both be removed from the codebase.

Another benefit to offering flag expiration is that it provides a simple and clear-cut rollout window in which to collect analytics and user feedback on a new feature. Engineers could determine the appropriate duration to test a new feature and set the flag expiration accordingly. Pioneer would handle expiring the feature flag at the assigned time, and the organization’s engineers could review the data collected later to decide how to proceed with the new feature.

8. Presentation

9. Meet the Team

Pioneer was built by a small team of dedicated individuals.

We are currently looking for positions, so please reach out if this project interested you and we would love to chat more about it!

Jimmy Zheng

Jimmy Zheng

Laura Davies

Laura Davies

Kyle Ledoux

Kyle Ledoux

Elizabeth Tackett

Elizabeth Tackett