Inside Mabl's Performance Testing Architecture

One of our goals here at mabl is to make test automation easier for everyone in the development organization, including manual testers, QA engineers, developers, and product owners. In late 2022, we introduced performance testing to our platform’s cross-browser and API testing capabilities. We conducted customer interviews, refined requirements, developed designs, implemented, and tested the early access release of the product. As more teams adopt low-code API performance testing with mabl, we wanted to share a few lessons we learned while developing this exciting feature.

From Functional Testing to Load Testing

Prior to this effort, we offered functional API testing. Here is the high-level architecture of that solution:

API Functional Testing Architecture

When our scheduling system determines that an API test should be executed, it publishes a message to a Pub/Sub topic containing all of the information required to execute the test. We then invoke a Cloud Function that executes the test, writes the output to Cloud Storage, and calls back to our API with results. This architecture has been working reliably for functional API tests since it was introduced. For load testing, our initial plan was to simply extend that architecture to support parallel executions.

Adapting the Functional API Test Architecture to Load Testing

A few changes would need to be made to adapt the functional API testing architecture for load testing. The most significant change from an architecture perspective was the addition of a new component called the Orchestrator which would be responsible for managing the Cloud Function instances associated with a load test. This component would start up the required number of Cloud Functions via Pub/Sub trigger and replace them in the event of a crash or if the test needed to run longer than the maximum Cloud Function runtime limit.

Our Initial Load Testing Architecture

Taking the above changes into account, we began implementing and testing the modified functional testing architecture:

Initial load testing architecture

Verifying Load Test Performance

Once we had implemented the initial load testing architecture, we needed to verify that it was performing adequately. The first problem we had when trying to verify the performance was finding a suitable target for the load. There are several sites designed to assist in writing functional API tests, such as postman-echo.com or httpbin.org, but these sites are not intended to be used for load testing.

We were able to quickly build and deploy a load testing target service based on nginx and GKE. By leveraging nginx's support for Lua scripting, we could pass in different parameters on the request query string in order to have the server return specified error codes with certain probabilities or induce randomized latency.

Challenge 1: Minimizing Latency and Variance

As we began to ramp up our testing to larger numbers of load generators, we noticed that the metrics we were measuring were not in-line with our expectations. Latencies were higher than expected, throughput was correspondingly lower, and several metrics seemed to have a higher than expected variance. Below is an example of one of these early test runs:

Request latencies were fluctuating by hundreds of milliseconds and averaging over 500ms per request.

We spent the next couple of weeks investigating ways to improve performance. We tested different variables by changing the client libraries, changing which cloud services were used to generate the load, and eliminating as many network hops as possible. First we tried generating load using the same client code but running in different execution environments: local non-virtualized hardware, plain GCE VMs, and GKE.

We didn't observe performance issues when running in these alternate environments, which led us to suspect that the Cloud Functions execution environment was a contributing factor. To confirm, we then generated load in Cloud Functions using a different client library written in a different language, and we observed similar performance anomalies. Although we were not able to determine the actual source of those performance anomalies, we suspect that several factors such as shared infrastructure and the underlying instance type contributed to this behavior.

Next, we had to decide what to do with these findings. Although the performance was not what we expected, changing a key component of the architecture was a significant risk to the timeline. We researched alternatives like Cloud Run and GKE, and we estimated that migrating to GKE would not be too much effort since we were already using it for the other components. We ultimately decided that achieving high performance was so critical to the success of the product that it was worth the extra time investment.

Migrating from Cloud Functions to GKE was a fairly easy process. We just had to define a new node pool for load generators and package the code in an image so that it could be deployed as a Kubernetes Stateful Set. The entire process took roughly one week.

Running the load generators on GKE yielded much lower latency and variance:

Load test with load generators executing on GKE

After making these changes we were finally seeing consistent performance for the majority of the test duration. However, there was one anomaly that was still bothering us.

Challenge 2: Achieving Consistent Throughput

After resolving the main performance problems with the load test infrastructure we noticed one other strange anomaly that would consistently appear near the beginning of the test:

Brief throughput drops near the beginning of load tests indicated a problem…somewhere

We would often see throughput drops lasting less than a minute that would occur shortly after the start of the test. Using the same debugging techniques we had used previously, we identified one setting whose value would affect the throughput drops: connection reuse.

When sending many HTTP requests, reusing the same TCP connection can significantly improve performance by avoiding the TCP handshake overhead each time. However, we had originally wanted to avoid connection reuse in order to more accurately simulate many unique users accessing the target system simultaneously. Unfortunately, we found that the high number of unique connections was causing performance issues when combined with Cloud NAT. Our suspicion was that Cloud NAT had to horizontally scale up after a few minutes of sustained load with many unique connections, and that this scaling operation resulted in a temporary performance degradation.

After enabling TCP reuse and performing some other Cloud NAT configuration tuning to better handle large numbers of connections, we were able to execute a 3,000-user test with smooth, consistent performance:

Smooth ramp-up and consistent performance with 3,000 load generators

Our Final Load Testing Architecture

After making the changes described above, we ended up with a simpler architecture relying on fewer services and achieving better performance at lower cost:

Final load testing architecture with GKE-based load generators

At mabl we use an agile, iterative development methodology. Early in the development phase of the API performance testing product we favored solutions that would allow us to get the functionality into customers' hands sooner in order to obtain feedback and validate the core features. Later on, we revisited some of those earlier decisions and made a few changes that we thought were necessary for the product to reach its full potential. The end result was a product that provided value to our Friends of mabl customer community and fostered a sense of pride across the mabl team.