← Blog/Engineering

How We Found and Fixed a Throughput Ceiling

June 12, 20266 min read

We have shipped a lot of features into Fusillade over the past few releases. Cookie jars, file uploads, retries, redirect handling, gRPC fixes, and more. When you add that much in a short time, it is easy for a quiet performance change to slip in without anyone noticing. So we sat down and ran a full round of regression benchmarks against k6. One of the numbers looked wrong, and chasing it led us to a real bottleneck. Here is the whole story.

The symptom

We ran the same simple GET test at several worker counts, from 50 workers all the way up to 2000. We expected throughput to climb as we added load, or at least to hold steady. Instead it sat at almost exactly the same number every time, no matter how many workers we used.

Workers	Throughput (before)	p95 latency
50	99,400 RPS	0.25 ms
200	98,300 RPS	0.25 ms
500	95,800 RPS	0.25 ms
1000	99,100 RPS	0.25 ms
2000	92,300 RPS	0.26 ms

A flat line like this is a strong clue. The latency stayed tiny and constant, so the server was not the problem. Adding more workers did nothing, so we were not short on work to do. Something inside Fusillade was serializing the work at a fixed rate.

The hunt

Fusillade runs each virtual user as a lightweight green thread. Many green threads share a small pool of real operating system threads, which we call carrier threads. This is what lets Fusillade run tens of thousands of workers on very little memory.

The catch is how blocking work interacts with that model. The default HTTP path uses a blocking client. When a green thread makes a blocking request, it does not just pause that one green thread. It parks the entire carrier thread until the response comes back. While that carrier is parked, none of the other green threads sitting on it can run.

That means the real number of requests in flight at any moment is equal to the number of carrier threads, not the number of workers. If you have 16 carrier threads, you get 16 requests in flight, whether you asked for 50 workers or 2000. The extra workers just wait in line.

When we checked the formula that sized the carrier pool, there it was. It scaled the pool very slowly with worker count and floored it at the CPU count. On a 16 core machine that worked out to 16 carriers for every worker count we tested. Sixteen requests in flight, dividing into our request latency, gave almost exactly the flat number we kept seeing.

The proof

We did not want to guess, so we made the carrier count adjustable and swept it while holding everything else fixed at 500 workers. If carrier count was the ceiling, throughput should climb as we raised it.

Carrier threads	Throughput	p95 latency
16 (old default)	99,038 RPS	0.25 ms
32	113,907 RPS	0.61 ms
64	126,322 RPS	1.36 ms
128	128,050 RPS	2.53 ms
256	113,874 RPS	4.67 ms

That settled it. Throughput rose as we added carriers, peaked around 128, then fell off as too many threads started fighting over the CPU. A blocking thread spends most of its time waiting on the socket, so it pays to run more of them than you have cores, up to a point.

The fix

The fix was to size the carrier pool toward a healthy multiple of the CPU count instead of the old slow formula, with a sensible cap so we never run so many threads that they get in each other's way. We also kept the manual override in place so you can tune it for unusual hardware.

Then we re-ran the full sweep with the new default and compared against k6.

Workers	Before	After	k6
50	99,400	120,797	121,900
200	98,300	129,924	113,200
500	95,800	130,866	105,200
1000	99,100	127,764	87,000
2000	92,300	131,043	78,600

The plateau is gone. Throughput now sits in the 121,000 to 131,000 range across the board instead of being stuck near 99,000. Fusillade matches k6 at 50 workers and pulls ahead at every level above that. The gap grows as the load grows, because k6 slows down as you add virtual users while Fusillade holds steady.

Memory went up a little, from about 38 MB to between 48 and 161 MB depending on load, because we are running more real threads now. That is still far below k6, which used between 700 MB and 1 GB in the same runs. The low memory story that Fusillade is known for is fully intact.

The takeaway

This is the kind of bug that hides easily. Nothing was broken, no test failed, and every feature worked. The tool was simply leaving performance on the table because of one number in one formula. We only caught it because we made a habit of running regression benchmarks after a busy run of feature work.

If you are running an older build, upgrade to v1.6.2 to get the higher throughput automatically. No script changes are needed. And if you want to tune the carrier pool for your own hardware, the FUSILLADE_MAY_WORKERS environment variable is there for you.

Try Fusillade v1.6.2

Higher throughput, the same low memory, and no script changes. See how it runs against your own services.

Read the docs