Don't Build for Scale

Build AS you scale

Apr 25, 2024

Big Tech → Startup → Clean Slate

Choosing your tech stack is one of the exciting bits of building a startup.

When working in Big Tech, you’re often forced to commit to an existing codebase and toolset. While the tight integration can offer comfort, it can also feel stifling at times. If you work at Meta (Facebook) for any period of time, you’re likely to end up writing some PHP hack, while your friends are off building with Next.js or Rust.

But when you finally break free from the cozy confines of a larger organization, you are also free from Mark Zuckerberg’s language choice in 2004.

Your tech stack isn’t permanent

Now I’m not here to tell you that you should only use Rust, or never use Node.js, or that you should definitely use Next.js, or trade HTMX for the hottest frontend trend…the list goes on and on.

Instead, my advice is much simpler. Don’t spend forever rabbit-holing on the perfect tech stack - pick something defensible, try it, and tweak it as you go. These decisions rarely last forever and are easier to change than you might think.

There’s a reason why people say “Premature optimization is the root of all evil.” Build until the cracks start to show, and then get creative about solutions. After all, your startup doesn’t need to handle infinite scale any time soon – it just needs to handle enough traffic to get your first customer on board.

To put this into perspective, here’s the story of how Statsig’s tech stack for event handling evolved..and why.

Shortcomings of our initial architecture

Before we started Statsig, we’d chat weekly about system design architecture, cloud providers, programming languages, and of course…coffee machines. (Okay, we mostly talked about which coffee machines to buy, and then went to play a game or two of DOTA).

One of the few things I remember aligning on was using the SMACK tech stack (Spark, Meso, Akka, Cassandra, Kafka) for our event ingestion and processing, which was all the rage at the time. But you won’t see this stack in our codebase today. Come to think of it, I can’t point to any other surface area in our product that has hit more scaling bottlenecks or that we have changed more over the past 3.5 years than our event processing stack.

Week one and reality strikes

During our first week, we set out to build event processing by our Node.js monolith server, using Kafka for event streaming and Cassandra as our storage layer, and then running our spark pipelines. Except we never even made it that far – we abandoned Kafka and Cassandra within the same week.

Instead, we started to batch-write files locally and then write CSVs to an Azure data lake. We also decided to use Databricks rather than raw Spark. We thought this was all we needed. In fact, our entire product strategy was captured on a whiteboard sketch. It was our illustration of the “wind of value” pushing developers to rain down events on the Statsig data pond 🤦🏻‍♂️…

One year in and we moved cloud providers

We quickly discovered that our ideal design wouldn’t work for streaming. We were doing batch processing but missing a few percent of total events. So we switched to Azure Event Hub, which improved reliability and enabled streaming into Databricks.

We later learned that using Databricks rather than our own Spark was the right call, but it wasn’t scaling for incredibly large datasets. Around January 2023, some of our larger daily jobs started…taking more than a day to run.

Fortunately, we tried out BigQuery and were shocked to find our 28-hour runs took just 4 hours - and they cost less! So we started the long slog of migrating all of our pipelines to GCP. But this change came with its own set of challenges.

Our workaround for data ingestion

At this point, we already had our Node app for ingesting events and logging to Event Hub. However, the new problem we faced was that our tight coupling of business logic to the persistence layer made things very fragile.

Changes or bugs in the processing code could lead to a worst-case scenario: event loss. So we introduced a lightweight event ingestion service (written in Rust) to blindly ingest logs and store them before processing them separately in the Node app.

In addition, we were still reading from Azure Event Hub after moving to GCP, so we were incurring a lot of costs simply from egressing events between cloud providers. Using GCP Pub/Sub would have been more expensive for our workloads, but we needed a first-party solution in Google Cloud to lower our cost.

Fortunately, we discovered a workaround. We wrote events to Google Cloud storage and used Pub/Sub to only pass the file pointer, so our processing logic could read the file and dump it into BigQuery.

Here’s a screenshot of the relevant bits from our high-level system architecture diagram:

We’re still not done scaling

And there you have it…nearly 3.5 years of evolution captured in this short blog.

From handling 0 events per day, to processing over 250,000,000,000 (and more)! We’re certainly not done changing our architecture to handle our increasing scale – but now we know how to improve it, piece by piece.

Let me know in the comments below which decisions or changes you’d like to hear more about, and we can cover them in more detail in a future post.

Happy scaling :)

Tore, Founding Engineer at Statsig

Scaling Down

Discussion about this post