Hannes Gustafsson Hannes Gustafsson

From Jupyter Notebooks to Infrastructure: Building a Data Platform in 3 Weeks

When I joined Circulate, “the data platform” was me.

Jupyter notebooks. Excel sheets. Corrections over email. A Postgres database of products nobody trusted and therefore nobody used.

Every number required explanation. Every correction created a new version of the truth. Every report was a small project.

The constraint: 3 weeks. One engineer. Significant AWS credits but no budget for managed platforms like Dagster Cloud or Fivetran.


The Gap I’d Felt for Years

At Burt Intelligence, where I worked previously, we were disciplined about modeling: facts_, dims_, clear ETL boundaries, a proper warehouse mindset. It worked well for its time.

But there was always friction when scaling data work across teams. Getting data into the warehouse was rarely the problem. Building reliable insights on top of it—with many dependencies and many stakeholders—was.

I didn’t have a name for that gap yet.


The Mental Model That Unlocked Everything

I started revisiting fundamentals. Designing Data-Intensive Applications. Kimball’s warehouse modeling. Modern thinking around lakes and warehouses.

Then I read about the medallion structure and DBT’s way of organizing projects. One idea was a turning point:

The raw layer should be append-only.

Storage is cheap. Historical truth is not. If raw data is immutable, you gain enormous freedom later. You can rebuild, reinterpret, and fix logic without ever losing the original record of what happened.

The second shift came when I read that Legora was using Dagster. I followed their tutorial with DBT and DLT. Something clicked that I had felt for years but never articulated:

The real problem in most data setups is not ingestion. It’s not modeling. It’s not storage.

It’s the lack of a system for describing how data products depend on each other.

Dagster made that explicit. You could describe: from these sources → through these transformations → into these insights. With lineage, dependencies, retries, and state.

This was the missing layer I had felt at Burt. The missing layer at Circulate.

But paying for Dagster when we could assemble the same principles with AWS primitives didn’t make sense. So I took the ideas and rebuilt the stack natively.


MVP 1: Prove Ingestion Works

The simplest question first: can we ingest data reliably and transform it into something useful without touching a notebook?

Using DLT Hub, I ingested data into S3 and queried it with Athena. Then I used DBT to structure the transformations:

  • stg_ for cleaned source data
  • int_ for intermediate logic
  • mart_ for business-ready tables

For the first time, we had a clean flow: Raw (append-only) → Structured → Analytics.

No manual steps. No hidden logic.


MVP 2: Events as the Spine

Circulate uses Cognito for authentication. The Cognito sub is the only reliable identifier for users across systems. That made it obvious: we needed events, not just batch exports.

We set up AWS Kinesis Data Streams and started emitting events for everything important:

{
  "event": "user_created",
  "cognito_sub": "...",
  "timestamp": "...",
  "payload": { }
}

This became the backbone of the platform. Not tables. Not APIs. Events.

Those events land in the raw layer, append-only, forever.


MVP 3: Orchestration Without an Orchestrator

This is where many teams introduce Airflow or Dagster OSS. I didn’t.

I used AWS Step Functions.

Step Functions trigger:

  1. DLT ingestion for each source
  2. A full DBT run across the warehouse

No scheduler to maintain. No metadata database. No container state to worry about. Just a state machine describing the flow of data products.

At this stage, it became clear: many orchestration tools solve operational problems you only get once you’re much larger.


MVP 4: The Lambda Wall

We initially tried Lambda for execution. It quickly became painful.

The friction:

  • 15-minute timeout ceiling (DBT runs took longer)
  • 250MB packaging limits (DLT dependencies exceeded this)
  • Cold starts adding 10-30 seconds to every run

The pivot: Move everything to Fargate.

DLT runs in containers. DBT runs in containers. Step Functions trigger ECS tasks. Predictable, boring, and easy to reason about.

Exactly what we needed.


MVP 5: Make It Usable

A data platform nobody can use is just an internal engineering project.

After reading about how Voi approached internal analytics, we set up Steep. Then I migrated all my Jupyter scripts into proper DBT models.

The metric that mattered:

What used to take 3 days of manual work—exporting data, transforming it, correcting it, explaining it—became 1 hour of structured queries.

Anyone in the company could now explore the same data I was using. The trust was immediate because the numbers were reproducible.


The Stack

LayerTool
IngestionDLT Hub
StorageS3 + Athena
TransformationsDBT (medallion structure)
EventsKinesis Data Streams
OrchestrationStep Functions
ExecutionFargate
ExplorationSteep

No Airflow. No Dagster. No heavy platform engineering.

Just AWS primitives assembled with the right mental model.


What Actually Changed

Before: Data lived in people. Numbers required explanations. Every analysis was bespoke and fragile.

After: Data lived in the platform. Numbers were reproducible. Insights were built on explicit dependencies. New data sources plugged into an existing system instead of creating new side processes.

The biggest shift wasn’t technical. Data stopped being a project and became infrastructure.


The Thing I Had Misunderstood

I used to think the hard question was: How do we get data into a warehouse?

It isn’t.

The hard question is: How do we make it easy to build reliable data products on top of raw data, with clear dependencies and the ability to rebuild everything at any time?

Once you understand that, the architecture becomes straightforward.

Three weeks from first commit to production. Every number in the company now traces back to a single source of truth.