Stage, Then Transform: Why Your Crawler Should Never Touch Production Tables

When you pull data from an external source, a public register, a CSV feed, a scraped site, there is one decision that quietly determines whether the pipeline will still be working in six months. It is not “which library”, “which schedule”, or “which database”. It is: where does the data land first?

Most pipelines answer that question badly. They crawl a source, massage the rows in-memory, and write straight into the production table the app already reads from. It works on day one. It breaks on day ninety.

This post is the pattern I wish I had internalised before I wrote my first crawler, explained with the pipeline we are building right now at Propi: matching first-time buyers to UK conveyancers using public SRA and Law Society data.

Stage, Then Transform: Why Your Crawler Should Never Touch Production Tables

Stage, Then Transform: Why Your Crawler Should Never Touch Production Tables

The shortcut that bites

More from Ideas

The Website That Told the Same Story Twice

The Workflow Is the Lesson

Dockerising a NestJS and Nuxt Monorepo with pnpm