Stage, Then Transform: Why Your Crawler Should Never Touch Production Tables
When you pull data from an external source, a public register, a CSV feed, a scraped site, there is one decision that quietly determines whether the pipeline will still be working in six months. It is not “which library”, “which schedule”, or “which database”. It is: where does the data land first?
Most pipelines answer that question badly. They crawl a source, massage the rows in-memory, and write straight into the production table the app already reads from. It works on day one. It breaks on day ninety.
This post is the pattern I wish I had internalised before I wrote my first crawler, explained with the pipeline we are building right now at Propi: matching first-time buyers to UK conveyancers using public SRA and Law Society data.



