SaleSpotter Retrospective

A look back at how I built a large retail data platform. What worked, what broke, and what it taught me.

Technical Solution Overview

SaleSpotter is a retail data platform. It tracks price trends and promotions across major Dutch supermarkets (Etos, Jumbo, AH, Kruidvat). The system is built from three parts, each one able to run on its own.

1. Data Scraper (TypeScript/Playwright)

I used a Bronze -> Silver -> Gold pipeline to scrape, process, and store product data.

Bronze (Raw Data): Fetching raw HTML/JSON from the source and saving it to disk (Gzip). Logic handles rate limiting, cookies, and network retries.
Silver (Extracted Data): Parsing the raw Bronze files using Cheerio and Zod to extract structured data (prices, titles, etc.) and validate the schema.
Gold (Normalized Data): Aggregating data from all scrapers, validating quality, and loading into the database via a separate service that watches for new Silver data.

Performance Metrics

The scrapers held a steady pace in production runs:

Etos: ~10 products/min throughput, with 2,300+ products successfully extracted in a single session.
Kruidvat: 7,300+ products discovered across 15 categories, with a paced acquisition rate of ~8s per product to minimize detection.

Technological Evolution (Lessons from the Trenches)

The scraper went through several phases as I adapted to anti-bot measures:

mTLS + Crawlee: Initial attempt using custom TLS fingerprinting. While technically elegant, it proved too slow for the scale required and was eventually detectable.
Crawl4AI + Markdown: Shifted to fetching content as Markdown to simplify the Silver parsing layer. This let me scrape ~3,000 products before hitting significant blocks.
LLM-Based Silver Phase: The most resilient parsing approach used gpt-4o-mini to extract structured attributes from Markdown, significantly increasing accuracy for dynamic and inconsistent pages where traditional CSS selectors failed.

2. Backend Service (Go)

The backend is written in Go and split into layers:

Frameworks: Built on Gin for routing and Uber FX for robust dependency injection.
API Architecture: Utilizes Huma for OpenAPI-driven handlers, ensuring every endpoint is well-documented and type-safe from the start.
Persistence: Managed through GORM models interacting with a MySQL database.
Connectivity: Integrated with Google OAuth for secure authentication and directly with the Albert Heijn API for efficient weekly product updates.

3. Infrastructure (Terraform/Ansible)

Production runs on Hetzner Cloud, managed end to end as code:

Provisioning: Terraform for reproducible environment setup (Load Balancers, DB Servers, App Servers).
Configuration: Ansible playbooks automate everything from private network NAT gateway configuration to security hardening (Fail2Ban, Auditd).
Observability: A full monitoring stack including Prometheus, Grafana, Loki, and Promtail for real-time log collection and alerting.
CI/CD: Integrated GitHub Actions workflows for automated Docker-based deployments.

Why It "Failed" (Technical Challenges)

The engineering held up. The blockers came from outside:

Anti-Bot Sophistication: Sites like Kruidvat utilize DataDome. Even with Playwright Stealth, large-scale runs required expensive residential proxies or slow, human-mimicry pacing to avoid progressive blocking.
Dynamic Web Components: Reliance on Shadow DOM and AJAX-hydrated content meant standard scraping often missed pricing data without complex network interception or visual waiting.
Data Volume vs. Cost: Ingesting uncompressed HTML for thousands of items created a trade-off between scraper speed and proxy bandwidth costs.
Maintenance Burden: Frequent changes in supermarket HTML required constant adjustment of selectors.