SaleSpotter: A Technical Retrospective
Reflecting on the architecture, successes, and hard-learned lessons of building a large-scale retail intelligence platform.
Technical Solution Overview
SaleSpotter is a sophisticated retail data platform designed to track and analyze price trends and promotions across major Dutch supermarkets (Etos, Jumbo, AH, Kruidvat). The system is built on a modular, decoupled architecture consisting of three primary components.
1. Data Scraper (TypeScript/Playwright)
We implemented a Bronze -> Silver -> Gold data pipeline architecture to scrape, process, and store product data.
- Bronze (Raw Data): Fetching raw HTML/JSON from the source and saving it to disk (Gzip). Logic handles rate limiting, cookies, and network retries.
- Silver (Extracted Data): Parsing the raw Bronze files using Cheerio and Zod to extract structured data (prices, titles, etc.) and validate the schema.
- Gold (Normalized Data): Aggregating data from all scrapers, validating quality, and loading into the database via a separate service that watches for new Silver data.
Performance Metrics
Our scrapers achieved significant throughput during production runs:
- Etos: ~10 products/min throughput, with 2,300+ products successfully extracted in a single session.
- Kruidvat: 7,300+ products discovered across 15 categories, with a paced acquisition rate of ~8s per product to minimize detection.
Technological Evolution (Lessons from the Trenches)
The scraper evolved through several experimental phases as we adapted to anti-bot measures:
- mTLS + Crawlee: Initial attempt using custom TLS fingerprinting. While technically elegant, it proved too slow for the scale required and was eventually detectable.
- Crawl4AI + Markdown: Shifted to fetching content as Markdown to simplify the Silver parsing layer. This enabled us to scrape ~3,000 products before hitting significant blocks.
- LLM-Based Silver Phase: Our most resilient parsing approach used gpt-4o-mini to extract structured attributes from Markdown, significantly increasing accuracy for dynamic and inconsistent pages where traditional CSS selectors failed.
2. Backend Service (Go)
The application follows a layered architecture using Go for high performance and modularity:
- Frameworks: Built on Gin for routing and Uber FX for robust dependency injection.
- API Architecture: Utilizes Huma for OpenAPI-driven handlers, ensuring every endpoint is well-documented and type-safe from the start.
- Persistence: Managed through GORM models interacting with a MySQL database.
- Connectivity: Integrated with Google OAuth for secure authentication and directly with the Albert Heijn API for efficient weekly product updates.
3. Infrastructure (Terraform/Ansible)
The production environment is hosted on Hetzner Cloud, managed entirely through infrastructure-as-code:
- Provisioning: Terraform for reproducible environment setup (Load Balancers, DB Servers, App Servers).
- Configuration: Ansible playbooks automate everything from private network NAT gateway configuration to security hardening (Fail2Ban, Auditd).
- Observability: A full monitoring stack including Prometheus, Grafana, Loki, and Promtail for real-time log collection and alerting.
- CI/CD: Integrated GitHub Actions workflows for automated Docker-based deployments.
Why It "Failed" (Technical Challenges)
Despite the robust engineering, external factors presented significant bottlenecks:
- Anti-Bot Sophistication: Sites like Kruidvat utilize DataDome. Even with Playwright Stealth, large-scale runs required expensive residential proxies or slow, human-mimicry pacing to avoid progressive blocking.
- Dynamic Web Components: Reliance on Shadow DOM and AJAX-hydrated content meant standard scraping often missed pricing data without complex network interception or visual waiting.
- Data Volume vs. Cost: Ingesting uncompressed HTML for thousands of items created a trade-off between scraper speed and proxy bandwidth costs.
- Maintenance Burden: Frequent changes in supermarket HTML required constant adjustment of selectors.