Reflecting on the architecture, successes, and hard-learned lessons of building a large-scale retail intelligence platform.

Technical Solution Overview

SaleSpotter is a sophisticated retail data platform designed to track and analyze price trends and promotions across major Dutch supermarkets (Etos, Jumbo, AH, Kruidvat). The system is built on a modular, decoupled architecture consisting of three primary components.

1. Data Scraper (TypeScript/Playwright)

We implemented a Bronze -> Silver -> Gold data pipeline architecture to scrape, process, and store product data.

  • Bronze (Raw Data): Fetching raw HTML/JSON from the source and saving it to disk (Gzip). Logic handles rate limiting, cookies, and network retries.
  • Silver (Extracted Data): Parsing the raw Bronze files using Cheerio and Zod to extract structured data (prices, titles, etc.) and validate the schema.
  • Gold (Normalized Data): Aggregating data from all scrapers, validating quality, and loading into the database via a separate service that watches for new Silver data.

Performance Metrics

Our scrapers achieved significant throughput during production runs:

  • Etos: ~10 products/min throughput, with 2,300+ products successfully extracted in a single session.
  • Kruidvat: 7,300+ products discovered across 15 categories, with a paced acquisition rate of ~8s per product to minimize detection.

Technological Evolution (Lessons from the Trenches)

The scraper evolved through several experimental phases as we adapted to anti-bot measures:

  • mTLS + Crawlee: Initial attempt using custom TLS fingerprinting. While technically elegant, it proved too slow for the scale required and was eventually detectable.
  • Crawl4AI + Markdown: Shifted to fetching content as Markdown to simplify the Silver parsing layer. This enabled us to scrape ~3,000 products before hitting significant blocks.
  • LLM-Based Silver Phase: Our most resilient parsing approach used gpt-4o-mini to extract structured attributes from Markdown, significantly increasing accuracy for dynamic and inconsistent pages where traditional CSS selectors failed.

2. Backend Service (Go)

The application follows a layered architecture using Go for high performance and modularity:

  • Frameworks: Built on Gin for routing and Uber FX for robust dependency injection.
  • API Architecture: Utilizes Huma for OpenAPI-driven handlers, ensuring every endpoint is well-documented and type-safe from the start.
  • Persistence: Managed through GORM models interacting with a MySQL database.
  • Connectivity: Integrated with Google OAuth for secure authentication and directly with the Albert Heijn API for efficient weekly product updates.

3. Infrastructure (Terraform/Ansible)

The production environment is hosted on Hetzner Cloud, managed entirely through infrastructure-as-code:

  • Provisioning: Terraform for reproducible environment setup (Load Balancers, DB Servers, App Servers).
  • Configuration: Ansible playbooks automate everything from private network NAT gateway configuration to security hardening (Fail2Ban, Auditd).
  • Observability: A full monitoring stack including Prometheus, Grafana, Loki, and Promtail for real-time log collection and alerting.
  • CI/CD: Integrated GitHub Actions workflows for automated Docker-based deployments.

Why It "Failed" (Technical Challenges)

Despite the robust engineering, external factors presented significant bottlenecks:

  1. Anti-Bot Sophistication: Sites like Kruidvat utilize DataDome. Even with Playwright Stealth, large-scale runs required expensive residential proxies or slow, human-mimicry pacing to avoid progressive blocking.
  2. Dynamic Web Components: Reliance on Shadow DOM and AJAX-hydrated content meant standard scraping often missed pricing data without complex network interception or visual waiting.
  3. Data Volume vs. Cost: Ingesting uncompressed HTML for thousands of items created a trade-off between scraper speed and proxy bandwidth costs.
  4. Maintenance Burden: Frequent changes in supermarket HTML required constant adjustment of selectors.