Automated E-Commerce Promo Monitoring with LLM Extraction — promoPulse (Open Source Dataset)
Tag: E-Commerce · Data Pipeline · LLM Extraction · Open Source · Kaggle
The problem: promotional data is fragmented, stale, and unstructured
E-commerce teams and data scientists share the same frustration with promotional intelligence: there is no reliable, structured source of what competitors are running today. Deals pages are dynamic, return raw HTML, and change format without notice.
Manual monitoring doesn't scale. Five retailers, daily deal cycles, seasonal campaigns, flash sales — that's hundreds of promotions per day, each with a different discount structure, expiration logic, and format. Some have coupon codes, some percentage discounts, some are BOGO, some are free shipping with a minimum order. Normalizing all of that by hand is not a pipeline; it's a full-time job.
For data scientists and ML engineers, the problem is different but adjacent: training data for e-commerce entity extraction, promotion classification, and discount modeling is either paywalled, poorly structured, or not updated frequently enough to reflect real market behavior. What's missing is a transparent, daily-updated dataset with documented methodology, verified source URLs, and consistent schema — available in CSV, JSON, and Parquet.
The solution: multi-source scraping + LLM-powered structured extraction
promoPulse is a daily-updated dataset of promotional offers, coupons, and deals from major US e-commerce retailers. It is also a live demonstration of the Indext Data Lab extraction pipeline — fully documented, with every record traceable to its original source URL.
The pipeline runs in four stages:
Multi-source fetching. Content is retrieved from public deal pages using three scraping APIs in parallel — Jina, Tavily, and Firecrawl. Multiple sources per site improve coverage and provide redundancy when one API returns incomplete content.
LLM-powered extraction. Raw page text is processed by GPT-4o-mini and Llama-3.3-70B to identify structured fields: title, promo code, discount value, discount type, expiration date, and description. LLMs handle the format variation that makes rule-based extraction brittle.
Deduplication. Two-key logic runs at both daily and cumulative history levels. Primary key when a promo code exists: (source_site, promo_code). Fallback: (source_site, title, source_url, valid_until). Same-day re-runs merge without duplicating records.
Validation and normalization. Discount values are normalized to a consistent numeric format. Promo codes are verified against raw page content. JSON output is validated with a retry mechanism on extraction failures.
Stack: Jina · Tavily · Firecrawl · GPT-4o-mini · Llama-3.3-70B · Python · Parquet / CSV / JSON
Pipeline runs automatically every day at 08:00 UTC. Reliability over the last 35 days: 100%.
Results: 4,600+ records, 32 days of history, 5 retailers
Today's snapshot (2026-04-08):
Promotions found 212
Sites scraped 5
Coupon codes 6
Pipeline status OKPer-site breakdown:
Data quality across key fields:
The low fill rates on promo_code and valid_until reflect real-world retailer behavior, not extraction failures. Most promotions are automatic checkout discounts with no published expiration.
What you can build with this data
For e-commerce and marketing teams: benchmark your promotional frequency and discount depth against market leaders. Track which discount types competitors favor — percentage off, BOGO, free shipping, fixed amount. Identify seasonal cycles before they happen. The daily_stats.csv and site_stats.csv analytics files are ready for dashboarding without any preprocessing.
For data scientists and ML engineers: the dataset provides high-quality labeled training data for promotion entity extraction, discount classification, and e-commerce NLP models. The LLM-structured output with verified source URLs gives you full provenance on every record. The starter EDA notebook on Kaggle covers daily volume trends, discount type distribution, coupon frequency analysis, and day-of-week seasonality — ready to fork and run.
Dataset structure
dataset/
current/ # Latest extraction snapshot (CSV, JSON, Parquet)
history/ # Daily archives + full cumulative history
analytics/ # daily_stats.csv, site_stats.csvEach record includes: title · promo_code · discount_value · discount_type · source_site · source_url · valid_from · valid_until · description · collect_date
License: CC BY 4.0 — use freely for research, commercial analytics, or model training with attribution.
License: CC BY 4.0 — use freely for research, commercial analytics, or model training with attribution.
Download the dataset →
Kaggle: https://www.kaggle.com/datasets/indext-data-lab-ai/promos-dataset
Fork the starter EDA notebook to run your own analysis immediately.
Need a custom data pipeline?
promoPulse covers US retailers updated daily. If your business needs monitoring of specific competitors, higher update frequency, additional geographies, or a fully integrated AI-driven extraction pipeline — Indext Data Lab builds these as custom data products.
Connect on LinkedIn → https://www.linkedin.com/company/indext-data-lab/