Kaggle Dataset: promoPulse (promotional offers, coupons and deals)

Automated E-Commerce Promo Monitoring with LLM Extraction — promoPulse (Open Source Dataset)

Tag: E-Commerce · Data Pipeline · LLM Extraction · Open Source · Kaggle

The problem: promotional data is fragmented, stale, and unstructured

E-commerce teams and data scientists share the same frustration with promotional intelligence: there is no reliable, structured source of what competitors are running today. Deals pages are dynamic, return raw HTML, and change format without notice.

Manual monitoring doesn't scale. Five retailers, daily deal cycles, seasonal campaigns, flash sales — that's hundreds of promotions per day, each with a different discount structure, expiration logic, and format. Some have coupon codes, some percentage discounts, some are BOGO, some are free shipping with a minimum order. Normalizing all of that by hand is not a pipeline; it's a full-time job.

For data scientists and ML engineers, the problem is different but adjacent: training data for e-commerce entity extraction, promotion classification, and discount modeling is either paywalled, poorly structured, or not updated frequently enough to reflect real market behavior. What's missing is a transparent, daily-updated dataset with documented methodology, verified source URLs, and consistent schema — available in CSV, JSON, and Parquet.

The solution: multi-source scraping + LLM-powered structured extraction

promoPulse is a daily-updated dataset of promotional offers, coupons, and deals from major US e-commerce retailers. It is also a live demonstration of the Indext Data Lab extraction pipeline — fully documented, with every record traceable to its original source URL.

The pipeline runs in four stages:

Multi-source fetching. Content is retrieved from public deal pages using three scraping APIs in parallel — Jina, Tavily, and Firecrawl. Multiple sources per site improve coverage and provide redundancy when one API returns incomplete content.

LLM-powered extraction. Raw page text is processed by GPT-4o-mini and Llama-3.3-70B to identify structured fields: title, promo code, discount value, discount type, expiration date, and description. LLMs handle the format variation that makes rule-based extraction brittle.

Deduplication. Two-key logic runs at both daily and cumulative history levels. Primary key when a promo code exists: (source_site, promo_code). Fallback: (source_site, title, source_url, valid_until). Same-day re-runs merge without duplicating records.

Validation and normalization. Discount values are normalized to a consistent numeric format. Promo codes are verified against raw page content. JSON output is validated with a retry mechanism on extraction failures.

Stack: Jina · Tavily · Firecrawl · GPT-4o-mini · Llama-3.3-70B · Python · Parquet / CSV / JSON

Pipeline runs automatically every day at 08:00 UTC. Reliability over the last 35 days: 100%.

Results: 4,600+ records, 32 days of history, 5 retailers

Today's snapshot (2026-04-08):

Promotions found   212
Sites scraped        5
Coupon codes         6
Pipeline status     OK

Per-site breakdown:

Site	Promos	Max Discount	Codes
officedepot.com	107	54.5%	1
ulta.com	52	100.0%	1
shutterfly.com	18	50.0%	4
1800flowers.com	12	50.0%	0
homedepot.com	7	30.0%	0

Data quality across key fields:

Field	Fill Rate	Note
title	100%	Always present
discount_type	100%	Always classified
description	100%	Always present
discount_value	52%	Not all promos have numeric value
valid_until	34%	Retailers often omit expiry dates
promo_code	3%	Most discounts are automatic

The low fill rates on promo_code and valid_until reflect real-world retailer behavior, not extraction failures. Most promotions are automatic checkout discounts with no published expiration.

What you can build with this data

For e-commerce and marketing teams: benchmark your promotional frequency and discount depth against market leaders. Track which discount types competitors favor — percentage off, BOGO, free shipping, fixed amount. Identify seasonal cycles before they happen. The daily_stats.csv and site_stats.csv analytics files are ready for dashboarding without any preprocessing.

For data scientists and ML engineers: the dataset provides high-quality labeled training data for promotion entity extraction, discount classification, and e-commerce NLP models. The LLM-structured output with verified source URLs gives you full provenance on every record. The starter EDA notebook on Kaggle covers daily volume trends, discount type distribution, coupon frequency analysis, and day-of-week seasonality — ready to fork and run.

Dataset structure

dataset/
current/     # Latest extraction snapshot (CSV, JSON, Parquet)
history/     # Daily archives + full cumulative history
analytics/   # daily_stats.csv, site_stats.csv

Each record includes: title · promo_code · discount_value · discount_type · source_site · source_url · valid_from · valid_until · description · collect_date

License: CC BY 4.0 — use freely for research, commercial analytics, or model training with attribution.

Download the dataset →

Kaggle: https://www.kaggle.com/datasets/indext-data-lab-ai/promos-dataset

Fork the starter EDA notebook to run your own analysis immediately.

Need a custom data pipeline?

promoPulse covers US retailers updated daily. If your business needs monitoring of specific competitors, higher update frequency, additional geographies, or a fully integrated AI-driven extraction pipeline — Indext Data Lab builds these as custom data products.

Connect on LinkedIn → https://www.linkedin.com/company/indext-data-lab/