Portfolio

How to Classify10M Pages with 95% Accuracy

Industry: SaaS / Market Intelligence
Developed for media research firms and entertainment news aggregators to automate cross-platform trend analysis and content curation.

Problem

Identifying SaaS offerings across the global web at scale is a "needle in a haystack" challenge that defies manual processing.
  • Extreme Volume: Categorizing 10 million pages manually would take years of human effort.
  • Linguistic Diversity: Web content is multilingual, making it difficult for standard scrapers to determine context and intent.
  • Resource Bottlenecks: Traditional synchronous scraping is too slow and resource-heavy for datasets of this magnitude.

Solution

A high-velocity, asynchronous AI pipeline that combines parallelized scraping with multi-layered linguistic validation.
  • Asynchronous Playwright Parser: Developed a robust scraping engine operating in parallel mode, capable of processing 20–30 websites simultaneously to maintain an 80% efficiency rate.
  • AI-Driven Classification: Integrated OpenAI and advanced ML algorithms to distinguish between SaaS and non-SaaS entities with high precision.
  • Global Language Detection: Implemented a specialized linguistic module achieving a 95% success rate in identifying page languages for better content contextualization.
  • Hybrid Validation Layer: Utilized a pseudo-random sampling method (300+ sites) with automated validation to ensure a consistent 85% classification accuracy across the entire 10-million-page dataset.

Tech Stack

  • Scraping Engine: Playwright (Asynchronous/Parallel)
  • Intelligence Layer: OpenAI API & Custom ML Classifiers
  • Infrastructure: Resource-optimized server architecture for high-volume data handling.
  • Validation Tools: ChatGPT-based automated auditing.

Results

  • Massive Scalability: Successfully processed 10 million pages, a feat unfeasible for manual operations.
  • Cost Efficiency: Optimized server resource consumption, maximizing output while minimizing operational overhead.
  • Precision Intelligence: Delivered a 95% accurate linguistic map and a highly reliable SaaS database for market researchers.
Web scraping & Market intelligence