The frustrating part
The API returns zero useful data. Not an error — just nothing. Your automation script is left staring at pixels with no idea what they are.
"What if we just trained a detector that understands UI element types — the way a human does?"
Dead end #1
Dead end #2
The pivot
The insight
HTML renders into exactly the pixels you expect, and the DOM knows every element's exact position. Instant ground truth. No human labeling needed.
python data_gen/generate.py --out datasets/ui_synth_v2 --n 3000 --seed 42Lesson learned
"93% accurate" sounds good in a research paper. In production automation, it means your agent fails on 1 in 14 interactions. That adds up fast.
match = find_by_text("screenshot.png", query="Sign in") if match: x, y = safe_click_point(match.bbox) print(f"Click at ({x}, {y})")Why EasyOCR over Tesseract
EasyOCR installs with a single pip command, supports 80+ languages, and returns word-level bounding boxes that line up perfectly with our detector output. Tesseract needs system packages and is fragile in automated environments.
"The API returns nothing. Your agent is blind. So you build it eyes."