Computer-Vision Fallback for Windows UI Automation — Local UI Locator (Open Source)
Tag: Computer Vision · Windows Automation · Open Source · Python
The problem: when native UI APIs go silent
Windows UI automation is built on accessibility APIs. Libraries like pywinauto and UIAutomation query the element tree of an application — finding a button by name, reading its state, clicking it programmatically. In theory, this works universally. In practice, it breaks constantly.
Custom-rendered controls in Electron apps, legacy Win32 with owner-draw, dynamically injected popups, and aggressively themed enterprise software often expose no accessibility tree at all. The API returns nothing — just a flat window of pixels. Your automation agent is blind.
The classic fallback is template matching: take a screenshot, find a known image, click its center. But template matching breaks the moment DPI, theme, or window scale changes. What you actually need is a model that understands types of UI elements — buttons, text fields, dropdowns — so it can locate them even when they look slightly different from training data. That model needs to run locally, add under 100 ms to each step, and install with a single pip install. Local UI Locator was built for exactly this gap.
The solution: YOLO11s + OCR + fuzzy matching
Local UI Locator is a Python library that provides a computer-vision fallback layer for Windows UI agents. It activates when the accessibility tree returns nothing, takes a screenshot, and returns actionable click coordinates.
The pipeline has four stages:
Element detection. A YOLO11s model runs on the screenshot and returns bounding boxes with element type and confidence score. It detects seven classes: button, textbox, checkbox, dropdown, icon, tab, menu_item.
Text reading. EasyOCR reads visible text within each detected bounding box. No system dependencies — pure pip-installable, supports 80+ languages.
Fuzzy matching. rapidfuzz.fuzz.token_set_ratio matches OCR output against the agent's query string. Handles word reordering, partial labels, and minor OCR substitutions robustly. An agent looking for "Sign in" will match a button labeled "Signin" or "Sign In".
Action verification. Before/after screenshot comparison via pixel diff, OCR delta, or combined mode — confirms the click actually had effect.
match = find_by_text("screenshot.png", query="Sign in")
if match:
x, y = safe_click_point(match.bbox)
print(f"Click at ({x}, {y})")
The library also ships a complete training pipeline. You can regenerate the synthetic dataset, retrain the model on your own UI styles, and evaluate — useful when your application uses a custom theme outside standard Windows 10/11 aesthetics.
Results: near-perfect detection across all seven classes
The model was trained on an NVIDIA RTX 5060 (8 GB, Blackwell) for 120 epochs with early stopping. The synthetic dataset of 3,000 Windows-style screenshots was generated entirely via HTML/CSS templates rendered with Playwright — no manual annotation.
Metric
Value
mAP@50
0.989
mAP@50–95
0.954
Precision
0.996
Recall
0.973
CPU inference (M2 Pro)
44–79 ms
GPU inference (RTX 5060)
2–5 ms
Class
AP@50
button
0.9919
textbox
0.9771
checkbox
0.9864
dropdown
0.9829
icon
0.9950
tab
0.9950
menu_item
0.9915
This represents a meaningful improvement over the prior YOLOv8n baseline (mAP@50 of 0.93) — a 6-point absolute gain — while keeping CPU inference under 80 ms. For a fallback layer that fires only when the accessibility tree is empty, that latency is acceptable.
The library ships with a Gradio demo that lets you upload any screenshot, adjust confidence threshold, filter by element class, and search elements by text — useful for validating behavior on your specific application before wiring it into an agent.
Why these specific components
YOLO11s over YOLOv8n. The accuracy gain from upgrading the backbone was significant: mAP@50 went from 0.93 to 0.989. Inference time roughly doubled on CPU (~30 ms to ~60 ms), but for a fallback layer that activates only on API failure, 60 ms is a reasonable trade-off.
EasyOCR over Tesseract. EasyOCR installs via pip with zero system dependencies. Tesseract requires system package installation and can be fragile in CI/CD environments. EasyOCR also returns word-level bounding boxes that intersect cleanly with the detector output.
Synthetic data via Playwright. Rendering HTML/CSS templates with Playwright gives exact bounding box coordinates from DOM queries — no manual annotation needed. Domain randomization across themes, fonts, DPI scaling, and noise was sufficient to achieve production-grade accuracy on real Windows UI despite training on entirely synthetic images.
Fuzzy matching via rapidfuzz. token_set_ratio handles partial label matches, word reordering, and minor OCR substitutions. Standard string equality would fail on the kind of OCR noise you see in real screenshots.
Known limitations
The model was trained on synthetic data only — real-world applications with heavily custom-styled controls may show a domain gap. It performs best on standard Windows 10 and 11 UI. The current release supports 7 element classes; complex widgets like date pickers, tree views, and data grids are not detected. Text content within elements is not provided by the detector — that requires the OCR layer explicitly. For non-standard applications, the included training pipeline makes it straightforward to generate additional data and fine-tune.