Windows UI Element Detector

Windows UI Element Detector — Try the Live Demo

Tag: Computer Vision · Windows Automation · Live Demo · Open Source

What this tool does

Windows UI Element Detector is a browser-based demo of a computer-vision model that finds interactive elements in any Windows screenshot — buttons, text fields, checkboxes, dropdowns, icons, tabs, and menu items. Upload a screenshot, get bounding boxes and JSON output back in seconds.

Under the hood it runs YOLO11s fine-tuned on 3,000 synthetic Windows-style UI screenshots, with EasyOCR for text reading and rapidfuzz for fuzzy label matching. No cloud APIs. No data sent anywhere. Everything runs locally on the Space hardware.

Who needs this

UI automation agents that rely on native accessibility APIs — pywinauto, UIAutomation — regularly fail on custom-rendered controls, Electron apps, and heavily themed enterprise software. When the accessibility tree returns nothing, you need a vision fallback. This demo lets you test whether the model works on your specific application before integrating the library into your pipeline.

How to use the demo

Upload any Windows screenshot — a dialog box, a settings panel, a full desktop window. Adjust the confidence threshold to control how many detections appear. Use the IoU slider to tune overlap suppression. Filter by class if you only care about buttons or text fields. Hit Detect.

The overlay shows bounding boxes with class labels and confidence scores. The JSON output gives you the raw data: class name, bounding box coordinates, score — ready to copy into your integration.

Controls:

Confidence threshold — lower it to catch more elements, raise it to keep only high-certainty detections
IoU threshold (NMS) — controls how aggressively overlapping boxes are merged
Filter classes — select specific element types or leave empty to detect all seven

Model performance

Trained on NVIDIA RTX 5060 (Blackwell, 8 GB) for 120 epochs on 3,000 synthetic Windows screenshots generated via Playwright — no manual annotation required.

Overall metrics:

Metric	Value
mAP@50	0.989
mAP@50–95	0.954
Precision	0.996
Recall	0.973
CPU inference (Apple M2 Pro)	44–79 ms
GPU inference (RTX 5060)	2–5 ms

Per-class AP@50:

Component	Score
Button	0.9919
Textbox	0.9771
Checkbox	0.9864
Dropdown	0.9829
Icon	0.9950
Tab	0.9950
Menu item	0.9915

Use it in your project

The library installs with a single command. Model weights download automatically from HuggingFace on first run.

pip install -e .

from local_ui_locator import detect_elements, find_by_text, safe_click_point

# Detect all UI elements
detections = detect_elements("screenshot.png", conf=0.3)
for det in detections:
    print(f"{det.type}: {det.bbox} (score={det.score:.2f})")

# Find element by visible label
match = find_by_text("screenshot.png", query="Sign in")
if match:
    x, y = safe_click_point(match.bbox)
    print(f"Click at ({x}, {y})")

Full source code, training pipeline, and synthetic dataset generator on GitHub → https://github.com/Indext-Data-Lab/windows-ui-synth

Known limitations

The model performs best on standard Windows 10 and 11 UI. Heavily custom-styled applications — games, custom-skinned enterprise tools, non-standard widget libraries — may show lower accuracy due to the synthetic training data. The detector returns bounding boxes and class labels only; text content within elements requires the OCR layer. Seven element classes are supported in this release.

Stack

YOLO11s (Ultralytics) · EasyOCR · rapidfuzz · Playwright · MIT License

HuggingFace → https://huggingface.co/spaces/IndextDataLab/windows-ui-locator

GitHub → https://github.com/Indext-Data-Lab/windows-ui-synth

Need a fully integrated AI solution for your business? Reach out through the website or connect on LinkedIn