<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:yandex="http://news.yandex.ru" xmlns:turbo="http://turbo.yandex.ru" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>News</title>
    <link>https://indext.io</link>
    <description/>
    <language>ru</language>
    <lastBuildDate>Thu, 09 Apr 2026 18:44:27 +0300</lastBuildDate>
    <item turbo="true">
      <title>We tried to teach a computer to see buttons. It took way longer than we thought.</title>
      <link>https://indext.io/tpost/oye1babhj1-we-tried-to-teach-a-computer-to-see-butt</link>
      <amplink>https://indext.io/tpost/oye1babhj1-we-tried-to-teach-a-computer-to-see-butt?amp=true</amplink>
      <pubDate>Thu, 09 Apr 2026 18:28:00 +0300</pubDate>
      <enclosure url="https://static.tildacdn.com/tild3762-3938-4437-b832-383039346138/Screenshot_2026-04-0.png" type="image/png"/>
      <description>Building a Windows UI detector from scratch — synthetic data, a YOLO model trained on fake screenshots, and the moment we almost gave up.</description>
      <turbo:content><![CDATA[<header><h1>We tried to teach a computer to see buttons. It took way longer than we thought.</h1></header><figure><img alt="" src="https://static.tildacdn.com/tild3762-3938-4437-b832-383039346138/Screenshot_2026-04-0.png"/></figure><div class="t-redactor__text">Here's the problem nobody talks about when building AI agents for Windows: the automation APIs are a disaster. You ask the system "where's the button?" and half the time it just shrugs.</div><div class="t-redactor__text">That's how this whole thing started. We were building an AI agent that could operate Windows apps — clicking buttons, filling in forms, navigating menus. The standard approach works great in demos. In real life? It breaks constantly.</div><div class="t-redactor__text">So we built our own fallback. This is the honest story of how that went.</div><h2  class="t-redactor__h2">The problem with Windows automation (it's worse than you think)</h2><div class="t-redactor__text">Windows has a built-in system called the "accessibility tree." Think of it like a map of every button, text box, and checkbox on screen. Libraries like pywinauto read that map and tell your agent what to click.</div><div class="t-redactor__text">In theory, this is perfect. Every app should expose this map. In practice? Tons of apps don't. Custom-built controls, old Win32 software, Electron apps — they'll show you a flat wall of pixels and nothing else. Your agent is blind.</div><blockquote class="t-redactor__quote"><strong>The frustrating part</strong><br />The API returns zero useful data. Not an error — just nothing. Your automation script is left staring at pixels with no idea what they are.</blockquote><div class="t-redactor__text">The classic workaround is template matching: take a screenshot, find a saved image of the button, click the center. It works until the user changes their DPI, installs a dark theme, or opens the app on a different-resolution monitor. Then it breaks. Every time.</div><div class="t-redactor__text">What we actually needed was something smarter — a model that could look at any screenshot and say: "that's a button, that's a text box, that's a dropdown." Something that generalizes instead of memorizing.</div><blockquote class="t-redactor__quote">"What if we just trained a detector that understands UI element types — the way a human does?"</blockquote><h2  class="t-redactor__h2">The plan (and why the obvious version didn't work)</h2><div class="t-redactor__text">Our first thought was simple: grab a bunch of real screenshots from Windows apps, draw boxes around every button and text field, and train a YOLO object detection model on it.</div><blockquote class="t-redactor__quote"><strong>Dead end #1</strong></blockquote><div class="t-redactor__text">Manual annotation is brutal. We tried labeling real screenshots by hand. Two hours in, we had maybe 50 usable images. We needed thousands.</div><blockquote class="t-redactor__quote"><strong>Dead end #2</strong></blockquote><div class="t-redactor__text">Scraping screenshots from the internet sounds great until you realize most UI screenshots are low-res, cropped, or don't cover the element types you need in the right proportions.</div><blockquote class="t-redactor__quote"><strong>The pivot</strong></blockquote><div class="t-redactor__text">What if we just made fake screenshots? We could render HTML/CSS that looks exactly like Windows UI, and the browser gives us pixel-perfect bounding boxes from the DOM — zero manual labeling.</div><div class="t-redactor__text">This was the idea that changed everything. Instead of annotating real screenshots, we'd generate thousands of fake ones — each one a synthetic Windows-style screen built from HTML templates, rendered with Playwright, and automatically annotated from DOM coordinates.</div><blockquote class="t-redactor__quote"><strong>The insight</strong><br />HTML renders into exactly the pixels you expect, and the DOM knows every element's exact position. Instant ground truth. No human labeling needed.</blockquote><h2  class="t-redactor__h2">Building the fake screenshot factory</h2><div class="t-redactor__text">We wrote templates — HTML and CSS files that look like Windows 10 and 11 UI. Dialogs, settings panels, login screens, forms. Each template could randomize fonts, colors, DPI scaling, element positions, and background noise.</div><div class="t-redactor__text">Then Playwright rendered each one into a PNG and extracted the bounding box coordinates directly from the DOM. One script, 3,000 synthetic screenshots, zero hours of labeling.</div><pre class="t-redactor__highlightcode"><code data-lang="{$la}">python data_gen/generate.py --out datasets/ui_synth_v2 --n 3000 --seed 42</code></pre><div class="t-redactor__text">We targeted 7 element classes: buttons, textboxes, checkboxes, dropdowns, icons, tabs, and menu items. Those cover the vast majority of things an agent actually needs to interact with.</div><div class="t-redactor__text">The domain randomization was key. If every fake screenshot looked identical, the model would memorize instead of learning. We varied everything we could think of — themes, fonts, sizes, noise levels, widget states.</div><h2  class="t-redactor__h2">The model: from YOLOv8 to YOLO11</h2><div class="t-redactor__text">We started with YOLOv8n — the smallest, fastest version. It was fine. mAP@50 around 0.93, which sounds impressive until you're deploying it in a real agent and 7 wrong clicks in 100 trips up your workflow badly.</div><blockquote class="t-redactor__quote"><strong>Lesson learned</strong><br />"93% accurate" sounds good in a research paper. In production automation, it means your agent fails on 1 in 14 interactions. That adds up fast.</blockquote><div class="t-redactor__text">We upgraded to YOLO11s — a bigger model in the same family. Accuracy jumped significantly. The trade-off was speed: CPU inference went from around 30ms to around 60ms. But here's the thing — this model only fires when the accessibility tree fails. It's a fallback, not the main path. 60ms is completely fine.</div><div class="t-redactor__text">Training took 120 epochs with early stopping on an RTX 5060. The results were better than we expected:</div><div class="t-redactor__embedcode"><table style="width:100%; border-collapse:collapse; font-family:Arial, sans-serif;">
  <thead>
    <tr style="background-color:#f5f5f5;">
      <th style="border:1px solid #ddd; padding:10px; text-align:left;">Metric</th>
      <th style="border:1px solid #ddd; padding:10px; text-align:left;">YOLOv8n (baseline)</th>
      <th style="border:1px solid #ddd; padding:10px; text-align:left;">YOLO11s (final)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border:1px solid #ddd; padding:10px;">mAP@50</td>
      <td style="border:1px solid #ddd; padding:10px;">0.930</td>
      <td style="border:1px solid #ddd; padding:10px;">0.989</td>
    </tr>
    <tr>
      <td style="border:1px solid #ddd; padding:10px;">mAP@50–95</td>
      <td style="border:1px solid #ddd; padding:10px;">~0.88</td>
      <td style="border:1px solid #ddd; padding:10px;">0.954</td>
    </tr>
    <tr>
      <td style="border:1px solid #ddd; padding:10px;">Precision</td>
      <td style="border:1px solid #ddd; padding:10px;">~0.94</td>
      <td style="border:1px solid #ddd; padding:10px;">0.996</td>
    </tr>
    <tr>
      <td style="border:1px solid #ddd; padding:10px;">Recall</td>
      <td style="border:1px solid #ddd; padding:10px;">~0.91</td>
      <td style="border:1px solid #ddd; padding:10px;">0.973</td>
    </tr>
    <tr>
      <td style="border:1px solid #ddd; padding:10px;">CPU inference</td>
      <td style="border:1px solid #ddd; padding:10px;">~30ms</td>
      <td style="border:1px solid #ddd; padding:10px;">44–79ms</td>
    </tr>
  </tbody>
</table></div><div class="t-redactor__text">Six percentage points on mAP@50 doesn't sound massive. But it's the difference between an agent that frustrates users and one that works reliably enough to ship.</div><h2  class="t-redactor__h2">The part we almost skipped (and shouldn't have)</h2><div class="t-redactor__text">Detecting buttons is one thing. But an agent doesn't just want coordinates — it wants to click the <em>right</em> button. "Submit" not "Cancel." "Sign in" not "Sign up."</div><div class="t-redactor__text">So we added an OCR layer. For each detected element, EasyOCR reads whatever text is visible inside the bounding box. Then when the agent says "click the Sign In button," the system fuzzy-matches that query against every detected element's text and returns the best match.</div><pre class="t-redactor__highlightcode"><code data-lang="{$la}">match = find_by_text(&quot;screenshot.png&quot;, query=&quot;Sign in&quot;) if match: x, y = safe_click_point(match.bbox) print(f&quot;Click at ({x}, {y})&quot;)</code></pre><div class="t-redactor__text">We used rapidfuzz for the fuzzy matching — specifically token_set_ratio, which handles word reordering and minor OCR errors really well. "Sign in" matches "Signin." "Submit form" matches "Submit." It's robust in a way that exact string matching never is.</div><blockquote class="t-redactor__quote"><strong>Why EasyOCR over Tesseract</strong><br />EasyOCR installs with a single pip command, supports 80+ languages, and returns word-level bounding boxes that line up perfectly with our detector output. Tesseract needs system packages and is fragile in automated environments.</blockquote><h2  class="t-redactor__h2">Does it actually work on real apps?</h2><div class="t-redactor__text">The honest answer: mostly yes, with some caveats.</div><div class="t-redactor__text">The model was trained entirely on synthetic data — fake Windows-style UI rendered in a browser. Real apps, especially heavily custom-themed ones, can look quite different. We saw a noticeable accuracy drop on Electron apps with unusual styling.</div><div class="t-redactor__text">But for standard Windows 10 and 11 apps — the vast majority of what enterprise agents interact with — it works well. The training pipeline is open source, so if your target app has unusual styling, you can generate more synthetic data that matches it and fine-tune.</div><div class="t-redactor__embedcode"><div style="display:grid; grid-template-columns:repeat(4,1fr); gap:30px; font-family:Arial, sans-serif; text-align:center;">

  <div>
    <div style="font-size:32px; font-weight:700;">0.989</div>
    <div style="font-size:14px; color:#666;">mAP@50</div>
  </div>

  <div>
    <div style="font-size:32px; font-weight:700;">3,000</div>
    <div style="font-size:14px; color:#666;">synthetic screenshots</div>
  </div>

  <div>
    <div style="font-size:32px; font-weight:700;">7</div>
    <div style="font-size:14px; color:#666;">UI classes detected</div>
  </div>

  <div>
    <div style="font-size:32px; font-weight:700;">&lt;80ms</div>
    <div style="font-size:14px; color:#666;">CPU inference</div>
  </div>

</div></div><h2  class="t-redactor__h2">What we'd do differently</h2><div class="t-redactor__text">If we started over, we'd spend more time on the synthetic data variety earlier. The biggest accuracy gains came not from model changes but from making the fake screenshots more realistic — adding more noise, more edge cases, more unusual layouts.</div><div class="t-redactor__text">We'd also add more element classes from the start. Date pickers, tree views, and data grids all exist in the real world. Right now, if an agent hits one of those, it has to fall back to coordinates. That's fine for now but it's the most obvious gap.</div><div class="t-redactor__text">And we'd integrate the action verification layer sooner. Before/after screenshot comparison to confirm a click actually worked — we added it late, and it's actually one of the most useful features for building reliable agents.</div><h2  class="t-redactor__h2">The bigger picture</h2><div class="t-redactor__text">This project started as a small piece of a larger agent framework. We needed a fallback. We built a library. Along the way we learned a lot about synthetic data generation, the real cost of accuracy in production, and why "good enough in testing" almost never means "good enough when deployed."</div><div class="t-redactor__text">The whole thing is open source — model weights on HuggingFace, code on GitHub, MIT license. If you're building Windows automation and hitting the same accessibility API dead ends we did, this is for you.</div><blockquote class="t-redactor__quote">"The API returns nothing. Your agent is blind. So you build it eyes."</blockquote><div class="t-redactor__text">That's what this is. A set of eyes for agents that need them.</div>]]></turbo:content>
    </item>
  </channel>
</rss>
