Coding Agents Grow Up

For years, programming something new followed a predictable, exhausting rhythm: write some code, hit a wall, and then disappear into a forest of documentation and StackOverflow tabs to find the trick to get it working. In 2025, that era ended for me.

Today, AI code assistants take so much drudgery out of development and debugging that the work has become mostly gratifying and rarely frustrating — the opposite of how programming was in the Before Times.

Between my work and my side interests, I often feel like I live in Visual Studio Code (VSCode — a popular open-source development environment). Early in 2025 I subscribed to GitHub Copilot, which integrates AI coding assistants into VSCode. At $10/month it’s a phenomenal bargain that offers software developers an easy way to loop in the latest models from OpenAI, Google, and Anthropic. Now when I’m trying something new (like this little project I did for fun) I can mostly stay in VSCode and work with an AI assistant that has mastered all of the documentation.

As mentioned elsewhere, my most significant side project in 2025 was helping Ukrainians develop an open-source ballistic calculator (pyballistic) in Python and Cython. I polished that off at the end of September.  By that point I had begun to spend more time with Github Copilot, and the capabilities of its latest models gave me enough confidence and support to tackle what would have previously been an absurdly ambitious project for my day job: an Excel Real-Time Data (RTD) server for the Interactive Brokers API.  (This RTD server feeds live market data, positions, and orders directly into Excel using native Excel formulas.) After two months of working seven days a week on that, I had a beautiful piece of software so solid (and validated every build by over 800 unit tests) that I had begun to use it in live trading operations.

Early Childhood Development

Watching these models mature over the last year has been like watching a child grow up.

Tell a child, “Clean your room.” First they’ll spend more time arguing than it would take to just do it. When they finally declare the task “done,” you might find a few toys picked up but most of the mess still there. Emphasize that “clean your room” means everything and you might find the floor clean but everything shoved under the bed.

Claude v3 was notorious for hacking shortcuts. Ask it to fix a failing test and it might just replace the test logic with a “return true;” statement. Claude v3.5 wouldn’t be so brazen, but it was still prone to hack the example rather than the task. GPT-4 and Gemini v2 would enthusiastically announce completion without checking their work. Like the child who picks up two toys and concludes that his room must be clean, even though the mess is visible from outside the door.

The teens came quickly: Claude v3.7 and its contemporaries would often spend more effort arguing that its failure was actually success than it would have taken to do the work correctly.

More recent models have become more likely to keep checking and working until they succeed. Performance of the latest models is still wildly variable: a model that astonishes me with its apparent skill one day may choke on something relatively simple the next. But they are getting more consistent. And they are definitely getting more intelligent.

What is intelligence?  It becomes easy to see when you’re doing hard work with different models. One of the neat things about Copilot is that you can choose to watch the model at work. They all think “out loud,” meaning you can read their chain of thought to understand how and why they do things. When it’s not having an off day, Claude Opus is intelligent.  Given a problem:

  • It can more reliably identify what matters.
  • It has a better sense of what to look at and what to ignore.
  • It produces better assessments of what’s possible and makes better plans to get there.
  • It knows when to persist and when to change directions.

These are some of the things that separate a junior developer from a more experienced one. They are also qualities that characterize more intelligent people.

Let me show you. Have you ever wondered what it’s like debugging software? Well, debugging is one thing the newer models can usually do as well as a good human programmer. In fact, they can do it better because they can run the process faster and interact with the code more directly. Below I pasted a transcript of Claude working to find and fix a tricky bug in my RTD server. This could just as well have been a transcript of my thoughts if I had to debug it. But whereas this would have been a draining hour+ distraction for me, the Claude instance cranked this out in minutes.

Continue reading “Coding Agents Grow Up”

Light Interaction App

Check out this nifty little touch-screen-compatible, WebGL-powered application.

To test out the latest AI, I added GitHub Copilot to VSCode and asked it to build a simple web application that lets the user move three radiant lights (red, green, and blue) around a screen to see how adding colors works. (For example, if the three colors are right on top of each other it looks like a single white light.) Here’s a screenshot of that first app:

By default Copilot uses GPT-4o, but on a few examples I have found that Claude 3.7 Sonnet (another Copilot option) is capable of more sophisticated computer engineering, so with that selected as my Copilot “Agent” I began enhancing this app. The most significant change – and something I’ve wanted to try for a while – was to use WebGL to take advantage of the graphics processing features built into most modern electronics. Thanks to that hardware acceleration, this enhanced app supports lots of light sources, dithering to avoid color banding, and real-time dragging lights around the screen without noticeable lag. Then I added touch-screen support so that the app can be used on mobile devices.

It took some coaching from me to get this working: At several points I observed bugs and Copilot would essentially get stuck in a loop saying, “Oh, I see what’s wrong; this should fix it,” without successfully fixing it. I had to guide the Agent through more intentional debugging methods to resolve several confusing problems. But by the end I hadn’t written or even touched much of the code. I was the designer and tester, and Copilot saved me the trouble of:

  • Scouring API documentation and sites like StackOverflow for code samples needed to make it work.
  • Learning or remembering the exact syntax of the languages involved (WebGL, JavaScript, CSS, HTML).
  • Recreating common GUI tricks, like adding code to make sure that everything is visible on a screen regardless of its size or orientation.
  • Finding and fixing minor bugs.
  • Writing debug code to understand and resolve major problems.

Here’s a screenshot from the final app (shown here with all light inverted – one of the fun features accessible by right-clicking/long-tapping):