AI Engineering Coach: Measure How You Actually Code with AI

AI Engineering Coach: Measure How You Actually Code with AI


🎯 TL;DR

AI Engineering Coach is an open-source VS Code extension from Microsoft that reads the local session logs your AI coding assistants already write, then turns them into a private analytics dashboard. It scores your prompting habits, flags anti-patterns, measures your AI-generated output, and surfaces repeated prompts you could promote into reusable skills.

It’s harness-agnostic (Claude Code, GitHub Copilot, Copilot CLI, Codex, OpenCode, and more), runs 100% locally, and does not cost you extra tokens for its core analytics. Think of it as a Strava for the way you work with AI.

Repo: github.com/microsoft/AI-Engineering-Coach

In a hurry? Jump to the install steps, then come back for the why.

The question nobody is measuring

Most of us now reach for an AI coding assistant before we reach for the keyboard. GitHub Copilot, Claude Code, Codex, Gemini CLI. They’ve quietly become the default surface for writing software. But here’s the uncomfortable question I kept coming back to:

Am I actually getting better at this, or am I just using it more?

We obsessively measure the AI: token counts, model benchmarks, latency. We almost never measure ourselves: the quality of our prompts, how often we review what the model generated before shipping it, whether we keep re-typing the same instructions, whether our repos even give the agent enough context to succeed.

That’s the gap the AI Engineering Coach fills. It doesn’t write code for you. It holds up a mirror to how you write code with AI, and that distinction is the whole point.

flowchart LR
    You([You, coding with AI]) -->|prompts · edits · tool calls| Tools["AI coding tools
Copilot · Claude Code · Codex · ..."] Tools -->|already write| Logs[("Local session logs
on disk")] Logs --> Coach["AI Engineering Coach
reads, never writes"] Coach -->|reflects back| Insights["Prompt quality · anti-patterns
output · reusable skills"] Insights -.->|so you level up| You style Coach fill:#cce5ff,stroke:#1f6feb,stroke-width:2px,color:#000 style Insights fill:#e6ffed,stroke:#2da44e,color:#000 style Logs fill:#fff4c2,stroke:#b7950b,color:#000

The loop nobody closes: your tools already write the logs. The Coach just reads them back to you, so the feedback finally points at you, not the model.

Read more
Governance for AI Agents (Part 1): The Gap, the OWASP Agentic Top 10, and Where AGT Fits

Governance for AI Agents (Part 1): The Gap, the OWASP Agentic Top 10, and Where AGT Fits


🎯 TL;DR

You’ve built a single agent, or wired up a multi-agent orchestration with LangChain / AutoGen / CrewAI / Microsoft Agent Framework. It works.

Now answer this: how are you covering the OWASP Agentic Top 10? How do you prove to a regulator that the agent did only what it was allowed to do? Prompt engineering and content filters don’t reach the layer where actions happen.

The Agent Governance Toolkit (AGT), an open-source Microsoft project I work on, puts a sub-millisecond deterministic policy decision in front of every tool call, gives every agent a cryptographic identity, and produces a tamper-evident audit trail.

This post is Part 1 of a series: what the governance gap is, what the OWASP Agentic Top 10 actually contains, and where AGT sits in your stack.

Repo: github.com/microsoft/agent-governance-toolkit

If you’ve shipped anything with LangChain, AutoGen, CrewAI, or Microsoft Agent Framework recently, you’ve probably hit the same wall I did. The agent works. It plans, calls tools, remembers things. And then you try to put it somewhere it can actually do harm, touch a database, hit a real API, run shell commands, talk to another agent, and you realise you have no good way to bound what it can do.

You have a model. You have prompts. You have a tool list. You don’t have a policy layer. So you do what we all do: stitch together if-statements, allowlists, regex filters on prompts, maybe a sandbox if you’re feeling fancy. It mostly works. Until it doesn’t.

flowchart LR
    U([User / Prompt]) --> M[LLM Planner
non-deterministic] M -->|tool call| T{{No policy layer}} T -->|just runs| DB[(Database)] T -->|just runs| API[(Production API)] T -->|just runs| SH[/Shell /] T -->|just runs| A2[Other Agents] style T fill:#ffd6d6,stroke:#c0392b,stroke-width:2px,color:#000 style M fill:#fff4c2,stroke:#b7950b,color:#000

The agent works. The governance doesn’t exist yet. That gap is what the OWASP Agentic Top 10 and every AI regulation is pointing at.

Read more
Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060

Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060


🎯 TL;DR: Subject-Driven Image Generation on 12GB VRAM

Large AI models like FLUX.1-schnell typically require datacenter GPUs with 48GB+ VRAM. Problem: Most developers and hobbyists only have access to consumer RTX cards which vary from 6 - 12GB VRAM in most cases (with the exception of the expensive 4090/5090 cards which can go up to 32gb).

Solution: Using mmgp (Memory Management for GPU Poor) with Docker containerization enables FLUX.1 OmniControl to run on RTX 3060 12GB through 8-bit quantization, dynamic VRAM/RAM offloading, and selective layer loading. The implementation provides a Gradio web interface generating 512x512 images in ~10 seconds after initial model loading, with models persisting in system RAM to avoid reload overhead.

Technical Approach: Profile 3 configuration quantizes the T5 text encoder (8.8GB → ~4.4GB), pins the FLUX transformer (22.7GB) to reserved system RAM, and dynamically loads only active layers to VRAM during inference. Tested and validated on RTX 3060 12GB with 64GB system RAM running Windows 11 + WSL2 + Docker Desktop.

Complete Implementation: All code, Dockerfile, and setup instructions are available at github.com/Ricky-G/docker-ai-models/omnicontrol


Recently, I wanted to experiment with OmniControl, a subject-driven image generation model that extends FLUX.1-schnell with LoRA adapters for precise control over object placement. The challenge? The model requirements listed 48GB+ VRAM, and I only had an RTX 3060 with 12GB sitting in my workstation.

This is a common frustration in the AI development community. Research papers showcase impressive results on expensive datacenter hardware, but practical implementation on consumer GPUs requires significant engineering effort. Could I actually run this model locally without upgrading to an RTX 4090/5090 or pay for a VM in Azure with A100?

The answer turned out to be yes - with some clever memory management and containerization. This blog post walks through the complete process of dockerizing OmniControl to run efficiently on a 12GB consumer GPU.

Read more