Industry

AI · Product Operations

Client

Personal Project

OpsPilot — A Multi-Agent AI System for Autonomous Product Operations

I built this because I wanted to understand agentic AI from the inside, not the outside.

The problem didn't start with technology. It started with a conversation. Talking to senior PMs and ICs at other companies, I kept hearing the same complaint: the actual thinking work, the prioritisation, the strategy, the decisions that require real judgment, was getting crowded out by the operational layer surrounding it. Reading messages. Interpreting intent. Deciding what each one needs. First-drafting from scratch. Thirty times a day across Gmail and Slack. AI assistants exist. But they wait to be asked. You still read the message, decide what it needs, frame the prompt, and review the output. The assistant reduces drafting time. It doesn't reduce the cognitive overhead of triage. You're still the operator. I wanted to build something that didn't wait. A system that reads the signal, understands what it requires, decides what to produce, and produces it without being prompted for each task individually. That became OpsPilot. The Wrong First Version The first version was a single prompt, one call to Gemini doing classification, routing, and generation simultaneously. It worked on clean inputs. It fell apart on realistic ones. The specific failure: on ambiguous messages like "can we get a PRD for the search bar?", the model began generating content before resolving whether this was a real request or a passing comment. Structurally correct output. Wrong in substance. Classification and generation were corrupting each other by running inside the same context at the same time. The fix was four separate agents, each with exactly one job: • Input Agent: reads raw message, converts to standard format • Classifier Agent: one question only, what type of task is this? • Decision Agent: routes to the correct output type • Execution Agent: generates the final artifact Accuracy improved. Debuggability improved dramatically. When something broke, I could see exactly which agent caused it instead of tracing through a monolithic prompt. What It Does Mission Control: all Gmail and Slack messages in one unified view • Auto-Execute: one click triggers the full pipeline, no prompting per task • Glass-Box Reasoning: transparent log of how the AI interpreted each message and why it routed the way it did. Built deliberately because black-box outputs break trust, and if you can't see the reasoning you can't catch the errors • Document Analysis Workspace: upload PDFs or images and have a full conversation with the document, ask specific questions, request sections, get contextual answers • Knowledge Base: all outputs saved, editable, and downloadable across sessions • Gmail OAuth: processes real inbox messages, not simulated data

What the evaluation actually showed. And what I would build next.

I tested OpsPilot on 40 real inputs because the only honest way to know whether an AI system works is to measure it on inputs it didn't know it would see. Results by category: • PRD requests: 80 to 90 percent, structured inputs, explicit intent, consistent language patterns • Client email queries: around 83 percent, similar reasons • Informational and FYI messages: 65 to 70 percent, occasional ambiguity creates classification uncertainty • Knowledge base alerts: 40 to 45 percent, rigid system-generated language overlaps semantically with other categories • Overall: approximately 70 percent The knowledge base failure is a label definition problem, not a model problem. The classifier doesn't have enough signal to distinguish a knowledge base alert from an FYI because the category boundaries I defined weren't sharp enough. The fix is prompt refinement and better label definitions, not a bigger model or more training data. That distinction is exactly what an AI PM needs to be able to make. What I'd Build Next The current version processes each message independently with no memory of previous messages and no awareness that Tuesday's PRD request connects to Thursday's feedback thread. Agent memory is the highest-value next problem: persistent context that lets the system understand active workstreams and connect related signals automatically. After that, expanded output types including Jira tickets, meeting summaries, and Slack thread digests. How I'd Measure Success in Production • Time to first draft • Reduction in manual triage • Output acceptance rate: how often users use the generated artifact with minimal editing That last metric is the most honest signal. 70 percent classification accuracy means nothing if users are rewriting 80 percent of outputs. What Building This Taught Me I understand why multi-agent architectures outperform monolithic prompts on complex reasoning tasks, not because I read it, but because I watched a single-prompt system fail in a specific, instructive way and had to redesign around it. I know what transparent reasoning logs do for user trust because I built one and felt the difference. I know what 70 percent accuracy feels like when you thought the system was working. Most candidates applying for AI PM roles have used LLMs. I built a system with them, got the architecture wrong the first time, understood exactly why, redesigned it, measured the output honestly, and documented what I found. That is the difference between knowing about agentic AI and knowing how it actually behaves.

Create a free website with Framer, the website builder loved by startups, designers and agencies.