Scaling Engineering: The Agents Supporting OnePay Engineers

Share

In The OnePay Platform Story, we described the architectural evolution from a single-product stack to a scalable, multi-product platform. In Inside OnePay's AI Journey, we laid out our three-pillar AI strategy spanning Operations, Productivity, and Product. And just recently, we provided an inside look at our Operations AI Team — the five agents transforming customer support. This article picks up where those left off, diving into the AI agents we built to supercharge the engineers who build OnePay itself.


The Engineer's Bottleneck

OnePay ships thousands of changes to production per month. Our mobile platform runs a unified React Native codebase across iOS, Android, and Web. Our backend is a constellation of microservices, data pipelines, and infrastructure spanning multiple cloud regions. The pace has to be extraordinarily fast, because millions of customers depend on us every day. AI agents already help our engineers write code and iterate fast. They are embedded in every developer's IDE, but writing code is only one part of the story.

Speed creates its own challenges. Every release flows through build systems, deployment infrastructure, and monitoring layers before it reaches a customer. Every incident demands rapid diagnosis across an ever-changing environment. Every data question requires navigating a lakehouse with catalogs spanning dozens of products and features. And every new engineer who joins needs to understand the environment, and be able to quickly contribute to a codebase with millions of lines of code.

We asked ourselves: what if AI could shoulder the repetitive, time-consuming parts of this work, not to replace engineers, but to let them focus on the problems that actually require human creativity and judgment?

The answer is another five specialized AI agents, powered by a growing ecosystem of Model Context Protocol (MCP) servers.

What is MCP, and Why It Matters

Before diving into the agents, it's worth explaining a foundational technology choice. The Model Context Protocol (MCP) is an open standard that defines how AI agents interact with external tools and data sources. Think of it as a universal adapter between AI agents and the systems they need to work with.

Rather than building bespoke integrations for every agent-to-system connection, we invested in building MCP servers — modular, reusable bridges between AI and our infrastructure. Each MCP server exposes a set of tools that our AI agents can discover and invoke. This is the same platform philosophy we described in The OnePay Platform Story: build shared, horizontal capabilities that accelerate everything built on top.

To accelerate adoption, we developed a core SDK that provides a fluent API that any team at OnePay can use to stand up a new MCP server with minimal boilerplate. This SDK is the foundation that makes our agent ecosystem scale.

Here’s a sampling of the MCP ecosystem that we developed to supercharge developer productivity:

  • CI/CD: An MCP server that connects to git source, ArgoCD, Infrastructure-as-Code (IaC) systems and others, giving AI agents the ability to inspect pipelines, debug deployments, and manage infrastructure state.

  • Observability: Provides access to real-time metrics, logs, traces, monitors, dashboards, and incidents across our entire observability stack.

  • Data: Enables AI-agents to interface with our lakehouse, with OAuth-secured federated access to authorized catalogs, query execution, and vector search.

In a fintech environment, giving AI agents access to production systems and customer data demands rigorous controls and these controls are baked into the SDK every MCP server inherits. Authentication is enforced via SSO and short-lived tokens, not static credentials. Safelists scope which hosts, projects, and operations an agent can reach; an agent can inspect a deployment but not trigger one without human authorization. Our Data MCP server uses OAuth with narrow permissions that respect our catalog's governance model, data access policies, and column-level security. The same rules apply whether a human or an AI runs the query. Secrets are vaulted, never stored in plaintext, and all tool invocations are logged for auditability.

These MCP servers are not standalone products, they are building blocks. Our Platform AI Agents are where these building blocks come to life.

Meet the Engineering AI Agents

Pipeline Agent

Every engineer at OnePay interacts with CI/CD pipelines daily. A typical day might involve debugging a failed CI job, checking why an ArgoCD deployment is stuck, or verifying an IaC workspace state before a release. Each of these tasks used to require context-switching between multiple tools, decoding cryptic logs, and sometimes hunting down a platform engineer for help.

The Pipeline Agent eliminates that friction. Built on top of our CI/CD MCP Server, it gives engineers the ability to interact with their entire release pipeline through natural language directly inside their IDE.

An engineer can paste a failing CI job link and ask, "Why did this fail?". The agent retrieves the job logs, strips ANSI formatting, detects error patterns, and provides a diagnosis. It can list ArgoCD applications across any of our environments, show deployment history, compare live state against Git state, and retrieve rendered Kubernetes manifests. For infrastructure changes, it inspects IaC workspaces, lists recent runs, and surfaces plan details.

The agent resolves ArgoCD application names automatically, mapping environment references like "staging" or "dev" to the correct cluster endpoints. It understands our naming conventions so engineers don't have to remember infrastructure specifics.

Setup takes a single command. Engineers run this command to configure all necessary MCP servers, authenticate via SSO to underlying platform services, and store credentials securely in their keychain. From there, the agent is available in every coding session, operating on behalf of the developer and under their command.

Code Quality Agent

Writing code is one thing. Shipping code that meets the standards of a fintech platform serving millions of customers is another. With thousands of production changes deployed per month across a large set of services, manual code review alone cannot catch every issue. When reviewers are human, they can get fatigued, miss edge cases, and cannot possibly hold every architectural convention in memory across an ever-growing codebase. The story is significantly different when the human reviewers are AI-assisted.

The Code Quality Agent is an AI-powered reviewer that automatically analyzes every merge request against a comprehensive, modular set of engineering rules that encode OnePay's operational knowledge.

Unlike generic linters or static code analysis tools, the Code Quality Agent understands our codebase, our infrastructure patterns, our framework conventions, and the specific classes of bugs that have caused real incidents. Its rule set is organized into modular use cases, each targeting a distinct category of quality and risk:

  1. Security: The agent follows a set of security best practices when reviewing code, and is additionally trained to look for aspects that are specific to the OnePay ecosystem. It detects when new endpoints are added without proper authentication. It cross-references ingress and service mesh configurations with controller implementations, flagging any unintended routes. It monitors secrets usage, ensuring new secrets are properly provisioned and secured in their respective vaults before code reaches production.

  2. Performance: It catches patterns known to cause performance degradation. It detects when objects that are intended to be initialized once are being re-initialized on every request, causing memory pressures and resource contentions. It highlights the sequential processing of independent async operations that could safely run in parallel to boost response time.

  3. Data Integrity: When database schemas or messaging streams are modified, the agent verifies that all entity schemas are updated correctly in our schema registry for all downstream systems. This ensures all schema changes are backwards compatible and correctly versioned across the entire stack. 

  4. Test Coverage: When new HTTP endpoints are introduced, the agent verifies that comprehensive service tests exist covering all status codes the endpoint can return, happy paths, authentication failures, validation errors, and business outcomes. It catches some subtle but occasional anti-patterns, such as using 4xx HTTP status codes for business logic outcomes (like "declined" or "not eligible") instead of 2xx responses with result fields, a pattern that undermines alerting and client error handling.

  5. Framework Conventions: The agent enforces correct usage of our internal libraries, catches dependency version drift against our centralized catalog, flags applications that reference unvetted libraries (which can cause unnecessary deployment cascading), and detects bootstrap configuration issues that would cause deployment failures.

Each rule category produces severity-graded feedback from critical findings that block merges to low-priority suggestions for improvement in style. Every finding includes specific remediation steps, not just a flag. The agent doesn't just tell you something is wrong; it tells you how to fix it and links to the relevant documentation.

The rule system is designed to be extensible. We manage rules using markdown files in a configuration directory that we version control. This allows the agent's knowledge to grow with the team. Every post-mortem that identifies a preventable class of bug can be encoded into a new rule, ensuring the same mistake is never repeated.

The Code Quality Agent operates across our various source code repositories, from application code to infrastructure and data, each with tailored rule sets that reflect the unique conventions and risks of that codebase. It's not a replacement for human reviewers; it's a force multiplier that handles the systematic checks so reviewers can focus on design, architecture, and the subtleties that only human judgment can evaluate.

Incident Agent

When an incident occurs, speed matters. Our engineers quickly coordinate in a dedicated Slack incident channel involving fast-moving conversations that mix diagnosis, remediation, and coordination. After resolution, we conduct a post-mortem review. This requires a structured post mortem document that captures the timeline of the incident, impact, root cause, remediation steps taken and actions that must be taken subsequently to prevent recurrence. Every post-mortem document follows the same format, and is reviewed in the same weekly forum. Traditionally, the incident commander would spend hours reading through the channel after the fact to make sure the incident is accurately captured and ready for review.

The Incident Agent automates this process entirely. It follows the incident channel along with the incident responders, and produces two outputs:

Incident Summaries: Concise reports covering the issue, impact, contributing factors, remediation steps, and extracted links to observability dashboards, CI pipelines, deployments, and bug tickets. These summaries are optimized for semantic search, feeding into a knowledge base of past incidents that engineers can query when diagnosing new issues.

Post-Mortem Document: Comprehensive documents following SRE best practices. The agent extracts timelines from the conversation, identifies root causes from the channel context, assesses user and business impact, and generates prioritized action items. 

The agent is enriched with metadata from our observability platform, such as incident severity and duration, ensuring the post-mortem timeline is accurate. It operates on our shared LLM platform (the same pluggable, multi-provider backend described in Inside OnePay's AI Journey), meaning we can swap models and providers without code changes.

The result: incidents that used to require hours of post-mortem preparation now have draft documents ready in minutes, freeing engineers to focus on prevention.

Data Analytics Agent

OnePay's data lakehouse is built on a canonical data catalog and follows a medallion architecture. Data flows from raw Bronze landing schemas through cleaned Silver layers into curated Gold analytics marts, feature stores, and reporting datasets. Catalogs span every product vertical. Navigating this landscape has historically required deep familiarity with catalog structure, schema conventions, and SQL patterns. Data analysts, product managers, and engineers each bring different questions but face the same barrier: knowing where to look and how to query.

The Data Analytics Agent is a conversational analytics experience embedded directly in our data workspace, right where analysts and product managers already work. Users can ask complex business questions in plain English, such as, "Why did transaction volume spike last Tuesday?" or "Which product vertical has the highest customer retention?", and the agent formulates a research plan, executes multiple SQL queries to gather evidence from different angles, iterates on its approach based on what it discovers, and delivers a comprehensive report with citations, visualizations, and supporting data tables. This goes well beyond simple query generation; it is an autonomous research workflow that reasons through multi-step analytical problems the way a skilled analyst would.

For analysts, it reduces time-to-insight. For product managers, it democratizes access to the data that informs our roadmap.

Data Help Agent

The Data Help Agent is an AI coding assistant embedded directly in our notebooks, SQL editors, and messaging channels, purpose-built for the data engineering workflow. It generates and transforms code from natural language, explains complex queries, diagnoses errors with proposed fixes, and leverages our catalog metadata to understand tables, columns, and data lineage in context. Engineers can @-mention specific tables in their prompts and receive suggestions grounded in the actual schema. It accelerates everything from exploratory analysis to pipeline development.

For data engineers, this means faster pipeline debugging, data validation, and onboarding into unfamiliar parts of the lakehouse.

Both data agents are also accessible outside of their native workflows via our Data MCP Server, integrated with our catalog through a custom OAuth application with scopes for SQL execution, warehouse access, functions, and vector search. This ensures that governance policies, access controls, and audit trails are enforced consistently, regardless of where the query originates.

The Platform Behind the Agents

These five agents share a common thread: none of them were built as isolated projects. Each one is composed of shared platform capabilities.

Our MCP Core SDK provides the server builder, authentication utilities, secret store, safelist access control, metrics integration, and testing framework that every MCP server is built on. Our shared LLM platform provides the pluggable multi-model, multi-provider backend that the Incident Agent and other services rely on. Our observability stack gives us real-time insight into agent performance, tool call latency, and error rates.

This composable approach mirrors the architectural principles that have defined OnePay's platform evolution. When we add a new agent or MCP server, it inherits the security, observability, and operational maturity of the platform. And when we improve the platform, every agent gets better.

Rather than building one-off AI integrations, we’re building infrastructure that compounds.

Conclusion

Our engineers now spend significantly less time on the mundane but necessary aspects of engineering, and more time building new features, enhancing reliability, and solving critical customer issues. This is just the beginning. We have more agents in development, which we will cover in upcoming posts as part of this series.

We are building the tools that are helping us build OnePay. If this mission excites you, whether you are an engineer who wants to push the boundaries of AI-assisted development, a platform builder who thinks in systems, or a data engineer who sees the potential in making data more accessible, we would love to hear from you. Check out our open roles at onepay.com/careers.

Stay tuned.