Case Studies | Operating Leverage

Case studies

How this plays out inside a real operating workflow.

Each story below follows the same shape as a real engagement: what was true before, what we built, and what changed — measured the same way we measure every pilot.

Technology

AI Support Triage Agent

Pain

Tickets sat untouched for hours, and a third were misrouted on the first pass.

Result

63%

less time to first response

Time to first response: ~5.5 hours → ~2 hours

Tickets requiring escalation rework: 1 in 3 → 1 in 8

Senior staff time on triage: ~10 hrs/week → ~2 hrs/week

Relevant if

Your support team isn't short on people, but short on context the moment a ticket lands — and the same routing questions get answered from scratch every time.

Read the full story ↓

Technology

Engineering Backlog Agent

Pain

Every bug ate 1-3 hours of investigation before an engineer could even start the fix.

Result

~15 hrs

engineer time returned per month

Investigation time per bug: ~1-3 hours → ~15 min review

Bugs with a usable draft PR pre-attached: 0% → ~40%

Engineer time returned per month: 0 hrs → ~15 hrs/engineer

Relevant if

Engineers spend more time reproducing and investigating bugs than fixing them, and that overhead falls unevenly across the team.

Read the full story ↓

Venture Capital

VC Deal & Portfolio Knowledge Brain

Pain

Deal history and portfolio knowledge were scattered across Notion, email, and partners' heads.

Result

~80%

less time lost reconstructing deal and portfolio history

Time to reconstruct deal/portfolio history before a meeting: ~half a day → minutes

New associate ramp-up on portfolio knowledge: several weeks → under a week

Quarterly portfolio review prep: multi-day data gathering → working session, data pre-assembled

Relevant if

"What do we already know about this?" is a question that only gets answered by pinging whoever happens to remember.

Read the full story ↓

Insurance

Insurance Claims Intake Voice Agent

Pain

Peak calls went to voicemail, delaying claims and stalling routine policy questions.

Result

~35% → ~95%

of inbound calls answered live, no voicemail

Calls answered live during peak windows: ~35% → ~95%

Time from loss reported to FNOL recorded: up to 1 business day → same call

Appointment scheduling turnaround: 2-3 days of phone tag → booked on first call

Relevant if

Your phone line is your busiest channel and your weakest one — especially outside business hours.

Read the full story ↓

Healthcare

Healthcare AI Governance Layer

Pain

Three AI pilots had standing access to PHI and billing with no audit trail.

Result

100%

of agent actions logged, permissioned, and reviewable

Agent actions logged and attributable: 0% → 100%

Billing-code changes requiring human approval: 0% (direct write access) → ~100% (two-tier approval)

Patient messages sent without a recorded human review step: untracked → 0%

Relevant if

AI pilots are already running with real access to sensitive systems, and nobody could currently produce an audit trail if asked.

Read the full story ↓

Technology

Slack/Chat-Ops Async Agent

Pain

Small "can someone..." requests piled up and always landed on the same two or three people.

Result

~25/week

ad-hoc requests handled directly in Slack, without pulling someone off their work

Time to get an answer to a quick 'can someone check...' request: minutes to days → usually under a few minutes

Ad-hoc requests handled without pulling someone off their work: 0 → ~25/week

Recurring 'small stuff' requests fixed at the source: 0 → 2 (in six weeks)

Relevant if

Your team's Slack is full of 'can someone...' requests that all land on the same two or three people.

Read the full story ↓

Fintech

AML/Fraud Alert Triage Agent

Pain

Hundreds of daily transaction-monitoring alerts meant analysts spent most of their time ruling out noise instead of investigating real risk.

Result

~50%

of daily alerts cleared without a full manual investigation

Low-risk alerts an analyst can clear without rebuilding the case: 0% → ~50%

Alert backlog older than 5 business days: ~600 → ~120

Agreement between agent recommendation and analyst's final disposition: n/a (no agent recommendation existed) → ~85% (up from ~55% in week one)

Relevant if

Your compliance or risk team spends more time assembling context for alerts than actually deciding what to do about them.

Read the full story ↓

Legal

NDA & Contract Intake Triage Agent

Pain

Routine NDAs waited days in the same queue as contracts that actually needed a lawyer's judgment.

Result

~70%

of inbound NDAs cleared the same day without attorney review

Inbound NDAs/contracts cleared without attorney review: 0% → ~70%

Time to clear a routine, in-playbook NDA: 2-4 days → same day (usually within hours)

Attorney time on routine document review per week: ~12 hrs/week → ~3 hrs/week

Relevant if

Most of what lands in legal's inbox is close to your standard terms, but every document waits in line behind the ones that actually need a lawyer.

Read the full story ↓

63%

less time to first response

A 40-person support team at a Series B vertical SaaS company

How it works

1Ticket arrives

↓→

2Agent pulls CRM history + policy docs

↓→

3Drafts priority, routing & reply

↓→

4Human reviews, edits, sends

↓→

5Escalation packet attached if needed

Time to first response

~5.5 hours→~2 hours

How this was measured

Measured across the general inbound queue over six weeks; P1 incidents (already fast-tracked separately) excluded.

Tickets requiring escalation rework

1 in 3→1 in 8

How this was measured

Counted as tickets reopened or reassigned to a different owner within 48 hours of the first reply, same six-week window.

Senior staff time on triage

~10 hrs/week→~2 hrs/week

How this was measured

Self-reported weekly estimate from the two senior support leads, taken before rollout and again at week six.

What it’s connected to

Zendesk (ticket queue, macros, and routing rules)
CRM account records (plan tier, billing status, usage history)
Help center and internal SOP/policy docs
Escalation rules and on-call ownership map

Where humans stayed in control

The agent drafts replies and routing recommendations only — nothing sends to a customer without a human clicking send.
Escalation packets are clearly marked as agent-assembled, so the receiving engineer knows to verify, not assume.
Acceptance rate and time-to-first-response were tracked from week one, so the team could see exactly how much the agent was actually being trusted.
Agents read CRM and policy data but cannot edit account records, issue refunds, or change billing.

The problem

Support volume had outgrown the team's ability to triage manually. Every ticket needed someone to read it, check the account in the CRM, find the relevant policy or runbook, decide whether it was a billing issue, a bug, or a how-to question, and route it to the right queue. That triage step alone was consuming roughly 10 hours a week of a senior support lead's time — time that should have gone to the hardest tickets, not the routing of easy ones. Meanwhile, the average ticket sat for 5.5 hours before anyone looked at it, and about a third of tickets were misrouted on the first pass, creating rework and a second wait for the customer.

What changed

Within six weeks, time to first response dropped from 5.5 hours to about 2 hours — most of that gain came from tickets no longer sitting untouched, since the agent surfaces priority and routing the moment a ticket arrives. Tickets requiring escalation rework fell from 1 in 3 to roughly 1 in 8, because the right context now travels with the ticket the first time. The triage workload on senior staff dropped from about 10 hours a week to 2, which they redirected to the highest-severity accounts and to reviewing the agent's draft replies for quality — the actual judgment work, not the routing.

What made it hard

The agent's first week of priority calls were noisy — it leaned hard on billing-related keywords, which show up in plenty of routine account questions, so it kept flagging low-urgency tickets as high-priority. Acceptance of its routing suggestions started around 60%, with support leads overriding it constantly. Reweighting toward account state and escalation history over keyword matches brought acceptance above 90% by week three, but that first week of overrides is part of why the rollout took six weeks instead of two.

~15 hrs

engineer time returned per month

A 12-engineer product team at a growth-stage B2B platform

How it works

1Bug filed

↓→

2Agent reproduces + scans code, logs, history

↓→

3Posts investigation summary

↓→

4Opens draft PR if reproduction is confident

↓→

5Engineer reviews; normal CI + review

Investigation time per bug

~1-3 hours→~15 min review

How this was measured

Tracked via existing time-tracking tags on tickets across an eight-week pilot, compared against the same tags from the prior quarter.

Bugs with a usable draft PR pre-attached

0%→~40%

How this was measured

“Usable” meant the engineer who picked up the ticket built on the draft rather than discarding it — judged case by case, not against a fixed rubric.

Engineer time returned per month

0 hrs→~15 hrs/engineer

How this was measured

Self-reported in a short weekly survey across the 12-person team, averaged over the second month of the pilot.

What it’s connected to

Issue tracker (backlog and ticket metadata)
Git repository — code, history, and open PRs
CI logs and test results
Prior incident write-ups and runbooks

Where humans stayed in control

The agent has read access to code, logs, and history — write access is limited to opening draft PRs on its own branches.
Every draft PR is labeled as agent-generated and goes through the same review and CI checks as any human-authored PR.
Nothing merges without a human approval — the agent cannot approve or merge its own changes.
If the agent can't reproduce an issue or form a confident root cause, it posts the investigation summary only and stops there, rather than guessing at a fix.

The problem

The team's velocity was being eaten by the unglamorous parts of bug fixing: reproducing the issue, searching logs and the codebase for the relevant code path, checking whether a similar incident had happened before, and writing up what was found before anyone could start the actual fix. This investigation phase routinely took an engineer 1-3 hours per ticket — often more than the fix itself. With a steady stream of incoming bugs and a fixed-size team, this overhead was a real tax on shipping new features, and it tended to fall disproportionately on whichever engineer happened to be free, regardless of whether they had context on that part of the system.

What changed

Across the team, engineers reported getting back roughly 15 hours a month each that had previously gone to investigation overhead — time that went back into feature work and into reviewing the agent's draft PRs, which took meaningfully less time than writing the fix from scratch. About 40% of incoming bugs ended up with a usable draft PR attached before a human even looked at the ticket. The team's median time from "bug filed" to "fix merged" dropped accordingly, not because the agent replaced engineering judgment, but because it removed the part of the work that was mostly about finding things, not deciding things.

What made it hard

Early on, the agent opened a draft PR for almost every bug, including several it hadn't actually reproduced — those drafts were mostly noise, and a couple of engineers started auto-closing them without reading. We tightened the policy so a PR only opens once reproduction confidence crosses a threshold; below that, it posts the investigation summary and stops. That change is most of the gap between the ~40% draft-PR rate and the much larger share of tickets that got a useful investigation summary.

Venture CapitalRelated workflow: Internal Knowledge Brain→

~80%

less time lost reconstructing deal and portfolio history

A 9-person investment team at an early-growth-stage venture fund

How it works

1Question asked (e.g. "what changed this quarter?")

↓→

2Agent searches CRM, drive, Notion, email

↓→

3Cross-checks figures across sources

↓→

4Answer returned with citations + conflicts flagged

↓→

5Partner verifies and uses in meeting

Time to reconstruct deal/portfolio history before a meeting

~half a day→minutes

How this was measured

Based on associate time logs for partner-meeting prep across a six-week sample, compared against the same prep task before rollout.

New associate ramp-up on portfolio knowledge

several weeks→under a week

How this was measured

Measured for the one new associate who joined during the pilot, against the prior two associates' ramp time as recalled by the partners who onboarded them.

Quarterly portfolio review prep

multi-day data gathering→working session, data pre-assembled

How this was measured

Compared prep time for the quarter immediately before rollout against the first quarterly review run on the new system.

What it’s connected to

Deal CRM (pipeline, diligence notes, decision rationale)
Shared drive (decks, data rooms, board materials)
Notion (meeting notes and partner write-ups)
Email (deal flow and portfolio company communication)

Where humans stayed in control

The agent only surfaces information the requesting partner or associate already has access to — it does not bypass existing document and folder permissions.
Every answer includes a citation back to the source document or thread, so claims can be checked before they're repeated in a board meeting.
The agent answers questions; it does not draft or send communications to founders, LPs, or portfolio companies on its own.
The first month ran with a lightweight 'flag this answer' step so the team could catch and correct retrieval issues before relying on it for board-facing numbers.

The problem

The fund's institutional knowledge was real but scattered — diligence notes in one partner's Notion, an updated cap table buried in an email thread, portfolio KPIs split between a CRM, a spreadsheet, and founder decks, and market research in a doc only the original author could find. Every week this had a cost: associates lost half a day before partner meetings reconstructing 'where are we with this company,' board prep meant manually re-assembling quarters of metrics, new associates took weeks to ramp because the fund's pattern-matching lived in partners' heads, and the fund sometimes couldn't tell if it had already passed on a deal that came back around.

What changed

Time spent reconstructing deal and portfolio history before meetings dropped by roughly 80% — a half-day of digging became a sourced answer checked in minutes, and quarterly portfolio reviews became working sessions instead of data-gathering exercises. New associate ramp-up on the existing portfolio dropped from several weeks to under a week, and on at least two occasions the agent surfaced a prior pass memo on a company that resurfaced with a new round — including a reminder of a concern that, in one case, turned out to still be live.

What made it hard

The hardest part wasn't retrieval — it was conflicts. A founder's deck would say one ARR number, the fund's own tracking spreadsheet would say another, and an old email thread would have a third, all 'correct' at different points in time. Early answers just picked whichever source had been modified most recently, which was sometimes the stale one. We ended up having the agent surface all three with dates and let the partner pick, rather than try to silently resolve it — slower, but it's the version the team actually trusts.

~35% → ~95%

of inbound calls answered live, no voicemail

A regional insurance brokerage handling personal-lines claims and policy service for roughly 14,000 policyholders

How it works

1Call comes in

↓→

2Agent answers, identifies policyholder

↓→

3Looks up policy / claims system

↓→

4Handles FNOL, status check, or scheduling

↓→

5Hands off to a human if outside its lane

Calls answered live during peak windows

~35%→~95%

How this was measured

Measured from call logs during the defined peak windows (Monday mornings, lunch hours, 48 hours post-weather-event) over four weeks post-rollout, against the same windows the prior month.

Time from loss reported to FNOL recorded

up to 1 business day→same call

How this was measured

Based on claims-system timestamps comparing FNOL creation time to call time, for calls the voice agent handled during the same four-week window.

Appointment scheduling turnaround

2-3 days of phone tag→booked on first call

How this was measured

Estimated from the prior month's callback logs versus bookings made directly during the agent's calls.

What it’s connected to

Policy management system (coverage, limits, payment status)
Claims management system (FNOL intake and claim records)
Agency calendar and scheduling (agent availability and bookings)
Call recording and transcription pipeline

Where humans stayed in control

The agent hands off to a human immediately for anything involving injury, fatality, an upset caller, or a request to speak to a person — no attempt to resolve those itself.
The agent can read policy and claims data to answer questions, but cannot bind coverage, approve a claim, issue a payment, or change a policy.
Every call is recorded and transcribed, and every FNOL the agent takes is reviewed by a human claims handler before the claim moves forward.
The handoff threshold was tuned over three weeks using real call recordings, with the agency reviewing every escalation decision during that period.

The problem

The brokerage's phone line was its busiest channel and its weakest one. Call volume spiked predictably — Monday mornings, lunch hours, and the 48 hours after any local weather event — and during those windows as few as one in three calls were answered live, with the rest going to voicemail and policyholders calling back repeatedly. The cost went beyond frustration: a first notice of loss sitting in voicemail for a day delayed the entire claims timeline, routine policy and coverage questions consumed a disproportionate share of staff time, and appointment scheduling turned into days of phone tag.

What changed

Live answer rate during peak windows went from roughly 35% to about 95% — nearly every call is now answered immediately, including after-hours and weekend calls that previously went to voicemail. FNOL intake that used to wait until the next business day now happens at the moment of the call, routine policy and coverage questions are resolved without staff involvement, and appointment scheduling that took two or three days of phone tag now happens on the first call. Every call being transcribed has also given the team a searchable record of what policyholders actually call about — which already surfaced a recurring coverage question now addressed in renewal materials.

What made it hard

The hardest part was the handoff line, not the call-answering itself. In the first week, the agent tried to talk a clearly upset policyholder through a coverage dispute instead of bringing in a person right away — technically correct information, wrong moment for it. We spent the next three weeks listening to recordings and pulling the escalation trigger earlier: any sign of frustration, ambiguity, or a request for 'someone,' and it hands off immediately, even mid-sentence.

100%

of agent actions logged, permissioned, and reviewable

A multi-site outpatient healthcare group operating six clinics with roughly 180 staff

How it works

1Patient/staff action triggers an agent

↓→

2Sandboxed execution checks permission tier

↓→

3Low-risk: executes directly + logs

↓→

4Higher-risk: proposed change queued for approval

↓→

5Staff approves/edits; audit log records all of it

Agent actions logged and attributable

0%→100%

How this was measured

Based on the centralized audit log's coverage across all three pilots, checked at the end of the ten-week rollout against each system's own action logs.

Billing-code changes requiring human approval

0% (direct write access)→~100% (two-tier approval)

How this was measured

Before, the billing assistant had direct write access to billing codes with no approval step. After, billing-code and cross-provider changes — about 30% of its write actions — always route to a staff approval queue; the remaining ~70% (low-risk, reversible reschedules) execute directly and are logged.

Patient messages sent without a recorded human review step

untracked→0%

How this was measured

There was no usable prior baseline — the 'before' state had no record of whether review happened at all, which was itself part of the problem.

What it’s connected to

Patient records system (scoped read access to non-clinical fields)
Practice management system (scheduling and billing)
Patient messaging portal
Centralized audit log mapped to compliance requirements

Where humans stayed in control

Every agent action across all three systems is logged with who/what/when/on whose behalf, and before/after state for any record change.
The patient-Q&A assistant's access to records was scoped down to only the fields needed for common questions — full clinical notes are out of reach.
Billing code changes and cross-provider scheduling changes require human approval before they apply; only low-risk, reversible actions execute directly.
No patient message is sent without a human reviewing it first, and the system records the draft, any edits, and the final sent version.

The problem

Three separate AI pilots had been stood up by different teams over about a year, each solving a real problem, and each running with more access and less oversight than anyone had really decided on. One assistant had read access to the patient records system — including clinical notes — with no logging of what it looked at or why. A second had write access to the practice management system and could move appointments and adjust billing codes directly, indistinguishable from staff edits. A third drafted patient messages with no record of what was sent versus what was edited. Collectively, three systems had standing access to PHI, billing, and patient communications, no audit trails, and no consistent way to answer the question compliance was starting to ask: if something went wrong, could anyone reconstruct what happened and why?

What changed

Every one of the roughly 40,000 agent actions per month across the three systems is now logged, permissioned, and attributable — the governance layer's audit log became the primary evidence in the group's next compliance review, closing what had been an open question on the prior one. Scoping the patient-Q&A assistant's access down also stopped it from occasionally surfacing clinical detail staff didn't need. Under the billing assistant's two-tier model, about 70% of its write actions are low-risk reschedules that execute directly and get logged, while billing-code and cross-provider changes always route to a staff approval queue — which caught two proposed billing-code changes a reviewer corrected before they applied — and the patient messaging review queue gave the clinics their first real visibility into how much staff actually edit the agent's drafts.

What made it hard

Migrating the patient-Q&A assistant's access was the trickiest piece, because scoping it down broke a few workflows the front-desk team had quietly come to depend on — staff who'd gotten used to asking it things like medication history would suddenly get 'I don't have access to that.' We spent a week going back through real questions with the front-desk team to sort which of those needed a different, narrower field added to the scoped set versus which ones genuinely shouldn't have been answerable by the assistant in the first place.

~25/week

ad-hoc requests handled directly in Slack, without pulling someone off their work

A 25-person product and engineering org at a Series A startup

How it works

1Someone @-mentions the agent with a request

↓→

2Agent classifies: quick lookup vs. background task

↓→

3Quick: answered directly in-thread

↓→

4Background: works async, posts result or draft when done

↓→

5Any system change applied by a human via normal review

Time to get an answer to a quick 'can someone check...' request

minutes to days→usually under a few minutes

How this was measured

Compared informal Slack response-time observations from the month before rollout to logged agent response times in the read-only lane during the second month.

Ad-hoc requests handled without pulling someone off their work

0→~25/week

How this was measured

Counted from the agent's Slack activity log over a four-week window; excludes requests the agent declined as out of scope.

Recurring 'small stuff' requests fixed at the source

0→2 (in six weeks)

How this was measured

Identified from the visible request history in Slack — two repeated requests were flagged by the team and resolved permanently rather than re-asked.

What it’s connected to

Slack / Google Chat (agent operates as a workspace member)
Issue tracker — Linear/Jira (status, comments, history)
Git repository, CI, and internal admin/reporting tools (scoped read; PRs as drafts)
Shared docs (edits proposed as suggestions, not direct overwrites)

Where humans stayed in control

Any change to code, docs, or config comes back as a draft or proposal in that tool's normal review flow — the agent never writes directly to a production system.
Every request and response happens in the open Slack/Google Chat thread, never a DM or hidden channel, so the whole team can see what the agent did and why.
Access started with read-only lookups for the first two weeks; drafting and proposal capabilities were added only once that lane was working reliably.
The agent declines requests outside its scoped access rather than guessing at another way to get the information.

The problem

As the team grew past about 25 people, a steady stream of small requests flowed through Slack — pull last month's signup numbers, check why a ticket has been stuck for two weeks, update some copy, write a one-off backfill script. None were big enough to put on a roadmap, but they all needed access to a specific system, so they landed on the same two or three people regardless of whether those people were the right ones for the job. Requests posted in busy channels got a thumbs-up and disappeared, turnaround swung from minutes to days depending on who was around, and with no real cost to not answering, a lot of small-but-useful requests just quietly never got done.

What changed

Within a month, the agent was handling roughly 25 requests a week directly in Slack — quick lookups now resolve in under a few minutes instead of minutes-to-days, and background tasks like draft PRs or doc edits show up in-thread within hours instead of 'whenever someone gets to it.' The bigger shift was behavioral: because asking stopped costing anyone else their time, people started raising the small things they'd previously just lived with, and the visible thread history has already prompted two recurring requests to get fixed at the source instead of handled over and over.

What made it hard

The early failure mode wasn't capability, it was tone and scope. In week one, the agent would sometimes answer a vague request ('can someone look at the signup numbers?') by guessing what was wanted and posting a full report nobody asked for — technically responsive, but it cluttered the thread and occasionally answered the wrong question. We added a quick clarifying-question step for ambiguous requests, which slowed down the fastest lookups slightly but cut down on 'that's not what I meant' replies a lot. The other early issue was ownership confusion — a couple of times, someone assumed the agent had handled a request when it had actually declined it as out of scope. Declines are now posted explicitly, with a reason, rather than silently skipped.

~50%

of daily alerts cleared without a full manual investigation

A payments fintech's compliance team monitoring transaction activity for roughly 120,000 active accounts

How it works

1Transaction-monitoring system fires an alert

↓→

2Agent pulls account history, KYC profile, related-party data, prior alerts

↓→

3Builds an investigation packet with a recommended disposition

↓→

4Analyst reviews packet, confirms or overrides

↓→

5Decision and reasoning logged to the case file

Low-risk alerts an analyst can clear without rebuilding the case

0%→~50%

How this was measured

Measured as the share of daily alerts where the analyst accepted the agent's 'clear' recommendation with only a confirmation check, tracked over an eight-week pilot against the prior month's full-investigation rate.

Alert backlog older than 5 business days

~600→~120

How this was measured

Backlog count taken from the case-management system at the start of the pilot and again at the eight-week mark.

Agreement between agent recommendation and analyst's final disposition

n/a (no agent recommendation existed)→~85% (up from ~55% in week one)

How this was measured

Tracked weekly from the agent's recommendation log compared against the analyst's recorded disposition for the same alert.

What it’s connected to

Transaction-monitoring / case-management system (alerts and case queues)
Core banking ledger (account history and balances)
KYC and onboarding records (customer profile, risk rating)
Sanctions and watchlist screening feeds

Where humans stayed in control

The agent recommends a disposition — clear, escalate, or file a SAR — but only a licensed compliance analyst can close an alert or file a report.
Every recommendation includes the specific data points and reasoning behind it, so the analyst can verify rather than rubber-stamp.
The agent cannot move money, freeze an account, or contact a customer — it only reads data and writes to the case file.
Agreement between the agent's recommendation and the analyst's final call was tracked from week one, so the team could see how much the recommendations were actually being trusted.

The problem

The compliance team's transaction-monitoring system threw off several hundred alerts a day — most of which, on investigation, turned out to be nothing: a customer's spending pattern shifted with a new job, a small business had a seasonal spike, a transfer matched a watchlist name by coincidence. But finding that out took real work. For every alert, an analyst had to pull the account's transaction history, prior alerts on the same customer, the KYC profile and risk rating, and check names against sanctions and watchlist feeds — often across four or five separate systems — before they could even start deciding whether it mattered. With an eight-person team working through that volume, the backlog of alerts older than five business days had grown to around 600, and the team's real fear wasn't the noise itself — it was that a genuinely SAR-worthy case was sitting in that backlog, indistinguishable from the routine ones, waiting its turn.

What changed

The agent now picks up every alert the moment it fires, pulls account history, KYC profile, related-party data, and prior alerts across the bank's core systems, and assembles an investigation packet with a recommended disposition — clear, escalate, or file a SAR — along with the specific data points behind that recommendation. About half of all alerts are now low-risk enough that an analyst can confirm the agent's 'clear' recommendation in a couple of minutes rather than rebuilding the case from scratch, which cut the alert backlog from roughly 600 to about 120 within two months. The other half — the genuinely ambiguous or higher-risk alerts — now get the analyst's full attention from the start, because the agent has already done the data-gathering, and the team filed its first SAR sourced from an alert that would previously have sat in the backlog for over a week.

What made it hard

The early version of the agent leaned heavily on transaction velocity and amount thresholds to score risk, which meant it routinely flagged 'escalate' on accounts that were simply high-volume by nature — small business merchant accounts processing dozens of transactions a day as a matter of course. Analysts quickly learned to ignore 'escalate' on those account types, which defeated the point. We rebuilt the scoring to compare each account against its own historical baseline and its account-type peer group rather than a single global threshold — about three weeks of analyst feedback to get right, but it's most of why the agreement rate between the agent's recommendation and the analyst's final disposition climbed from roughly 55% to over 85%.

~70%

of inbound NDAs cleared the same day without attorney review

A 4-person in-house legal team supporting a roughly 300-person company with a steady stream of inbound NDAs and vendor contracts

How it works

1NDA or vendor contract arrives by email or contract tool

↓→

2Agent compares every clause against the legal playbook

↓→

3In-playbook: agent clears it and notifies the requester

↓→

4Off-playbook: agent drafts redlines and flags the deviations

↓→

5Attorney reviews only the flagged items and signs off

Inbound NDAs/contracts cleared without attorney review

0%→~70%

How this was measured

Measured over a six-week pilot as the share of inbound documents the agent cleared with no attorney touch, against the prior month's baseline where every document required attorney sign-off.

Time to clear a routine, in-playbook NDA

2-4 days→same day (usually within hours)

How this was measured

Compared average turnaround time logged in the contract tool for the month before rollout against the six-week pilot.

Attorney time on routine document review per week

~12 hrs/week→~3 hrs/week

How this was measured

Self-reported by the four attorneys in a weekly time estimate, averaged over the pilot's second month.

What it’s connected to

Contract management / e-signature system (inbound documents and execution)
Legal playbook and clause library (approved positions and fallback language)
Email (intake from sales, procurement, and counterparties)
CRM or procurement system (deal context tied to each contract)

Where humans stayed in control

The agent clears a document only when every clause matches the playbook exactly — a single off-playbook clause routes the whole document to an attorney.
Every redline the agent proposes cites the specific playbook position it's based on, so the attorney is checking reasoning, not guessing at it.
The agent cannot countersign or send an executed document — clearing a document means notifying the requester it's ready, not finalizing it.
The first month ran with an attorney spot-checking a sample of 'cleared, no review needed' documents to confirm the agent's playbook matching before that check was removed.

The problem

The legal team's inbox was a mix of two very different things: the occasional contract that genuinely needed a lawyer's judgment, and a steady stream of NDAs and standard vendor agreements that were, clause for clause, close to identical to the company's own template. But both arrived in the same queue, and an attorney had to read the whole document either way to confirm there wasn't a hidden deviation. With four attorneys covering around 25 inbound documents a week, routine NDAs — the kind sales sent over before a first call, or procurement needed before a vendor demo — could sit for two to four days waiting for someone to confirm what was, most of the time, already true: that it matched the standard terms. Sales and procurement had learned to build that delay into their timelines, which meant legal's queue wasn't just a backlog, it was a tax on how fast deals could move.

What changed

The agent now reviews every inbound NDA and vendor contract against the legal playbook the moment it arrives. When a document matches the playbook's approved positions, the agent clears it and notifies the requester directly — no attorney involved. When something doesn't match, the agent drafts redlines against the specific deviating clauses and routes the document to an attorney with those deviations flagged, so review starts at 'here's what's different' instead of 'read the whole thing.' About 70% of inbound NDAs now clear the same day, most within a couple of hours, and the attorneys' time on routine document review dropped from roughly 12 hours a week to about 3 — time that went back into the contracts that actually need negotiation. Sales and procurement noticed the change before legal did: NDA turnaround stopped being something they had to plan around.

What made it hard

The hardest part was teaching the agent the difference between a clause that was worded differently but meant the same thing, and one that was actually a substantive deviation. Early on, it flagged a lot of documents as 'off-playbook' simply because a counterparty had rephrased a standard confidentiality term or used a different definition that amounted to the same scope — technically different text, same legal effect. That drove the off-playbook rate to nearly half of all documents in week one, barely better than the old process. We spent about two weeks going through flagged documents with the attorneys, building out a library of 'equivalent phrasing' for the playbook's core clauses, before the in-playbook clearance rate stabilized at around 70%.

Start small, build seriously

Bring your most expensive workflow. Leave the call with a ranked plan for where AI pays off first.

Book a Free Assessment Call View assessment