5/26/26: The 99% Reliable AI Agent: How Haize Labs Sells Enterprise AI to Banks, Insurers & Healthcare
Greetings everyone!
I knew you were hankering for some more GTM AI goodness… I could feel it in my brazen psyche.. there you were, on your phone or laptop.. refreshing every few moments for this email and newsletter to arrive.
Well no more waiting for BEHOLD! It is time!
Today we have Julia Macdonald the SVP of GTM and AI Solutions at Haize Labs and probably one of the most brilliant and kind individuals I have met.
As usual, this podcast and the goodies along with it are always free and we would love your feedback of any topic you want covered from the vast network of amazing people we know.
Please subscribe and share with your peeps
Let do it…
You can go to Youtube, Apple, Spotify as well as a whole other host of locations to hear the podcast or see the video interview.
The 99% Reliable AI Agent: How Haize Labs Sells Enterprise AI to Banks, Insurers & Healthcare
I’ve sat through enough enablement chatbot demos to know the pattern. The bot looks perfect in the demo room. Then production hits, and your customers start getting wildly different answers to the same question. One prospect gets “no discount.” The next gets “20% off, just for you.” Your team’s Slack lights up. Your CRO gets nervous. Your AI roadmap goes on a “pause for review.”
Julia MacDonald, SVP of GTM at Haize Labs, has the cleanest framing I have heard for what you are up against:
“Most agents are good enough.. until they’re not.”
That one sentence is going to cost a lot of teams their AI roadmap this year. So this week, I sat down with Julia to walk through how Haize Labs sells 99% reliable agents to banks, insurers, and immigration lawyers, and what you can steal from their playbook by Friday.
Four things came out of this conversation that I think every GTM leader needs to internalize.
1- The “Mostly Fine” trap is a GTM problem, not an engineering one
Most AI agents in the wild today are good enough for personal use. Open Claude. ChatGPT. A handful of internal copilots. The problem starts the moment that same probabilistic behavior steps into a regulated, customer-facing, multi-turn conversation.
Julia’s example is the one to memorize:
“You can’t have a situation where your agent tells Customer A ‘I can’t give you a discount,’ and Customer B ‘maybe I can give you a discount.’ That’s the reliability issue I’m talking about.”
This is what kills enterprise pilots. Not bad UX. Not slow latency. Variance across customers. And it gets worse in multi-turn conversations, because each turn compounds the drift.
Here is what I want you to take from this:
The enterprise buyer is not pricing your agent’s average performance. They are pricing the worst 1% of conversations, because that’s what triggers a compliance review, a refund, or a CFPB letter.
“We hit 95% accuracy in testing” is not a closing number anymore. The buyer needs to know what happens in the other 5%, and how often it happens, and whether it can be bounded.
If your sales motion can’t answer that question with specifics, you are losing deals to vendors who can. Haize Labs is one of those vendors.
The shift for your team this week: stop demoing your agent’s best behavior. Start demoing the boundaries it cannot cross.
2- The hypothesis-first filter (almost mathematical)
This was the part of the conversation I rewound twice. Julia does not pick verticals because they’re “hot.” She runs every use case through a near-mathematical filter before her 2-person GTM team touches it.
Her criteria, paraphrased:
The workflow is document-heavy
It cannot be outsourced to a $5/hour worker (paralegals, nurses, claims adjusters, not data entry)
There is a right vs. wrong answer that someone cares about
The use case is sufficiently well-defined that you can score it
The cost of being wrong is high enough to pay six figures to solve
A human is still on the hook for final accountability (no AI lawyers, no AI doctors, yet)
Concrete example she gave: immigration law. Document-heavy. Talent-constrained (you cannot hire your way out of it). Right and wrong are codified by visa requirements. Stakes are someone’s life trajectory. Lawyer still signs the filing.
Contrast that with: an AI that generates legal letters from a template. Document-heavy, sure. But the cost of a mediocre letter is low. There’s no scarcity. The buyer will not pay $100K for it. Wrong shape, wrong sale.
What to do this week: Take your top three AI agent use cases. Score each against those six criteria. Anything that fails two or more is a roadmap distraction. Kill it or downgrade it.
3- Code of Conduct + Supervisor Models + Red Teaming (the actual reliability stack)
This is where the Haize Labs technology gets interesting, and where I learned something I’m bringing into my own prompt engineering work at Momentum.
Most teams use “LLM as a judge” to police their agents. Another LLM, another prompt, asking “did the first model do a good job?” Julia’s team thinks that approach is too brittle, because the judge is governed by whoever wrote the rubric prompt that morning.
Haize Labs’ stack:
Code of Conduct. A comprehensive, enterprise-specific rules document. For the bank example she walked through (voice debt collection agent), this includes every CFPB regulation, every authentication rule, every prohibited phrase.
Supervisor Models. Models actually fine-tuned on that Code of Conduct. Not a judge prompt. A judge architecture.
Adversarial Red Teaming. Algorithms that try to break the agent against the Code of Conduct. They do not stop. They generate thousands of prompts (Julia mentioned one run hitting 15,000 adversarial queries) using different languages, role-play, multi-turn escalation, emojis, the works.
Iterative fine-tuning. Every successful attack becomes training data. You loop until reliability stabilizes at 99%+.
Why this matters even if you never buy Haize Labs:
“We tested it” is not a methodology. Your AI vendor evaluation needs to ask: show me your adversarial test suite, in numbers.
For internal builds, you can adopt a version of this. Write your own Code of Conduct (the rules your agent must never break). Generate adversarial prompts (Claude or GPT will help). Score every release against the same suite.
This is the loop that gets you from “demo magic” to “production-grade.” There is no shortcut.
Tactical takeaway: Before your next agent release, draft a one-page Code of Conduct. Generate 100 adversarial prompts against it. Score the agent. Track that number across versions. That alone will outperform 90% of your competitors’ QA.
4- The 2-person AI-native GTM stack (Julia’s actual workflow)
Julia runs partnership, sales, and marketing for Haize Labs with one other person. Her quote: “Without AI, I don’t think we would have survived.” Here is the motion, end to end.
Her enterprise prospecting flow:
Start with a hypothesis (one of her use cases that passed the filter)
Use AI to generate a list of companies matching that use case
Use AI to generate a list of executives at those companies
Enrich LinkedIn profiles using Clay + OpenAI
Land in Dripify for low-volume, high-touch LinkedIn outreach (not blast spam)
Her pre-call brief flow:
Custom prompt that pulls together: customer pain points, recent announcements, competitor moves, Haize capability mapping
Output: a structured brief with three likely pain points and two recommended use cases to discuss
Result, in her words: “I am smarter today than I was six months ago, every single call.”
The honest gap she named, which is worth respecting:
They are not yet doing closed-loop attribution from outreach to revenue. The reason is speed: every hour spent building the meta-agent is an hour not spent talking to enterprise prospects. That’s a real tradeoff for a 2-person team, and it’s worth naming so you don’t beat yourself up for the same one.
The strategic shift for your team:
Tooling is no longer the bottleneck. Hypothesis quality is. A 2-person team with a sharp hypothesis will outperform a 10-person team running spray-and-pray.
The pre/post-call brief workflow is the single highest-ROI thing you can build this quarter. It compounds. Every call makes you smarter on the next one.
Closed-loop measurement is the next frontier. The teams that wire it up first will have an unfair advantage.
Why this matters
The next 18 months of enterprise AI will be won by teams who solve two things at once: reliability that survives the worst 1% of conversations, and a GTM motion lean enough to make money before the funding clock runs out. Julia is doing both. She is also one of the most generous teachers I have had on this podcast.
If you build agents, you need a Code of Conduct. If you sell agents, you need the hypothesis filter. If you run a 2-person GTM team, you need Julia’s prospecting stack. If you have none of the above, you need to start this week.
That’s three hours of work. It will save you a quarter of wasted engineering.
The 99% Reliability Filter
7 criteria to test any AI agent use case before you build it, buy it, or stake your roadmap on it.
A GTM AI Academy field guide. Inspired by my podcast conversation with Julia MacDonald, SVP of GTM at Haize Labs.
I have watched dozens of GTM teams deploy AI agents in the last 18 months. Most of them work great in the demo. Then production hits. One prospect gets “no discount.” The next gets “20% off.” Your team’s Slack lights up. Your CRO gets nervous. Your AI roadmap goes on a quiet “pause for review.”
I sat down with Julia MacDonald, SVP of GTM at Haize Labs, for the GTM AI Podcast this week. Haize Labs is the only company I have seen solve the AI agent reliability problem at the enterprise level. They sell to banks, insurance companies, immigration lawyers, and healthcare systems.
That one sentence is going to cost a lot of teams their AI roadmap this year.
Here is why. Your enterprise buyer is not pricing your agent’s average performance. They are pricing the worst 1% of conversations, because that is what triggers the compliance review, the customer refund, or the CFPB letter. “We hit 95% accuracy in testing” stopped being a closing number a year ago. If you cannot tell the buyer what happens in the other 5%, how often it happens, and whether it can be bounded, the buyer finds a vendor who can.
So before you build the next agent, you need a filter. Something that tells you, in 5 minutes, whether this use case is worth your engineering quarter, your roadmap real estate, and your reputation.
The 7 criteria
You will score each use case 1 to 5 on each criterion. 5 means “strongly true.” 1 means “not true at all.” Be honest, not optimistic. The whole point of the filter is that optimism is what got the failed pilots funded in the first place.
1- Document-heavy workflow
The work being automated is fundamentally about reading, interpreting, and producing structured documents. Not a one-line answer. Real document logic.
YES: Immigration visa filings. Insurance claims adjudication. Loan underwriting packets. Pre-trial discovery review.
NO: Generic chat replies. Single-field data entry. Status lookups. “What was my last invoice?”
Ask yourself: If I removed the documents from this workflow, is there anything left for the agent to actually reason about?
If the answer is “no, the documents are the work,” you have a strong candidate. If the answer is “honestly, this could be a Zapier task,” you are not in agent territory.
2- Cannot be cheaply outsourced
The work cannot be handed to a $5/hour offshore worker without losing quality. The labor itself is skilled, certified, or contextually scarce.
YES: Paralegal review. Registered nurse triage. Claims adjuster decisions. Senior tax preparation.
NO: Lead list cleanup. Data normalization. Generic transcription. Basic moderation.
Ask yourself: If “we already outsource this for cheap and it works,” will I really pay six figures for an AI version?
This is the criterion that disqualifies more roadmap items than any other. There is a giant market for cheap labor automation. There is no market for replacing cheap labor with enterprise software. The economics force you toward regulated, specialist, or scarcity-bound roles. That is a feature of this filter, not a constraint.
3- Has a right vs. wrong answer
The workflow has verifiable outcomes. There is a way to score whether the agent did it correctly. This is what makes 99% reliability even measurable.
YES: Did the visa filing meet the regulatory requirements? Did the claim get categorized to the correct billing code? Did the underwriting hit the right tier?
NO: Did the marketing copy “feel on brand”? Did the meeting summary capture “the vibe”? Did the rep coaching feedback “land well”?
Ask yourself: Could I write a rubric that a human expert would consistently agree with?
If the answer is no, you have no signal to fine-tune against. You can ship an agent, but you cannot make it reliable in a measurable way. That is fine for some internal tools. It is not fine for an enterprise contract.
4- High stakes for being wrong
The cost of a single wrong answer is high enough that someone will pay to prevent it. Regulatory fines, customer harm, reputational damage, or significant revenue loss.
YES: Voice debt collection (CFPB exposure). Anti-money laundering screening. Drug interaction checks. Trade settlement decisions.
NO: Internal meme generation. Low-stakes ranking suggestions. Anything where a wrong answer is shrugged at.
Ask yourself: If this agent got it wrong in 5% of cases, would anyone notice or care enough to act on it?
If the answer is “nobody would notice,” nobody will pay six figures to prevent the 5%. The math collapses. The deal stalls in procurement. Stakes are what make reliability a budget line.
5- Talent scarcity or capacity ceiling
The human role being augmented is hard to hire, hard to retain, or hits a capacity wall. The buyer is not just optimizing cost. They are unblocking growth they cannot otherwise hit.
YES: Paralegals (active talent war). Nurses (shortage in every market). Senior claims adjusters (decade to train).
NO: Roles with abundant labor supply, low turnover, and predictable training pipelines.
Ask yourself: If my buyer could hire their way out of this, would they have already?
If yes, you are selling cost savings into a non-painful problem. If no, you are selling capacity expansion into an actual constraint. The second one closes. The first one does not.
6- Economically worth automating
The value of automating the workflow justifies a real software contract. Not $10 of saved labor per run. Real money: hours of specialist time, days of cycle time, or revenue at risk.
YES: Auto insurance claims cycle time from 60 days to 3 days. Months of paralegal time per filing automated. Days of underwriter review compressed into hours.
NO: Saving 30 minutes of a $15/hour task. Marginal wins that don’t move a P&L line.
Ask yourself: Can I write the ROI story on a single Post-It in a way the CFO would believe?
If the value of automation is not big enough to fit on a Post-It and survive a 30-second sniff test, the deal will die in procurement. You need a story that lands fast and lands big.
7- Human accountability preserved
A human is still on the hook for the final outcome. The agent collects, structures, recommends, or drafts. The licensed professional signs, approves, or releases. This is the difference between an AI tool and an AI liability.
YES: Agent prepares the visa filing, lawyer signs and files. Agent flags potential fraud, analyst reviews and acts.
NO: “The AI is the doctor.” “The AI is the lawyer.” “The AI is the financial advisor.” Not yet. Not for years.
Ask yourself: If this agent makes a critical mistake, is there a named human who will own it?
If the answer is “no one,” you have built a lawsuit, not a product. Julia’s quote on this was sharp: Haize Labs deliberately does not build AI lawyers or AI doctors. The legal, regulatory, and trust infrastructure for full-replacement AI is years behind the technology. Build for the assistive lane. The full-replacement lane is where pilots go to die.
How to score
For each criterion, score 1 to 5:
5 = Strongly true. Perfect fit for the criterion.
3 = Partially true. Mixed signals. Would need work.
1 = Not true at all. The criterion does not hold.
Add them up. Maximum score is 35. Your total tells you the move.
The decision tree
30 to 35: Build it. High value, well defined, accountability-safe. Prioritize this on the roadmap. Hire or partner for the reliability engineering (Code of Conduct, supervisor models, adversarial red teaming). This is the agent worth your A-team.
23 to 29: Pilot it. Real signal, but at least one criterion is shaky. Run a contained pilot with one named customer or one internal team. Define the exit criteria upfront. If the weak criterion does not move during pilot, do not graduate it to production.
15 to 22: Wait or partner. The shape is wrong for a full build. Either wait for the failing criteria to mature (models get better, regulations clarify), or buy a vendor that has already solved the hard part. Do not build this in-house.
Under 15: Kill it. This use case is a roadmap distraction. Either the work isn’t valuable enough, isn’t well defined enough, or the accountability lane is too murky. Cut it, communicate the decision, and move the team to a higher-scoring use case.
The hardest move on this list is killing the 18-point project that “someone really wants.” An executive sponsored it. A customer asked for it. Someone wants to see what AI can do. None of those are reasons to ship. The filter exists to give you cover. Show the score. Hand them the rubric. Let the math do the conversation.
Three worked examples
Let me show you the filter in action on three real-shape use cases. Scores are mine. You may score them differently for your buyer or your industry, which is the point.
Example A: AI agent that drafts customer claim denial letters for a regional health insurer
The agent ingests the original claim, the policy terms, and the adjudicator’s structured notes. It drafts the denial letter using state-specific regulatory language. A licensed claims supervisor reviews and signs every letter before it goes out.
Score:
1- Document-heavy: 5 (claim, policy, notes, regulations)
2- Cannot be cheaply outsourced: 4 (licensed adjuster work, not BPO)
3- Right vs. wrong: 5 (regulatory compliance is testable)
4- High stakes: 5 (state insurance commissioner exposure)
5- Talent scarcity: 4 (claims supervisors are hard to hire)
6- Economically worth it: 5 (hours of supervisor time per letter, hundreds per week)
7- Human accountability: 5 (supervisor signs every letter)
Total: 33 of 35. Verdict: Build it. This is exactly the shape Haize Labs would take on.
Example B: AI agent that summarizes weekly team Slack channels into a “what you missed” digest
The agent reads Slack channels and writes a 200-word digest for each team member who was out of the loop that week.
Score:
1- Document-heavy: 2 (chat is not document logic)
2- Cannot be cheaply outsourced: 1 (anyone could write this digest)
3- Right vs. wrong: 1 (no rubric for “good summary”)
4- High stakes: 1 (if it misses something, nobody is fined)
5- Talent scarcity: 1 (no one is hiring “Slack digest writers”)
6- Economically worth it: 2 (saves maybe 15 min per person per week)
7- Human accountability: 1 (no one owns a missed digest)
Total: 9 of 35. Verdict: Kill it. This is a nice-to-have personal productivity feature. It is not a fundable enterprise agent. Build it as a free internal tool with the off-the-shelf models you already have. Stop trying to make it a product.
Example C: AI agent that screens inbound sales emails for ICP fit and routes high-fit leads to AEs
The agent reads inbound emails, scores ICP fit using your customer data, and routes hot leads to the right account executive with a recommended response template.
Score:
1- Document-heavy: 3 (some structured email + CRM data, but lightweight)
2- Cannot be cheaply outsourced: 3 (SDRs do this, but they are not cheap or fast)
3- Right vs. wrong: 4 (ICP fit is fairly testable against historical close data)
4- High stakes: 3 (mis-routed leads cost revenue, but no regulatory exposure)
5- Talent scarcity: 3 (SDR turnover is real but not crisis-level)
6- Economically worth it: 4 (cuts response time, improves conversion measurably)
7- Human accountability: 4 (AE owns the lead from the moment it routes)
Total: 24 of 35. Verdict: Pilot it. Real signal, but the stakes and scarcity are mid-grade. Pilot with one segment of your inbound flow. Define a 60-day exit criterion: ICP precision must hit X, response time must drop by Y. If both move, graduate. If neither, contain.
Notice how scoring forces clarity. The Slack digest is not “a bad idea.” It is a wrong-shape idea for an enterprise agent investment. The ICP router is not “a great idea.” It is a contained pilot with measurable exit criteria. The filter turns vibes into decisions.
Why this matters more than usual right now
The next 18 months of enterprise AI will be won by teams who solve two things at once: reliability that survives the worst 1% of conversations, and a roadmap discipline that does not waste a quarter on a use case that never had a chance.
Most teams are missing both. They are demo-buyers, evaluating AI agents on what the demo can do. They are vibe-builders, greenlighting use cases because “AI is hot” or because an executive watched a keynote. The math catches up with them at the end of the quarter, when the agent ships, fails in some embarrassing way, and the CFO asks why the line item exists.
The 7 criteria above are the math. Run them honestly. The math will tell you what your gut already suspected.
My challenge to you, this week
Pick your top 3 AI agent use cases. Score them honestly against the 7 criteria. Then have a 20-minute conversation with your team about the lowest-scoring one, and decide together: kill it, downgrade it, or change it. Done by Friday. mk? mk.
If you do this, three things happen:
You save your team a quarter of wasted engineering on a use case that was never going to land.
You walk into your next leadership AI update with a defensible scoring framework, not opinions.
You become the named person other people in your org bring AI ideas to for review. The framework gives you a position. The position gives you leverage. The leverage compounds.
If you want the interactive version of this filter (live scoring, radar chart, real-time verdict), you can grab it from the GTM AI Academy site. If you want to hear Julia walk through the full Haize Labs methodology (Code of Conduct, supervisor models, adversarial red teaming, the 2-person GTM team that sells reliability to Fortune 500 banks), the full podcast episode is in the show notes.
The next 18 months of enterprise AI will be won by the people who said no to the wrong use cases this quarter. I hope this is the post that helps you say no.
More to come next week, where I will be unpacking the lossless RAG approach Julia teased at the end of our conversation.


Solid learnings. Thanks for sharing