RheXa
PricingUse CasesBlogDemo
Sign InGet Started
RheXa
PricingUse CasesBlogDemoAboutChangelogSecurity
Sign inGet Started
All articles
AI

The 0.85 rule — how RheXa decides when to send and when to stop

Every AI reply gets a confidence score. Below 0.85, it stops. Here's the engineering behind that number and why we chose it.

5 min readMar 3, 2026RheXa Team · Engineering

One of the questions we get most often from technical users is: how does RheXa know when it doesn't know something?

The answer is confidence scoring — a system that runs on every AI-generated reply before it's sent. If the score falls below 0.85, the reply is blocked and the conversation is escalated to a human. If it's 0.85 or above, it sends.

Here's how that number was chosen and what the system actually does.

Why AI systems need a confidence gate

Language models are fluent. That's their greatest strength and their greatest risk. They can produce grammatically perfect, confidently-worded replies about things they're completely wrong about. There's no built-in hesitation, no "I'm not sure about this."

Without an external check, a customer could ask "do you cover SE22?" and the AI might confidently reply "yes, we cover SE22" — even if that postcode isn't in the knowledge base. It filled a gap with a plausible answer. The answer is wrong. The customer shows up expecting service you don't offer.

A confidence gate prevents this. It forces the system to evaluate how certain it actually is before it sends anything.

How the confidence score is calculated

RheXa's confidence score is a composite of three signals:

1. Retrieval similarity score
When the customer's question is converted to a vector and compared against the knowledge base, the closest matching chunks are assigned similarity scores between 0 and 1. A score of 1.0 means the query is semantically identical to that chunk. A score of 0.4 means there's a distant relationship.

If the top retrieved chunk has a similarity score below 0.6, that's a strong signal that the knowledge base doesn't contain a good answer to this question.

2. Coverage coverage — how much of the question is addressed
We parse the customer's question into intent components. "How much does Invisalign cost and how long does treatment take?" has two intent components: price and duration. If the retrieved content covers price but not duration, coverage is 50%. Lower coverage means lower confidence.

3. Language model self-assessment
After generating a draft reply, RheXa prompts the language model to evaluate its own reply: "On a scale of 0–1, how confident are you that this reply is accurate given only the provided context?" This self-assessment is imperfect — models tend to be overconfident — but calibrated against the other signals, it adds useful information.

The three signals are weighted and combined into a final confidence score between 0 and 1.

Why 0.85 and not 0.9 or 0.7?

We tested multiple thresholds during the beta period, tracking two metrics: false positives (AI sent a wrong reply that should have been blocked) and false negatives (AI blocked a reply that was actually correct).

At 0.9: We blocked 34% of all replies. Most of them were correct. Staff spent a lot of time reviewing messages that the AI could have handled fine. False negative rate too high.

At 0.75: We blocked only 8% of replies. But the wrong-reply rate climbed to 4.2% of sent messages. For a business handling 200 messages a week, that's 8 incorrect replies per week going out without review. Too risky.

At 0.85: We blocked 17% of replies. The wrong-reply rate on sent messages dropped to 0.6%. Staff reviewed about 3–4 escalations per day on average for a 200-message-per-week business. That felt like the right balance: most conversations handled autonomously, edge cases caught by humans.

0.85 isn't a magical number — it's a calibrated one. And it's configurable: if your business has a lower tolerance for errors (medical, legal, financial sectors), you can raise the threshold. If you're in a lower-stakes domain and want more automation, you can lower it slightly. The default is 0.85 because it works well across most service businesses.

What happens when a reply is blocked

When the confidence score falls below 0.85, two things happen simultaneously:

  1. The draft reply and the confidence breakdown are sent to your team's review queue — along with the retrieved knowledge base chunks and the customer's full message
  2. If you've configured an auto-acknowledgement, a holding message goes to the customer: "Thanks for your message — one of our team will follow up shortly." (This is optional and off by default.)

Your team sees exactly what the AI was going to say, why it was uncertain, and what it searched for in the knowledge base. You can approve the draft with one click, edit it, or write a custom reply. Either way, the customer gets a response.

What blocked conversations tell you about your knowledge base

Here's a secondary benefit that's easy to miss: the pattern of blocked conversations is a diagnostic tool for your knowledge base.

If you see 15 blocked conversations in a week all related to the question "do you do weekend appointments?", that's a clear signal to add a clear, unambiguous section about your weekend availability to the knowledge base. After you do, those conversations will stop being blocked.

The confidence gate isn't just a safety net. It's feedback. It tells you exactly where your knowledge base has gaps, so you can close them.

The principle behind the number

The 0.85 threshold reflects a philosophy: an AI that sometimes says "I'm not sure, let me get a human to help you" is more trustworthy than one that always has a confident answer. Confidence in the face of uncertainty isn't a feature. It's a bug.

The goal isn't to automate everything. The goal is to automate the things the AI can handle correctly, and pass everything else to the people who can.

ShareLinkedInTwitter

Ready to automate your customer messages?

Connect WhatsApp and Gmail or Outlook in ten minutes. AI replies in your tone — with a knowledge base that knows your business.

Start your 14-day free trial →

More articles

AI

Can customers tell the difference between AI and human replies?

7 min read · Apr 8, 2026

AI

What is RAG and why does it matter for your business?

6 min read · Mar 24, 2026