Every AI reply gets a confidence score. Below 0.85, it stops. Here's the engineering behind that number and why we chose it.
One of the questions we get most often from technical users is: how does RheXa know when it doesn't know something?
The answer is confidence scoring — a system that runs on every AI-generated reply before it's sent. If the score falls below 0.85, the reply is blocked and the conversation is escalated to a human. If it's 0.85 or above, it sends.
Here's how that number was chosen and what the system actually does.
Language models are fluent. That's their greatest strength and their greatest risk. They can produce grammatically perfect, confidently-worded replies about things they're completely wrong about. There's no built-in hesitation, no "I'm not sure about this."
Without an external check, a customer could ask "do you cover SE22?" and the AI might confidently reply "yes, we cover SE22" — even if that postcode isn't in the knowledge base. It filled a gap with a plausible answer. The answer is wrong. The customer shows up expecting service you don't offer.
A confidence gate prevents this. It forces the system to evaluate how certain it actually is before it sends anything.
RheXa's confidence score is a composite of three signals:
1. Retrieval similarity score
When the customer's question is converted to a vector and compared against the knowledge base, the closest matching chunks are assigned similarity scores between 0 and 1. A score of 1.0 means the query is semantically identical to that chunk. A score of 0.4 means there's a distant relationship.
If the top retrieved chunk has a similarity score below 0.6, that's a strong signal that the knowledge base doesn't contain a good answer to this question.
2. Coverage coverage — how much of the question is addressed
We parse the customer's question into intent components. "How much does Invisalign cost and how long does treatment take?" has two intent components: price and duration. If the retrieved content covers price but not duration, coverage is 50%. Lower coverage means lower confidence.
3. Language model self-assessment
After generating a draft reply, RheXa prompts the language model to evaluate its own reply: "On a scale of 0–1, how confident are you that this reply is accurate given only the provided context?" This self-assessment is imperfect — models tend to be overconfident — but calibrated against the other signals, it adds useful information.
The three signals are weighted and combined into a final confidence score between 0 and 1.
We tested multiple thresholds during the beta period, tracking two metrics: false positives (AI sent a wrong reply that should have been blocked) and false negatives (AI blocked a reply that was actually correct).
At 0.9: We blocked 34% of all replies. Most of them were correct. Staff spent a lot of time reviewing messages that the AI could have handled fine. False negative rate too high.
At 0.75: We blocked only 8% of replies. But the wrong-reply rate climbed to 4.2% of sent messages. For a business handling 200 messages a week, that's 8 incorrect replies per week going out without review. Too risky.
At 0.85: We blocked 17% of replies. The wrong-reply rate on sent messages dropped to 0.6%. Staff reviewed about 3–4 escalations per day on average for a 200-message-per-week business. That felt like the right balance: most conversations handled autonomously, edge cases caught by humans.
0.85 isn't a magical number — it's a calibrated one. And it's configurable: if your business has a lower tolerance for errors (medical, legal, financial sectors), you can raise the threshold. If you're in a lower-stakes domain and want more automation, you can lower it slightly. The default is 0.85 because it works well across most service businesses.
When the confidence score falls below 0.85, two things happen simultaneously:
Your team sees exactly what the AI was going to say, why it was uncertain, and what it searched for in the knowledge base. You can approve the draft with one click, edit it, or write a custom reply. Either way, the customer gets a response.
Here's a secondary benefit that's easy to miss: the pattern of blocked conversations is a diagnostic tool for your knowledge base.
If you see 15 blocked conversations in a week all related to the question "do you do weekend appointments?", that's a clear signal to add a clear, unambiguous section about your weekend availability to the knowledge base. After you do, those conversations will stop being blocked.
The confidence gate isn't just a safety net. It's feedback. It tells you exactly where your knowledge base has gaps, so you can close them.
The 0.85 threshold reflects a philosophy: an AI that sometimes says "I'm not sure, let me get a human to help you" is more trustworthy than one that always has a confident answer. Confidence in the face of uncertainty isn't a feature. It's a bug.
The goal isn't to automate everything. The goal is to automate the things the AI can handle correctly, and pass everything else to the people who can.
Connect WhatsApp and Gmail or Outlook in ten minutes. AI replies in your tone — with a knowledge base that knows your business.
Start your 14-day free trial →