Testlio launches AI chatbot testing service amid scrutiny

Fri, 17th Apr 2026

Testlio has launched an AI chatbot testing service designed to assess risks before chatbots are released to customers.

The service uses a framework built around four risk domains: safety and security, consistency, accuracy and logic, and user experience. Each assessment covers eight areas, rising to nine for retrieval-augmented generation systems.

The launch comes amid growing scrutiny of how companies test chatbots and virtual assistants as they take on more customer service and brand interaction work. Early adopter data from Testlio showed nearly half of high-severity issues stemmed from safety guardrails and fallback handling.

Those findings point to a common problem in production systems, where models may fail to refuse unsafe requests, escalate correctly, or handle breakdowns when they cannot answer. Human testers are central to every evaluation rather than relying solely on automated checks.

As part of the service, Testlio has introduced a scoring system called LeoPulse. It aggregates results across safety, reliability, and capability to produce what it describes as an AI confidence score that can be tracked over time as models change.

According to Testlio, the score is weighted by risk so serious failures are not masked by stronger results in less critical areas. Assessments also include prioritised issues, severity rankings, recommendations, and a client team that presents findings and next steps.

Coverage areas

The testing framework covers output accuracy and intent resolution, misinformation and hallucination, data privacy and personally identifiable information handling, safety guardrails and fallback handling, bias and fairness, context retention and memory handling, adversarial testing and AI red teaming, and localisation and multilingual behaviour. For retrieval-based systems, it also examines retrieval quality and factual grounding.

Testlio said the model was built around how chatbots fail in live use rather than in controlled evaluation settings. That distinction may matter for companies whose customer-facing systems must deal with varied prompts, edge cases, and shifting user behaviour.

Summer Weisberg, Chief Executive Officer of Testlio, described the launch as a response to reputational and trust risks created by poor chatbot interactions.

"Every interaction is a brand trust moment. When those moments go wrong; a hallucination, an off-brand response, a safety failure, they erode trust and loyalty that took years to build. Our AI Chatbot Testing solution exists to protect that trust, by putting real human judgment between your brand and the AI failures that automated tools struggle to catch," said Summer Weisberg, Chief Executive Officer, Testlio.

Human review

Testers in the company's global network are trained to assess AI behaviour beyond basic functionality, including output quality, intent resolution, hallucination detection, and bias identification.

Testlio also uses a matching system called LeoMatch to select testers based on a client's target audience and market. It said this helps mirror real-world usage conditions and reduces the time needed to assemble testing teams compared with manual selection.

The company said the matching process gets teams running three times faster than manual tester selection and uncovers twice as many critical issues, though it did not provide further methodological detail for those figures.

Among early users is Hallmark+, which Testlio said has worked with the business for five years across 13 platforms. The company said that relationship has covered more than one million users and delivered estimated annual savings of USD $1.34 million.

The rollout comes as software providers and brands face pressure to show that generative AI systems can operate safely, accurately, and consistently in consumer settings. Standard benchmark tests and scripted prompt checks have often struggled to capture the range of failures that emerge once products are widely deployed.

Customers can use the service for a one-off baseline assessment or ongoing validation as models are updated and new features are introduced. The service is available now.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google