Google has released a blunt assessment of how reliable today’s AI chatbots really are, and the numbers aren’t flattering. Using its newly launched FACTS Benchmark Suite, the company found that even the best AI models struggle to achieve factual accuracy rates above 70%. The leader, Gemini 3 Pro, achieved an overall accuracy of 69%, while other leading systems from OpenAI, Anthropic and xAI achieved even worse results. The snack is simple and inconvenient. These chatbots still give about one in three answers wrong, even when they seem confident.
The benchmark is important because most existing AI tests focus on whether a model can complete a task, rather than whether the information it produces is actually true. For industries like finance, healthcare and legal, this gap can be costly. A fluent response that sounds safe but contains errors can cause real harm, especially if users assume the chatbot knows what it’s talking about.
What Google’s accuracy test reveals
The FACTS Benchmark Suite was developed by Google’s FACTS team together with Kaggle to directly test factual accuracy on four real-world applications. A test measures parametric knowledge, which tests whether a model can answer fact-based questions using only what it has learned during training. Another evaluates search performance and tests how well models use web tools to retrieve accurate information. A third focus is on justification, i.e. whether the model adheres to a provided document without adding false details. The fourth part is about multimodal understanding, such as correctly reading tables, diagrams and images.
The results show clear differences between the models. Gemini 3 Pro topped the rankings with a FACTS score of 69%, followed by Gemini 2.5 Pro and OpenAI’s ChatGPT-5 with almost 62%. Claude 4.5 Opus landed at ~51%, while Grok 4 reached ~54%. Multimodal tasks were consistently the weakest area, with accuracy often below 50%. This is important because these tasks involve reading charts, graphs, or images where a chatbot could easily misinterpret a sales chart or pull the wrong number from a document, resulting in errors that are easy to miss but difficult to undo.
The bottom line is not that chatbots are useless, but blind trust is risky. Google’s own data suggests that AI is improving, but it still needs verification, guardrails and human oversight before it can be treated as a reliable source of truth.




