The French start-up Giskard has just unveiled a benchmark measuring the main faults of the most used language models.
What are the LLMs who present the least risk to use? The young French shoot Giskard asked herself the question and presents a flagship, a full benchmark to try to answer them. Published in April, the latter tests in a relatively reliable way the risk of hallucinations, generation of toxic contents or even in the answers produced.
Model | General average | Hallucination | Dangerousness (Harmfull) | Bias and stereotypes | Model editor |
---|---|---|---|---|---|
GPT-4O Mini | 63.93% | 74.50% | 77.29% | 40.00% | OPENAI |
GROK 2 | 65.15% | 77.35% | 91.44% | 26.67% | xai |
Large Mistral | 66.00% | 79.72% | 89.38% | 28.89% | Mistral |
Mistral Small 3.1 24b | 67.88% | 77.72% | 90.91% | 35.00% | Mistral |
LLAMA 3.3 70B | 67.97% | 73.41% | 86.04% | 44.44% | Meta |
Deepseek V3 | 70.77% | 77.91% | 89.00% | 45.39% | Deepseek |
Qwen 2.5 max | 72.71% | 77.12% | 89.89% | 51.11% | Alibaba Qwen |
GPT-4O | 72.80% | 83.89% | 92.66% | 41.85% | OPENAI |
Deepseek V3 (0324) | 73.92% | 77.86% | 92.80% | 51.11% | Deepseek |
Gemini 2.0 Flash | 74.89% | 78.13% | 94.30% | 52.22% | |
Gemma 3 27b | 75.23% | 69.90% | 91.36% | 64.44% | |
Claude 3.7 SONNET | 75.53% | 89.26% | 95.52% | 41.82% | Anthropic |
Claude 3.5 SONNET | 75.62% | 91.09% | 95.40% | 40.37% | Anthropic |
LLAMA 4 MAVERICK | 76.72% | 77.02% | 89.25% | 63.89% | Meta |
LLAMA 3.1 405B | 77.59% | 75.54% | 86.49% | 70.74% | Meta |
Claude 3.5 Haiku | 82.72% | 86.97% | 95.36% | 65.81% | Anthropic |
Gemini 1.5 Pro | 87.29% | 87.06% | 96.84% | 77.96% |
17 models have been tested. Giskard has only tested the main models on the market by giving priority to the most used. “We prefer to assess the stable models, widely used, rather than criticizing non -finalized versions”, justifies Alex Combessie the co -founder and CEO of Giskard. Exit therefore the latest versions of Gemini or the last version of GPT-4O (withdrawn by Openai elsewhere). Exit also the models of reasoning which, in addition to being often experimental, constitute a target that is not very relevant for benchmark.
The worst models all categories
The first lighthouse classification gives results relatively expected and in accordance with the various returns from the community. In the top 5 “worst” models tested (out of 17, therefore), we find GPT-4O Mini, Grok 2, Mistral Large, Mistral Small 3.1 24b and finally Llama 3.3 70B. Conversely, in the ranking of the best models, we find Gemini 1.5 Pro, Claude 3.5 Haiku and Llama 3.1 405B.
The worst models in hallucinations
Considering only the metric hallucination, gemma 3 27b, llama 3.3 70b, gpt-4o mini, llama 3.1 405b and llama 4 maverick obtain the worst scores. In contrast, Anthropic is strong with 3 of the models that hallucinate the least in the top 5: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Gemini 1.5 Pro, Claude 3.5 Haiku and finally GPT-4O (from Openai).
The most dangerous models
In terms of generation of dangerous content (recognition of problematic content in input and appropriate response), it is still GPT-4O Mini which is the least well, followed by Llama 3.3 70B, LLAMA 3.1 405B, DEEPSEEK V3 and LLAMA 4 MAVERICK. Conversely, Gemini 1.5 Pro remains the best model followed closely by the 3 models of anthropic (Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3.5 Haiku) and finally Gemini 2.0 Flash in fifth position.
The most biased models
It is certainly the category where the margin of progression is the most important. The LLM biases and stereotypes are still very marked according to the results communicated by lighthouse. Grok 2 obtains the worst note, followed by Mistral Large, Mistral Small 3.1 24b, GPT-4O Mini and finally Claude 3.5 Sonnet. In contrast, Gemini 1.5 Pro gets the best scores followed by Llama 3.1 405b, Claude 3.5 Haiku, Gemma 3 27b and Llama 4 Maverick in last position.
Although the size can impact the generation of toxic content (the smaller the models, the more they tend to generate “harmfull” remarks), the number of parameters does not explain everything. “Our analyzes demonstrate that the sensitivity to the user’s formulation varies considerably according to suppliers. For example, anthropic models seem less influenced by the formulation of questions than their competitors, regardless of their size. The way of asking the question (by asking for a brief or detailed answer) also has variable effects. This leads us to think that specific training methods, such as Human returns (RLHF), count more than size, “says Matteo Dora, CTO De Giskard.
A robust methodology
Lighthouse tests the models using a private dataset of around 6,000 conversations, with only a subset of around 1,600 samples made public on Hugging Face to guarantee transparency while preventing potential manipulation of model training. The researchers collected data in several languages (French, English, Spanish) and created tests that reflect real situations.
For metrics hallucinationfour subtaches are tested:
- The capacity of the model to generate factual answers on question of general culture (invoicing)
- the propensity of the model to provide exact information when it responds to prompts with elements initially false
- the ability of the model to deal with questionable affirmations (pseudosciences, conspiracy theories)
- The capacity of the model to use tools without hallucinating (very useful for the use of MCP for example)
For metric dangerousness Or vigilance (harmfulness), the researchers assessed the capacity of the model to recognize potentially dangerous situations and to provide appropriate warnings.
Finally for the bias and stereotype metrics (Bias & Fairness), the benchmark focuses on the propensity of the model to identify by itself the biases and stereotypes generated in its own outings.
A collaboration with Mistral AI and Deepmind
Phare is all the more relevant since it directly attacks essential metrics for companies wishing to use LLM. On its site the precise results of each model are exposed publicly by also including the subtaches. It is even possible to compare the results of two models between them. The benchmark was financially supported by the BPI and the European Commission. Giskard also combined with Mistral AI and Deepmind on the technical part. The LMEVAL Framework for use was thus developed in direct collaboration with the team responsible for Gemma at Deepmind (years No access to private training data, of course).
Subsequently, the team plans to add two key key features: “Probably by June, we will add a module to assess the resistance to jailbreaks and the prompt injection”, confides Matteo Dora. Finally, researchers will continue to feed the leaderboard with the latest stable models published. Next on the list: Grok 3, Qwen 3, and certainly GPT-4.1.