Evaluation Metrics help you assess the quality of the answers generated by the Assistants you create in Globant Enterprise AI. They allow you to measure aspects such as accuracy, clarity, consistency and robustness of responses, enabling continuous improvement of your assistants' performance.
Each metric returns a score between 0 and 1, where lower values indicate better performance. Along with the score, a short explanatory message is provided as feedback.
This metric evaluates how complete and understandable the Assistant's response is compared to the expected response.
Accuracy is measured by comparing key points, relevant data, and ideas between the expected answer and the Assistant’s response.
Clarity is assessed based on the structure, readability, and conciseness of the generated response.
A single combined score is assigned to reflect both accuracy and clarity.
- 1.0 – The response is complete and clear.
- 0.0 – The response is incomplete and confusing.
Low scores indicate missing information, irrelevant content, or poor readability.
Measures how relevant the retrieved documents are for generating an appropriate answer to the original question.
Each document is compared to the expected answer and assigned a relevance score:
- 1.0 – Fully answers the question directly.
- ≥ 0.8 – Covers most of the content with high similarity or effective paraphrasing.
- ≥ 0.5 – Provides useful, though partial, information.
- ≥ 0.3 – Has a vague connection, but is not directly useful.
- 0.0 – Irrelevant to the question.
The final score is the average of all individual document scores, weighted equally.
Evaluates whether the assertions made by the Assistant are supported by the retrieved documents.
The evaluation method consists in breaking down the response into individual factual statements and verifying whether each one is substantiated by the retrieved context.
The final score is calculated using the following formula:
Faithfulness = (number of valid statements, supported by the context) / (total number statements)
The values obtained by applying the formula can be interpreted as follows:
- 1.0 – All statements are supported.
- 0.0 – None of the statements are supported.
Detects whether the response contains fabricated information or content that is not supported by the retrieved documents.
It evaluates whether the content is properly substantiated by the context.
- 0.0 – The response is fully substantiated (no hallucination).
- 1.0 – The response is entirely fabricated (unsupported).
Measures how sensitive the response is to irrelevant or misleading information within the context.
The evaluation method consists in identifying the Assistant's assertions and verifying whether they meet the following two conditions:
- They are supported by the retrieved documents.
- They are consistent with the expected response.
Only statements that meet both conditions are considered valid.
The final score is calculated using the following formula:
Noise Sensitivity =(Number of invalid claims) / (Total number of claims)
The values obtained by applying the formula can be interpreted as follows:
- 0.0 – All statements are valid (high robustness).
- 1.0 – All statements are invalid (high sensitivity to noise).
In addition to the five main metrics, each evaluation captures automatic indicators about the Assistant's technical performance. These values do not affect the qualitative evaluation, but provide useful insights into efficiency:
- Execution time.
- Number of tokens used.
- Generation cost.
- Average relevance of the retrieved context.
Since version 2025-04.