Evaluations¶
The evaluation module takes the produced inference file and the original processed dataset and runs a list of evaluations, producing a final results file, in a YAML format. The evaluations are represented as metric classes.
We implement several metrics including: a wrapper for HuggingFace evaluate
class, which can accept a list of metrics,
EM, F1, classification (accuracy, precision, recall, F1), BERTScore, Semantic similarity (using a customizable
cross-encoder). The module can also run metrics from DeepEval,
which offers a large collection of LLM evaluations.
Metrics can be either local or global; a local metric runs over each example individually, scores are collected and averaged. A global metric runs on the entire dataset at once, for example: classification F1.
The configuration contains the following section:
answer_processor:
_target_: ragfit.processing.answer_processors.regex.RegexAnswer
capture_pattern: # "<ANSWER>: (.*)"
stopping_pattern: # "[,.;]"
There is a default processor, called RegexAnswer
; it can filter text, based on a python regex capture pattern. It can
also split text using a stopping pattern. For example, in the Chain-of-Thought reasoning we used in the paper, the model
is instruction to explain its answer, cite if needed and finally print the final results in the following format
<ANSWER>: ...
. We can use this format as a capture pattern; thus models that learn to answer using this pattern (obey the
instruction) will score higher.
Next is a list of metrics; each one is a python class:
metrics:
- _target_: ragfit.evaluation.metrics.HFEvaluate
metric_names: [rouge]
- _target_: ragfit.evaluation.metrics.EM
- _target_: ragfit.evaluation.metrics.F1
- _target_: ragfit.evaluation.metrics.BERTScore
model: microsoft/deberta-large-mnli
Some metrics require additional parameters, for example HuggingFace evaluate
requires the metrics' names, BERTScore
requires an embedding model.
Finally:
results_file: my-evaluation.yaml
generated_file: inference.jsonl
data_file: my-processed-data.jsonl
limit:
One needs to provide the generated inference file, the processed dataset and a filename for the results summary. A limit number of rows can be provided for debugging purposes.
Running Evaluations on ASQA¶
As the final part of the demonstration of the framework with the ASQA dataset and Phi-3 models, we will evaluate the different RAG configurations, with and without the use of fine-tuning.
As a reminder, ASQA has 2 types of answers: long answer and short answers. We will evaluate the generated answers using the long answer with RAGAS metrics (faithfulness and relevancy) and use the short answers with ASQA defined STR-EM.
Short¶
Starting with the short answers, the label keyword is answer-short
(recall the processing) and a representative
configuration looks like this:
answer_processor:
_target_: ragfit.processing.answer_processors.regex.RegexAnswer
capture_pattern: "<ANSWER>: (.*)"
stopping_pattern:
metrics:
- _target_: ragfit.evaluation.metrics.StringEM
key_names:
generated: text
label: answer-short
query: query
results_file: evaluation-asqa-baseline.yaml
generated_file: asqa-baseline-dev-generated.jsonl
data_file: asqa-baseline-dev.jsonl
Here are the calls to evaluate the different configurations:
Baseline:
Context:
python evaluation.py -cp configs/paper -cn evaluation-asqa-short \
results_file=asqa-context-dev-generated-results.yaml \
data_file=asqa-context-dev.jsonl \
generated_file=asqa-context-dev-generated.jsonl
Context with fine-tuned model:
python evaluation.py -cp configs/paper -cn evaluation-asqa-short \
results_file=asqa-context-ft-dev-generated-results.yaml \
data_file=asqa-context-dev.jsonl \
generated_file=asqa-context-ft-dev-generated.jsonl
Chain-of-Thought:
python evaluation.py -cp configs/paper -cn evaluation-asqa-short \
results_file=asqa-cot-dev-generated-results.yaml \
data_file=asqa-cot-dev.jsonl \
generated_file=asqa-cot-dev-generated.jsonl
Chain-of-Thought with fine-tuned model:
python evaluation.py -cp configs/paper -cn evaluation-asqa-short \
results_file=asqa-cot-ft-dev-generated-results.yaml \
data_file=asqa-cot-dev.jsonl \
generated_file=asqa-cot-ft-dev-generated.jsonl
Long¶
Evaluation the generated output with respect to the full answer, we use two RAGAS metrics, namely faithfulness and relevancy. The RAGAS metrics require a context for the critic to make a judgment, so these are not relevant for the baseline configuration.
The different in configuration is in the list of metrics and keywords:
metrics:
- _target_: ragfit.evaluation.deep.Faithfulness
azure_endpoint: azure.endpoint.com
azure_deployment: GPT-4-32k-Bot
api_version: 2024-05-01-preview
- _target_: ragfit.evaluation.deep.Relevancy
azure_endpoint: azure.endpoint.com
azure_deployment: GPT-4-32k-Bot
api_version: 2024-05-01-preview
embeddings: BAAI/bge-small-en-v1.5
key_names:
generated: text
label: answers
query: query
context: positive_passages
The relevancy metrics an embedder—it generates probable questions based on the generated answer (and the context) and then measures semantic similarity to the original question.
Context:
python evaluation.py -cp configs/paper -cn evaluation-asqa-long \
results_file=asqa-context-dev-generated-results-ragas.yaml \
data_file=asqa-context-dev.jsonl \
generated_file=asqa-context-dev-generated.jsonl
Context with fine-tuned model:
python evaluation.py -cp configs/paper -cn evaluation-asqa-long \
results_file=asqa-context-ft-dev-generated-results-ragas.yaml \
data_file=asqa-context-dev.jsonl \
generated_file=asqa-context-ft-dev-generated.jsonl
Chain-of-Thought:
python evaluation.py -cp configs/paper -cn evaluation-asqa-long \
results_file=asqa-cot-dev-generated-results-ragas.yaml \
data_file=asqa-cot-dev.jsonl \
generated_file=asqa-cot-dev-generated.jsonl
Chain-of-Thought with fine-tuned model: