Running an Evaluation Script for the Shared Task Datasets
- We use the https://github.com/kasnerz/factgenie/ framework in the shared task for running the LLM evaluation (annotation of the data-to-text system outputs using LLMs).
- We strongly encourage to run it locally and visualize and evaluate your outputs in factgenie
- factgenie has cli
factgenie run-llm-eval
for running the evaluation from CLI. - factgenie can run the same evaluation interactively in a browser at http://127.0.0.1:5000/llm-eval. It is great for debugging the prompt.
- We use factgenie webserver also for visualization of the annotation and the datasets themselves.
- Before using any dataset one needs to prepare dataloading of the datasets in factgenie. Luckily, we have added the shared tasks st-* datasets to factgenie for you.
Looking at the result: Shared Task datasets and annotations
Before we start, let’s look at running factgenie instance deployed at:
- There are an Openweather domain input data examples released as st24-openweather dataset.
- The output from a mistral model was generated using https://github.com/kasnerz/quintd/ (in particular this version)
- The annotations with id st24-demo-openweather-dev-llama3 were generated using the following command
factgenie run-llm-eval \
--campaign_name st24-demo-openweather-dev-llama3 \
--dataset_name st24-openweather \
--split dev \
--llm_output_name mistral \
--llm_metric_config factgenie/llm-eval/ollama-llama3.yaml
Running the evaluation
- Go through the README.md and install dependencies.
- [(Optionally) Look at how we added the shared task datasets in the section How to evaluate the existing outputs?
- At the same PR look at the section How to evaluate the existing outputs? and learn how to run the factgenie run-llm-eval command
-
Finally, look at how you can add your model outputs to factgenie to be evaluated. Here is the example for mistral model. Assuming you have outputs from an awesome_model for the input data from the dev split of the st24-gsmarena datasets; then you need to create file
factgenie/outputs/st24-gsmarena/dev/awesome_model.json
with the structure described in the example:The following structure is required for the LLM output file
{ "setup" { "id": "mistral", "model": "mistral" }, "generated": [{"out": "first llm output}, {"out": "second llm output"}, ..., {"out": "Last llm output"}] }
The rest of the fields are ignored by factgenie. In this case, the rest of the fields were used by https://github.com/kasnerz/quintd to generate the mistral.json file and to obtain the LLM outputs based on the dataset inputs.