Factgenie Toolking for Annotating and Visualizing LLM Hallucinations

Running an Evaluation Script for the Shared Task Datasets

We use the https://github.com/kasnerz/factgenie/ framework in the shared task for running the LLM evaluation (annotation of the data-to-text system outputs using LLMs).
- We strongly encourage to run it locally and visualize and evaluate your outputs in factgenie
- factgenie has cli factgenie run-llm-eval for running the evaluation from CLI.
- factgenie can run the same evaluation interactively in a browser at http://127.0.0.1:5000/llm-eval. It is great for debugging the prompt.
We use factgenie webserver also for visualization of the annotation and the datasets themselves.
Before using any dataset one needs to prepare dataloading of the datasets in factgenie. Luckily, we have added the shared tasks st-* datasets to factgenie for you.

Looking at the result: Shared Task datasets and annotations

Before we start, let’s look at running factgenie instance deployed at:

https://quest.ms.mff.cuni.cz/namuddis/factgenie/browse?dataset=st24-openweather&split=dev&example_idx=0

Example from deployed factgenie toolkit at the address https://quest.ms.mff.cuni.cz/namuddis/factgenie/browse?dataset=st24-openweather&split=dev&example_idx=0

There are an Openweather domain input data examples released as st24-openweather dataset.
The output from a mistral model was generated using https://github.com/kasnerz/quintd/ (in particular this version)
The annotations with id st24-demo-openweather-dev-llama3 were generated using the following command

factgenie run-llm-eval \
  --campaign_name st24-demo-openweather-dev-llama3 \
  --dataset_name st24-openweather \
  --split dev \
  --llm_output_name mistral \
  --llm_metric_config factgenie/llm-eval/ollama-llama3.yaml

Running the evaluation

Go through the README.md and install dependencies.
[(Optionally) Look at how we added the shared task datasets in the section How to evaluate the existing outputs?
At the same PR look at the section How to evaluate the existing outputs? and learn how to run the factgenie run-llm-eval command
Finally, look at how you can add your model outputs to factgenie to be evaluated. Here is the example for mistral model. Assuming you have outputs from an awesome_model for the input data from the dev split of the st24-gsmarena datasets; then you need to create file factgenie/outputs/st24-gsmarena/dev/awesome_model.json with the structure described in the example:

The following structure is required for the LLM output file
```
 {
     "setup" {
         "id": "mistral",
         "model": "mistral"
     },
     "generated": [{"out": "first llm output}, {"out": "second llm output"}, ..., {"out": "Last llm output"}]
 }
```
The rest of the fields are ignored by factgenie. In this case, the rest of the fields were used by https://github.com/kasnerz/quintd to generate the mistral.json file and to obtain the LLM outputs based on the dataset inputs.

Practical D2T

Running an Evaluation Script for the Shared Task Datasets

Looking at the result: Shared Task datasets and annotations

Running the evaluation