Report on the second workshop of the series "Understanding and evaluating complex automation systems for journals", titled " Evaluating LLM-Written abstracts with ChainForge" and presented by Gauransh Kumar

Evaluating LLM-Written abstracts with ChainForge

Introduction

Video is available : https://nakala.fr/10.34847/nkl.1fb6vmgi

Gauransh Kumar (https://gauransh.dev/) is a PhD student in Computer Science at the Université de Montréal. He works with Ian Arawjo, who initially developped ChainForge.

This workshop aims to help us understand but how to use LLMs in a better way.

A few definitions first:

prompts : the instructions given as input to the LLM.
prompt engineering : optimization of prompts.

Prompt engineering workflow:

template : strings with variables that we can fill in.
input data : for testing purposes
Analysis of outputs.

ChainForge = UI for prompt engineering created by Ian Arawjo in 2023 https://chainforge.ai/play/

Architecture :

4 nodes :

prompt node
commands and prompt injections
models
evaluation and analysis

Demonstration

Goal : creating an abstract for a paper and evaluate the abstracts Retrieveal Augmented Generation (RAG).

RAG : parts of source documents injected in the prompt. i.e. adding context.

1st Table contains 3 articles in the top 10 in SSH.

RAGForge, in development, is soon to be part of ChainForge : it allows the upload of source documents.

Chunking : segmentation of the documents : huggingface models for semantic tokenization.

Query : the instructio for the retrieval part of the workflow ex: ’what is the meaning of ...’

Retrieval node : look for the words in the instructions that appear in the chunks (BM25) (or dense retrieval with vectors) : gives you the top 5 relevant chunks.

Prompt node : the overall instruction to the LLM : "You are a senior reviewer, write an abstract "

ChainForge gives the possibility to look at what will be sent before actually sending it, which is usefull when you consider the combinatory factor of each nodes (3 prompts*3articles*3models*2 retrieving methods etc.)

Models node : selection of several models.NB: Models API key can be added in the settings.

Output analysis : we can look at all the abstracts generated for qualitative inspection. However, this is a lot if there is a large amount of outputs so we are interested in automation of evaluation using LLM-as-judge.

LLM-evaluator node:

-> LLM as a judge : a LLM prompted to evaluate based on several criteria, or in htis case, to score the abstract against the ground truth abstract.

-> LLM as judge on quantitative element: word counts.

type of expected output can be set to numeric, binary etc.

LLM-as-judge : controverse surrounding their use as the LLM can be biased.

Questions

Alexia Schneider: why using RAG for the prompt and not just prompting the entire text ?

Gauransh Kumar: to avoid hallucinations related to longer prompts.

Alexia Schneider: the problem with RAG is that it makes the workflow less reliable. One way for this specific task could be to select key parts of the text for the prompt from the structure of the article (e.g. introduction, conclusion, results, etc.)

Gauransh Kumar: RAG as an extra tool that can be used if needed. But LLMs are improving themselves, becoming more and more capable to give good answers to longer prompts. It is totally possible to rely on the structure of the source document in which case the retrieval methods (BM25 and vectorisation) part is not necessary.

Yves Terrat: Can we can go back and adjust the prompt retroactively ?

Gauransh Kumar: It is possible to add more prompts in which case only the new prompts will load.

Markus Reisenleitner : The answers provided in the demo are summaries, not abstracts. In humanities, abstracts are part of the article and are written by the author. It might be interesting to use these generated summaries to provide different understanding of what the abstracts should say, in contrast to the ones provided by the author.

Gauransh Kumar: abstract are indeed very structured. If the exact structure can be described, then the LLM could do it. The abstract remains the first entry point to the article. The generated texts are not to be taken at face value.

Marcello Vitali-Rosati : Agrees on the distinction between abstracts and summaries. For Alexia : we could think about workflows to help authors producee abstracts. Maybe we can provide a number of prompts, identifying many ways and structures to produce abstracts (for example, one could be "select the most important part of the text...") ; and then the tool could give different abstracts, and it could help evaluating the more appropriate one. Maybe add some features to Stylo.

Alexia Schneider: Yes, we can work on this. There is a converging interest for a broader public : from the authors to the editors and reviewers, assistant tools and features could be relevant.

Gauransh Kumar : automatic screening reviews tool developped earlier -> LLM are good at removing the noise. https://www.researchgate.net/publication/391912321_AutoRev_Automatic_Peer_Review_System_for_Academic_Research_Papers

Davin Baragiotta : is there a way to export the workflow in order to gain a bit of independence from the tool ?

Gauransh Kumar : we can export the workflow from the tool, but the export is meant to be re-imported in the tool and can’t be used independently. But we can export answers (csv format) and graphs seperately.

Markus Reisenleitner: "removing the noise" is verging on editorial job. I would draw the line here and be careful in the responsibility we give to the LLM. So as not to normalize the process.

Gauransh Kumae: The decision should depend on humans. Chainforge works best as an assistant for human decisions.

Alexia Schneider: Chainforge is also useful to reflect on the task we are asking the tool to do. As we evaluate the performances of the tool, we reflect on the task itself, structuring the prompt and the criteria for evaluation.

Gauransh Kumar: it helps you figure out whether or not the task is doable and if the automation is actually worth it and reducing the time of performance.

Marcello Vitali-Rosati: the economics and the way we display the workflows is central : ChainForge is interesting in forcing us to think about the complex workflow and how it should match your expectations of the output. It goes against the expected interaction with AI tools. It would be really nice to design a few template workflows for journals.

Yves Terrat: As a prompt expert : do you notice that models are less and less sensitive to the prompts?

Gauransh Kumar: Automation of prompts optimization seems to be the next logical step in the current context.

Yves Terrat: Prompt optimization is an overwelming task, and needs to be adapted to each model.

Gauransh Kumar: the goal of my PhD is to automate the prompting optimization. He developed an automatic screening tool.

Alexia Schneider: can you tell us more on the automatic screening tool

Gauransh Kumar: the tool will come out this summer -> LLM is good at excluding the noise in systematic reviews -> methodology : 36 prompts (few-shots, CoT) optimized to 2, across 50 datasets. It relies on a prefiltering methodology : from a research question or field the algorithm filters a large corpus (title+abstract of scientific publications).