Report on the first workshop "Understanding and evaluating complex automation systems for journals", introduction to Agentic AI and RAG by Alexia Schneider

Video of the workshop available on Nakala : https://nakala.fr/10.34847/nkl.7eaeg131

Presentation support available here : https://alexiaschn.github.io/blog/slides/26-02-12_revue30_atelierAutomatisationSeance01.html#/title-slide

Presentation

Complex AI systems

Generative AI always relies on instructions being given in natural language. Nowadays, LLMs are integrated within complex pipelines and processes.

Mainstream chatbot applications are interacting with LLMs along with a modular system. There is an obfuscation of the way the user request or "user prompt" is being dealt with. Therefore, we cannot claim to understand the process leading an app to give a certain answer: the user prompt and subsequent prompt engineering from the user standpoint is not effectual in understanding what happened and which processes happened under the hood.

Current narrative discourse is lead by some big tech companies that have an incentive to present the current applications as autonomous, "agentive" and anthropomorphise them.

In reality, AI agents are a concatenation of calls to one or different LLMs.

Architecture of a single AI agent

The system prompt contains a series of instructions in natural language:

description of the expected personality of the LLM
description of tools defined by the dev
description of the expected formatting of the output (JSON, XML, md etc.)
+ user request

The AI Agent proceeds by iteration and concatenation of the requests. Most models nowadays have been trained to have a "thinking" process before they can add in a structured way their "final answer" or "tool call".

RAG (Retrieval-Augmented Generation)

LLMs have two major limits :

Limited contextual window (despite claims of progress in the size of that window over the past couple of years)
Fixed vectorial representation of the training dataset.

RAG fundamentals: adding data within the system prompt from a selected set of sources, usually only chunks of those sources. The selection can be designed within the pipeline by the RAG developper with reliable external sources, or it can be the result of an internet search triggered by the LLM itself.

Architecture :

Source documents are chunked to fit the system prompt context window. They can be vectorized in preparation and stored if needed.
Information Retrieval : extraction of relevant chunks in regards to the user query. Two main IR strategies. Lexical search uses upgraded Ctrl+F (TFiDF or BM25), "semantic search" aka "dense retrieval" uses the vectors to compute similarity between the user request and each chunk of the source documents.
Retrieved chunks are added to the system prompt and are meant to anchor the generated answer.

Several points of attention for understanding and evaluating RAG systems:

the external datasets can be biased too just as much as the original training dataset.
RAG and Agentic AI add multiple invisible interpretative layers ; behind a single user request lies several LLM calls with potential multiple interactions with external sources and apps.
dissemination of information is a major limitation that is hard to evaluate : when information is present in several documents it can be misinterpreted or ignored altogether if the chunk is not retrieved.

Relates to current paradigm in censorship : it is easier to never show something in a context of mass information than to do the risky thing of deleting it.

Discussion

Marcello Vitali-Rosati : "I am appalled by our willingness to adopt aberrant systems" --> multiplication of intermediaries and operations for sometimes very trivial tasks for the sake of sparing ourselves the energy of thinking. Why ask a chatbot to compute 4+4 when the calculator is right here. Great computationnel energetical cost for things that aren’t even certain. Cognitive offloading of menial tasks and narrative of emancipation for heavy tasks when in reality it implies growing dependancy towards costly infrastructure.

Alexia Schneider : Ex. 1 request sent to a LLM = 10 Google requests.

Alix Chagué (in the chat) : I am curious, do you have any details about the inner workings of the "thinking" step of an LLM. Is the LLM calling itself ? Yesterday I was trying a VLM for transcription and its "thinking" section had a lot of "But wait, the user asked for A..." etc.

Alexia Schneider : Reasoning LLM are not meant for long interactions due to their latency. They tend to be used as orchestrators sending requests to other pre-prompted smaller LLMs. The reasonning step is usually left to that larger specialized LLM, reducing latency for task completion.

Nicolas Sauret : There seems to be a relation between the thinking part of the generated answer and the final answer. Tests with Deepseek show that interpretation of the prompt differed depending of phrasing. Is there a real relation between the thinking step and the final answer ?

Alexia Schneider : Evaluation seems to conclude that there is definitely a correlation and models perform better when asked to reflect before providing their final answer. Basic prompt engineering is litteraly asking a "chain of thought" prompt meaning, prompting to think step-by-step.

Marcello Vitali-Rosati : In an agentic system, the thinking process can be left out or added to the prompt at each iteration.

Frédéric Clavert : It is necessary to put together RAG practices in SSH communities in order to avoid GAFAM tools. A small RAG well done can be more efficient and less expensive.

Marcello Vitali-Rosati : Huma-Num is working on a RAG implementation on Isidore (SSH francophone search engine). A possible avenue would be to discuss this with Stéphane Pouyllau and other interested journals and researchers.

Frédéric Clavert : PleIAs is developping small models that could be relevant for research in SSH. (in the chat): "Anecdote:** To speed up a RAG query, I had "compressed" the chunks. But all my documents started with the list of people present at the meetings of the Committee of Governors of the Central Banks of the EEC, with the same people often attending, since they typically held long terms as central bank governors. The result? The model, relying on the sources I provided to the RAG system, concluded that a governor had only attended one of the committee meetings—when in fact, they had been present at these monthly meetings for almost the entire 1970s. All the chunks containing the list of attendees had been merged into a single chunk. In short, with RAG, the devil is in the details, and users still need to have a pretty precise understanding of what’s going on."

Discussion sur les besoins spécifiques des revues.

Alexia Schneider & Marcello Vitali-Rosati : We don’t have to follow the GAFAM trends of performing article reviewing with chatbots but we can think about testing tools and methodologies. Designing a tool that could help peer reviewing for specific aspects and provide support. Ex: RAG for parts of the analysis of the lit review of an article being peer reviewed.

Nicolas Sauret : it only displaces the question of verification : either we will take the chatbot’s answer at face value, or we will have to check everything it says, in which case it is not helping us in terms of time or efficiency.

Marcello Vitali-Rosati : It would only be interesting if it would improve the quality and not time management. The idea is not to replace someone or get in a productivist mindset but rather to gain depth in the evaluation of an article by allowing us to reflect on what it means to structure correctly your bibliography for instance.

Alexia Schneider : it could be possible to train a classifier model that would support the evaluation of the lit review.

Nicolas Sauret : We go back to an almost ontological matter: what is the job of a researcher. In the end these technologies might just take more time to implement for no improvement.

Marcello Vitali-Rosati : 2 possibilities :

either we reject GAFAM solutions and take the risk of having people resort to them anyway,
or we use the technologies at the core of this and attempt to build our own models and systems that reflect our values.

Nicolas Sauret : Journals are first and foremost in need of resources (money and human resources). Are the needs of researchers and editorial team members aligned ?