Third workshop in the series "Understanding and Evaluating Complex Automation Systems for Journals", titled "Debunking myth surrounding automation and AI, the example of HTR" and presented by Alix Chagué (Université de Montréal)

Video of the worshop on the myths surrounding Handwritten Text Recognition (HTR) : https://nakala.fr/10.34847/nkl.1fb6vmgi

Introduction

Almanach Research Team in Natural Language Processing (NLP): In 2018, conducted initial research on HTR using Transkribus, after which Alix continued this work with eScriptorium, an open-source, free, and locally executable platform for HTR.

CREMMA (Consortium for the Recognition of Handwriting in Ancient Materials) enabled research on various projects and helped identify key questions across numerous topics.

Alix worked on developing multiple models in various languages, including Ancient Greek and Esperanto, but primarily in French.

HTR United Platform: A shared dataset platform for training models in automatic handwritten transcription. The platform encourages and promotes the sharing of research data.

Two Key Challenges:

- Model evaluation methodology

- User perspective

Definitions

HTR

Definition: Handwritten Text Recognition = a machine learning method for labeling image content.

HTR is a machine learning technology.

An image containing text is provided, and the goal is to produce an equivalent machine-readable text. This is a transcription of the text, not a description of the document.

Image sources can vary: digitized documents, screenshots, photos, or videos taken "in the wild" (which can be used, for example, for geospatialization). Alix gives the example of tombstone images that can feed genealogical research.

HTR specifically handles handwritten data.

It began with automatic check recognition, or address recognition on mail.

Technology development accelerated from the 2010s onward.

ASR

ASR = Automatic Speech Recognition.

Results are often linked to HTR due to similar logic: starting from a non-textual recording (audio or image) to obtain text, notably through transcription.

OCR

Optical Character Recognition.

OCR technology is particularly accessible online and relatively old, with initial developments in the 1950s-60s. It was then used to accelerate data entry from printed documents.

HTR Challenges

Difficulty of handwritten scripts (unlike standard printed text): the question also arises for handwritten scripts.

Great variety of layouts, for example, medieval manuscripts with diagrams.

Several commercial software solutions now allow automatic transcription with pre-trained models.

Google Cloud Vision, and more recently, multimodal LLMs (olmOCR, LEO).

Transkribus / eScriptorium: trainable.

GPT / Gemini: can perform transcription.

How to Evaluate Models?

Blog post by Dan Cohen: The Writing Is on the Wall for Handwriting Recognition. https://newsletter.dancohen.org/archive/the-writing-is-on-the-wall-for-handwriting-recognition/

Tested Gemini 3 Pro + 3 images for evaluation.

According to Dan Cohen, better than Transkribus.

Blog post by Alix Chagué: A Perfect Job is the New Very Good Job.

https://alix-tz.github.io/phd/posts/025-fr/

"Ad hoc" evaluations highlight evaluation weaknesses: no systematisation. Often a quick, qualitative evaluation based on a few texts. From this, generalisations about the model are drawn.

Dan Cohen concludes that specialized software like Transkribus is superior to Gemini 3.

He compares the transcription of 3 images:

1. Letter from George Boole ... left-right inversion of page order and only the first page (on the right) and last page (on the left).

Compares the two transcriptions:

Transkribus respects the original document line by line. Cohen notes some transcription errors.

Gemini (prompt not provided): gives a more structured transcription (indicates right and left pages). Cohen estimates there are no errors, though there are more subtle ones.

Gemini also adds editorial comments to the transcription: Gemini’s output appears "enhanced" compared to Transkribus.

2. Second image: a war letter from Charles Carroll to Alexander Hamilton, binarized (contrast enhancement black/white to bring out the text).

Dan Cohen then ignores the Transkribus result and presents only the LLM result.

Gemini: logical organization of the page. Matches a note to a page number. Indicates abbreviations. Struck-out words instead of ink blots.

3. [Letter from Jane Austen](https:// https://www.themorgan.org/collection/jane-austen/letters/33) from 1808, double page with greater difficulty.

- Jane Austen rotated her letter 90 degrees and continued writing (very common in the 19th century).

Different orientations cause interference of reading levels for the machine but not for human reading, which can ignore these interferences.

Another reading difficulty for the machine is caused by ink crossing the page.

Gemini starts then stops, highlighting that the text is too obscure.

The model is not obsequious (sycophantic). Think authentic.

A sign of reliability for Dan Cohen.

No evaluation protocol by Cohen.

Possible metric and evaluation method:

- Comparison with a human transcription (ground truth): what Alix did and found that Transkribus transcription has a CER (Character Error Rate) score of 9%.

CER comparison is less used for generative models: other methods are then used since LLMs also produce editorial comments that do not allow strict character evaluation.

Specialized models (i.e., Transkribus) tolerate some ambiguity when certain characters are difficult to read, or when the stroke is not as expected, whereas LLMs prioritize a "readable" transcription.

Possible qualitative evaluation:

4 criteria to evaluate seriousness (Lincoln & Guba 1986):

Lincoln, Y.S. and Guba, E.G. (1986), But is it rigorous? Trustworthiness and authenticity in naturalistic evaluation. New Directions for Program Evaluation, 1986: 73-84. https://doi.org/10.1002/ev.1427

- Credibility (how were the results obtained?)

- Transferability

- Reliability (are the results constant over time?)

- Confirmability (are the results replicable?)

Weaknesses in empirical evaluations of HTR by LLMs:

- Reliability

Limitations of this type of evaluation

Absence of the prompt in the evaluation
Black box effect and obscure generation process
Background updates of models cannot be recorded
Different modes (thinking, etc.) of use

Dan Cohen inputs a 19th-century English corpus, already widely studied, which may be implicitly already known by Gemini.

The problems of automatic transcription, which Cohen assumes are solved by Gemini, are rather transcriptions of non-Latin scripts, complex images (such as the Jane Austen letter example, whereas it is precisely this type of example that we would like models to be able to transcribe).

Cohen’s evaluation of Gemini takes into account texts in English, by known authors. This does not correspond to all of Transkribus’s objectives.

Task definition: Is only the transcription expected?

Transkribus provides a strict transcription, while Gemini gives editorial notes, comments. This would be an aspect to better define.

Format problem as well: a structured document (e.g., XML-ALTO or XML-PAL) with metadata that allows contextualizing the text. The LLM mixes together (a single semantic layer) the transcription proper and what is more of the order of metadata.

Confusion between software and model: the evaluation interface with Transkribus actually allows parameterization of the order of transcribed pages, etc., contrary to what Cohen says.

Problem regarding what is expected from HTR: Is it text that is expected to be:

Readable
Plausible: would be an LLM like Gemini
Perfect, what is that?

What type of transcription do we want?

For example, Gemini gives annotations in markdown. Is that what we want? Do we want editorial notes? Gemini in these examples adds insertions (indications that fall under critical editing).

A whole branch of research does not concern itself with what transcription means: it reduces transcription to "reproducing the text" without a clear definition of expectations.

Implicit and uniform naturalization of transcription revealed by HTR statements and evaluations.

Necessity of a protocol/standardization/clarification on expectations.

See also: the "two-headed" model by Sergio Torres Aguilar (cf. https://hal.science/hal-04983305/document)

Discussion

Gérald Kembellec: Around 2020, worked on HTR on the correspondence corpus of Constance de Salmes. Interest of AI to take a corpus with many different writings: training had little interest due to the number of writings unless there were sub-corpus. Is this something that would be solved today?

Alix Chagué: Models in 2020 were trained on sub-corpus. Generic models at the time could be retrained on very restricted sub-corpus of a few pages. The advantage of models being generalization.

Gérald Kembellec: So the era of supervised learning in HTR is over?

Alix Chagué: Supervised learning is still necessary, but less heavy. LLMs are not trainable, so we must rely on their generalization capabilities. If the model is very poor on one of the five hands in the corpus, that hand would need to be worked on differently. However, in general, models have gained in generalization.

Alexia Schneider: Dan Cohen’s bias toward LLMs is frequent. Is a lack of specialization involved? Is there a debate around these topics in the field?

Alix Chagué: Wonder if she herself may have a bias toward "traditional" models. She is however unaware of posts responding to Dan Cohen, but her own post produced many returns through more informal channels of specialists who support her view, notably on the hasty nature of Dan Cohen’s evaluation. Cohen’s post is however to be understood in the context of a newsletter, an opinion piece, and not a scientific product.

Other posts tested Gemini in a more rigorous framework. One cannot deny that models like Gemini are excellent for transcription: in the community, the model’s competence is recognized. Models are good: the question remains in the lack of transparency regarding training data of generic generative LLMs.

It frequently happens that people seek "solutions" to research "problems," which is both commercial discourse and a desire for innovation.

Alexia Schneider: Digital humanities have often positioned themselves in favor of distancing, text automation. Now, the field rather positions itself in favor of a return to materiality. Should we question this bias? Could we imagine specialized large models, more transparent, trained on diverse corpora? Is the transcription step destined to be devalued, trivialized by the use of LLMs? These questions imply a questioning of values, since one can easily imagine that the same criticisms were addressed to HTR at its beginnings.

Alix Chagué: In 2018, worked on two projects, including a corpus transcription for distant reading with Transkribus, where we did not have control over training. Researchers indeed had this reaction: the transcription task was integrated into research, nourishing other editorial tasks being carried out simultaneously. LLMs reproduce methodological problems that remain current, regardless of technology. Transcription technology is also much mobilized in technological and commercial uses: attention to the text is different from the philological perspective. At this crossroads, tensions are created with ways of envisioning the task coming from humanities.

Another question is that of standards for recording data: in the case of LLMs, one obtains a JSON output, which does not allow interoperability. One also runs the risk of navigating between two technologies that do not communicate well together, which is less the case in environments like Transkribus or e-Scriptorium, which allow image manipulation not managed in LLMs.

A plethora of problems are raised, but the general discourse shifts between disenchantment and enthusiasm.

Alexia Schneider: The complexity of the transcription step may have been invisible until the advent of systems like e-Scriptorium or Transkribus. The use of LLMs makes one believe they restore the text identically or that one faces an "improved" text, which obscures a series of choices that would usually be those of the philologist/user. The enthusiasm is more technophilic than philological.

Alix Chagué: Completely agree. Usage ends up being defined by what the tool offers/can do rather than by user expectations. The segmentation step, for example, underlies a series of questions. The use of an LLM for transcription encourages rather a possible acceptance of the result. Transcription is always a representation, always a "betrayal," always at a degree of separation from the source document.