Minutes of the AI workshop of March 13th (Gérald Kembellec )

Workshop Recording

Presentation with Christelle Magdelaine: Manager of the CREPAC resource center, project leader for the automation of article summary production (see previous workshop).

Presentation: The Digital Palimpsest

"The Digital Palimpsest":

Referencing the article “Goyet, S. (2017). Outils d’écriture du web et industrie du texte: Du code informatique comme pratique lettrée. Réseaux, 206(6), 61‑94. https://doi.org/10.3917/res.206.0061”:

There are ways to think about and structure digital content.

"Screen Writings": There are semantic forms in sub-layers that are invisible in the graphical layer but accessible through web scraping and harvesting tools (e.g., Zotero).

Jean-Edouard Bigot and the concept of "equipped reading".

Example of Gérald’s usage: openlinksw

Alexandra Saemmer and the rhetoric of hypertext: In HTML5, the link tag `<a> </a>` can convey meaning and intentionality in document structures. For example, in journalism, attributes like "nofollow" indicate disagreement with the linked content to avoid associating one’s site with it.

Zerilli, S. (2015). Alexandra Saemmer, Rhétorique du texte numérique: Figures de la lecture, anticipations de pratiques. Lectures. https://doi.org/10.4000/lectures.18678

In terms of structuring information in sub-layers, we can also mention the pericope, as discussed in "Réflexions sur le fragment dans les pratiques scientifiques en ligne: Entre matérialité documentaire et péricope" (Kembellec & Bottini, 2017).

In exegesis, a pericope is the segmentation of content based on semantic units rather than chapters or verses. For example, the parable of the prodigal son (Luke) can be segmented into units of meaning (pericope). In digital writing, this could be a fragment of a visual document. Examples include memes that frame a portion of an image or video. The tools for this strategy are those of the Semantic Web, such as Dublin Core.

Discussion

Marcello Vitali-Rosati: Critical remark on the notion of the digital palimpsest - the palimpsest is a poetic metaphor that risks obscuring the interpretive layers of code. There is nothing "erased," rewritten, or evanescent in the structure of digital content. There is a necessary and deterministic dependence between the hardware structure and the binary representation in characters (ASCII, etc.). The transition from one layer to another is a political discussion -> question of establishing standards that define what text is. The palimpsest metaphor risks overshadowing political questions.

Reference to Kittler, to counter the palimpsest metaphor.

Friedrich Kittler: Mode protégé—Les presses du réel (book). (s. d.). Consulted March 13, 2025, at https://www.lespressesdureel.com/ouvrage.php?id=3852&menu=0

Frédéric Clavert (in the chat): If I may, regarding the discussion on the palimpsest metaphor: not coming from a discipline that uses this notion, I am struck that there are no links with everything done on web archives (Niels Brügger, for example, who spends a lot of time looking at the different levels of analysis of archived web pages -- cf. https://direct.mit.edu/books/monograph/4215/The-Archived-Web-Doing-History-in-the-Digital-Age). Without even mentioning the political logic that Marcello refers to.

Brügger, N. (2018). The Archived Web: Doing History in the Digital Age. The MIT Press. https://doi.org/10.7551/mitpress/10726.001.0001

Gérald Kembellec: One can decorate the sub-layers and lie about the content of the text. The screen writing can be completely different from what the source code/sub-layer announces. It is possible to completely lie about the semantic content. This is the game of SEO. Theoretically, what is presented in the source code and on the screen in HTML5 is supposed to be identical, and the visual should be constitutive of what is coded.

The transition between layers can be random or constructed, so meaning can be superimposed each time one moves from one layer to another. This transition, which masks the structure, is part of indexing practices that show more information and results in the graphical layer compared to its sub-layers.

Marcello Vitali-Rosati: The transition between one protocol and another is always politically determined by the establishment of standards. The hardness of each layer means they are never ephemeral. In any case, it is a deterministic system with political devices that handle the negotiation and mediation between layers. To observe the limits of transitions between layers, one can consider the system as closed (Marcello’s view) and question the political stakes, or consider the system as open (Gérald’s view).

Gérald Kembellec (starting from the example of note-taking): If we transform Christelle’s notes with an automated tool and device, we are in the 19th century, and from this perspective, Marcello is right. However, if we put the notes inside a book, what we see is the cover; the real notes with meaning do not appear. We should understand the palimpsest as Tim Berners-Lee did when he moved from SGML in 1991 to HTML for presentation and decoupled form from content, and then with XHTML, he created a strong structure where form and content could be worked on together. The logic of XHTML aligns with Marcello’s perspective. However, with HTML5, as envisioned by the WHATWG, one can flatten meaning on the screen and have something different in the source. In this case, the writing of meaning is different from the visual rendering and is not industrial but editorial. This logic does not make sense for the dissemination of scientific publications. For example, one can visually put "AI," the buzzword, everywhere but not necessarily in the source. However, entrusting an AI to fetch metadata, i.e., on an industrial scale, leaves no room for cheating on the content.

Regarding the wars between XHTML and HTML5: two completely different approaches with schema.org and json-ld on one side and RDFa on the other. Today, we seek a balance between information and communication (a unified discipline in France). For example, Sire analyzed Google’s code: one can have a purely communicational vision rather than an informational one.

Sire, G. (2018). Web sémantique: Les politiques du sens et la rhétorique des données. Les Enjeux de l’information et de la communication, 192(2), 147‑160. https://doi.org/10.3917/enic.025.0147

Scission in the W3C consortium: WHATWG put HTML5 in boxes with schemas, and it works very well because Google and its algorithms support this logic with a commercial vision. On the other side, there is the BnF and RDFa with a more open vision of the web. These technologies have ideological positions. The choice of authorities, structures, and vocabularies positions institutions.

Marcello Vitali-Rosati: The takeover of WHATWG by Google has made the web much less structured compared to its previous versions. With HTML5, data is less structured to fit commercial logic. Big Tech has won the browser war and web governance. This transition has been little debated.

Gérald Kembellec: Proposes integration with an AI of chosen and thought-out tags. David Shotton: integrate the principles of writing good content and collaborating between author/editor/documentalist. And "intellectual primitives" by Bruno Latour and John Unsworth with a similar idea but from a less anthropological and more sociological perspective on the circulation of scientific information through our shared writings on the web. With knowledge discovery thought out from the article’s writing rather than trusting an interface over which we have no control.

Experimentation in 2017: Authority notices automatically generated by Google thanks to the formal structuring of data with schema.org from the bio-bibliographic pages of the francophone art writing critique project. The pages were rewritten in sub-layers. All semantic information was in the HTML pages about researchers from a single CSV, and the information was available without transformation (without sub-layers). The pages thus had a text box to present structured information on top of the the search engine’s webpage.

Marcello Vitali-Rosati: Search engines do not consider the structured information we produce. Content is taken as plain text, and metadata is regenerated by algorithms because few journals actually include metadata. Example of Sens public which uses very rich RDFa, Rameau notices, ORCID identifiers, but this enrichment is ignored by the industrial processing tool. The issue is more at the community level: we carry values that do not translate into the use of tools corresponding to these values.

Tension between our uses and values (DH) and the uses that the web and browsers make of the information we produce: Does it still make sense to ask for the use of tools that resist ease and focus on data structuring when browsers process text according to other logics?

Gérald Kembellec: The interest in semantic enrichment logics for the web peaked in 2017 with "rich snippets," Wikipedia information text blocks extracted by Google and placed at the top of the page. Peerj deployed significant resources to produce semantic indexing: Does this work have any interest beyond the satisfaction of artisanal work? Example of RASH and articles in Simplified HTML that illustrate this form/content work.

Peroni, S., Osborne, F., Iorio, A. D., Nuzzolese, A. G., Poggi, F., Vitali, F., & Motta, E. (2017). Research Articles in Simplified HTML: A Web-first format for HTML-based scholarly articles. PeerJ Computer Science, 3, e132. https://doi.org/10.7717/peerj-cs.132

Frédéric Clavert: What is the definition of "fair" between a human agent and AI?

Gérald Kembellec: Christelle and the project presented by Joaquine Barbet illustrate these fairness questions well.

Christelle Magdelaine: The balance is still difficult between what is fair and economic issues. The conclusions of the project on the automation of summary production with the analyst-indexer are that an LLM is not yet capable of replacing a human. The current blockage is on legal questions. AI as an aid to human work and not a replacement. See previous workshop for more details on the project conducted at CNAM.

References

Brügger, N. (2018). The Archived Web: Doing History in the Digital Age. The MIT Press. https://doi.org/10.7551/mitpress/10726.001.0001

Latour, B. (2001). Le métier de chercheur. Regard d’un anthropologue. Éditions Quæ. https://doi.org/10.3917/quae.latou.2001.01

Kembellec, G., & Bottini, T. (2017, novembre). Réflexions sur le fragment dans les pratiques scientifiques en ligne: Entre matérialité documentaire et péricope. 20° Colloque International sur le Document Numérique: CiDE.20. https://hal.science/hal-01700064

Kembellec, G. (2021). L’érudition numérique palimpseste. Hermès, La Revue, 87(1), 145‑158.

Friedrich Kittler: Mode protégé—Les presses du réel (book). (s. d.). Consulted March 13, 2025, at https://www.lespressesdureel.com/ouvrage.php?id=3852&menu=0

Sire, G. (2018). Web sémantique: Les politiques du sens et la rhétorique des données. Les Enjeux de l’information et de la communication, 192(2), 147‑160. https://doi.org/10.3917/enic.025.0147

Zerilli, S. (2015). Alexandra Saemmer, Rhétorique du texte numérique: Figures de la lecture, anticipations de pratiques. Lectures. https://doi.org/10.4000/lectures.18678