Meeting with Juan Luis Gastaldi

Juan Luis Gastaldi is currently investigating the epistemological aspects of

tokenization, a key step in the Natural Language Processing (NLP) pipeline.

Aligned with the themes of Revue3.0, Gastaldi explores what NLP algorithms

can tell us about the nature of language itself. According to his hypothesis,

these algorithms reveal formal structures of language through their operation.

Understanding these structures provides insight into the notion of language as

modeled by these algorithms.

Introduction and context of the analysis

The exploration of this research question must be grounded in several theoretical

and technical premises:

1. A clear distinction must be made between the language modelling embodied

by LLMs and the interfaces of chatbots, which are in no way part of this

modelling process.

2. A clear distinction must also be established between language models,

which are matrix representations of language, and the functions responsible

for their training.

3. LLMs are of a formal nature. They are not empirical objects. Their

epistemological scope can only be understood through a formal approach,

not through experimental methods.

4. LLMs are statistical models. A statistical model is a function that generates

a probabilistic distribution over a set of data. By their nature, statistical

models are inherently probabilistic, hence stochastic.

5. The reference corpus constitutes the sole material component of LLM

systems. In this context, a corpus is indeed a defined material space, with

fixed boundaries, that can be traversed using specific functions.

6. Today, the training and evaluation of LLMs primarily rely on a statistical

approach known as the “maximum entropy” principle. In this framework,

LLM training is performed on a small subset of the corpus, while evaluation

is carried out on another small subset of the same corpus. In this context although the corpus is of a material nature, it is now considered a statistical

object.

Induction vs. deduction

For Chomsky, a grammar is a subset of all possible expressions. Grammars are

thus always deduced, not induced. LLMs appear to contradict this hypothesis

because they are statistical in nature, thus inductive and stochastic.

However, according to Gastaldi’s hypothesis, the modelling embodied by LLMs

reveals the presence of a macroscopic structural coherence related to language in

general, independent of individual grammars. This macroscopic element could

be linked to Chomsky’s notion of a context-free grammar, but, according to

Gastaldi, it is more pertinent to identify it with a system of types, in the sense

used in programming. More broadly, Gastaldi’s approach explores how LLMs

enable us to observe the implicit structure of language as a whole.

To conduct this analysis, Gastaldi examines the notion of token, which, in the

context of LLMs, represents the fundamental unit of language.

Language units

A token is a sequence of characters that often appear together in a corpus.

Tokens are thus derived from a set of characters.

Tokenization raises a fundamental question: what constitutes a linguistic unit,

or the smallest unit of language?

Western philosophical tradition offers two answers to this question:

1. A linguistic unit exists if and only if it has a referent in the empirical world.

While structuralism rejects this hypothesis, LLMs, as formal objects, are

the evidence that this referential perspective is not sufficient to define

linguistic units.

2. According to the structuralist approach, linguistic units are elements that

are actualized within a system of relations, meaning within a specific

structure. They depend on the structure from which they emerge.

Markus Reisenleitner’s intervention invited reflection on the role assigned, in this

discourse, to the written and acoustic materiality of the phoneme. According

to Gastaldi, the material elements of a language determine the evolution of its

formal structure over time.

Gastaldi’s most recent research

In light of these perspectives, Gastaldi’s recent work seeks to highlight the

implicit structure of LLMs by analyzing them through the lens of linear algebra.

His study relies on a formal analysis of word embeddings, a key technique in the functioning of modern LLMs. Word embeddings are dense vectors computed through a vectorization process. Vectorization relies, at its core, on counting words in context from a large textual dataset so that it can allocate a representation for each word on a continuous space. The vectorization process is therefore an implicit factorization of a matrix containing information about the use of words.

After this implicit factorization, it is necessary to reduce the data representation

space. The optimal method for this reduction involves organizing the data

according to their internal similarities. Subsequently, it is necessary to further

reduce this space by decreasing the number of dimensions that compose it. To

do this, a basis change is performed, resulting in eigenvectors, which represent

an optimal organization of the space around vectorial directionalities.

Through a compositional analysis, it is possible to observe fixed points in the

space of eigenvectors corresponding to the formal definition of computational

types. The identification of these structural types aligns with the structuralist

conception of language, according to which language is defined as paradigmatic,

semiotic, and hierarchical.

For more details on these analyses, we refer you to Juan Luis Gastaldi’s most

recent publications, available on his official website (https://www.giannigastal

di.com/), particularly his article titled The Structure of Meaning in Language:

Parallel Narratives in Linear Algebra and Category Theory (2024).