Meeting with Juan Luis Gastaldi
Juan Luis Gastaldi is currently investigating the epistemological aspects of
tokenization, a key step in the Natural Language Processing (NLP) pipeline.
Aligned with the themes of Revue3.0, Gastaldi explores what NLP algorithms
can tell us about the nature of language itself. According to his hypothesis,
these algorithms reveal formal structures of language through their operation.
Understanding these structures provides insight into the notion of language as
modeled by these algorithms.
Introduction and context of the analysis
The exploration of this research question must be grounded in several theoretical
and technical premises:
1. A clear distinction must be made between the language modelling embodied
by LLMs and the interfaces of chatbots, which are in no way part of this
modelling process.
2. A clear distinction must also be established between language models,
which are matrix representations of language, and the functions responsible
for their training.
3. LLMs are of a formal nature. They are not empirical objects. Their
epistemological scope can only be understood through a formal approach,
not through experimental methods.
4. LLMs are statistical models. A statistical model is a function that generates
a probabilistic distribution over a set of data. By their nature, statistical
models are inherently probabilistic, hence stochastic.
5. The reference corpus constitutes the sole material component of LLM
systems. In this context, a corpus is indeed a defined material space, with
fixed boundaries, that can be traversed using specific functions.
6. Today, the training and evaluation of LLMs primarily rely on a statistical
approach known as the “maximum entropy” principle. In this framework,
LLM training is performed on a small subset of the corpus, while evaluation
is carried out on another small subset of the same corpus. In this context although the corpus is of a material nature, it is now considered a statistical
object.
Induction vs. deduction
For Chomsky, a grammar is a subset of all possible expressions. Grammars are
thus always deduced, not induced. LLMs appear to contradict this hypothesis
because they are statistical in nature, thus inductive and stochastic.
However, according to Gastaldi’s hypothesis, the modelling embodied by LLMs
reveals the presence of a macroscopic structural coherence related to language in
general, independent of individual grammars. This macroscopic element could
be linked to Chomsky’s notion of a context-free grammar, but, according to
Gastaldi, it is more pertinent to identify it with a system of types, in the sense
used in programming. More broadly, Gastaldi’s approach explores how LLMs
enable us to observe the implicit structure of language as a whole.
To conduct this analysis, Gastaldi examines the notion of token, which, in the
context of LLMs, represents the fundamental unit of language.
Language units
A token is a sequence of characters that often appear together in a corpus.
Tokens are thus derived from a set of characters.
Tokenization raises a fundamental question: what constitutes a linguistic unit,
or the smallest unit of language?
Western philosophical tradition offers two answers to this question:
1. A linguistic unit exists if and only if it has a referent in the empirical world.
While structuralism rejects this hypothesis, LLMs, as formal objects, are
the evidence that this referential perspective is not sufficient to define
linguistic units.
2. According to the structuralist approach, linguistic units are elements that
are actualized within a system of relations, meaning within a specific
structure. They depend on the structure from which they emerge.
Markus Reisenleitner’s intervention invited reflection on the role assigned, in this
discourse, to the written and acoustic materiality of the phoneme. According
to Gastaldi, the material elements of a language determine the evolution of its
formal structure over time.
Gastaldi’s most recent research
In light of these perspectives, Gastaldi’s recent work seeks to highlight the
implicit structure of LLMs by analyzing them through the lens of linear algebra.
His study relies on a formal analysis of word embeddings, a key technique in the functioning of modern LLMs. Word embeddings are dense vectors computed through a vectorization process. Vectorization relies, at its core, on counting words in context from a large textual dataset so that it can allocate a representation for each word on a continuous space. The vectorization process is therefore an implicit factorization of a matrix containing information about the use of words.
After this implicit factorization, it is necessary to reduce the data representation
space. The optimal method for this reduction involves organizing the data
according to their internal similarities. Subsequently, it is necessary to further
reduce this space by decreasing the number of dimensions that compose it. To
do this, a basis change is performed, resulting in eigenvectors, which represent
an optimal organization of the space around vectorial directionalities.
Through a compositional analysis, it is possible to observe fixed points in the
space of eigenvectors corresponding to the formal definition of computational
types. The identification of these structural types aligns with the structuralist
conception of language, according to which language is defined as paradigmatic,
semiotic, and hierarchical.
For more details on these analyses, we refer you to Juan Luis Gastaldi’s most
recent publications, available on his official website (https://www.giannigastal
di.com/), particularly his article titled The Structure of Meaning in Language:
Parallel Narratives in Linear Algebra and Category Theory (2024).