Inhalt
Aktueller Ordner:
/ARS_PreNSymAI_Eng.tex
% English Version
\documentclass[12pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{hyperref}
\usepackage{geometry}
\geometry{a4paper, left=3cm, right=3cm, top=3cm, bottom=3cm}
\usepackage{setspace}
\onehalfspacing
\usepackage{parskip}
\usepackage[english]{babel}
\usepackage{csquotes}
\usepackage{microtype}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{listings}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{float}
\usepackage{url}
\usepackage{natbib}
\usepackage{titling}
\lstset{
language=Lisp,
basicstyle=\ttfamily\small,
keywordstyle=\color{blue},
commentstyle=\color{green!40!black},
stringstyle=\color{red},
showstringspaces=false,
numbers=left,
numberstyle=\tiny,
numbersep=5pt,
breaklines=true,
frame=single,
backgroundcolor=\color{gray!5},
tabsize=2,
captionpos=b
}
\title{\Huge\textbf{Grammar Induction, Transduction, and Parsing} \\
\LARGE ARS as a Methodological Precursor to \\[2mm]
\LARGE Explainable Neuro-Symbolic AI}
\author{
\large
\begin{tabular}{c}
Paul Koop
\end{tabular}
}
\date{\large 1994--2026}
\begin{document}
\maketitle
\begin{abstract}
This paper examines the historical and methodological relationship between the
Algorithmic Recursive Sequence Analysis (ARS) and contemporary neuro-symbolic
AI. Drawing on three early implementations of ARS—an inductor in Scheme, a parser
in Pascal, and a transducer in Lisp (1994)—as well as a large language model
simulation in Python (2023), I argue that ARS constitutes a \textit{proto-neuro-symbolic}
methodology. Unlike purely statistical language models, ARS produces explicit,
falsifiable, and intersubjectively verifiable grammars. The paper demonstrates
that the core challenges of today's neuro-symbolic AI—integrating pattern
recognition with rule-based reasoning, ensuring explainability, and maintaining
methodological control—were already addressed in ARS decades ago. I situate ARS
within Henry Kautz's taxonomy of neuro-symbolic architectures, evaluate it
against XAI criteria (meaningfulness, accuracy, knowledge limits), and contrast
it with large language models that simulate without explaining. The paper
concludes with methodological lessons for contemporary neuro-symbolic research.
\end{abstract}
\newpage
\tableofcontents
\newpage
\section{Introduction: The Hidden Heritage of ARS}
The current discourse on neuro-symbolic AI is marked by a curious amnesia.
While researchers debate architectures that integrate neural networks with
symbolic reasoning \citep{hitzler2022neuro, garcez2020neurosymbolic}, a
methodologically sophisticated precursor has largely been forgotten: the
\textbf{Algorithmic Recursive Sequence Analysis (ARS)}.
Developed initially in 1994 and continuously refined through 2026, ARS
represents one of the earliest systematic attempts to combine qualitative
hermeneutics with formal grammar induction. Unlike contemporary large language
models (LLMs), which learn statistical patterns from massive corpora but remain
opaque, ARS produces \textbf{explicit, falsifiable, and intersubjectively
verifiable grammars}. Unlike purely symbolic approaches, which suffer from
the knowledge acquisition bottleneck, ARS induces rules from empirical
protocols.
This paper makes three contributions:
\begin{enumerate}
\item It reconstructs three early ARS implementations—an \textbf{inductor}
in Scheme, a \textbf{parser} in Pascal, and a \textbf{transducer} in
Lisp—showing how each addresses a different aspect of sequence analysis.
\item It interprets these implementations as \textbf{proto-neuro-symbolic}
systems, situating them within Henry Kautz's taxonomy of neuro-symbolic
architectures \citep{kautz2020third}.
\item It contrasts ARS with a large language model trained on the same
corpus, demonstrating that LLMs simulate but do not \textit{explain}—a
distinction central to XAI (Explainable AI) criteria \citep{ortigossa2024xai}.
\end{enumerate}
The paper does not claim that ARS is a neuro-symbolic system in the contemporary
sense—it lacks neural components. Rather, I argue that ARS embodies the
\textit{methodological logic} of neuro-symbolic integration: the combination of
pattern-based induction (System 1) with rule-based explication (System 2),
maintaining explainability through design.
\section{Three Implementations, One Corpus}
\subsection{The Empirical Foundation: A Market Conversation}
All implementations analyzed in this paper are based on the same empirical
corpus: a transcribed sales conversation recorded at Aachen market square on
June 28, 1994. The transcript was subjected to qualitative sequential analysis
following the methodology of objective hermeneutics \citep{oevermann1979methodology},
resulting in a terminal symbol string of 12 categories (KBG, VBG, KBBd, VBBd,
KBA, VBA, KAE, VAE, KAA, VAA, KAV, VAV).
The terminal symbol string used throughout is:
\begin{verbatim}
KBG VBG KBBd VBBd KBA VBA KBBd VBBd KBA VBA KAE VAE KAE VAE KAA VAA KAV VAV
\end{verbatim}
\subsection{Inductor (Scheme, 1994): From Corpus to Grammar}
The inductor, written in Scheme, is the foundational component of ARS. Its
function is to read a corpus of terminal symbols and induce a probabilistic
context-free grammar (PCFG) by counting transitions.
\subsubsection{Core Data Structures}
\begin{lstlisting}[caption=Lexicon and Transformation Matrix in Scheme]
;; Lexicon: 12 terminal symbols
(define lexikon (vector 'KBG 'VBG 'KBBd 'VBBd 'KBA 'VBA
'KAE 'VAE 'KAA 'VAA 'KAV 'VAV))
;; Transformation matrix counting transitions
(define matrix (vector zeile0 zeile1 ... zeile17))
;; Function to count transitions
(define (transformationenZaehlen korpus)
(vector-set! (vector-ref matrix (izeichen (car korpus)))
(izeichen (car(cdr korpus)))
(+ 1 (vector-ref (vector-ref matrix (izeichen (car korpus)))
(izeichen (car(cdr korpus))))))
(if(not(null? (cdr (cdr korpus))))
(transformationenZaehlen (cdr korpus))))
\end{lstlisting}
\subsubsection{Induced Grammar}
The resulting grammar is:
\begin{verbatim}
(KBG -> . VBG)
(VBG -> . KBBd)
(KBBd -> . VBBd)
(VBBd -> . KBA)
(KBA -> . VBA)
(VBA -> . KBBd) (VBA -> . KAE)
(KAE -> . VAE)
(VAE -> . KAE) (VAE -> . KAA)
(KAA -> . VAA)
(VAA -> . KAV)
(KAV -> . VAV)
\end{verbatim}
\subsubsection{Interpretation}
The inductor transforms the empirical protocol into an \textbf{explicit rule
system}. Each production rule is weighted by its empirical frequency. This
transformation is reversible: given the grammar, one can generate sequences
that reproduce the statistical properties of the original corpus.
In neuro-symbolic terms, the inductor performs \textbf{symbolic abstraction}
from discrete data. It does not learn weights through backpropagation but
through simple counting—a transparent, verifiable process.
\subsection{Parser (Pascal, 1992): Validating Well-Formedness}
The parser, written in Pascal, implements a chart parser that decides whether
a given terminal symbol string is \textit{well-formed} according to the
induced grammar.
\subsubsection{Key Data Types}
\begin{lstlisting}[caption=Parser Data Structures in Pascal]
TYPE
TKategorien = (Leer, VKG, BG, VT, AV, B, A, BBD, BA, AE, AA,
KBG, VBG, KBBD, VBBD, KBA, VBA, KAE, VAE,
KAA, VAA, KAV, VAV);
TKante = RECORD
Kategorie : TKategorien;
vor, nach, zeigt : PTKante;
gefunden : PTKantenListe;
aktiv : BOOLEAN;
nummer : INTEGER;
CASE Wort : BOOLEAN OF
TRUE : (inhalt : STRING);
FALSE : (gesucht : PTKategorienListe);
END;
\end{lstlisting}
\subsubsection{Parsing Algorithm}
The parser implements a standard chart parsing algorithm with three core rules:
\begin{enumerate}
\item \textbf{Initialization}: Terminal symbols are added as active edges.
\item \textbf{Prediction}: New edges are created for nonterminals that can
start at a given position.
\item \textbf{Completion}: When a nonterminal is fully matched, it triggers
completion of higher-level rules.
\end{enumerate}
\subsubsection{Interpretation}
The parser operationalizes the concept of \textbf{structural well-formedness}.
A sequence is not merely "plausible" but formally decidable. This anticipates
the deterministic finite automaton (DFA) later formalized in
\texttt{ARS\_XAI\_Aut\_Ger.tex}.
In XAI terms, the parser embodies \textbf{explainability by design}: every
decision to accept or reject a sequence can be traced to explicit rules.
\subsection{Transducer (Lisp, 1994): Generating New Protocols}
The transducer, written in Lisp, generates new terminal symbol strings from
the induced grammar, simulating possible sales conversations.
\subsubsection{Generation Algorithm}
\begin{lstlisting}[caption=Transducer in Lisp]
;; Generates a sequence
(defun gs (st r)
(cond
((equal st nil) nil)
((atom st) (cons st (gs (next st r (random 101)) r)))
(t (cons (eval st) (gs (next st r (random 101)) r)))
)
)
;; Selects the next symbol based on weighted probabilities
(defun next (st r z)
(cond
((equal r nil) nil)
((and (<= z (car (cdr (car r))))
(equal st (car (car r))))
(car (reverse (car r))))
(t (next st (cdr r) z))
)
)
\end{lstlisting}
\subsubsection{Example Output}
A typical generated sequence (brackets removed for readability):
\begin{verbatim}
KBG VBG KBBD VBBD KBA VBA KAE VAE KAA VAA
KBBD VBBD KBA VBA KBBD VBBD KBA VBA KBBD VBBD KBA VBA KAE VAE KAA VAA
KAV VAV
\end{verbatim}
\subsubsection{Interpretation}
The transducer is a \textbf{generative model}—but unlike an LLM, its generation
process is fully transparent. Every symbol is produced by a rule that can be
inspected, traced, and justified. The transducer does not hallucinate; it
follows the grammar.
\subsection{The Large Language Model (Python, 2023): Simulation Without Explanation}
For comparison, a deep language model (LSTM-based) was trained on the same
corpus. The model architecture follows the implementation described in
\citet{trask2020neural}.
\subsubsection{Model Architecture}
\begin{lstlisting}[caption=LSTM Language Model in Python]
class LSTMCell(Layer):
def __init__(self, n_inputs, n_hidden, n_output):
self.xf = Linear(n_inputs, n_hidden)
self.xi = Linear(n_inputs, n_hidden)
self.xo = Linear(n_inputs, n_hidden)
self.xc = Linear(n_inputs, n_hidden)
self.hf = Linear(n_hidden, n_hidden, bias=False)
self.hi = Linear(n_hidden, n_hidden, bias=False)
self.ho = Linear(n_hidden, n_hidden, bias=False)
self.hc = Linear(n_hidden, n_hidden, bias=False)
self.w_ho = Linear(n_hidden, n_output, bias=False)
\end{lstlisting}
\subsubsection{Example Output}
\begin{verbatim}
KBG VBG
KBBD VBBD KBA VBA KAE VAE KAA VAA
KBBD VBBD KBA VBA KBBD VBBD KBA VBA KBBD VBBD KBA VBA KAE VAE
KAA VAA
KAV VAV
KBG VBG
KBBD VBBD KBA VBA KAE VAE KAE VAE KAE VAE KAE VAE KAA VAA
\end{verbatim}
\subsubsection{Interpretation}
The LLM output is \textbf{superficially indistinguishable} from the transducer's
output. Both generate plausible sequences of terminal symbols. However, the
similarity is deceptive:
\begin{itemize}
\item The \textbf{transducer's} output is generated by explicit, inspectable
rules. Every symbol's production can be traced to a grammar rule.
\item The \textbf{LLM's} output is generated by internal weights that are
not directly interpretable. One cannot explain \textit{why} a particular
symbol was chosen.
\end{itemize}
As noted in the original notebook:
\blockquote{In contrast to cognitivist models (ARS, Grammar Induction, Parser,
Grammar Transduction), such a large language model explains nothing and
therefore large language models are celebrated by postmodernism, posthumanism,
and transhumanism with parasitic intent.}
\section{ARS as Proto-Neuro-Symbolic AI}
\subsection{The Neuro-Symbolic Research Program}
Neuro-symbolic AI integrates neural methods (pattern recognition, learning from
data) with symbolic methods (logic, rules, reasoning). Henry Kautz's taxonomy
\citep{kautz2020third} distinguishes several architectural patterns:
\begin{table}[H]
\centering
\caption{Kautz's Neuro-Symbolic Architectures}
\label{tab:kautz}
\begin{tabular}{@{} p{4cm} p{8cm} @{}}
\toprule
\textbf{Architecture} & \textbf{Description} \\
\midrule
Neural | Symbolic & Neural perception, symbolic reasoning \\
Neural: Symbolic → Neural & Symbolic generation of training data \\
NeuralSymbolic & Neural networks generated from symbolic rules \\
Neural[Symbolic] & Symbolic reasoning embedded in neural networks \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Locating ARS in the Taxonomy}
ARS does not fit neatly into any single category because it was developed
independently of the neural paradigm. However, if we interpret the qualitative
interpretation process as a form of \textbf{pattern recognition} (System 1) and
grammar induction as \textbf{symbolic reasoning} (System 2), ARS approximates
the \textbf{Neural | Symbolic} pattern:
\begin{itemize}
\item \textbf{Pattern recognition} (System 1): The human interpreter
identifies recurring patterns in the transcript, produces readings, and
falsifies alternatives—a form of pattern-based cognition.
\item \textbf{Symbolic reasoning} (System 2): The induced grammar, parser,
and transducer constitute a formal symbolic system that can be executed,
inspected, and validated.
\end{itemize}
What distinguishes ARS from contemporary neuro-symbolic systems is that the
pattern recognition component is \textbf{human}, not neural. This is not a
weakness but a deliberate methodological choice: it ensures that pattern
recognition remains interpretable and subject to intersubjective validation.
\subsection{The Three Components as Complementary Neuro-Symbolic Functions}
\begin{table}[H]
\centering
\caption{ARS Components and Their Neuro-Symbolic Functions}
\label{tab:components}
\begin{tabular}{@{} p{3cm} p{4cm} p{6cm} @{}}
\toprule
\textbf{Component} & \textbf{Language} & \textbf{Neuro-Symbolic Function} \\
\midrule
Inductor & Scheme & Symbol abstraction from discrete data \\
Parser & Pascal & Structural validation, well-formedness checking \\
Transducer & Lisp & Generative rule application \\
LLM (contrast) & Python & Pure pattern recognition without explanation \\
\bottomrule
\end{tabular}
\end{table}
Together, these three components form a \textbf{complete pipeline} from
empirical data to generative model—a pipeline that is fully transparent at
every step.
\section{XAI Validation of ARS}
The three NIST XAI criteria \citep{ortigossa2024xai} provide a framework for
evaluating explainability:
\subsection{Meaningfulness (Verständlichkeit)}
\begin{itemize}
\item \textbf{Inductor}: The transformation matrix and production rules are
directly interpretable. Each rule corresponds to an observed transition in
the corpus.
\item \textbf{Parser}: States (KBG, VBG, VKG, etc.) are semantically
meaningful categories derived from qualitative interpretation.
\item \textbf{Transducer}: Generation follows explicit rules that can be
inspected.
\item \textbf{LLM}: Weights and hidden states are not directly interpretable.
\end{itemize}
\subsection{Accuracy (Genauigkeit)}
\begin{itemize}
\item \textbf{Inductor}: The induced grammar reproduces the empirical
transition frequencies with high correlation (r = 0.9999).
\item \textbf{Parser}: Well-formedness decisions are deterministic and
verifiable.
\item \textbf{Transducer}: Generated sequences follow the statistical
distribution of the corpus.
\item \textbf{LLM}: Training loss decreases, but the model does not produce
explicit rules that can be verified against the data.
\end{itemize}
\subsection{Knowledge Limits (Wissensgrenzen)}
\begin{itemize}
\item \textbf{ARS}: The grammar explicitly documents its data basis
(8 transcripts, 59 inter-acts). It makes no claim to generalization beyond
the corpus.
\item \textbf{LLM}: The model's limitations are not explicitly represented.
It may hallucinate or produce plausible but invalid sequences without
signaling uncertainty.
\end{itemize}
\section{Simulation vs. Explanation: The Fundamental Distinction}
\subsection{What LLMs Do: Statistical Simulation}
Large language models learn the statistical distribution of token sequences
from training data. When generating, they sample from this learned distribution.
This is \textbf{simulation}: the model produces outputs that resemble the
training distribution.
Crucially, simulation does not require understanding the \textit{rules} that
generate the data. An LLM trained on a corpus of sales conversations can
generate plausible new conversations without ever representing concepts like
"greeting," "need clarification," or "farewell."
\subsection{What ARS Does: Explanatory Reconstruction}
ARS, in contrast, aims for \textbf{explanatory reconstruction}. It induces
explicit rules that \textit{constitute} the observed regularities. These rules
are not merely statistical summaries but \textbf{generative mechanisms} that
can be:
\begin{enumerate}
\item \textbf{Inspected}: The rules are written in a formal language
(Scheme, Pascal, Lisp).
\item \textbf{Traced}: Every generation step can be traced back to a rule.
\item \textbf{Falsified}: A counterexample can refute a rule.
\item \textbf{Communicated}: The rules can be shared, discussed, and
criticized by other researchers.
\end{enumerate}
\subsection{The Cargo Cult Critique}
The original notebook contains a provocative passage:
\blockquote{If one wants to write a textbook on the rules of sales conversations
but ends up with a software agent that enjoys conducting sales conversations,
one has done poor work at a very high level.}
This critique is not anti-AI. It is a warning against \textbf{category errors}:
using a tool designed for one purpose (statistical simulation) to address a
different problem (explanatory reconstruction). An LLM is an excellent simulator
but a poor explainer. ARS is an excellent explainer but a less scalable
simulator. Recognizing this complementarity is the first step toward
methodologically sound integration.
\section{Toward a Methodological Synthesis}
\subsection{Complementarity, Not Competition}
The analysis above suggests a division of labor:
\begin{itemize}
\item \textbf{Use LLMs for scaling}: Neural pattern recognition can propose
initial category assignments, identify candidate patterns, and process
large corpora.
\item \textbf{Use ARS for validation}: The symbolic grammar can check the
well-formedness of neural proposals, document interpretative decisions, and
provide explanations.
\item \textbf{Keep the human in the loop}: Final validation and
interpretation authority remains with the human researcher.
\end{itemize}
This is precisely the approach later formalized as \textbf{CGTI (Computational
Grounded Theory Integration)} and \textbf{AQSA (Adversarial Qualitative Sequence
Analysis)}.
\subsection{Lessons for Contemporary Neuro-Symbolic AI}
From the ARS experience, contemporary neuro-symbolic research can learn:
\begin{enumerate}
\item \textbf{Explainability by design}: Build symbolic components that are
interpretable from the ground up, not as post-hoc additions.
\item \textbf{Multiple formalisms}: Different tasks (induction, parsing,
generation) may require different formal languages. Scheme, Pascal, and
Lisp each served a distinct purpose.
\item \textbf{Methodological control before scaling}: A small,
well-understood corpus (8 transcripts) provides more methodological insight
than a large, opaque corpus.
\item \textbf{The human as System 1}: In some contexts, human pattern
recognition is superior to neural networks—not because it is faster, but
because it is interpretable and can be communicated.
\end{enumerate}
\section{Conclusion}
This paper has reconstructed three early implementations of the Algorithmic
Recursive Sequence Analysis (ARS)—an inductor in Scheme, a parser in Pascal,
and a transducer in Lisp—and contrasted them with a large language model
trained on the same corpus. I have argued that:
\begin{enumerate}
\item ARS constitutes a \textbf{proto-neuro-symbolic} methodology,
anticipating core concerns of contemporary neuro-symbolic AI by decades.
\item The three components (inductor, parser, transducer) address
complementary functions: symbol abstraction, structural validation, and
generative rule application.
\item Unlike LLMs, which simulate statistical distributions without
explanation, ARS produces \textbf{explicit, falsifiable, and intersubjectively
verifiable grammars}.
\item ARS satisfies the XAI criteria of meaningfulness, accuracy, and
knowledge limits in ways that pure neural models cannot.
\end{enumerate}
The historical record shows that the challenges of neuro-symbolic integration
were recognized and addressed long before the current wave of research. ARS
offers a methodological template that contemporary researchers would do well
to study—not as a historical artifact, but as a living approach to
\textbf{explainable, controlled, and verifiable} sequence analysis.
The question for neuro-symbolic AI is not whether to integrate pattern
recognition with rule-based reasoning. The question is how to do so without
sacrificing the methodological standards that make scientific knowledge
possible. ARS provides one answer.
\newpage
\begin{thebibliography}{99}
\bibitem[Garcez \& Lamb(2020)]{garcez2020neurosymbolic}
Garcez, A. d'Avila, \& Lamb, L. C. (2020). Neurosymbolic AI: The 3rd wave.
\textit{arXiv preprint arXiv:2012.05876}.
\bibitem[Hitzler \& Sarker(2022)]{hitzler2022neuro}
Hitzler, P., \& Sarker, M. K. (Eds.). (2022). \textit{Neuro-Symbolic Artificial
Intelligence: The State of the Art}. IOS Press.
\bibitem[Kautz(2020)]{kautz2020third}
Kautz, H. (2020). The third AI summer: AAAI Robert S. Engelmore Memorial Award
Lecture. \textit{AI Magazine}, 43(1), 93-104.
\bibitem[Koop(1992)]{koop1992parser}
Koop, P. (1992). \textit{Demo-Parser Chart-Parser Version 1.0}. Pascal source code.
\bibitem[Koop(1994)]{koop1994scheme}
Koop, P. (1994). \textit{Grammatikinduktion empirisch gesicherter
Verkaufsgespräche}. Scheme source code.
\bibitem[Koop(1994)]{koop1994lisp}
Koop, P. (1994). \textit{Sequenzanalyse empirisch gesicherter
Verkaufsgespräche}. Lisp source code.
\bibitem[Koop(2023)]{koop2023notebook}
Koop, P. (2023). \textit{Qualitative Sozialforschung und Große Sprachmodelle}.
Jupyter Notebook.
\bibitem[Oevermann et al.(1979)]{oevermann1979methodology}
Oevermann, U., Allert, T., Konau, E., \& Krambeck, J. (1979). The methodology
of objective hermeneutics. In H.-G. Soeffner (Ed.), \textit{Interpretative
Procedures in the Social and Text Sciences} (pp. 352-434). Metzler.
\bibitem[Ortigossa et al.(2024)]{ortigossa2024xai}
Ortigossa, E. S., Gonçalves, T., \& Nonato, L. G. (2024). Explainable Artificial
Intelligence (XAI)—From Theory to Methods and Applications. \textit{IEEE Access},
12, 80799-80846.
\bibitem[Trask(2020)]{trask2020neural}
Trask, A. W. (2020). \textit{Neural Networks and Deep Learning: A Simple
Introduction with Examples in Python}. dpunkt. [German translation]
\end{thebibliography}
\end{document}