flreader – Dateibrowser

Inhalt

Aktueller Ordner: /
ARS_XAI_Eng.tex

% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\documentclass[
  12pt,
  a4paper,
  oneside,
  titlepage
]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{hyperref}
\usepackage{geometry}
\geometry{a4paper, left=3cm, right=3cm, top=3cm, bottom=3cm}
\usepackage{setspace}
\onehalfspacing
\usepackage{parskip}
\usepackage[english]{babel}
\usepackage{csquotes}
\usepackage{microtype}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{float}
\usepackage{url}
\usepackage{natbib}
\usepackage{titling}

% Listing-Style for Python
\lstset{
  language=Python,
  basicstyle=\ttfamily\small,
  keywordstyle=\color{blue},
  commentstyle=\color{green!40!black},
  stringstyle=\color{red},
  showstringspaces=false,
  numbers=left,
  numberstyle=\tiny,
  numbersep=5pt,
  breaklines=true,
  frame=single,
  backgroundcolor=\color{gray!5},
  tabsize=2,
  captionpos=b
}

% Title
\title{\Huge\textbf{Between Interpretation and Computation} \\
       \LARGE Algorithmic Recursive Sequence Analysis as a Bridge \\
       \LARGE between Qualitative Hermeneutics and Formal Modeling}
\author{
  \large
  \begin{tabular}{c}
    Paul Koop
  \end{tabular}
}
\date{\large June/July 1994 \& 2024}

\begin{document}

\maketitle

\begin{abstract}
Qualitative social research currently faces a methodological dilemma: On one hand, 
generative AI systems promise an unprecedented scaling of interpretive work steps; 
on the other hand, due to their stochastic nature, they elude the classical validation 
logic of qualitative research. This paper argues that this dilemma can be resolved by 
revisiting formalizing approaches that were already present in the tradition of text 
analysis but were forgotten due to recent developments in generative AI. As a concrete 
solution, the paper develops \textbf{Algorithmic Recursive Sequence Analysis (ARS)}, 
a procedure that transforms interpretive processes into a formal grammar, making them 
transparent, reproducible, and intersubjectively verifiable. The connection to current 
discussions on \textbf{Explainable AI (XAI)} proves to be doubly fruitful: It provides 
a conceptual framework to reflect on the quality of qualitative interpretations and 
reminds us that explainability is not a luxury but a necessity—in technology as well 
as in science. The empirical application to eight transcripts of sales conversations 
demonstrates the effectiveness of the procedure.
\end{abstract}

\newpage
\tableofcontents
\newpage

\section{Introduction: The Paradox of Qualitative Research in the Age of Generative AI}

Qualitative social research currently faces a methodological dilemma. On one hand, 
generative AI systems promise an unprecedented scaling of interpretive work steps. 
On the other hand, due to their stochastic nature, these systems elude the classical 
validation logic of qualitative research. Where the latter traditionally relies on 
detailed disclosure of the coding process and intersubjective comprehensibility, 
there is now a blind reliance on the supposed \enquote{emergence} of neural networks.

This trend is problematic because it decouples computer-assisted text analysis from 
its methodological foundations. At the same time, however, it points to a deficit 
that affects qualitative research itself: It lacks a formalized vocabulary to make 
its interpretive processes accessible to algorithmic procedures. The result is a 
choice between two unsatisfactory options: either renouncing scaling or abandoning 
methodological control.

This paper argues that this dilemma can be resolved by revisiting formalizing 
approaches that were already present in the tradition of text analysis but were 
forgotten due to recent developments in generative AI. As a concrete solution, the 
paper develops \textbf{Algorithmic Recursive Sequence Analysis (ARS)}, a procedure 
that transforms interpretive processes into a formal grammar, making them transparent, 
reproducible, and intersubjectively verifiable.

The point of this approach lies in its connection to current discussions on 
\textbf{Explainable Artificial Intelligence (XAI)} . XAI has developed as a response 
to the opacity of neural networks \citep{Samek2019, BarredoArrieta2020}. The central 
insight is: Those who cannot comprehend the decisions of complex AI systems cannot 
trust them—and should not use them in safety-critical areas \citep{Weller2019}. This 
insight, so the thesis of this paper, can be productively applied to qualitative 
research: It also needs procedures that make its interpretive processes explainable. 
ARS understands itself as such a procedure—as a contribution to an \textbf{explainable 
qualitative research} that preserves the methodological standards of the discipline 
while simultaneously opening up to algorithmic modeling.

The paper is structured as follows: Section 2 introduces the concept of Explainable 
AI and develops the analogy to qualitative research. Section 3 presents ARS in its 
methodological architecture. Section 4 documents the empirical application to eight 
transcripts of sales conversations. Section 5 reflects on the results in light of 
the XAI discussion. Section 6 draws a conclusion and outlines perspectives.

\section{Explainable AI: Concept, Development, and Methodological Relevance}

\subsection{Origins and Fundamental Ideas of XAI}

The development of Explainable Artificial Intelligence (XAI) is closely linked to the 
realization that the increasing performance of complex AI models comes with a loss of 
transparency. In particular, deep neural networks, which achieve impressive results 
in numerous application domains, operate as \enquote{black boxes}: Their internal 
decision processes are not directly comprehensible to developers or users 
\citep[ p.~2]{Samek2019}.

This opacity becomes problematic when AI systems are used in safety-critical areas—in 
medical diagnostics, jurisprudence, finance, or autonomous control 
\citep[ p.~80800]{Ortigossa2024}. Wrong decisions can have serious consequences here. 
At the same time, the opacity of the models makes it difficult to identify bias and 
discrimination. A frequently cited case is the COMPAS system for recidivism prediction, 
which systematically disadvantaged African American defendants without this bias being 
recognizable from the model architecture \citep[ p.~84]{BarredoArrieta2020}.

XAI research responds to this problem by developing methods to subsequently explain 
the decisions of complex models or to design interpretable models from the outset 
\citep{Mersha2024}. The term \enquote{Explainable AI} itself originates from an 
initiative of the US research agency DARPA, which from 2015 onwards specifically 
funded projects on the explainability of AI systems \citep[ p.~86]{BarredoArrieta2020}. 
Since then, XAI has developed into an independent research field addressing both 
technical and ethical as well as legal questions.

An important legal driver of the XAI discussion was the European General Data Protection 
Regulation (GDPR). In particular, Recital 71 is often interpreted in research as the 
basis of a \enquote{right to explanation}, even though the regulation does not formulate 
an explicit, enforceable right to complete algorithmic disclosure \citep{Wachter2017}. 
Nevertheless, the GDPR establishes binding requirements for transparency, 
comprehensibility, and information obligations in automated decisions, thereby 
reinforcing the normative pressure to develop explainable AI systems.

\subsection{Central Concepts and Taxonomies}

The XAI literature has developed a series of concepts and distinctions to structure 
the field. \textbf{Explainability} generally denotes the property of an AI system to 
present its decisions in a way that is understandable to humans 
\citep[ p.~89]{BarredoArrieta2020}. \textbf{Interpretability} aims at enabling a 
human observer to comprehend the functioning of the system \citep[ p.~25]{Weller2019}. 
\textbf{Transparency} means the disclosure of systemic processes and design decisions 
\citep[ p.~27]{Weller2019}.

A fundamental taxonomic distinction concerns the timing of explainability: 
\textbf{Ad-hoc methods} (also \enquote{Explanation by Design}) integrate explainability 
into the model architecture from the beginning. They design models that are 
principally interpretable due to their structure—such as decision trees or rule-based 
systems. \textbf{Post-hoc methods}, on the other hand, apply explanation techniques 
to already trained black-box models. They attempt to retrospectively reconstruct 
which input factors were decisive for a particular decision 
\citep[ p.~92]{BarredoArrieta2020}.

A second distinction concerns the scope of explanation: \textbf{Global explanations} 
target the overall behavior of the model—they answer the question of how the model 
fundamentally functions. \textbf{Local explanations}, on the other hand, refer to 
individual decisions—they explain why a specific input led to a specific output 
\citep[ p.~80805]{Ortigossa2024}.

A third distinction concerns methodology: \textbf{Model-specific procedures} are only 
applicable to certain model architectures (e.g., neural networks). \textbf{Model-agnostic 
procedures}, on the other hand, can be used independently of the concrete model 
architecture \citep[ p.~3]{Mersha2024}.

Among the best-known XAI procedures are:

\begin{itemize}
    \item \textbf{LIME (Local Interpretable Model-agnostic Explanations)}: A 
    model-agnostic procedure that learns simple, interpretable local surrogate models 
    to explain the decisions of complex black-box models 
    \citep[ p.~102]{BarredoArrieta2020}.
    
    \item \textbf{SHAP (SHapley Additive exPlanations)}: A procedure based on 
    cooperative game theory that quantifies the contribution of each input feature 
    to a prediction \citep[ p.~104]{BarredoArrieta2020}.
    
    \item \textbf{Saliency Maps}: Visualizations that show for image classifiers 
    which image regions were particularly relevant for a decision \citep{Zhou2019}.
    
    \item \textbf{Layer-wise Relevance Propagation (LRP)}: A procedure that 
    propagates the prediction of a neural network backwards layer by layer, thus 
    identifying relevant input regions \citep{Montavon2019}.
\end{itemize}

\subsection{XAI as a Methodological Challenge}

The XAI discussion is not limited to technical procedures. It touches on fundamental 
methodological questions: What does it mean to \enquote{explain} a decision? Who is 
the addressee of the explanation? What quality criteria apply to explanations?

NIST (National Institute of Standards and Technology) has formulated three fundamental 
properties of good explanations \citep[ p.~80810]{Ortigossa2024}:

\begin{enumerate}
    \item \textbf{Meaningfulness}: Explanations must be understandable to the intended 
    addressee. This requires adaptation to their prior knowledge and cognitive abilities.
    
    \item \textbf{Accuracy}: Explanations must correctly represent the actual decision 
    processes of the model. There is a potential conflict of goals with meaningfulness: 
    An accurate but highly complex explanation may be incomprehensible; a comprehensible 
    but inaccurate explanation may be misleading.
    
    \item \textbf{Knowledge Limits}: Good explanations make clear under which conditions 
    the model works reliably and where its limits lie.
\end{enumerate}

These criteria are relevant not only for technical systems. They can, as this paper 
argues, be transferred to qualitative research. Qualitative interpretations must also 
be understandable (for the scientific community), accurate (in the sense of fidelity 
to the text), and state their limits (e.g., regarding the scope of interpretation). 
The XAI discussion thus provides a conceptual framework to reflect on the quality of 
qualitative interpretations—and to develop procedures that ensure this quality.

\subsection{From XAI to Explainable Qualitative Research: An Analogy}

The transfer of the XAI perspective to qualitative research is based on an analogy 
systematized in Table~\ref{tab:analogy}:

\begin{table}[h]
\centering
\caption{Analogy between Technical XAI and Qualitative Research}
\label{tab:analogy}
\begin{tabular}{@{} p{2.5cm} p{5cm} p{5cm} @{}}
\toprule
\textbf{Dimension} & \textbf{Technical XAI} & \textbf{Qualitative Research} \\
\midrule
Problem & Opaque decisions of neural networks & Opaque interpretation processes \\
Cause & Subsymbolic representations & Implicit rule knowledge \\
Consequence & Lack of trust, undetected bias & Lack of intersubjectivity \\
Solution & Explication of decision bases & Explication of interpretation rules \\
Methods & LIME, SHAP, Saliency Maps & ARS, explicit category formation \\
Criteria & Meaningfulness, Accuracy, Knowledge Limits & Comprehensibility, Text fidelity, Scope \\
\bottomrule
\end{tabular}
\end{table}

The point of this analogy lies in the reversal of perspective: While XAI asks how to 
explain the decisions of \textit{technical} systems, explainable qualitative research 
asks how to make the interpretation processes of \textit{human} researchers explainable. 
In both cases, it is about transforming implicit, opaque operations into explicit, 
comprehensible rules.

Algorithmic Recursive Sequence Analysis, presented in the following, understands 
itself as a procedure that accomplishes this transformation. It formalizes 
interpretation processes without automating them. It produces explicit, verifiable 
models without eliminating hermeneutic openness. And it thus creates the prerequisites 
for a qualitatively substantial but methodologically controlled use of algorithmic 
procedures.

\section{Algorithmic Recursive Sequence Analysis: Methodological Architecture}

\subsection{Basic Operations: From Transcription to Terminal Symbol String}

ARS operates on transcripts of natural interactions. The first step consists of a 
detailed sequential analysis following the logic of qualitative interpretation. 
Qualitative sequence analysis, as developed in objective hermeneutics 
\citep{Oevermann1979} and conversation analysis \citep{Sacks1974}, aims to uncover 
the latent meaning structure of interactions through the systematic reconstruction 
of their sequential order. Each speech act is analyzed with regard to its sequential 
function and its intentional quality.

The analysis follows the principle of \textbf{interpretation production and falsification} 
\citep[ p.~392]{Oevermann1979}: For each sequential step, alternative interpretation 
possibilities are generated and systematically tested against the further course of 
the interaction. This procedure of \enquote{controlled interpretation} 
\citep[ p.~158]{Flick2019} ensures intersubjective comprehensibility and forces the 
explication of interpretation rules.

The result of this interpretive work is a \textbf{terminal symbol string}, in which 
each speech act is represented by a symbol from a previously developed category system. 
These terminal symbols function as a formalized equivalent of qualitative coding 
\citep[ p.~207]{Przyborski2021}. The following table illustrates this using an example 
from a transcript:

\begin{table}[h]
\centering
\caption{Example of Terminal Symbol Assignment}
\label{tab:terminal}
\begin{tabular}{@{} p{6cm} c p{4cm} @{}}
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} & \textbf{Interpretation} \\
\midrule
Customer: Good day & KBG & Customer greeting (initiation of interaction) \\
Salesperson: Good day & VBG & Salesperson greeting (reciprocal confirmation) \\
Customer: One portion of coarse liver sausage, please. & KBBd & Customer need (articulation of purchase desire) \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Grammar Induction: From Individual Cases to Generative Models}

Based on the terminal symbol strings, an individual grammar is induced for each 
transcript. This grammar specifies which sequence patterns are observable in the 
respective transcript and which transitions between terminal symbols are possible. 
Formally, it is a transition-based grammar operating at the level of terminal symbols, 
whose production rules are based on observed transition frequencies.

Unlike classical linguistic PCFGs \citep{Manning1999}, ARS dispenses with explicit 
non-terminals and deep recursive derivations. Instead, the grammar models sequential 
regularities as probabilistic transitions between formalized speech act categories. 
The term grammar is used here in a methodological, not a strictly formal-linguistic 
sense: as an explicit, generative rule system for reconstructing observable sequence 
structures.

Induction is performed by simply counting observed transitions:

\begin{lstlisting}[caption=Counting Transitions between Terminal Symbols]
transitions = {}
for chain in empirical_chains:
    for i in range(len(chain) - 1):
        start, end = chain[i], chain[i + 1]
        if start not in transitions:
            transitions[start] = {}
        if end not in transitions[start]:
            transitions[start][end] = 0
        transitions[start][end] += 1
\end{lstlisting}

\subsection{Unification and Optimization}

The individual grammars are merged into a \textbf{unified grammar} covering the 
sequence structure of all transcripts. This is subjected to an iterative adjustment 
process that gradually increases the agreement of the transition probabilities with 
the empirically observed distribution structure. The procedure follows a heuristic 
scheme: It generates artificial strings, compares their frequency distribution with 
the empirical data, and iteratively adjusts the transition probabilities.

The definition of a start symbol represents a model-theoretic simplification. It 
serves to generate syntactically consistent sequences and does not claim to fully 
capture the empirical diversity of real conversation openings.

\section{Empirical Application: Eight Transcripts of Sales Conversations}

\subsection{Hypothetical Initial Grammar}

Based on the literature on sales conversations, the following hypothetical grammar 
was derived: A sales conversation (VKG) consists of greeting (BG), sales part (VT), 
and farewell (AV). The terminal symbols include KBG, VBG, KBBd, VBBd, KBA, VBA, KAE, 
VAE, KAA, VAA, KAV, VAV.

\subsection{The Eight Transcripts}

The complete transcripts can be found in Appendix A. They document interactions at 
various sales stands at Aachen market square in June/July 1994.

\subsection{Terminal Symbol Strings}

Since sales conversations can empirically begin with different speech acts, a uniform 
start symbol was defined for the generation of artificial sequences. This decision 
serves exclusively model consistency and does not affect the transition structure of 
the grammar.

The terminal symbol strings formed from the transcripts are fully documented in 
Appendix A.

\subsection{Python Implementation}

The complete Python program for grammar induction and optimization can be found in 
Appendix B. It implements the steps described in Section 3 and visualizes the 
optimization process.

\subsection{Results of Iterative Adjustment}

The optimized grammar exhibits the following structure:

\begin{table}[h]
\centering
\caption{Optimized Transition Probabilities}
\label{tab:results}
\begin{tabular}{@{} l l @{}}
\toprule
\textbf{Start Symbol} & \textbf{Following Symbols with Probabilities} \\
\midrule
KBG & VBG (0.67), VBBd (0.33) \\
VBG & KBBd (1.0) \\
KBBd & VBBd (0.67), VAA (0.17), VBA (0.17) \\
VBBd & KBA (0.44), VAA (0.22), KBBd (0.22), KAA (0.11) \\
KBA & VBA (0.5), VAA (0.5) \\
VBA & KBBd (0.5), KAE (0.25), VAA (0.25) \\
VAA & KAA (0.86), KAV (0.14) \\
KAA & VAV (0.75), VBG (0.25) \\
VAV & KAV (1.0) \\
KAE & VAE (1.0) \\
VAE & KAA (1.0) \\
KAV & KBBd (1.0) \\
\bottomrule
\end{tabular}
\end{table}

In the validation phase, where a larger number of artificial sequences (n = 100) were 
generated based on the optimized transition structure, there is a near-perfect 
agreement between empirical and generated frequencies (r = 0.9999; p < 0.001).

This high agreement is not to be understood as predictive performance or proof of 
generalization. Rather, it documents the structural reproducibility of the empirically 
observed transition patterns using the same grammar with an enlarged sample. At the 
same time, it must be methodologically reflected that the Pearson correlation 
coefficient for frequency vectors with constant sum (1.0) tends to yield high values. 
The correlation observed here therefore primarily confirms the internal consistency 
of the procedure, less an external validity in the sense of predictive power 
\citep[ p.~489]{Flick2019}.

During the iterative optimization phase, the correlation remains stable at about 
r ≈ 0.92, which already indicates a high structural fit of the induced grammar. The 
further increase in correlation during validation is due to the larger sample of 
generated sequences with unchanged transition structure.

\section{Discussion: ARS as a Contribution to Explainable Qualitative Research}

\subsection{ARS and the XAI Criteria}

ARS fulfills the three NIST criteria for good explanations in a form adapted to 
qualitative research:

\textbf{Meaningfulness} is ensured through explicit category formation. The terminal 
symbols are semantically meaningful (KBG = customer greeting) and remain tied to the 
interpretive exploration. A third researcher can comprehend which assignments were 
made. This corresponds to the principle of \enquote{communicative validation} central 
to qualitative research \citep[ p.~328]{Flick2019}.

\textbf{Accuracy} is operationalized here in the sense of structural fit, not in the 
sense of predictive validity. The high agreement between empirical and generated 
frequencies shows that the grammar precisely reproduces the observed distribution 
structure of the data. In the terminology of qualitative research, one could speak 
of \enquote{appropriateness to the subject matter} \citep[ p.~34]{Przyborski2021}.

\textbf{Knowledge Limits} are marked by documenting the production and falsification 
of interpretations. The grammar does not claim to capture the \enquote{true} structure 
of the interaction but reconstructs observable regularities based on interpretive 
decisions. It thus makes its own contingency visible—a methodological virtue discussed 
in qualitative research under the keyword \enquote{reflexivity} \citep[ p.~129]{Flick2019}.

\subsection{Ad-hoc vs. Post-hoc: ARS as Explanation by Design}

In XAI terminology, ARS is to be classified as an \textbf{ad-hoc procedure} 
(Explanation by Design). It does not design the grammar as a subsequent explanation 
of an already existing model but integrates explainability into the modeling process 
from the beginning. The terminal symbols are not black boxes but explicate the 
interpretive decisions. The transition probabilities are not opaque weights but 
simple relative frequencies.

This fundamentally distinguishes ARS from post-hoc procedures that attempt to 
subsequently explain the decisions of neural networks. While these procedures can 
only provide approximate insights into a principally opaque architecture, ARS is 
designed to be transparent from the ground up.

\subsection{Limits of the Analogy}

The analogy between XAI and qualitative research has limits that must be reflected 
upon. \textbf{First}, XAI primarily aims at explaining \textit{technical} systems, 
while qualitative research is about the explication of \textit{human} interpretation 
processes. The causality is different: In XAI, we explain why an algorithm made a 
particular decision; in ARS, we explain how researchers arrived at a particular 
interpretation.

\textbf{Second}, XAI operates with a different concept of truth. The explanations 
are supposed to correctly represent the actual decision processes of the model. In 
ARS, on the other hand, there are no \enquote{actual} processes that exist 
independently of interpretation. The grammar is not a discovery but a construction—one 
that must, however, prove itself against empirical evidence \citep[ p.~80]{Flick2019}.

\textbf{Third}, the addressee is different. XAI explanations are directed at users, 
developers, or regulatory authorities. ARS explanations are directed at the 
scientific community of qualitative research. The criteria for meaningfulness must 
therefore be adapted to their specific discourse practice.

\subsection{Methodological Implications}

Despite these limits, the XAI perspective opens up productive questions for 
qualitative research: How can we explicate our interpretation processes so that 
they become comprehensible to others? What formats of explication are suitable? 
How can we not only claim but demonstrate the quality of our interpretations?

ARS provides a concrete answer to these questions. It formalizes interpretation 
processes without automating them. It makes interpretive decisions explicit without 
eliminating hermeneutic openness. It thus creates the prerequisites for a 
methodologically reflected use of algorithmic procedures in qualitative research.

\section{Conclusion and Outlook}

Qualitative social research faces the challenge of using the possibilities of 
algorithmic text analysis without sacrificing its methodological standards. 
Algorithmic Recursive Sequence Analysis offers a way to productively address this 
challenge. It formalizes interpretation processes without automating them. It 
produces explicit, verifiable models without eliminating hermeneutic openness.

The connection to the XAI discussion proves doubly fruitful: It provides a conceptual 
framework to reflect on the quality of qualitative interpretations. And it reminds 
us that explainability is not a luxury but a necessity—in technology as well as in 
science.

Further research could develop ARS in several directions: through the integration 
of additional formal modeling methods (Petri nets, Bayesian networks), through more 
systematic connection with computational linguistics methods, or through application 
to other types of interaction. What remains crucial is always methodological control: 
The formal procedures must respect the interpretive character of the analysis and 
must not lead to its automation.

\newpage
\begin{thebibliography}{99}

\bibitem[Barredo Arrieta et al.(2020)]{BarredoArrieta2020}
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., 
Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., 
\& Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, 
taxonomies, opportunities and challenges toward responsible AI. 
\textit{Information Fusion}, 58, 82-115.

\bibitem[Flick(2019)]{Flick2019}
Flick, U. (2019). \textit{An Introduction to Qualitative Research} (7th ed.). 
Sage Publications.

\bibitem[Manning \& Schütze(1999)]{Manning1999}
Manning, C. D., \& Schütze, H. (1999). \textit{Foundations of Statistical Natural 
Language Processing}. MIT Press.

\bibitem[Mersha et al.(2024)]{Mersha2024}
Mersha, M., et al. (2024). Explainable Artificial Intelligence: A Survey of Needs, 
Techniques, Applications, and Future Direction. \textit{Neurocomputing}, 599, 128111.

\bibitem[Montavon et al.(2019)]{Montavon2019}
Montavon, G., Binder, A., Lapuschkin, S., Samek, W., \& Müller, K.-R. (2019). 
Layer-Wise Relevance Propagation: An Overview. In W. Samek, G. Montavon, 
A. Vedaldi, L. K. Hansen, \& K.-R. Müller (Eds.), \textit{Explainable AI: 
Interpreting, Explaining and Visualizing Deep Learning} (pp. 193-210). Springer.

\bibitem[Oevermann et al.(1979)]{Oevermann1979}
Oevermann, U., Allert, T., Konau, E., \& Krambeck, J. (1979). The methodology of 
objective hermeneutics and its general research-logical significance in the social 
sciences. In H.-G. Soeffner (Ed.), \textit{Interpretive Procedures in the Social 
and Text Sciences} (pp. 352-434). Metzler.

\bibitem[Ortigossa et al.(2024)]{Ortigossa2024}
Ortigossa, E. S., Gonçalves, T., \& Nonato, L. G. (2024). EXplainable Artificial 
Intelligence (XAI)—From Theory to Methods and Applications. \textit{IEEE Access}, 
12, 80799-80846.

\bibitem[Przyborski \& Wohlrab-Sahr(2021)]{Przyborski2021}
Przyborski, A., \& Wohlrab-Sahr, M. (2021). \textit{Qualitative Social Research: 
A Workbook} (5th ed.). De Gruyter Oldenbourg. [German original: \textit{Qualitative 
Sozialforschung: Ein Arbeitsbuch}]

\bibitem[Sacks et al.(1974)]{Sacks1974}
Sacks, H., Schegloff, E. A., \& Jefferson, G. (1974). A simplest systematics for 
the organization of turn-taking for conversation. \textit{Language}, 50(4), 696-735.

\bibitem[Samek \& Müller(2019)]{Samek2019}
Samek, W., \& Müller, K.-R. (2019). Towards Explainable Artificial Intelligence. 
In W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, \& K.-R. Müller (Eds.), 
\textit{Explainable AI: Interpreting, Explaining and Visualizing Deep Learning} 
(pp. 1-10). Springer.

\bibitem[Wachter et al.(2017)]{Wachter2017}
Wachter, S., Mittelstadt, B., \& Floridi, L. (2017). Why a right to explanation 
of automated decision-making does not exist in the general data protection 
regulation. \textit{International Data Privacy Law}, 7(2), 76-99.

\bibitem[Weller(2019)]{Weller2019}
Weller, A. (2019). Transparency: Motivations and Challenges. In W. Samek, 
G. Montavon, A. Vedaldi, L. K. Hansen, \& K.-R. Müller (Eds.), 
\textit{Explainable AI: Interpreting, Explaining and Visualizing Deep Learning} 
(pp. 23-40). Springer.

\bibitem[Zhou et al.(2019)]{Zhou2019}
Zhou, B., Bau, D., Oliva, A., \& Torralba, A. (2019). Comparing the Interpretability 
of Deep Networks via Network Dissection. In W. Samek, G. Montavon, A. Vedaldi, 
L. K. Hansen, \& K.-R. Müller (Eds.), \textit{Explainable AI: Interpreting, 
Explaining and Visualizing Deep Learning} (pp. 239-252). Springer.

\end{thebibliography}

\newpage
\appendix
\section{The Eight Transcripts with Terminal Symbols}

\subsection{Transcript 1 - Butcher Shop}
\textbf{Date:} June 28, 1994, \textbf{Location:} Butcher Shop, Aachen, 11:00 AM

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 1 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer: Good day & KBG \\
Salesperson: Good day & VBG \\
Customer: One portion of coarse liver sausage, please. & KBBd \\
Salesperson: How much would you like? & VBBd \\
Customer: Two hundred grams. & KBA \\
Salesperson: Two hundred grams. Anything else? & VBA \\
Customer: Yes, then a piece of Black Forest ham. & KBBd \\
Salesperson: How large should the piece be? & VBBd \\
Customer: Around three hundred grams. & KBA \\
Salesperson: That will be eight marks twenty. & VAA \\
Customer: Here you are. & KAA \\
Salesperson: Thank you and have a nice day! & VAV \\
Customer: Thanks, you too! & KAV \\
\end{longtable}

\textbf{Terminal Symbol String 1:} KBG, VBG, KBBd, VBBd, KBA, VBA, KBBd, VBBd, KBA, VAA, KAA, VAV, KAV

\subsection{Transcript 2 - Marketplace (Cherries)}
\textbf{Date:} June 28, 1994, \textbf{Location:} Marketplace, Aachen

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 2 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Salesperson: Everyone can try cherries here, everyone can try cherries here! & VBG \\
Customer 1: Half a kilo of cherries, please. & KBBd \\
Salesperson: Half a kilo? Or a kilo? & VBBd \\
Salesperson: Three marks, please. & VAA \\
Customer 1: Thank you! & KAA \\
Salesperson: Everyone can try cherries here! & VBG \\
Customer 2: Half a kilo, please. & KBBd \\
Salesperson: Three marks, please. & VAA \\
Customer 2: Thank you! & KAA \\
\end{longtable}

\textbf{Terminal Symbol String 2:} VBG, KBBd, VBBd, VAA, KAA, VBG, KBBd, VAA, KAA

\subsection{Transcript 3 - Fish Stand}
\textbf{Date:} June 28, 1994, \textbf{Location:} Fish Stand, Marketplace, Aachen

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 3 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer: One pound of saithe, please. & KBBd \\
Salesperson: Saithe, alright. & VBBd \\
Salesperson: Four marks nineteen, please. & VAA \\
Customer: Thank you! & KAA \\
\end{longtable}

\textbf{Terminal Symbol String 3:} KBBd, VBBd, VAA, KAA

\subsection{Transcript 4 - Vegetable Stand (Detailed)}
\textbf{Date:} June 28, 1994, \textbf{Location:} Vegetable Stand, Marketplace, Aachen, 11:00 AM

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 4 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer: Listen, I'll take some mushrooms. & KBBd \\
Salesperson: Brown or white? & VBBd \\
Customer: Let's take the white ones. & KBA \\
Salesperson: They're both fresh, don't worry. & VBA \\
Customer: What about chanterelles? & KBBd \\
Salesperson: Ah, they're great! & VBA \\
Customer: Can I put them in rice salad? & KAE \\
Salesperson: Better sauté them briefly in a pan. & VAE \\
Customer: Okay, I'll do that. & KAA \\
Salesperson: Have a nice day! & VAV \\
Customer: You too! & KAV \\
\end{longtable}

\textbf{Terminal Symbol String 4:} KBBd, VBBd, KBA, VBA, KBBd, VBA, KAE, VAE, KAA, VAV, KAV

\subsection{Transcript 5 - Vegetable Stand (with KAV at beginning)}
\textbf{Date:} June 26, 1994, \textbf{Location:} Vegetable Stand, Marketplace, Aachen, 11:00 AM

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 5 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer 1: Goodbye! & KAV \\
Customer 2: I'd like a kilo of Granny Smith apples here. & KBBd \\
Salesperson: Anything else? & VBBd \\
Customer 2: Yes, another kilo of onions. & KBBd \\
Salesperson: Six marks twenty-five, please. & VAA \\
Customer 2: Goodbye! & KAV \\
\end{longtable}

\textbf{Terminal Symbol String 5:} KAV, KBBd, VBBd, KBBd, VAA, KAV

\subsection{Transcript 6 - Cheese Stand}
\textbf{Date:} June 28, 1994, \textbf{Location:} Cheese Stand, Marketplace, Aachen

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 6 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer 1: Good morning! & KBG \\
Salesperson: Good morning! & VBG \\
Customer 1: I'd like five hundred grams of Dutch Gouda. & KBBd \\
Salesperson: As a piece? & VBBd \\
Customer 1: Yes, as a piece, please. & KAA \\
\end{longtable}

\textbf{Terminal Symbol String 6:} KBG, VBG, KBBd, VBBd, KAA

\subsection{Transcript 7 - Candy Stand}
\textbf{Date:} June 28, 1994, \textbf{Location:} Candy Stand, Marketplace, Aachen, 11:30 AM

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 7 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer: I'd like one hundred grams of the mixed ones. & KBBd \\
Salesperson: For home or to take away? & VBBd \\
Customer: To take away, please. & KBA \\
Salesperson: Fifty pfennigs, please. & VAA \\
Customer: Thanks! & KAA \\
\end{longtable}

\textbf{Terminal Symbol String 7:} KBBd, VBBd, KBA, VAA, KAA

\subsection{Transcript 8 - Bakery}
\textbf{Date:} July 9, 1994, \textbf{Location:} Bakery, Aachen, 12:00 PM

\begin{longtable}{@{} p{8cm} c @{}}
\caption{Transcript 8 - Terminal Symbols}\\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endfirsthead
\multicolumn{2}{c}%
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
\toprule
\textbf{Transcript Excerpt} & \textbf{Terminal Symbol} \\
\midrule
\endhead
\midrule \multicolumn{2}{r}{\textit{Continued on next page}} \\
\endfoot
\bottomrule
\endlastfoot
Customer: Good day! & KBG \\
Salesperson: One portion of our best coffee, freshly ground, please. & VBBd \\
Customer: Yes, also two pieces of fruit salad and a small cup of cream. & KBBd \\
Salesperson: Alright! & VBA \\
Salesperson: That will be fourteen marks and nineteen pfennigs, please. & VAA \\
Customer: I'll pay in small change. & KAA \\
Salesperson: Thank you very much, have a nice Sunday! & VAV \\
Customer: Thanks, you too! & KAV \\
\end{longtable}

\textbf{Terminal Symbol String 8:} KBG, VBBd, KBBd, VBA, VAA, KAA, VAV, KAV

\newpage
\section{Complete Python Implementation}

\begin{lstlisting}[caption=Algorithmic Recursive Sequence Analysis 2.0 - Complete Code]
"""
Algorithmic Recursive Sequence Analysis 2.0
Grammar Induction from Eight Transcripts
Optimization through Iterative Comparison of Empirical and Generated Strings
"""

import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
from tabulate import tabulate

# ============================================================================
# 1. EMPIRICAL DATA: Terminal symbol strings from eight transcripts
# ============================================================================

empirical_chains = [
    # Transcript 1: Butcher Shop
    ['KBG', 'VBG', 'KBBd', 'VBBd', 'KBA', 'VBA', 'KBBd', 'VBBd', 'KBA', 'VAA', 'KAA', 'VAV', 'KAV'],
    # Transcript 2: Marketplace (Cherries)
    ['VBG', 'KBBd', 'VBBd', 'VAA', 'KAA', 'VBG', 'KBBd', 'VAA', 'KAA'],
    # Transcript 3: Fish Stand
    ['KBBd', 'VBBd', 'VAA', 'KAA'],
    # Transcript 4: Vegetable Stand (detailed)
    ['KBBd', 'VBBd', 'KBA', 'VBA', 'KBBd', 'VBA', 'KAE', 'VAE', 'KAA', 'VAV', 'KAV'],
    # Transcript 5: Vegetable Stand (with KAV at beginning)
    ['KAV', 'KBBd', 'VBBd', 'KBBd', 'VAA', 'KAV'],
    # Transcript 6: Cheese Stand
    ['KBG', 'VBG', 'KBBd', 'VBBd', 'KAA'],
    # Transcript 7: Candy Stand
    ['KBBd', 'VBBd', 'KBA', 'VAA', 'KAA'],
    # Transcript 8: Bakery
    ['KBG', 'VBBd', 'KBBd', 'VBA', 'VAA', 'KAA', 'VAV', 'KAV']
]

# ============================================================================
# 2. TRANSITION COUNTING AND INITIAL PROBABILITIES
# ============================================================================

def count_transitions(chains):
    """Counts transitions between terminal symbols in all chains"""
    transitions = {}
    for chain in chains:
        for i in range(len(chain) - 1):
            start, end = chain[i], chain[i + 1]
            if start not in transitions:
                transitions[start] = {}
            if end not in transitions[start]:
                transitions[start][end] = 0
            transitions[start][end] += 1
    return transitions

def calculate_probabilities(transitions):
    """Normalizes transition counts to probabilities"""
    probabilities = {}
    for start in transitions:
        total = sum(transitions[start].values())
        probabilities[start] = {end: count / total 
                               for end, count in transitions[start].items()}
    return probabilities

# Initial calculations
initial_transitions = count_transitions(empirical_chains)
initial_probabilities = calculate_probabilities(initial_transitions)

print("=" * 70)
print("ALGORITHMIC RECURSIVE SEQUENCE ANALYSIS 2.0")
print("=" * 70)
print("\n1. INITIAL TRANSITION PROBABILITIES (FROM EMPIRICAL DATA)")
print("-" * 70)

for start in sorted(initial_probabilities.keys()):
    transitions_str = ", ".join([f"{end}: {prob:.3f}" 
                                 for end, prob in initial_probabilities[start].items()])
    print(f"{start} -> {transitions_str}")

# ============================================================================
# 3. TERMINAL SYMBOLS AND START SYMBOL
# ============================================================================

terminal_symbols = sorted(list(set([item for sublist in empirical_chains 
                                     for item in sublist])))
start_symbol = empirical_chains[0][0]  # KBG as start (can be adjusted)

print(f"\nTerminal symbols ({len(terminal_symbols)}): {terminal_symbols}")
print(f"Start symbol: {start_symbol}")

# ============================================================================
# 4. GENERATION OF ARTIFICIAL CHAINS
# ============================================================================

def generate_chain(probabilities, start_symbol, max_length=20):
    """Generates a chain based on transition probabilities"""
    chain = [start_symbol]
    current = start_symbol
    
    for _ in range(max_length - 1):
        if current not in probabilities:
            break
        
        next_symbols = list(probabilities[current].keys())
        probs = list(probabilities[current].values())
        
        # If no following symbols exist, break
        if not next_symbols:
            break
            
        next_symbol = np.random.choice(next_symbols, p=probs)
        chain.append(next_symbol)
        current = next_symbol
        
        # Stop if we land at a terminal without further transitions
        if current not in probabilities:
            break
    
    return chain

def generate_multiple_chains(probabilities, start_symbol, n_chains=8, max_length=20):
    """Generates multiple chains"""
    return [generate_chain(probabilities, start_symbol, max_length) 
            for _ in range(n_chains)]

# ============================================================================
# 5. FREQUENCY ANALYSIS
# ============================================================================

def compute_frequencies(chains, terminals):
    """Computes relative frequencies of terminal symbols in chains"""
    frequency_array = np.zeros(len(terminals))
    terminal_index = {term: i for i, term in enumerate(terminals)}
    
    for chain in chains:
        for symbol in chain:
            if symbol in terminal_index:
                frequency_array[terminal_index[symbol]] += 1
    
    total = frequency_array.sum()
    if total > 0:
        frequency_array /= total  # Normalization
    
    return frequency_array

# Empirical frequencies as reference
empirical_frequencies = compute_frequencies(empirical_chains, terminal_symbols)

print("\n2. EMPIRICAL RELATIVE FREQUENCIES")
print("-" * 70)
for i, symbol in enumerate(terminal_symbols):
    print(f"{symbol}: {empirical_frequencies[i]:.4f}")

# ============================================================================
# 6. ITERATIVE GRAMMAR OPTIMIZATION
# ============================================================================

def optimize_grammar(empirical_chains, terminal_symbols, start_symbol,
                     max_iterations=1000, tolerance=0.01, target_correlation=0.9):
    """
    Optimizes the grammar through iterative comparison with generated chains.
    """
    
    # Initial probabilities from empirical data
    transitions = count_transitions(empirical_chains)
    probabilities = calculate_probabilities(transitions)
    
    # Empirical frequencies as target
    empirical_freqs = compute_frequencies(empirical_chains, terminal_symbols)
    
    best_correlation = 0
    best_significance = 1
    best_probabilities = None
    history = []
    
    print("\n3. ITERATIVE OPTIMIZATION")
    print("-" * 70)
    
    for iteration in range(max_iterations):
        # Generate 8 artificial chains
        generated_chains = generate_multiple_chains(probabilities, start_symbol, n_chains=8)
        
        # Compute frequencies of generated chains
        generated_freqs = compute_frequencies(generated_chains, terminal_symbols)
        
        # Correlation analysis
        correlation, p_value = pearsonr(empirical_freqs, generated_freqs)
        history.append((iteration, correlation, p_value))
        
        # Progress display every 50 iterations
        if iteration % 50 == 0:
            print(f"Iteration {iteration:4d}: Correlation = {correlation:.4f}, p = {p_value:.4f}")
        
        # Check termination criterion
        if correlation >= target_correlation and p_value < 0.05:
            best_correlation = correlation
            best_significance = p_value
            best_probabilities = {start: probs.copy() 
                                 for start, probs in probabilities.items()}
            print(f"\nOptimum reached at iteration {iteration}:")
            print(f"  Correlation = {correlation:.4f}")
            print(f"  Significance = {p_value:.4f}")
            break
        
        # Adjust probabilities
        for start in probabilities:
            for end in probabilities[start]:
                # Error calculation
                empirical_prob = empirical_freqs[terminal_symbols.index(end)]
                generated_prob = generated_freqs[terminal_symbols.index(end)]
                error = empirical_prob - generated_prob
                
                # Adjustment with tolerance factor
                probabilities[start][end] += error * tolerance
                
                # Bound to [0,1]
                probabilities[start][end] = max(0.01, min(0.99, probabilities[start][end]))
        
        # Renormalization
        for start in probabilities:
            total = sum(probabilities[start].values())
            if total > 0:
                probabilities[start] = {end: prob / total 
                                       for end, prob in probabilities[start].items()}
    
    # If no optimum was reached, take the best iteration
    if best_probabilities is None:
        # Find iteration with highest correlation
        best_idx = max(range(len(history)), key=lambda i: history[i][1])
        best_iter, best_correlation, best_significance = history[best_idx]
        best_probabilities = calculate_probabilities(count_transitions(empirical_chains))
        print(f"\nNo optimum reached. Best correlation at iteration {best_iter}:")
        print(f"  Correlation = {best_correlation:.4f}")
        print(f"  Significance = {best_significance:.4f}")
    
    return best_probabilities, best_correlation, best_significance, history

# Perform optimization
optimized_probabilities, best_corr, best_sig, history = optimize_grammar(
    empirical_chains, terminal_symbols, start_symbol,
    max_iterations=500, tolerance=0.005, target_correlation=0.9
)

# ============================================================================
# 7. OPTIMIZATION VISUALIZATION
# ============================================================================

def plot_optimization_history(history):
    """Visualizes the optimization process"""
    iterations, correlations, p_values = zip(*history)
    
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
    
    # Correlation development
    ax1.plot(iterations, correlations, 'b-', linewidth=1.5)
    ax1.set_xlabel('Iteration')
    ax1.set_ylabel('Correlation (Pearson r)')
    ax1.set_title('Optimization Process: Correlation between Empirical and Generated Frequencies')
    ax1.grid(True, alpha=0.3)
    ax1.axhline(y=0.9, color='r', linestyle='--', alpha=0.5, label='Target correlation (0.9)')
    ax1.legend()
    
    # p-value development (logarithmic)
    p_values = [max(p, 1e-10) for p in p_values]  # Avoid log(0)
    ax2.semilogy(iterations, p_values, 'g-', linewidth=1.5)
    ax2.set_xlabel('Iteration')
    ax2.set_ylabel('p-value (logarithmic)')
    ax2.set_title('Significance of Correlation')
    ax2.grid(True, alpha=0.3)
    ax2.axhline(y=0.05, color='r', linestyle='--', alpha=0.5, label='Significance level (0.05)')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig('optimization_history.png', dpi=150)
    plt.show()

# Optional: Visualization (if matplotlib available)
try:
    plot_optimization_history(history)
    print("\nOptimization history saved as 'optimization_history.png'.")
except:
    print("\n(Note: matplotlib required for visualization)")

# ============================================================================
# 8. OUTPUT OF OPTIMIZED GRAMMAR
# ============================================================================

print("\n" + "=" * 70)
print("4. OPTIMIZED PROBABILISTIC GRAMMAR")
print("=" * 70)

# Output sorted by start symbol
for start in sorted(optimized_probabilities.keys()):
    transitions = optimized_probabilities[start]
    transitions_str = ", ".join([f"'{end}': {prob:.3f}" 
                                 for end, prob in sorted(transitions.items())])
    print(f"\n{start} -> {transitions_str}")

# ============================================================================
# 9. VALIDATION: COMPARISON OF EMPIRICAL AND GENERATED FREQUENCIES
# ============================================================================

# Generate new chains with optimized grammar
validation_chains = generate_multiple_chains(
    optimized_probabilities, start_symbol, n_chains=100, max_length=20
)
validation_frequencies = compute_frequencies(validation_chains, terminal_symbols)

print("\n" + "=" * 70)
print("5. VALIDATION: EMPIRICAL VS. GENERATED FREQUENCIES")
print("=" * 70)

table_data = []
for i, symbol in enumerate(terminal_symbols):
    table_data.append([
        symbol,
        f"{empirical_frequencies[i]:.4f}",
        f"{validation_frequencies[i]:.4f}",
        f"{abs(empirical_frequencies[i] - validation_frequencies[i]):.4f}"
    ])

print(tabulate(table_data, 
               headers=["Symbol", "Empirical", "Generated", "Difference"],
               tablefmt="grid"))

# Overall correlation
final_corr, final_p = pearsonr(empirical_frequencies, validation_frequencies)
print(f"\nCorrelation (100 generated chains): r = {final_corr:.4f}, p = {final_p:.4f}")

# ============================================================================
# 10. EXAMPLE GENERATED CHAINS
# ============================================================================

print("\n" + "=" * 70)
print("6. EXAMPLE GENERATED TERMINAL SYMBOL CHAINS")
print("=" * 70)

example_chains = generate_multiple_chains(
    optimized_probabilities, start_symbol, n_chains=5, max_length=15
)

for i, chain in enumerate(example_chains, 1):
    chain_str = " -> ".join(chain)
    print(f"\nChain {i} ({len(chain)} symbols):")
    print(f"  {chain_str}")

# ============================================================================
# 11. EXPORT GRAMMAR AS STRUCTURE
# ============================================================================

def export_grammar_as_pcfg(probabilities, filename="optimized_grammar.txt"):
    """Exports the grammar in PCFG format"""
    with open(filename, 'w', encoding='utf-8') as f:
        f.write("# Optimized probabilistic context-free grammar (PCFG)\n")
        f.write("# Generated by Algorithmic Recursive Sequence Analysis 2.0\n\n")
        
        for start in sorted(probabilities.keys()):
            transitions = probabilities[start]
            for end, prob in sorted(transitions.items()):
                f.write(f"{start} -> {end} [{prob:.3f}]\n")
    
    print(f"\nGrammar exported as '{filename}'.")

export_grammar_as_pcfg(optimized_probabilities)

print("\n" + "=" * 70)
print("ALGORITHMIC RECURSIVE SEQUENCE ANALYSIS COMPLETED")
print("=" * 70)
\end{lstlisting}

\end{document}