Inhalt

Aktueller Ordner: /

.ipynb (pretty JSON)

{
    "metadata": {
        "kernelspec": {
            "name": "python",
            "display_name": "Python (Pyodide)",
            "language": "python"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "python",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text\/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8"
        }
    },
    "nbformat_minor": 4,
    "nbformat": 4,
    "cells": [
        {
            "cell_type": "markdown",
            "source": "Algorithmisch Rekursive Sequenzanalyse 2.0",
            "metadata": []
        },
        {
            "cell_type": "code",
            "source": "Example for optimizing grammar",
            "metadata": {
                "trusted": true
            },
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "source": "Paul Koop November 2024 post@paul-koop.org",
            "metadata": []
        },
        {
            "cell_type": "code",
            "source": "## Problem statement and solution approach\n\n### Problem\nGiven is an empirical sequence of terminal characters obtained from a specific language model. \nThe task is to create a context-free probabilistic grammar that can generate sequences that match \nthe empirical sequence as closely as possible. The aim is to find a grammar whose probability \ndistributions are adjusted so that the relative frequencies of the generated sequences\ncome as close as possible to the terminal characters occurring in the empirical sequence.\n\n### Procedure\n1. **Initial Grammar Definition**: An initial probabilistic grammar is defined that can \ngenerate different possible sequences with different probabilities.\n  \n2. **Generation and frequency analysis**: Multiple sequences are generated based on the \ngrammar, and the relative frequency of each terminal character is calculated.\n\n3. **Correlation test**: The correlation between the relative frequencies of the terminal \ncharacters in the generated and in the empirical sequence is calculated. The Spearman rank\ncorrelation coefficient serves as a measure of agreement. Additionally, \na p-value is calculated to check the significance of the correlation.\n\n4. **Optimization by adjusting probabilities**: If the correlation is low, \n    the probabilities within the grammar are iteratively adjusted to better\n    match the generated sequences to the empirical sequence. This ensures\n    that the probabilities are normalized to 1 after each adjustment.\n\n5. **Stopping criterion**: The optimization process ends when a significantly\n    high correlation is achieved or the maximum number of iterations is reached.\n\n### Result\nAfter the optimization process, the adjusted grammar is output, along with \n    the best correlation coefficient and its significance level (p-value). A significantly high correlation means that the grammar has been successfully adapted to the empirical sequence and can serve as a model for its generation.\n",
            "metadata": {
                "trusted": true
            },
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "source": "",
            "metadata": []
        },
        {
            "cell_type": "code",
            "source": "import numpy as np\nfrom scipy.stats import spearmanr\nfrom collections import Counter\n\n# Beispielhafte empirische Sequenz\nempirical_sequence = ['KBG', 'VBG', 'KBBd', 'VBA', 'KBBd', 'VBA', 'KBA', 'VBA', \n                      'KBBd', 'VBBd', 'KBBd', 'VBBd', 'KBA', 'VBA', 'KBA', 'VBA', \n                      'KAA', 'VAA', 'KAV', 'VAV']\n\n# Grammatik mit Gewichtungen\ngrammar = {\n    \"<Start>\": [[\"<Begrüßung>\", \"<Bedarf>\", \"<Abschluss>\", \"<Verabschiedung>\", 1.0]],\n    \"<Begrüßung>\": [[\"KBG\", \"VBG\", 1.0]],\n    \"<Bedarf>\": [[\"<BedarfSegment>\", \"<Bedarf>\", 0.8], [\"<BedarfSegment>\", 0.2]],\n    \"<BedarfSegment>\": [[\"KBBd\", \"VBBd\", 0.4], [\"KBBd\", \"VBA\", 0.3], [\"KBA\", \"VBA\", 0.3]],\n    \"<Abschluss>\": [[\"KAA\", \"VAA\", 0.6], [\"VAA\", \"KAA\", 0.4]],\n    \"<Verabschiedung>\": [[\"KAV\", \"VAV\", 0.7], [\"VAV\", \"KAV\", 0.3]]\n}\n\n# Funktion zur Generierung einer Terminalsequenz basierend auf der Grammatik\ndef generate_terminal_sequence(grammar):\n    sequence = []\n    production_stack = [\"<Start>\"]\n    while production_stack:\n        production = production_stack.pop()\n        if production in grammar:\n            rules = grammar[production]\n            options = [rule[:-1] for rule in rules]  # Die Produktionsregeln ohne die Wahrscheinlichkeit\n            probabilities = [rule[-1] for rule in rules]  # Die Wahrscheinlichkeiten\n            selected_rule = np.random.choice(len(options), p=probabilities)  # Index der gewählten Regel\n            production_stack.extend(reversed(options[selected_rule]))\n        else:\n            sequence.append(production)\n    return sequence\n\n# Berechne die Häufigkeit der Terminalsymbole in einer Liste von Sequenzen\ndef calculate_frequencies(sequences):\n    flat_sequence = [symbol for sequence in sequences for symbol in sequence]\n    total_count = len(flat_sequence)\n    frequencies = {symbol: count \/ total_count for symbol, count in Counter(flat_sequence).items()}\n    return frequencies\n\n# Häufigkeiten der empirischen Sequenz\nempirical_frequencies = calculate_frequencies([empirical_sequence])\n\n# Optimierungsprozess\nbest_correlation = -1\nbest_grammar = grammar\nbest_p_value = None\n\nfor _ in range(100):  # Maximale Anzahl an Iterationen\n    # Generiere mehrere Sequenzen und berechne die Häufigkeit der Terminalzeichen\n    generated_sequences = [generate_terminal_sequence(grammar) for _ in range(1000)]\n    generated_frequencies = calculate_frequencies(generated_sequences)\n    \n    # Konvertiere Frequenzdaten für Korrelationstest\n    empirical_values = np.array([empirical_frequencies.get(symbol, 0) for symbol in empirical_frequencies])\n    generated_values = np.array([generated_frequencies.get(symbol, 0) for symbol in empirical_frequencies])\n    \n    # Berechne die Korrelation zur empirischen Sequenz\n    spearman_corr, spearman_p = spearmanr(empirical_values, generated_values)\n    \n    # Aktualisiere das beste Ergebnis, falls signifikant und besser\n    if spearman_p < 0.05 and spearman_corr > best_correlation:\n        best_correlation = spearman_corr\n        best_grammar = grammar\n        best_p_value = spearman_p\n    \n    # Wenn die Korrelation akzeptabel ist, beende die Schleife\n    if spearman_corr > 0.8:  # Setze einen gewünschten Korrelationswert fest\n        break\n    \n    # Andernfalls passe die Wahrscheinlichkeiten leicht an\n    for key in grammar:\n        for i, rule in enumerate(grammar[key]):\n            new_prob = rule[-1] + np.random.uniform(-0.05, 0.05)\n            grammar[key][i][-1] = max(0, min(new_prob, 1))  # Stelle sicher, dass die Wahrscheinlichkeiten im Bereich [0, 1] bleiben\n        # Normiere die Wahrscheinlichkeiten neu, damit ihre Summe 1 ist\n        total_prob = sum(rule[-1] for rule in grammar[key])\n        for rule in grammar[key]:\n            rule[-1] \/= total_prob\n\n# Ausgabe der optimierten Grammatik, Korrelation und Signifikanz\nprint(\"Optimized Grammar:\", best_grammar)\nprint(\"Best Spearman Correlation:\", best_correlation)\nprint(\"Significance (p-value):\", best_p_value)\nif best_p_value < 0.05:\n    print(\"The correlation is statistically significant.\")\nelse:\n    print(\"The correlation is not statistically significant.\")\n\n",
            "metadata": {
                "trusted": true
            },
            "outputs": [
                {
                    "name": "stdout",
                    "text": "Optimized Grammar: {'<Start>': [['<Begrüßung>', '<Bedarf>', '<Abschluss>', '<Verabschiedung>', 1.0]], '<Begrüßung>': [['KBG', 'VBG', 1.0]], '<Bedarf>': [['<BedarfSegment>', '<Bedarf>', 0.8], ['<BedarfSegment>', 0.2]], '<BedarfSegment>': [['KBBd', 'VBBd', 0.4], ['KBBd', 'VBA', 0.3], ['KBA', 'VBA', 0.3]], '<Abschluss>': [['KAA', 'VAA', 0.6], ['VAA', 'KAA', 0.4]], '<Verabschiedung>': [['KAV', 'VAV', 0.7], ['VAV', 'KAV', 0.3]]}\nBest Spearman Correlation: 0.9692307692307693\nSignificance (p-value): 3.778488151361357e-06\nThe correlation is statistically significant.\n",
                    "output_type": "stream"
                }
            ],
            "execution_count": 4
        },
        {
            "cell_type": "code",
            "source": "",
            "metadata": {
                "trusted": true
            },
            "outputs": [],
            "execution_count": null
        }
    ]
}