Skip to content

Probabilistic Data Analysis

Kai edited this page Mar 10, 2023 · 8 revisions

PROBABILISTIC ANALYSIS

Analyze the data created by the probabilistic data synthesis (prob_data). The input to this program (and the output of prob_data.py), prob_analysis_raw.pkl, is a list of data structures formatted as follows:

[
    line_number,
    {
         verse_1: v1_occurences,
         verse_2: v2_occurences,
         verse_3: v3_occurences,
         ...
    }
], ...

To understand how this data is collected, go to Probabilistic Data Synthesis.

Multiple methods of analysis are available:

  1. simple_analysis

    • Find the verse with the maximum number of verse occurrences. This is accomplished by sorting the list of verses by occurrence in descending order and grabbing the top value. For more distinct verses with a clear correlating verse, this method of analysis is quick and clear. However, for clauses with many verses but no distinct occurrence outliers (i.e. clauses without any verse citations or allusions), it will not be helpful.
  2. average_analysis

    • Find verses with more occurrences than the average + standard deviation. Instead of just grabbing only one verse that has the most occurrences, average_analysis takes all of the occurrences into account. In pseudocode, average_analysis performs as follows:
    for verse in clause:
        min = (average + standard deviation) rounded up
        if occurrences of verse > min
            add to output

Clone this wiki locally