-
Notifications
You must be signed in to change notification settings - Fork 0
Probabilistic Data Analysis
Kai edited this page Mar 10, 2023
·
8 revisions
Analyze the data created by the probabilistic data synthesis (prob_data).
The input to this program (and the output of prob_data.py), prob_analysis_raw.pkl, is a list of data structures formatted as follows:
[
line_number,
{
verse_1: v1_occurences,
verse_2: v2_occurences,
verse_3: v3_occurences,
...
}
], ...
To understand how this data is collected, go to Probabilistic Data Synthesis.
Multiple methods of analysis are available:
-
simple_analysis- Find the verse with the maximum number of verse occurrences. This is accomplished by sorting the list of verses by occurrence in descending order and grabbing the top value. For more distinct verses with a clear correlating verse, this method of analysis is quick and clear. However, for clauses with many verses but no distinct occurrence outliers (i.e. clauses without any verse citations or allusions), it will not be helpful.
-
average_analysis-
Find verses with more occurrences than the
average + standard deviation. Instead of just grabbing only one verse that has the most occurrences,average_analysistakes all of the occurrences into account. In pseudocode,average_analysisperforms as follows:
-
Find verses with more occurrences than the
for verse in clause:
min = (average + standard deviation) rounded up
if occurrences of verse > min
add to output