Skip to content

Source Text Parser

Kai edited this page Mar 10, 2023 · 11 revisions

SOURCE TEXT PARSER

Purpose:

The source text parser (found in source_parser.py) exists to take koine Greek unicode text and convert it into gclause (Greek clause) objects A clause is defined as a series of Greek words ending with one of the following three characters: '.', ';', '·'. The parser is activated by calling the parse_greek function passing in a file path to a text file as the only parameter.

Input

The file path must take one of two styles. The first style is a continual string of Greek text. The source parser will use punctuation and newline characters (i.e. '\n') to differentiate clauses in the text. Otherwise, the text can be passed in formatted. Each line should then begin with a string containing no whitespace which will signify the position within the document followed by one or many Greek words ending with one of the punctuating characters. This line can contain multiple punctuating characters within, but must end with a punctuating character.

Process

The parser works through each clause individually. All definite articles as well as the Greek equivalent of "and" are removed from the Greek text to eliminate the most common word matches The removed words are:

ο του τη τασ ταισ αι
το τον την την η τω
τησ των οι τα τουσ τοισ
και

Output

The output of the source parser is a list of gclause objects which will then be used in the probabilistic analysis phase.

Clone this wiki locally