-
Notifications
You must be signed in to change notification settings - Fork 0
Home
SimilarityMatcher is essentially an experiment with string processing algorithms. The aim is to test basic notions crafted by structural linguists, namely: minimal pairs, segmentation and permutation tests as a discovery procedure to figure out how any natural language works. So far, Longest Common Subsequence extraction is the main goal, adapted both for character and word-like units processing. In the future, plans are to implement weighting and penalty schemes for editing operations, and to allow for multilinear processing: for the time being, processing is limited to one "layer" at a time (raw text, text+pos tags, text+linguistic annotation). But the objective is to process multiple layers at the same time. Applications of this approach range from fundamental linguistics (an implementation of abstracting rules based on form analogy) to Natural Language Processing, Corpus Linguistics (collocations, repeated segments and phraseological units extraction) and Information Retrieval (term extraction, text summarization).