Skip to content

Umlauts need additional edit distance #12

@then4p

Description

@then4p

I built a test that looks like this:

def test_umlauts(self):
  dictionary_path = os.path.join(self.fortests_path, "umlaut_dict.txt")
  
  edit_distance_max = 1
  prefix_length = 5
  sym_spell = SymSpell(edit_distance_max, prefix_length)
  sym_spell.load_dictionary(dictionary_path, 0, 1)
  
  result = sym_spell.lookup("dämen", Verbosity.TOP, 2)
  self.assertEqual(1, len(result))
  self.assertEqual("damen", result[0].term)

With a dictionary that contains only this line: damen 1

However this test fails with edit_distance_max = 1 and passes with edit_distance_max = 2 even though there is only 1 character changed from dämen to damen

It seems like there is a bug so that umlauts like 'ä' are being interpreted as 'ae' or something like that?

If anyone has an idea where to look I'd gladly try to fix it but I haven't found anything yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions