Skip to content

Commit 007b659

Browse files
authored
Merge pull request #179 from TranslatorSRI/fix-windows-smart-quote
We don't currently handle start-end quotes properly (i.e. “” and ‘’), because in the database we usually encode these as plain quotes (" and '). This PR replaces the query string to use the latter. To this, I had to add `clique_identifier_count` to the test data, which is how I figured out that an entry with clique_identifier_count=1 always got a zero score (because `log(1) = 0`). We now add one to `clique_identifier_count` to fix this issue. Closes #176.
2 parents c3e4a7f + f625532 commit 007b659

File tree

3 files changed

+202
-99
lines changed

3 files changed

+202
-99
lines changed

api/server.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -354,6 +354,15 @@ async def lookup(string: str,
354354
# First, we strip and lowercase the query since all our indexes are case-insensitive.
355355
string_lc = string.strip().lower()
356356

357+
# There is a possibility that the input text isn't in UTF-8.
358+
# We could try a bunch of Python packages to try to determine what the encoding actually is:
359+
# - https://pypi.org/project/charset-normalizer/
360+
# - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit
361+
# But the only issue we've actually run into so far has been the Windows smart
362+
# quote (https://github.com/TranslatorSRI/NameResolution/issues/176), so for now
363+
# let's detect and replace just those characters.
364+
string_lc = re.sub(r"[“”]", '"', re.sub(r"[‘’]", "'", string_lc))
365+
357366
# Do we have a search string at all?
358367
if string_lc == "":
359368
return []
@@ -439,7 +448,7 @@ async def lookup(string: str,
439448
"boost": [
440449
# The boost is multiplied with score -- calculating the log() reduces how quickly this increases
441450
# the score for increasing clique identifier counts.
442-
"log(clique_identifier_count)"
451+
"log(sum(clique_identifier_count, 1))"
443452
],
444453
},
445454
},

0 commit comments

Comments
 (0)