rerank results by query

The algorithm is similar to the PHP reranking and uses the terms from the display name to check against the query terms. However instead of exact matching it uses a per-word-edit-distance, so that it is less strict when it comes to mismatching accents or other one letter differences. Country names get a higher penalty because they don't receive a penalty during token matching right now. This will work badly with the legacy tokenizer. Given that it is marked for removal, it is simply not worth optimising for it.
2023-09-19 16:18:09 +02:00
parent 5762a5bc80
commit fd26310d6a
3 changed files with 64 additions and 4 deletions
--- a/nominatim/api/search/legacy_tokenizer.py
+++ b/nominatim/api/search/legacy_tokenizer.py
@@ -127,6 +127,15 @@ class LegacyQueryAnalyzer(AbstractQueryAnalyzer):
        return query


+    def normalize_text(self, text: str) -> str:
+        """ Bring the given text into a normalized form.
+
+            This only removes case, so some difference with the normalization
+            in the phrase remains.
+        """
+        return text.lower()
+
+
    def split_query(self, query: qmod.QueryStruct) -> Tuple[List[str],
                                                            Dict[str, List[qmod.TokenRange]]]:
        """ Transliterate the phrases and split them into tokens.