introduce tokenizer modules

This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.

The legacy tokenizer implements Nominatim's traditional algorithms.
This commit is contained in:
Sarah Hoffmann
2021-04-21 09:57:17 +02:00
parent 5c7b9ef909
commit af968d4903
10 changed files with 289 additions and 0 deletions

View File

@@ -18,6 +18,12 @@ NOMINATIM_DATABASE_WEBUSER="www-data"
# Changing this value requires to run 'nominatim refresh --functions'.
NOMINATIM_DATABASE_MODULE_PATH=
# Tokenizer used for normalizing and parsing queries and names.
# The tokenizer is set up during import and cannot be changed afterwards
# without a reimport.
# Currently available tokenizers: legacy
NOMINATIM_TOKENIZER="legacy"
# Number of occurances of a word before it is considered frequent.
# Similar to the concept of stop words. Frequent partial words get ignored
# or handled differently during search.