mirror of
https://github.com/osm-search/Nominatim.git
synced 2026-02-16 15:47:58 +00:00
Merge pull request #3610 from lonvia/search-preprocessing
Add configurable query preprocessing
This commit is contained in:
@@ -4,12 +4,11 @@ The tokenizer module in Nominatim is responsible for analysing the names given
|
||||
to OSM objects and the terms of an incoming query in order to make sure, they
|
||||
can be matched appropriately.
|
||||
|
||||
Nominatim offers different tokenizer modules, which behave differently and have
|
||||
different configuration options. This sections describes the tokenizers and how
|
||||
they can be configured.
|
||||
Nominatim currently offers only one tokenizer module, the ICU tokenizer. This section
|
||||
describes the tokenizer and how it can be configured.
|
||||
|
||||
!!! important
|
||||
The use of a tokenizer is tied to a database installation. You need to choose
|
||||
The selection of tokenizer is tied to a database installation. You need to choose
|
||||
and configure the tokenizer before starting the initial import. Once the import
|
||||
is done, you cannot switch to another tokenizer anymore. Reconfiguring the
|
||||
chosen tokenizer is very limited as well. See the comments in each tokenizer
|
||||
@@ -43,10 +42,19 @@ On import the tokenizer processes names in the following three stages:
|
||||
See the [Token analysis](#token-analysis) section below for more
|
||||
information.
|
||||
|
||||
During query time, only normalization and transliteration are relevant.
|
||||
An incoming query is first split into name chunks (this usually means splitting
|
||||
the string at the commas) and the each part is normalised and transliterated.
|
||||
The result is used to look up places in the search index.
|
||||
During query time, the tokeinzer is responsible for processing incoming
|
||||
queries. This happens in two stages:
|
||||
|
||||
1. During **query preprocessing** the incoming text is split into name
|
||||
chunks and normalised. This usually means applying the same normalisation
|
||||
as during the import process but may involve other processing like,
|
||||
for example, word break detection.
|
||||
2. The **token analysis** step breaks down the query parts into tokens,
|
||||
looks them up in the database and assignes them possible functions and
|
||||
probabilities.
|
||||
|
||||
Query processing can be further customized while the rest of the analysis
|
||||
is hard-coded.
|
||||
|
||||
### Configuration
|
||||
|
||||
@@ -58,6 +66,8 @@ have no effect.
|
||||
Here is an example configuration file:
|
||||
|
||||
``` yaml
|
||||
query-preprocessing:
|
||||
- normalize
|
||||
normalization:
|
||||
- ":: lower ()"
|
||||
- "ß > 'ss'" # German szet is unambiguously equal to double ss
|
||||
@@ -81,6 +91,22 @@ token-analysis:
|
||||
The configuration file contains four sections:
|
||||
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
|
||||
|
||||
#### Query preprocessing
|
||||
|
||||
The section for `query-preprocessing` defines an ordered list of functions
|
||||
that are applied to the query before the token analysis.
|
||||
|
||||
The following is a list of preprocessors that are shipped with Nominatim.
|
||||
|
||||
##### normalize
|
||||
|
||||
::: nominatim_api.query_preprocessing.normalize
|
||||
options:
|
||||
members: False
|
||||
heading_level: 6
|
||||
docstring_section_style: spacy
|
||||
|
||||
|
||||
#### Normalization and Transliteration
|
||||
|
||||
The normalization and transliteration sections each define a set of
|
||||
|
||||
@@ -14,10 +14,11 @@ of sanitizers and token analysis.
|
||||
implemented, it is not guaranteed to be stable at the moment.
|
||||
|
||||
|
||||
## Using non-standard sanitizers and token analyzers
|
||||
## Using non-standard modules
|
||||
|
||||
Sanitizer names (in the `step` property) and token analysis names (in the
|
||||
`analyzer`) may refer to externally supplied modules. There are two ways
|
||||
Sanitizer names (in the `step` property), token analysis names (in the
|
||||
`analyzer`) and query preprocessor names (in the `step` property)
|
||||
may refer to externally supplied modules. There are two ways
|
||||
to include external modules: through a library or from the project directory.
|
||||
|
||||
To include a module from a library, use the absolute import path as name and
|
||||
@@ -27,6 +28,47 @@ To use a custom module without creating a library, you can put the module
|
||||
somewhere in your project directory and then use the relative path to the
|
||||
file. Include the whole name of the file including the `.py` ending.
|
||||
|
||||
## Custom query preprocessors
|
||||
|
||||
A query preprocessor must export a single factory function `create` with
|
||||
the following signature:
|
||||
|
||||
``` python
|
||||
create(self, config: QueryConfig) -> Callable[[list[Phrase]], list[Phrase]]
|
||||
```
|
||||
|
||||
The function receives the custom configuration for the preprocessor and
|
||||
returns a callable (function or class) with the actual preprocessing
|
||||
code. When a query comes in, then the callable gets a list of phrases
|
||||
and needs to return the transformed list of phrases. The list and phrases
|
||||
may be changed in place or a completely new list may be generated.
|
||||
|
||||
The `QueryConfig` is a simple dictionary which contains all configuration
|
||||
options given in the yaml configuration of the ICU tokenizer. It is up to
|
||||
the function to interpret the values.
|
||||
|
||||
A `nominatim_api.search.Phrase` describes a part of the query that contains one or more independent
|
||||
search terms. Breaking a query into phrases helps reducing the number of
|
||||
possible tokens Nominatim has to take into account. However a phrase break
|
||||
is definitive: a multi-term search word cannot go over a phrase break.
|
||||
A Phrase object has two fields:
|
||||
|
||||
* `ptype` further refines the type of phrase (see list below)
|
||||
* `text` contains the query text for the phrase
|
||||
|
||||
The order of phrases matters to Nominatim when doing further processing.
|
||||
Thus, while you may split or join phrases, you should not reorder them
|
||||
unless you really know what you are doing.
|
||||
|
||||
Phrase types (`nominatim_api.search.PhraseType`) can further help narrowing
|
||||
down how the tokens in the phrase are interpreted. The following phrase types
|
||||
are known:
|
||||
|
||||
::: nominatim_api.search.PhraseType
|
||||
options:
|
||||
heading_level: 6
|
||||
|
||||
|
||||
## Custom sanitizer modules
|
||||
|
||||
A sanitizer module must export a single factory function `create` with the
|
||||
@@ -90,21 +132,22 @@ adding extra attributes) or completely replace the list with a different one.
|
||||
The following sanitizer removes the directional prefixes from street names
|
||||
in the US:
|
||||
|
||||
``` python
|
||||
import re
|
||||
!!! example
|
||||
``` python
|
||||
import re
|
||||
|
||||
def _filter_function(obj):
|
||||
if obj.place.country_code == 'us' \
|
||||
and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
|
||||
for name in obj.names:
|
||||
name.name = re.sub(r'^(north|south|west|east) ',
|
||||
'',
|
||||
name.name,
|
||||
flags=re.IGNORECASE)
|
||||
def _filter_function(obj):
|
||||
if obj.place.country_code == 'us' \
|
||||
and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
|
||||
for name in obj.names:
|
||||
name.name = re.sub(r'^(north|south|west|east) ',
|
||||
'',
|
||||
name.name,
|
||||
flags=re.IGNORECASE)
|
||||
|
||||
def create(config):
|
||||
return _filter_function
|
||||
```
|
||||
def create(config):
|
||||
return _filter_function
|
||||
```
|
||||
|
||||
This is the most simple form of a sanitizer module. If defines a single
|
||||
filter function and implements the required `create()` function by returning
|
||||
@@ -128,13 +171,13 @@ sanitizers:
|
||||
|
||||
!!! warning
|
||||
This example is just a simplified show case on how to create a sanitizer.
|
||||
It is not really read for real-world use: while the sanitizer would
|
||||
It is not really meant for real-world use: while the sanitizer would
|
||||
correctly transform `West 5th Street` into `5th Street`. it would also
|
||||
shorten a simple `North Street` to `Street`.
|
||||
|
||||
For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
|
||||
They can be found in the directory
|
||||
[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers).
|
||||
[`src/nominatim_db/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/src/nominatim_db/tokenizer/sanitizers).
|
||||
|
||||
|
||||
## Custom token analysis module
|
||||
|
||||
@@ -91,14 +91,19 @@ for a custom tokenizer implementation.
|
||||
|
||||
### Directory Structure
|
||||
|
||||
Nominatim expects a single file `src/nominatim_db/tokenizer/<NAME>_tokenizer.py`
|
||||
containing the Python part of the implementation.
|
||||
Nominatim expects two files containing the Python part of the implementation:
|
||||
|
||||
* `src/nominatim_db/tokenizer/<NAME>_tokenizer.py` contains the tokenizer
|
||||
code used during import and
|
||||
* `src/nominatim_api/search/NAME>_tokenizer.py` has the code used during
|
||||
query time.
|
||||
|
||||
`<NAME>` is a unique name for the tokenizer consisting of only lower-case
|
||||
letters, digits and underscore. A tokenizer also needs to install some SQL
|
||||
functions. By convention, these should be placed in `lib-sql/tokenizer`.
|
||||
|
||||
If the tokenizer has a default configuration file, this should be saved in
|
||||
the `settings/<NAME>_tokenizer.<SUFFIX>`.
|
||||
`settings/<NAME>_tokenizer.<SUFFIX>`.
|
||||
|
||||
### Configuration and Persistence
|
||||
|
||||
@@ -110,9 +115,11 @@ are tied to a database installation and must only be read during installation
|
||||
time. If they are needed for the runtime then they must be saved into the
|
||||
`nominatim_properties` table and later loaded from there.
|
||||
|
||||
### The Python module
|
||||
### The Python modules
|
||||
|
||||
The Python module is expect to export a single factory function:
|
||||
#### `src/nominatim_db/tokenizer/`
|
||||
|
||||
The import Python module is expected to export a single factory function:
|
||||
|
||||
```python
|
||||
def create(dsn: str, data_dir: Path) -> AbstractTokenizer
|
||||
@@ -123,6 +130,20 @@ is a directory in the project directory that the tokenizer may use to save
|
||||
database-specific data. The function must return the instance of the tokenizer
|
||||
class as defined below.
|
||||
|
||||
#### `src/nominatim_api/search/`
|
||||
|
||||
The query-time Python module must also export a factory function:
|
||||
|
||||
``` python
|
||||
def create_query_analyzer(conn: SearchConnection) -> AbstractQueryAnalyzer
|
||||
```
|
||||
|
||||
The `conn` parameter contains the current search connection. See the
|
||||
[library documentation](../library/Low-Level-DB-Access.md#searchconnection-class)
|
||||
for details on the class. The function must return the instance of the tokenizer
|
||||
class as defined below.
|
||||
|
||||
|
||||
### Python Tokenizer Class
|
||||
|
||||
All tokenizers must inherit from `nominatim_db.tokenizer.base.AbstractTokenizer`
|
||||
@@ -138,6 +159,13 @@ and implement the abstract functions defined there.
|
||||
options:
|
||||
heading_level: 6
|
||||
|
||||
|
||||
### Python Query Analyzer Class
|
||||
|
||||
::: nominatim_api.search.query_analyzer_factory.AbstractQueryAnalyzer
|
||||
options:
|
||||
heading_level: 6
|
||||
|
||||
### PL/pgSQL Functions
|
||||
|
||||
The tokenizer must provide access functions for the `token_info` column
|
||||
|
||||
Reference in New Issue
Block a user