mirror of
https://github.com/osm-search/Nominatim.git
synced 2026-03-12 05:44:06 +00:00
Merge pull request #3610 from lonvia/search-preprocessing
Add configurable query preprocessing
This commit is contained in:
@@ -4,12 +4,11 @@ The tokenizer module in Nominatim is responsible for analysing the names given
|
|||||||
to OSM objects and the terms of an incoming query in order to make sure, they
|
to OSM objects and the terms of an incoming query in order to make sure, they
|
||||||
can be matched appropriately.
|
can be matched appropriately.
|
||||||
|
|
||||||
Nominatim offers different tokenizer modules, which behave differently and have
|
Nominatim currently offers only one tokenizer module, the ICU tokenizer. This section
|
||||||
different configuration options. This sections describes the tokenizers and how
|
describes the tokenizer and how it can be configured.
|
||||||
they can be configured.
|
|
||||||
|
|
||||||
!!! important
|
!!! important
|
||||||
The use of a tokenizer is tied to a database installation. You need to choose
|
The selection of tokenizer is tied to a database installation. You need to choose
|
||||||
and configure the tokenizer before starting the initial import. Once the import
|
and configure the tokenizer before starting the initial import. Once the import
|
||||||
is done, you cannot switch to another tokenizer anymore. Reconfiguring the
|
is done, you cannot switch to another tokenizer anymore. Reconfiguring the
|
||||||
chosen tokenizer is very limited as well. See the comments in each tokenizer
|
chosen tokenizer is very limited as well. See the comments in each tokenizer
|
||||||
@@ -43,10 +42,19 @@ On import the tokenizer processes names in the following three stages:
|
|||||||
See the [Token analysis](#token-analysis) section below for more
|
See the [Token analysis](#token-analysis) section below for more
|
||||||
information.
|
information.
|
||||||
|
|
||||||
During query time, only normalization and transliteration are relevant.
|
During query time, the tokeinzer is responsible for processing incoming
|
||||||
An incoming query is first split into name chunks (this usually means splitting
|
queries. This happens in two stages:
|
||||||
the string at the commas) and the each part is normalised and transliterated.
|
|
||||||
The result is used to look up places in the search index.
|
1. During **query preprocessing** the incoming text is split into name
|
||||||
|
chunks and normalised. This usually means applying the same normalisation
|
||||||
|
as during the import process but may involve other processing like,
|
||||||
|
for example, word break detection.
|
||||||
|
2. The **token analysis** step breaks down the query parts into tokens,
|
||||||
|
looks them up in the database and assignes them possible functions and
|
||||||
|
probabilities.
|
||||||
|
|
||||||
|
Query processing can be further customized while the rest of the analysis
|
||||||
|
is hard-coded.
|
||||||
|
|
||||||
### Configuration
|
### Configuration
|
||||||
|
|
||||||
@@ -58,6 +66,8 @@ have no effect.
|
|||||||
Here is an example configuration file:
|
Here is an example configuration file:
|
||||||
|
|
||||||
``` yaml
|
``` yaml
|
||||||
|
query-preprocessing:
|
||||||
|
- normalize
|
||||||
normalization:
|
normalization:
|
||||||
- ":: lower ()"
|
- ":: lower ()"
|
||||||
- "ß > 'ss'" # German szet is unambiguously equal to double ss
|
- "ß > 'ss'" # German szet is unambiguously equal to double ss
|
||||||
@@ -81,6 +91,22 @@ token-analysis:
|
|||||||
The configuration file contains four sections:
|
The configuration file contains four sections:
|
||||||
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
|
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
|
||||||
|
|
||||||
|
#### Query preprocessing
|
||||||
|
|
||||||
|
The section for `query-preprocessing` defines an ordered list of functions
|
||||||
|
that are applied to the query before the token analysis.
|
||||||
|
|
||||||
|
The following is a list of preprocessors that are shipped with Nominatim.
|
||||||
|
|
||||||
|
##### normalize
|
||||||
|
|
||||||
|
::: nominatim_api.query_preprocessing.normalize
|
||||||
|
options:
|
||||||
|
members: False
|
||||||
|
heading_level: 6
|
||||||
|
docstring_section_style: spacy
|
||||||
|
|
||||||
|
|
||||||
#### Normalization and Transliteration
|
#### Normalization and Transliteration
|
||||||
|
|
||||||
The normalization and transliteration sections each define a set of
|
The normalization and transliteration sections each define a set of
|
||||||
|
|||||||
@@ -14,10 +14,11 @@ of sanitizers and token analysis.
|
|||||||
implemented, it is not guaranteed to be stable at the moment.
|
implemented, it is not guaranteed to be stable at the moment.
|
||||||
|
|
||||||
|
|
||||||
## Using non-standard sanitizers and token analyzers
|
## Using non-standard modules
|
||||||
|
|
||||||
Sanitizer names (in the `step` property) and token analysis names (in the
|
Sanitizer names (in the `step` property), token analysis names (in the
|
||||||
`analyzer`) may refer to externally supplied modules. There are two ways
|
`analyzer`) and query preprocessor names (in the `step` property)
|
||||||
|
may refer to externally supplied modules. There are two ways
|
||||||
to include external modules: through a library or from the project directory.
|
to include external modules: through a library or from the project directory.
|
||||||
|
|
||||||
To include a module from a library, use the absolute import path as name and
|
To include a module from a library, use the absolute import path as name and
|
||||||
@@ -27,6 +28,47 @@ To use a custom module without creating a library, you can put the module
|
|||||||
somewhere in your project directory and then use the relative path to the
|
somewhere in your project directory and then use the relative path to the
|
||||||
file. Include the whole name of the file including the `.py` ending.
|
file. Include the whole name of the file including the `.py` ending.
|
||||||
|
|
||||||
|
## Custom query preprocessors
|
||||||
|
|
||||||
|
A query preprocessor must export a single factory function `create` with
|
||||||
|
the following signature:
|
||||||
|
|
||||||
|
``` python
|
||||||
|
create(self, config: QueryConfig) -> Callable[[list[Phrase]], list[Phrase]]
|
||||||
|
```
|
||||||
|
|
||||||
|
The function receives the custom configuration for the preprocessor and
|
||||||
|
returns a callable (function or class) with the actual preprocessing
|
||||||
|
code. When a query comes in, then the callable gets a list of phrases
|
||||||
|
and needs to return the transformed list of phrases. The list and phrases
|
||||||
|
may be changed in place or a completely new list may be generated.
|
||||||
|
|
||||||
|
The `QueryConfig` is a simple dictionary which contains all configuration
|
||||||
|
options given in the yaml configuration of the ICU tokenizer. It is up to
|
||||||
|
the function to interpret the values.
|
||||||
|
|
||||||
|
A `nominatim_api.search.Phrase` describes a part of the query that contains one or more independent
|
||||||
|
search terms. Breaking a query into phrases helps reducing the number of
|
||||||
|
possible tokens Nominatim has to take into account. However a phrase break
|
||||||
|
is definitive: a multi-term search word cannot go over a phrase break.
|
||||||
|
A Phrase object has two fields:
|
||||||
|
|
||||||
|
* `ptype` further refines the type of phrase (see list below)
|
||||||
|
* `text` contains the query text for the phrase
|
||||||
|
|
||||||
|
The order of phrases matters to Nominatim when doing further processing.
|
||||||
|
Thus, while you may split or join phrases, you should not reorder them
|
||||||
|
unless you really know what you are doing.
|
||||||
|
|
||||||
|
Phrase types (`nominatim_api.search.PhraseType`) can further help narrowing
|
||||||
|
down how the tokens in the phrase are interpreted. The following phrase types
|
||||||
|
are known:
|
||||||
|
|
||||||
|
::: nominatim_api.search.PhraseType
|
||||||
|
options:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
|
||||||
## Custom sanitizer modules
|
## Custom sanitizer modules
|
||||||
|
|
||||||
A sanitizer module must export a single factory function `create` with the
|
A sanitizer module must export a single factory function `create` with the
|
||||||
@@ -90,21 +132,22 @@ adding extra attributes) or completely replace the list with a different one.
|
|||||||
The following sanitizer removes the directional prefixes from street names
|
The following sanitizer removes the directional prefixes from street names
|
||||||
in the US:
|
in the US:
|
||||||
|
|
||||||
``` python
|
!!! example
|
||||||
import re
|
``` python
|
||||||
|
import re
|
||||||
|
|
||||||
def _filter_function(obj):
|
def _filter_function(obj):
|
||||||
if obj.place.country_code == 'us' \
|
if obj.place.country_code == 'us' \
|
||||||
and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
|
and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
|
||||||
for name in obj.names:
|
for name in obj.names:
|
||||||
name.name = re.sub(r'^(north|south|west|east) ',
|
name.name = re.sub(r'^(north|south|west|east) ',
|
||||||
'',
|
'',
|
||||||
name.name,
|
name.name,
|
||||||
flags=re.IGNORECASE)
|
flags=re.IGNORECASE)
|
||||||
|
|
||||||
def create(config):
|
def create(config):
|
||||||
return _filter_function
|
return _filter_function
|
||||||
```
|
```
|
||||||
|
|
||||||
This is the most simple form of a sanitizer module. If defines a single
|
This is the most simple form of a sanitizer module. If defines a single
|
||||||
filter function and implements the required `create()` function by returning
|
filter function and implements the required `create()` function by returning
|
||||||
@@ -128,13 +171,13 @@ sanitizers:
|
|||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
This example is just a simplified show case on how to create a sanitizer.
|
This example is just a simplified show case on how to create a sanitizer.
|
||||||
It is not really read for real-world use: while the sanitizer would
|
It is not really meant for real-world use: while the sanitizer would
|
||||||
correctly transform `West 5th Street` into `5th Street`. it would also
|
correctly transform `West 5th Street` into `5th Street`. it would also
|
||||||
shorten a simple `North Street` to `Street`.
|
shorten a simple `North Street` to `Street`.
|
||||||
|
|
||||||
For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
|
For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
|
||||||
They can be found in the directory
|
They can be found in the directory
|
||||||
[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers).
|
[`src/nominatim_db/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/src/nominatim_db/tokenizer/sanitizers).
|
||||||
|
|
||||||
|
|
||||||
## Custom token analysis module
|
## Custom token analysis module
|
||||||
|
|||||||
@@ -91,14 +91,19 @@ for a custom tokenizer implementation.
|
|||||||
|
|
||||||
### Directory Structure
|
### Directory Structure
|
||||||
|
|
||||||
Nominatim expects a single file `src/nominatim_db/tokenizer/<NAME>_tokenizer.py`
|
Nominatim expects two files containing the Python part of the implementation:
|
||||||
containing the Python part of the implementation.
|
|
||||||
|
* `src/nominatim_db/tokenizer/<NAME>_tokenizer.py` contains the tokenizer
|
||||||
|
code used during import and
|
||||||
|
* `src/nominatim_api/search/NAME>_tokenizer.py` has the code used during
|
||||||
|
query time.
|
||||||
|
|
||||||
`<NAME>` is a unique name for the tokenizer consisting of only lower-case
|
`<NAME>` is a unique name for the tokenizer consisting of only lower-case
|
||||||
letters, digits and underscore. A tokenizer also needs to install some SQL
|
letters, digits and underscore. A tokenizer also needs to install some SQL
|
||||||
functions. By convention, these should be placed in `lib-sql/tokenizer`.
|
functions. By convention, these should be placed in `lib-sql/tokenizer`.
|
||||||
|
|
||||||
If the tokenizer has a default configuration file, this should be saved in
|
If the tokenizer has a default configuration file, this should be saved in
|
||||||
the `settings/<NAME>_tokenizer.<SUFFIX>`.
|
`settings/<NAME>_tokenizer.<SUFFIX>`.
|
||||||
|
|
||||||
### Configuration and Persistence
|
### Configuration and Persistence
|
||||||
|
|
||||||
@@ -110,9 +115,11 @@ are tied to a database installation and must only be read during installation
|
|||||||
time. If they are needed for the runtime then they must be saved into the
|
time. If they are needed for the runtime then they must be saved into the
|
||||||
`nominatim_properties` table and later loaded from there.
|
`nominatim_properties` table and later loaded from there.
|
||||||
|
|
||||||
### The Python module
|
### The Python modules
|
||||||
|
|
||||||
The Python module is expect to export a single factory function:
|
#### `src/nominatim_db/tokenizer/`
|
||||||
|
|
||||||
|
The import Python module is expected to export a single factory function:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def create(dsn: str, data_dir: Path) -> AbstractTokenizer
|
def create(dsn: str, data_dir: Path) -> AbstractTokenizer
|
||||||
@@ -123,6 +130,20 @@ is a directory in the project directory that the tokenizer may use to save
|
|||||||
database-specific data. The function must return the instance of the tokenizer
|
database-specific data. The function must return the instance of the tokenizer
|
||||||
class as defined below.
|
class as defined below.
|
||||||
|
|
||||||
|
#### `src/nominatim_api/search/`
|
||||||
|
|
||||||
|
The query-time Python module must also export a factory function:
|
||||||
|
|
||||||
|
``` python
|
||||||
|
def create_query_analyzer(conn: SearchConnection) -> AbstractQueryAnalyzer
|
||||||
|
```
|
||||||
|
|
||||||
|
The `conn` parameter contains the current search connection. See the
|
||||||
|
[library documentation](../library/Low-Level-DB-Access.md#searchconnection-class)
|
||||||
|
for details on the class. The function must return the instance of the tokenizer
|
||||||
|
class as defined below.
|
||||||
|
|
||||||
|
|
||||||
### Python Tokenizer Class
|
### Python Tokenizer Class
|
||||||
|
|
||||||
All tokenizers must inherit from `nominatim_db.tokenizer.base.AbstractTokenizer`
|
All tokenizers must inherit from `nominatim_db.tokenizer.base.AbstractTokenizer`
|
||||||
@@ -138,6 +159,13 @@ and implement the abstract functions defined there.
|
|||||||
options:
|
options:
|
||||||
heading_level: 6
|
heading_level: 6
|
||||||
|
|
||||||
|
|
||||||
|
### Python Query Analyzer Class
|
||||||
|
|
||||||
|
::: nominatim_api.search.query_analyzer_factory.AbstractQueryAnalyzer
|
||||||
|
options:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
### PL/pgSQL Functions
|
### PL/pgSQL Functions
|
||||||
|
|
||||||
The tokenizer must provide access functions for the `token_info` column
|
The tokenizer must provide access functions for the `token_info` column
|
||||||
|
|||||||
@@ -1,3 +1,5 @@
|
|||||||
|
query-preprocessing:
|
||||||
|
- step: normalize
|
||||||
normalization:
|
normalization:
|
||||||
- ":: lower ()"
|
- ":: lower ()"
|
||||||
- ":: Hans-Hant"
|
- ":: Hans-Hant"
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ from .typing import SaFromClause
|
|||||||
from .sql.sqlalchemy_schema import SearchTables
|
from .sql.sqlalchemy_schema import SearchTables
|
||||||
from .sql.sqlalchemy_types import Geometry
|
from .sql.sqlalchemy_types import Geometry
|
||||||
from .logging import log
|
from .logging import log
|
||||||
|
from .config import Configuration
|
||||||
|
|
||||||
T = TypeVar('T')
|
T = TypeVar('T')
|
||||||
|
|
||||||
@@ -31,9 +32,11 @@ class SearchConnection:
|
|||||||
|
|
||||||
def __init__(self, conn: AsyncConnection,
|
def __init__(self, conn: AsyncConnection,
|
||||||
tables: SearchTables,
|
tables: SearchTables,
|
||||||
properties: Dict[str, Any]) -> None:
|
properties: Dict[str, Any],
|
||||||
|
config: Configuration) -> None:
|
||||||
self.connection = conn
|
self.connection = conn
|
||||||
self.t = tables
|
self.t = tables
|
||||||
|
self.config = config
|
||||||
self._property_cache = properties
|
self._property_cache = properties
|
||||||
self._classtables: Optional[Set[str]] = None
|
self._classtables: Optional[Set[str]] = None
|
||||||
self.query_timeout: Optional[int] = None
|
self.query_timeout: Optional[int] = None
|
||||||
|
|||||||
@@ -184,7 +184,7 @@ class NominatimAPIAsync:
|
|||||||
assert self._tables is not None
|
assert self._tables is not None
|
||||||
|
|
||||||
async with self._engine.begin() as conn:
|
async with self._engine.begin() as conn:
|
||||||
yield SearchConnection(conn, self._tables, self._property_cache)
|
yield SearchConnection(conn, self._tables, self._property_cache, self.config)
|
||||||
|
|
||||||
async def status(self) -> StatusResult:
|
async def status(self) -> StatusResult:
|
||||||
""" Return the status of the database.
|
""" Return the status of the database.
|
||||||
|
|||||||
0
src/nominatim_api/query_preprocessing/__init__.py
Normal file
0
src/nominatim_api/query_preprocessing/__init__.py
Normal file
32
src/nominatim_api/query_preprocessing/base.py
Normal file
32
src/nominatim_api/query_preprocessing/base.py
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-3.0-or-later
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2024 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Common data types and protocols for preprocessing.
|
||||||
|
"""
|
||||||
|
from typing import List, Callable
|
||||||
|
|
||||||
|
from ..typing import Protocol
|
||||||
|
from ..search import query as qmod
|
||||||
|
from .config import QueryConfig
|
||||||
|
|
||||||
|
QueryProcessingFunc = Callable[[List[qmod.Phrase]], List[qmod.Phrase]]
|
||||||
|
|
||||||
|
|
||||||
|
class QueryHandler(Protocol):
|
||||||
|
""" Protocol for query modules.
|
||||||
|
"""
|
||||||
|
def create(self, config: QueryConfig) -> QueryProcessingFunc:
|
||||||
|
"""
|
||||||
|
Create a function for sanitizing a place.
|
||||||
|
Arguments:
|
||||||
|
config: A dictionary with the additional configuration options
|
||||||
|
specified in the tokenizer configuration
|
||||||
|
normalizer: A instance to transliterate text
|
||||||
|
Return:
|
||||||
|
The result is a list modified by the preprocessor.
|
||||||
|
"""
|
||||||
|
pass
|
||||||
34
src/nominatim_api/query_preprocessing/config.py
Normal file
34
src/nominatim_api/query_preprocessing/config.py
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-3.0-or-later
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2024 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Configuration for Sanitizers.
|
||||||
|
"""
|
||||||
|
from typing import Any, TYPE_CHECKING
|
||||||
|
from collections import UserDict
|
||||||
|
|
||||||
|
# working around missing generics in Python < 3.8
|
||||||
|
# See https://github.com/python/typing/issues/60#issuecomment-869757075
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
_BaseUserDict = UserDict[str, Any]
|
||||||
|
else:
|
||||||
|
_BaseUserDict = UserDict
|
||||||
|
|
||||||
|
|
||||||
|
class QueryConfig(_BaseUserDict):
|
||||||
|
""" The `QueryConfig` class is a read-only dictionary
|
||||||
|
with configuration options for the preprocessor.
|
||||||
|
In addition to the usual dictionary functions, the class provides
|
||||||
|
accessors to standard preprocessor options that are used by many of the
|
||||||
|
preprocessors.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def set_normalizer(self, normalizer: Any) -> 'QueryConfig':
|
||||||
|
""" Set the normalizer function to be used.
|
||||||
|
"""
|
||||||
|
self['_normalizer'] = normalizer
|
||||||
|
|
||||||
|
return self
|
||||||
31
src/nominatim_api/query_preprocessing/normalize.py
Normal file
31
src/nominatim_api/query_preprocessing/normalize.py
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-3.0-or-later
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2024 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Normalize query text using the same ICU normalization rules that are
|
||||||
|
applied during import. If a phrase becomes empty because the normalization
|
||||||
|
removes all terms, then the phrase is deleted.
|
||||||
|
|
||||||
|
This preprocessor does not come with any extra information. Instead it will
|
||||||
|
use the configuration from the `normalization` section.
|
||||||
|
"""
|
||||||
|
from typing import cast
|
||||||
|
|
||||||
|
from .config import QueryConfig
|
||||||
|
from .base import QueryProcessingFunc
|
||||||
|
from ..search.query import Phrase
|
||||||
|
|
||||||
|
|
||||||
|
def create(config: QueryConfig) -> QueryProcessingFunc:
|
||||||
|
normalizer = config.get('_normalizer')
|
||||||
|
|
||||||
|
if not normalizer:
|
||||||
|
return lambda p: p
|
||||||
|
|
||||||
|
return lambda phrases: list(
|
||||||
|
filter(lambda p: p.text,
|
||||||
|
(Phrase(p.ptype, cast(str, normalizer.transliterate(p.text)))
|
||||||
|
for p in phrases)))
|
||||||
@@ -16,12 +16,14 @@ from icu import Transliterator
|
|||||||
|
|
||||||
import sqlalchemy as sa
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
from ..errors import UsageError
|
||||||
from ..typing import SaRow
|
from ..typing import SaRow
|
||||||
from ..sql.sqlalchemy_types import Json
|
from ..sql.sqlalchemy_types import Json
|
||||||
from ..connection import SearchConnection
|
from ..connection import SearchConnection
|
||||||
from ..logging import log
|
from ..logging import log
|
||||||
from ..search import query as qmod
|
from . import query as qmod
|
||||||
from ..search.query_analyzer_factory import AbstractQueryAnalyzer
|
from ..query_preprocessing.config import QueryConfig
|
||||||
|
from .query_analyzer_factory import AbstractQueryAnalyzer
|
||||||
|
|
||||||
|
|
||||||
DB_TO_TOKEN_TYPE = {
|
DB_TO_TOKEN_TYPE = {
|
||||||
@@ -151,6 +153,8 @@ class ICUQueryAnalyzer(AbstractQueryAnalyzer):
|
|||||||
self.transliterator = await self.conn.get_cached_value('ICUTOK', 'transliterator',
|
self.transliterator = await self.conn.get_cached_value('ICUTOK', 'transliterator',
|
||||||
_make_transliterator)
|
_make_transliterator)
|
||||||
|
|
||||||
|
await self._setup_preprocessing()
|
||||||
|
|
||||||
if 'word' not in self.conn.t.meta.tables:
|
if 'word' not in self.conn.t.meta.tables:
|
||||||
sa.Table('word', self.conn.t.meta,
|
sa.Table('word', self.conn.t.meta,
|
||||||
sa.Column('word_id', sa.Integer),
|
sa.Column('word_id', sa.Integer),
|
||||||
@@ -159,15 +163,36 @@ class ICUQueryAnalyzer(AbstractQueryAnalyzer):
|
|||||||
sa.Column('word', sa.Text),
|
sa.Column('word', sa.Text),
|
||||||
sa.Column('info', Json))
|
sa.Column('info', Json))
|
||||||
|
|
||||||
|
async def _setup_preprocessing(self) -> None:
|
||||||
|
""" Load the rules for preprocessing and set up the handlers.
|
||||||
|
"""
|
||||||
|
|
||||||
|
rules = self.conn.config.load_sub_configuration('icu_tokenizer.yaml',
|
||||||
|
config='TOKENIZER_CONFIG')
|
||||||
|
preprocessing_rules = rules.get('query-preprocessing', [])
|
||||||
|
|
||||||
|
self.preprocessors = []
|
||||||
|
|
||||||
|
for func in preprocessing_rules:
|
||||||
|
if 'step' not in func:
|
||||||
|
raise UsageError("Preprocessing rule is missing the 'step' attribute.")
|
||||||
|
if not isinstance(func['step'], str):
|
||||||
|
raise UsageError("'step' attribute must be a simple string.")
|
||||||
|
|
||||||
|
module = self.conn.config.load_plugin_module(
|
||||||
|
func['step'], 'nominatim_api.query_preprocessing')
|
||||||
|
self.preprocessors.append(
|
||||||
|
module.create(QueryConfig(func).set_normalizer(self.normalizer)))
|
||||||
|
|
||||||
async def analyze_query(self, phrases: List[qmod.Phrase]) -> qmod.QueryStruct:
|
async def analyze_query(self, phrases: List[qmod.Phrase]) -> qmod.QueryStruct:
|
||||||
""" Analyze the given list of phrases and return the
|
""" Analyze the given list of phrases and return the
|
||||||
tokenized query.
|
tokenized query.
|
||||||
"""
|
"""
|
||||||
log().section('Analyze query (using ICU tokenizer)')
|
log().section('Analyze query (using ICU tokenizer)')
|
||||||
normalized = list(filter(lambda p: p.text,
|
for func in self.preprocessors:
|
||||||
(qmod.Phrase(p.ptype, self.normalize_text(p.text))
|
phrases = func(phrases)
|
||||||
for p in phrases)))
|
query = qmod.QueryStruct(phrases)
|
||||||
query = qmod.QueryStruct(normalized)
|
|
||||||
log().var_dump('Normalized query', query.source)
|
log().var_dump('Normalized query', query.source)
|
||||||
if not query.source:
|
if not query.source:
|
||||||
return query
|
return query
|
||||||
|
|||||||
@@ -21,9 +21,11 @@ if TYPE_CHECKING:
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
import sqlalchemy as sa
|
import sqlalchemy as sa
|
||||||
import os
|
import os
|
||||||
from typing_extensions import (TypeAlias as TypeAlias)
|
from typing_extensions import (TypeAlias as TypeAlias,
|
||||||
|
Protocol as Protocol)
|
||||||
else:
|
else:
|
||||||
TypeAlias = str
|
TypeAlias = str
|
||||||
|
Protocol = object
|
||||||
|
|
||||||
StrPath = Union[str, 'os.PathLike[str]']
|
StrPath = Union[str, 'os.PathLike[str]']
|
||||||
|
|
||||||
|
|||||||
34
test/python/api/query_processing/test_normalize.py
Normal file
34
test/python/api/query_processing/test_normalize.py
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-3.0-or-later
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2024 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Tests for normalizing search queries.
|
||||||
|
"""
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from icu import Transliterator
|
||||||
|
|
||||||
|
import nominatim_api.search.query as qmod
|
||||||
|
from nominatim_api.query_preprocessing.config import QueryConfig
|
||||||
|
from nominatim_api.query_preprocessing import normalize
|
||||||
|
|
||||||
|
def run_preprocessor_on(query, norm):
|
||||||
|
normalizer = Transliterator.createFromRules("normalization", norm)
|
||||||
|
proc = normalize.create(QueryConfig().set_normalizer(normalizer))
|
||||||
|
|
||||||
|
return proc(query)
|
||||||
|
|
||||||
|
|
||||||
|
def test_normalize_simple():
|
||||||
|
norm = ':: lower();'
|
||||||
|
query = [qmod.Phrase(qmod.PhraseType.NONE, 'Hallo')]
|
||||||
|
|
||||||
|
out = run_preprocessor_on(query, norm)
|
||||||
|
|
||||||
|
assert len(out) == 1
|
||||||
|
assert out == [qmod.Phrase(qmod.PhraseType.NONE, 'hallo')]
|
||||||
Reference in New Issue
Block a user