mirror of
https://github.com/osm-search/Nominatim.git
synced 2026-02-26 11:08:13 +00:00
add documentation for new configuration of ICU tokenizer
This commit is contained in:
@@ -60,22 +60,23 @@ NOMINATIM_TOKENIZER=icu
|
|||||||
|
|
||||||
### How it works
|
### How it works
|
||||||
|
|
||||||
On import the tokenizer processes names in the following four stages:
|
On import the tokenizer processes names in the following three stages:
|
||||||
|
|
||||||
1. The **Normalization** part removes all non-relevant information from the
|
1. During the **Sanitizer step** incoming names are cleaned up and converted to
|
||||||
input.
|
**full names**. This step can be used to regularize spelling, split multi-name
|
||||||
2. Incoming names are now converted to **full names**. This process is currently
|
tags into their parts and tag names with additional attributes. See the
|
||||||
hard coded and mostly serves to handle name tags from OSM that contain
|
[Sanitizers section](#sanitizers) below for available cleaning routines.
|
||||||
multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
|
2. The **Normalization** part removes all information from the full names
|
||||||
3. Next the tokenizer creates **variants** from the full names. These variants
|
that are not relevant for search.
|
||||||
cover decomposition and abbreviation handling. Variants are saved to the
|
3. The **Token analysis** step takes the normalized full names and creates
|
||||||
database, so that it is not necessary to create the variants for a search
|
all transliterated variants under which the name should be searchable.
|
||||||
query.
|
See the [Token analysis](#token-analysis) section below for more
|
||||||
4. The final **Tokenization** step converts the names to a simple ASCII form,
|
information.
|
||||||
potentially removing further spelling variants for better matching.
|
|
||||||
|
|
||||||
At query time only stage 1) and 4) are used. The query is normalized and
|
During query time, only normalization and transliteration are relevant.
|
||||||
tokenized and the resulting string used for searching in the database.
|
An incoming query is first split into name chunks (this usually means splitting
|
||||||
|
the string at the commas) and the each part is normalised and transliterated.
|
||||||
|
The result is used to look up places in the search index.
|
||||||
|
|
||||||
### Configuration
|
### Configuration
|
||||||
|
|
||||||
@@ -93,21 +94,36 @@ normalization:
|
|||||||
transliteration:
|
transliteration:
|
||||||
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
|
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
|
||||||
- ":: Ascii ()"
|
- ":: Ascii ()"
|
||||||
variants:
|
sanitizers:
|
||||||
- language: de
|
- step: split-name-list
|
||||||
words:
|
token-analysis:
|
||||||
- ~haus => haus
|
- analyzer: generic
|
||||||
- ~strasse -> str
|
variants:
|
||||||
- language: en
|
- !include icu-rules/variants-ca.yaml
|
||||||
words:
|
- words:
|
||||||
- road -> rd
|
- road -> rd
|
||||||
- bridge -> bdge,br,brdg,bri,brg
|
- bridge -> bdge,br,brdg,bri,brg
|
||||||
```
|
```
|
||||||
|
|
||||||
The configuration file contains three sections:
|
The configuration file contains four sections:
|
||||||
`normalization`, `transliteration`, `variants`.
|
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
|
||||||
|
|
||||||
The normalization and transliteration sections each must contain a list of
|
#### Normalization and Transliteration
|
||||||
|
|
||||||
|
The normalization and transliteration sections each define a set of
|
||||||
|
ICU rules that are applied to the names.
|
||||||
|
|
||||||
|
The **normalisation** rules are applied after sanitation. They should remove
|
||||||
|
any information that is not relevant for search at all. Usual rules to be
|
||||||
|
applied here are: lower-casing, removing of special characters, cleanup of
|
||||||
|
spaces.
|
||||||
|
|
||||||
|
The **transliteration** rules are applied at the end of the tokenization
|
||||||
|
process to transfer the name into an ASCII representation. Transliteration can
|
||||||
|
be useful to allow for further fuzzy matching, especially between different
|
||||||
|
scripts.
|
||||||
|
|
||||||
|
Each section must contain a list of
|
||||||
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
|
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
|
||||||
The rules are applied in the order in which they appear in the file.
|
The rules are applied in the order in which they appear in the file.
|
||||||
You can also include additional rules from external yaml file using the
|
You can also include additional rules from external yaml file using the
|
||||||
@@ -119,6 +135,85 @@ and may again include other files.
|
|||||||
YAML syntax. You should therefore always enclose the ICU rules in
|
YAML syntax. You should therefore always enclose the ICU rules in
|
||||||
double-quotes.
|
double-quotes.
|
||||||
|
|
||||||
|
#### Sanitizers
|
||||||
|
|
||||||
|
The sanitizers section defines an ordered list of functions that are applied
|
||||||
|
to the name and address tags before they are further processed by the tokenizer.
|
||||||
|
They allows to clean up the tagging and bring it to a standardized form more
|
||||||
|
suitable for building the search index.
|
||||||
|
|
||||||
|
!!! hint
|
||||||
|
Sanitizers only have an effect on how the search index is built. They
|
||||||
|
do not change the information about each place that is saved in the
|
||||||
|
database. In particular, they have no influence on how the results are
|
||||||
|
displayed. The returned results always show the original information as
|
||||||
|
stored in the OpenStreetMap database.
|
||||||
|
|
||||||
|
Each entry contains information of a sanitizer to be applied. It has a
|
||||||
|
mandatory parameter `step` which gives the name of the sanitizer. Depending
|
||||||
|
on the type, it may have additional parameters to configure its operation.
|
||||||
|
|
||||||
|
The order of the list matters. The sanitizers are applied exactly in the order
|
||||||
|
that is configured. Each sanitizer works on the results of the previous one.
|
||||||
|
|
||||||
|
The following is a list of sanitizers that are shipped with Nominatim.
|
||||||
|
|
||||||
|
##### split-name-list
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.split_name_list
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
##### strip-brace-terms
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.strip_brace_terms
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
##### tag-analyzer-by-language
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#### Token Analysis
|
||||||
|
|
||||||
|
Token analyzers take a full name and transform it into one or more normalized
|
||||||
|
form that are then saved in the search index. In its simplest form, the
|
||||||
|
analyzer only applies the transliteration rules. More complex analyzers
|
||||||
|
create additional spelling variants of a name. This is useful to handle
|
||||||
|
decomposition and abbreviation.
|
||||||
|
|
||||||
|
The ICU tokenizer may use different analyzers for different names. To select
|
||||||
|
the analyzer to be used, the name must be tagged with the `analyzer` attribute
|
||||||
|
by a sanitizer (see for example the
|
||||||
|
[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
|
||||||
|
|
||||||
|
The token-analysis section contains the list of configured analyzers. Each
|
||||||
|
analyzer must have an `id` parameter that uniquely identifies the analyzer.
|
||||||
|
The only exception is the default analyzer that is used when no special
|
||||||
|
analyzer was selected.
|
||||||
|
|
||||||
|
Different analyzer implementations may exist. To select the implementation,
|
||||||
|
the `analyzer` parameter must be set. Currently there is only one implementation
|
||||||
|
`generic` which is described in the following.
|
||||||
|
|
||||||
|
##### Generic token analyzer
|
||||||
|
|
||||||
|
The generic analyzer is able to create variants from a list of given
|
||||||
|
abbreviation and decomposition replacements. It takes one optional parameter
|
||||||
|
`variants` which lists the replacements to apply. If the section is
|
||||||
|
omitted, then the generic analyzer becomes a simple analyzer that only
|
||||||
|
applies the transliteration.
|
||||||
|
|
||||||
The variants section defines lists of replacements which create alternative
|
The variants section defines lists of replacements which create alternative
|
||||||
spellings of a name. To create the variants, a name is scanned from left to
|
spellings of a name. To create the variants, a name is scanned from left to
|
||||||
right and the longest matching replacement is applied until the end of the
|
right and the longest matching replacement is applied until the end of the
|
||||||
@@ -144,7 +239,7 @@ term.
|
|||||||
words in the configuration because then it is possible to change the
|
words in the configuration because then it is possible to change the
|
||||||
rules for normalization later without having to adapt the variant rules.
|
rules for normalization later without having to adapt the variant rules.
|
||||||
|
|
||||||
#### Decomposition
|
###### Decomposition
|
||||||
|
|
||||||
In its standard form, only full words match against the source. There
|
In its standard form, only full words match against the source. There
|
||||||
is a special notation to match the prefix and suffix of a word:
|
is a special notation to match the prefix and suffix of a word:
|
||||||
@@ -171,7 +266,7 @@ To avoid automatic decomposition, use the '|' notation:
|
|||||||
|
|
||||||
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
|
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
|
||||||
|
|
||||||
#### Initial and final terms
|
###### Initial and final terms
|
||||||
|
|
||||||
It is also possible to restrict replacements to the beginning and end of a
|
It is also possible to restrict replacements to the beginning and end of a
|
||||||
name:
|
name:
|
||||||
@@ -184,7 +279,7 @@ name:
|
|||||||
So the first example would trigger a replacement for "south 45th street" but
|
So the first example would trigger a replacement for "south 45th street" but
|
||||||
not for "the south beach restaurant".
|
not for "the south beach restaurant".
|
||||||
|
|
||||||
#### Replacements vs. variants
|
###### Replacements vs. variants
|
||||||
|
|
||||||
The replacement syntax `source => target` works as a pure replacement. It changes
|
The replacement syntax `source => target` works as a pure replacement. It changes
|
||||||
the name instead of creating a variant. To create an additional version, you'd
|
the name instead of creating a variant. To create an additional version, you'd
|
||||||
|
|||||||
@@ -1,5 +1,9 @@
|
|||||||
"""
|
"""
|
||||||
Name processor that splits name values with multiple values into their components.
|
Sanitizer that splits lists of names into their components.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
delimiters: Define the set of characters to be used for
|
||||||
|
splitting the list. (default: `,;`)
|
||||||
"""
|
"""
|
||||||
import re
|
import re
|
||||||
|
|
||||||
@@ -7,9 +11,7 @@ from nominatim.errors import UsageError
|
|||||||
|
|
||||||
def create(func):
|
def create(func):
|
||||||
""" Create a name processing function that splits name values with
|
""" Create a name processing function that splits name values with
|
||||||
multiple values into their components. The optional parameter
|
multiple values into their components.
|
||||||
'delimiters' can be used to define the characters that should be used
|
|
||||||
for splitting. The default is ',;'.
|
|
||||||
"""
|
"""
|
||||||
delimiter_set = set(func.get('delimiters', ',;'))
|
delimiter_set = set(func.get('delimiters', ',;'))
|
||||||
if not delimiter_set:
|
if not delimiter_set:
|
||||||
|
|||||||
@@ -1,11 +1,12 @@
|
|||||||
"""
|
"""
|
||||||
Sanitizer handling names with addendums in braces.
|
This sanitizer creates additional name variants for names that have
|
||||||
|
addendums in brackets (e.g. "Halle (Saale)"). The additional variant contains
|
||||||
|
only the main name part with the bracket part removed.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def create(_):
|
def create(_):
|
||||||
""" Create a name processing function that creates additional name variants
|
""" Create a name processing function that creates additional name variants
|
||||||
when a name has an addendum in brackets (e.g. "Halle (Saale)"). The
|
for bracket addendums.
|
||||||
additional variant only contains the main name without the bracket part.
|
|
||||||
"""
|
"""
|
||||||
def _process(obj):
|
def _process(obj):
|
||||||
""" Add variants for names that have a bracket extension.
|
""" Add variants for names that have a bracket extension.
|
||||||
|
|||||||
@@ -1,5 +1,28 @@
|
|||||||
"""
|
"""
|
||||||
Name processor for tagging the langauge of the name
|
This sanitizer sets the `analyzer` property depending on the
|
||||||
|
language of the tag. The language is taken from the suffix of the name.
|
||||||
|
If a name already has an analyzer tagged, then this is kept.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
|
||||||
|
filter-kind: Restrict the names the sanitizer should be applied to
|
||||||
|
to the given tags. The parameter expects a list of
|
||||||
|
regular expressions which are matched against `kind`.
|
||||||
|
Note that a match against the full string is expected.
|
||||||
|
whitelist: Restrict the set of languages that should be tagged.
|
||||||
|
Expects a list of acceptable suffixes. When unset,
|
||||||
|
all 2- and 3-letter lower-case codes are accepted.
|
||||||
|
use-defaults: Configure what happens when the name has no suffix.
|
||||||
|
When set to 'all', a variant is created for
|
||||||
|
each of the default languages in the country
|
||||||
|
the feature is in. When set to 'mono', a variant is
|
||||||
|
only created, when exactly one language is spoken
|
||||||
|
in the country. The default is to do nothing with
|
||||||
|
the default languages of a country.
|
||||||
|
mode: Define how the variants are created and may be 'replace' or
|
||||||
|
'append'. When set to 'append' the original name (without
|
||||||
|
any analyzer tagged) is retained. (default: replace)
|
||||||
|
|
||||||
"""
|
"""
|
||||||
import re
|
import re
|
||||||
|
|
||||||
@@ -75,24 +98,6 @@ class _AnalyzerByLanguage:
|
|||||||
|
|
||||||
def create(config):
|
def create(config):
|
||||||
""" Create a function that sets the analyzer property depending on the
|
""" Create a function that sets the analyzer property depending on the
|
||||||
language of the tag. The language is taken from the suffix.
|
language of the tag.
|
||||||
|
|
||||||
To restrict the set of languages that should be tagged, use
|
|
||||||
'whitelist'. A list of acceptable suffixes. When unset, all 2- and
|
|
||||||
3-letter codes are accepted.
|
|
||||||
|
|
||||||
'use-defaults' configures what happens when the name has no suffix
|
|
||||||
with a language tag. When set to 'all', a variant is created for
|
|
||||||
each on the spoken languages in the country the feature is in. When
|
|
||||||
set to 'mono', a variant is created, when only one language is spoken
|
|
||||||
in the country. The default is, to do nothing with the default languages
|
|
||||||
of a country.
|
|
||||||
|
|
||||||
'mode' hay be 'replace' (the default) or 'append' and configures if
|
|
||||||
the original name (without any analyzer tagged) is retained.
|
|
||||||
|
|
||||||
With 'filter-kind' the set of names the sanitizer should be applied
|
|
||||||
to can be retricted to the given patterns of 'kind'. It expects a
|
|
||||||
list of regular expression to be matched against 'kind'.
|
|
||||||
"""
|
"""
|
||||||
return _AnalyzerByLanguage(config)
|
return _AnalyzerByLanguage(config)
|
||||||
|
|||||||
Reference in New Issue
Block a user