mirror of
https://github.com/osm-search/Nominatim.git
synced 2026-02-16 15:47:58 +00:00
docs: make customization chapter a separate section
This commit is contained in:
@@ -1,101 +0,0 @@
|
||||
# Customization of the Database
|
||||
|
||||
This section explains in detail how to configure a Nominatim import and
|
||||
the various means to use external data.
|
||||
|
||||
## External postcode data
|
||||
|
||||
Nominatim creates a table of known postcode centroids during import. This table
|
||||
is used for searches of postcodes and for adding postcodes to places where the
|
||||
OSM data does not provide one. These postcode centroids are mainly computed
|
||||
from the OSM data itself. In addition, Nominatim supports reading postcode
|
||||
information from an external CSV file, to supplement the postcodes that are
|
||||
missing in OSM.
|
||||
|
||||
To enable external postcode support, simply put one CSV file per country into
|
||||
your project directory and name it `<CC>_postcodes.csv`. `<CC>` must be the
|
||||
two-letter country code for which to apply the file. The file may also be
|
||||
gzipped. Then it must be called `<CC>_postcodes.csv.gz`.
|
||||
|
||||
The CSV file must use commas as a delimiter and have a header line. Nominatim
|
||||
expects three columns to be present: `postcode`, `lat` and `lon`. All other
|
||||
columns are ignored. `lon` and `lat` must describe the x and y coordinates of the
|
||||
postcode centroids in WGS84.
|
||||
|
||||
The postcode files are loaded only when there is data for the given country
|
||||
in your database. For example, if there is a `us_postcodes.csv` file in your
|
||||
project directory but you import only an excerpt of Italy, then the US postcodes
|
||||
will simply be ignored.
|
||||
|
||||
As a rule, the external postcode data should be put into the project directory
|
||||
**before** starting the initial import. Still, you can add, remove and update the
|
||||
external postcode data at any time. Simply
|
||||
run:
|
||||
|
||||
```
|
||||
nominatim refresh --postcodes
|
||||
```
|
||||
|
||||
to make the changes visible in your database. Be aware, however, that the changes
|
||||
only have an immediate effect on searches for postcodes. Postcodes that were
|
||||
added to places are only updated, when they are reindexed. That usually happens
|
||||
only during replication updates.
|
||||
|
||||
## Installing Tiger housenumber data for the US
|
||||
|
||||
Nominatim is able to use the official [TIGER](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html)
|
||||
address set to complement the OSM house number data in the US. You can add
|
||||
TIGER data to your own Nominatim instance by following these steps. The
|
||||
entire US adds about 10GB to your database.
|
||||
|
||||
1. Get preprocessed TIGER 2021 data:
|
||||
|
||||
cd $PROJECT_DIR
|
||||
wget https://nominatim.org/data/tiger2021-nominatim-preprocessed.csv.tar.gz
|
||||
|
||||
2. Import the data into your Nominatim database:
|
||||
|
||||
nominatim add-data --tiger-data tiger2021-nominatim-preprocessed.csv.tar.gz
|
||||
|
||||
3. Enable use of the Tiger data in your `.env` by adding:
|
||||
|
||||
echo NOMINATIM_USE_US_TIGER_DATA=yes >> .env
|
||||
|
||||
4. Apply the new settings:
|
||||
|
||||
nominatim refresh --functions
|
||||
|
||||
|
||||
See the [developer's guide](../develop/data-sources.md#us-census-tiger) for more
|
||||
information on how the data got preprocessed.
|
||||
|
||||
## Special phrases import
|
||||
|
||||
As described in the [Importation chapter](Import.md), it is possible to
|
||||
import special phrases from the wiki with the following command:
|
||||
|
||||
```sh
|
||||
nominatim special-phrases --import-from-wiki
|
||||
```
|
||||
|
||||
But, it is also possible to import some phrases from a csv file.
|
||||
To do so, you have access to the following command:
|
||||
|
||||
```sh
|
||||
nominatim special-phrases --import-from-csv <csv file>
|
||||
```
|
||||
|
||||
Note that the two previous import commands will update the phrases from your database.
|
||||
This means that if you import some phrases from a csv file, only the phrases
|
||||
present in the csv file will be kept into the database. All other phrases will
|
||||
be removed.
|
||||
|
||||
If you want to only add new phrases and not update the other ones you can add
|
||||
the argument `--no-replace` to the import command. For example:
|
||||
|
||||
```sh
|
||||
nominatim special-phrases --import-from-csv <csv file> --no-replace
|
||||
```
|
||||
|
||||
This will add the phrases present in the csv file into the database without
|
||||
removing the other ones.
|
||||
@@ -1,302 +0,0 @@
|
||||
# Tokenizers
|
||||
|
||||
The tokenizer module in Nominatim is responsible for analysing the names given
|
||||
to OSM objects and the terms of an incoming query in order to make sure, they
|
||||
can be matched appropriately.
|
||||
|
||||
Nominatim offers different tokenizer modules, which behave differently and have
|
||||
different configuration options. This sections describes the tokenizers and how
|
||||
they can be configured.
|
||||
|
||||
!!! important
|
||||
The use of a tokenizer is tied to a database installation. You need to choose
|
||||
and configure the tokenizer before starting the initial import. Once the import
|
||||
is done, you cannot switch to another tokenizer anymore. Reconfiguring the
|
||||
chosen tokenizer is very limited as well. See the comments in each tokenizer
|
||||
section.
|
||||
|
||||
## Legacy tokenizer
|
||||
|
||||
The legacy tokenizer implements the analysis algorithms of older Nominatim
|
||||
versions. It uses a special Postgresql module to normalize names and queries.
|
||||
This tokenizer is currently the default.
|
||||
|
||||
To enable the tokenizer add the following line to your project configuration:
|
||||
|
||||
```
|
||||
NOMINATIM_TOKENIZER=legacy
|
||||
```
|
||||
|
||||
The Postgresql module for the tokenizer is available in the `module` directory
|
||||
and also installed with the remainder of the software under
|
||||
`lib/nominatim/module/nominatim.so`. You can specify a custom location for
|
||||
the module with
|
||||
|
||||
```
|
||||
NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
|
||||
```
|
||||
|
||||
This is in particular useful when the database runs on a different server.
|
||||
See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
|
||||
|
||||
There are no other configuration options for the legacy tokenizer. All
|
||||
normalization functions are hard-coded.
|
||||
|
||||
## ICU tokenizer
|
||||
|
||||
The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
|
||||
normalize names and queries. It also offers configurable decomposition and
|
||||
abbreviation handling.
|
||||
|
||||
To enable the tokenizer add the following line to your project configuration:
|
||||
|
||||
```
|
||||
NOMINATIM_TOKENIZER=icu
|
||||
```
|
||||
|
||||
### How it works
|
||||
|
||||
On import the tokenizer processes names in the following three stages:
|
||||
|
||||
1. During the **Sanitizer step** incoming names are cleaned up and converted to
|
||||
**full names**. This step can be used to regularize spelling, split multi-name
|
||||
tags into their parts and tag names with additional attributes. See the
|
||||
[Sanitizers section](#sanitizers) below for available cleaning routines.
|
||||
2. The **Normalization** part removes all information from the full names
|
||||
that are not relevant for search.
|
||||
3. The **Token analysis** step takes the normalized full names and creates
|
||||
all transliterated variants under which the name should be searchable.
|
||||
See the [Token analysis](#token-analysis) section below for more
|
||||
information.
|
||||
|
||||
During query time, only normalization and transliteration are relevant.
|
||||
An incoming query is first split into name chunks (this usually means splitting
|
||||
the string at the commas) and the each part is normalised and transliterated.
|
||||
The result is used to look up places in the search index.
|
||||
|
||||
### Configuration
|
||||
|
||||
The ICU tokenizer is configured using a YAML file which can be configured using
|
||||
`NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
|
||||
saved as part of the internal database status. Later changes to the variable
|
||||
have no effect.
|
||||
|
||||
Here is an example configuration file:
|
||||
|
||||
``` yaml
|
||||
normalization:
|
||||
- ":: lower ()"
|
||||
- "ß > 'ss'" # German szet is unimbigiously equal to double ss
|
||||
transliteration:
|
||||
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
|
||||
- ":: Ascii ()"
|
||||
sanitizers:
|
||||
- step: split-name-list
|
||||
token-analysis:
|
||||
- analyzer: generic
|
||||
variants:
|
||||
- !include icu-rules/variants-ca.yaml
|
||||
- words:
|
||||
- road -> rd
|
||||
- bridge -> bdge,br,brdg,bri,brg
|
||||
```
|
||||
|
||||
The configuration file contains four sections:
|
||||
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
|
||||
|
||||
#### Normalization and Transliteration
|
||||
|
||||
The normalization and transliteration sections each define a set of
|
||||
ICU rules that are applied to the names.
|
||||
|
||||
The **normalisation** rules are applied after sanitation. They should remove
|
||||
any information that is not relevant for search at all. Usual rules to be
|
||||
applied here are: lower-casing, removing of special characters, cleanup of
|
||||
spaces.
|
||||
|
||||
The **transliteration** rules are applied at the end of the tokenization
|
||||
process to transfer the name into an ASCII representation. Transliteration can
|
||||
be useful to allow for further fuzzy matching, especially between different
|
||||
scripts.
|
||||
|
||||
Each section must contain a list of
|
||||
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
|
||||
The rules are applied in the order in which they appear in the file.
|
||||
You can also include additional rules from external yaml file using the
|
||||
`!include` tag. The included file must contain a valid YAML list of ICU rules
|
||||
and may again include other files.
|
||||
|
||||
!!! warning
|
||||
The ICU rule syntax contains special characters that conflict with the
|
||||
YAML syntax. You should therefore always enclose the ICU rules in
|
||||
double-quotes.
|
||||
|
||||
#### Sanitizers
|
||||
|
||||
The sanitizers section defines an ordered list of functions that are applied
|
||||
to the name and address tags before they are further processed by the tokenizer.
|
||||
They allows to clean up the tagging and bring it to a standardized form more
|
||||
suitable for building the search index.
|
||||
|
||||
!!! hint
|
||||
Sanitizers only have an effect on how the search index is built. They
|
||||
do not change the information about each place that is saved in the
|
||||
database. In particular, they have no influence on how the results are
|
||||
displayed. The returned results always show the original information as
|
||||
stored in the OpenStreetMap database.
|
||||
|
||||
Each entry contains information of a sanitizer to be applied. It has a
|
||||
mandatory parameter `step` which gives the name of the sanitizer. Depending
|
||||
on the type, it may have additional parameters to configure its operation.
|
||||
|
||||
The order of the list matters. The sanitizers are applied exactly in the order
|
||||
that is configured. Each sanitizer works on the results of the previous one.
|
||||
|
||||
The following is a list of sanitizers that are shipped with Nominatim.
|
||||
|
||||
##### split-name-list
|
||||
|
||||
::: nominatim.tokenizer.sanitizers.split_name_list
|
||||
selection:
|
||||
members: False
|
||||
rendering:
|
||||
heading_level: 6
|
||||
|
||||
##### strip-brace-terms
|
||||
|
||||
::: nominatim.tokenizer.sanitizers.strip_brace_terms
|
||||
selection:
|
||||
members: False
|
||||
rendering:
|
||||
heading_level: 6
|
||||
|
||||
##### tag-analyzer-by-language
|
||||
|
||||
::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language
|
||||
selection:
|
||||
members: False
|
||||
rendering:
|
||||
heading_level: 6
|
||||
|
||||
|
||||
|
||||
#### Token Analysis
|
||||
|
||||
Token analyzers take a full name and transform it into one or more normalized
|
||||
form that are then saved in the search index. In its simplest form, the
|
||||
analyzer only applies the transliteration rules. More complex analyzers
|
||||
create additional spelling variants of a name. This is useful to handle
|
||||
decomposition and abbreviation.
|
||||
|
||||
The ICU tokenizer may use different analyzers for different names. To select
|
||||
the analyzer to be used, the name must be tagged with the `analyzer` attribute
|
||||
by a sanitizer (see for example the
|
||||
[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
|
||||
|
||||
The token-analysis section contains the list of configured analyzers. Each
|
||||
analyzer must have an `id` parameter that uniquely identifies the analyzer.
|
||||
The only exception is the default analyzer that is used when no special
|
||||
analyzer was selected.
|
||||
|
||||
Different analyzer implementations may exist. To select the implementation,
|
||||
the `analyzer` parameter must be set. Currently there is only one implementation
|
||||
`generic` which is described in the following.
|
||||
|
||||
##### Generic token analyzer
|
||||
|
||||
The generic analyzer is able to create variants from a list of given
|
||||
abbreviation and decomposition replacements. It takes one optional parameter
|
||||
`variants` which lists the replacements to apply. If the section is
|
||||
omitted, then the generic analyzer becomes a simple analyzer that only
|
||||
applies the transliteration.
|
||||
|
||||
The variants section defines lists of replacements which create alternative
|
||||
spellings of a name. To create the variants, a name is scanned from left to
|
||||
right and the longest matching replacement is applied until the end of the
|
||||
string is reached.
|
||||
|
||||
The variants section must contain a list of replacement groups. Each group
|
||||
defines a set of properties that describes where the replacements are
|
||||
applicable. In addition, the word section defines the list of replacements
|
||||
to be made. The basic replacement description is of the form:
|
||||
|
||||
```
|
||||
<source>[,<source>[...]] => <target>[,<target>[...]]
|
||||
```
|
||||
|
||||
The left side contains one or more `source` terms to be replaced. The right side
|
||||
lists one or more replacements. Each source is replaced with each replacement
|
||||
term.
|
||||
|
||||
!!! tip
|
||||
The source and target terms are internally normalized using the
|
||||
normalization rules given in the configuration. This ensures that the
|
||||
strings match as expected. In fact, it is better to use unnormalized
|
||||
words in the configuration because then it is possible to change the
|
||||
rules for normalization later without having to adapt the variant rules.
|
||||
|
||||
###### Decomposition
|
||||
|
||||
In its standard form, only full words match against the source. There
|
||||
is a special notation to match the prefix and suffix of a word:
|
||||
|
||||
``` yaml
|
||||
- ~strasse => str # matches "strasse" as full word and in suffix position
|
||||
- hinter~ => hntr # matches "hinter" as full word and in prefix position
|
||||
```
|
||||
|
||||
There is no facility to match a string in the middle of the word. The suffix
|
||||
and prefix notation automatically trigger the decomposition mode: two variants
|
||||
are created for each replacement, one with the replacement attached to the word
|
||||
and one separate. So in above example, the tokenization of "hauptstrasse" will
|
||||
create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
|
||||
triggers the variants "rote str" and "rotestr". By having decomposition work
|
||||
both ways, it is sufficient to create the variants at index time. The variant
|
||||
rules are not applied at query time.
|
||||
|
||||
To avoid automatic decomposition, use the '|' notation:
|
||||
|
||||
``` yaml
|
||||
- ~strasse |=> str
|
||||
```
|
||||
|
||||
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
|
||||
|
||||
###### Initial and final terms
|
||||
|
||||
It is also possible to restrict replacements to the beginning and end of a
|
||||
name:
|
||||
|
||||
``` yaml
|
||||
- ^south => s # matches only at the beginning of the name
|
||||
- road$ => rd # matches only at the end of the name
|
||||
```
|
||||
|
||||
So the first example would trigger a replacement for "south 45th street" but
|
||||
not for "the south beach restaurant".
|
||||
|
||||
###### Replacements vs. variants
|
||||
|
||||
The replacement syntax `source => target` works as a pure replacement. It changes
|
||||
the name instead of creating a variant. To create an additional version, you'd
|
||||
have to write `source => source,target`. As this is a frequent case, there is
|
||||
a shortcut notation for it:
|
||||
|
||||
```
|
||||
<source>[,<source>[...]] -> <target>[,<target>[...]]
|
||||
```
|
||||
|
||||
The simple arrow causes an additional variant to be added. Note that
|
||||
decomposition has an effect here on the source as well. So a rule
|
||||
|
||||
``` yaml
|
||||
- "~strasse -> str"
|
||||
```
|
||||
|
||||
means that for a word like `hauptstrasse` four variants are created:
|
||||
`hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
|
||||
|
||||
### Reconfiguration
|
||||
|
||||
Changing the configuration after the import is currently not possible, although
|
||||
this feature may be added at a later time.
|
||||
Reference in New Issue
Block a user