docs: make customization chapter a separate section

2026-02-16 15:47:58 +00:00 · 2021-10-12 21:25:13 +02:00
parent e52b801cd0
commit 751563644f
9 changed files with 150 additions and 103 deletions
--- a/docs/admin/Customization.md
+++ b/docs/admin/Customization.md
@@ -1,101 +0,0 @@
-# Customization of the Database
-
-This section explains in detail how to configure a Nominatim import and
-the various means to use external data.
-
-## External postcode data
-
-Nominatim creates a table of known postcode centroids during import. This table
-is used for searches of postcodes and for adding postcodes to places where the
-OSM data does not provide one. These postcode centroids are mainly computed
-from the OSM data itself. In addition, Nominatim supports reading postcode
-information from an external CSV file, to supplement the postcodes that are
-missing in OSM.
-
-To enable external postcode support, simply put one CSV file per country into
-your project directory and name it `<CC>_postcodes.csv`. `<CC>` must be the
-two-letter country code for which to apply the file. The file may also be
-gzipped. Then it must be called `<CC>_postcodes.csv.gz`.
-
-The CSV file must use commas as a delimiter and have a header line. Nominatim
-expects three columns to be present: `postcode`, `lat` and `lon`. All other
-columns are ignored. `lon` and `lat` must describe the x and y coordinates of the
-postcode centroids in WGS84.
-
-The postcode files are loaded only when there is data for the given country
-in your database. For example, if there is a `us_postcodes.csv` file in your
-project directory but you import only an excerpt of Italy, then the US postcodes
-will simply be ignored.
-
-As a rule, the external postcode data should be put into the project directory
-**before** starting the initial import. Still, you can add, remove and update the
-external postcode data at any time. Simply
-run:
-
-```
-nominatim refresh --postcodes
-```
-
-to make the changes visible in your database. Be aware, however, that the changes
-only have an immediate effect on searches for postcodes. Postcodes that were
-added to places are only updated, when they are reindexed. That usually happens
-only during replication updates.
-
-## Installing Tiger housenumber data for the US
-
-Nominatim is able to use the official [TIGER](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html)
-address set to complement the OSM house number data in the US. You can add
-TIGER data to your own Nominatim instance by following these steps. The
-entire US adds about 10GB to your database.
-
-  1. Get preprocessed TIGER 2021 data:
-
-        cd $PROJECT_DIR
-        wget https://nominatim.org/data/tiger2021-nominatim-preprocessed.csv.tar.gz
-
-  2. Import the data into your Nominatim database:
-
-        nominatim add-data --tiger-data tiger2021-nominatim-preprocessed.csv.tar.gz
-
-  3. Enable use of the Tiger data in your `.env` by adding:
-
-        echo NOMINATIM_USE_US_TIGER_DATA=yes >> .env
-
-  4. Apply the new settings:
-
-        nominatim refresh --functions
-
-
-See the [developer's guide](../develop/data-sources.md#us-census-tiger) for more
-information on how the data got preprocessed.
-
-## Special phrases import
-
-As described in the [Importation chapter](Import.md), it is possible to
-import special phrases from the wiki with the following command:
-
-```sh
-nominatim special-phrases --import-from-wiki
-```
-
-But, it is also possible to import some phrases from a csv file. 
-To do so, you have access to the following command:
-
-```sh
-nominatim special-phrases --import-from-csv <csv file>
-```
-
-Note that the two previous import commands will update the phrases from your database.
-This means that if you import some phrases from a csv file, only the phrases
-present in the csv file will be kept into the database. All other phrases will
-be removed.
-
-If you want to only add new phrases and not update the other ones you can add
-the argument `--no-replace` to the import command. For example:
-
-```sh
-nominatim special-phrases --import-from-csv <csv file> --no-replace
-```
-
-This will add the phrases present in the csv file into the database without
-removing the other ones.
--- a/docs/admin/Tokenizers.md
+++ b/docs/admin/Tokenizers.md
@@ -1,302 +0,0 @@
-# Tokenizers
-
-The tokenizer module in Nominatim is responsible for analysing the names given
-to OSM objects and the terms of an incoming query in order to make sure, they
-can be matched appropriately.
-
-Nominatim offers different tokenizer modules, which behave differently and have
-different configuration options. This sections describes the tokenizers and how
-they can be configured.
-
-!!! important
-    The use of a tokenizer is tied to a database installation. You need to choose
-    and configure the tokenizer before starting the initial import. Once the import
-    is done, you cannot switch to another tokenizer anymore. Reconfiguring the
-    chosen tokenizer is very limited as well. See the comments in each tokenizer
-    section.
-
-## Legacy tokenizer
-
-The legacy tokenizer implements the analysis algorithms of older Nominatim
-versions. It uses a special Postgresql module to normalize names and queries.
-This tokenizer is currently the default.
-
-To enable the tokenizer add the following line to your project configuration:
-
-```
-NOMINATIM_TOKENIZER=legacy
-```
-
-The Postgresql module for the tokenizer is available in the `module` directory
-and also installed with the remainder of the software under
-`lib/nominatim/module/nominatim.so`. You can specify a custom location for
-the module with
-
-```
-NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
-```
-
-This is in particular useful when the database runs on a different server.
-See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
-
-There are no other configuration options for the legacy tokenizer. All
-normalization functions are hard-coded.
-
-## ICU tokenizer
-
-The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
-normalize names and queries. It also offers configurable decomposition and
-abbreviation handling.
-
-To enable the tokenizer add the following line to your project configuration:
-
-```
-NOMINATIM_TOKENIZER=icu
-```
-
-### How it works
-
-On import the tokenizer processes names in the following three stages:
-
-1. During the **Sanitizer step** incoming names are cleaned up and converted to
-   **full names**. This step can be used to regularize spelling, split multi-name
-   tags into their parts and tag names with additional attributes. See the
-   [Sanitizers section](#sanitizers) below for available cleaning routines.
-2. The **Normalization** part removes all information from the full names
-   that are not relevant for search.
-3. The **Token analysis** step takes the normalized full names and creates
-   all transliterated variants under which the name should be searchable.
-   See the [Token analysis](#token-analysis) section below for more
-   information.
-
-During query time, only normalization and transliteration are relevant.
-An incoming query is first split into name chunks (this usually means splitting
-the string at the commas) and the each part is normalised and transliterated.
-The result is used to look up places in the search index.
-
-### Configuration
-
-The ICU tokenizer is configured using a YAML file which can be configured using
-`NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
-saved as part of the internal database status. Later changes to the variable
-have no effect.
-
-Here is an example configuration file:
-
-``` yaml
-normalization:
-    - ":: lower ()"
-    - "ß > 'ss'" # German szet is unimbigiously equal to double ss
-transliteration:
-    - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
-    - ":: Ascii ()"
-sanitizers:
-    - step: split-name-list
-token-analysis:
-    - analyzer: generic
-      variants:
-          - !include icu-rules/variants-ca.yaml
-          - words:
-              - road -> rd
-              - bridge -> bdge,br,brdg,bri,brg
-```
-
-The configuration file contains four sections:
-`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
-
-#### Normalization and Transliteration
-
-The normalization and transliteration sections each define a set of
-ICU rules that are applied to the names.
-
-The **normalisation** rules are applied after sanitation. They should remove
-any information that is not relevant for search at all. Usual rules to be
-applied here are: lower-casing, removing of special characters, cleanup of
-spaces.
-
-The **transliteration** rules are applied at the end of the tokenization
-process to transfer the name into an ASCII representation. Transliteration can
-be useful to allow for further fuzzy matching, especially between different
-scripts.
-
-Each section must contain a list of
-[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
-The rules are applied in the order in which they appear in the file.
-You can also include additional rules from external yaml file using the
-`!include` tag. The included file must contain a valid YAML list of ICU rules
-and may again include other files.
-
-!!! warning
-    The ICU rule syntax contains special characters that conflict with the
-    YAML syntax. You should therefore always enclose the ICU rules in
-    double-quotes.
-
-#### Sanitizers
-
-The sanitizers section defines an ordered list of functions that are applied
-to the name and address tags before they are further processed by the tokenizer.
-They allows to clean up the tagging and bring it to a standardized form more
-suitable for building the search index.
-
-!!! hint
-    Sanitizers only have an effect on how the search index is built. They
-    do not change the information about each place that is saved in the
-    database. In particular, they have no influence on how the results are
-    displayed. The returned results always show the original information as
-    stored in the OpenStreetMap database.
-
-Each entry contains information of a sanitizer to be applied. It has a
-mandatory parameter `step` which gives the name of the sanitizer. Depending
-on the type, it may have additional parameters to configure its operation.
-
-The order of the list matters. The sanitizers are applied exactly in the order
-that is configured. Each sanitizer works on the results of the previous one.
-
-The following is a list of sanitizers that are shipped with Nominatim.
-
-##### split-name-list
-
-::: nominatim.tokenizer.sanitizers.split_name_list
-    selection:
-        members: False
-    rendering:
-        heading_level: 6
-
-##### strip-brace-terms
-
-::: nominatim.tokenizer.sanitizers.strip_brace_terms
-    selection:
-        members: False
-    rendering:
-        heading_level: 6
-
-##### tag-analyzer-by-language
-
-::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language
-    selection:
-        members: False
-    rendering:
-        heading_level: 6
-
-
-
-#### Token Analysis
-
-Token analyzers take a full name and transform it into one or more normalized
-form that are then saved in the search index. In its simplest form, the
-analyzer only applies the transliteration rules. More complex analyzers
-create additional spelling variants of a name. This is useful to handle
-decomposition and abbreviation.
-
-The ICU tokenizer may use different analyzers for different names. To select
-the analyzer to be used, the name must be tagged with the `analyzer` attribute
-by a sanitizer (see for example the
-[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
-
-The token-analysis section contains the list of configured analyzers. Each
-analyzer must have an `id` parameter that uniquely identifies the analyzer.
-The only exception is the default analyzer that is used when no special
-analyzer was selected.
-
-Different analyzer implementations may exist. To select the implementation,
-the `analyzer` parameter must be set. Currently there is only one implementation
-`generic` which is described in the following.
-
-##### Generic token analyzer
-
-The generic analyzer is able to create variants from a list of given
-abbreviation and decomposition replacements. It takes one optional parameter
-`variants` which lists the replacements to apply. If the section is
-omitted, then the generic analyzer becomes a simple analyzer that only
-applies the transliteration.
-
-The variants section defines lists of replacements which create alternative
-spellings of a name. To create the variants, a name is scanned from left to
-right and the longest matching replacement is applied until the end of the
-string is reached.
-
-The variants section must contain a list of replacement groups. Each group
-defines a set of properties that describes where the replacements are
-applicable. In addition, the word section defines the list of replacements
-to be made. The basic replacement description is of the form:
-
-```
-<source>[,<source>[...]] => <target>[,<target>[...]]
-```
-
-The left side contains one or more `source` terms to be replaced. The right side
-lists one or more replacements. Each source is replaced with each replacement
-term.
-
-!!! tip
-    The source and target terms are internally normalized using the
-    normalization rules given in the configuration. This ensures that the
-    strings match as expected. In fact, it is better to use unnormalized
-    words in the configuration because then it is possible to change the
-    rules for normalization later without having to adapt the variant rules.
-
-###### Decomposition
-
-In its standard form, only full words match against the source. There
-is a special notation to match the prefix and suffix of a word:
-
-``` yaml
- ~strasse => str  # matches "strasse" as full word and in suffix position
- hinter~ => hntr  # matches "hinter" as full word and in prefix position
-```
-
-There is no facility to match a string in the middle of the word. The suffix
-and prefix notation automatically trigger the decomposition mode: two variants
-are created for each replacement, one with the replacement attached to the word
-and one separate. So in above example, the tokenization of "hauptstrasse" will
-create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
-triggers the variants "rote str" and "rotestr". By having decomposition work
-both ways, it is sufficient to create the variants at index time. The variant
-rules are not applied at query time.
-
-To avoid automatic decomposition, use the '|' notation:
-
-``` yaml
- ~strasse |=> str
-```
-
-simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
-
-###### Initial and final terms
-
-It is also possible to restrict replacements to the beginning and end of a
-name:
-
-``` yaml
- ^south => s  # matches only at the beginning of the name
- road$ => rd  # matches only at the end of the name
-```
-
-So the first example would trigger a replacement for "south 45th street" but
-not for "the south beach restaurant".
-
-###### Replacements vs. variants
-
-The replacement syntax `source => target` works as a pure replacement. It changes
-the name instead of creating a variant. To create an additional version, you'd
-have to write `source => source,target`. As this is a frequent case, there is
-a shortcut notation for it:
-
-```
-<source>[,<source>[...]] -> <target>[,<target>[...]]
-```
-
-The simple arrow causes an additional variant to be added. Note that
-decomposition has an effect here on the source as well. So a rule
-
-``` yaml
- "~strasse -> str"
-```
-
-means that for a word like `hauptstrasse` four variants are created:
-`hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
-
-### Reconfiguration
-
-Changing the configuration after the import is currently not possible, although
-this feature may be added at a later time.