add documentation for postcode customization

2026-03-09 11:34:07 +00:00 · 2022-06-20 17:42:12 +02:00
parent 7f2ad4ac7e
commit 5be320368c
4 changed files with 176 additions and 2 deletions
--- a/docs/customize/Country-Settings.md
+++ b/docs/customize/Country-Settings.md
@@ -0,0 +1,149 @@
 # Customizing Per-Country Data
 Whenever an OSM is imported into Nominatim, the object is first assigned
 a country. Nominatim can use this information to adapt various aspects of
 the address computation to the local customs of the country. This section
 explains how country assignment works and the principal per-country
 localizations.
 ## Country assignment
 Countries are assigned on the basis of country data from the OpenStreetMap
 input data itself. Countries are expected to be tagged according to the
 [administrative boundary schema](https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative):
 a OSM relation with `boundary=administrative` and `admin_level=2`. Nominatim
 uses the country code to distinguish the countries.
 If there is no country data available for a point, then Nominatim uses the
 fallback data imported from `data/country_osm_grid.sql.gz`. This was computed
 from OSM data as well but is guaranteed to cover all countries.
 Some OSM objects may also be located outside any country, for example a buoy
 in the middle of the ocean. These object do not get any country assigned and
 get a default treatment when it comes to localized handling of data.
 ## Per-country settings
 ### Global country settings
 The main place to configure settings per country is the file
 `settings/country_settings.yaml`. This file has one section per country that
 is recognised by Nominatim. Each section is tagged with the country code
 (in lower case) and contains the different localization information. Only
 countries which are listed in this file are taken into account for computations.
 For example, the section for Andorra looks like this:
 ```
    partition: 35
    languages: ca
    names: !include country-names/ad.yaml
    postcode:
      pattern: "(ddd)"
      output: AD\1
 ```
 The individual settings are described below.
 #### `partition`
 Nominatim internally splits the data into multiple tables to improve
 performance. The partition number tells Nominatim into which table to put
 the country. This is purely internal management and has no effect on the
 output data.
 The default is to have one partition per country.
 #### `languages`
 A comma-separated list of ISO-639 language codes of default languages in the
 country. These are the languages used in name tags without a language suffix.
 Note that this is not necessarily the same as the list of official languages
 in the country. There may be officially recognised languages in a country
 which are only ever used in name tags with the appropriate language suffixes.
 Conversely, a non-official language may appear a lot in the name tags, for
 example when used as an unofficial Lingua Franca.
 List the languages in order of frequency of appearance with the most frequently
 used language first. It is not recommended to add languages when there are only
 very few occurrences.
 If only one language is listed, then Nominatim will 'auto-complete' the
 language of names without an explicit language-suffix.
 #### `names`
 List of names of the country and its translations. These names are used as
 a baseline. It is always possible to search countries by the given names, no
 matter what other names are in the OSM data. They are also used as a fallback
 when a needed translation is not available.
 !!! Note
    The list of names per country is currently fairly large because Nominatim
    supports translations in many languages per default. That is why the
    name lists have been separated out into extra files. You can find the
    name lists in the file `settings/country-names/<country code>.yaml`.
    The names section in the main country settings file only refers to these
    files via the special `!include` directive.
 #### `postcode`
 Describes the format of the postcode that is in use in the country.
 When a country has no official postcodes, set this to no. Example:
 ```
 ae:
    postcode: no
 ```
 When a country has a postcode, you need to state the postcode pattern and
 the default output format. Example:
 ```
 bm:
    postcode:
      pattern: "(ll)[ -]?(dd)"
      output: \1 \2
 ```
 The **pattern** is a regular expression that describes the possible formats
 accepted as a postcode. The pattern follows the standard syntax for
 [regular expressions in Python](https://docs.python.org/3/library/re.html#regular-expression-syntax)
 with two extra shortcuts: `d` is a shortcut for a single digit([0-9])
 and `l` for a single ASCII letter ([A-Z]).
 Use match groups to indicate groups in the postcode that may optionally be
 separated with a space or a hyphen.
 For example, the postcode for Bermuda above always consists of two letters
 and two digits. They may optionally be separated by a space or hyphen. That
 means that Nominatim will consider `AB56`, `AB 56` and `AB-56` spelling variants
 for one and the same postcode.
 Never add the country code in front of the postcode pattern. Nominatim will
 automatically accept variants with a country code prefix for all postcodes.
 The **output** field is an optional field that describes what the canonical
 spelling of the postcode should be. The format is the
 [regular expression expand syntax](https://docs.python.org/3/library/re.html#re.Match.expand) referring back to the bracket groups in the pattern.
 Most simple postcodes only have one spelling variant. In that case, the
 **output** can be omitted. The postcode will simply be used as is.
 In the Bermuda example above, the canonical spelling would be to have a space
 between letters and digits.
 !!! Warning
    When your postcode pattern covers multiple variants of the postcode, then
    you must explicitly state the canonical output or Nominatim will not
    handle the variations correctly.
 ### Other country-specific configuration
 There are some other configuration files where you can set localized settings
 according to the assigned country. These are:
 * [Place ranking configuration](Ranking.md)
 Please see the linked documentation sections for more information.
--- a/docs/customize/Tokenizers.md
+++ b/docs/customize/Tokenizers.md
@@ -205,6 +205,14 @@ The following is a list of sanitizers that are shipped with Nominatim.
    rendering:
        heading_level: 6
 ##### clean-postcodes
 ::: nominatim.tokenizer.sanitizers.clean_postcodes
    selection:
        members: False
    rendering:
        heading_level: 6
 #### Token Analysis
@@ -222,8 +230,12 @@ by a sanitizer (see for example the
 The token-analysis section contains the list of configured analyzers. Each
 analyzer must have an `id` parameter that uniquely identifies the analyzer.
 The only exception is the default analyzer that is used when no special
-analyzer was selected. There is one special id '@housenumber'. If an analyzer
+analyzer was selected. There are analysers with special ids:
-with that name is present, it is used for normalization of house numbers.
+
 * '@housenumber'. If an analyzer with that name is present, it is used
   for normalization of house numbers.
 * '@potcode'. If an analyzer with that name is present, it is used
   for normalization of postcodes.
 Different analyzer implementations may exist. To select the implementation,
 the `analyzer` parameter must be set. The different implementations are
@@ -356,6 +368,14 @@ house numbers of the form '3 a', '3A', '3-A' etc. are all considered equivalent.
 The analyzer cannot be customized.
 ##### Postcode token analyzer
 The analyzer `postcodes` is pupose-made to analyze postcodes. It supports
 a 'lookup' varaint of the token, which produces variants with optional
 spaces. Use together with the clean-postcodes sanitizer.
 The analyzer cannot be customized.
 ### Reconfiguration
 Changing the configuration after the import is currently not possible, although
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -28,6 +28,7 @@ pages:
        - 'Overview': 'customize/Overview.md'
        - 'Import Styles': 'customize/Import-Styles.md'
        - 'Configuration Settings': 'customize/Settings.md'
        - 'Per-Country Data': 'customize/Country-Settings.md'
        - 'Place Ranking' : 'customize/Ranking.md'
        - 'Tokenizers' : 'customize/Tokenizers.md'
        - 'Special Phrases': 'customize/Special-Phrases.md'
--- a/nominatim/tokenizer/sanitizers/clean_postcodes.py
+++ b/nominatim/tokenizer/sanitizers/clean_postcodes.py
@@ -15,6 +15,10 @@ Arguments:
                        postcode centroids of a country but is still searchable.
                        When set to 'no', non-conforming postcodes are not
                        searchable either.
    default-pattern:    Pattern to use, when there is none available for the
                        country in question. Warning: will not be used for
                        objects that have no country assigned. These are always
                        assumed to have no postcode.
 """
 from nominatim.data.postcode_format import PostcodeFormatter