Merge pull request #2757 from lonvia/filter-postcodes

Add filtering, normalisation and variants for postcodes
2022-06-24 21:09:41 +02:00
parent 0cd3a1b9bd 536f08f33a
commit 3bf3b894ea
35 changed files with 1562 additions and 221 deletions
--- a/.pylintrc
+++ b/.pylintrc
@@ -13,4 +13,4 @@ ignored-classes=NominatimArgs,closing
 # 'too-many-ancestors' is triggered already by deriving from UserDict
 disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use
-good-names=i,x,y,fd,db
+good-names=i,x,y,fd,db,cc
--- a/docs/customize/Country-Settings.md
+++ b/docs/customize/Country-Settings.md
@@ -0,0 +1,149 @@
 # Customizing Per-Country Data
 Whenever an OSM is imported into Nominatim, the object is first assigned
 a country. Nominatim can use this information to adapt various aspects of
 the address computation to the local customs of the country. This section
 explains how country assignment works and the principal per-country
 localizations.
 ## Country assignment
 Countries are assigned on the basis of country data from the OpenStreetMap
 input data itself. Countries are expected to be tagged according to the
 [administrative boundary schema](https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative):
 a OSM relation with `boundary=administrative` and `admin_level=2`. Nominatim
 uses the country code to distinguish the countries.
 If there is no country data available for a point, then Nominatim uses the
 fallback data imported from `data/country_osm_grid.sql.gz`. This was computed
 from OSM data as well but is guaranteed to cover all countries.
 Some OSM objects may also be located outside any country, for example a buoy
 in the middle of the ocean. These object do not get any country assigned and
 get a default treatment when it comes to localized handling of data.
 ## Per-country settings
 ### Global country settings
 The main place to configure settings per country is the file
 `settings/country_settings.yaml`. This file has one section per country that
 is recognised by Nominatim. Each section is tagged with the country code
 (in lower case) and contains the different localization information. Only
 countries which are listed in this file are taken into account for computations.
 For example, the section for Andorra looks like this:
 ```
    partition: 35
    languages: ca
    names: !include country-names/ad.yaml
    postcode:
      pattern: "(ddd)"
      output: AD\1
 ```
 The individual settings are described below.
 #### `partition`
 Nominatim internally splits the data into multiple tables to improve
 performance. The partition number tells Nominatim into which table to put
 the country. This is purely internal management and has no effect on the
 output data.
 The default is to have one partition per country.
 #### `languages`
 A comma-separated list of ISO-639 language codes of default languages in the
 country. These are the languages used in name tags without a language suffix.
 Note that this is not necessarily the same as the list of official languages
 in the country. There may be officially recognised languages in a country
 which are only ever used in name tags with the appropriate language suffixes.
 Conversely, a non-official language may appear a lot in the name tags, for
 example when used as an unofficial Lingua Franca.
 List the languages in order of frequency of appearance with the most frequently
 used language first. It is not recommended to add languages when there are only
 very few occurrences.
 If only one language is listed, then Nominatim will 'auto-complete' the
 language of names without an explicit language-suffix.
 #### `names`
 List of names of the country and its translations. These names are used as
 a baseline. It is always possible to search countries by the given names, no
 matter what other names are in the OSM data. They are also used as a fallback
 when a needed translation is not available.
 !!! Note
    The list of names per country is currently fairly large because Nominatim
    supports translations in many languages per default. That is why the
    name lists have been separated out into extra files. You can find the
    name lists in the file `settings/country-names/<country code>.yaml`.
    The names section in the main country settings file only refers to these
    files via the special `!include` directive.
 #### `postcode`
 Describes the format of the postcode that is in use in the country.
 When a country has no official postcodes, set this to no. Example:
 ```
 ae:
    postcode: no
 ```
 When a country has a postcode, you need to state the postcode pattern and
 the default output format. Example:
 ```
 bm:
    postcode:
      pattern: "(ll)[ -]?(dd)"
      output: \1 \2
 ```
 The **pattern** is a regular expression that describes the possible formats
 accepted as a postcode. The pattern follows the standard syntax for
 [regular expressions in Python](https://docs.python.org/3/library/re.html#regular-expression-syntax)
 with two extra shortcuts: `d` is a shortcut for a single digit([0-9])
 and `l` for a single ASCII letter ([A-Z]).
 Use match groups to indicate groups in the postcode that may optionally be
 separated with a space or a hyphen.
 For example, the postcode for Bermuda above always consists of two letters
 and two digits. They may optionally be separated by a space or hyphen. That
 means that Nominatim will consider `AB56`, `AB 56` and `AB-56` spelling variants
 for one and the same postcode.
 Never add the country code in front of the postcode pattern. Nominatim will
 automatically accept variants with a country code prefix for all postcodes.
 The **output** field is an optional field that describes what the canonical
 spelling of the postcode should be. The format is the
 [regular expression expand syntax](https://docs.python.org/3/library/re.html#re.Match.expand) referring back to the bracket groups in the pattern.
 Most simple postcodes only have one spelling variant. In that case, the
 **output** can be omitted. The postcode will simply be used as is.
 In the Bermuda example above, the canonical spelling would be to have a space
 between letters and digits.
 !!! Warning
    When your postcode pattern covers multiple variants of the postcode, then
    you must explicitly state the canonical output or Nominatim will not
    handle the variations correctly.
 ### Other country-specific configuration
 There are some other configuration files where you can set localized settings
 according to the assigned country. These are:
 * [Place ranking configuration](Ranking.md)
 Please see the linked documentation sections for more information.
--- a/docs/customize/Tokenizers.md
+++ b/docs/customize/Tokenizers.md
@@ -205,6 +205,14 @@ The following is a list of sanitizers that are shipped with Nominatim.
    rendering:
        heading_level: 6
 ##### clean-postcodes
 ::: nominatim.tokenizer.sanitizers.clean_postcodes
    selection:
        members: False
    rendering:
        heading_level: 6
 #### Token Analysis
@@ -222,8 +230,12 @@ by a sanitizer (see for example the
 The token-analysis section contains the list of configured analyzers. Each
 analyzer must have an `id` parameter that uniquely identifies the analyzer.
 The only exception is the default analyzer that is used when no special
-analyzer was selected. There is one special id '@housenumber'. If an analyzer
+analyzer was selected. There are analysers with special ids:
-with that name is present, it is used for normalization of house numbers.
+
 * '@housenumber'. If an analyzer with that name is present, it is used
   for normalization of house numbers.
 * '@potcode'. If an analyzer with that name is present, it is used
   for normalization of postcodes.
 Different analyzer implementations may exist. To select the implementation,
 the `analyzer` parameter must be set. The different implementations are
@@ -356,6 +368,14 @@ house numbers of the form '3 a', '3A', '3-A' etc. are all considered equivalent.
 The analyzer cannot be customized.
 ##### Postcode token analyzer
 The analyzer `postcodes` is pupose-made to analyze postcodes. It supports
 a 'lookup' varaint of the token, which produces variants with optional
 spaces. Use together with the clean-postcodes sanitizer.
 The analyzer cannot be customized.
 ### Reconfiguration
 Changing the configuration after the import is currently not possible, although
--- a/docs/develop/Tokenizers.md
+++ b/docs/develop/Tokenizers.md
@@ -245,11 +245,11 @@ Currently, tokenizers are encouraged to make sure that matching works against
 both the search token list and the match token list.
 ```sql
-FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
+FUNCTION token_get_postcode(info JSONB) RETURNS TEXT
 ```
-Return the normalized version of the given postcode. This function must return
+Return the postcode for the object, if any exists. The postcode must be in
-the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
+the form that should also be presented to the end-user.
 ```sql
 FUNCTION token_strip_info(info JSONB) RETURNS JSONB
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -28,6 +28,7 @@ pages:
        - 'Overview': 'customize/Overview.md'
        - 'Import Styles': 'customize/Import-Styles.md'
        - 'Configuration Settings': 'customize/Settings.md'
        - 'Per-Country Data': 'customize/Country-Settings.md'
        - 'Place Ranking' : 'customize/Ranking.md'
        - 'Tokenizers' : 'customize/Tokenizers.md'
        - 'Special Phrases': 'customize/Special-Phrases.md'
--- a/lib-php/TokenPostcode.php
+++ b/lib-php/TokenPostcode.php
@@ -25,7 +25,12 @@ class Postcode
    public function __construct($iId, $sPostcode, $sCountryCode = '')
    {
        $this->iId = $iId;
-        $this->sPostcode = $sPostcode;
+        $iSplitPos = strpos($sPostcode, '@');
        if ($iSplitPos === false) {
            $this->sPostcode = $sPostcode;
        } else {
            $this->sPostcode = substr($sPostcode, 0, $iSplitPos);
        }
        $this->sCountryCode = empty($sCountryCode) ? '' : $sCountryCode;
    }
--- a/lib-php/tokenizer/icu_tokenizer.php
+++ b/lib-php/tokenizer/icu_tokenizer.php
@@ -190,13 +190,17 @@ class Tokenizer
                    if ($aWord['word'] !== null
                        && pg_escape_string($aWord['word']) == $aWord['word']
                    ) {
-                        $sNormPostcode = $this->normalizeString($aWord['word']);
+                        $iSplitPos = strpos($aWord['word'], '@');
-                        if (strpos($sNormQuery, $sNormPostcode) !== false) {
+                        if ($iSplitPos === false) {
-                            $oValidTokens->addToken(
+                            $sPostcode = $aWord['word'];
-                                $sTok,
+                        } else {
-                                new Token\Postcode($iId, $aWord['word'], null)
+                            $sPostcode = substr($aWord['word'], 0, $iSplitPos);
                            );
                        }
                        $oValidTokens->addToken(
                            $sTok,
                            new Token\Postcode($iId, $sPostcode, null)
                        );
                    }
                    break;
                case 'S':  // tokens for classification terms (special phrases)
--- a/lib-sql/functions/address_lookup.sql
+++ b/lib-sql/functions/address_lookup.sql
@@ -320,6 +320,11 @@ BEGIN
    location := ROW(null, null, null, hstore('ref', place.postcode), 'place',
                    'postcode', null, null, false, true, 5, 0)::addressline;
    RETURN NEXT location;
  ELSEIF place.address is not null and place.address ? 'postcode'
         and not place.address->'postcode' SIMILAR TO '%(,|;)%' THEN
    location := ROW(null, null, null, hstore('ref', place.address->'postcode'), 'place',
                    'postcode', null, null, false, true, 5, 0)::addressline;
    RETURN NEXT location;
  END IF;
  RETURN;
--- a/lib-sql/functions/interpolation.sql
+++ b/lib-sql/functions/interpolation.sql
@@ -156,7 +156,6 @@ DECLARE
  linegeo GEOMETRY;
  splitline GEOMETRY;
  sectiongeo GEOMETRY;
  interpol_postcode TEXT;
  postcode TEXT;
  stepmod SMALLINT;
 BEGIN
@@ -174,8 +173,6 @@ BEGIN
                                                 ST_PointOnSurface(NEW.linegeo),
                                                 NEW.linegeo);
  interpol_postcode := token_normalized_postcode(NEW.address->'postcode');
  NEW.token_info := token_strip_info(NEW.token_info);
  IF NEW.address ? '_inherited' THEN
    NEW.address := hstore('interpolation', NEW.address->'interpolation');
@@ -207,6 +204,11 @@ BEGIN
    FOR nextnode IN
      SELECT DISTINCT ON (nodeidpos)
          osm_id, address, geometry,
          -- Take the postcode from the node only if it has a housenumber itself.
          -- Note that there is a corner-case where the node has a wrongly
          -- formatted postcode and therefore 'postcode' contains a derived
          -- variant.
          CASE WHEN address ? 'postcode' THEN placex.postcode ELSE NULL::text END as postcode,
          substring(address->'housenumber','[0-9]+')::integer as hnr
        FROM placex, generate_series(1, array_upper(waynodes, 1)) nodeidpos
        WHERE osm_type = 'N' and osm_id = waynodes[nodeidpos]::BIGINT
@@ -260,13 +262,10 @@ BEGIN
        endnumber := newend;
        -- determine postcode
-        postcode := coalesce(interpol_postcode,
+        postcode := coalesce(prevnode.postcode, nextnode.postcode, postcode);
-                             token_normalized_postcode(prevnode.address->'postcode'),
+        IF postcode is NULL and NEW.parent_place_id > 0 THEN
-                             token_normalized_postcode(nextnode.address->'postcode'),
+            SELECT placex.postcode FROM placex
-                             postcode);
+              WHERE place_id = NEW.parent_place_id INTO postcode;
        IF postcode is NULL THEN
            SELECT token_normalized_postcode(placex.postcode)
              FROM placex WHERE place_id = NEW.parent_place_id INTO postcode;
        END IF;
        IF postcode is NULL THEN
            postcode := get_nearest_postcode(NEW.country_code, nextnode.geometry);
--- a/lib-sql/functions/placex_triggers.sql
+++ b/lib-sql/functions/placex_triggers.sql
@@ -992,7 +992,7 @@ BEGIN
      {% if debug %}RAISE WARNING 'Got parent details from search name';{% endif %}
      -- determine postcode
-      NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
+      NEW.postcode := coalesce(token_get_postcode(NEW.token_info),
                               location.postcode,
                               get_nearest_postcode(NEW.country_code, NEW.centroid));
@@ -1150,8 +1150,7 @@ BEGIN
  {% if debug %}RAISE WARNING 'RETURN insert_addresslines: %, %, %', NEW.parent_place_id, NEW.postcode, nameaddress_vector;{% endif %}
-  NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
+  NEW.postcode := coalesce(token_get_postcode(NEW.token_info), NEW.postcode);
                           NEW.postcode);
  -- if we have a name add this to the name search table
  IF NEW.name IS NOT NULL THEN
--- a/lib-sql/tokenizer/icu_tokenizer.sql
+++ b/lib-sql/tokenizer/icu_tokenizer.sql
@@ -97,10 +97,10 @@ AS $$
 $$ LANGUAGE SQL IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
+CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
  RETURNS TEXT
 AS $$
-  SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
+  SELECT info->>'postcode';
 $$ LANGUAGE SQL IMMUTABLE STRICT;
@@ -223,3 +223,26 @@ BEGIN
 END;
 $$
 LANGUAGE plpgsql;
 CREATE OR REPLACE FUNCTION create_postcode_word(postcode TEXT, lookup_terms TEXT[])
  RETURNS BOOLEAN
  AS $$
 DECLARE
  existing INTEGER;
 BEGIN
  SELECT count(*) INTO existing
    FROM word WHERE word = postcode and type = 'P';
  IF existing > 0 THEN
    RETURN TRUE;
  END IF;
  -- postcodes don't need word ids
  INSERT INTO word (word_token, type, word)
    SELECT lookup_term, 'P', postcode FROM unnest(lookup_terms) as lookup_term;
  RETURN FALSE;
 END;
 $$
 LANGUAGE plpgsql;
--- a/lib-sql/tokenizer/legacy_tokenizer.sql
+++ b/lib-sql/tokenizer/legacy_tokenizer.sql
@@ -97,10 +97,10 @@ AS $$
 $$ LANGUAGE SQL IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
+CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
  RETURNS TEXT
 AS $$
-  SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
+  SELECT info->>'postcode';
 $$ LANGUAGE SQL IMMUTABLE STRICT;
--- a/nominatim/data/init.py
+++ b/nominatim/data/init.py
--- a/nominatim/data/postcode_format.py
+++ b/nominatim/data/postcode_format.py
@@ -0,0 +1,109 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Functions for formatting postcodes according to their country-specific
 format.
 """
 import re
 from nominatim.errors import UsageError
 from nominatim.tools import country_info
 class CountryPostcodeMatcher:
    """ Matches and formats a postcode according to a format definition
        of the given country.
    """
    def __init__(self, country_code, config):
        if 'pattern' not in config:
            raise UsageError("Field 'pattern' required for 'postcode' "
                             f"for country '{country_code}'")
        pc_pattern = config['pattern'].replace('d', '[0-9]').replace('l', '[A-Z]')
        self.norm_pattern = re.compile(f'\\s*(?:{country_code.upper()}[ -]?)?(.*)\\s*')
        self.pattern = re.compile(pc_pattern)
        self.output = config.get('output', r'\g<0>')
    def match(self, postcode):
        """ Match the given postcode against the postcode pattern for this
            matcher. Returns a `re.Match` object if the match was successful
            and None otherwise.
        """
        # Upper-case, strip spaces and leading country code.
        normalized = self.norm_pattern.fullmatch(postcode.upper())
        if normalized:
            return self.pattern.fullmatch(normalized.group(1))
        return None
    def normalize(self, match):
        """ Return the default format of the postcode for the given match.
            `match` must be a `re.Match` object previously returned by
            `match()`
        """
        return match.expand(self.output)
 class PostcodeFormatter:
    """ Container for different postcode formats of the world and
        access functions.
    """
    def __init__(self):
        # Objects without a country code can't have a postcode per definition.
        self.country_without_postcode = {None}
        self.country_matcher = {}
        self.default_matcher = CountryPostcodeMatcher('', {'pattern': '.*'})
        for ccode, prop in country_info.iterate('postcode'):
            if prop is False:
                self.country_without_postcode.add(ccode)
            elif isinstance(prop, dict):
                self.country_matcher[ccode] = CountryPostcodeMatcher(ccode, prop)
            else:
                raise UsageError(f"Invalid entry 'postcode' for country '{ccode}'")
    def set_default_pattern(self, pattern):
        """ Set the postcode match pattern to use, when a country does not
            have a specific pattern or is marked as country without postcode.
        """
        self.default_matcher = CountryPostcodeMatcher('', {'pattern': pattern})
    def get_matcher(self, country_code):
        """ Return the CountryPostcodeMatcher for the given country.
            Returns None if the country doesn't have a postcode and the
            default matcher if there is no specific matcher configured for
            the country.
        """
        if country_code in self.country_without_postcode:
            return None
        return self.country_matcher.get(country_code, self.default_matcher)
    def match(self, country_code, postcode):
        """ Match the given postcode against the postcode pattern for this
            matcher. Returns a `re.Match` object if the country has a pattern
            and the match was successful or None if the match failed.
        """
        if country_code in self.country_without_postcode:
            return None
        return self.country_matcher.get(country_code, self.default_matcher).match(postcode)
    def normalize(self, country_code, match):
        """ Return the default format of the postcode for the given match.
            `match` must be a `re.Match` object previously returned by
            `match()`
        """
        return self.country_matcher.get(country_code, self.default_matcher).normalize(match)
--- a/nominatim/tokenizer/icu_tokenizer.py
+++ b/nominatim/tokenizer/icu_tokenizer.py
@@ -11,7 +11,6 @@ libICU instead of the PostgreSQL module.
 import itertools
 import json
 import logging
 import re
 from textwrap import dedent
 from nominatim.db.connection import connect
@@ -291,33 +290,72 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
        """ Update postcode tokens in the word table from the location_postcode
            table.
        """
-        to_delete = []
+        analyzer = self.token_analysis.analysis.get('@postcode')
        with self.conn.cursor() as cur:
-            # This finds us the rows in location_postcode and word that are
+            # First get all postcode names currently in the word table.
-            # missing in the other table.
+            cur.execute("SELECT DISTINCT word FROM word WHERE type = 'P'")
-            cur.execute("""SELECT * FROM
+            word_entries = set((entry[0] for entry in cur))
                            (SELECT pc, word FROM
                              (SELECT distinct(postcode) as pc FROM location_postcode) p
                              FULL JOIN
                              (SELECT word FROM word WHERE type = 'P') w
                              ON pc = word) x
                           WHERE pc is null or word is null""")
-            with CopyBuffer() as copystr:
+            # Then compute the required postcode names from the postcode table.
-                for postcode, word in cur:
+            needed_entries = set()
-                    if postcode is None:
+            cur.execute("SELECT country_code, postcode FROM location_postcode")
-                        to_delete.append(word)
+            for cc, postcode in cur:
-                    else:
+                info = PlaceInfo({'country_code': cc,
-                        copystr.add(self._search_normalized(postcode),
+                                  'class': 'place', 'type': 'postcode',
-                                    'P', postcode)
+                                  'address': {'postcode': postcode}})
                address = self.sanitizer.process_names(info)[1]
                for place in address:
                    if place.kind == 'postcode':
                        if analyzer is None:
                            postcode_name = place.name.strip().upper()
                            variant_base = None
                        else:
                            postcode_name = analyzer.normalize(place.name)
                            variant_base = place.get_attr("variant")
                        if variant_base:
                            needed_entries.add(f'{postcode_name}@{variant_base}')
                        else:
                            needed_entries.add(postcode_name)
                        break
        # Now update the word table.
        self._delete_unused_postcode_words(word_entries - needed_entries)
        self._add_missing_postcode_words(needed_entries - word_entries)
    def _delete_unused_postcode_words(self, tokens):
        if tokens:
            with self.conn.cursor() as cur:
                cur.execute("DELETE FROM word WHERE type = 'P' and word = any(%s)",
                            (list(tokens), ))
    def _add_missing_postcode_words(self, tokens):
        if not tokens:
            return
        analyzer = self.token_analysis.analysis.get('@postcode')
        terms = []
        for postcode_name in tokens:
            if '@' in postcode_name:
                term, variant = postcode_name.split('@', 2)
                term = self._search_normalized(term)
                variants = {term}
                if analyzer is not None:
                    variants.update(analyzer.get_variants_ascii(variant))
                    variants = list(variants)
            else:
                variants = [self._search_normalized(postcode_name)]
            terms.append((postcode_name, variants))
        if terms:
            with self.conn.cursor() as cur:
                cur.execute_values("""SELECT create_postcode_word(pc, var)
                                      FROM (VALUES %s) AS v(pc, var)""",
                                   terms)
                if to_delete:
                    cur.execute("""DELETE FROM WORD
                                   WHERE type ='P' and word = any(%s)
                                """, (to_delete, ))
                copystr.copy_out(cur, 'word',
                                 columns=['word_token', 'type', 'word'])
    def update_special_phrases(self, phrases, should_replace):
@@ -473,7 +511,7 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
    def _process_place_address(self, token_info, address):
        for item in address:
            if item.kind == 'postcode':
-                self._add_postcode(item.name)
+                token_info.set_postcode(self._add_postcode(item))
            elif item.kind == 'housenumber':
                token_info.add_housenumber(*self._compute_housenumber_token(item))
            elif item.kind == 'street':
@@ -605,26 +643,38 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
        return full_tokens, partial_tokens
-    def _add_postcode(self, postcode):
+    def _add_postcode(self, item):
        """ Make sure the normalized postcode is present in the word table.
        """
-        if re.search(r'[:,;]', postcode) is None:
+        analyzer = self.token_analysis.analysis.get('@postcode')
            postcode = self.normalize_postcode(postcode)
-            if postcode not in self._cache.postcodes:
+        if analyzer is None:
-                term = self._search_normalized(postcode)
+            postcode_name = item.name.strip().upper()
-                if not term:
+            variant_base = None
-                    return
+        else:
            postcode_name = analyzer.normalize(item.name)
            variant_base = item.get_attr("variant")
-                with self.conn.cursor() as cur:
+        if variant_base:
-                    # no word_id needed for postcodes
+            postcode = f'{postcode_name}@{variant_base}'
-                    cur.execute("""INSERT INTO word (word_token, type, word)
+        else:
-                                   (SELECT %s, 'P', pc FROM (VALUES (%s)) as v(pc)
+            postcode = postcode_name
-                                    WHERE NOT EXISTS
+
-                                     (SELECT * FROM word
+        if postcode not in self._cache.postcodes:
-                                      WHERE type = 'P' and word = pc))
+            term = self._search_normalized(postcode_name)
-                                """, (term, postcode))
+            if not term:
-                self._cache.postcodes.add(postcode)
+                return None
            variants = {term}
            if analyzer is not None and variant_base:
                variants.update(analyzer.get_variants_ascii(variant_base))
            with self.conn.cursor() as cur:
                cur.execute("SELECT create_postcode_word(%s, %s)",
                            (postcode, list(variants)))
            self._cache.postcodes.add(postcode)
        return postcode_name
 class _TokenInfo:
@@ -637,6 +687,7 @@ class _TokenInfo:
        self.street_tokens = set()
        self.place_tokens = set()
        self.address_tokens = {}
        self.postcode = None
    @staticmethod
@@ -665,6 +716,9 @@ class _TokenInfo:
        if self.address_tokens:
            out['addr'] = self.address_tokens
        if self.postcode:
            out['postcode'] = self.postcode
        return out
@@ -701,6 +755,11 @@ class _TokenInfo:
        if partials:
            self.address_tokens[key] = self._mk_array(partials)
    def set_postcode(self, postcode):
        """ Set the postcode to the given one.
        """
        self.postcode = postcode
 class _TokenCache:
    """ Cache for token information to avoid repeated database queries.
--- a/nominatim/tokenizer/legacy_tokenizer.py
+++ b/nominatim/tokenizer/legacy_tokenizer.py
@@ -467,8 +467,9 @@ class LegacyNameAnalyzer(AbstractAnalyzer):
            if key == 'postcode':
                # Make sure the normalized postcode is present in the word table.
                if re.search(r'[:,;]', value) is None:
-                    self._cache.add_postcode(self.conn,
+                    norm_pc = self.normalize_postcode(value)
-                                             self.normalize_postcode(value))
+                    token_info.set_postcode(norm_pc)
                    self._cache.add_postcode(self.conn, norm_pc)
            elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
                hnrs.append(value)
            elif key == 'street':
@@ -527,6 +528,11 @@ class _TokenInfo:
            self.data['hnr_tokens'], self.data['hnr'] = cur.fetchone()
    def set_postcode(self, postcode):
        """ Set or replace the postcode token with the given value.
        """
        self.data['postcode'] = postcode
    def add_street(self, conn, street):
        """ Add addr:street match terms.
        """
--- a/nominatim/tokenizer/sanitizers/clean_postcodes.py
+++ b/nominatim/tokenizer/sanitizers/clean_postcodes.py
@@ -0,0 +1,74 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Sanitizer that filters postcodes by their officially allowed pattern.
 Arguments:
    convert-to-address: If set to 'yes' (the default), then postcodes that do
                        not conform with their country-specific pattern are
                        converted to an address component. That means that
                        the postcode does not take part when computing the
                        postcode centroids of a country but is still searchable.
                        When set to 'no', non-conforming postcodes are not
                        searchable either.
    default-pattern:    Pattern to use, when there is none available for the
                        country in question. Warning: will not be used for
                        objects that have no country assigned. These are always
                        assumed to have no postcode.
 """
 from nominatim.data.postcode_format import PostcodeFormatter
 class _PostcodeSanitizer:
    def __init__(self, config):
        self.convert_to_address = config.get_bool('convert-to-address', True)
        self.matcher = PostcodeFormatter()
        default_pattern = config.get('default-pattern')
        if default_pattern is not None and isinstance(default_pattern, str):
            self.matcher.set_default_pattern(default_pattern)
    def __call__(self, obj):
        if not obj.address:
            return
        postcodes = ((i, o) for i, o in enumerate(obj.address) if o.kind == 'postcode')
        for pos, postcode in postcodes:
            formatted = self.scan(postcode.name, obj.place.country_code)
            if formatted is None:
                if self.convert_to_address:
                    postcode.kind = 'unofficial_postcode'
                else:
                    obj.address.pop(pos)
            else:
                postcode.name = formatted[0]
                postcode.set_attr('variant', formatted[1])
    def scan(self, postcode, country):
        """ Check the postcode for correct formatting and return the
            normalized version. Returns None if the postcode does not
            correspond to the oficial format of the given country.
        """
        match = self.matcher.match(country, postcode)
        if match is None:
            return None
        return self.matcher.normalize(country, match),\
               ' '.join(filter(lambda p: p is not None, match.groups()))
 def create(config):
    """ Create a housenumber processing function.
    """
    return _PostcodeSanitizer(config)
--- a/nominatim/tokenizer/sanitizers/config.py
+++ b/nominatim/tokenizer/sanitizers/config.py
@@ -44,6 +44,20 @@ class SanitizerConfig(UserDict):
        return values
    def get_bool(self, param, default=None):
        """ Extract a configuration parameter as a boolean.
            The parameter must be one of the yaml boolean values or an
            user error will be raised. If `default` is given, then the parameter
            may also be missing or empty.
        """
        value = self.data.get(param, default)
        if not isinstance(value, bool):
            raise UsageError(f"Parameter '{param}' must be a boolean value ('yes' or 'no'.")
        return value
    def get_delimiter(self, default=',;'):
        """ Return the 'delimiter' parameter in the configuration as a
            compiled regular expression that can be used to split the names on the
--- a/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
+++ b/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
@@ -48,8 +48,7 @@ class _AnalyzerByLanguage:
        self.deflangs = {}
        if use_defaults in ('mono', 'all'):
-            for ccode, prop in country_info.iterate():
+            for ccode, clangs in country_info.iterate('languages'):
                clangs = prop['languages']
                if len(clangs) == 1 or use_defaults == 'all':
                    if self.whitelist:
                        self.deflangs[ccode] = [l for l in clangs if l in self.whitelist]
--- a/nominatim/tokenizer/token_analysis/postcodes.py
+++ b/nominatim/tokenizer/token_analysis/postcodes.py
@@ -0,0 +1,65 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Specialized processor for postcodes. Supports a 'lookup' variant of the
 token, which produces variants with optional spaces.
 """
 from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
 ### Configuration section
 def configure(rules, normalization_rules): # pylint: disable=W0613
    """ All behaviour is currently hard-coded.
    """
    return None
 ### Analysis section
 def create(normalizer, transliterator, config): # pylint: disable=W0613
    """ Create a new token analysis instance for this module.
    """
    return PostcodeTokenAnalysis(normalizer, transliterator)
 class PostcodeTokenAnalysis:
    """ Special normalization and variant generation for postcodes.
        This analyser must not be used with anything but postcodes as
        it follows some special rules: `normalize` doesn't necessarily
        need to return a standard form as per normalization rules. It
        needs to return the canonical form of the postcode that is also
        used for output. `get_variants_ascii` then needs to ensure that
        the generated variants once more follow the standard normalization
        and transliteration, so that postcodes are correctly recognised by
        the search algorithm.
    """
    def __init__(self, norm, trans):
        self.norm = norm
        self.trans = trans
        self.mutator = MutationVariantGenerator(' ', (' ', ''))
    def normalize(self, name):
        """ Return the standard form of the postcode.
        """
        return name.strip().upper()
    def get_variants_ascii(self, norm_name):
        """ Compute the spelling variants for the given normalized postcode.
            Takes the canonical form of the postcode, normalizes it using the
            standard rules and then creates variants of the result where
            all spaces are optional.
        """
        # Postcodes follow their own transliteration rules.
        # Make sure at this point, that the terms are normalized in a way
        # that they are searchable with the standard transliteration rules.
        return [self.trans.transliterate(term) for term in
                self.mutator.generate([self.norm.transliterate(norm_name)]) if term]
--- a/nominatim/tools/country_info.py
+++ b/nominatim/tools/country_info.py
@@ -84,10 +84,20 @@ def setup_country_config(config):
    _COUNTRY_INFO.load(config)
-def iterate():
+def iterate(prop=None):
    """ Iterate over country code and properties.
        When `prop` is None, all countries are returned with their complete
        set of properties.
        If `prop` is given, then only countries are returned where the
        given property is set. The second item of the tuple contains only
        the content of the given property.
    """
-    return _COUNTRY_INFO.items()
+    if prop is None:
        return _COUNTRY_INFO.items()
    return ((c, p[prop]) for c, p in _COUNTRY_INFO.items() if prop in p)
 def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
--- a/nominatim/tools/postcodes.py
+++ b/nominatim/tools/postcodes.py
@@ -8,6 +8,7 @@
 Functions for importing, updating and otherwise maintaining the table
 of artificial postcode centroids.
 """
 from collections import defaultdict
 import csv
 import gzip
 import logging
@@ -16,6 +17,8 @@ from math import isfinite
 from psycopg2 import sql as pysql
 from nominatim.db.connection import connect
 from nominatim.utils.centroid import PointsCentroid
 from nominatim.data.postcode_format import PostcodeFormatter
 LOG = logging.getLogger()
@@ -30,20 +33,31 @@ def _to_float(num, max_value):
    return num
-class _CountryPostcodesCollector:
+class _PostcodeCollector:
    """ Collector for postcodes of a single country.
    """
-    def __init__(self, country):
+    def __init__(self, country, matcher):
        self.country = country
-        self.collected = {}
+        self.matcher = matcher
        self.collected = defaultdict(PointsCentroid)
        self.normalization_cache = None
    def add(self, postcode, x, y):
        """ Add the given postcode to the collection cache. If the postcode
            already existed, it is overwritten with the new centroid.
        """
-        self.collected[postcode] = (x, y)
+        if self.matcher is not None:
            if self.normalization_cache and self.normalization_cache[0] == postcode:
                normalized = self.normalization_cache[1]
            else:
                match = self.matcher.match(postcode)
                normalized = self.matcher.normalize(match) if match else None
                self.normalization_cache = (postcode, normalized)
            if normalized:
                self.collected[normalized] += (x, y)
    def commit(self, conn, analyzer, project_dir):
@@ -93,16 +107,16 @@ class _CountryPostcodesCollector:
                           WHERE country_code = %s""",
                        (self.country, ))
            for postcode, x, y in cur:
-                newx, newy = self.collected.pop(postcode, (None, None))
+                pcobj = self.collected.pop(postcode, None)
-                if newx is not None:
+                if pcobj:
-                    dist = (x - newx)**2 + (y - newy)**2
+                    newx, newy = pcobj.centroid()
-                    if dist > 0.0000001:
+                    if (x - newx) > 0.0000001 or (y - newy) > 0.0000001:
                        to_update.append((postcode, newx, newy))
                else:
                    to_delete.append(postcode)
-        to_add = [(k, v[0], v[1]) for k, v in self.collected.items()]
+        to_add = [(k, *v.centroid()) for k, v in self.collected.items()]
-        self.collected = []
+        self.collected = None
        return to_add, to_delete, to_update
@@ -125,8 +139,10 @@ class _CountryPostcodesCollector:
                postcode = analyzer.normalize_postcode(row['postcode'])
                if postcode not in self.collected:
                    try:
-                        self.collected[postcode] = (_to_float(row['lon'], 180),
+                        # Do float conversation separately, it might throw
-                                                    _to_float(row['lat'], 90))
+                        centroid = (_to_float(row['lon'], 180),
                                    _to_float(row['lat'], 90))
                        self.collected[postcode] += centroid
                    except ValueError:
                        LOG.warning("Bad coordinates %s, %s in %s country postcode file.",
                                    row['lat'], row['lon'], self.country)
@@ -158,6 +174,7 @@ def update_postcodes(dsn, project_dir, tokenizer):
        potentially enhances it with external data and then updates the
        postcodes in the table 'location_postcode'.
    """
    matcher = PostcodeFormatter()
    with tokenizer.name_analyzer() as analyzer:
        with connect(dsn) as conn:
            # First get the list of countries that currently have postcodes.
@@ -169,19 +186,17 @@ def update_postcodes(dsn, project_dir, tokenizer):
            # Recompute the list of valid postcodes from placex.
            with conn.cursor(name="placex_postcodes") as cur:
                cur.execute("""
-                SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
+                SELECT cc, pc, ST_X(centroid), ST_Y(centroid)
                FROM (SELECT
                        COALESCE(plx.country_code,
                                 get_country_code(ST_Centroid(pl.geometry))) as cc,
-                        token_normalized_postcode(pl.address->'postcode') as pc,
+                        pl.address->'postcode' as pc,
-                        ST_Centroid(ST_Collect(COALESCE(plx.centroid,
+                        COALESCE(plx.centroid, ST_Centroid(pl.geometry)) as centroid
                                                        ST_Centroid(pl.geometry)))) as centroid
                      FROM place AS pl LEFT OUTER JOIN placex AS plx
                             ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
-                    WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null
+                    WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null) xx
                    GROUP BY cc, pc) xx
                WHERE pc IS NOT null AND cc IS NOT null
-                ORDER BY country_code, pc""")
+                ORDER BY cc, pc""")
                collector = None
@@ -189,7 +204,7 @@ def update_postcodes(dsn, project_dir, tokenizer):
                    if collector is None or country != collector.country:
                        if collector is not None:
                            collector.commit(conn, analyzer, project_dir)
-                        collector = _CountryPostcodesCollector(country)
+                        collector = _PostcodeCollector(country, matcher.get_matcher(country))
                        todo_countries.discard(country)
                    collector.add(postcode, x, y)
@@ -198,7 +213,8 @@ def update_postcodes(dsn, project_dir, tokenizer):
            # Now handle any countries that are only in the postcode table.
            for country in todo_countries:
-                _CountryPostcodesCollector(country).commit(conn, analyzer, project_dir)
+                fmt = matcher.get_matcher(country)
                _PostcodeCollector(country, fmt).commit(conn, analyzer, project_dir)
            conn.commit()
--- a/nominatim/utils/init.py
+++ b/nominatim/utils/init.py
--- a/nominatim/utils/centroid.py
+++ b/nominatim/utils/centroid.py
@@ -0,0 +1,48 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Functions for computation of centroids.
 """
 from collections.abc import Collection
 class PointsCentroid:
    """ Centroid computation from single points using an online algorithm.
        More points may be added at any time.
        Coordinates are internally treated as a 7-digit fixed-point float
        (i.e. in OSM style).
    """
    def __init__(self):
        self.sum_x = 0
        self.sum_y = 0
        self.count = 0
    def centroid(self):
        """ Return the centroid of all points collected so far.
        """
        if self.count == 0:
            raise ValueError("No points available for centroid.")
        return (float(self.sum_x/self.count)/10000000,
                float(self.sum_y/self.count)/10000000)
    def __len__(self):
        return self.count
    def __iadd__(self, other):
        if isinstance(other, Collection) and len(other) == 2:
            if all(isinstance(p, (float, int)) for p in other):
                x, y = other
                self.sum_x += int(x * 10000000)
                self.sum_y += int(y * 10000000)
                self.count += 1
                return self
        raise ValueError("Can only add 2-element tuples to centroid.")
--- a/settings/country_settings.yaml
+++ b/settings/country_settings.yaml
--- a/settings/icu_tokenizer.yaml
+++ b/settings/icu_tokenizer.yaml
@@ -32,6 +32,9 @@ sanitizers:
        - streetnumber
      convert-to-name:
        - (\A|.*,)[^\d,]{3,}(,.*|\Z)
    - step: clean-postcodes
      convert-to-address: yes
      default-pattern: "[A-Z0-9- ]{3,12}"
    - step: split-name-list
    - step: strip-brace-terms
    - step: tag-analyzer-by-language
@@ -43,6 +46,8 @@ token-analysis:
    - analyzer: generic
    - id: "@housenumber"
      analyzer: housenumbers
    - id: "@postcode"
      analyzer: postcodes
    - id: bg
      analyzer: generic
      mode: variant-only
--- a/test/bdd/db/import/postcodes.feature
+++ b/test/bdd/db/import/postcodes.feature
@@ -163,25 +163,8 @@ Feature: Import of postcodes
           | de      | 01982    | country:de |
        And there are word tokens for postcodes 01982
    Scenario: Different postcodes with the same normalization can both be found
        Given the places
           | osm | class | type  | addr+postcode | addr+housenumber | geometry |
           | N34 | place | house | EH4 7EA       | 111              | country:gb |
           | N35 | place | house | E4 7EA        | 111              | country:gb |
        When importing
        Then location_postcode contains exactly
           | country | postcode | geometry |
           | gb      | EH4 7EA  | country:gb |
           | gb      | E4 7EA   | country:gb |
        When sending search query "EH4 7EA"
        Then results contain
           | type     | display_name |
           | postcode | EH4 7EA      |
        When sending search query "E4 7EA"
        Then results contain
           | type     | display_name |
           | postcode | E4 7EA       |
    @Fail
    Scenario: search and address ranks for GB post codes correctly assigned
        Given the places
         | osm  | class | type     | postcode | geometry |
@@ -195,55 +178,19 @@ Feature: Import of postcodes
         | E45 2    | gb      | 23          | 5 |
         | Y45      | gb      | 21          | 5 |
-    Scenario: wrongly formatted GB postcodes are down-ranked
+    @fail-legacy
    Scenario: Postcodes outside all countries are not added to the postcode and word table
        Given the places
-         | osm  | class | type     | postcode | geometry |
+            | osm | class | type  | addr+postcode | addr+housenumber | addr+place  | geometry  |
-         | N1   | place | postcode | EA452CD  | country:gb |
+            | N34 | place | house | 01982         | 111              | Null Island | 0 0.00001 |
-         | N2   | place | postcode | E45 23   | country:gb |
+        And the places
            | osm | class | type   | name        | geometry |
            | N1  | place | hamlet | Null Island | 0 0      |
        When importing
        Then location_postcode contains exactly
-         | postcode | country | rank_search | rank_address |
+            | country | postcode | geometry |
-         | EA452CD  | gb      | 30          | 30 |
+        And there are no word tokens for postcodes 01982
-         | E45 23   | gb      | 30          | 30 |
+        When sending search query "111, 01982 Null Island"
-
+        Then results contain
-    Scenario: search and address rank for DE postcodes correctly assigned
+            | osm | display_name |
-        Given the places
+            | N34 | 111, Null Island, 01982 |
         | osm | class | type     | postcode | geometry |
         | N1  | place | postcode | 56427    | country:de |
         | N2  | place | postcode | 5642     | country:de |
         | N3  | place | postcode | 5642A    | country:de |
         | N4  | place | postcode | 564276   | country:de |
        When importing
        Then location_postcode contains exactly
         | postcode | country | rank_search | rank_address |
         | 56427    | de      | 21          | 11 |
         | 5642     | de      | 30          | 30 |
         | 5642A    | de      | 30          | 30 |
         | 564276   | de      | 30          | 30 |
    Scenario: search and address rank for other postcodes are correctly assigned
        Given the places
         | osm | class | type     | postcode | geometry |
         | N1  | place | postcode | 1        | country:ca |
         | N2  | place | postcode | X3       | country:ca |
         | N3  | place | postcode | 543      | country:ca |
         | N4  | place | postcode | 54dc     | country:ca |
         | N5  | place | postcode | 12345    | country:ca |
         | N6  | place | postcode | 55TT667  | country:ca |
         | N7  | place | postcode | 123-65   | country:ca |
         | N8  | place | postcode | 12 445 4 | country:ca |
         | N9  | place | postcode | A1:bc10  | country:ca |
        When importing
        Then location_postcode contains exactly
         | postcode | country | rank_search | rank_address |
         | 1        | ca      | 21          | 11 |
         | X3       | ca      | 21          | 11 |
         | 543      | ca      | 21          | 11 |
         | 54DC     | ca      | 21          | 11 |
         | 12345    | ca      | 21          | 11 |
         | 55TT667  | ca      | 21          | 11 |
         | 123-65   | ca      | 25          | 11 |
         | 12 445 4 | ca      | 25          | 11 |
         | A1:BC10  | ca      | 25          | 11 |
--- a/test/bdd/db/query/normalization.feature
+++ b/test/bdd/db/query/normalization.feature
@@ -168,14 +168,6 @@ Feature: Import and search of names
         | ID | osm |
         | 0  | R1 |
    Scenario: Unprintable characters in postcodes are ignored
        Given the named places
            | osm  | class   | type   | address                    | geometry   |
            | N234 | amenity | prison | 'postcode' : u'1234\u200e' | country:de |
        When importing
        And sending search query "1234"
        Then result 0 has not attributes osm_type
    Scenario Outline: Housenumbers with special characters are found
        Given the grid
            | 1 |  |   |  | 2 |
--- a/test/bdd/db/query/postcodes.feature
+++ b/test/bdd/db/query/postcodes.feature
@@ -0,0 +1,97 @@
@DB
 Feature: Querying fo postcode variants
    Scenario: Postcodes in Singapore (6-digit postcode)
        Given the grid with origin SG
            | 10 |   |   |   | 11 |
        And the places
            | osm | class   | type | name   | addr+postcode | geometry |
            | W1  | highway | path | Lorang | 399174        | 10,11    |
        When importing
        When sending search query "399174"
        Then results contain
            | ID | type     | display_name |
            | 0  | postcode | 399174       |
    @fail-legacy
    Scenario Outline: Postcodes in the Netherlands (mixed postcode with spaces)
        Given the grid with origin NL
            | 10 |   |   |   | 11 |
        And the places
            | osm | class   | type | name     | addr+postcode | geometry |
            | W1  | highway | path | De Weide | 3993 DX       | 10,11    |
        When importing
        When sending search query "3993 DX"
        Then results contain
            | ID | type     | display_name |
            | 0  | postcode | 3993 DX      |
        When sending search query "3993dx"
        Then results contain
            | ID | type     | display_name |
            | 0  | postcode | 3993 DX      |
        Examples:
            | postcode |
            | 3993 DX  |
            | 3993DX   |
            | 3993 dx  |
    @fail-legacy
    Scenario: Postcodes in Singapore (6-digit postcode)
        Given the grid with origin SG
            | 10 |   |   |   | 11 |
        And the places
            | osm | class   | type | name   | addr+postcode | geometry |
            | W1  | highway | path | Lorang | 399174        | 10,11    |
        When importing
        When sending search query "399174"
        Then results contain
            | ID | type     | display_name |
            | 0  | postcode | 399174       |
    @fail-legacy
    Scenario Outline: Postcodes in Andorra (with country code)
        Given the grid with origin AD
            | 10 |   |   |   | 11 |
        And the places
            | osm | class   | type | name   | addr+postcode | geometry |
            | W1  | highway | path | Lorang | <postcode>    | 10,11    |
        When importing
        When sending search query "675"
        Then results contain
            | ID | type     | display_name |
            | 0  | postcode | AD675        |
        When sending search query "AD675"
        Then results contain
            | ID | type     | display_name |
            | 0  | postcode | AD675        |
        Examples:
            | postcode |
            | 675      |
            | AD 675   |
            | AD675    |
    Scenario: Different postcodes with the same normalization can both be found
        Given the places
           | osm | class | type  | addr+postcode | addr+housenumber | geometry |
           | N34 | place | house | EH4 7EA       | 111              | country:gb |
           | N35 | place | house | E4 7EA        | 111              | country:gb |
        When importing
        Then location_postcode contains exactly
           | country | postcode | geometry |
           | gb      | EH4 7EA  | country:gb |
           | gb      | E4 7EA   | country:gb |
        When sending search query "EH4 7EA"
        Then results contain
           | type     | display_name |
           | postcode | EH4 7EA      |
        When sending search query "E4 7EA"
        Then results contain
           | type     | display_name |
           | postcode | E4 7EA       |
--- a/test/bdd/steps/steps_db_ops.py
+++ b/test/bdd/steps/steps_db_ops.py
@@ -18,13 +18,19 @@ from nominatim.tokenizer import factory as tokenizer_factory
 def check_database_integrity(context):
    """ Check some generic constraints on the tables.
    """
-    # place_addressline should not have duplicate (place_id, address_place_id)
+    with context.db.cursor() as cur:
-    cur = context.db.cursor()
+        # place_addressline should not have duplicate (place_id, address_place_id)
-    cur.execute("""SELECT count(*) FROM
+        cur.execute("""SELECT count(*) FROM
-                    (SELECT place_id, address_place_id, count(*) as c
+                        (SELECT place_id, address_place_id, count(*) as c
-                     FROM place_addressline GROUP BY place_id, address_place_id) x
+                         FROM place_addressline GROUP BY place_id, address_place_id) x
-                   WHERE c > 1""")
+                       WHERE c > 1""")
-    assert cur.fetchone()[0] == 0, "Duplicates found in place_addressline"
+        assert cur.fetchone()[0] == 0, "Duplicates found in place_addressline"
        # word table must not have empty word_tokens
        if context.nominatim.tokenizer != 'legacy':
            cur.execute("SELECT count(*) FROM word WHERE word_token = ''")
            assert cur.fetchone()[0] == 0, "Empty word tokens found in word table"
 ################################ GIVEN ##################################
--- a/test/python/tokenizer/sanitizers/test_clean_postcodes.py
+++ b/test/python/tokenizer/sanitizers/test_clean_postcodes.py
@@ -0,0 +1,102 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Tests for the sanitizer that normalizes postcodes.
 """
 import pytest
 from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
 from nominatim.indexer.place_info import PlaceInfo
 from nominatim.tools import country_info
@pytest.fixture
 def sanitize(def_config, request):
    country_info.setup_country_config(def_config)
    sanitizer_args = {'step': 'clean-postcodes'}
    for mark in request.node.iter_markers(name="sanitizer_params"):
        sanitizer_args.update({k.replace('_', '-') : v for k,v in mark.kwargs.items()})
    def _run(country=None, **kwargs):
        pi = {'address': kwargs}
        if country is not None:
            pi['country_code'] = country
        _, address = PlaceSanitizer([sanitizer_args]).process_names(PlaceInfo(pi))
        return sorted([(p.kind, p.name) for p in address])
    return _run
@pytest.mark.parametrize("country", (None, 'ae'))
 def test_postcode_no_country(sanitize, country):
    assert sanitize(country=country, postcode='23231') == [('unofficial_postcode', '23231')]
@pytest.mark.parametrize("country", (None, 'ae'))
@pytest.mark.sanitizer_params(convert_to_address=False)
 def test_postcode_no_country_drop(sanitize, country):
    assert sanitize(country=country, postcode='23231') == []
@pytest.mark.parametrize("postcode", ('12345', '  12345  ', 'de 12345',
                                      'DE12345', 'DE 12345', 'DE-12345'))
 def test_postcode_pass_good_format(sanitize, postcode):
    assert sanitize(country='de', postcode=postcode) == [('postcode', '12345')]
@pytest.mark.parametrize("postcode", ('123456', '', '   ', '.....',
                                      'DE  12345', 'DEF12345', 'CH 12345'))
@pytest.mark.sanitizer_params(convert_to_address=False)
 def test_postcode_drop_bad_format(sanitize, postcode):
    assert sanitize(country='de', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('1234', '9435', '99000'))
 def test_postcode_cyprus_pass(sanitize, postcode):
    assert sanitize(country='cy', postcode=postcode) == [('postcode', postcode)]
@pytest.mark.parametrize("postcode", ('91234', '99a45', '567'))
@pytest.mark.sanitizer_params(convert_to_address=False)
 def test_postcode_cyprus_fail(sanitize, postcode):
    assert sanitize(country='cy', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('123456', 'A33F2G7'))
 def test_postcode_kazakhstan_pass(sanitize, postcode):
    assert sanitize(country='kz', postcode=postcode) == [('postcode', postcode)]
@pytest.mark.parametrize("postcode", ('V34T6Y923456', '99345'))
@pytest.mark.sanitizer_params(convert_to_address=False)
 def test_postcode_kazakhstan_fail(sanitize, postcode):
    assert sanitize(country='kz', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('675 34', '67534', 'SE-675 34', 'SE67534'))
 def test_postcode_sweden_pass(sanitize, postcode):
    assert sanitize(country='se', postcode=postcode) == [('postcode', '675 34')]
@pytest.mark.parametrize("postcode", ('67 345', '671123'))
@pytest.mark.sanitizer_params(convert_to_address=False)
 def test_postcode_sweden_fail(sanitize, postcode):
    assert sanitize(country='se', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('AB1', '123-456-7890', '1 as 44'))
@pytest.mark.sanitizer_params(default_pattern='[A-Z0-9- ]{3,12}')
 def test_postcode_default_pattern_pass(sanitize, postcode):
    assert sanitize(country='an', postcode=postcode) == [('postcode', postcode.upper())]
@pytest.mark.parametrize("postcode", ('C', '12', 'ABC123DEF 456', '1234,5678', '11223;11224'))
@pytest.mark.sanitizer_params(convert_to_address=False, default_pattern='[A-Z0-9- ]{3,12}')
 def test_postcode_default_pattern_fail(sanitize, postcode):
    assert sanitize(country='an', postcode=postcode) == []
--- a/test/python/tokenizer/test_icu.py
+++ b/test/python/tokenizer/test_icu.py
@@ -72,7 +72,8 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
    def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
                     variants=('~gasse -> gasse', 'street => st', ),
-                     sanitizers=[], with_housenumber=False):
+                     sanitizers=[], with_housenumber=False,
                     with_postcode=False):
        cfgstr = {'normalization': list(norm),
                  'sanitizers': sanitizers,
                  'transliteration': list(trans),
@@ -81,6 +82,9 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
        if with_housenumber:
            cfgstr['token-analysis'].append({'id': '@housenumber',
                                             'analyzer': 'housenumbers'})
        if with_postcode:
            cfgstr['token-analysis'].append({'id': '@postcode',
                                             'analyzer': 'postcodes'})
        (test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
        tok.loader = nominatim.tokenizer.icu_rule_loader.ICURuleLoader(test_config)
@@ -246,28 +250,69 @@ def test_normalize_postcode(analyzer):
        anl.normalize_postcode('38 Б') == '38 Б'
-def test_update_postcodes_from_db_empty(analyzer, table_factory, word_table):
+class TestPostcodes:
    table_factory('location_postcode', 'postcode TEXT',
                  content=(('1234',), ('12 34',), ('AB23',), ('1234',)))
-    with analyzer() as anl:
+    @pytest.fixture(autouse=True)
-        anl.update_postcodes_from_db()
+    def setup(self, analyzer, sql_functions):
-
+        sanitizers = [{'step': 'clean-postcodes'}]
-    assert word_table.count() == 3
+        with analyzer(sanitizers=sanitizers, with_postcode=True) as anl:
-    assert word_table.get_postcodes() == {'1234', '12 34', 'AB23'}
+            self.analyzer = anl
            yield anl
-def test_update_postcodes_from_db_add_and_remove(analyzer, table_factory, word_table):
+    def process_postcode(self, cc, postcode):
-    table_factory('location_postcode', 'postcode TEXT',
+        return self.analyzer.process_place(PlaceInfo({'country_code': cc,
-                  content=(('1234',), ('45BC', ), ('XX45', )))
+                                                      'address': {'postcode': postcode}}))
    word_table.add_postcode(' 1234', '1234')
    word_table.add_postcode(' 5678', '5678')
    with analyzer() as anl:
        anl.update_postcodes_from_db()
-    assert word_table.count() == 3
+    def test_update_postcodes_from_db_empty(self, table_factory, word_table):
-    assert word_table.get_postcodes() == {'1234', '45BC', 'XX45'}
+        table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
                      content=(('de', '12345'), ('se', '132 34'),
                               ('bm', 'AB23'), ('fr', '12345')))
        self.analyzer.update_postcodes_from_db()
        assert word_table.count() == 5
        assert word_table.get_postcodes() == {'12345', '132 34@132 34', 'AB 23@AB 23'}
    def test_update_postcodes_from_db_ambigious(self, table_factory, word_table):
        table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
                      content=(('in', '123456'), ('sg', '123456')))
        self.analyzer.update_postcodes_from_db()
        assert word_table.count() == 3
        assert word_table.get_postcodes() == {'123456', '123456@123 456'}
    def test_update_postcodes_from_db_add_and_remove(self, table_factory, word_table):
        table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
                      content=(('ch', '1234'), ('bm', 'BC 45'), ('bm', 'XX45')))
        word_table.add_postcode(' 1234', '1234')
        word_table.add_postcode(' 5678', '5678')
        self.analyzer.update_postcodes_from_db()
        assert word_table.count() == 5
        assert word_table.get_postcodes() == {'1234', 'BC 45@BC 45', 'XX 45@XX 45'}
    def test_process_place_postcode_simple(self, word_table):
        info = self.process_postcode('de', '12345')
        assert info['postcode'] == '12345'
        assert word_table.get_postcodes() == {'12345', }
    def test_process_place_postcode_with_space(self, word_table):
        info = self.process_postcode('in', '123 567')
        assert info['postcode'] == '123567'
        assert word_table.get_postcodes() == {'123567@123 567', }
 def test_update_special_phrase_empty_table(analyzer, word_table):
@@ -437,13 +482,6 @@ class TestPlaceAddress:
        assert word_table.get_postcodes() == {pcode, }
    @pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
    def test_process_place_bad_postcode(self, word_table, pcode):
        self.process_address(postcode=pcode)
        assert not word_table.get_postcodes()
    @pytest.mark.parametrize('hnr', ['123a', '1', '101'])
    def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
        info = self.process_address(housenumber=hnr)
--- a/test/python/tokenizer/token_analysis/test_analysis_postcodes.py
+++ b/test/python/tokenizer/token_analysis/test_analysis_postcodes.py
@@ -0,0 +1,60 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Tests for special postcode analysis and variant generation.
 """
 import pytest
 from icu import Transliterator
 import nominatim.tokenizer.token_analysis.postcodes as module
 from nominatim.errors import UsageError
 DEFAULT_NORMALIZATION = """ :: NFD ();
                            '🜳' > ' ';
                            [[:Nonspacing Mark:] [:Cf:]] >;
                            :: lower ();
                            [[:Punctuation:][:Space:]]+ > ' ';
                            :: NFC ();
                        """
 DEFAULT_TRANSLITERATION = """ ::  Latin ();
                              '🜵' > ' ';
                          """
@pytest.fixture
 def analyser():
    rules = { 'analyzer': 'postcodes'}
    config = module.configure(rules, DEFAULT_NORMALIZATION)
    trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
    norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
    return module.create(norm, trans, config)
 def get_normalized_variants(proc, name):
    norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
    return proc.get_variants_ascii(norm.transliterate(name).strip())
@pytest.mark.parametrize('name,norm', [('12', '12'),
                                       ('A 34 ', 'A 34'),
                                       ('34-av', '34-AV')])
 def test_normalize(analyser, name, norm):
    assert analyser.normalize(name) == norm
@pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}),
                                               ('AB-998', {'ab 998', 'ab998'}),
                                               ('23 FGH D3', {'23 fgh d3', '23fgh d3',
                                                              '23 fghd3', '23fghd3'})])
 def test_get_variants_ascii(analyser, postcode, variants):
    out = analyser.get_variants_ascii(postcode)
    assert len(out) == len(set(out))
    assert set(out) == variants
--- a/test/python/tools/test_postcodes.py
+++ b/test/python/tools/test_postcodes.py
@@ -11,7 +11,7 @@ import subprocess
 import pytest
-from nominatim.tools import postcodes
+from nominatim.tools import postcodes, country_info
 import dummy_tokenizer
 class MockPostcodeTable:
@@ -64,11 +64,26 @@ class MockPostcodeTable:
 def tokenizer():
    return dummy_tokenizer.DummyTokenizer(None, None)
@pytest.fixture
-def postcode_table(temp_db_conn, placex_table):
+def postcode_table(def_config, temp_db_conn, placex_table):
    country_info.setup_country_config(def_config)
    return MockPostcodeTable(temp_db_conn)
@pytest.fixture
 def insert_implicit_postcode(placex_table, place_row):
    """
        Inserts data into the placex and place table
        which can then be used to compute one postcode.
    """
    def _insert_implicit_postcode(osm_id, country, geometry, address):
        placex_table.add(osm_id=osm_id, country=country, geom=geometry)
        place_row(osm_id=osm_id, geom='SRID=4326;'+geometry, address=address)
    return _insert_implicit_postcode
 def test_postcodes_empty(dsn, postcode_table, place_table,
                         tmp_path, tokenizer):
    postcodes.update_postcodes(dsn, tmp_path, tokenizer)
@@ -193,7 +208,22 @@ def test_can_compute(dsn, table_factory):
    table_factory('place')
    assert postcodes.can_compute(dsn)
 def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
    #Rewrite the get_country_code function to verify its execution.
    temp_db_cursor.execute("""
        CREATE OR REPLACE FUNCTION get_country_code(place geometry)
        RETURNS TEXT AS $$ BEGIN 
        RETURN 'yy';
        END; $$ LANGUAGE plpgsql;
    """)
    place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
    postcodes.update_postcodes(dsn, tmp_path, tokenizer)
    assert postcode_table.row_set == {('yy', 'AB 4511', 10, 12)}
 def test_discard_badly_formatted_postcodes(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
    #Rewrite the get_country_code function to verify its execution.
    temp_db_cursor.execute("""
        CREATE OR REPLACE FUNCTION get_country_code(place geometry)
@@ -204,16 +234,4 @@ def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_tabl
    place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
    postcodes.update_postcodes(dsn, tmp_path, tokenizer)
-    assert postcode_table.row_set == {('fr', 'AB 4511', 10, 12)}
+    assert not postcode_table.row_set
@pytest.fixture
 def insert_implicit_postcode(placex_table, place_row):
    """
        Inserts data into the placex and place table
        which can then be used to compute one postcode.
    """
    def _insert_implicit_postcode(osm_id, country, geometry, address):
        placex_table.add(osm_id=osm_id, country=country, geom=geometry)
        place_row(osm_id=osm_id, geom='SRID=4326;'+geometry, address=address)
    return _insert_implicit_postcode
--- a/test/python/utils/test_centroid.py
+++ b/test/python/utils/test_centroid.py
@@ -0,0 +1,56 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # This file is part of Nominatim. (https://nominatim.org)
 #
 # Copyright (C) 2022 by the Nominatim developer community.
 # For a full list of authors see the git log.
 """
 Tests for centroid computation.
 """
 import pytest
 from nominatim.utils.centroid import PointsCentroid
 def test_empty_set():
    c = PointsCentroid()
    with pytest.raises(ValueError, match='No points'):
        c.centroid()
@pytest.mark.parametrize("centroid", [(0,0), (-1, 3), [0.0000032, 88.4938]])
 def test_one_point_centroid(centroid):
    c = PointsCentroid()
    c += centroid
    assert len(c.centroid()) == 2
    assert c.centroid() == (pytest.approx(centroid[0]), pytest.approx(centroid[1]))
 def test_multipoint_centroid():
    c = PointsCentroid()
    c += (20.0, -10.0)
    assert c.centroid() == (pytest.approx(20.0), pytest.approx(-10.0))
    c += (20.2, -9.0)
    assert c.centroid() == (pytest.approx(20.1), pytest.approx(-9.5))
    c += (20.2, -9.0)
    assert c.centroid() == (pytest.approx(20.13333), pytest.approx(-9.333333))
 def test_manypoint_centroid():
    c = PointsCentroid()
    for _ in range(10000):
        c += (4.564732, -0.000034)
    assert c.centroid() == (pytest.approx(4.564732), pytest.approx(-0.000034))
@pytest.mark.parametrize("param", ["aa", None, 5, [1, 2, 3], (3, None), ("a", 3.9)])
 def test_add_non_tuple(param):
    c = PointsCentroid()
    with pytest.raises(ValueError, match='2-element tuples'):
        c += param