Merge pull request #2460 from lonvia/multiple-analyzers

Add support for multiple token analyzers
2026-02-15 10:57:58 +00:00 · 2021-10-09 14:41:09 +02:00
parent 19d4e047f6 6c79a60e19
commit 4b007ae740
22 changed files with 1340 additions and 551 deletions
--- a/docs/admin/Tokenizers.md
+++ b/docs/admin/Tokenizers.md
@@ -60,22 +60,23 @@ NOMINATIM_TOKENIZER=icu

 ### How it works

-On import the tokenizer processes names in the following four stages:
+On import the tokenizer processes names in the following three stages:

-1. The **Normalization** part removes all non-relevant information from the
-   input.
-2. Incoming names are now converted to **full names**. This process is currently
-   hard coded and mostly serves to handle name tags from OSM that contain
-   multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
-3. Next the tokenizer creates **variants** from the full names. These variants
-   cover decomposition and abbreviation handling. Variants are saved to the
-   database, so that it is not necessary to create the variants for a search
-   query.
-4. The final **Tokenization** step converts the names to a simple ASCII form,
-   potentially removing further spelling variants for better matching.
+1. During the **Sanitizer step** incoming names are cleaned up and converted to
+   **full names**. This step can be used to regularize spelling, split multi-name
+   tags into their parts and tag names with additional attributes. See the
+   [Sanitizers section](#sanitizers) below for available cleaning routines.
+2. The **Normalization** part removes all information from the full names
+   that are not relevant for search.
+3. The **Token analysis** step takes the normalized full names and creates
+   all transliterated variants under which the name should be searchable.
+   See the [Token analysis](#token-analysis) section below for more
+   information.

-At query time only stage 1) and 4) are used. The query is normalized and
-tokenized and the resulting string used for searching in the database.
+During query time, only normalization and transliteration are relevant.
+An incoming query is first split into name chunks (this usually means splitting
+the string at the commas) and the each part is normalised and transliterated.
+The result is used to look up places in the search index.

 ### Configuration

@@ -93,21 +94,36 @@ normalization:
 transliteration:
    - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
    - ":: Ascii ()"
-variants:
-    - language: de
-      words:
-        - ~haus => haus
-        - ~strasse -> str
-    - language: en
-      words: 
-        - road -> rd
-        - bridge -> bdge,br,brdg,bri,brg
+sanitizers:
+    - step: split-name-list
+token-analysis:
+    - analyzer: generic
+      variants:
+          - !include icu-rules/variants-ca.yaml
+          - words:
+              - road -> rd
+              - bridge -> bdge,br,brdg,bri,brg
 ```

-The configuration file contains three sections:
-`normalization`, `transliteration`, `variants`.
+The configuration file contains four sections:
+`normalization`, `transliteration`, `sanitizers` and `token-analysis`.

-The normalization and transliteration sections each must contain a list of
+#### Normalization and Transliteration
+
+The normalization and transliteration sections each define a set of
+ICU rules that are applied to the names.
+
+The **normalisation** rules are applied after sanitation. They should remove
+any information that is not relevant for search at all. Usual rules to be
+applied here are: lower-casing, removing of special characters, cleanup of
+spaces.
+
+The **transliteration** rules are applied at the end of the tokenization
+process to transfer the name into an ASCII representation. Transliteration can
+be useful to allow for further fuzzy matching, especially between different
+scripts.
+
+Each section must contain a list of
 [ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
 The rules are applied in the order in which they appear in the file.
 You can also include additional rules from external yaml file using the
@@ -119,6 +135,85 @@ and may again include other files.
    YAML syntax. You should therefore always enclose the ICU rules in
    double-quotes.

+#### Sanitizers
+
+The sanitizers section defines an ordered list of functions that are applied
+to the name and address tags before they are further processed by the tokenizer.
+They allows to clean up the tagging and bring it to a standardized form more
+suitable for building the search index.
+
+!!! hint
+    Sanitizers only have an effect on how the search index is built. They
+    do not change the information about each place that is saved in the
+    database. In particular, they have no influence on how the results are
+    displayed. The returned results always show the original information as
+    stored in the OpenStreetMap database.
+
+Each entry contains information of a sanitizer to be applied. It has a
+mandatory parameter `step` which gives the name of the sanitizer. Depending
+on the type, it may have additional parameters to configure its operation.
+
+The order of the list matters. The sanitizers are applied exactly in the order
+that is configured. Each sanitizer works on the results of the previous one.
+
+The following is a list of sanitizers that are shipped with Nominatim.
+
+##### split-name-list
+
+::: nominatim.tokenizer.sanitizers.split_name_list
+    selection:
+        members: False
+    rendering:
+        heading_level: 6
+
+##### strip-brace-terms
+
+::: nominatim.tokenizer.sanitizers.strip_brace_terms
+    selection:
+        members: False
+    rendering:
+        heading_level: 6
+
+##### tag-analyzer-by-language
+
+::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language
+    selection:
+        members: False
+    rendering:
+        heading_level: 6
+
+
+
+#### Token Analysis
+
+Token analyzers take a full name and transform it into one or more normalized
+form that are then saved in the search index. In its simplest form, the
+analyzer only applies the transliteration rules. More complex analyzers
+create additional spelling variants of a name. This is useful to handle
+decomposition and abbreviation.
+
+The ICU tokenizer may use different analyzers for different names. To select
+the analyzer to be used, the name must be tagged with the `analyzer` attribute
+by a sanitizer (see for example the
+[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
+
+The token-analysis section contains the list of configured analyzers. Each
+analyzer must have an `id` parameter that uniquely identifies the analyzer.
+The only exception is the default analyzer that is used when no special
+analyzer was selected.
+
+Different analyzer implementations may exist. To select the implementation,
+the `analyzer` parameter must be set. Currently there is only one implementation
+`generic` which is described in the following.
+
+##### Generic token analyzer
+
+The generic analyzer is able to create variants from a list of given
+abbreviation and decomposition replacements. It takes one optional parameter
+`variants` which lists the replacements to apply. If the section is
+omitted, then the generic analyzer becomes a simple analyzer that only
+applies the transliteration.
+
 The variants section defines lists of replacements which create alternative
 spellings of a name. To create the variants, a name is scanned from left to
 right and the longest matching replacement is applied until the end of the
@@ -144,7 +239,7 @@ term.
    words in the configuration because then it is possible to change the
    rules for normalization later without having to adapt the variant rules.

-#### Decomposition
+###### Decomposition

 In its standard form, only full words match against the source. There
 is a special notation to match the prefix and suffix of a word:
@@ -171,7 +266,7 @@ To avoid automatic decomposition, use the '|' notation:

 simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".

-#### Initial and final terms
+###### Initial and final terms

 It is also possible to restrict replacements to the beginning and end of a
 name:
@@ -184,7 +279,7 @@ name:
 So the first example would trigger a replacement for "south 45th street" but
 not for "the south beach restaurant".

-#### Replacements vs. variants
+###### Replacements vs. variants

 The replacement syntax `source => target` works as a pure replacement. It changes
 the name instead of creating a variant. To create an additional version, you'd
--- a/nominatim/config.py
+++ b/nominatim/config.py
@@ -12,6 +12,27 @@ from nominatim.errors import UsageError

 LOG = logging.getLogger()

+
+def flatten_config_list(content, section=''):
+    """ Flatten YAML configuration lists that contain include sections
+        which are lists themselves.
+    """
+    if not content:
+        return []
+
+    if not isinstance(content, list):
+        raise UsageError(f"List expected in section '{section}'.")
+
+    output = []
+    for ele in content:
+        if isinstance(ele, list):
+            output.extend(flatten_config_list(ele, section))
+        else:
+            output.append(ele)
+
+    return output
+
+
 class Configuration:
    """ Load and manage the project configuration.

--- a/nominatim/tokenizer/base.py
+++ b/nominatim/tokenizer/base.py
@@ -194,15 +194,13 @@ class AbstractTokenizer(ABC):
        """ Check that the database is set up correctly and ready for being
            queried.

-            Returns:
-              If an issue was found, return an error message with the
-              description of the issue as well as hints for the user on
-              how to resolve the issue.
-
            Arguments:
              config: Read-only object with configuration options.

-              Return `None`, if no issue was found.
+            Returns:
+              If an issue was found, return an error message with the
+              description of the issue as well as hints for the user on
+              how to resolve the issue. If everything is okay, return `None`.
        """
        pass

--- a/nominatim/tokenizer/icu_name_processor.py
+++ b/nominatim/tokenizer/icu_name_processor.py
@@ -1,104 +0,0 @@
-"""
-Processor for names that are imported into the database based on the
-ICU library.
-"""
-from collections import defaultdict
-import itertools
-
-from icu import Transliterator
-import datrie
-
-
-class ICUNameProcessor:
-    """ Collects the different transformation rules for normalisation of names
-        and provides the functions to apply the transformations.
-    """
-
-    def __init__(self, norm_rules, trans_rules, replacements):
-        self.normalizer = Transliterator.createFromRules("icu_normalization",
-                                                         norm_rules)
-        self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
-                                                       trans_rules +
-                                                       ";[:Space:]+ > ' '")
-        self.search = Transliterator.createFromRules("icu_search",
-                                                     norm_rules + trans_rules)
-
-        # Intermediate reorder by source. Also compute required character set.
-        immediate = defaultdict(list)
-        chars = set()
-        for variant in replacements:
-            if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
-                replstr = variant.replacement[:-1]
-            else:
-                replstr = variant.replacement
-            immediate[variant.source].append(replstr)
-            chars.update(variant.source)
-        # Then copy to datrie
-        self.replacements = datrie.Trie(''.join(chars))
-        for src, repllist in immediate.items():
-            self.replacements[src] = repllist
-
-
-    def get_normalized(self, name):
-        """ Normalize the given name, i.e. remove all elements not relevant
-            for search.
-        """
-        return self.normalizer.transliterate(name).strip()
-
-    def get_variants_ascii(self, norm_name):
-        """ Compute the spelling variants for the given normalized name
-            and transliterate the result.
-        """
-        baseform = '^ ' + norm_name + ' ^'
-        partials = ['']
-
-        startpos = 0
-        pos = 0
-        force_space = False
-        while pos < len(baseform):
-            full, repl = self.replacements.longest_prefix_item(baseform[pos:],
-                                                               (None, None))
-            if full is not None:
-                done = baseform[startpos:pos]
-                partials = [v + done + r
-                            for v, r in itertools.product(partials, repl)
-                            if not force_space or r.startswith(' ')]
-                if len(partials) > 128:
-                    # If too many variants are produced, they are unlikely
-                    # to be helpful. Only use the original term.
-                    startpos = 0
-                    break
-                startpos = pos + len(full)
-                if full[-1] == ' ':
-                    startpos -= 1
-                    force_space = True
-                pos = startpos
-            else:
-                pos += 1
-                force_space = False
-
-        # No variants detected? Fast return.
-        if startpos == 0:
-            trans_name = self.to_ascii.transliterate(norm_name).strip()
-            return [trans_name] if trans_name else []
-
-        return self._compute_result_set(partials, baseform[startpos:])
-
-
-    def _compute_result_set(self, partials, prefix):
-        results = set()
-
-        for variant in partials:
-            vname = variant + prefix
-            trans_name = self.to_ascii.transliterate(vname[1:-1]).strip()
-            if trans_name:
-                results.add(trans_name)
-
-        return list(results)
-
-
-    def get_search_normalized(self, name):
-        """ Return the normalized version of the name (including transliteration)
-            to be applied at search time.
-        """
-        return self.search.transliterate(' ' + name + ' ').strip()
--- a/nominatim/tokenizer/icu_rule_loader.py
+++ b/nominatim/tokenizer/icu_rule_loader.py
@@ -1,19 +1,17 @@
 """
 Helper class to create ICU rules from a configuration file.
 """
+import importlib
 import io
 import json
 import logging
-import itertools
-import re
-
-from icu import Transliterator

+from nominatim.config import flatten_config_list
 from nominatim.db.properties import set_property, get_property
 from nominatim.errors import UsageError
-from nominatim.tokenizer.icu_name_processor import ICUNameProcessor
 from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
-import nominatim.tokenizer.icu_variants as variants
+from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
+import nominatim.tools.country_info

 LOG = logging.getLogger()

@@ -22,33 +20,15 @@ DBCFG_IMPORT_TRANS_RULES = "tokenizer_import_transliteration"
 DBCFG_IMPORT_ANALYSIS_RULES = "tokenizer_import_analysis_rules"


-def _flatten_config_list(content):
-    if not content:
-        return []
-
-    if not isinstance(content, list):
-        raise UsageError("List expected in ICU configuration.")
-
-    output = []
-    for ele in content:
-        if isinstance(ele, list):
-            output.extend(_flatten_config_list(ele))
-        else:
-            output.append(ele)
-
-    return output
-
-
-class VariantRule:
-    """ Saves a single variant expansion.
-
-        An expansion consists of the normalized replacement term and
-        a dicitonary of properties that describe when the expansion applies.
+def _get_section(rules, section):
+    """ Get the section named 'section' from the rules. If the section does
+        not exist, raise a usage error with a meaningful message.
    """
+    if section not in rules:
+        LOG.fatal("Section '%s' not found in tokenizer config.", section)
+        raise UsageError("Syntax error in tokenizer configuration file.")

-    def __init__(self, replacement, properties):
-        self.replacement = replacement
-        self.properties = properties or {}
+    return rules[section]


 class ICURuleLoader:
@@ -59,12 +39,13 @@ class ICURuleLoader:
        rules = config.load_sub_configuration('icu_tokenizer.yaml',
                                              config='TOKENIZER_CONFIG')

-        self.variants = set()
+        # Make sure country information is available to analyzers and sanatizers.
+        nominatim.tools.country_info.setup_country_config(config)

        self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization')
        self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration')
-        self.analysis_rules = self._get_section(rules, 'variants')
-        self._parse_variant_list()
+        self.analysis_rules = _get_section(rules, 'token-analysis')
+        self._setup_analysis()

        # Load optional sanitizer rule set.
        self.sanitizer_rules = rules.get('sanitizers', [])
@@ -77,7 +58,7 @@ class ICURuleLoader:
        self.normalization_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES)
        self.transliteration_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES)
        self.analysis_rules = json.loads(get_property(conn, DBCFG_IMPORT_ANALYSIS_RULES))
-        self._parse_variant_list()
+        self._setup_analysis()


    def save_config_to_db(self, conn):
@@ -98,9 +79,8 @@ class ICURuleLoader:
    def make_token_analysis(self):
        """ Create a token analyser from the reviouly loaded rules.
        """
-        return ICUNameProcessor(self.normalization_rules,
-                                self.transliteration_rules,
-                                self.variants)
+        return ICUTokenAnalysis(self.normalization_rules,
+                                self.transliteration_rules, self.analysis)


    def get_search_rules(self):
@@ -115,159 +95,66 @@ class ICURuleLoader:
        rules.write(self.transliteration_rules)
        return rules.getvalue()

+
    def get_normalization_rules(self):
        """ Return rules for normalisation of a term.
        """
        return self.normalization_rules

+
    def get_transliteration_rules(self):
        """ Return the rules for converting a string into its asciii representation.
        """
        return self.transliteration_rules

-    def get_replacement_pairs(self):
-        """ Return the list of possible compound decompositions with
-            application of abbreviations included.
-            The result is a list of pairs: the first item is the sequence to
-            replace, the second is a list of replacements.
+
+    def _setup_analysis(self):
+        """ Process the rules used for creating the various token analyzers.
        """
-        return self.variants
+        self.analysis = {}
+
+        if not isinstance(self.analysis_rules, list):
+            raise UsageError("Configuration section 'token-analysis' must be a list.")
+
+        for section in self.analysis_rules:
+            name = section.get('id', None)
+            if name in self.analysis:
+                if name is None:
+                    LOG.fatal("ICU tokenizer configuration has two default token analyzers.")
+                else:
+                    LOG.fatal("ICU tokenizer configuration has two token "
+                              "analyzers with id '%s'.", name)
+                raise UsageError("Syntax error in ICU tokenizer config.")
+            self.analysis[name] = TokenAnalyzerRule(section, self.normalization_rules)


    @staticmethod
-    def _get_section(rules, section):
-        """ Get the section named 'section' from the rules. If the section does
-            not exist, raise a usage error with a meaningful message.
-        """
-        if section not in rules:
-            LOG.fatal("Section '%s' not found in tokenizer config.", section)
-            raise UsageError("Syntax error in tokenizer configuration file.")
-
-        return rules[section]
-
-
-    def _cfg_to_icu_rules(self, rules, section):
+    def _cfg_to_icu_rules(rules, section):
        """ Load an ICU ruleset from the given section. If the section is a
            simple string, it is interpreted as a file name and the rules are
            loaded verbatim from the given file. The filename is expected to be
            relative to the tokenizer rule file. If the section is a list then
            each line is assumed to be a rule. All rules are concatenated and returned.
        """
-        content = self._get_section(rules, section)
+        content = _get_section(rules, section)

        if content is None:
            return ''

-        return ';'.join(_flatten_config_list(content)) + ';'
+        return ';'.join(flatten_config_list(content, section)) + ';'


-    def _parse_variant_list(self):
-        rules = self.analysis_rules
-
-        self.variants.clear()
-
-        if not rules:
-            return
-
-        rules = _flatten_config_list(rules)
-
-        vmaker = _VariantMaker(self.normalization_rules)
-
-        properties = []
-        for section in rules:
-            # Create the property field and deduplicate against existing
-            # instances.
-            props = variants.ICUVariantProperties.from_rules(section)
-            for existing in properties:
-                if existing == props:
-                    props = existing
-                    break
-            else:
-                properties.append(props)
-
-            for rule in (section.get('words') or []):
-                self.variants.update(vmaker.compute(rule, props))
-
-
-class _VariantMaker:
-    """ Generater for all necessary ICUVariants from a single variant rule.
-
-        All text in rules is normalized to make sure the variants match later.
+class TokenAnalyzerRule:
+    """ Factory for a single analysis module. The class saves the configuration
+        and creates a new token analyzer on request.
    """

-    def __init__(self, norm_rules):
-        self.norm = Transliterator.createFromRules("rule_loader_normalization",
-                                                   norm_rules)
+    def __init__(self, rules, normalization_rules):
+        # Find the analysis module
+        module_name = 'nominatim.tokenizer.token_analysis.' \
+                      + _get_section(rules, 'analyzer').replace('-', '_')
+        analysis_mod = importlib.import_module(module_name)
+        self.create = analysis_mod.create

-
-    def compute(self, rule, props):
-        """ Generator for all ICUVariant tuples from a single variant rule.
-        """
-        parts = re.split(r'(\|)?([=-])>', rule)
-        if len(parts) != 4:
-            raise UsageError("Syntax error in variant rule: " + rule)
-
-        decompose = parts[1] is None
-        src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
-        repl_terms = (self.norm.transliterate(t.strip()) for t in parts[3].split(','))
-
-        # If the source should be kept, add a 1:1 replacement
-        if parts[2] == '-':
-            for src in src_terms:
-                if src:
-                    for froms, tos in _create_variants(*src, src[0], decompose):
-                        yield variants.ICUVariant(froms, tos, props)
-
-        for src, repl in itertools.product(src_terms, repl_terms):
-            if src and repl:
-                for froms, tos in _create_variants(*src, repl, decompose):
-                    yield variants.ICUVariant(froms, tos, props)
-
-
-    def _parse_variant_word(self, name):
-        name = name.strip()
-        match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
-        if match is None or (match.group(1) == '~' and match.group(3) == '~'):
-            raise UsageError("Invalid variant word descriptor '{}'".format(name))
-        norm_name = self.norm.transliterate(match.group(2))
-        if not norm_name:
-            return None
-
-        return norm_name, match.group(1), match.group(3)
-
-
-_FLAG_MATCH = {'^': '^ ',
-               '$': ' ^',
-               '': ' '}
-
-
-def _create_variants(src, preflag, postflag, repl, decompose):
-    if preflag == '~':
-        postfix = _FLAG_MATCH[postflag]
-        # suffix decomposition
-        src = src + postfix
-        repl = repl + postfix
-
-        yield src, repl
-        yield ' ' + src, ' ' + repl
-
-        if decompose:
-            yield src, ' ' + repl
-            yield ' ' + src, repl
-    elif postflag == '~':
-        # prefix decomposition
-        prefix = _FLAG_MATCH[preflag]
-        src = prefix + src
-        repl = prefix + repl
-
-        yield src, repl
-        yield src + ' ', repl + ' '
-
-        if decompose:
-            yield src, repl + ' '
-            yield src + ' ', repl
-    else:
-        prefix = _FLAG_MATCH[preflag]
-        postfix = _FLAG_MATCH[postflag]
-
-        yield prefix + src + postfix, prefix + repl + postfix
+        # Load the configuration.
+        self.config = analysis_mod.configure(rules, normalization_rules)
--- a/nominatim/tokenizer/icu_token_analysis.py
+++ b/nominatim/tokenizer/icu_token_analysis.py
@@ -0,0 +1,23 @@
+"""
+Container class collecting all components required to transform an OSM name
+into a Nominatim token.
+"""
+
+from icu import Transliterator
+
+class ICUTokenAnalysis:
+    """ Container class collecting the transliterators and token analysis
+        modules for a single NameAnalyser instance.
+    """
+
+    def __init__(self, norm_rules, trans_rules, analysis_rules):
+        self.normalizer = Transliterator.createFromRules("icu_normalization",
+                                                         norm_rules)
+        trans_rules += ";[:Space:]+ > ' '"
+        self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
+                                                       trans_rules)
+        self.search = Transliterator.createFromRules("icu_search",
+                                                     norm_rules + trans_rules)
+
+        self.analysis = {name: arules.create(self.to_ascii, arules.config)
+                         for name, arules in analysis_rules.items()}
--- a/nominatim/tokenizer/icu_tokenizer.py
+++ b/nominatim/tokenizer/icu_tokenizer.py
@@ -164,7 +164,7 @@ class LegacyICUTokenizer(AbstractTokenizer):
        """ Count the partial terms from the names in the place table.
        """
        words = Counter()
-        name_proc = self.loader.make_token_analysis()
+        analysis = self.loader.make_token_analysis()

        with conn.cursor(name="words") as cur:
            cur.execute(""" SELECT v, count(*) FROM
@@ -172,12 +172,10 @@ class LegacyICUTokenizer(AbstractTokenizer):
                            WHERE length(v) < 75 GROUP BY v""")

            for name, cnt in cur:
-                terms = set()
-                for word in name_proc.get_variants_ascii(name_proc.get_normalized(name)):
-                    if ' ' in word:
-                        terms.update(word.split())
-                for term in terms:
-                    words[term] += cnt
+                word = analysis.search.transliterate(name)
+                if word and ' ' in word:
+                    for term in set(word.split()):
+                        words[term] += cnt

        return words

@@ -209,14 +207,14 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
    def _search_normalized(self, name):
        """ Return the search token transliteration of the given name.
        """
-        return self.token_analysis.get_search_normalized(name)
+        return self.token_analysis.search.transliterate(name).strip()


    def _normalized(self, name):
        """ Return the normalized version of the given name with all
            non-relevant information removed.
        """
-        return self.token_analysis.get_normalized(name)
+        return self.token_analysis.normalizer.transliterate(name).strip()


    def get_word_token_info(self, words):
@@ -456,6 +454,7 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
        if addr_terms:
            token_info.add_address_terms(addr_terms)

+
    def _compute_partial_tokens(self, name):
        """ Normalize the given term, split it into partial words and return
            then token list for them.
@@ -492,19 +491,25 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
        partial_tokens = set()

        for name in names:
+            analyzer_id = name.get_attr('analyzer')
            norm_name = self._normalized(name.name)
-            full, part = self._cache.names.get(norm_name, (None, None))
+            if analyzer_id is None:
+                token_id = norm_name
+            else:
+                token_id = f'{norm_name}@{analyzer_id}'
+
+            full, part = self._cache.names.get(token_id, (None, None))
            if full is None:
-                variants = self.token_analysis.get_variants_ascii(norm_name)
+                variants = self.token_analysis.analysis[analyzer_id].get_variants_ascii(norm_name)
                if not variants:
                    continue

                with self.conn.cursor() as cur:
                    cur.execute("SELECT (getorcreate_full_word(%s, %s)).*",
-                                (norm_name, variants))
+                                (token_id, variants))
                    full, part = cur.fetchone()

-                self._cache.names[norm_name] = (full, part)
+                self._cache.names[token_id] = (full, part)

            full_tokens.add(full)
            partial_tokens.update(part)
--- a/nominatim/tokenizer/icu_variants.py
+++ b/nominatim/tokenizer/icu_variants.py
@@ -1,25 +0,0 @@
-"""
-Data structures for saving variant expansions for ICU tokenizer.
-"""
-from collections import namedtuple
-
-_ICU_VARIANT_PORPERTY_FIELDS = ['lang']
-
-
-class ICUVariantProperties(namedtuple('_ICUVariantProperties', _ICU_VARIANT_PORPERTY_FIELDS)):
-    """ Data container for saving properties that describe when a variant
-        should be applied.
-
-        Property instances are hashable.
-    """
-    @classmethod
-    def from_rules(cls, _):
-        """ Create a new property type from a generic dictionary.
-
-            The function only takes into account the properties that are
-            understood presently and ignores all others.
-        """
-        return cls(lang=None)
-
-
-ICUVariant = namedtuple('ICUVariant', ['source', 'replacement', 'properties'])
--- a/nominatim/tokenizer/sanitizers/split_name_list.py
+++ b/nominatim/tokenizer/sanitizers/split_name_list.py
@@ -1,5 +1,9 @@
 """
-Name processor that splits name values with multiple values into their components.
+Sanitizer that splits lists of names into their components.
+
+Arguments:
+    delimiters: Define the set of characters to be used for
+                splitting the list. (default: `,;`)
 """
 import re

@@ -7,9 +11,7 @@ from nominatim.errors import UsageError

 def create(func):
    """ Create a name processing function that splits name values with
-        multiple values into their components. The optional parameter
-        'delimiters' can be used to define the characters that should be used
-        for splitting. The default is ',;'.
+        multiple values into their components.
    """
    delimiter_set = set(func.get('delimiters', ',;'))
    if not delimiter_set:
@@ -24,7 +26,6 @@ def create(func):
        new_names = []
        for name in obj.names:
            split_names = regexp.split(name.name)
-            print(split_names)
            if len(split_names) == 1:
                new_names.append(name)
            else:
--- a/nominatim/tokenizer/sanitizers/strip_brace_terms.py
+++ b/nominatim/tokenizer/sanitizers/strip_brace_terms.py
@@ -1,11 +1,12 @@
 """
-Sanitizer handling names with addendums in braces.
+This sanitizer creates additional name variants for names that have
+addendums in brackets (e.g. "Halle (Saale)"). The additional variant contains
+only the main name part with the bracket part removed.
 """

 def create(_):
    """ Create a name processing function that creates additional name variants
-        when a name has an addendum in brackets (e.g. "Halle (Saale)"). The
-        additional variant only contains the main name without the bracket part.
+        for bracket addendums.
    """
    def _process(obj):
        """ Add variants for names that have a bracket extension.
--- a/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
+++ b/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
@@ -0,0 +1,103 @@
+"""
+This sanitizer sets the `analyzer` property depending on the
+language of the tag. The language is taken from the suffix of the name.
+If a name already has an analyzer tagged, then this is kept.
+
+Arguments:
+
+    filter-kind: Restrict the names the sanitizer should be applied to
+                 to the given tags. The parameter expects a list of
+                 regular expressions which are matched against `kind`.
+                 Note that a match against the full string is expected.
+    whitelist: Restrict the set of languages that should be tagged.
+               Expects a list of acceptable suffixes. When unset,
+               all 2- and 3-letter lower-case codes are accepted.
+    use-defaults:  Configure what happens when the name has no suffix.
+                   When set to 'all', a variant is created for
+                   each of the default languages in the country
+                   the feature is in. When set to 'mono', a variant is
+                   only created, when exactly one language is spoken
+                   in the country. The default is to do nothing with
+                   the default languages of a country.
+    mode: Define how the variants are created and may be 'replace' or
+          'append'. When set to 'append' the original name (without
+          any analyzer tagged) is retained. (default: replace)
+
+"""
+import re
+
+from nominatim.tools import country_info
+
+class _AnalyzerByLanguage:
+    """ Processor for tagging the language of names in a place.
+    """
+
+    def __init__(self, config):
+        if 'filter-kind' in config:
+            self.regexes = [re.compile(regex) for regex in config['filter-kind']]
+        else:
+            self.regexes = None
+
+        self.replace = config.get('mode', 'replace') != 'append'
+        self.whitelist = config.get('whitelist')
+
+        self.__compute_default_languages(config.get('use-defaults', 'no'))
+
+
+    def __compute_default_languages(self, use_defaults):
+        self.deflangs = {}
+
+        if use_defaults in ('mono', 'all'):
+            for ccode, prop in country_info.iterate():
+                clangs = prop['languages']
+                if len(clangs) == 1 or use_defaults == 'all':
+                    if self.whitelist:
+                        self.deflangs[ccode] = [l for l in clangs if l in self.whitelist]
+                    else:
+                        self.deflangs[ccode] = clangs
+
+
+    def _kind_matches(self, kind):
+        if self.regexes is None:
+            return True
+
+        return any(regex.fullmatch(kind) for regex in self.regexes)
+
+
+    def _suffix_matches(self, suffix):
+        if self.whitelist is None:
+            return len(suffix) in (2, 3) and suffix.islower()
+
+        return suffix in self.whitelist
+
+
+    def __call__(self, obj):
+        if not obj.names:
+            return
+
+        more_names = []
+
+        for name in (n for n in obj.names
+                     if not n.has_attr('analyzer') and self._kind_matches(n.kind)):
+            if name.suffix:
+                langs = [name.suffix] if self._suffix_matches(name.suffix) else None
+            else:
+                langs = self.deflangs.get(obj.place.country_code)
+
+
+            if langs:
+                if self.replace:
+                    name.set_attr('analyzer', langs[0])
+                else:
+                    more_names.append(name.clone(attr={'analyzer': langs[0]}))
+
+                more_names.extend(name.clone(attr={'analyzer': l}) for l in langs[1:])
+
+        obj.names.extend(more_names)
+
+
+def create(config):
+    """ Create a function that sets the analyzer property depending on the
+        language of the tag.
+    """
+    return _AnalyzerByLanguage(config)
--- a/nominatim/tokenizer/token_analysis/init.py
+++ b/nominatim/tokenizer/token_analysis/init.py
--- a/nominatim/tokenizer/token_analysis/generic.py
+++ b/nominatim/tokenizer/token_analysis/generic.py
@@ -0,0 +1,224 @@
+"""
+Generic processor for names that creates abbreviation variants.
+"""
+from collections import defaultdict, namedtuple
+import itertools
+import re
+
+from icu import Transliterator
+import datrie
+
+from nominatim.config import flatten_config_list
+from nominatim.errors import UsageError
+
+### Configuration section
+
+ICUVariant = namedtuple('ICUVariant', ['source', 'replacement'])
+
+def configure(rules, normalization_rules):
+    """ Extract and preprocess the configuration for this module.
+    """
+    config = {}
+
+    config['replacements'], config['chars'] = _get_variant_config(rules.get('variants'),
+                                                                  normalization_rules)
+    config['variant_only'] = rules.get('mode', '') == 'variant-only'
+
+    return config
+
+
+def _get_variant_config(rules, normalization_rules):
+    """ Convert the variant definition from the configuration into
+        replacement sets.
+    """
+    immediate = defaultdict(list)
+    chars = set()
+
+    if rules:
+        vset = set()
+        rules = flatten_config_list(rules, 'variants')
+
+        vmaker = _VariantMaker(normalization_rules)
+
+        for section in rules:
+            for rule in (section.get('words') or []):
+                vset.update(vmaker.compute(rule))
+
+        # Intermediate reorder by source. Also compute required character set.
+        for variant in vset:
+            if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
+                replstr = variant.replacement[:-1]
+            else:
+                replstr = variant.replacement
+            immediate[variant.source].append(replstr)
+            chars.update(variant.source)
+
+    return list(immediate.items()), ''.join(chars)
+
+
+class _VariantMaker:
+    """ Generater for all necessary ICUVariants from a single variant rule.
+
+        All text in rules is normalized to make sure the variants match later.
+    """
+
+    def __init__(self, norm_rules):
+        self.norm = Transliterator.createFromRules("rule_loader_normalization",
+                                                   norm_rules)
+
+
+    def compute(self, rule):
+        """ Generator for all ICUVariant tuples from a single variant rule.
+        """
+        parts = re.split(r'(\|)?([=-])>', rule)
+        if len(parts) != 4:
+            raise UsageError("Syntax error in variant rule: " + rule)
+
+        decompose = parts[1] is None
+        src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
+        repl_terms = (self.norm.transliterate(t).strip() for t in parts[3].split(','))
+
+        # If the source should be kept, add a 1:1 replacement
+        if parts[2] == '-':
+            for src in src_terms:
+                if src:
+                    for froms, tos in _create_variants(*src, src[0], decompose):
+                        yield ICUVariant(froms, tos)
+
+        for src, repl in itertools.product(src_terms, repl_terms):
+            if src and repl:
+                for froms, tos in _create_variants(*src, repl, decompose):
+                    yield ICUVariant(froms, tos)
+
+
+    def _parse_variant_word(self, name):
+        name = name.strip()
+        match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
+        if match is None or (match.group(1) == '~' and match.group(3) == '~'):
+            raise UsageError("Invalid variant word descriptor '{}'".format(name))
+        norm_name = self.norm.transliterate(match.group(2)).strip()
+        if not norm_name:
+            return None
+
+        return norm_name, match.group(1), match.group(3)
+
+
+_FLAG_MATCH = {'^': '^ ',
+               '$': ' ^',
+               '': ' '}
+
+
+def _create_variants(src, preflag, postflag, repl, decompose):
+    if preflag == '~':
+        postfix = _FLAG_MATCH[postflag]
+        # suffix decomposition
+        src = src + postfix
+        repl = repl + postfix
+
+        yield src, repl
+        yield ' ' + src, ' ' + repl
+
+        if decompose:
+            yield src, ' ' + repl
+            yield ' ' + src, repl
+    elif postflag == '~':
+        # prefix decomposition
+        prefix = _FLAG_MATCH[preflag]
+        src = prefix + src
+        repl = prefix + repl
+
+        yield src, repl
+        yield src + ' ', repl + ' '
+
+        if decompose:
+            yield src, repl + ' '
+            yield src + ' ', repl
+    else:
+        prefix = _FLAG_MATCH[preflag]
+        postfix = _FLAG_MATCH[postflag]
+
+        yield prefix + src + postfix, prefix + repl + postfix
+
+
+### Analysis section
+
+def create(transliterator, config):
+    """ Create a new token analysis instance for this module.
+    """
+    return GenericTokenAnalysis(transliterator, config)
+
+
+class GenericTokenAnalysis:
+    """ Collects the different transformation rules for normalisation of names
+        and provides the functions to apply the transformations.
+    """
+
+    def __init__(self, to_ascii, config):
+        self.to_ascii = to_ascii
+        self.variant_only = config['variant_only']
+
+        # Set up datrie
+        if config['replacements']:
+            self.replacements = datrie.Trie(config['chars'])
+            for src, repllist in config['replacements']:
+                self.replacements[src] = repllist
+        else:
+            self.replacements = None
+
+
+    def get_variants_ascii(self, norm_name):
+        """ Compute the spelling variants for the given normalized name
+            and transliterate the result.
+        """
+        baseform = '^ ' + norm_name + ' ^'
+        partials = ['']
+
+        startpos = 0
+        if self.replacements is not None:
+            pos = 0
+            force_space = False
+            while pos < len(baseform):
+                full, repl = self.replacements.longest_prefix_item(baseform[pos:],
+                                                                   (None, None))
+                if full is not None:
+                    done = baseform[startpos:pos]
+                    partials = [v + done + r
+                                for v, r in itertools.product(partials, repl)
+                                if not force_space or r.startswith(' ')]
+                    if len(partials) > 128:
+                        # If too many variants are produced, they are unlikely
+                        # to be helpful. Only use the original term.
+                        startpos = 0
+                        break
+                    startpos = pos + len(full)
+                    if full[-1] == ' ':
+                        startpos -= 1
+                        force_space = True
+                    pos = startpos
+                else:
+                    pos += 1
+                    force_space = False
+
+        # No variants detected? Fast return.
+        if startpos == 0:
+            if self.variant_only:
+                return []
+
+            trans_name = self.to_ascii.transliterate(norm_name).strip()
+            return [trans_name] if trans_name else []
+
+        return self._compute_result_set(partials, baseform[startpos:],
+                                        norm_name if self.variant_only else '')
+
+
+    def _compute_result_set(self, partials, prefix, exclude):
+        results = set()
+
+        for variant in partials:
+            vname = (variant + prefix)[1:-1].strip()
+            if vname != exclude:
+                trans_name = self.to_ascii.transliterate(vname).strip()
+                if trans_name:
+                    results.add(trans_name)
+
+        return list(results)
--- a/nominatim/tools/country_info.py
+++ b/nominatim/tools/country_info.py
@@ -13,12 +13,21 @@ class _CountryInfo:
    def __init__(self):
        self._info = {}

+
    def load(self, config):
        """ Load the country properties from the configuration files,
            if they are not loaded yet.
        """
        if not self._info:
            self._info = config.load_sub_configuration('country_settings.yaml')
+            # Convert languages into a list for simpler handling.
+            for prop in self._info.values():
+                if 'languages' not in prop:
+                    prop['languages'] = []
+                elif not isinstance(prop['languages'], list):
+                    prop['languages'] = [x.strip()
+                                         for x in prop['languages'].split(',')]
+

    def items(self):
        """ Return tuples of (country_code, property dict) as iterable.
@@ -36,6 +45,12 @@ def setup_country_config(config):
    _COUNTRY_INFO.load(config)


+def iterate():
+    """ Iterate over country code and properties.
+    """
+    return _COUNTRY_INFO.items()
+
+
 def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
    """ Create and populate the tables with basic static data that provides
        the background for geocoding. Data is assumed to not yet exist.
@@ -50,10 +65,7 @@ def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
                partition = 0
            else:
                partition = props.get('partition')
-            if ',' in (props.get('languages', ',') or ','):
-                lang = None
-            else:
-                lang = props['languages']
+            lang = props['languages'][0] if len(props['languages']) == 1 else None
            params.append((ccode, partition, lang))

    with connect(dsn) as conn:
--- a/settings/country_settings.yaml
+++ b/settings/country_settings.yaml
@@ -171,7 +171,7 @@ bt:
 #  (Bouvet Island)
 bv:
    partition: 185
-    languages: no
+    languages: "no"

 # Botswana (Botswana)
 bw:
@@ -1006,7 +1006,7 @@ si:
 #  (Svalbard and Jan Mayen)
 sj:
    partition: 197
-    languages: no
+    languages: "no"

 # Slovakia (Slovensko)
 sk:
--- a/settings/icu_tokenizer.yaml
+++ b/settings/icu_tokenizer.yaml
@@ -27,34 +27,160 @@ transliteration:
 sanitizers:
    - step: split-name-list
    - step: strip-brace-terms
-variants:
-    - !include icu-rules/variants-bg.yaml
-    - !include icu-rules/variants-ca.yaml
-    - !include icu-rules/variants-cs.yaml
-    - !include icu-rules/variants-da.yaml
-    - !include icu-rules/variants-de.yaml
-    - !include icu-rules/variants-el.yaml
-    - !include icu-rules/variants-en.yaml
-    - !include icu-rules/variants-es.yaml
-    - !include icu-rules/variants-et.yaml
-    - !include icu-rules/variants-eu.yaml
-    - !include icu-rules/variants-fi.yaml
-    - !include icu-rules/variants-fr.yaml
-    - !include icu-rules/variants-gl.yaml
-    - !include icu-rules/variants-hu.yaml
-    - !include icu-rules/variants-it.yaml
-    - !include icu-rules/variants-ja.yaml
-    - !include icu-rules/variants-mg.yaml
-    - !include icu-rules/variants-ms.yaml
-    - !include icu-rules/variants-nl.yaml
-    - !include icu-rules/variants-no.yaml
-    - !include icu-rules/variants-pl.yaml
-    - !include icu-rules/variants-pt.yaml
-    - !include icu-rules/variants-ro.yaml
-    - !include icu-rules/variants-ru.yaml
-    - !include icu-rules/variants-sk.yaml
-    - !include icu-rules/variants-sl.yaml
-    - !include icu-rules/variants-sv.yaml
-    - !include icu-rules/variants-tr.yaml
-    - !include icu-rules/variants-uk.yaml
-    - !include icu-rules/variants-vi.yaml
+    - step: tag-analyzer-by-language
+      filter-kind: [".*name.*"]
+      whitelist: [bg,ca,cs,da,de,el,en,es,et,eu,fi,fr,gl,hu,it,ja,mg,ms,nl,no,pl,pt,ro,ru,sk,sl,sv,tr,uk,vi]
+      use-defaults: all
+      mode: append
+token-analysis:
+    - analyzer: generic
+    - id: bg
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-bg.yaml
+    - id: ca
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-ca.yaml
+    - id: cs
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-cs.yaml
+    - id: da
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-da.yaml
+    - id: de
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-de.yaml
+    - id: el
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-el.yaml
+    - id: en
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-en.yaml
+    - id: es
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-es.yaml
+    - id: et
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-et.yaml
+    - id: eu
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-eu.yaml
+    - id: fi
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-fi.yaml
+    - id: fr
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-fr.yaml
+    - id: gl
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-gl.yaml
+    - id: hu
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-hu.yaml
+    - id: it
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-it.yaml
+    - id: ja
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-ja.yaml
+    - id: mg
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-mg.yaml
+    - id: ms
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-ms.yaml
+    - id: nl
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-nl.yaml
+    - id: no
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-no.yaml
+    - id: pl
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-pl.yaml
+    - id: pt
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-pt.yaml
+    - id: ro
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-ro.yaml
+    - id: ru
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-ru.yaml
+    - id: sk
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-sk.yaml
+    - id: sl
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-sl.yaml
+    - id: sv
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-sv.yaml
+    - id: tr
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-tr.yaml
+    - id: uk
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-uk.yaml
+    - id: vi
+      analyzer: generic
+      mode: variant-only
+      variants:
+          - !include icu-rules/variants-vi.yaml
--- a/test/bdd/db/query/normalization.feature
+++ b/test/bdd/db/query/normalization.feature
@@ -52,7 +52,7 @@ Feature: Import and search of names

    Scenario: Special characters in name
        Given the places
-          | osm | class | type      | name |
+          | osm | class | type      | name+name:de |
          | N1  | place | locality  | Jim-Knopf-Straße |
          | N2  | place | locality  | Smith/Weston |
          | N3  | place | locality  | space mountain |
--- a/test/python/test_tokenizer_icu.py
+++ b/test/python/test_tokenizer_icu.py
@@ -69,10 +69,11 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
    def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
                     variants=('~gasse -> gasse', 'street => st', ),
                     sanitizers=[]):
-        cfgstr = {'normalization' : list(norm),
-                  'sanitizers' : sanitizers,
-                  'transliteration' : list(trans),
-                  'variants' : [ {'words': list(variants)}]}
+        cfgstr = {'normalization': list(norm),
+                  'sanitizers': sanitizers,
+                  'transliteration': list(trans),
+                  'token-analysis': [{'analyzer': 'generic',
+                                      'variants': [{'words': list(variants)}]}]}
        (test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
        tok.loader = ICURuleLoader(test_config)

@@ -168,9 +169,7 @@ def test_init_word_table(tokenizer_factory, test_config, place_row, word_table):
    tok.init_new_db(test_config)

    assert word_table.get_partial_words() == {('test', 1),
-                                              ('no', 1), ('area', 2),
-                                              ('holz', 1), ('strasse', 1),
-                                              ('str', 1)}
+                                              ('no', 1), ('area', 2)}


 def test_init_from_project(monkeypatch, test_config, tokenizer_factory):
--- a/test/python/test_tokenizer_icu_name_processor.py
+++ b/test/python/test_tokenizer_icu_name_processor.py
@@ -1,104 +0,0 @@
-"""
-Tests for import name normalisation and variant generation.
-"""
-from textwrap import dedent
-
-import pytest
-
-from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
-
-from nominatim.errors import UsageError
-
-@pytest.fixture
-def cfgfile(def_config, tmp_path):
-    project_dir = tmp_path / 'project_dir'
-    project_dir.mkdir()
-    def_config.project_dir = project_dir
-
-    def _create_config(*variants, **kwargs):
-        content = dedent("""\
-        normalization:
-            - ":: NFD ()"
-            - "'🜳' > ' '"
-            - "[[:Nonspacing Mark:] [:Cf:]] >"
-            - ":: lower ()"
-            - "[[:Punctuation:][:Space:]]+ > ' '"
-            - ":: NFC ()"
-        transliteration:
-            - "::  Latin ()"
-            - "'🜵' > ' '"
-        """)
-        content += "variants:\n  - words:\n"
-        content += '\n'.join(("      - " + s for s in variants)) + '\n'
-        for k, v in kwargs:
-            content += "    {}: {}\n".format(k, v)
-        (project_dir / 'icu_tokenizer.yaml').write_text(content)
-
-        return def_config
-
-    return _create_config
-
-
-def get_normalized_variants(proc, name):
-    return proc.get_variants_ascii(proc.get_normalized(name))
-
-
-def test_variants_empty(cfgfile):
-    config = cfgfile('saint -> 🜵', 'street -> st')
-
-    proc = ICURuleLoader(config).make_token_analysis()
-
-    assert get_normalized_variants(proc, '🜵') == []
-    assert get_normalized_variants(proc, '🜳') == []
-    assert get_normalized_variants(proc, 'saint') == ['saint']
-
-
-VARIANT_TESTS = [
-(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
-(('weg => wg',), "holzweg", {'holzweg'}),
-(('weg -> wg',), "holzweg", {'holzweg'}),
-(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
-(('~weg -> weg',), "holzweg",  {'holz weg', 'holzweg'}),
-(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
-(('~weg -> w',), "holzweg",  {'holz weg', 'holzweg', 'holz w', 'holzw'}),
-(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
-(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
-(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
-(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
-(('weg => wg',), "Meier Weg", {'meier wg'}),
-(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
-(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
-     {'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
-(('am => a', 'bach => b'), "am bach", {'a b'}),
-(('am => a', '~bach => b'), "am bach", {'a b'}),
-(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
-(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
-(('saint -> s,st', 'street -> st'), "Saint Johns Street",
-     {'saint johns street', 's johns street', 'st johns street',
-      'saint johns st', 's johns st', 'st johns st'}),
-(('river$ -> r',), "River Bend Road", {'river bend road'}),
-(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
-(('^north => n',), "North 2nd Street", {'n 2nd street'}),
-(('^north => n',), "Airport North", {'airport north'}),
-(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
-(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
-]
-
-@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
-def test_variants(cfgfile, rules, name, variants):
-    config = cfgfile(*rules)
-    proc = ICURuleLoader(config).make_token_analysis()
-
-    result = get_normalized_variants(proc, name)
-
-    assert len(result) == len(set(result))
-    assert set(get_normalized_variants(proc, name)) == variants
-
-
-def test_search_normalized(cfgfile):
-    config = cfgfile('~street => s,st', 'master => mstr')
-    proc = ICURuleLoader(config).make_token_analysis()
-
-    assert proc.get_search_normalized('Master Street') == 'master street'
-    assert proc.get_search_normalized('Earnes St') == 'earnes st'
-    assert proc.get_search_normalized('Nostreet') == 'nostreet'
--- a/test/python/test_tokenizer_icu_rule_loader.py
+++ b/test/python/test_tokenizer_icu_rule_loader.py
@@ -34,8 +34,8 @@ def cfgrules(test_config):
            - "::  Latin ()"
            - "[[:Punctuation:][:Space:]]+ > ' '"
        """)
-        content += "variants:\n  - words:\n"
-        content += '\n'.join(("      - " + s for s in variants)) + '\n'
+        content += "token-analysis:\n  - analyzer: generic\n    variants:\n     - words:\n"
+        content += '\n'.join(("         - " + s for s in variants)) + '\n'
        for k, v in kwargs:
            content += "    {}: {}\n".format(k, v)
        (test_config.project_dir / 'icu_tokenizer.yaml').write_text(content)
@@ -49,20 +49,21 @@ def test_empty_rule_set(test_config):
    (test_config.project_dir / 'icu_tokenizer.yaml').write_text(dedent("""\
        normalization:
        transliteration:
-        variants:
+        token-analysis:
+          - analyzer: generic
+            variants:
        """))

    rules = ICURuleLoader(test_config)
    assert rules.get_search_rules() == ''
    assert rules.get_normalization_rules() == ''
    assert rules.get_transliteration_rules() == ''
-    assert list(rules.get_replacement_pairs()) == []

-CONFIG_SECTIONS = ('normalization', 'transliteration', 'variants')
+CONFIG_SECTIONS = ('normalization', 'transliteration', 'token-analysis')

@pytest.mark.parametrize("section", CONFIG_SECTIONS)
 def test_missing_section(section, test_config):
-    rule_cfg = { s: {} for s in CONFIG_SECTIONS if s != section}
+    rule_cfg = { s: [] for s in CONFIG_SECTIONS if s != section}
    (test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(rule_cfg))

    with pytest.raises(UsageError):
@@ -107,7 +108,9 @@ def test_transliteration_rules_from_file(test_config):
        transliteration:
            - "'ax' > 'b'"
            - !include transliteration.yaml
-        variants:
+        token-analysis:
+            - analyzer: generic
+              variants:
        """))
    transpath = test_config.project_dir / ('transliteration.yaml')
    transpath.write_text('- "x > y"')
@@ -119,6 +122,15 @@ def test_transliteration_rules_from_file(test_config):
    assert trans.transliterate(" axxt ") == " byt "


+def test_search_rules(cfgrules):
+    config = cfgrules('~street => s,st', 'master => mstr')
+    proc = ICURuleLoader(config).make_token_analysis()
+
+    assert proc.search.transliterate('Master Street').strip() == 'master street'
+    assert proc.search.transliterate('Earnes St').strip() == 'earnes st'
+    assert proc.search.transliterate('Nostreet').strip() == 'nostreet'
+
+
 class TestGetReplacements:

    @pytest.fixture(autouse=True)
@@ -127,9 +139,9 @@ class TestGetReplacements:

    def get_replacements(self, *variants):
        loader = ICURuleLoader(self.cfgrules(*variants))
-        rules = loader.get_replacement_pairs()
+        rules = loader.analysis[None].config['replacements']

-        return set((v.source, v.replacement) for v in rules)
+        return sorted((k, sorted(v)) for k,v in rules)


    @pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
@@ -141,131 +153,122 @@ class TestGetReplacements:
    def test_add_full(self):
        repl = self.get_replacements("foo -> bar")

-        assert repl == {(' foo ', ' bar '), (' foo ', ' foo ')}
+        assert repl == [(' foo ', [' bar', ' foo'])]


    def test_replace_full(self):
        repl = self.get_replacements("foo => bar")

-        assert repl == {(' foo ', ' bar ')}
+        assert repl == [(' foo ', [' bar'])]


    def test_add_suffix_no_decompose(self):
        repl = self.get_replacements("~berg |-> bg")

-        assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
-                        (' berg ', ' berg '), (' berg ', ' bg ')}
+        assert repl == [(' berg ', [' berg', ' bg']),
+                        ('berg ', ['berg', 'bg'])]


    def test_replace_suffix_no_decompose(self):
        repl = self.get_replacements("~berg |=> bg")

-        assert repl == {('berg ', 'bg '), (' berg ', ' bg ')}
+        assert repl == [(' berg ', [' bg']),('berg ', ['bg'])]


    def test_add_suffix_decompose(self):
        repl = self.get_replacements("~berg -> bg")

-        assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
-                        (' berg ', ' berg '), (' berg ', 'berg '),
-                        ('berg ', 'bg '), ('berg ', ' bg '),
-                        (' berg ', 'bg '), (' berg ', ' bg ')}
+        assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
+                        ('berg ', [' berg', ' bg', 'berg', 'bg'])]


    def test_replace_suffix_decompose(self):
        repl = self.get_replacements("~berg => bg")

-        assert repl == {('berg ', 'bg '), ('berg ', ' bg '),
-                        (' berg ', 'bg '), (' berg ', ' bg ')}
+        assert repl == [(' berg ', [' bg', 'bg']),
+                        ('berg ', [' bg', 'bg'])]


    def test_add_prefix_no_compose(self):
        repl = self.get_replacements("hinter~ |-> hnt")

-        assert repl == {(' hinter', ' hinter'), (' hinter ', ' hinter '),
-                        (' hinter', ' hnt'), (' hinter ', ' hnt ')}
+        assert repl == [(' hinter', [' hinter', ' hnt']),
+                        (' hinter ', [' hinter', ' hnt'])]


    def test_replace_prefix_no_compose(self):
        repl = self.get_replacements("hinter~ |=> hnt")

-        assert repl ==  {(' hinter', ' hnt'), (' hinter ', ' hnt ')}
+        assert repl ==  [(' hinter', [' hnt']), (' hinter ', [' hnt'])]


    def test_add_prefix_compose(self):
        repl = self.get_replacements("hinter~-> h")

-        assert repl == {(' hinter', ' hinter'), (' hinter', ' hinter '),
-                        (' hinter', ' h'), (' hinter', ' h '),
-                        (' hinter ', ' hinter '), (' hinter ', ' hinter'),
-                        (' hinter ', ' h '), (' hinter ', ' h')}
+        assert repl == [(' hinter', [' h', ' h ', ' hinter', ' hinter ']),
+                        (' hinter ', [' h', ' h', ' hinter', ' hinter'])]


    def test_replace_prefix_compose(self):
        repl = self.get_replacements("hinter~=> h")

-        assert repl == {(' hinter', ' h'), (' hinter', ' h '),
-                        (' hinter ', ' h '), (' hinter ', ' h')}
+        assert repl == [(' hinter', [' h', ' h ']),
+                        (' hinter ', [' h', ' h'])]


    def test_add_beginning_only(self):
        repl = self.get_replacements("^Premier -> Pr")

-        assert repl == {('^ premier ', '^ premier '), ('^ premier ', '^ pr ')}
+        assert repl == [('^ premier ', ['^ pr', '^ premier'])]


    def test_replace_beginning_only(self):
        repl = self.get_replacements("^Premier => Pr")

-        assert repl == {('^ premier ', '^ pr ')}
+        assert repl == [('^ premier ', ['^ pr'])]


    def test_add_final_only(self):
        repl = self.get_replacements("road$ -> rd")

-        assert repl == {(' road ^', ' road ^'), (' road ^', ' rd ^')}
+        assert repl == [(' road ^', [' rd ^', ' road ^'])]


    def test_replace_final_only(self):
        repl = self.get_replacements("road$ => rd")

-        assert repl == {(' road ^', ' rd ^')}
+        assert repl == [(' road ^', [' rd ^'])]


    def test_decompose_only(self):
        repl = self.get_replacements("~foo -> foo")

-        assert repl == {('foo ', 'foo '), ('foo ', ' foo '),
-                        (' foo ', 'foo '), (' foo ', ' foo ')}
+        assert repl == [(' foo ', [' foo', 'foo']),
+                        ('foo ', [' foo', 'foo'])]


    def test_add_suffix_decompose_end_only(self):
        repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")

-        assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
-                        (' berg ', ' berg '), (' berg ', ' bg '),
-                        ('berg ^', 'berg ^'), ('berg ^', ' berg ^'),
-                        ('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
-                        (' berg ^', 'berg ^'), (' berg ^', 'bg ^'),
-                        (' berg ^', ' berg ^'), (' berg ^', ' bg ^')}
+        assert repl == [(' berg ', [' berg', ' bg']),
+                        (' berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^']),
+                        ('berg ', ['berg', 'bg']),
+                        ('berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^'])]


    def test_replace_suffix_decompose_end_only(self):
        repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")

-        assert repl == {('berg ', 'bg '), (' berg ', ' bg '),
-                        ('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
-                        (' berg ^', 'bg ^'), (' berg ^', ' bg ^')}
+        assert repl == [(' berg ', [' bg']),
+                        (' berg ^', [' bg ^', 'bg ^']),
+                        ('berg ', ['bg']),
+                        ('berg ^', [' bg ^', 'bg ^'])]


    def test_add_multiple_suffix(self):
        repl = self.get_replacements("~berg,~burg -> bg")

-        assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
-                        (' berg ', ' berg '), (' berg ', 'berg '),
-                        ('berg ', 'bg '), ('berg ', ' bg '),
-                        (' berg ', 'bg '), (' berg ', ' bg '),
-                        ('burg ', 'burg '), ('burg ', ' burg '),
-                        (' burg ', ' burg '), (' burg ', 'burg '),
-                        ('burg ', 'bg '), ('burg ', ' bg '),
-                        (' burg ', 'bg '), (' burg ', ' bg ')}
+        assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
+                        (' burg ', [' bg', ' burg', 'bg', 'burg']),
+                        ('berg ', [' berg', ' bg', 'berg', 'bg']),
+                        ('burg ', [' bg', ' burg', 'bg', 'burg'])]
--- a/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py
+++ b/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py
@@ -0,0 +1,259 @@
+"""
+Tests for the sanitizer that enables language-dependent analyzers.
+"""
+import pytest
+
+from nominatim.indexer.place_info import PlaceInfo
+from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
+from nominatim.tools.country_info import setup_country_config
+
+class TestWithDefaults:
+
+    @staticmethod
+    def run_sanitizer_on(country, **kwargs):
+        place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
+                           'country_code': country})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}]).process_names(place)
+
+        return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
+
+
+    def test_no_names(self):
+        assert self.run_sanitizer_on('de') == []
+
+
+    def test_simple(self):
+        res = self.run_sanitizer_on('fr', name='Foo',name_de='Zoo', ref_abc='M')
+
+        assert res == [('Foo', 'name', None, {}),
+                       ('M', 'ref', 'abc', {'analyzer': 'abc'}),
+                       ('Zoo', 'name', 'de', {'analyzer': 'de'})]
+
+
+    @pytest.mark.parametrize('suffix', ['DE', 'asbc'])
+    def test_illegal_suffix(self, suffix):
+        assert self.run_sanitizer_on('fr', **{'name_' + suffix: 'Foo'}) \
+                 == [('Foo', 'name', suffix, {})]
+
+
+class TestFilterKind:
+
+    @staticmethod
+    def run_sanitizer_on(filt, **kwargs):
+        place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
+                           'country_code': 'de'})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
+                                   'filter-kind': filt}]).process_names(place)
+
+        return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
+
+
+    def test_single_exact_name(self):
+        res = self.run_sanitizer_on(['name'], name_fr='A', ref_fr='12',
+                                              shortname_fr='C', name='D')
+
+        assert res == [('12', 'ref',  'fr', {}),
+                       ('A',  'name', 'fr', {'analyzer': 'fr'}),
+                       ('C',  'shortname', 'fr', {}),
+                       ('D',  'name', None, {})]
+
+
+    def test_single_pattern(self):
+        res = self.run_sanitizer_on(['.*name'],
+                                    name_fr='A', ref_fr='12', namexx_fr='B',
+                                    shortname_fr='C', name='D')
+
+        assert res == [('12', 'ref',  'fr', {}),
+                       ('A',  'name', 'fr', {'analyzer': 'fr'}),
+                       ('B',  'namexx', 'fr', {}),
+                       ('C',  'shortname', 'fr', {'analyzer': 'fr'}),
+                       ('D',  'name', None, {})]
+
+
+    def test_multiple_patterns(self):
+        res = self.run_sanitizer_on(['.*name', 'ref'],
+                                    name_fr='A', ref_fr='12', oldref_fr='X',
+                                    namexx_fr='B', shortname_fr='C', name='D')
+
+        assert res == [('12', 'ref',  'fr', {'analyzer': 'fr'}),
+                       ('A',  'name', 'fr', {'analyzer': 'fr'}),
+                       ('B',  'namexx', 'fr', {}),
+                       ('C',  'shortname', 'fr', {'analyzer': 'fr'}),
+                       ('D',  'name', None, {}),
+                       ('X',  'oldref', 'fr', {})]
+
+
+class TestDefaultCountry:
+
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        setup_country_config(def_config)
+
+    @staticmethod
+    def run_sanitizer_append(mode,  country, **kwargs):
+        place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
+                           'country_code': country})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
+                                   'use-defaults': mode,
+                                   'mode': 'append'}]).process_names(place)
+
+        assert all(isinstance(p.attr, dict) for p in name)
+        assert all(len(p.attr) <= 1 for p in name)
+        assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
+                   for p in name)
+
+        return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
+
+
+    @staticmethod
+    def run_sanitizer_replace(mode,  country, **kwargs):
+        place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
+                           'country_code': country})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
+                                   'use-defaults': mode,
+                                   'mode': 'replace'}]).process_names(place)
+
+        assert all(isinstance(p.attr, dict) for p in name)
+        assert all(len(p.attr) <= 1 for p in name)
+        assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
+                   for p in name)
+
+        return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
+
+
+    def test_missing_country(self):
+        place = PlaceInfo({'name': {'name': 'something'}})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
+                                   'use-defaults': 'all',
+                                   'mode': 'replace'}]).process_names(place)
+
+        assert len(name) == 1
+        assert name[0].name == 'something'
+        assert name[0].suffix is None
+        assert 'analyzer' not in name[0].attr
+
+
+    def test_mono_unknown_country(self):
+        expect = [('XX', '')]
+
+        assert self.run_sanitizer_replace('mono', 'xx', name='XX') == expect
+        assert self.run_sanitizer_append('mono', 'xx', name='XX') == expect
+
+
+    def test_mono_monoling_replace(self):
+        res = self.run_sanitizer_replace('mono', 'de', name='Foo')
+
+        assert res == [('Foo', 'de')]
+
+
+    def test_mono_monoling_append(self):
+        res = self.run_sanitizer_append('mono', 'de', name='Foo')
+
+        assert res == [('Foo', ''), ('Foo', 'de')]
+
+
+    def test_mono_multiling(self):
+        expect = [('XX', '')]
+
+        assert self.run_sanitizer_replace('mono', 'ch', name='XX') == expect
+        assert self.run_sanitizer_append('mono', 'ch', name='XX') == expect
+
+
+    def test_all_unknown_country(self):
+        expect = [('XX', '')]
+
+        assert self.run_sanitizer_replace('all', 'xx', name='XX') == expect
+        assert self.run_sanitizer_append('all', 'xx', name='XX') == expect
+
+
+    def test_all_monoling_replace(self):
+        res = self.run_sanitizer_replace('all', 'de', name='Foo')
+
+        assert res == [('Foo', 'de')]
+
+
+    def test_all_monoling_append(self):
+        res = self.run_sanitizer_append('all', 'de', name='Foo')
+
+        assert res == [('Foo', ''), ('Foo', 'de')]
+
+
+    def test_all_multiling_append(self):
+        res = self.run_sanitizer_append('all', 'ch', name='XX')
+
+        assert res == [('XX', ''),
+                       ('XX', 'de'), ('XX', 'fr'), ('XX', 'it'), ('XX', 'rm')]
+
+
+    def test_all_multiling_replace(self):
+        res = self.run_sanitizer_replace('all', 'ch', name='XX')
+
+        assert res == [('XX', 'de'), ('XX', 'fr'), ('XX', 'it'), ('XX', 'rm')]
+
+
+class TestCountryWithWhitelist:
+
+    @staticmethod
+    def run_sanitizer_on(mode,  country, **kwargs):
+        place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
+                           'country_code': country})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
+                                   'use-defaults': mode,
+                                   'mode': 'replace',
+                                   'whitelist': ['de', 'fr', 'ru']}]).process_names(place)
+
+        assert all(isinstance(p.attr, dict) for p in name)
+        assert all(len(p.attr) <= 1 for p in name)
+        assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
+                   for p in name)
+
+        return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
+
+
+    def test_mono_monoling(self):
+        assert self.run_sanitizer_on('mono', 'de', name='Foo') == [('Foo', 'de')]
+        assert self.run_sanitizer_on('mono', 'pt', name='Foo') == [('Foo', '')]
+
+
+    def test_mono_multiling(self):
+        assert self.run_sanitizer_on('mono', 'ca', name='Foo') == [('Foo', '')]
+
+
+    def test_all_monoling(self):
+        assert self.run_sanitizer_on('all', 'de', name='Foo') == [('Foo', 'de')]
+        assert self.run_sanitizer_on('all', 'pt', name='Foo') == [('Foo', '')]
+
+
+    def test_all_multiling(self):
+        assert self.run_sanitizer_on('all', 'ca', name='Foo') == [('Foo', 'fr')]
+        assert self.run_sanitizer_on('all', 'ch', name='Foo') \
+            == [('Foo', 'de'), ('Foo', 'fr')]
+
+
+class TestWhiteList:
+
+    @staticmethod
+    def run_sanitizer_on(whitelist, **kwargs):
+        place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}})
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
+                                   'mode': 'replace',
+                                   'whitelist': whitelist}]).process_names(place)
+
+        assert all(isinstance(p.attr, dict) for p in name)
+        assert all(len(p.attr) <= 1 for p in name)
+        assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
+                   for p in name)
+
+        return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
+
+
+    def test_in_whitelist(self):
+        assert self.run_sanitizer_on(['de', 'xx'], ref_xx='123') == [('123', 'xx')]
+
+
+    def test_not_in_whitelist(self):
+        assert self.run_sanitizer_on(['de', 'xx'], ref_yy='123') == [('123', '')]
+
+
+    def test_empty_whitelist(self):
+        assert self.run_sanitizer_on([], ref_yy='123') == [('123', '')]
--- a/test/python/tokenizer/token_analysis/test_generic.py
+++ b/test/python/tokenizer/token_analysis/test_generic.py
@@ -0,0 +1,265 @@
+"""
+Tests for import name normalisation and variant generation.
+"""
+import pytest
+
+from icu import Transliterator
+
+import nominatim.tokenizer.token_analysis.generic as module
+from nominatim.errors import UsageError
+
+DEFAULT_NORMALIZATION = """ :: NFD ();
+                            '🜳' > ' ';
+                            [[:Nonspacing Mark:] [:Cf:]] >;
+                            :: lower ();
+                            [[:Punctuation:][:Space:]]+ > ' ';
+                            :: NFC ();
+                        """
+
+DEFAULT_TRANSLITERATION = """ ::  Latin ();
+                              '🜵' > ' ';
+                          """
+
+def make_analyser(*variants, variant_only=False):
+    rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
+    if variant_only:
+        rules['mode'] = 'variant-only'
+    config = module.configure(rules, DEFAULT_NORMALIZATION)
+    trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
+
+    return module.create(trans, config)
+
+
+def get_normalized_variants(proc, name):
+    norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+    return proc.get_variants_ascii(norm.transliterate(name).strip())
+
+
+def test_no_variants():
+    rules = { 'analyzer': 'generic' }
+    config = module.configure(rules, DEFAULT_NORMALIZATION)
+    trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
+
+    proc = module.create(trans, config)
+
+    assert get_normalized_variants(proc, '大德!') == ['dà dé']
+
+
+def test_variants_empty():
+    proc = make_analyser('saint -> 🜵', 'street -> st')
+
+    assert get_normalized_variants(proc, '🜵') == []
+    assert get_normalized_variants(proc, '🜳') == []
+    assert get_normalized_variants(proc, 'saint') == ['saint']
+
+
+VARIANT_TESTS = [
+(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
+(('weg => wg',), "holzweg", {'holzweg'}),
+(('weg -> wg',), "holzweg", {'holzweg'}),
+(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
+(('~weg -> weg',), "holzweg",  {'holz weg', 'holzweg'}),
+(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
+(('~weg -> w',), "holzweg",  {'holz weg', 'holzweg', 'holz w', 'holzw'}),
+(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
+(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
+(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
+(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
+(('weg => wg',), "Meier Weg", {'meier wg'}),
+(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
+(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
+     {'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
+(('am => a', 'bach => b'), "am bach", {'a b'}),
+(('am => a', '~bach => b'), "am bach", {'a b'}),
+(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
+(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
+(('saint -> s,st', 'street -> st'), "Saint Johns Street",
+     {'saint johns street', 's johns street', 'st johns street',
+      'saint johns st', 's johns st', 'st johns st'}),
+(('river$ -> r',), "River Bend Road", {'river bend road'}),
+(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
+(('^north => n',), "North 2nd Street", {'n 2nd street'}),
+(('^north => n',), "Airport North", {'airport north'}),
+(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
+(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
+]
+
+@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
+def test_variants(rules, name, variants):
+    proc = make_analyser(*rules)
+
+    result = get_normalized_variants(proc, name)
+
+    assert len(result) == len(set(result))
+    assert set(get_normalized_variants(proc, name)) == variants
+
+
+VARIANT_ONLY_TESTS = [
+(('weg => wg',), "hallo", set()),
+(('weg => wg',), "Meier Weg", {'meier wg'}),
+(('weg -> wg',), "Meier Weg", {'meier wg'}),
+]
+
+@pytest.mark.parametrize("rules,name,variants", VARIANT_ONLY_TESTS)
+def test_variants_only(rules, name, variants):
+    proc = make_analyser(*rules, variant_only=True)
+
+    result = get_normalized_variants(proc, name)
+
+    assert len(result) == len(set(result))
+    assert set(get_normalized_variants(proc, name)) == variants
+
+
+class TestGetReplacements:
+
+    @staticmethod
+    def configure_rules(*variants):
+        rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
+        return module.configure(rules, DEFAULT_NORMALIZATION)
+
+
+    def get_replacements(self, *variants):
+        config = self.configure_rules(*variants)
+
+        return sorted((k, sorted(v)) for k,v in config['replacements'])
+
+
+    @pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
+                                         '~foo~ -> bar', 'fo~ o -> bar'])
+    def test_invalid_variant_description(self, variant):
+        with pytest.raises(UsageError):
+            self.configure_rules(variant)
+
+
+    @pytest.mark.parametrize("rule", ["!!! -> bar", "bar => !!!"])
+    def test_ignore_unnormalizable_terms(self, rule):
+        repl = self.get_replacements(rule)
+
+        assert repl == []
+
+
+    def test_add_full(self):
+        repl = self.get_replacements("foo -> bar")
+
+        assert repl == [(' foo ', [' bar', ' foo'])]
+
+
+    def test_replace_full(self):
+        repl = self.get_replacements("foo => bar")
+
+        assert repl == [(' foo ', [' bar'])]
+
+
+    def test_add_suffix_no_decompose(self):
+        repl = self.get_replacements("~berg |-> bg")
+
+        assert repl == [(' berg ', [' berg', ' bg']),
+                        ('berg ', ['berg', 'bg'])]
+
+
+    def test_replace_suffix_no_decompose(self):
+        repl = self.get_replacements("~berg |=> bg")
+
+        assert repl == [(' berg ', [' bg']),('berg ', ['bg'])]
+
+
+    def test_add_suffix_decompose(self):
+        repl = self.get_replacements("~berg -> bg")
+
+        assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
+                        ('berg ', [' berg', ' bg', 'berg', 'bg'])]
+
+
+    def test_replace_suffix_decompose(self):
+        repl = self.get_replacements("~berg => bg")
+
+        assert repl == [(' berg ', [' bg', 'bg']),
+                        ('berg ', [' bg', 'bg'])]
+
+
+    def test_add_prefix_no_compose(self):
+        repl = self.get_replacements("hinter~ |-> hnt")
+
+        assert repl == [(' hinter', [' hinter', ' hnt']),
+                        (' hinter ', [' hinter', ' hnt'])]
+
+
+    def test_replace_prefix_no_compose(self):
+        repl = self.get_replacements("hinter~ |=> hnt")
+
+        assert repl ==  [(' hinter', [' hnt']), (' hinter ', [' hnt'])]
+
+
+    def test_add_prefix_compose(self):
+        repl = self.get_replacements("hinter~-> h")
+
+        assert repl == [(' hinter', [' h', ' h ', ' hinter', ' hinter ']),
+                        (' hinter ', [' h', ' h', ' hinter', ' hinter'])]
+
+
+    def test_replace_prefix_compose(self):
+        repl = self.get_replacements("hinter~=> h")
+
+        assert repl == [(' hinter', [' h', ' h ']),
+                        (' hinter ', [' h', ' h'])]
+
+
+    def test_add_beginning_only(self):
+        repl = self.get_replacements("^Premier -> Pr")
+
+        assert repl == [('^ premier ', ['^ pr', '^ premier'])]
+
+
+    def test_replace_beginning_only(self):
+        repl = self.get_replacements("^Premier => Pr")
+
+        assert repl == [('^ premier ', ['^ pr'])]
+
+
+    def test_add_final_only(self):
+        repl = self.get_replacements("road$ -> rd")
+
+        assert repl == [(' road ^', [' rd ^', ' road ^'])]
+
+
+    def test_replace_final_only(self):
+        repl = self.get_replacements("road$ => rd")
+
+        assert repl == [(' road ^', [' rd ^'])]
+
+
+    def test_decompose_only(self):
+        repl = self.get_replacements("~foo -> foo")
+
+        assert repl == [(' foo ', [' foo', 'foo']),
+                        ('foo ', [' foo', 'foo'])]
+
+
+    def test_add_suffix_decompose_end_only(self):
+        repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
+
+        assert repl == [(' berg ', [' berg', ' bg']),
+                        (' berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^']),
+                        ('berg ', ['berg', 'bg']),
+                        ('berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^'])]
+
+
+    def test_replace_suffix_decompose_end_only(self):
+        repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
+
+        assert repl == [(' berg ', [' bg']),
+                        (' berg ^', [' bg ^', 'bg ^']),
+                        ('berg ', ['bg']),
+                        ('berg ^', [' bg ^', 'bg ^'])]
+
+
+    @pytest.mark.parametrize('rule', ["~berg,~burg -> bg",
+                                      "~berg, ~burg -> bg",
+                                      "~berg,,~burg -> bg"])
+    def test_add_multiple_suffix(self, rule):
+        repl = self.get_replacements(rule)
+
+        assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
+                        (' burg ', [' bg', ' burg', 'bg', 'burg']),
+                        ('berg ', [' berg', ' bg', 'berg', 'bg']),
+                        ('burg ', [' bg', ' burg', 'bg', 'burg'])]