diff --git a/docs/admin/Tokenizers.md b/docs/admin/Tokenizers.md index 6f8898c8..90d0fb5e 100644 --- a/docs/admin/Tokenizers.md +++ b/docs/admin/Tokenizers.md @@ -60,22 +60,23 @@ NOMINATIM_TOKENIZER=icu ### How it works -On import the tokenizer processes names in the following four stages: +On import the tokenizer processes names in the following three stages: -1. The **Normalization** part removes all non-relevant information from the - input. -2. Incoming names are now converted to **full names**. This process is currently - hard coded and mostly serves to handle name tags from OSM that contain - multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)). -3. Next the tokenizer creates **variants** from the full names. These variants - cover decomposition and abbreviation handling. Variants are saved to the - database, so that it is not necessary to create the variants for a search - query. -4. The final **Tokenization** step converts the names to a simple ASCII form, - potentially removing further spelling variants for better matching. +1. During the **Sanitizer step** incoming names are cleaned up and converted to + **full names**. This step can be used to regularize spelling, split multi-name + tags into their parts and tag names with additional attributes. See the + [Sanitizers section](#sanitizers) below for available cleaning routines. +2. The **Normalization** part removes all information from the full names + that are not relevant for search. +3. The **Token analysis** step takes the normalized full names and creates + all transliterated variants under which the name should be searchable. + See the [Token analysis](#token-analysis) section below for more + information. -At query time only stage 1) and 4) are used. The query is normalized and -tokenized and the resulting string used for searching in the database. +During query time, only normalization and transliteration are relevant. +An incoming query is first split into name chunks (this usually means splitting +the string at the commas) and the each part is normalised and transliterated. +The result is used to look up places in the search index. ### Configuration @@ -93,21 +94,36 @@ normalization: transliteration: - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml - ":: Ascii ()" -variants: - - language: de - words: - - ~haus => haus - - ~strasse -> str - - language: en - words: - - road -> rd - - bridge -> bdge,br,brdg,bri,brg +sanitizers: + - step: split-name-list +token-analysis: + - analyzer: generic + variants: + - !include icu-rules/variants-ca.yaml + - words: + - road -> rd + - bridge -> bdge,br,brdg,bri,brg ``` -The configuration file contains three sections: -`normalization`, `transliteration`, `variants`. +The configuration file contains four sections: +`normalization`, `transliteration`, `sanitizers` and `token-analysis`. -The normalization and transliteration sections each must contain a list of +#### Normalization and Transliteration + +The normalization and transliteration sections each define a set of +ICU rules that are applied to the names. + +The **normalisation** rules are applied after sanitation. They should remove +any information that is not relevant for search at all. Usual rules to be +applied here are: lower-casing, removing of special characters, cleanup of +spaces. + +The **transliteration** rules are applied at the end of the tokenization +process to transfer the name into an ASCII representation. Transliteration can +be useful to allow for further fuzzy matching, especially between different +scripts. + +Each section must contain a list of [ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html). The rules are applied in the order in which they appear in the file. You can also include additional rules from external yaml file using the @@ -119,6 +135,85 @@ and may again include other files. YAML syntax. You should therefore always enclose the ICU rules in double-quotes. +#### Sanitizers + +The sanitizers section defines an ordered list of functions that are applied +to the name and address tags before they are further processed by the tokenizer. +They allows to clean up the tagging and bring it to a standardized form more +suitable for building the search index. + +!!! hint + Sanitizers only have an effect on how the search index is built. They + do not change the information about each place that is saved in the + database. In particular, they have no influence on how the results are + displayed. The returned results always show the original information as + stored in the OpenStreetMap database. + +Each entry contains information of a sanitizer to be applied. It has a +mandatory parameter `step` which gives the name of the sanitizer. Depending +on the type, it may have additional parameters to configure its operation. + +The order of the list matters. The sanitizers are applied exactly in the order +that is configured. Each sanitizer works on the results of the previous one. + +The following is a list of sanitizers that are shipped with Nominatim. + +##### split-name-list + +::: nominatim.tokenizer.sanitizers.split_name_list + selection: + members: False + rendering: + heading_level: 6 + +##### strip-brace-terms + +::: nominatim.tokenizer.sanitizers.strip_brace_terms + selection: + members: False + rendering: + heading_level: 6 + +##### tag-analyzer-by-language + +::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language + selection: + members: False + rendering: + heading_level: 6 + + + +#### Token Analysis + +Token analyzers take a full name and transform it into one or more normalized +form that are then saved in the search index. In its simplest form, the +analyzer only applies the transliteration rules. More complex analyzers +create additional spelling variants of a name. This is useful to handle +decomposition and abbreviation. + +The ICU tokenizer may use different analyzers for different names. To select +the analyzer to be used, the name must be tagged with the `analyzer` attribute +by a sanitizer (see for example the +[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)). + +The token-analysis section contains the list of configured analyzers. Each +analyzer must have an `id` parameter that uniquely identifies the analyzer. +The only exception is the default analyzer that is used when no special +analyzer was selected. + +Different analyzer implementations may exist. To select the implementation, +the `analyzer` parameter must be set. Currently there is only one implementation +`generic` which is described in the following. + +##### Generic token analyzer + +The generic analyzer is able to create variants from a list of given +abbreviation and decomposition replacements. It takes one optional parameter +`variants` which lists the replacements to apply. If the section is +omitted, then the generic analyzer becomes a simple analyzer that only +applies the transliteration. + The variants section defines lists of replacements which create alternative spellings of a name. To create the variants, a name is scanned from left to right and the longest matching replacement is applied until the end of the @@ -144,7 +239,7 @@ term. words in the configuration because then it is possible to change the rules for normalization later without having to adapt the variant rules. -#### Decomposition +###### Decomposition In its standard form, only full words match against the source. There is a special notation to match the prefix and suffix of a word: @@ -171,7 +266,7 @@ To avoid automatic decomposition, use the '|' notation: simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str". -#### Initial and final terms +###### Initial and final terms It is also possible to restrict replacements to the beginning and end of a name: @@ -184,7 +279,7 @@ name: So the first example would trigger a replacement for "south 45th street" but not for "the south beach restaurant". -#### Replacements vs. variants +###### Replacements vs. variants The replacement syntax `source => target` works as a pure replacement. It changes the name instead of creating a variant. To create an additional version, you'd diff --git a/nominatim/config.py b/nominatim/config.py index 64614bf1..f316280b 100644 --- a/nominatim/config.py +++ b/nominatim/config.py @@ -12,6 +12,27 @@ from nominatim.errors import UsageError LOG = logging.getLogger() + +def flatten_config_list(content, section=''): + """ Flatten YAML configuration lists that contain include sections + which are lists themselves. + """ + if not content: + return [] + + if not isinstance(content, list): + raise UsageError(f"List expected in section '{section}'.") + + output = [] + for ele in content: + if isinstance(ele, list): + output.extend(flatten_config_list(ele, section)) + else: + output.append(ele) + + return output + + class Configuration: """ Load and manage the project configuration. diff --git a/nominatim/tokenizer/base.py b/nominatim/tokenizer/base.py index 53289c78..02bc312f 100644 --- a/nominatim/tokenizer/base.py +++ b/nominatim/tokenizer/base.py @@ -194,15 +194,13 @@ class AbstractTokenizer(ABC): """ Check that the database is set up correctly and ready for being queried. - Returns: - If an issue was found, return an error message with the - description of the issue as well as hints for the user on - how to resolve the issue. - Arguments: config: Read-only object with configuration options. - Return `None`, if no issue was found. + Returns: + If an issue was found, return an error message with the + description of the issue as well as hints for the user on + how to resolve the issue. If everything is okay, return `None`. """ pass diff --git a/nominatim/tokenizer/icu_name_processor.py b/nominatim/tokenizer/icu_name_processor.py deleted file mode 100644 index 544f5ebc..00000000 --- a/nominatim/tokenizer/icu_name_processor.py +++ /dev/null @@ -1,104 +0,0 @@ -""" -Processor for names that are imported into the database based on the -ICU library. -""" -from collections import defaultdict -import itertools - -from icu import Transliterator -import datrie - - -class ICUNameProcessor: - """ Collects the different transformation rules for normalisation of names - and provides the functions to apply the transformations. - """ - - def __init__(self, norm_rules, trans_rules, replacements): - self.normalizer = Transliterator.createFromRules("icu_normalization", - norm_rules) - self.to_ascii = Transliterator.createFromRules("icu_to_ascii", - trans_rules + - ";[:Space:]+ > ' '") - self.search = Transliterator.createFromRules("icu_search", - norm_rules + trans_rules) - - # Intermediate reorder by source. Also compute required character set. - immediate = defaultdict(list) - chars = set() - for variant in replacements: - if variant.source[-1] == ' ' and variant.replacement[-1] == ' ': - replstr = variant.replacement[:-1] - else: - replstr = variant.replacement - immediate[variant.source].append(replstr) - chars.update(variant.source) - # Then copy to datrie - self.replacements = datrie.Trie(''.join(chars)) - for src, repllist in immediate.items(): - self.replacements[src] = repllist - - - def get_normalized(self, name): - """ Normalize the given name, i.e. remove all elements not relevant - for search. - """ - return self.normalizer.transliterate(name).strip() - - def get_variants_ascii(self, norm_name): - """ Compute the spelling variants for the given normalized name - and transliterate the result. - """ - baseform = '^ ' + norm_name + ' ^' - partials = [''] - - startpos = 0 - pos = 0 - force_space = False - while pos < len(baseform): - full, repl = self.replacements.longest_prefix_item(baseform[pos:], - (None, None)) - if full is not None: - done = baseform[startpos:pos] - partials = [v + done + r - for v, r in itertools.product(partials, repl) - if not force_space or r.startswith(' ')] - if len(partials) > 128: - # If too many variants are produced, they are unlikely - # to be helpful. Only use the original term. - startpos = 0 - break - startpos = pos + len(full) - if full[-1] == ' ': - startpos -= 1 - force_space = True - pos = startpos - else: - pos += 1 - force_space = False - - # No variants detected? Fast return. - if startpos == 0: - trans_name = self.to_ascii.transliterate(norm_name).strip() - return [trans_name] if trans_name else [] - - return self._compute_result_set(partials, baseform[startpos:]) - - - def _compute_result_set(self, partials, prefix): - results = set() - - for variant in partials: - vname = variant + prefix - trans_name = self.to_ascii.transliterate(vname[1:-1]).strip() - if trans_name: - results.add(trans_name) - - return list(results) - - - def get_search_normalized(self, name): - """ Return the normalized version of the name (including transliteration) - to be applied at search time. - """ - return self.search.transliterate(' ' + name + ' ').strip() diff --git a/nominatim/tokenizer/icu_rule_loader.py b/nominatim/tokenizer/icu_rule_loader.py index 330179bb..b8551038 100644 --- a/nominatim/tokenizer/icu_rule_loader.py +++ b/nominatim/tokenizer/icu_rule_loader.py @@ -1,19 +1,17 @@ """ Helper class to create ICU rules from a configuration file. """ +import importlib import io import json import logging -import itertools -import re - -from icu import Transliterator +from nominatim.config import flatten_config_list from nominatim.db.properties import set_property, get_property from nominatim.errors import UsageError -from nominatim.tokenizer.icu_name_processor import ICUNameProcessor from nominatim.tokenizer.place_sanitizer import PlaceSanitizer -import nominatim.tokenizer.icu_variants as variants +from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis +import nominatim.tools.country_info LOG = logging.getLogger() @@ -22,33 +20,15 @@ DBCFG_IMPORT_TRANS_RULES = "tokenizer_import_transliteration" DBCFG_IMPORT_ANALYSIS_RULES = "tokenizer_import_analysis_rules" -def _flatten_config_list(content): - if not content: - return [] - - if not isinstance(content, list): - raise UsageError("List expected in ICU configuration.") - - output = [] - for ele in content: - if isinstance(ele, list): - output.extend(_flatten_config_list(ele)) - else: - output.append(ele) - - return output - - -class VariantRule: - """ Saves a single variant expansion. - - An expansion consists of the normalized replacement term and - a dicitonary of properties that describe when the expansion applies. +def _get_section(rules, section): + """ Get the section named 'section' from the rules. If the section does + not exist, raise a usage error with a meaningful message. """ + if section not in rules: + LOG.fatal("Section '%s' not found in tokenizer config.", section) + raise UsageError("Syntax error in tokenizer configuration file.") - def __init__(self, replacement, properties): - self.replacement = replacement - self.properties = properties or {} + return rules[section] class ICURuleLoader: @@ -59,12 +39,13 @@ class ICURuleLoader: rules = config.load_sub_configuration('icu_tokenizer.yaml', config='TOKENIZER_CONFIG') - self.variants = set() + # Make sure country information is available to analyzers and sanatizers. + nominatim.tools.country_info.setup_country_config(config) self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization') self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration') - self.analysis_rules = self._get_section(rules, 'variants') - self._parse_variant_list() + self.analysis_rules = _get_section(rules, 'token-analysis') + self._setup_analysis() # Load optional sanitizer rule set. self.sanitizer_rules = rules.get('sanitizers', []) @@ -77,7 +58,7 @@ class ICURuleLoader: self.normalization_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES) self.transliteration_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES) self.analysis_rules = json.loads(get_property(conn, DBCFG_IMPORT_ANALYSIS_RULES)) - self._parse_variant_list() + self._setup_analysis() def save_config_to_db(self, conn): @@ -98,9 +79,8 @@ class ICURuleLoader: def make_token_analysis(self): """ Create a token analyser from the reviouly loaded rules. """ - return ICUNameProcessor(self.normalization_rules, - self.transliteration_rules, - self.variants) + return ICUTokenAnalysis(self.normalization_rules, + self.transliteration_rules, self.analysis) def get_search_rules(self): @@ -115,159 +95,66 @@ class ICURuleLoader: rules.write(self.transliteration_rules) return rules.getvalue() + def get_normalization_rules(self): """ Return rules for normalisation of a term. """ return self.normalization_rules + def get_transliteration_rules(self): """ Return the rules for converting a string into its asciii representation. """ return self.transliteration_rules - def get_replacement_pairs(self): - """ Return the list of possible compound decompositions with - application of abbreviations included. - The result is a list of pairs: the first item is the sequence to - replace, the second is a list of replacements. + + def _setup_analysis(self): + """ Process the rules used for creating the various token analyzers. """ - return self.variants + self.analysis = {} + + if not isinstance(self.analysis_rules, list): + raise UsageError("Configuration section 'token-analysis' must be a list.") + + for section in self.analysis_rules: + name = section.get('id', None) + if name in self.analysis: + if name is None: + LOG.fatal("ICU tokenizer configuration has two default token analyzers.") + else: + LOG.fatal("ICU tokenizer configuration has two token " + "analyzers with id '%s'.", name) + raise UsageError("Syntax error in ICU tokenizer config.") + self.analysis[name] = TokenAnalyzerRule(section, self.normalization_rules) @staticmethod - def _get_section(rules, section): - """ Get the section named 'section' from the rules. If the section does - not exist, raise a usage error with a meaningful message. - """ - if section not in rules: - LOG.fatal("Section '%s' not found in tokenizer config.", section) - raise UsageError("Syntax error in tokenizer configuration file.") - - return rules[section] - - - def _cfg_to_icu_rules(self, rules, section): + def _cfg_to_icu_rules(rules, section): """ Load an ICU ruleset from the given section. If the section is a simple string, it is interpreted as a file name and the rules are loaded verbatim from the given file. The filename is expected to be relative to the tokenizer rule file. If the section is a list then each line is assumed to be a rule. All rules are concatenated and returned. """ - content = self._get_section(rules, section) + content = _get_section(rules, section) if content is None: return '' - return ';'.join(_flatten_config_list(content)) + ';' + return ';'.join(flatten_config_list(content, section)) + ';' - def _parse_variant_list(self): - rules = self.analysis_rules - - self.variants.clear() - - if not rules: - return - - rules = _flatten_config_list(rules) - - vmaker = _VariantMaker(self.normalization_rules) - - properties = [] - for section in rules: - # Create the property field and deduplicate against existing - # instances. - props = variants.ICUVariantProperties.from_rules(section) - for existing in properties: - if existing == props: - props = existing - break - else: - properties.append(props) - - for rule in (section.get('words') or []): - self.variants.update(vmaker.compute(rule, props)) - - -class _VariantMaker: - """ Generater for all necessary ICUVariants from a single variant rule. - - All text in rules is normalized to make sure the variants match later. +class TokenAnalyzerRule: + """ Factory for a single analysis module. The class saves the configuration + and creates a new token analyzer on request. """ - def __init__(self, norm_rules): - self.norm = Transliterator.createFromRules("rule_loader_normalization", - norm_rules) + def __init__(self, rules, normalization_rules): + # Find the analysis module + module_name = 'nominatim.tokenizer.token_analysis.' \ + + _get_section(rules, 'analyzer').replace('-', '_') + analysis_mod = importlib.import_module(module_name) + self.create = analysis_mod.create - - def compute(self, rule, props): - """ Generator for all ICUVariant tuples from a single variant rule. - """ - parts = re.split(r'(\|)?([=-])>', rule) - if len(parts) != 4: - raise UsageError("Syntax error in variant rule: " + rule) - - decompose = parts[1] is None - src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')] - repl_terms = (self.norm.transliterate(t.strip()) for t in parts[3].split(',')) - - # If the source should be kept, add a 1:1 replacement - if parts[2] == '-': - for src in src_terms: - if src: - for froms, tos in _create_variants(*src, src[0], decompose): - yield variants.ICUVariant(froms, tos, props) - - for src, repl in itertools.product(src_terms, repl_terms): - if src and repl: - for froms, tos in _create_variants(*src, repl, decompose): - yield variants.ICUVariant(froms, tos, props) - - - def _parse_variant_word(self, name): - name = name.strip() - match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name) - if match is None or (match.group(1) == '~' and match.group(3) == '~'): - raise UsageError("Invalid variant word descriptor '{}'".format(name)) - norm_name = self.norm.transliterate(match.group(2)) - if not norm_name: - return None - - return norm_name, match.group(1), match.group(3) - - -_FLAG_MATCH = {'^': '^ ', - '$': ' ^', - '': ' '} - - -def _create_variants(src, preflag, postflag, repl, decompose): - if preflag == '~': - postfix = _FLAG_MATCH[postflag] - # suffix decomposition - src = src + postfix - repl = repl + postfix - - yield src, repl - yield ' ' + src, ' ' + repl - - if decompose: - yield src, ' ' + repl - yield ' ' + src, repl - elif postflag == '~': - # prefix decomposition - prefix = _FLAG_MATCH[preflag] - src = prefix + src - repl = prefix + repl - - yield src, repl - yield src + ' ', repl + ' ' - - if decompose: - yield src, repl + ' ' - yield src + ' ', repl - else: - prefix = _FLAG_MATCH[preflag] - postfix = _FLAG_MATCH[postflag] - - yield prefix + src + postfix, prefix + repl + postfix + # Load the configuration. + self.config = analysis_mod.configure(rules, normalization_rules) diff --git a/nominatim/tokenizer/icu_token_analysis.py b/nominatim/tokenizer/icu_token_analysis.py new file mode 100644 index 00000000..f27a2fbe --- /dev/null +++ b/nominatim/tokenizer/icu_token_analysis.py @@ -0,0 +1,23 @@ +""" +Container class collecting all components required to transform an OSM name +into a Nominatim token. +""" + +from icu import Transliterator + +class ICUTokenAnalysis: + """ Container class collecting the transliterators and token analysis + modules for a single NameAnalyser instance. + """ + + def __init__(self, norm_rules, trans_rules, analysis_rules): + self.normalizer = Transliterator.createFromRules("icu_normalization", + norm_rules) + trans_rules += ";[:Space:]+ > ' '" + self.to_ascii = Transliterator.createFromRules("icu_to_ascii", + trans_rules) + self.search = Transliterator.createFromRules("icu_search", + norm_rules + trans_rules) + + self.analysis = {name: arules.create(self.to_ascii, arules.config) + for name, arules in analysis_rules.items()} diff --git a/nominatim/tokenizer/icu_tokenizer.py b/nominatim/tokenizer/icu_tokenizer.py index 2ece10f2..12d1eccd 100644 --- a/nominatim/tokenizer/icu_tokenizer.py +++ b/nominatim/tokenizer/icu_tokenizer.py @@ -164,7 +164,7 @@ class LegacyICUTokenizer(AbstractTokenizer): """ Count the partial terms from the names in the place table. """ words = Counter() - name_proc = self.loader.make_token_analysis() + analysis = self.loader.make_token_analysis() with conn.cursor(name="words") as cur: cur.execute(""" SELECT v, count(*) FROM @@ -172,12 +172,10 @@ class LegacyICUTokenizer(AbstractTokenizer): WHERE length(v) < 75 GROUP BY v""") for name, cnt in cur: - terms = set() - for word in name_proc.get_variants_ascii(name_proc.get_normalized(name)): - if ' ' in word: - terms.update(word.split()) - for term in terms: - words[term] += cnt + word = analysis.search.transliterate(name) + if word and ' ' in word: + for term in set(word.split()): + words[term] += cnt return words @@ -209,14 +207,14 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer): def _search_normalized(self, name): """ Return the search token transliteration of the given name. """ - return self.token_analysis.get_search_normalized(name) + return self.token_analysis.search.transliterate(name).strip() def _normalized(self, name): """ Return the normalized version of the given name with all non-relevant information removed. """ - return self.token_analysis.get_normalized(name) + return self.token_analysis.normalizer.transliterate(name).strip() def get_word_token_info(self, words): @@ -456,6 +454,7 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer): if addr_terms: token_info.add_address_terms(addr_terms) + def _compute_partial_tokens(self, name): """ Normalize the given term, split it into partial words and return then token list for them. @@ -492,19 +491,25 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer): partial_tokens = set() for name in names: + analyzer_id = name.get_attr('analyzer') norm_name = self._normalized(name.name) - full, part = self._cache.names.get(norm_name, (None, None)) + if analyzer_id is None: + token_id = norm_name + else: + token_id = f'{norm_name}@{analyzer_id}' + + full, part = self._cache.names.get(token_id, (None, None)) if full is None: - variants = self.token_analysis.get_variants_ascii(norm_name) + variants = self.token_analysis.analysis[analyzer_id].get_variants_ascii(norm_name) if not variants: continue with self.conn.cursor() as cur: cur.execute("SELECT (getorcreate_full_word(%s, %s)).*", - (norm_name, variants)) + (token_id, variants)) full, part = cur.fetchone() - self._cache.names[norm_name] = (full, part) + self._cache.names[token_id] = (full, part) full_tokens.add(full) partial_tokens.update(part) diff --git a/nominatim/tokenizer/icu_variants.py b/nominatim/tokenizer/icu_variants.py deleted file mode 100644 index 93272f58..00000000 --- a/nominatim/tokenizer/icu_variants.py +++ /dev/null @@ -1,25 +0,0 @@ -""" -Data structures for saving variant expansions for ICU tokenizer. -""" -from collections import namedtuple - -_ICU_VARIANT_PORPERTY_FIELDS = ['lang'] - - -class ICUVariantProperties(namedtuple('_ICUVariantProperties', _ICU_VARIANT_PORPERTY_FIELDS)): - """ Data container for saving properties that describe when a variant - should be applied. - - Property instances are hashable. - """ - @classmethod - def from_rules(cls, _): - """ Create a new property type from a generic dictionary. - - The function only takes into account the properties that are - understood presently and ignores all others. - """ - return cls(lang=None) - - -ICUVariant = namedtuple('ICUVariant', ['source', 'replacement', 'properties']) diff --git a/nominatim/tokenizer/sanitizers/split_name_list.py b/nominatim/tokenizer/sanitizers/split_name_list.py index f1514203..86385985 100644 --- a/nominatim/tokenizer/sanitizers/split_name_list.py +++ b/nominatim/tokenizer/sanitizers/split_name_list.py @@ -1,5 +1,9 @@ """ -Name processor that splits name values with multiple values into their components. +Sanitizer that splits lists of names into their components. + +Arguments: + delimiters: Define the set of characters to be used for + splitting the list. (default: `,;`) """ import re @@ -7,9 +11,7 @@ from nominatim.errors import UsageError def create(func): """ Create a name processing function that splits name values with - multiple values into their components. The optional parameter - 'delimiters' can be used to define the characters that should be used - for splitting. The default is ',;'. + multiple values into their components. """ delimiter_set = set(func.get('delimiters', ',;')) if not delimiter_set: @@ -24,7 +26,6 @@ def create(func): new_names = [] for name in obj.names: split_names = regexp.split(name.name) - print(split_names) if len(split_names) == 1: new_names.append(name) else: diff --git a/nominatim/tokenizer/sanitizers/strip_brace_terms.py b/nominatim/tokenizer/sanitizers/strip_brace_terms.py index ec91bac9..caadc815 100644 --- a/nominatim/tokenizer/sanitizers/strip_brace_terms.py +++ b/nominatim/tokenizer/sanitizers/strip_brace_terms.py @@ -1,11 +1,12 @@ """ -Sanitizer handling names with addendums in braces. +This sanitizer creates additional name variants for names that have +addendums in brackets (e.g. "Halle (Saale)"). The additional variant contains +only the main name part with the bracket part removed. """ def create(_): """ Create a name processing function that creates additional name variants - when a name has an addendum in brackets (e.g. "Halle (Saale)"). The - additional variant only contains the main name without the bracket part. + for bracket addendums. """ def _process(obj): """ Add variants for names that have a bracket extension. diff --git a/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py b/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py new file mode 100644 index 00000000..739e9313 --- /dev/null +++ b/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py @@ -0,0 +1,103 @@ +""" +This sanitizer sets the `analyzer` property depending on the +language of the tag. The language is taken from the suffix of the name. +If a name already has an analyzer tagged, then this is kept. + +Arguments: + + filter-kind: Restrict the names the sanitizer should be applied to + to the given tags. The parameter expects a list of + regular expressions which are matched against `kind`. + Note that a match against the full string is expected. + whitelist: Restrict the set of languages that should be tagged. + Expects a list of acceptable suffixes. When unset, + all 2- and 3-letter lower-case codes are accepted. + use-defaults: Configure what happens when the name has no suffix. + When set to 'all', a variant is created for + each of the default languages in the country + the feature is in. When set to 'mono', a variant is + only created, when exactly one language is spoken + in the country. The default is to do nothing with + the default languages of a country. + mode: Define how the variants are created and may be 'replace' or + 'append'. When set to 'append' the original name (without + any analyzer tagged) is retained. (default: replace) + +""" +import re + +from nominatim.tools import country_info + +class _AnalyzerByLanguage: + """ Processor for tagging the language of names in a place. + """ + + def __init__(self, config): + if 'filter-kind' in config: + self.regexes = [re.compile(regex) for regex in config['filter-kind']] + else: + self.regexes = None + + self.replace = config.get('mode', 'replace') != 'append' + self.whitelist = config.get('whitelist') + + self.__compute_default_languages(config.get('use-defaults', 'no')) + + + def __compute_default_languages(self, use_defaults): + self.deflangs = {} + + if use_defaults in ('mono', 'all'): + for ccode, prop in country_info.iterate(): + clangs = prop['languages'] + if len(clangs) == 1 or use_defaults == 'all': + if self.whitelist: + self.deflangs[ccode] = [l for l in clangs if l in self.whitelist] + else: + self.deflangs[ccode] = clangs + + + def _kind_matches(self, kind): + if self.regexes is None: + return True + + return any(regex.fullmatch(kind) for regex in self.regexes) + + + def _suffix_matches(self, suffix): + if self.whitelist is None: + return len(suffix) in (2, 3) and suffix.islower() + + return suffix in self.whitelist + + + def __call__(self, obj): + if not obj.names: + return + + more_names = [] + + for name in (n for n in obj.names + if not n.has_attr('analyzer') and self._kind_matches(n.kind)): + if name.suffix: + langs = [name.suffix] if self._suffix_matches(name.suffix) else None + else: + langs = self.deflangs.get(obj.place.country_code) + + + if langs: + if self.replace: + name.set_attr('analyzer', langs[0]) + else: + more_names.append(name.clone(attr={'analyzer': langs[0]})) + + more_names.extend(name.clone(attr={'analyzer': l}) for l in langs[1:]) + + obj.names.extend(more_names) + + +def create(config): + """ Create a function that sets the analyzer property depending on the + language of the tag. + """ + return _AnalyzerByLanguage(config) diff --git a/nominatim/tokenizer/token_analysis/__init__.py b/nominatim/tokenizer/token_analysis/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/nominatim/tokenizer/token_analysis/generic.py b/nominatim/tokenizer/token_analysis/generic.py new file mode 100644 index 00000000..4b47889e --- /dev/null +++ b/nominatim/tokenizer/token_analysis/generic.py @@ -0,0 +1,224 @@ +""" +Generic processor for names that creates abbreviation variants. +""" +from collections import defaultdict, namedtuple +import itertools +import re + +from icu import Transliterator +import datrie + +from nominatim.config import flatten_config_list +from nominatim.errors import UsageError + +### Configuration section + +ICUVariant = namedtuple('ICUVariant', ['source', 'replacement']) + +def configure(rules, normalization_rules): + """ Extract and preprocess the configuration for this module. + """ + config = {} + + config['replacements'], config['chars'] = _get_variant_config(rules.get('variants'), + normalization_rules) + config['variant_only'] = rules.get('mode', '') == 'variant-only' + + return config + + +def _get_variant_config(rules, normalization_rules): + """ Convert the variant definition from the configuration into + replacement sets. + """ + immediate = defaultdict(list) + chars = set() + + if rules: + vset = set() + rules = flatten_config_list(rules, 'variants') + + vmaker = _VariantMaker(normalization_rules) + + for section in rules: + for rule in (section.get('words') or []): + vset.update(vmaker.compute(rule)) + + # Intermediate reorder by source. Also compute required character set. + for variant in vset: + if variant.source[-1] == ' ' and variant.replacement[-1] == ' ': + replstr = variant.replacement[:-1] + else: + replstr = variant.replacement + immediate[variant.source].append(replstr) + chars.update(variant.source) + + return list(immediate.items()), ''.join(chars) + + +class _VariantMaker: + """ Generater for all necessary ICUVariants from a single variant rule. + + All text in rules is normalized to make sure the variants match later. + """ + + def __init__(self, norm_rules): + self.norm = Transliterator.createFromRules("rule_loader_normalization", + norm_rules) + + + def compute(self, rule): + """ Generator for all ICUVariant tuples from a single variant rule. + """ + parts = re.split(r'(\|)?([=-])>', rule) + if len(parts) != 4: + raise UsageError("Syntax error in variant rule: " + rule) + + decompose = parts[1] is None + src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')] + repl_terms = (self.norm.transliterate(t).strip() for t in parts[3].split(',')) + + # If the source should be kept, add a 1:1 replacement + if parts[2] == '-': + for src in src_terms: + if src: + for froms, tos in _create_variants(*src, src[0], decompose): + yield ICUVariant(froms, tos) + + for src, repl in itertools.product(src_terms, repl_terms): + if src and repl: + for froms, tos in _create_variants(*src, repl, decompose): + yield ICUVariant(froms, tos) + + + def _parse_variant_word(self, name): + name = name.strip() + match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name) + if match is None or (match.group(1) == '~' and match.group(3) == '~'): + raise UsageError("Invalid variant word descriptor '{}'".format(name)) + norm_name = self.norm.transliterate(match.group(2)).strip() + if not norm_name: + return None + + return norm_name, match.group(1), match.group(3) + + +_FLAG_MATCH = {'^': '^ ', + '$': ' ^', + '': ' '} + + +def _create_variants(src, preflag, postflag, repl, decompose): + if preflag == '~': + postfix = _FLAG_MATCH[postflag] + # suffix decomposition + src = src + postfix + repl = repl + postfix + + yield src, repl + yield ' ' + src, ' ' + repl + + if decompose: + yield src, ' ' + repl + yield ' ' + src, repl + elif postflag == '~': + # prefix decomposition + prefix = _FLAG_MATCH[preflag] + src = prefix + src + repl = prefix + repl + + yield src, repl + yield src + ' ', repl + ' ' + + if decompose: + yield src, repl + ' ' + yield src + ' ', repl + else: + prefix = _FLAG_MATCH[preflag] + postfix = _FLAG_MATCH[postflag] + + yield prefix + src + postfix, prefix + repl + postfix + + +### Analysis section + +def create(transliterator, config): + """ Create a new token analysis instance for this module. + """ + return GenericTokenAnalysis(transliterator, config) + + +class GenericTokenAnalysis: + """ Collects the different transformation rules for normalisation of names + and provides the functions to apply the transformations. + """ + + def __init__(self, to_ascii, config): + self.to_ascii = to_ascii + self.variant_only = config['variant_only'] + + # Set up datrie + if config['replacements']: + self.replacements = datrie.Trie(config['chars']) + for src, repllist in config['replacements']: + self.replacements[src] = repllist + else: + self.replacements = None + + + def get_variants_ascii(self, norm_name): + """ Compute the spelling variants for the given normalized name + and transliterate the result. + """ + baseform = '^ ' + norm_name + ' ^' + partials = [''] + + startpos = 0 + if self.replacements is not None: + pos = 0 + force_space = False + while pos < len(baseform): + full, repl = self.replacements.longest_prefix_item(baseform[pos:], + (None, None)) + if full is not None: + done = baseform[startpos:pos] + partials = [v + done + r + for v, r in itertools.product(partials, repl) + if not force_space or r.startswith(' ')] + if len(partials) > 128: + # If too many variants are produced, they are unlikely + # to be helpful. Only use the original term. + startpos = 0 + break + startpos = pos + len(full) + if full[-1] == ' ': + startpos -= 1 + force_space = True + pos = startpos + else: + pos += 1 + force_space = False + + # No variants detected? Fast return. + if startpos == 0: + if self.variant_only: + return [] + + trans_name = self.to_ascii.transliterate(norm_name).strip() + return [trans_name] if trans_name else [] + + return self._compute_result_set(partials, baseform[startpos:], + norm_name if self.variant_only else '') + + + def _compute_result_set(self, partials, prefix, exclude): + results = set() + + for variant in partials: + vname = (variant + prefix)[1:-1].strip() + if vname != exclude: + trans_name = self.to_ascii.transliterate(vname).strip() + if trans_name: + results.add(trans_name) + + return list(results) diff --git a/nominatim/tools/country_info.py b/nominatim/tools/country_info.py index e04a8693..635d1584 100644 --- a/nominatim/tools/country_info.py +++ b/nominatim/tools/country_info.py @@ -13,12 +13,21 @@ class _CountryInfo: def __init__(self): self._info = {} + def load(self, config): """ Load the country properties from the configuration files, if they are not loaded yet. """ if not self._info: self._info = config.load_sub_configuration('country_settings.yaml') + # Convert languages into a list for simpler handling. + for prop in self._info.values(): + if 'languages' not in prop: + prop['languages'] = [] + elif not isinstance(prop['languages'], list): + prop['languages'] = [x.strip() + for x in prop['languages'].split(',')] + def items(self): """ Return tuples of (country_code, property dict) as iterable. @@ -36,6 +45,12 @@ def setup_country_config(config): _COUNTRY_INFO.load(config) +def iterate(): + """ Iterate over country code and properties. + """ + return _COUNTRY_INFO.items() + + def setup_country_tables(dsn, sql_dir, ignore_partitions=False): """ Create and populate the tables with basic static data that provides the background for geocoding. Data is assumed to not yet exist. @@ -50,10 +65,7 @@ def setup_country_tables(dsn, sql_dir, ignore_partitions=False): partition = 0 else: partition = props.get('partition') - if ',' in (props.get('languages', ',') or ','): - lang = None - else: - lang = props['languages'] + lang = props['languages'][0] if len(props['languages']) == 1 else None params.append((ccode, partition, lang)) with connect(dsn) as conn: diff --git a/settings/country_settings.yaml b/settings/country_settings.yaml index 77b137a1..dcbb1847 100644 --- a/settings/country_settings.yaml +++ b/settings/country_settings.yaml @@ -171,7 +171,7 @@ bt: # (Bouvet Island) bv: partition: 185 - languages: no + languages: "no" # Botswana (Botswana) bw: @@ -1006,7 +1006,7 @@ si: # (Svalbard and Jan Mayen) sj: partition: 197 - languages: no + languages: "no" # Slovakia (Slovensko) sk: diff --git a/settings/icu_tokenizer.yaml b/settings/icu_tokenizer.yaml index 08b7a7ff..41760c49 100644 --- a/settings/icu_tokenizer.yaml +++ b/settings/icu_tokenizer.yaml @@ -27,34 +27,160 @@ transliteration: sanitizers: - step: split-name-list - step: strip-brace-terms -variants: - - !include icu-rules/variants-bg.yaml - - !include icu-rules/variants-ca.yaml - - !include icu-rules/variants-cs.yaml - - !include icu-rules/variants-da.yaml - - !include icu-rules/variants-de.yaml - - !include icu-rules/variants-el.yaml - - !include icu-rules/variants-en.yaml - - !include icu-rules/variants-es.yaml - - !include icu-rules/variants-et.yaml - - !include icu-rules/variants-eu.yaml - - !include icu-rules/variants-fi.yaml - - !include icu-rules/variants-fr.yaml - - !include icu-rules/variants-gl.yaml - - !include icu-rules/variants-hu.yaml - - !include icu-rules/variants-it.yaml - - !include icu-rules/variants-ja.yaml - - !include icu-rules/variants-mg.yaml - - !include icu-rules/variants-ms.yaml - - !include icu-rules/variants-nl.yaml - - !include icu-rules/variants-no.yaml - - !include icu-rules/variants-pl.yaml - - !include icu-rules/variants-pt.yaml - - !include icu-rules/variants-ro.yaml - - !include icu-rules/variants-ru.yaml - - !include icu-rules/variants-sk.yaml - - !include icu-rules/variants-sl.yaml - - !include icu-rules/variants-sv.yaml - - !include icu-rules/variants-tr.yaml - - !include icu-rules/variants-uk.yaml - - !include icu-rules/variants-vi.yaml + - step: tag-analyzer-by-language + filter-kind: [".*name.*"] + whitelist: [bg,ca,cs,da,de,el,en,es,et,eu,fi,fr,gl,hu,it,ja,mg,ms,nl,no,pl,pt,ro,ru,sk,sl,sv,tr,uk,vi] + use-defaults: all + mode: append +token-analysis: + - analyzer: generic + - id: bg + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-bg.yaml + - id: ca + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-ca.yaml + - id: cs + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-cs.yaml + - id: da + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-da.yaml + - id: de + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-de.yaml + - id: el + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-el.yaml + - id: en + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-en.yaml + - id: es + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-es.yaml + - id: et + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-et.yaml + - id: eu + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-eu.yaml + - id: fi + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-fi.yaml + - id: fr + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-fr.yaml + - id: gl + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-gl.yaml + - id: hu + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-hu.yaml + - id: it + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-it.yaml + - id: ja + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-ja.yaml + - id: mg + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-mg.yaml + - id: ms + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-ms.yaml + - id: nl + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-nl.yaml + - id: no + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-no.yaml + - id: pl + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-pl.yaml + - id: pt + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-pt.yaml + - id: ro + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-ro.yaml + - id: ru + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-ru.yaml + - id: sk + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-sk.yaml + - id: sl + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-sl.yaml + - id: sv + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-sv.yaml + - id: tr + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-tr.yaml + - id: uk + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-uk.yaml + - id: vi + analyzer: generic + mode: variant-only + variants: + - !include icu-rules/variants-vi.yaml diff --git a/test/bdd/db/query/normalization.feature b/test/bdd/db/query/normalization.feature index b8a760f9..deaa635e 100644 --- a/test/bdd/db/query/normalization.feature +++ b/test/bdd/db/query/normalization.feature @@ -52,7 +52,7 @@ Feature: Import and search of names Scenario: Special characters in name Given the places - | osm | class | type | name | + | osm | class | type | name+name:de | | N1 | place | locality | Jim-Knopf-Straße | | N2 | place | locality | Smith/Weston | | N3 | place | locality | space mountain | diff --git a/test/python/test_tokenizer_icu.py b/test/python/test_tokenizer_icu.py index 9a6f5a94..6a2f2f8b 100644 --- a/test/python/test_tokenizer_icu.py +++ b/test/python/test_tokenizer_icu.py @@ -69,10 +69,11 @@ def analyzer(tokenizer_factory, test_config, monkeypatch, def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',), variants=('~gasse -> gasse', 'street => st', ), sanitizers=[]): - cfgstr = {'normalization' : list(norm), - 'sanitizers' : sanitizers, - 'transliteration' : list(trans), - 'variants' : [ {'words': list(variants)}]} + cfgstr = {'normalization': list(norm), + 'sanitizers': sanitizers, + 'transliteration': list(trans), + 'token-analysis': [{'analyzer': 'generic', + 'variants': [{'words': list(variants)}]}]} (test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr)) tok.loader = ICURuleLoader(test_config) @@ -168,9 +169,7 @@ def test_init_word_table(tokenizer_factory, test_config, place_row, word_table): tok.init_new_db(test_config) assert word_table.get_partial_words() == {('test', 1), - ('no', 1), ('area', 2), - ('holz', 1), ('strasse', 1), - ('str', 1)} + ('no', 1), ('area', 2)} def test_init_from_project(monkeypatch, test_config, tokenizer_factory): diff --git a/test/python/test_tokenizer_icu_name_processor.py b/test/python/test_tokenizer_icu_name_processor.py deleted file mode 100644 index d0ed21ec..00000000 --- a/test/python/test_tokenizer_icu_name_processor.py +++ /dev/null @@ -1,104 +0,0 @@ -""" -Tests for import name normalisation and variant generation. -""" -from textwrap import dedent - -import pytest - -from nominatim.tokenizer.icu_rule_loader import ICURuleLoader - -from nominatim.errors import UsageError - -@pytest.fixture -def cfgfile(def_config, tmp_path): - project_dir = tmp_path / 'project_dir' - project_dir.mkdir() - def_config.project_dir = project_dir - - def _create_config(*variants, **kwargs): - content = dedent("""\ - normalization: - - ":: NFD ()" - - "'🜳' > ' '" - - "[[:Nonspacing Mark:] [:Cf:]] >" - - ":: lower ()" - - "[[:Punctuation:][:Space:]]+ > ' '" - - ":: NFC ()" - transliteration: - - ":: Latin ()" - - "'🜵' > ' '" - """) - content += "variants:\n - words:\n" - content += '\n'.join((" - " + s for s in variants)) + '\n' - for k, v in kwargs: - content += " {}: {}\n".format(k, v) - (project_dir / 'icu_tokenizer.yaml').write_text(content) - - return def_config - - return _create_config - - -def get_normalized_variants(proc, name): - return proc.get_variants_ascii(proc.get_normalized(name)) - - -def test_variants_empty(cfgfile): - config = cfgfile('saint -> 🜵', 'street -> st') - - proc = ICURuleLoader(config).make_token_analysis() - - assert get_normalized_variants(proc, '🜵') == [] - assert get_normalized_variants(proc, '🜳') == [] - assert get_normalized_variants(proc, 'saint') == ['saint'] - - -VARIANT_TESTS = [ -(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}), -(('weg => wg',), "holzweg", {'holzweg'}), -(('weg -> wg',), "holzweg", {'holzweg'}), -(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}), -(('~weg -> weg',), "holzweg", {'holz weg', 'holzweg'}), -(('~weg => w',), "holzweg", {'holz w', 'holzw'}), -(('~weg -> w',), "holzweg", {'holz weg', 'holzweg', 'holz w', 'holzw'}), -(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}), -(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}), -(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}), -(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}), -(('weg => wg',), "Meier Weg", {'meier wg'}), -(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}), -(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße", - {'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}), -(('am => a', 'bach => b'), "am bach", {'a b'}), -(('am => a', '~bach => b'), "am bach", {'a b'}), -(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}), -(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}), -(('saint -> s,st', 'street -> st'), "Saint Johns Street", - {'saint johns street', 's johns street', 'st johns street', - 'saint johns st', 's johns st', 'st johns st'}), -(('river$ -> r',), "River Bend Road", {'river bend road'}), -(('river$ -> r',), "Bent River", {'bent river', 'bent r'}), -(('^north => n',), "North 2nd Street", {'n 2nd street'}), -(('^north => n',), "Airport North", {'airport north'}), -(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}), -(('am => a',), "am am am am am am am am", {'a a a a a a a a'}) -] - -@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS) -def test_variants(cfgfile, rules, name, variants): - config = cfgfile(*rules) - proc = ICURuleLoader(config).make_token_analysis() - - result = get_normalized_variants(proc, name) - - assert len(result) == len(set(result)) - assert set(get_normalized_variants(proc, name)) == variants - - -def test_search_normalized(cfgfile): - config = cfgfile('~street => s,st', 'master => mstr') - proc = ICURuleLoader(config).make_token_analysis() - - assert proc.get_search_normalized('Master Street') == 'master street' - assert proc.get_search_normalized('Earnes St') == 'earnes st' - assert proc.get_search_normalized('Nostreet') == 'nostreet' diff --git a/test/python/test_tokenizer_icu_rule_loader.py b/test/python/test_tokenizer_icu_rule_loader.py index 6ec53edc..e22ccd4b 100644 --- a/test/python/test_tokenizer_icu_rule_loader.py +++ b/test/python/test_tokenizer_icu_rule_loader.py @@ -34,8 +34,8 @@ def cfgrules(test_config): - ":: Latin ()" - "[[:Punctuation:][:Space:]]+ > ' '" """) - content += "variants:\n - words:\n" - content += '\n'.join((" - " + s for s in variants)) + '\n' + content += "token-analysis:\n - analyzer: generic\n variants:\n - words:\n" + content += '\n'.join((" - " + s for s in variants)) + '\n' for k, v in kwargs: content += " {}: {}\n".format(k, v) (test_config.project_dir / 'icu_tokenizer.yaml').write_text(content) @@ -49,20 +49,21 @@ def test_empty_rule_set(test_config): (test_config.project_dir / 'icu_tokenizer.yaml').write_text(dedent("""\ normalization: transliteration: - variants: + token-analysis: + - analyzer: generic + variants: """)) rules = ICURuleLoader(test_config) assert rules.get_search_rules() == '' assert rules.get_normalization_rules() == '' assert rules.get_transliteration_rules() == '' - assert list(rules.get_replacement_pairs()) == [] -CONFIG_SECTIONS = ('normalization', 'transliteration', 'variants') +CONFIG_SECTIONS = ('normalization', 'transliteration', 'token-analysis') @pytest.mark.parametrize("section", CONFIG_SECTIONS) def test_missing_section(section, test_config): - rule_cfg = { s: {} for s in CONFIG_SECTIONS if s != section} + rule_cfg = { s: [] for s in CONFIG_SECTIONS if s != section} (test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(rule_cfg)) with pytest.raises(UsageError): @@ -107,7 +108,9 @@ def test_transliteration_rules_from_file(test_config): transliteration: - "'ax' > 'b'" - !include transliteration.yaml - variants: + token-analysis: + - analyzer: generic + variants: """)) transpath = test_config.project_dir / ('transliteration.yaml') transpath.write_text('- "x > y"') @@ -119,6 +122,15 @@ def test_transliteration_rules_from_file(test_config): assert trans.transliterate(" axxt ") == " byt " +def test_search_rules(cfgrules): + config = cfgrules('~street => s,st', 'master => mstr') + proc = ICURuleLoader(config).make_token_analysis() + + assert proc.search.transliterate('Master Street').strip() == 'master street' + assert proc.search.transliterate('Earnes St').strip() == 'earnes st' + assert proc.search.transliterate('Nostreet').strip() == 'nostreet' + + class TestGetReplacements: @pytest.fixture(autouse=True) @@ -127,9 +139,9 @@ class TestGetReplacements: def get_replacements(self, *variants): loader = ICURuleLoader(self.cfgrules(*variants)) - rules = loader.get_replacement_pairs() + rules = loader.analysis[None].config['replacements'] - return set((v.source, v.replacement) for v in rules) + return sorted((k, sorted(v)) for k,v in rules) @pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar', @@ -141,131 +153,122 @@ class TestGetReplacements: def test_add_full(self): repl = self.get_replacements("foo -> bar") - assert repl == {(' foo ', ' bar '), (' foo ', ' foo ')} + assert repl == [(' foo ', [' bar', ' foo'])] def test_replace_full(self): repl = self.get_replacements("foo => bar") - assert repl == {(' foo ', ' bar ')} + assert repl == [(' foo ', [' bar'])] def test_add_suffix_no_decompose(self): repl = self.get_replacements("~berg |-> bg") - assert repl == {('berg ', 'berg '), ('berg ', 'bg '), - (' berg ', ' berg '), (' berg ', ' bg ')} + assert repl == [(' berg ', [' berg', ' bg']), + ('berg ', ['berg', 'bg'])] def test_replace_suffix_no_decompose(self): repl = self.get_replacements("~berg |=> bg") - assert repl == {('berg ', 'bg '), (' berg ', ' bg ')} + assert repl == [(' berg ', [' bg']),('berg ', ['bg'])] def test_add_suffix_decompose(self): repl = self.get_replacements("~berg -> bg") - assert repl == {('berg ', 'berg '), ('berg ', ' berg '), - (' berg ', ' berg '), (' berg ', 'berg '), - ('berg ', 'bg '), ('berg ', ' bg '), - (' berg ', 'bg '), (' berg ', ' bg ')} + assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']), + ('berg ', [' berg', ' bg', 'berg', 'bg'])] def test_replace_suffix_decompose(self): repl = self.get_replacements("~berg => bg") - assert repl == {('berg ', 'bg '), ('berg ', ' bg '), - (' berg ', 'bg '), (' berg ', ' bg ')} + assert repl == [(' berg ', [' bg', 'bg']), + ('berg ', [' bg', 'bg'])] def test_add_prefix_no_compose(self): repl = self.get_replacements("hinter~ |-> hnt") - assert repl == {(' hinter', ' hinter'), (' hinter ', ' hinter '), - (' hinter', ' hnt'), (' hinter ', ' hnt ')} + assert repl == [(' hinter', [' hinter', ' hnt']), + (' hinter ', [' hinter', ' hnt'])] def test_replace_prefix_no_compose(self): repl = self.get_replacements("hinter~ |=> hnt") - assert repl == {(' hinter', ' hnt'), (' hinter ', ' hnt ')} + assert repl == [(' hinter', [' hnt']), (' hinter ', [' hnt'])] def test_add_prefix_compose(self): repl = self.get_replacements("hinter~-> h") - assert repl == {(' hinter', ' hinter'), (' hinter', ' hinter '), - (' hinter', ' h'), (' hinter', ' h '), - (' hinter ', ' hinter '), (' hinter ', ' hinter'), - (' hinter ', ' h '), (' hinter ', ' h')} + assert repl == [(' hinter', [' h', ' h ', ' hinter', ' hinter ']), + (' hinter ', [' h', ' h', ' hinter', ' hinter'])] def test_replace_prefix_compose(self): repl = self.get_replacements("hinter~=> h") - assert repl == {(' hinter', ' h'), (' hinter', ' h '), - (' hinter ', ' h '), (' hinter ', ' h')} + assert repl == [(' hinter', [' h', ' h ']), + (' hinter ', [' h', ' h'])] def test_add_beginning_only(self): repl = self.get_replacements("^Premier -> Pr") - assert repl == {('^ premier ', '^ premier '), ('^ premier ', '^ pr ')} + assert repl == [('^ premier ', ['^ pr', '^ premier'])] def test_replace_beginning_only(self): repl = self.get_replacements("^Premier => Pr") - assert repl == {('^ premier ', '^ pr ')} + assert repl == [('^ premier ', ['^ pr'])] def test_add_final_only(self): repl = self.get_replacements("road$ -> rd") - assert repl == {(' road ^', ' road ^'), (' road ^', ' rd ^')} + assert repl == [(' road ^', [' rd ^', ' road ^'])] def test_replace_final_only(self): repl = self.get_replacements("road$ => rd") - assert repl == {(' road ^', ' rd ^')} + assert repl == [(' road ^', [' rd ^'])] def test_decompose_only(self): repl = self.get_replacements("~foo -> foo") - assert repl == {('foo ', 'foo '), ('foo ', ' foo '), - (' foo ', 'foo '), (' foo ', ' foo ')} + assert repl == [(' foo ', [' foo', 'foo']), + ('foo ', [' foo', 'foo'])] def test_add_suffix_decompose_end_only(self): repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg") - assert repl == {('berg ', 'berg '), ('berg ', 'bg '), - (' berg ', ' berg '), (' berg ', ' bg '), - ('berg ^', 'berg ^'), ('berg ^', ' berg ^'), - ('berg ^', 'bg ^'), ('berg ^', ' bg ^'), - (' berg ^', 'berg ^'), (' berg ^', 'bg ^'), - (' berg ^', ' berg ^'), (' berg ^', ' bg ^')} + assert repl == [(' berg ', [' berg', ' bg']), + (' berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^']), + ('berg ', ['berg', 'bg']), + ('berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^'])] def test_replace_suffix_decompose_end_only(self): repl = self.get_replacements("~berg |=> bg", "~berg$ => bg") - assert repl == {('berg ', 'bg '), (' berg ', ' bg '), - ('berg ^', 'bg ^'), ('berg ^', ' bg ^'), - (' berg ^', 'bg ^'), (' berg ^', ' bg ^')} + assert repl == [(' berg ', [' bg']), + (' berg ^', [' bg ^', 'bg ^']), + ('berg ', ['bg']), + ('berg ^', [' bg ^', 'bg ^'])] def test_add_multiple_suffix(self): repl = self.get_replacements("~berg,~burg -> bg") - assert repl == {('berg ', 'berg '), ('berg ', ' berg '), - (' berg ', ' berg '), (' berg ', 'berg '), - ('berg ', 'bg '), ('berg ', ' bg '), - (' berg ', 'bg '), (' berg ', ' bg '), - ('burg ', 'burg '), ('burg ', ' burg '), - (' burg ', ' burg '), (' burg ', 'burg '), - ('burg ', 'bg '), ('burg ', ' bg '), - (' burg ', 'bg '), (' burg ', ' bg ')} + assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']), + (' burg ', [' bg', ' burg', 'bg', 'burg']), + ('berg ', [' berg', ' bg', 'berg', 'bg']), + ('burg ', [' bg', ' burg', 'bg', 'burg'])] diff --git a/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py b/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py new file mode 100644 index 00000000..e4a836fa --- /dev/null +++ b/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py @@ -0,0 +1,259 @@ +""" +Tests for the sanitizer that enables language-dependent analyzers. +""" +import pytest + +from nominatim.indexer.place_info import PlaceInfo +from nominatim.tokenizer.place_sanitizer import PlaceSanitizer +from nominatim.tools.country_info import setup_country_config + +class TestWithDefaults: + + @staticmethod + def run_sanitizer_on(country, **kwargs): + place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}, + 'country_code': country}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}]).process_names(place) + + return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name]) + + + def test_no_names(self): + assert self.run_sanitizer_on('de') == [] + + + def test_simple(self): + res = self.run_sanitizer_on('fr', name='Foo',name_de='Zoo', ref_abc='M') + + assert res == [('Foo', 'name', None, {}), + ('M', 'ref', 'abc', {'analyzer': 'abc'}), + ('Zoo', 'name', 'de', {'analyzer': 'de'})] + + + @pytest.mark.parametrize('suffix', ['DE', 'asbc']) + def test_illegal_suffix(self, suffix): + assert self.run_sanitizer_on('fr', **{'name_' + suffix: 'Foo'}) \ + == [('Foo', 'name', suffix, {})] + + +class TestFilterKind: + + @staticmethod + def run_sanitizer_on(filt, **kwargs): + place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}, + 'country_code': 'de'}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language', + 'filter-kind': filt}]).process_names(place) + + return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name]) + + + def test_single_exact_name(self): + res = self.run_sanitizer_on(['name'], name_fr='A', ref_fr='12', + shortname_fr='C', name='D') + + assert res == [('12', 'ref', 'fr', {}), + ('A', 'name', 'fr', {'analyzer': 'fr'}), + ('C', 'shortname', 'fr', {}), + ('D', 'name', None, {})] + + + def test_single_pattern(self): + res = self.run_sanitizer_on(['.*name'], + name_fr='A', ref_fr='12', namexx_fr='B', + shortname_fr='C', name='D') + + assert res == [('12', 'ref', 'fr', {}), + ('A', 'name', 'fr', {'analyzer': 'fr'}), + ('B', 'namexx', 'fr', {}), + ('C', 'shortname', 'fr', {'analyzer': 'fr'}), + ('D', 'name', None, {})] + + + def test_multiple_patterns(self): + res = self.run_sanitizer_on(['.*name', 'ref'], + name_fr='A', ref_fr='12', oldref_fr='X', + namexx_fr='B', shortname_fr='C', name='D') + + assert res == [('12', 'ref', 'fr', {'analyzer': 'fr'}), + ('A', 'name', 'fr', {'analyzer': 'fr'}), + ('B', 'namexx', 'fr', {}), + ('C', 'shortname', 'fr', {'analyzer': 'fr'}), + ('D', 'name', None, {}), + ('X', 'oldref', 'fr', {})] + + +class TestDefaultCountry: + + @pytest.fixture(autouse=True) + def setup_country(self, def_config): + setup_country_config(def_config) + + @staticmethod + def run_sanitizer_append(mode, country, **kwargs): + place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}, + 'country_code': country}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language', + 'use-defaults': mode, + 'mode': 'append'}]).process_names(place) + + assert all(isinstance(p.attr, dict) for p in name) + assert all(len(p.attr) <= 1 for p in name) + assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer']) + for p in name) + + return sorted([(p.name, p.attr.get('analyzer', '')) for p in name]) + + + @staticmethod + def run_sanitizer_replace(mode, country, **kwargs): + place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}, + 'country_code': country}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language', + 'use-defaults': mode, + 'mode': 'replace'}]).process_names(place) + + assert all(isinstance(p.attr, dict) for p in name) + assert all(len(p.attr) <= 1 for p in name) + assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer']) + for p in name) + + return sorted([(p.name, p.attr.get('analyzer', '')) for p in name]) + + + def test_missing_country(self): + place = PlaceInfo({'name': {'name': 'something'}}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language', + 'use-defaults': 'all', + 'mode': 'replace'}]).process_names(place) + + assert len(name) == 1 + assert name[0].name == 'something' + assert name[0].suffix is None + assert 'analyzer' not in name[0].attr + + + def test_mono_unknown_country(self): + expect = [('XX', '')] + + assert self.run_sanitizer_replace('mono', 'xx', name='XX') == expect + assert self.run_sanitizer_append('mono', 'xx', name='XX') == expect + + + def test_mono_monoling_replace(self): + res = self.run_sanitizer_replace('mono', 'de', name='Foo') + + assert res == [('Foo', 'de')] + + + def test_mono_monoling_append(self): + res = self.run_sanitizer_append('mono', 'de', name='Foo') + + assert res == [('Foo', ''), ('Foo', 'de')] + + + def test_mono_multiling(self): + expect = [('XX', '')] + + assert self.run_sanitizer_replace('mono', 'ch', name='XX') == expect + assert self.run_sanitizer_append('mono', 'ch', name='XX') == expect + + + def test_all_unknown_country(self): + expect = [('XX', '')] + + assert self.run_sanitizer_replace('all', 'xx', name='XX') == expect + assert self.run_sanitizer_append('all', 'xx', name='XX') == expect + + + def test_all_monoling_replace(self): + res = self.run_sanitizer_replace('all', 'de', name='Foo') + + assert res == [('Foo', 'de')] + + + def test_all_monoling_append(self): + res = self.run_sanitizer_append('all', 'de', name='Foo') + + assert res == [('Foo', ''), ('Foo', 'de')] + + + def test_all_multiling_append(self): + res = self.run_sanitizer_append('all', 'ch', name='XX') + + assert res == [('XX', ''), + ('XX', 'de'), ('XX', 'fr'), ('XX', 'it'), ('XX', 'rm')] + + + def test_all_multiling_replace(self): + res = self.run_sanitizer_replace('all', 'ch', name='XX') + + assert res == [('XX', 'de'), ('XX', 'fr'), ('XX', 'it'), ('XX', 'rm')] + + +class TestCountryWithWhitelist: + + @staticmethod + def run_sanitizer_on(mode, country, **kwargs): + place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}, + 'country_code': country}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language', + 'use-defaults': mode, + 'mode': 'replace', + 'whitelist': ['de', 'fr', 'ru']}]).process_names(place) + + assert all(isinstance(p.attr, dict) for p in name) + assert all(len(p.attr) <= 1 for p in name) + assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer']) + for p in name) + + return sorted([(p.name, p.attr.get('analyzer', '')) for p in name]) + + + def test_mono_monoling(self): + assert self.run_sanitizer_on('mono', 'de', name='Foo') == [('Foo', 'de')] + assert self.run_sanitizer_on('mono', 'pt', name='Foo') == [('Foo', '')] + + + def test_mono_multiling(self): + assert self.run_sanitizer_on('mono', 'ca', name='Foo') == [('Foo', '')] + + + def test_all_monoling(self): + assert self.run_sanitizer_on('all', 'de', name='Foo') == [('Foo', 'de')] + assert self.run_sanitizer_on('all', 'pt', name='Foo') == [('Foo', '')] + + + def test_all_multiling(self): + assert self.run_sanitizer_on('all', 'ca', name='Foo') == [('Foo', 'fr')] + assert self.run_sanitizer_on('all', 'ch', name='Foo') \ + == [('Foo', 'de'), ('Foo', 'fr')] + + +class TestWhiteList: + + @staticmethod + def run_sanitizer_on(whitelist, **kwargs): + place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}}) + name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language', + 'mode': 'replace', + 'whitelist': whitelist}]).process_names(place) + + assert all(isinstance(p.attr, dict) for p in name) + assert all(len(p.attr) <= 1 for p in name) + assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer']) + for p in name) + + return sorted([(p.name, p.attr.get('analyzer', '')) for p in name]) + + + def test_in_whitelist(self): + assert self.run_sanitizer_on(['de', 'xx'], ref_xx='123') == [('123', 'xx')] + + + def test_not_in_whitelist(self): + assert self.run_sanitizer_on(['de', 'xx'], ref_yy='123') == [('123', '')] + + + def test_empty_whitelist(self): + assert self.run_sanitizer_on([], ref_yy='123') == [('123', '')] diff --git a/test/python/tokenizer/token_analysis/test_generic.py b/test/python/tokenizer/token_analysis/test_generic.py new file mode 100644 index 00000000..02a95f25 --- /dev/null +++ b/test/python/tokenizer/token_analysis/test_generic.py @@ -0,0 +1,265 @@ +""" +Tests for import name normalisation and variant generation. +""" +import pytest + +from icu import Transliterator + +import nominatim.tokenizer.token_analysis.generic as module +from nominatim.errors import UsageError + +DEFAULT_NORMALIZATION = """ :: NFD (); + '🜳' > ' '; + [[:Nonspacing Mark:] [:Cf:]] >; + :: lower (); + [[:Punctuation:][:Space:]]+ > ' '; + :: NFC (); + """ + +DEFAULT_TRANSLITERATION = """ :: Latin (); + '🜵' > ' '; + """ + +def make_analyser(*variants, variant_only=False): + rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]} + if variant_only: + rules['mode'] = 'variant-only' + config = module.configure(rules, DEFAULT_NORMALIZATION) + trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION) + + return module.create(trans, config) + + +def get_normalized_variants(proc, name): + norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) + return proc.get_variants_ascii(norm.transliterate(name).strip()) + + +def test_no_variants(): + rules = { 'analyzer': 'generic' } + config = module.configure(rules, DEFAULT_NORMALIZATION) + trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION) + + proc = module.create(trans, config) + + assert get_normalized_variants(proc, '大德!') == ['dà dé'] + + +def test_variants_empty(): + proc = make_analyser('saint -> 🜵', 'street -> st') + + assert get_normalized_variants(proc, '🜵') == [] + assert get_normalized_variants(proc, '🜳') == [] + assert get_normalized_variants(proc, 'saint') == ['saint'] + + +VARIANT_TESTS = [ +(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}), +(('weg => wg',), "holzweg", {'holzweg'}), +(('weg -> wg',), "holzweg", {'holzweg'}), +(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}), +(('~weg -> weg',), "holzweg", {'holz weg', 'holzweg'}), +(('~weg => w',), "holzweg", {'holz w', 'holzw'}), +(('~weg -> w',), "holzweg", {'holz weg', 'holzweg', 'holz w', 'holzw'}), +(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}), +(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}), +(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}), +(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}), +(('weg => wg',), "Meier Weg", {'meier wg'}), +(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}), +(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße", + {'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}), +(('am => a', 'bach => b'), "am bach", {'a b'}), +(('am => a', '~bach => b'), "am bach", {'a b'}), +(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}), +(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}), +(('saint -> s,st', 'street -> st'), "Saint Johns Street", + {'saint johns street', 's johns street', 'st johns street', + 'saint johns st', 's johns st', 'st johns st'}), +(('river$ -> r',), "River Bend Road", {'river bend road'}), +(('river$ -> r',), "Bent River", {'bent river', 'bent r'}), +(('^north => n',), "North 2nd Street", {'n 2nd street'}), +(('^north => n',), "Airport North", {'airport north'}), +(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}), +(('am => a',), "am am am am am am am am", {'a a a a a a a a'}) +] + +@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS) +def test_variants(rules, name, variants): + proc = make_analyser(*rules) + + result = get_normalized_variants(proc, name) + + assert len(result) == len(set(result)) + assert set(get_normalized_variants(proc, name)) == variants + + +VARIANT_ONLY_TESTS = [ +(('weg => wg',), "hallo", set()), +(('weg => wg',), "Meier Weg", {'meier wg'}), +(('weg -> wg',), "Meier Weg", {'meier wg'}), +] + +@pytest.mark.parametrize("rules,name,variants", VARIANT_ONLY_TESTS) +def test_variants_only(rules, name, variants): + proc = make_analyser(*rules, variant_only=True) + + result = get_normalized_variants(proc, name) + + assert len(result) == len(set(result)) + assert set(get_normalized_variants(proc, name)) == variants + + +class TestGetReplacements: + + @staticmethod + def configure_rules(*variants): + rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]} + return module.configure(rules, DEFAULT_NORMALIZATION) + + + def get_replacements(self, *variants): + config = self.configure_rules(*variants) + + return sorted((k, sorted(v)) for k,v in config['replacements']) + + + @pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar', + '~foo~ -> bar', 'fo~ o -> bar']) + def test_invalid_variant_description(self, variant): + with pytest.raises(UsageError): + self.configure_rules(variant) + + + @pytest.mark.parametrize("rule", ["!!! -> bar", "bar => !!!"]) + def test_ignore_unnormalizable_terms(self, rule): + repl = self.get_replacements(rule) + + assert repl == [] + + + def test_add_full(self): + repl = self.get_replacements("foo -> bar") + + assert repl == [(' foo ', [' bar', ' foo'])] + + + def test_replace_full(self): + repl = self.get_replacements("foo => bar") + + assert repl == [(' foo ', [' bar'])] + + + def test_add_suffix_no_decompose(self): + repl = self.get_replacements("~berg |-> bg") + + assert repl == [(' berg ', [' berg', ' bg']), + ('berg ', ['berg', 'bg'])] + + + def test_replace_suffix_no_decompose(self): + repl = self.get_replacements("~berg |=> bg") + + assert repl == [(' berg ', [' bg']),('berg ', ['bg'])] + + + def test_add_suffix_decompose(self): + repl = self.get_replacements("~berg -> bg") + + assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']), + ('berg ', [' berg', ' bg', 'berg', 'bg'])] + + + def test_replace_suffix_decompose(self): + repl = self.get_replacements("~berg => bg") + + assert repl == [(' berg ', [' bg', 'bg']), + ('berg ', [' bg', 'bg'])] + + + def test_add_prefix_no_compose(self): + repl = self.get_replacements("hinter~ |-> hnt") + + assert repl == [(' hinter', [' hinter', ' hnt']), + (' hinter ', [' hinter', ' hnt'])] + + + def test_replace_prefix_no_compose(self): + repl = self.get_replacements("hinter~ |=> hnt") + + assert repl == [(' hinter', [' hnt']), (' hinter ', [' hnt'])] + + + def test_add_prefix_compose(self): + repl = self.get_replacements("hinter~-> h") + + assert repl == [(' hinter', [' h', ' h ', ' hinter', ' hinter ']), + (' hinter ', [' h', ' h', ' hinter', ' hinter'])] + + + def test_replace_prefix_compose(self): + repl = self.get_replacements("hinter~=> h") + + assert repl == [(' hinter', [' h', ' h ']), + (' hinter ', [' h', ' h'])] + + + def test_add_beginning_only(self): + repl = self.get_replacements("^Premier -> Pr") + + assert repl == [('^ premier ', ['^ pr', '^ premier'])] + + + def test_replace_beginning_only(self): + repl = self.get_replacements("^Premier => Pr") + + assert repl == [('^ premier ', ['^ pr'])] + + + def test_add_final_only(self): + repl = self.get_replacements("road$ -> rd") + + assert repl == [(' road ^', [' rd ^', ' road ^'])] + + + def test_replace_final_only(self): + repl = self.get_replacements("road$ => rd") + + assert repl == [(' road ^', [' rd ^'])] + + + def test_decompose_only(self): + repl = self.get_replacements("~foo -> foo") + + assert repl == [(' foo ', [' foo', 'foo']), + ('foo ', [' foo', 'foo'])] + + + def test_add_suffix_decompose_end_only(self): + repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg") + + assert repl == [(' berg ', [' berg', ' bg']), + (' berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^']), + ('berg ', ['berg', 'bg']), + ('berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^'])] + + + def test_replace_suffix_decompose_end_only(self): + repl = self.get_replacements("~berg |=> bg", "~berg$ => bg") + + assert repl == [(' berg ', [' bg']), + (' berg ^', [' bg ^', 'bg ^']), + ('berg ', ['bg']), + ('berg ^', [' bg ^', 'bg ^'])] + + + @pytest.mark.parametrize('rule', ["~berg,~burg -> bg", + "~berg, ~burg -> bg", + "~berg,,~burg -> bg"]) + def test_add_multiple_suffix(self, rule): + repl = self.get_replacements(rule) + + assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']), + (' burg ', [' bg', ' burg', 'bg', 'burg']), + ('berg ', [' berg', ' bg', 'berg', 'bg']), + ('burg ', [' bg', ' burg', 'bg', 'burg'])]