mirror of
https://github.com/osm-search/Nominatim.git
synced 2026-03-07 10:34:08 +00:00
Merge pull request #2460 from lonvia/multiple-analyzers
Add support for multiple token analyzers
This commit is contained in:
@@ -60,22 +60,23 @@ NOMINATIM_TOKENIZER=icu
|
|||||||
|
|
||||||
### How it works
|
### How it works
|
||||||
|
|
||||||
On import the tokenizer processes names in the following four stages:
|
On import the tokenizer processes names in the following three stages:
|
||||||
|
|
||||||
1. The **Normalization** part removes all non-relevant information from the
|
1. During the **Sanitizer step** incoming names are cleaned up and converted to
|
||||||
input.
|
**full names**. This step can be used to regularize spelling, split multi-name
|
||||||
2. Incoming names are now converted to **full names**. This process is currently
|
tags into their parts and tag names with additional attributes. See the
|
||||||
hard coded and mostly serves to handle name tags from OSM that contain
|
[Sanitizers section](#sanitizers) below for available cleaning routines.
|
||||||
multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
|
2. The **Normalization** part removes all information from the full names
|
||||||
3. Next the tokenizer creates **variants** from the full names. These variants
|
that are not relevant for search.
|
||||||
cover decomposition and abbreviation handling. Variants are saved to the
|
3. The **Token analysis** step takes the normalized full names and creates
|
||||||
database, so that it is not necessary to create the variants for a search
|
all transliterated variants under which the name should be searchable.
|
||||||
query.
|
See the [Token analysis](#token-analysis) section below for more
|
||||||
4. The final **Tokenization** step converts the names to a simple ASCII form,
|
information.
|
||||||
potentially removing further spelling variants for better matching.
|
|
||||||
|
|
||||||
At query time only stage 1) and 4) are used. The query is normalized and
|
During query time, only normalization and transliteration are relevant.
|
||||||
tokenized and the resulting string used for searching in the database.
|
An incoming query is first split into name chunks (this usually means splitting
|
||||||
|
the string at the commas) and the each part is normalised and transliterated.
|
||||||
|
The result is used to look up places in the search index.
|
||||||
|
|
||||||
### Configuration
|
### Configuration
|
||||||
|
|
||||||
@@ -93,21 +94,36 @@ normalization:
|
|||||||
transliteration:
|
transliteration:
|
||||||
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
|
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
|
||||||
- ":: Ascii ()"
|
- ":: Ascii ()"
|
||||||
variants:
|
sanitizers:
|
||||||
- language: de
|
- step: split-name-list
|
||||||
words:
|
token-analysis:
|
||||||
- ~haus => haus
|
- analyzer: generic
|
||||||
- ~strasse -> str
|
variants:
|
||||||
- language: en
|
- !include icu-rules/variants-ca.yaml
|
||||||
words:
|
- words:
|
||||||
- road -> rd
|
- road -> rd
|
||||||
- bridge -> bdge,br,brdg,bri,brg
|
- bridge -> bdge,br,brdg,bri,brg
|
||||||
```
|
```
|
||||||
|
|
||||||
The configuration file contains three sections:
|
The configuration file contains four sections:
|
||||||
`normalization`, `transliteration`, `variants`.
|
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
|
||||||
|
|
||||||
The normalization and transliteration sections each must contain a list of
|
#### Normalization and Transliteration
|
||||||
|
|
||||||
|
The normalization and transliteration sections each define a set of
|
||||||
|
ICU rules that are applied to the names.
|
||||||
|
|
||||||
|
The **normalisation** rules are applied after sanitation. They should remove
|
||||||
|
any information that is not relevant for search at all. Usual rules to be
|
||||||
|
applied here are: lower-casing, removing of special characters, cleanup of
|
||||||
|
spaces.
|
||||||
|
|
||||||
|
The **transliteration** rules are applied at the end of the tokenization
|
||||||
|
process to transfer the name into an ASCII representation. Transliteration can
|
||||||
|
be useful to allow for further fuzzy matching, especially between different
|
||||||
|
scripts.
|
||||||
|
|
||||||
|
Each section must contain a list of
|
||||||
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
|
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
|
||||||
The rules are applied in the order in which they appear in the file.
|
The rules are applied in the order in which they appear in the file.
|
||||||
You can also include additional rules from external yaml file using the
|
You can also include additional rules from external yaml file using the
|
||||||
@@ -119,6 +135,85 @@ and may again include other files.
|
|||||||
YAML syntax. You should therefore always enclose the ICU rules in
|
YAML syntax. You should therefore always enclose the ICU rules in
|
||||||
double-quotes.
|
double-quotes.
|
||||||
|
|
||||||
|
#### Sanitizers
|
||||||
|
|
||||||
|
The sanitizers section defines an ordered list of functions that are applied
|
||||||
|
to the name and address tags before they are further processed by the tokenizer.
|
||||||
|
They allows to clean up the tagging and bring it to a standardized form more
|
||||||
|
suitable for building the search index.
|
||||||
|
|
||||||
|
!!! hint
|
||||||
|
Sanitizers only have an effect on how the search index is built. They
|
||||||
|
do not change the information about each place that is saved in the
|
||||||
|
database. In particular, they have no influence on how the results are
|
||||||
|
displayed. The returned results always show the original information as
|
||||||
|
stored in the OpenStreetMap database.
|
||||||
|
|
||||||
|
Each entry contains information of a sanitizer to be applied. It has a
|
||||||
|
mandatory parameter `step` which gives the name of the sanitizer. Depending
|
||||||
|
on the type, it may have additional parameters to configure its operation.
|
||||||
|
|
||||||
|
The order of the list matters. The sanitizers are applied exactly in the order
|
||||||
|
that is configured. Each sanitizer works on the results of the previous one.
|
||||||
|
|
||||||
|
The following is a list of sanitizers that are shipped with Nominatim.
|
||||||
|
|
||||||
|
##### split-name-list
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.split_name_list
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
##### strip-brace-terms
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.strip_brace_terms
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
##### tag-analyzer-by-language
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#### Token Analysis
|
||||||
|
|
||||||
|
Token analyzers take a full name and transform it into one or more normalized
|
||||||
|
form that are then saved in the search index. In its simplest form, the
|
||||||
|
analyzer only applies the transliteration rules. More complex analyzers
|
||||||
|
create additional spelling variants of a name. This is useful to handle
|
||||||
|
decomposition and abbreviation.
|
||||||
|
|
||||||
|
The ICU tokenizer may use different analyzers for different names. To select
|
||||||
|
the analyzer to be used, the name must be tagged with the `analyzer` attribute
|
||||||
|
by a sanitizer (see for example the
|
||||||
|
[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
|
||||||
|
|
||||||
|
The token-analysis section contains the list of configured analyzers. Each
|
||||||
|
analyzer must have an `id` parameter that uniquely identifies the analyzer.
|
||||||
|
The only exception is the default analyzer that is used when no special
|
||||||
|
analyzer was selected.
|
||||||
|
|
||||||
|
Different analyzer implementations may exist. To select the implementation,
|
||||||
|
the `analyzer` parameter must be set. Currently there is only one implementation
|
||||||
|
`generic` which is described in the following.
|
||||||
|
|
||||||
|
##### Generic token analyzer
|
||||||
|
|
||||||
|
The generic analyzer is able to create variants from a list of given
|
||||||
|
abbreviation and decomposition replacements. It takes one optional parameter
|
||||||
|
`variants` which lists the replacements to apply. If the section is
|
||||||
|
omitted, then the generic analyzer becomes a simple analyzer that only
|
||||||
|
applies the transliteration.
|
||||||
|
|
||||||
The variants section defines lists of replacements which create alternative
|
The variants section defines lists of replacements which create alternative
|
||||||
spellings of a name. To create the variants, a name is scanned from left to
|
spellings of a name. To create the variants, a name is scanned from left to
|
||||||
right and the longest matching replacement is applied until the end of the
|
right and the longest matching replacement is applied until the end of the
|
||||||
@@ -144,7 +239,7 @@ term.
|
|||||||
words in the configuration because then it is possible to change the
|
words in the configuration because then it is possible to change the
|
||||||
rules for normalization later without having to adapt the variant rules.
|
rules for normalization later without having to adapt the variant rules.
|
||||||
|
|
||||||
#### Decomposition
|
###### Decomposition
|
||||||
|
|
||||||
In its standard form, only full words match against the source. There
|
In its standard form, only full words match against the source. There
|
||||||
is a special notation to match the prefix and suffix of a word:
|
is a special notation to match the prefix and suffix of a word:
|
||||||
@@ -171,7 +266,7 @@ To avoid automatic decomposition, use the '|' notation:
|
|||||||
|
|
||||||
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
|
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
|
||||||
|
|
||||||
#### Initial and final terms
|
###### Initial and final terms
|
||||||
|
|
||||||
It is also possible to restrict replacements to the beginning and end of a
|
It is also possible to restrict replacements to the beginning and end of a
|
||||||
name:
|
name:
|
||||||
@@ -184,7 +279,7 @@ name:
|
|||||||
So the first example would trigger a replacement for "south 45th street" but
|
So the first example would trigger a replacement for "south 45th street" but
|
||||||
not for "the south beach restaurant".
|
not for "the south beach restaurant".
|
||||||
|
|
||||||
#### Replacements vs. variants
|
###### Replacements vs. variants
|
||||||
|
|
||||||
The replacement syntax `source => target` works as a pure replacement. It changes
|
The replacement syntax `source => target` works as a pure replacement. It changes
|
||||||
the name instead of creating a variant. To create an additional version, you'd
|
the name instead of creating a variant. To create an additional version, you'd
|
||||||
|
|||||||
@@ -12,6 +12,27 @@ from nominatim.errors import UsageError
|
|||||||
|
|
||||||
LOG = logging.getLogger()
|
LOG = logging.getLogger()
|
||||||
|
|
||||||
|
|
||||||
|
def flatten_config_list(content, section=''):
|
||||||
|
""" Flatten YAML configuration lists that contain include sections
|
||||||
|
which are lists themselves.
|
||||||
|
"""
|
||||||
|
if not content:
|
||||||
|
return []
|
||||||
|
|
||||||
|
if not isinstance(content, list):
|
||||||
|
raise UsageError(f"List expected in section '{section}'.")
|
||||||
|
|
||||||
|
output = []
|
||||||
|
for ele in content:
|
||||||
|
if isinstance(ele, list):
|
||||||
|
output.extend(flatten_config_list(ele, section))
|
||||||
|
else:
|
||||||
|
output.append(ele)
|
||||||
|
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
class Configuration:
|
class Configuration:
|
||||||
""" Load and manage the project configuration.
|
""" Load and manage the project configuration.
|
||||||
|
|
||||||
|
|||||||
@@ -194,15 +194,13 @@ class AbstractTokenizer(ABC):
|
|||||||
""" Check that the database is set up correctly and ready for being
|
""" Check that the database is set up correctly and ready for being
|
||||||
queried.
|
queried.
|
||||||
|
|
||||||
Returns:
|
|
||||||
If an issue was found, return an error message with the
|
|
||||||
description of the issue as well as hints for the user on
|
|
||||||
how to resolve the issue.
|
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
config: Read-only object with configuration options.
|
config: Read-only object with configuration options.
|
||||||
|
|
||||||
Return `None`, if no issue was found.
|
Returns:
|
||||||
|
If an issue was found, return an error message with the
|
||||||
|
description of the issue as well as hints for the user on
|
||||||
|
how to resolve the issue. If everything is okay, return `None`.
|
||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|||||||
@@ -1,104 +0,0 @@
|
|||||||
"""
|
|
||||||
Processor for names that are imported into the database based on the
|
|
||||||
ICU library.
|
|
||||||
"""
|
|
||||||
from collections import defaultdict
|
|
||||||
import itertools
|
|
||||||
|
|
||||||
from icu import Transliterator
|
|
||||||
import datrie
|
|
||||||
|
|
||||||
|
|
||||||
class ICUNameProcessor:
|
|
||||||
""" Collects the different transformation rules for normalisation of names
|
|
||||||
and provides the functions to apply the transformations.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self, norm_rules, trans_rules, replacements):
|
|
||||||
self.normalizer = Transliterator.createFromRules("icu_normalization",
|
|
||||||
norm_rules)
|
|
||||||
self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
|
|
||||||
trans_rules +
|
|
||||||
";[:Space:]+ > ' '")
|
|
||||||
self.search = Transliterator.createFromRules("icu_search",
|
|
||||||
norm_rules + trans_rules)
|
|
||||||
|
|
||||||
# Intermediate reorder by source. Also compute required character set.
|
|
||||||
immediate = defaultdict(list)
|
|
||||||
chars = set()
|
|
||||||
for variant in replacements:
|
|
||||||
if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
|
|
||||||
replstr = variant.replacement[:-1]
|
|
||||||
else:
|
|
||||||
replstr = variant.replacement
|
|
||||||
immediate[variant.source].append(replstr)
|
|
||||||
chars.update(variant.source)
|
|
||||||
# Then copy to datrie
|
|
||||||
self.replacements = datrie.Trie(''.join(chars))
|
|
||||||
for src, repllist in immediate.items():
|
|
||||||
self.replacements[src] = repllist
|
|
||||||
|
|
||||||
|
|
||||||
def get_normalized(self, name):
|
|
||||||
""" Normalize the given name, i.e. remove all elements not relevant
|
|
||||||
for search.
|
|
||||||
"""
|
|
||||||
return self.normalizer.transliterate(name).strip()
|
|
||||||
|
|
||||||
def get_variants_ascii(self, norm_name):
|
|
||||||
""" Compute the spelling variants for the given normalized name
|
|
||||||
and transliterate the result.
|
|
||||||
"""
|
|
||||||
baseform = '^ ' + norm_name + ' ^'
|
|
||||||
partials = ['']
|
|
||||||
|
|
||||||
startpos = 0
|
|
||||||
pos = 0
|
|
||||||
force_space = False
|
|
||||||
while pos < len(baseform):
|
|
||||||
full, repl = self.replacements.longest_prefix_item(baseform[pos:],
|
|
||||||
(None, None))
|
|
||||||
if full is not None:
|
|
||||||
done = baseform[startpos:pos]
|
|
||||||
partials = [v + done + r
|
|
||||||
for v, r in itertools.product(partials, repl)
|
|
||||||
if not force_space or r.startswith(' ')]
|
|
||||||
if len(partials) > 128:
|
|
||||||
# If too many variants are produced, they are unlikely
|
|
||||||
# to be helpful. Only use the original term.
|
|
||||||
startpos = 0
|
|
||||||
break
|
|
||||||
startpos = pos + len(full)
|
|
||||||
if full[-1] == ' ':
|
|
||||||
startpos -= 1
|
|
||||||
force_space = True
|
|
||||||
pos = startpos
|
|
||||||
else:
|
|
||||||
pos += 1
|
|
||||||
force_space = False
|
|
||||||
|
|
||||||
# No variants detected? Fast return.
|
|
||||||
if startpos == 0:
|
|
||||||
trans_name = self.to_ascii.transliterate(norm_name).strip()
|
|
||||||
return [trans_name] if trans_name else []
|
|
||||||
|
|
||||||
return self._compute_result_set(partials, baseform[startpos:])
|
|
||||||
|
|
||||||
|
|
||||||
def _compute_result_set(self, partials, prefix):
|
|
||||||
results = set()
|
|
||||||
|
|
||||||
for variant in partials:
|
|
||||||
vname = variant + prefix
|
|
||||||
trans_name = self.to_ascii.transliterate(vname[1:-1]).strip()
|
|
||||||
if trans_name:
|
|
||||||
results.add(trans_name)
|
|
||||||
|
|
||||||
return list(results)
|
|
||||||
|
|
||||||
|
|
||||||
def get_search_normalized(self, name):
|
|
||||||
""" Return the normalized version of the name (including transliteration)
|
|
||||||
to be applied at search time.
|
|
||||||
"""
|
|
||||||
return self.search.transliterate(' ' + name + ' ').strip()
|
|
||||||
@@ -1,19 +1,17 @@
|
|||||||
"""
|
"""
|
||||||
Helper class to create ICU rules from a configuration file.
|
Helper class to create ICU rules from a configuration file.
|
||||||
"""
|
"""
|
||||||
|
import importlib
|
||||||
import io
|
import io
|
||||||
import json
|
import json
|
||||||
import logging
|
import logging
|
||||||
import itertools
|
|
||||||
import re
|
|
||||||
|
|
||||||
from icu import Transliterator
|
|
||||||
|
|
||||||
|
from nominatim.config import flatten_config_list
|
||||||
from nominatim.db.properties import set_property, get_property
|
from nominatim.db.properties import set_property, get_property
|
||||||
from nominatim.errors import UsageError
|
from nominatim.errors import UsageError
|
||||||
from nominatim.tokenizer.icu_name_processor import ICUNameProcessor
|
|
||||||
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
|
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
|
||||||
import nominatim.tokenizer.icu_variants as variants
|
from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
|
||||||
|
import nominatim.tools.country_info
|
||||||
|
|
||||||
LOG = logging.getLogger()
|
LOG = logging.getLogger()
|
||||||
|
|
||||||
@@ -22,33 +20,15 @@ DBCFG_IMPORT_TRANS_RULES = "tokenizer_import_transliteration"
|
|||||||
DBCFG_IMPORT_ANALYSIS_RULES = "tokenizer_import_analysis_rules"
|
DBCFG_IMPORT_ANALYSIS_RULES = "tokenizer_import_analysis_rules"
|
||||||
|
|
||||||
|
|
||||||
def _flatten_config_list(content):
|
def _get_section(rules, section):
|
||||||
if not content:
|
""" Get the section named 'section' from the rules. If the section does
|
||||||
return []
|
not exist, raise a usage error with a meaningful message.
|
||||||
|
|
||||||
if not isinstance(content, list):
|
|
||||||
raise UsageError("List expected in ICU configuration.")
|
|
||||||
|
|
||||||
output = []
|
|
||||||
for ele in content:
|
|
||||||
if isinstance(ele, list):
|
|
||||||
output.extend(_flatten_config_list(ele))
|
|
||||||
else:
|
|
||||||
output.append(ele)
|
|
||||||
|
|
||||||
return output
|
|
||||||
|
|
||||||
|
|
||||||
class VariantRule:
|
|
||||||
""" Saves a single variant expansion.
|
|
||||||
|
|
||||||
An expansion consists of the normalized replacement term and
|
|
||||||
a dicitonary of properties that describe when the expansion applies.
|
|
||||||
"""
|
"""
|
||||||
|
if section not in rules:
|
||||||
|
LOG.fatal("Section '%s' not found in tokenizer config.", section)
|
||||||
|
raise UsageError("Syntax error in tokenizer configuration file.")
|
||||||
|
|
||||||
def __init__(self, replacement, properties):
|
return rules[section]
|
||||||
self.replacement = replacement
|
|
||||||
self.properties = properties or {}
|
|
||||||
|
|
||||||
|
|
||||||
class ICURuleLoader:
|
class ICURuleLoader:
|
||||||
@@ -59,12 +39,13 @@ class ICURuleLoader:
|
|||||||
rules = config.load_sub_configuration('icu_tokenizer.yaml',
|
rules = config.load_sub_configuration('icu_tokenizer.yaml',
|
||||||
config='TOKENIZER_CONFIG')
|
config='TOKENIZER_CONFIG')
|
||||||
|
|
||||||
self.variants = set()
|
# Make sure country information is available to analyzers and sanatizers.
|
||||||
|
nominatim.tools.country_info.setup_country_config(config)
|
||||||
|
|
||||||
self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization')
|
self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization')
|
||||||
self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration')
|
self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration')
|
||||||
self.analysis_rules = self._get_section(rules, 'variants')
|
self.analysis_rules = _get_section(rules, 'token-analysis')
|
||||||
self._parse_variant_list()
|
self._setup_analysis()
|
||||||
|
|
||||||
# Load optional sanitizer rule set.
|
# Load optional sanitizer rule set.
|
||||||
self.sanitizer_rules = rules.get('sanitizers', [])
|
self.sanitizer_rules = rules.get('sanitizers', [])
|
||||||
@@ -77,7 +58,7 @@ class ICURuleLoader:
|
|||||||
self.normalization_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES)
|
self.normalization_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES)
|
||||||
self.transliteration_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES)
|
self.transliteration_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES)
|
||||||
self.analysis_rules = json.loads(get_property(conn, DBCFG_IMPORT_ANALYSIS_RULES))
|
self.analysis_rules = json.loads(get_property(conn, DBCFG_IMPORT_ANALYSIS_RULES))
|
||||||
self._parse_variant_list()
|
self._setup_analysis()
|
||||||
|
|
||||||
|
|
||||||
def save_config_to_db(self, conn):
|
def save_config_to_db(self, conn):
|
||||||
@@ -98,9 +79,8 @@ class ICURuleLoader:
|
|||||||
def make_token_analysis(self):
|
def make_token_analysis(self):
|
||||||
""" Create a token analyser from the reviouly loaded rules.
|
""" Create a token analyser from the reviouly loaded rules.
|
||||||
"""
|
"""
|
||||||
return ICUNameProcessor(self.normalization_rules,
|
return ICUTokenAnalysis(self.normalization_rules,
|
||||||
self.transliteration_rules,
|
self.transliteration_rules, self.analysis)
|
||||||
self.variants)
|
|
||||||
|
|
||||||
|
|
||||||
def get_search_rules(self):
|
def get_search_rules(self):
|
||||||
@@ -115,159 +95,66 @@ class ICURuleLoader:
|
|||||||
rules.write(self.transliteration_rules)
|
rules.write(self.transliteration_rules)
|
||||||
return rules.getvalue()
|
return rules.getvalue()
|
||||||
|
|
||||||
|
|
||||||
def get_normalization_rules(self):
|
def get_normalization_rules(self):
|
||||||
""" Return rules for normalisation of a term.
|
""" Return rules for normalisation of a term.
|
||||||
"""
|
"""
|
||||||
return self.normalization_rules
|
return self.normalization_rules
|
||||||
|
|
||||||
|
|
||||||
def get_transliteration_rules(self):
|
def get_transliteration_rules(self):
|
||||||
""" Return the rules for converting a string into its asciii representation.
|
""" Return the rules for converting a string into its asciii representation.
|
||||||
"""
|
"""
|
||||||
return self.transliteration_rules
|
return self.transliteration_rules
|
||||||
|
|
||||||
def get_replacement_pairs(self):
|
|
||||||
""" Return the list of possible compound decompositions with
|
def _setup_analysis(self):
|
||||||
application of abbreviations included.
|
""" Process the rules used for creating the various token analyzers.
|
||||||
The result is a list of pairs: the first item is the sequence to
|
|
||||||
replace, the second is a list of replacements.
|
|
||||||
"""
|
"""
|
||||||
return self.variants
|
self.analysis = {}
|
||||||
|
|
||||||
|
if not isinstance(self.analysis_rules, list):
|
||||||
|
raise UsageError("Configuration section 'token-analysis' must be a list.")
|
||||||
|
|
||||||
|
for section in self.analysis_rules:
|
||||||
|
name = section.get('id', None)
|
||||||
|
if name in self.analysis:
|
||||||
|
if name is None:
|
||||||
|
LOG.fatal("ICU tokenizer configuration has two default token analyzers.")
|
||||||
|
else:
|
||||||
|
LOG.fatal("ICU tokenizer configuration has two token "
|
||||||
|
"analyzers with id '%s'.", name)
|
||||||
|
raise UsageError("Syntax error in ICU tokenizer config.")
|
||||||
|
self.analysis[name] = TokenAnalyzerRule(section, self.normalization_rules)
|
||||||
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def _get_section(rules, section):
|
def _cfg_to_icu_rules(rules, section):
|
||||||
""" Get the section named 'section' from the rules. If the section does
|
|
||||||
not exist, raise a usage error with a meaningful message.
|
|
||||||
"""
|
|
||||||
if section not in rules:
|
|
||||||
LOG.fatal("Section '%s' not found in tokenizer config.", section)
|
|
||||||
raise UsageError("Syntax error in tokenizer configuration file.")
|
|
||||||
|
|
||||||
return rules[section]
|
|
||||||
|
|
||||||
|
|
||||||
def _cfg_to_icu_rules(self, rules, section):
|
|
||||||
""" Load an ICU ruleset from the given section. If the section is a
|
""" Load an ICU ruleset from the given section. If the section is a
|
||||||
simple string, it is interpreted as a file name and the rules are
|
simple string, it is interpreted as a file name and the rules are
|
||||||
loaded verbatim from the given file. The filename is expected to be
|
loaded verbatim from the given file. The filename is expected to be
|
||||||
relative to the tokenizer rule file. If the section is a list then
|
relative to the tokenizer rule file. If the section is a list then
|
||||||
each line is assumed to be a rule. All rules are concatenated and returned.
|
each line is assumed to be a rule. All rules are concatenated and returned.
|
||||||
"""
|
"""
|
||||||
content = self._get_section(rules, section)
|
content = _get_section(rules, section)
|
||||||
|
|
||||||
if content is None:
|
if content is None:
|
||||||
return ''
|
return ''
|
||||||
|
|
||||||
return ';'.join(_flatten_config_list(content)) + ';'
|
return ';'.join(flatten_config_list(content, section)) + ';'
|
||||||
|
|
||||||
|
|
||||||
def _parse_variant_list(self):
|
class TokenAnalyzerRule:
|
||||||
rules = self.analysis_rules
|
""" Factory for a single analysis module. The class saves the configuration
|
||||||
|
and creates a new token analyzer on request.
|
||||||
self.variants.clear()
|
|
||||||
|
|
||||||
if not rules:
|
|
||||||
return
|
|
||||||
|
|
||||||
rules = _flatten_config_list(rules)
|
|
||||||
|
|
||||||
vmaker = _VariantMaker(self.normalization_rules)
|
|
||||||
|
|
||||||
properties = []
|
|
||||||
for section in rules:
|
|
||||||
# Create the property field and deduplicate against existing
|
|
||||||
# instances.
|
|
||||||
props = variants.ICUVariantProperties.from_rules(section)
|
|
||||||
for existing in properties:
|
|
||||||
if existing == props:
|
|
||||||
props = existing
|
|
||||||
break
|
|
||||||
else:
|
|
||||||
properties.append(props)
|
|
||||||
|
|
||||||
for rule in (section.get('words') or []):
|
|
||||||
self.variants.update(vmaker.compute(rule, props))
|
|
||||||
|
|
||||||
|
|
||||||
class _VariantMaker:
|
|
||||||
""" Generater for all necessary ICUVariants from a single variant rule.
|
|
||||||
|
|
||||||
All text in rules is normalized to make sure the variants match later.
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, norm_rules):
|
def __init__(self, rules, normalization_rules):
|
||||||
self.norm = Transliterator.createFromRules("rule_loader_normalization",
|
# Find the analysis module
|
||||||
norm_rules)
|
module_name = 'nominatim.tokenizer.token_analysis.' \
|
||||||
|
+ _get_section(rules, 'analyzer').replace('-', '_')
|
||||||
|
analysis_mod = importlib.import_module(module_name)
|
||||||
|
self.create = analysis_mod.create
|
||||||
|
|
||||||
|
# Load the configuration.
|
||||||
def compute(self, rule, props):
|
self.config = analysis_mod.configure(rules, normalization_rules)
|
||||||
""" Generator for all ICUVariant tuples from a single variant rule.
|
|
||||||
"""
|
|
||||||
parts = re.split(r'(\|)?([=-])>', rule)
|
|
||||||
if len(parts) != 4:
|
|
||||||
raise UsageError("Syntax error in variant rule: " + rule)
|
|
||||||
|
|
||||||
decompose = parts[1] is None
|
|
||||||
src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
|
|
||||||
repl_terms = (self.norm.transliterate(t.strip()) for t in parts[3].split(','))
|
|
||||||
|
|
||||||
# If the source should be kept, add a 1:1 replacement
|
|
||||||
if parts[2] == '-':
|
|
||||||
for src in src_terms:
|
|
||||||
if src:
|
|
||||||
for froms, tos in _create_variants(*src, src[0], decompose):
|
|
||||||
yield variants.ICUVariant(froms, tos, props)
|
|
||||||
|
|
||||||
for src, repl in itertools.product(src_terms, repl_terms):
|
|
||||||
if src and repl:
|
|
||||||
for froms, tos in _create_variants(*src, repl, decompose):
|
|
||||||
yield variants.ICUVariant(froms, tos, props)
|
|
||||||
|
|
||||||
|
|
||||||
def _parse_variant_word(self, name):
|
|
||||||
name = name.strip()
|
|
||||||
match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
|
|
||||||
if match is None or (match.group(1) == '~' and match.group(3) == '~'):
|
|
||||||
raise UsageError("Invalid variant word descriptor '{}'".format(name))
|
|
||||||
norm_name = self.norm.transliterate(match.group(2))
|
|
||||||
if not norm_name:
|
|
||||||
return None
|
|
||||||
|
|
||||||
return norm_name, match.group(1), match.group(3)
|
|
||||||
|
|
||||||
|
|
||||||
_FLAG_MATCH = {'^': '^ ',
|
|
||||||
'$': ' ^',
|
|
||||||
'': ' '}
|
|
||||||
|
|
||||||
|
|
||||||
def _create_variants(src, preflag, postflag, repl, decompose):
|
|
||||||
if preflag == '~':
|
|
||||||
postfix = _FLAG_MATCH[postflag]
|
|
||||||
# suffix decomposition
|
|
||||||
src = src + postfix
|
|
||||||
repl = repl + postfix
|
|
||||||
|
|
||||||
yield src, repl
|
|
||||||
yield ' ' + src, ' ' + repl
|
|
||||||
|
|
||||||
if decompose:
|
|
||||||
yield src, ' ' + repl
|
|
||||||
yield ' ' + src, repl
|
|
||||||
elif postflag == '~':
|
|
||||||
# prefix decomposition
|
|
||||||
prefix = _FLAG_MATCH[preflag]
|
|
||||||
src = prefix + src
|
|
||||||
repl = prefix + repl
|
|
||||||
|
|
||||||
yield src, repl
|
|
||||||
yield src + ' ', repl + ' '
|
|
||||||
|
|
||||||
if decompose:
|
|
||||||
yield src, repl + ' '
|
|
||||||
yield src + ' ', repl
|
|
||||||
else:
|
|
||||||
prefix = _FLAG_MATCH[preflag]
|
|
||||||
postfix = _FLAG_MATCH[postflag]
|
|
||||||
|
|
||||||
yield prefix + src + postfix, prefix + repl + postfix
|
|
||||||
|
|||||||
23
nominatim/tokenizer/icu_token_analysis.py
Normal file
23
nominatim/tokenizer/icu_token_analysis.py
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
"""
|
||||||
|
Container class collecting all components required to transform an OSM name
|
||||||
|
into a Nominatim token.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from icu import Transliterator
|
||||||
|
|
||||||
|
class ICUTokenAnalysis:
|
||||||
|
""" Container class collecting the transliterators and token analysis
|
||||||
|
modules for a single NameAnalyser instance.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, norm_rules, trans_rules, analysis_rules):
|
||||||
|
self.normalizer = Transliterator.createFromRules("icu_normalization",
|
||||||
|
norm_rules)
|
||||||
|
trans_rules += ";[:Space:]+ > ' '"
|
||||||
|
self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
|
||||||
|
trans_rules)
|
||||||
|
self.search = Transliterator.createFromRules("icu_search",
|
||||||
|
norm_rules + trans_rules)
|
||||||
|
|
||||||
|
self.analysis = {name: arules.create(self.to_ascii, arules.config)
|
||||||
|
for name, arules in analysis_rules.items()}
|
||||||
@@ -164,7 +164,7 @@ class LegacyICUTokenizer(AbstractTokenizer):
|
|||||||
""" Count the partial terms from the names in the place table.
|
""" Count the partial terms from the names in the place table.
|
||||||
"""
|
"""
|
||||||
words = Counter()
|
words = Counter()
|
||||||
name_proc = self.loader.make_token_analysis()
|
analysis = self.loader.make_token_analysis()
|
||||||
|
|
||||||
with conn.cursor(name="words") as cur:
|
with conn.cursor(name="words") as cur:
|
||||||
cur.execute(""" SELECT v, count(*) FROM
|
cur.execute(""" SELECT v, count(*) FROM
|
||||||
@@ -172,12 +172,10 @@ class LegacyICUTokenizer(AbstractTokenizer):
|
|||||||
WHERE length(v) < 75 GROUP BY v""")
|
WHERE length(v) < 75 GROUP BY v""")
|
||||||
|
|
||||||
for name, cnt in cur:
|
for name, cnt in cur:
|
||||||
terms = set()
|
word = analysis.search.transliterate(name)
|
||||||
for word in name_proc.get_variants_ascii(name_proc.get_normalized(name)):
|
if word and ' ' in word:
|
||||||
if ' ' in word:
|
for term in set(word.split()):
|
||||||
terms.update(word.split())
|
words[term] += cnt
|
||||||
for term in terms:
|
|
||||||
words[term] += cnt
|
|
||||||
|
|
||||||
return words
|
return words
|
||||||
|
|
||||||
@@ -209,14 +207,14 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
|
|||||||
def _search_normalized(self, name):
|
def _search_normalized(self, name):
|
||||||
""" Return the search token transliteration of the given name.
|
""" Return the search token transliteration of the given name.
|
||||||
"""
|
"""
|
||||||
return self.token_analysis.get_search_normalized(name)
|
return self.token_analysis.search.transliterate(name).strip()
|
||||||
|
|
||||||
|
|
||||||
def _normalized(self, name):
|
def _normalized(self, name):
|
||||||
""" Return the normalized version of the given name with all
|
""" Return the normalized version of the given name with all
|
||||||
non-relevant information removed.
|
non-relevant information removed.
|
||||||
"""
|
"""
|
||||||
return self.token_analysis.get_normalized(name)
|
return self.token_analysis.normalizer.transliterate(name).strip()
|
||||||
|
|
||||||
|
|
||||||
def get_word_token_info(self, words):
|
def get_word_token_info(self, words):
|
||||||
@@ -456,6 +454,7 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
|
|||||||
if addr_terms:
|
if addr_terms:
|
||||||
token_info.add_address_terms(addr_terms)
|
token_info.add_address_terms(addr_terms)
|
||||||
|
|
||||||
|
|
||||||
def _compute_partial_tokens(self, name):
|
def _compute_partial_tokens(self, name):
|
||||||
""" Normalize the given term, split it into partial words and return
|
""" Normalize the given term, split it into partial words and return
|
||||||
then token list for them.
|
then token list for them.
|
||||||
@@ -492,19 +491,25 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
|
|||||||
partial_tokens = set()
|
partial_tokens = set()
|
||||||
|
|
||||||
for name in names:
|
for name in names:
|
||||||
|
analyzer_id = name.get_attr('analyzer')
|
||||||
norm_name = self._normalized(name.name)
|
norm_name = self._normalized(name.name)
|
||||||
full, part = self._cache.names.get(norm_name, (None, None))
|
if analyzer_id is None:
|
||||||
|
token_id = norm_name
|
||||||
|
else:
|
||||||
|
token_id = f'{norm_name}@{analyzer_id}'
|
||||||
|
|
||||||
|
full, part = self._cache.names.get(token_id, (None, None))
|
||||||
if full is None:
|
if full is None:
|
||||||
variants = self.token_analysis.get_variants_ascii(norm_name)
|
variants = self.token_analysis.analysis[analyzer_id].get_variants_ascii(norm_name)
|
||||||
if not variants:
|
if not variants:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
with self.conn.cursor() as cur:
|
with self.conn.cursor() as cur:
|
||||||
cur.execute("SELECT (getorcreate_full_word(%s, %s)).*",
|
cur.execute("SELECT (getorcreate_full_word(%s, %s)).*",
|
||||||
(norm_name, variants))
|
(token_id, variants))
|
||||||
full, part = cur.fetchone()
|
full, part = cur.fetchone()
|
||||||
|
|
||||||
self._cache.names[norm_name] = (full, part)
|
self._cache.names[token_id] = (full, part)
|
||||||
|
|
||||||
full_tokens.add(full)
|
full_tokens.add(full)
|
||||||
partial_tokens.update(part)
|
partial_tokens.update(part)
|
||||||
|
|||||||
@@ -1,25 +0,0 @@
|
|||||||
"""
|
|
||||||
Data structures for saving variant expansions for ICU tokenizer.
|
|
||||||
"""
|
|
||||||
from collections import namedtuple
|
|
||||||
|
|
||||||
_ICU_VARIANT_PORPERTY_FIELDS = ['lang']
|
|
||||||
|
|
||||||
|
|
||||||
class ICUVariantProperties(namedtuple('_ICUVariantProperties', _ICU_VARIANT_PORPERTY_FIELDS)):
|
|
||||||
""" Data container for saving properties that describe when a variant
|
|
||||||
should be applied.
|
|
||||||
|
|
||||||
Property instances are hashable.
|
|
||||||
"""
|
|
||||||
@classmethod
|
|
||||||
def from_rules(cls, _):
|
|
||||||
""" Create a new property type from a generic dictionary.
|
|
||||||
|
|
||||||
The function only takes into account the properties that are
|
|
||||||
understood presently and ignores all others.
|
|
||||||
"""
|
|
||||||
return cls(lang=None)
|
|
||||||
|
|
||||||
|
|
||||||
ICUVariant = namedtuple('ICUVariant', ['source', 'replacement', 'properties'])
|
|
||||||
@@ -1,5 +1,9 @@
|
|||||||
"""
|
"""
|
||||||
Name processor that splits name values with multiple values into their components.
|
Sanitizer that splits lists of names into their components.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
delimiters: Define the set of characters to be used for
|
||||||
|
splitting the list. (default: `,;`)
|
||||||
"""
|
"""
|
||||||
import re
|
import re
|
||||||
|
|
||||||
@@ -7,9 +11,7 @@ from nominatim.errors import UsageError
|
|||||||
|
|
||||||
def create(func):
|
def create(func):
|
||||||
""" Create a name processing function that splits name values with
|
""" Create a name processing function that splits name values with
|
||||||
multiple values into their components. The optional parameter
|
multiple values into their components.
|
||||||
'delimiters' can be used to define the characters that should be used
|
|
||||||
for splitting. The default is ',;'.
|
|
||||||
"""
|
"""
|
||||||
delimiter_set = set(func.get('delimiters', ',;'))
|
delimiter_set = set(func.get('delimiters', ',;'))
|
||||||
if not delimiter_set:
|
if not delimiter_set:
|
||||||
@@ -24,7 +26,6 @@ def create(func):
|
|||||||
new_names = []
|
new_names = []
|
||||||
for name in obj.names:
|
for name in obj.names:
|
||||||
split_names = regexp.split(name.name)
|
split_names = regexp.split(name.name)
|
||||||
print(split_names)
|
|
||||||
if len(split_names) == 1:
|
if len(split_names) == 1:
|
||||||
new_names.append(name)
|
new_names.append(name)
|
||||||
else:
|
else:
|
||||||
|
|||||||
@@ -1,11 +1,12 @@
|
|||||||
"""
|
"""
|
||||||
Sanitizer handling names with addendums in braces.
|
This sanitizer creates additional name variants for names that have
|
||||||
|
addendums in brackets (e.g. "Halle (Saale)"). The additional variant contains
|
||||||
|
only the main name part with the bracket part removed.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def create(_):
|
def create(_):
|
||||||
""" Create a name processing function that creates additional name variants
|
""" Create a name processing function that creates additional name variants
|
||||||
when a name has an addendum in brackets (e.g. "Halle (Saale)"). The
|
for bracket addendums.
|
||||||
additional variant only contains the main name without the bracket part.
|
|
||||||
"""
|
"""
|
||||||
def _process(obj):
|
def _process(obj):
|
||||||
""" Add variants for names that have a bracket extension.
|
""" Add variants for names that have a bracket extension.
|
||||||
|
|||||||
103
nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
Normal file
103
nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
"""
|
||||||
|
This sanitizer sets the `analyzer` property depending on the
|
||||||
|
language of the tag. The language is taken from the suffix of the name.
|
||||||
|
If a name already has an analyzer tagged, then this is kept.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
|
||||||
|
filter-kind: Restrict the names the sanitizer should be applied to
|
||||||
|
to the given tags. The parameter expects a list of
|
||||||
|
regular expressions which are matched against `kind`.
|
||||||
|
Note that a match against the full string is expected.
|
||||||
|
whitelist: Restrict the set of languages that should be tagged.
|
||||||
|
Expects a list of acceptable suffixes. When unset,
|
||||||
|
all 2- and 3-letter lower-case codes are accepted.
|
||||||
|
use-defaults: Configure what happens when the name has no suffix.
|
||||||
|
When set to 'all', a variant is created for
|
||||||
|
each of the default languages in the country
|
||||||
|
the feature is in. When set to 'mono', a variant is
|
||||||
|
only created, when exactly one language is spoken
|
||||||
|
in the country. The default is to do nothing with
|
||||||
|
the default languages of a country.
|
||||||
|
mode: Define how the variants are created and may be 'replace' or
|
||||||
|
'append'. When set to 'append' the original name (without
|
||||||
|
any analyzer tagged) is retained. (default: replace)
|
||||||
|
|
||||||
|
"""
|
||||||
|
import re
|
||||||
|
|
||||||
|
from nominatim.tools import country_info
|
||||||
|
|
||||||
|
class _AnalyzerByLanguage:
|
||||||
|
""" Processor for tagging the language of names in a place.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
if 'filter-kind' in config:
|
||||||
|
self.regexes = [re.compile(regex) for regex in config['filter-kind']]
|
||||||
|
else:
|
||||||
|
self.regexes = None
|
||||||
|
|
||||||
|
self.replace = config.get('mode', 'replace') != 'append'
|
||||||
|
self.whitelist = config.get('whitelist')
|
||||||
|
|
||||||
|
self.__compute_default_languages(config.get('use-defaults', 'no'))
|
||||||
|
|
||||||
|
|
||||||
|
def __compute_default_languages(self, use_defaults):
|
||||||
|
self.deflangs = {}
|
||||||
|
|
||||||
|
if use_defaults in ('mono', 'all'):
|
||||||
|
for ccode, prop in country_info.iterate():
|
||||||
|
clangs = prop['languages']
|
||||||
|
if len(clangs) == 1 or use_defaults == 'all':
|
||||||
|
if self.whitelist:
|
||||||
|
self.deflangs[ccode] = [l for l in clangs if l in self.whitelist]
|
||||||
|
else:
|
||||||
|
self.deflangs[ccode] = clangs
|
||||||
|
|
||||||
|
|
||||||
|
def _kind_matches(self, kind):
|
||||||
|
if self.regexes is None:
|
||||||
|
return True
|
||||||
|
|
||||||
|
return any(regex.fullmatch(kind) for regex in self.regexes)
|
||||||
|
|
||||||
|
|
||||||
|
def _suffix_matches(self, suffix):
|
||||||
|
if self.whitelist is None:
|
||||||
|
return len(suffix) in (2, 3) and suffix.islower()
|
||||||
|
|
||||||
|
return suffix in self.whitelist
|
||||||
|
|
||||||
|
|
||||||
|
def __call__(self, obj):
|
||||||
|
if not obj.names:
|
||||||
|
return
|
||||||
|
|
||||||
|
more_names = []
|
||||||
|
|
||||||
|
for name in (n for n in obj.names
|
||||||
|
if not n.has_attr('analyzer') and self._kind_matches(n.kind)):
|
||||||
|
if name.suffix:
|
||||||
|
langs = [name.suffix] if self._suffix_matches(name.suffix) else None
|
||||||
|
else:
|
||||||
|
langs = self.deflangs.get(obj.place.country_code)
|
||||||
|
|
||||||
|
|
||||||
|
if langs:
|
||||||
|
if self.replace:
|
||||||
|
name.set_attr('analyzer', langs[0])
|
||||||
|
else:
|
||||||
|
more_names.append(name.clone(attr={'analyzer': langs[0]}))
|
||||||
|
|
||||||
|
more_names.extend(name.clone(attr={'analyzer': l}) for l in langs[1:])
|
||||||
|
|
||||||
|
obj.names.extend(more_names)
|
||||||
|
|
||||||
|
|
||||||
|
def create(config):
|
||||||
|
""" Create a function that sets the analyzer property depending on the
|
||||||
|
language of the tag.
|
||||||
|
"""
|
||||||
|
return _AnalyzerByLanguage(config)
|
||||||
0
nominatim/tokenizer/token_analysis/__init__.py
Normal file
0
nominatim/tokenizer/token_analysis/__init__.py
Normal file
224
nominatim/tokenizer/token_analysis/generic.py
Normal file
224
nominatim/tokenizer/token_analysis/generic.py
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
"""
|
||||||
|
Generic processor for names that creates abbreviation variants.
|
||||||
|
"""
|
||||||
|
from collections import defaultdict, namedtuple
|
||||||
|
import itertools
|
||||||
|
import re
|
||||||
|
|
||||||
|
from icu import Transliterator
|
||||||
|
import datrie
|
||||||
|
|
||||||
|
from nominatim.config import flatten_config_list
|
||||||
|
from nominatim.errors import UsageError
|
||||||
|
|
||||||
|
### Configuration section
|
||||||
|
|
||||||
|
ICUVariant = namedtuple('ICUVariant', ['source', 'replacement'])
|
||||||
|
|
||||||
|
def configure(rules, normalization_rules):
|
||||||
|
""" Extract and preprocess the configuration for this module.
|
||||||
|
"""
|
||||||
|
config = {}
|
||||||
|
|
||||||
|
config['replacements'], config['chars'] = _get_variant_config(rules.get('variants'),
|
||||||
|
normalization_rules)
|
||||||
|
config['variant_only'] = rules.get('mode', '') == 'variant-only'
|
||||||
|
|
||||||
|
return config
|
||||||
|
|
||||||
|
|
||||||
|
def _get_variant_config(rules, normalization_rules):
|
||||||
|
""" Convert the variant definition from the configuration into
|
||||||
|
replacement sets.
|
||||||
|
"""
|
||||||
|
immediate = defaultdict(list)
|
||||||
|
chars = set()
|
||||||
|
|
||||||
|
if rules:
|
||||||
|
vset = set()
|
||||||
|
rules = flatten_config_list(rules, 'variants')
|
||||||
|
|
||||||
|
vmaker = _VariantMaker(normalization_rules)
|
||||||
|
|
||||||
|
for section in rules:
|
||||||
|
for rule in (section.get('words') or []):
|
||||||
|
vset.update(vmaker.compute(rule))
|
||||||
|
|
||||||
|
# Intermediate reorder by source. Also compute required character set.
|
||||||
|
for variant in vset:
|
||||||
|
if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
|
||||||
|
replstr = variant.replacement[:-1]
|
||||||
|
else:
|
||||||
|
replstr = variant.replacement
|
||||||
|
immediate[variant.source].append(replstr)
|
||||||
|
chars.update(variant.source)
|
||||||
|
|
||||||
|
return list(immediate.items()), ''.join(chars)
|
||||||
|
|
||||||
|
|
||||||
|
class _VariantMaker:
|
||||||
|
""" Generater for all necessary ICUVariants from a single variant rule.
|
||||||
|
|
||||||
|
All text in rules is normalized to make sure the variants match later.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, norm_rules):
|
||||||
|
self.norm = Transliterator.createFromRules("rule_loader_normalization",
|
||||||
|
norm_rules)
|
||||||
|
|
||||||
|
|
||||||
|
def compute(self, rule):
|
||||||
|
""" Generator for all ICUVariant tuples from a single variant rule.
|
||||||
|
"""
|
||||||
|
parts = re.split(r'(\|)?([=-])>', rule)
|
||||||
|
if len(parts) != 4:
|
||||||
|
raise UsageError("Syntax error in variant rule: " + rule)
|
||||||
|
|
||||||
|
decompose = parts[1] is None
|
||||||
|
src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
|
||||||
|
repl_terms = (self.norm.transliterate(t).strip() for t in parts[3].split(','))
|
||||||
|
|
||||||
|
# If the source should be kept, add a 1:1 replacement
|
||||||
|
if parts[2] == '-':
|
||||||
|
for src in src_terms:
|
||||||
|
if src:
|
||||||
|
for froms, tos in _create_variants(*src, src[0], decompose):
|
||||||
|
yield ICUVariant(froms, tos)
|
||||||
|
|
||||||
|
for src, repl in itertools.product(src_terms, repl_terms):
|
||||||
|
if src and repl:
|
||||||
|
for froms, tos in _create_variants(*src, repl, decompose):
|
||||||
|
yield ICUVariant(froms, tos)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_variant_word(self, name):
|
||||||
|
name = name.strip()
|
||||||
|
match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
|
||||||
|
if match is None or (match.group(1) == '~' and match.group(3) == '~'):
|
||||||
|
raise UsageError("Invalid variant word descriptor '{}'".format(name))
|
||||||
|
norm_name = self.norm.transliterate(match.group(2)).strip()
|
||||||
|
if not norm_name:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return norm_name, match.group(1), match.group(3)
|
||||||
|
|
||||||
|
|
||||||
|
_FLAG_MATCH = {'^': '^ ',
|
||||||
|
'$': ' ^',
|
||||||
|
'': ' '}
|
||||||
|
|
||||||
|
|
||||||
|
def _create_variants(src, preflag, postflag, repl, decompose):
|
||||||
|
if preflag == '~':
|
||||||
|
postfix = _FLAG_MATCH[postflag]
|
||||||
|
# suffix decomposition
|
||||||
|
src = src + postfix
|
||||||
|
repl = repl + postfix
|
||||||
|
|
||||||
|
yield src, repl
|
||||||
|
yield ' ' + src, ' ' + repl
|
||||||
|
|
||||||
|
if decompose:
|
||||||
|
yield src, ' ' + repl
|
||||||
|
yield ' ' + src, repl
|
||||||
|
elif postflag == '~':
|
||||||
|
# prefix decomposition
|
||||||
|
prefix = _FLAG_MATCH[preflag]
|
||||||
|
src = prefix + src
|
||||||
|
repl = prefix + repl
|
||||||
|
|
||||||
|
yield src, repl
|
||||||
|
yield src + ' ', repl + ' '
|
||||||
|
|
||||||
|
if decompose:
|
||||||
|
yield src, repl + ' '
|
||||||
|
yield src + ' ', repl
|
||||||
|
else:
|
||||||
|
prefix = _FLAG_MATCH[preflag]
|
||||||
|
postfix = _FLAG_MATCH[postflag]
|
||||||
|
|
||||||
|
yield prefix + src + postfix, prefix + repl + postfix
|
||||||
|
|
||||||
|
|
||||||
|
### Analysis section
|
||||||
|
|
||||||
|
def create(transliterator, config):
|
||||||
|
""" Create a new token analysis instance for this module.
|
||||||
|
"""
|
||||||
|
return GenericTokenAnalysis(transliterator, config)
|
||||||
|
|
||||||
|
|
||||||
|
class GenericTokenAnalysis:
|
||||||
|
""" Collects the different transformation rules for normalisation of names
|
||||||
|
and provides the functions to apply the transformations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, to_ascii, config):
|
||||||
|
self.to_ascii = to_ascii
|
||||||
|
self.variant_only = config['variant_only']
|
||||||
|
|
||||||
|
# Set up datrie
|
||||||
|
if config['replacements']:
|
||||||
|
self.replacements = datrie.Trie(config['chars'])
|
||||||
|
for src, repllist in config['replacements']:
|
||||||
|
self.replacements[src] = repllist
|
||||||
|
else:
|
||||||
|
self.replacements = None
|
||||||
|
|
||||||
|
|
||||||
|
def get_variants_ascii(self, norm_name):
|
||||||
|
""" Compute the spelling variants for the given normalized name
|
||||||
|
and transliterate the result.
|
||||||
|
"""
|
||||||
|
baseform = '^ ' + norm_name + ' ^'
|
||||||
|
partials = ['']
|
||||||
|
|
||||||
|
startpos = 0
|
||||||
|
if self.replacements is not None:
|
||||||
|
pos = 0
|
||||||
|
force_space = False
|
||||||
|
while pos < len(baseform):
|
||||||
|
full, repl = self.replacements.longest_prefix_item(baseform[pos:],
|
||||||
|
(None, None))
|
||||||
|
if full is not None:
|
||||||
|
done = baseform[startpos:pos]
|
||||||
|
partials = [v + done + r
|
||||||
|
for v, r in itertools.product(partials, repl)
|
||||||
|
if not force_space or r.startswith(' ')]
|
||||||
|
if len(partials) > 128:
|
||||||
|
# If too many variants are produced, they are unlikely
|
||||||
|
# to be helpful. Only use the original term.
|
||||||
|
startpos = 0
|
||||||
|
break
|
||||||
|
startpos = pos + len(full)
|
||||||
|
if full[-1] == ' ':
|
||||||
|
startpos -= 1
|
||||||
|
force_space = True
|
||||||
|
pos = startpos
|
||||||
|
else:
|
||||||
|
pos += 1
|
||||||
|
force_space = False
|
||||||
|
|
||||||
|
# No variants detected? Fast return.
|
||||||
|
if startpos == 0:
|
||||||
|
if self.variant_only:
|
||||||
|
return []
|
||||||
|
|
||||||
|
trans_name = self.to_ascii.transliterate(norm_name).strip()
|
||||||
|
return [trans_name] if trans_name else []
|
||||||
|
|
||||||
|
return self._compute_result_set(partials, baseform[startpos:],
|
||||||
|
norm_name if self.variant_only else '')
|
||||||
|
|
||||||
|
|
||||||
|
def _compute_result_set(self, partials, prefix, exclude):
|
||||||
|
results = set()
|
||||||
|
|
||||||
|
for variant in partials:
|
||||||
|
vname = (variant + prefix)[1:-1].strip()
|
||||||
|
if vname != exclude:
|
||||||
|
trans_name = self.to_ascii.transliterate(vname).strip()
|
||||||
|
if trans_name:
|
||||||
|
results.add(trans_name)
|
||||||
|
|
||||||
|
return list(results)
|
||||||
@@ -13,12 +13,21 @@ class _CountryInfo:
|
|||||||
def __init__(self):
|
def __init__(self):
|
||||||
self._info = {}
|
self._info = {}
|
||||||
|
|
||||||
|
|
||||||
def load(self, config):
|
def load(self, config):
|
||||||
""" Load the country properties from the configuration files,
|
""" Load the country properties from the configuration files,
|
||||||
if they are not loaded yet.
|
if they are not loaded yet.
|
||||||
"""
|
"""
|
||||||
if not self._info:
|
if not self._info:
|
||||||
self._info = config.load_sub_configuration('country_settings.yaml')
|
self._info = config.load_sub_configuration('country_settings.yaml')
|
||||||
|
# Convert languages into a list for simpler handling.
|
||||||
|
for prop in self._info.values():
|
||||||
|
if 'languages' not in prop:
|
||||||
|
prop['languages'] = []
|
||||||
|
elif not isinstance(prop['languages'], list):
|
||||||
|
prop['languages'] = [x.strip()
|
||||||
|
for x in prop['languages'].split(',')]
|
||||||
|
|
||||||
|
|
||||||
def items(self):
|
def items(self):
|
||||||
""" Return tuples of (country_code, property dict) as iterable.
|
""" Return tuples of (country_code, property dict) as iterable.
|
||||||
@@ -36,6 +45,12 @@ def setup_country_config(config):
|
|||||||
_COUNTRY_INFO.load(config)
|
_COUNTRY_INFO.load(config)
|
||||||
|
|
||||||
|
|
||||||
|
def iterate():
|
||||||
|
""" Iterate over country code and properties.
|
||||||
|
"""
|
||||||
|
return _COUNTRY_INFO.items()
|
||||||
|
|
||||||
|
|
||||||
def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
|
def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
|
||||||
""" Create and populate the tables with basic static data that provides
|
""" Create and populate the tables with basic static data that provides
|
||||||
the background for geocoding. Data is assumed to not yet exist.
|
the background for geocoding. Data is assumed to not yet exist.
|
||||||
@@ -50,10 +65,7 @@ def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
|
|||||||
partition = 0
|
partition = 0
|
||||||
else:
|
else:
|
||||||
partition = props.get('partition')
|
partition = props.get('partition')
|
||||||
if ',' in (props.get('languages', ',') or ','):
|
lang = props['languages'][0] if len(props['languages']) == 1 else None
|
||||||
lang = None
|
|
||||||
else:
|
|
||||||
lang = props['languages']
|
|
||||||
params.append((ccode, partition, lang))
|
params.append((ccode, partition, lang))
|
||||||
|
|
||||||
with connect(dsn) as conn:
|
with connect(dsn) as conn:
|
||||||
|
|||||||
@@ -171,7 +171,7 @@ bt:
|
|||||||
# (Bouvet Island)
|
# (Bouvet Island)
|
||||||
bv:
|
bv:
|
||||||
partition: 185
|
partition: 185
|
||||||
languages: no
|
languages: "no"
|
||||||
|
|
||||||
# Botswana (Botswana)
|
# Botswana (Botswana)
|
||||||
bw:
|
bw:
|
||||||
@@ -1006,7 +1006,7 @@ si:
|
|||||||
# (Svalbard and Jan Mayen)
|
# (Svalbard and Jan Mayen)
|
||||||
sj:
|
sj:
|
||||||
partition: 197
|
partition: 197
|
||||||
languages: no
|
languages: "no"
|
||||||
|
|
||||||
# Slovakia (Slovensko)
|
# Slovakia (Slovensko)
|
||||||
sk:
|
sk:
|
||||||
|
|||||||
@@ -27,34 +27,160 @@ transliteration:
|
|||||||
sanitizers:
|
sanitizers:
|
||||||
- step: split-name-list
|
- step: split-name-list
|
||||||
- step: strip-brace-terms
|
- step: strip-brace-terms
|
||||||
variants:
|
- step: tag-analyzer-by-language
|
||||||
- !include icu-rules/variants-bg.yaml
|
filter-kind: [".*name.*"]
|
||||||
- !include icu-rules/variants-ca.yaml
|
whitelist: [bg,ca,cs,da,de,el,en,es,et,eu,fi,fr,gl,hu,it,ja,mg,ms,nl,no,pl,pt,ro,ru,sk,sl,sv,tr,uk,vi]
|
||||||
- !include icu-rules/variants-cs.yaml
|
use-defaults: all
|
||||||
- !include icu-rules/variants-da.yaml
|
mode: append
|
||||||
- !include icu-rules/variants-de.yaml
|
token-analysis:
|
||||||
- !include icu-rules/variants-el.yaml
|
- analyzer: generic
|
||||||
- !include icu-rules/variants-en.yaml
|
- id: bg
|
||||||
- !include icu-rules/variants-es.yaml
|
analyzer: generic
|
||||||
- !include icu-rules/variants-et.yaml
|
mode: variant-only
|
||||||
- !include icu-rules/variants-eu.yaml
|
variants:
|
||||||
- !include icu-rules/variants-fi.yaml
|
- !include icu-rules/variants-bg.yaml
|
||||||
- !include icu-rules/variants-fr.yaml
|
- id: ca
|
||||||
- !include icu-rules/variants-gl.yaml
|
analyzer: generic
|
||||||
- !include icu-rules/variants-hu.yaml
|
mode: variant-only
|
||||||
- !include icu-rules/variants-it.yaml
|
variants:
|
||||||
- !include icu-rules/variants-ja.yaml
|
- !include icu-rules/variants-ca.yaml
|
||||||
- !include icu-rules/variants-mg.yaml
|
- id: cs
|
||||||
- !include icu-rules/variants-ms.yaml
|
analyzer: generic
|
||||||
- !include icu-rules/variants-nl.yaml
|
mode: variant-only
|
||||||
- !include icu-rules/variants-no.yaml
|
variants:
|
||||||
- !include icu-rules/variants-pl.yaml
|
- !include icu-rules/variants-cs.yaml
|
||||||
- !include icu-rules/variants-pt.yaml
|
- id: da
|
||||||
- !include icu-rules/variants-ro.yaml
|
analyzer: generic
|
||||||
- !include icu-rules/variants-ru.yaml
|
mode: variant-only
|
||||||
- !include icu-rules/variants-sk.yaml
|
variants:
|
||||||
- !include icu-rules/variants-sl.yaml
|
- !include icu-rules/variants-da.yaml
|
||||||
- !include icu-rules/variants-sv.yaml
|
- id: de
|
||||||
- !include icu-rules/variants-tr.yaml
|
analyzer: generic
|
||||||
- !include icu-rules/variants-uk.yaml
|
mode: variant-only
|
||||||
- !include icu-rules/variants-vi.yaml
|
variants:
|
||||||
|
- !include icu-rules/variants-de.yaml
|
||||||
|
- id: el
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-el.yaml
|
||||||
|
- id: en
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-en.yaml
|
||||||
|
- id: es
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-es.yaml
|
||||||
|
- id: et
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-et.yaml
|
||||||
|
- id: eu
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-eu.yaml
|
||||||
|
- id: fi
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-fi.yaml
|
||||||
|
- id: fr
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-fr.yaml
|
||||||
|
- id: gl
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-gl.yaml
|
||||||
|
- id: hu
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-hu.yaml
|
||||||
|
- id: it
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-it.yaml
|
||||||
|
- id: ja
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-ja.yaml
|
||||||
|
- id: mg
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-mg.yaml
|
||||||
|
- id: ms
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-ms.yaml
|
||||||
|
- id: nl
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-nl.yaml
|
||||||
|
- id: no
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-no.yaml
|
||||||
|
- id: pl
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-pl.yaml
|
||||||
|
- id: pt
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-pt.yaml
|
||||||
|
- id: ro
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-ro.yaml
|
||||||
|
- id: ru
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-ru.yaml
|
||||||
|
- id: sk
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-sk.yaml
|
||||||
|
- id: sl
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-sl.yaml
|
||||||
|
- id: sv
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-sv.yaml
|
||||||
|
- id: tr
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-tr.yaml
|
||||||
|
- id: uk
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-uk.yaml
|
||||||
|
- id: vi
|
||||||
|
analyzer: generic
|
||||||
|
mode: variant-only
|
||||||
|
variants:
|
||||||
|
- !include icu-rules/variants-vi.yaml
|
||||||
|
|||||||
@@ -52,7 +52,7 @@ Feature: Import and search of names
|
|||||||
|
|
||||||
Scenario: Special characters in name
|
Scenario: Special characters in name
|
||||||
Given the places
|
Given the places
|
||||||
| osm | class | type | name |
|
| osm | class | type | name+name:de |
|
||||||
| N1 | place | locality | Jim-Knopf-Straße |
|
| N1 | place | locality | Jim-Knopf-Straße |
|
||||||
| N2 | place | locality | Smith/Weston |
|
| N2 | place | locality | Smith/Weston |
|
||||||
| N3 | place | locality | space mountain |
|
| N3 | place | locality | space mountain |
|
||||||
|
|||||||
@@ -69,10 +69,11 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
|
|||||||
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
|
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
|
||||||
variants=('~gasse -> gasse', 'street => st', ),
|
variants=('~gasse -> gasse', 'street => st', ),
|
||||||
sanitizers=[]):
|
sanitizers=[]):
|
||||||
cfgstr = {'normalization' : list(norm),
|
cfgstr = {'normalization': list(norm),
|
||||||
'sanitizers' : sanitizers,
|
'sanitizers': sanitizers,
|
||||||
'transliteration' : list(trans),
|
'transliteration': list(trans),
|
||||||
'variants' : [ {'words': list(variants)}]}
|
'token-analysis': [{'analyzer': 'generic',
|
||||||
|
'variants': [{'words': list(variants)}]}]}
|
||||||
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
|
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
|
||||||
tok.loader = ICURuleLoader(test_config)
|
tok.loader = ICURuleLoader(test_config)
|
||||||
|
|
||||||
@@ -168,9 +169,7 @@ def test_init_word_table(tokenizer_factory, test_config, place_row, word_table):
|
|||||||
tok.init_new_db(test_config)
|
tok.init_new_db(test_config)
|
||||||
|
|
||||||
assert word_table.get_partial_words() == {('test', 1),
|
assert word_table.get_partial_words() == {('test', 1),
|
||||||
('no', 1), ('area', 2),
|
('no', 1), ('area', 2)}
|
||||||
('holz', 1), ('strasse', 1),
|
|
||||||
('str', 1)}
|
|
||||||
|
|
||||||
|
|
||||||
def test_init_from_project(monkeypatch, test_config, tokenizer_factory):
|
def test_init_from_project(monkeypatch, test_config, tokenizer_factory):
|
||||||
|
|||||||
@@ -1,104 +0,0 @@
|
|||||||
"""
|
|
||||||
Tests for import name normalisation and variant generation.
|
|
||||||
"""
|
|
||||||
from textwrap import dedent
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
|
|
||||||
|
|
||||||
from nominatim.errors import UsageError
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def cfgfile(def_config, tmp_path):
|
|
||||||
project_dir = tmp_path / 'project_dir'
|
|
||||||
project_dir.mkdir()
|
|
||||||
def_config.project_dir = project_dir
|
|
||||||
|
|
||||||
def _create_config(*variants, **kwargs):
|
|
||||||
content = dedent("""\
|
|
||||||
normalization:
|
|
||||||
- ":: NFD ()"
|
|
||||||
- "'🜳' > ' '"
|
|
||||||
- "[[:Nonspacing Mark:] [:Cf:]] >"
|
|
||||||
- ":: lower ()"
|
|
||||||
- "[[:Punctuation:][:Space:]]+ > ' '"
|
|
||||||
- ":: NFC ()"
|
|
||||||
transliteration:
|
|
||||||
- ":: Latin ()"
|
|
||||||
- "'🜵' > ' '"
|
|
||||||
""")
|
|
||||||
content += "variants:\n - words:\n"
|
|
||||||
content += '\n'.join((" - " + s for s in variants)) + '\n'
|
|
||||||
for k, v in kwargs:
|
|
||||||
content += " {}: {}\n".format(k, v)
|
|
||||||
(project_dir / 'icu_tokenizer.yaml').write_text(content)
|
|
||||||
|
|
||||||
return def_config
|
|
||||||
|
|
||||||
return _create_config
|
|
||||||
|
|
||||||
|
|
||||||
def get_normalized_variants(proc, name):
|
|
||||||
return proc.get_variants_ascii(proc.get_normalized(name))
|
|
||||||
|
|
||||||
|
|
||||||
def test_variants_empty(cfgfile):
|
|
||||||
config = cfgfile('saint -> 🜵', 'street -> st')
|
|
||||||
|
|
||||||
proc = ICURuleLoader(config).make_token_analysis()
|
|
||||||
|
|
||||||
assert get_normalized_variants(proc, '🜵') == []
|
|
||||||
assert get_normalized_variants(proc, '🜳') == []
|
|
||||||
assert get_normalized_variants(proc, 'saint') == ['saint']
|
|
||||||
|
|
||||||
|
|
||||||
VARIANT_TESTS = [
|
|
||||||
(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
|
|
||||||
(('weg => wg',), "holzweg", {'holzweg'}),
|
|
||||||
(('weg -> wg',), "holzweg", {'holzweg'}),
|
|
||||||
(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
|
|
||||||
(('~weg -> weg',), "holzweg", {'holz weg', 'holzweg'}),
|
|
||||||
(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
|
|
||||||
(('~weg -> w',), "holzweg", {'holz weg', 'holzweg', 'holz w', 'holzw'}),
|
|
||||||
(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
|
|
||||||
(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
|
|
||||||
(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
|
|
||||||
(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
|
|
||||||
(('weg => wg',), "Meier Weg", {'meier wg'}),
|
|
||||||
(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
|
|
||||||
(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
|
|
||||||
{'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
|
|
||||||
(('am => a', 'bach => b'), "am bach", {'a b'}),
|
|
||||||
(('am => a', '~bach => b'), "am bach", {'a b'}),
|
|
||||||
(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
|
|
||||||
(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
|
|
||||||
(('saint -> s,st', 'street -> st'), "Saint Johns Street",
|
|
||||||
{'saint johns street', 's johns street', 'st johns street',
|
|
||||||
'saint johns st', 's johns st', 'st johns st'}),
|
|
||||||
(('river$ -> r',), "River Bend Road", {'river bend road'}),
|
|
||||||
(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
|
|
||||||
(('^north => n',), "North 2nd Street", {'n 2nd street'}),
|
|
||||||
(('^north => n',), "Airport North", {'airport north'}),
|
|
||||||
(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
|
|
||||||
(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
|
|
||||||
]
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
|
|
||||||
def test_variants(cfgfile, rules, name, variants):
|
|
||||||
config = cfgfile(*rules)
|
|
||||||
proc = ICURuleLoader(config).make_token_analysis()
|
|
||||||
|
|
||||||
result = get_normalized_variants(proc, name)
|
|
||||||
|
|
||||||
assert len(result) == len(set(result))
|
|
||||||
assert set(get_normalized_variants(proc, name)) == variants
|
|
||||||
|
|
||||||
|
|
||||||
def test_search_normalized(cfgfile):
|
|
||||||
config = cfgfile('~street => s,st', 'master => mstr')
|
|
||||||
proc = ICURuleLoader(config).make_token_analysis()
|
|
||||||
|
|
||||||
assert proc.get_search_normalized('Master Street') == 'master street'
|
|
||||||
assert proc.get_search_normalized('Earnes St') == 'earnes st'
|
|
||||||
assert proc.get_search_normalized('Nostreet') == 'nostreet'
|
|
||||||
@@ -34,8 +34,8 @@ def cfgrules(test_config):
|
|||||||
- ":: Latin ()"
|
- ":: Latin ()"
|
||||||
- "[[:Punctuation:][:Space:]]+ > ' '"
|
- "[[:Punctuation:][:Space:]]+ > ' '"
|
||||||
""")
|
""")
|
||||||
content += "variants:\n - words:\n"
|
content += "token-analysis:\n - analyzer: generic\n variants:\n - words:\n"
|
||||||
content += '\n'.join((" - " + s for s in variants)) + '\n'
|
content += '\n'.join((" - " + s for s in variants)) + '\n'
|
||||||
for k, v in kwargs:
|
for k, v in kwargs:
|
||||||
content += " {}: {}\n".format(k, v)
|
content += " {}: {}\n".format(k, v)
|
||||||
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(content)
|
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(content)
|
||||||
@@ -49,20 +49,21 @@ def test_empty_rule_set(test_config):
|
|||||||
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(dedent("""\
|
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(dedent("""\
|
||||||
normalization:
|
normalization:
|
||||||
transliteration:
|
transliteration:
|
||||||
variants:
|
token-analysis:
|
||||||
|
- analyzer: generic
|
||||||
|
variants:
|
||||||
"""))
|
"""))
|
||||||
|
|
||||||
rules = ICURuleLoader(test_config)
|
rules = ICURuleLoader(test_config)
|
||||||
assert rules.get_search_rules() == ''
|
assert rules.get_search_rules() == ''
|
||||||
assert rules.get_normalization_rules() == ''
|
assert rules.get_normalization_rules() == ''
|
||||||
assert rules.get_transliteration_rules() == ''
|
assert rules.get_transliteration_rules() == ''
|
||||||
assert list(rules.get_replacement_pairs()) == []
|
|
||||||
|
|
||||||
CONFIG_SECTIONS = ('normalization', 'transliteration', 'variants')
|
CONFIG_SECTIONS = ('normalization', 'transliteration', 'token-analysis')
|
||||||
|
|
||||||
@pytest.mark.parametrize("section", CONFIG_SECTIONS)
|
@pytest.mark.parametrize("section", CONFIG_SECTIONS)
|
||||||
def test_missing_section(section, test_config):
|
def test_missing_section(section, test_config):
|
||||||
rule_cfg = { s: {} for s in CONFIG_SECTIONS if s != section}
|
rule_cfg = { s: [] for s in CONFIG_SECTIONS if s != section}
|
||||||
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(rule_cfg))
|
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(rule_cfg))
|
||||||
|
|
||||||
with pytest.raises(UsageError):
|
with pytest.raises(UsageError):
|
||||||
@@ -107,7 +108,9 @@ def test_transliteration_rules_from_file(test_config):
|
|||||||
transliteration:
|
transliteration:
|
||||||
- "'ax' > 'b'"
|
- "'ax' > 'b'"
|
||||||
- !include transliteration.yaml
|
- !include transliteration.yaml
|
||||||
variants:
|
token-analysis:
|
||||||
|
- analyzer: generic
|
||||||
|
variants:
|
||||||
"""))
|
"""))
|
||||||
transpath = test_config.project_dir / ('transliteration.yaml')
|
transpath = test_config.project_dir / ('transliteration.yaml')
|
||||||
transpath.write_text('- "x > y"')
|
transpath.write_text('- "x > y"')
|
||||||
@@ -119,6 +122,15 @@ def test_transliteration_rules_from_file(test_config):
|
|||||||
assert trans.transliterate(" axxt ") == " byt "
|
assert trans.transliterate(" axxt ") == " byt "
|
||||||
|
|
||||||
|
|
||||||
|
def test_search_rules(cfgrules):
|
||||||
|
config = cfgrules('~street => s,st', 'master => mstr')
|
||||||
|
proc = ICURuleLoader(config).make_token_analysis()
|
||||||
|
|
||||||
|
assert proc.search.transliterate('Master Street').strip() == 'master street'
|
||||||
|
assert proc.search.transliterate('Earnes St').strip() == 'earnes st'
|
||||||
|
assert proc.search.transliterate('Nostreet').strip() == 'nostreet'
|
||||||
|
|
||||||
|
|
||||||
class TestGetReplacements:
|
class TestGetReplacements:
|
||||||
|
|
||||||
@pytest.fixture(autouse=True)
|
@pytest.fixture(autouse=True)
|
||||||
@@ -127,9 +139,9 @@ class TestGetReplacements:
|
|||||||
|
|
||||||
def get_replacements(self, *variants):
|
def get_replacements(self, *variants):
|
||||||
loader = ICURuleLoader(self.cfgrules(*variants))
|
loader = ICURuleLoader(self.cfgrules(*variants))
|
||||||
rules = loader.get_replacement_pairs()
|
rules = loader.analysis[None].config['replacements']
|
||||||
|
|
||||||
return set((v.source, v.replacement) for v in rules)
|
return sorted((k, sorted(v)) for k,v in rules)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
|
@pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
|
||||||
@@ -141,131 +153,122 @@ class TestGetReplacements:
|
|||||||
def test_add_full(self):
|
def test_add_full(self):
|
||||||
repl = self.get_replacements("foo -> bar")
|
repl = self.get_replacements("foo -> bar")
|
||||||
|
|
||||||
assert repl == {(' foo ', ' bar '), (' foo ', ' foo ')}
|
assert repl == [(' foo ', [' bar', ' foo'])]
|
||||||
|
|
||||||
|
|
||||||
def test_replace_full(self):
|
def test_replace_full(self):
|
||||||
repl = self.get_replacements("foo => bar")
|
repl = self.get_replacements("foo => bar")
|
||||||
|
|
||||||
assert repl == {(' foo ', ' bar ')}
|
assert repl == [(' foo ', [' bar'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_suffix_no_decompose(self):
|
def test_add_suffix_no_decompose(self):
|
||||||
repl = self.get_replacements("~berg |-> bg")
|
repl = self.get_replacements("~berg |-> bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
|
assert repl == [(' berg ', [' berg', ' bg']),
|
||||||
(' berg ', ' berg '), (' berg ', ' bg ')}
|
('berg ', ['berg', 'bg'])]
|
||||||
|
|
||||||
|
|
||||||
def test_replace_suffix_no_decompose(self):
|
def test_replace_suffix_no_decompose(self):
|
||||||
repl = self.get_replacements("~berg |=> bg")
|
repl = self.get_replacements("~berg |=> bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'bg '), (' berg ', ' bg ')}
|
assert repl == [(' berg ', [' bg']),('berg ', ['bg'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_suffix_decompose(self):
|
def test_add_suffix_decompose(self):
|
||||||
repl = self.get_replacements("~berg -> bg")
|
repl = self.get_replacements("~berg -> bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
|
assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
|
||||||
(' berg ', ' berg '), (' berg ', 'berg '),
|
('berg ', [' berg', ' bg', 'berg', 'bg'])]
|
||||||
('berg ', 'bg '), ('berg ', ' bg '),
|
|
||||||
(' berg ', 'bg '), (' berg ', ' bg ')}
|
|
||||||
|
|
||||||
|
|
||||||
def test_replace_suffix_decompose(self):
|
def test_replace_suffix_decompose(self):
|
||||||
repl = self.get_replacements("~berg => bg")
|
repl = self.get_replacements("~berg => bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'bg '), ('berg ', ' bg '),
|
assert repl == [(' berg ', [' bg', 'bg']),
|
||||||
(' berg ', 'bg '), (' berg ', ' bg ')}
|
('berg ', [' bg', 'bg'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_prefix_no_compose(self):
|
def test_add_prefix_no_compose(self):
|
||||||
repl = self.get_replacements("hinter~ |-> hnt")
|
repl = self.get_replacements("hinter~ |-> hnt")
|
||||||
|
|
||||||
assert repl == {(' hinter', ' hinter'), (' hinter ', ' hinter '),
|
assert repl == [(' hinter', [' hinter', ' hnt']),
|
||||||
(' hinter', ' hnt'), (' hinter ', ' hnt ')}
|
(' hinter ', [' hinter', ' hnt'])]
|
||||||
|
|
||||||
|
|
||||||
def test_replace_prefix_no_compose(self):
|
def test_replace_prefix_no_compose(self):
|
||||||
repl = self.get_replacements("hinter~ |=> hnt")
|
repl = self.get_replacements("hinter~ |=> hnt")
|
||||||
|
|
||||||
assert repl == {(' hinter', ' hnt'), (' hinter ', ' hnt ')}
|
assert repl == [(' hinter', [' hnt']), (' hinter ', [' hnt'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_prefix_compose(self):
|
def test_add_prefix_compose(self):
|
||||||
repl = self.get_replacements("hinter~-> h")
|
repl = self.get_replacements("hinter~-> h")
|
||||||
|
|
||||||
assert repl == {(' hinter', ' hinter'), (' hinter', ' hinter '),
|
assert repl == [(' hinter', [' h', ' h ', ' hinter', ' hinter ']),
|
||||||
(' hinter', ' h'), (' hinter', ' h '),
|
(' hinter ', [' h', ' h', ' hinter', ' hinter'])]
|
||||||
(' hinter ', ' hinter '), (' hinter ', ' hinter'),
|
|
||||||
(' hinter ', ' h '), (' hinter ', ' h')}
|
|
||||||
|
|
||||||
|
|
||||||
def test_replace_prefix_compose(self):
|
def test_replace_prefix_compose(self):
|
||||||
repl = self.get_replacements("hinter~=> h")
|
repl = self.get_replacements("hinter~=> h")
|
||||||
|
|
||||||
assert repl == {(' hinter', ' h'), (' hinter', ' h '),
|
assert repl == [(' hinter', [' h', ' h ']),
|
||||||
(' hinter ', ' h '), (' hinter ', ' h')}
|
(' hinter ', [' h', ' h'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_beginning_only(self):
|
def test_add_beginning_only(self):
|
||||||
repl = self.get_replacements("^Premier -> Pr")
|
repl = self.get_replacements("^Premier -> Pr")
|
||||||
|
|
||||||
assert repl == {('^ premier ', '^ premier '), ('^ premier ', '^ pr ')}
|
assert repl == [('^ premier ', ['^ pr', '^ premier'])]
|
||||||
|
|
||||||
|
|
||||||
def test_replace_beginning_only(self):
|
def test_replace_beginning_only(self):
|
||||||
repl = self.get_replacements("^Premier => Pr")
|
repl = self.get_replacements("^Premier => Pr")
|
||||||
|
|
||||||
assert repl == {('^ premier ', '^ pr ')}
|
assert repl == [('^ premier ', ['^ pr'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_final_only(self):
|
def test_add_final_only(self):
|
||||||
repl = self.get_replacements("road$ -> rd")
|
repl = self.get_replacements("road$ -> rd")
|
||||||
|
|
||||||
assert repl == {(' road ^', ' road ^'), (' road ^', ' rd ^')}
|
assert repl == [(' road ^', [' rd ^', ' road ^'])]
|
||||||
|
|
||||||
|
|
||||||
def test_replace_final_only(self):
|
def test_replace_final_only(self):
|
||||||
repl = self.get_replacements("road$ => rd")
|
repl = self.get_replacements("road$ => rd")
|
||||||
|
|
||||||
assert repl == {(' road ^', ' rd ^')}
|
assert repl == [(' road ^', [' rd ^'])]
|
||||||
|
|
||||||
|
|
||||||
def test_decompose_only(self):
|
def test_decompose_only(self):
|
||||||
repl = self.get_replacements("~foo -> foo")
|
repl = self.get_replacements("~foo -> foo")
|
||||||
|
|
||||||
assert repl == {('foo ', 'foo '), ('foo ', ' foo '),
|
assert repl == [(' foo ', [' foo', 'foo']),
|
||||||
(' foo ', 'foo '), (' foo ', ' foo ')}
|
('foo ', [' foo', 'foo'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_suffix_decompose_end_only(self):
|
def test_add_suffix_decompose_end_only(self):
|
||||||
repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
|
repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
|
assert repl == [(' berg ', [' berg', ' bg']),
|
||||||
(' berg ', ' berg '), (' berg ', ' bg '),
|
(' berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^']),
|
||||||
('berg ^', 'berg ^'), ('berg ^', ' berg ^'),
|
('berg ', ['berg', 'bg']),
|
||||||
('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
|
('berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^'])]
|
||||||
(' berg ^', 'berg ^'), (' berg ^', 'bg ^'),
|
|
||||||
(' berg ^', ' berg ^'), (' berg ^', ' bg ^')}
|
|
||||||
|
|
||||||
|
|
||||||
def test_replace_suffix_decompose_end_only(self):
|
def test_replace_suffix_decompose_end_only(self):
|
||||||
repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
|
repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'bg '), (' berg ', ' bg '),
|
assert repl == [(' berg ', [' bg']),
|
||||||
('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
|
(' berg ^', [' bg ^', 'bg ^']),
|
||||||
(' berg ^', 'bg ^'), (' berg ^', ' bg ^')}
|
('berg ', ['bg']),
|
||||||
|
('berg ^', [' bg ^', 'bg ^'])]
|
||||||
|
|
||||||
|
|
||||||
def test_add_multiple_suffix(self):
|
def test_add_multiple_suffix(self):
|
||||||
repl = self.get_replacements("~berg,~burg -> bg")
|
repl = self.get_replacements("~berg,~burg -> bg")
|
||||||
|
|
||||||
assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
|
assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
|
||||||
(' berg ', ' berg '), (' berg ', 'berg '),
|
(' burg ', [' bg', ' burg', 'bg', 'burg']),
|
||||||
('berg ', 'bg '), ('berg ', ' bg '),
|
('berg ', [' berg', ' bg', 'berg', 'bg']),
|
||||||
(' berg ', 'bg '), (' berg ', ' bg '),
|
('burg ', [' bg', ' burg', 'bg', 'burg'])]
|
||||||
('burg ', 'burg '), ('burg ', ' burg '),
|
|
||||||
(' burg ', ' burg '), (' burg ', 'burg '),
|
|
||||||
('burg ', 'bg '), ('burg ', ' bg '),
|
|
||||||
(' burg ', 'bg '), (' burg ', ' bg ')}
|
|
||||||
|
|||||||
@@ -0,0 +1,259 @@
|
|||||||
|
"""
|
||||||
|
Tests for the sanitizer that enables language-dependent analyzers.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from nominatim.indexer.place_info import PlaceInfo
|
||||||
|
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
|
||||||
|
from nominatim.tools.country_info import setup_country_config
|
||||||
|
|
||||||
|
class TestWithDefaults:
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def run_sanitizer_on(country, **kwargs):
|
||||||
|
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
|
||||||
|
'country_code': country})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}]).process_names(place)
|
||||||
|
|
||||||
|
return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_names(self):
|
||||||
|
assert self.run_sanitizer_on('de') == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_simple(self):
|
||||||
|
res = self.run_sanitizer_on('fr', name='Foo',name_de='Zoo', ref_abc='M')
|
||||||
|
|
||||||
|
assert res == [('Foo', 'name', None, {}),
|
||||||
|
('M', 'ref', 'abc', {'analyzer': 'abc'}),
|
||||||
|
('Zoo', 'name', 'de', {'analyzer': 'de'})]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('suffix', ['DE', 'asbc'])
|
||||||
|
def test_illegal_suffix(self, suffix):
|
||||||
|
assert self.run_sanitizer_on('fr', **{'name_' + suffix: 'Foo'}) \
|
||||||
|
== [('Foo', 'name', suffix, {})]
|
||||||
|
|
||||||
|
|
||||||
|
class TestFilterKind:
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def run_sanitizer_on(filt, **kwargs):
|
||||||
|
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
|
||||||
|
'country_code': 'de'})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
|
||||||
|
'filter-kind': filt}]).process_names(place)
|
||||||
|
|
||||||
|
return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
|
||||||
|
|
||||||
|
|
||||||
|
def test_single_exact_name(self):
|
||||||
|
res = self.run_sanitizer_on(['name'], name_fr='A', ref_fr='12',
|
||||||
|
shortname_fr='C', name='D')
|
||||||
|
|
||||||
|
assert res == [('12', 'ref', 'fr', {}),
|
||||||
|
('A', 'name', 'fr', {'analyzer': 'fr'}),
|
||||||
|
('C', 'shortname', 'fr', {}),
|
||||||
|
('D', 'name', None, {})]
|
||||||
|
|
||||||
|
|
||||||
|
def test_single_pattern(self):
|
||||||
|
res = self.run_sanitizer_on(['.*name'],
|
||||||
|
name_fr='A', ref_fr='12', namexx_fr='B',
|
||||||
|
shortname_fr='C', name='D')
|
||||||
|
|
||||||
|
assert res == [('12', 'ref', 'fr', {}),
|
||||||
|
('A', 'name', 'fr', {'analyzer': 'fr'}),
|
||||||
|
('B', 'namexx', 'fr', {}),
|
||||||
|
('C', 'shortname', 'fr', {'analyzer': 'fr'}),
|
||||||
|
('D', 'name', None, {})]
|
||||||
|
|
||||||
|
|
||||||
|
def test_multiple_patterns(self):
|
||||||
|
res = self.run_sanitizer_on(['.*name', 'ref'],
|
||||||
|
name_fr='A', ref_fr='12', oldref_fr='X',
|
||||||
|
namexx_fr='B', shortname_fr='C', name='D')
|
||||||
|
|
||||||
|
assert res == [('12', 'ref', 'fr', {'analyzer': 'fr'}),
|
||||||
|
('A', 'name', 'fr', {'analyzer': 'fr'}),
|
||||||
|
('B', 'namexx', 'fr', {}),
|
||||||
|
('C', 'shortname', 'fr', {'analyzer': 'fr'}),
|
||||||
|
('D', 'name', None, {}),
|
||||||
|
('X', 'oldref', 'fr', {})]
|
||||||
|
|
||||||
|
|
||||||
|
class TestDefaultCountry:
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def setup_country(self, def_config):
|
||||||
|
setup_country_config(def_config)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def run_sanitizer_append(mode, country, **kwargs):
|
||||||
|
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
|
||||||
|
'country_code': country})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
|
||||||
|
'use-defaults': mode,
|
||||||
|
'mode': 'append'}]).process_names(place)
|
||||||
|
|
||||||
|
assert all(isinstance(p.attr, dict) for p in name)
|
||||||
|
assert all(len(p.attr) <= 1 for p in name)
|
||||||
|
assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
|
||||||
|
for p in name)
|
||||||
|
|
||||||
|
return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
|
||||||
|
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def run_sanitizer_replace(mode, country, **kwargs):
|
||||||
|
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
|
||||||
|
'country_code': country})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
|
||||||
|
'use-defaults': mode,
|
||||||
|
'mode': 'replace'}]).process_names(place)
|
||||||
|
|
||||||
|
assert all(isinstance(p.attr, dict) for p in name)
|
||||||
|
assert all(len(p.attr) <= 1 for p in name)
|
||||||
|
assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
|
||||||
|
for p in name)
|
||||||
|
|
||||||
|
return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
|
||||||
|
|
||||||
|
|
||||||
|
def test_missing_country(self):
|
||||||
|
place = PlaceInfo({'name': {'name': 'something'}})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
|
||||||
|
'use-defaults': 'all',
|
||||||
|
'mode': 'replace'}]).process_names(place)
|
||||||
|
|
||||||
|
assert len(name) == 1
|
||||||
|
assert name[0].name == 'something'
|
||||||
|
assert name[0].suffix is None
|
||||||
|
assert 'analyzer' not in name[0].attr
|
||||||
|
|
||||||
|
|
||||||
|
def test_mono_unknown_country(self):
|
||||||
|
expect = [('XX', '')]
|
||||||
|
|
||||||
|
assert self.run_sanitizer_replace('mono', 'xx', name='XX') == expect
|
||||||
|
assert self.run_sanitizer_append('mono', 'xx', name='XX') == expect
|
||||||
|
|
||||||
|
|
||||||
|
def test_mono_monoling_replace(self):
|
||||||
|
res = self.run_sanitizer_replace('mono', 'de', name='Foo')
|
||||||
|
|
||||||
|
assert res == [('Foo', 'de')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_mono_monoling_append(self):
|
||||||
|
res = self.run_sanitizer_append('mono', 'de', name='Foo')
|
||||||
|
|
||||||
|
assert res == [('Foo', ''), ('Foo', 'de')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_mono_multiling(self):
|
||||||
|
expect = [('XX', '')]
|
||||||
|
|
||||||
|
assert self.run_sanitizer_replace('mono', 'ch', name='XX') == expect
|
||||||
|
assert self.run_sanitizer_append('mono', 'ch', name='XX') == expect
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_unknown_country(self):
|
||||||
|
expect = [('XX', '')]
|
||||||
|
|
||||||
|
assert self.run_sanitizer_replace('all', 'xx', name='XX') == expect
|
||||||
|
assert self.run_sanitizer_append('all', 'xx', name='XX') == expect
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_monoling_replace(self):
|
||||||
|
res = self.run_sanitizer_replace('all', 'de', name='Foo')
|
||||||
|
|
||||||
|
assert res == [('Foo', 'de')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_monoling_append(self):
|
||||||
|
res = self.run_sanitizer_append('all', 'de', name='Foo')
|
||||||
|
|
||||||
|
assert res == [('Foo', ''), ('Foo', 'de')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_multiling_append(self):
|
||||||
|
res = self.run_sanitizer_append('all', 'ch', name='XX')
|
||||||
|
|
||||||
|
assert res == [('XX', ''),
|
||||||
|
('XX', 'de'), ('XX', 'fr'), ('XX', 'it'), ('XX', 'rm')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_multiling_replace(self):
|
||||||
|
res = self.run_sanitizer_replace('all', 'ch', name='XX')
|
||||||
|
|
||||||
|
assert res == [('XX', 'de'), ('XX', 'fr'), ('XX', 'it'), ('XX', 'rm')]
|
||||||
|
|
||||||
|
|
||||||
|
class TestCountryWithWhitelist:
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def run_sanitizer_on(mode, country, **kwargs):
|
||||||
|
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
|
||||||
|
'country_code': country})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
|
||||||
|
'use-defaults': mode,
|
||||||
|
'mode': 'replace',
|
||||||
|
'whitelist': ['de', 'fr', 'ru']}]).process_names(place)
|
||||||
|
|
||||||
|
assert all(isinstance(p.attr, dict) for p in name)
|
||||||
|
assert all(len(p.attr) <= 1 for p in name)
|
||||||
|
assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
|
||||||
|
for p in name)
|
||||||
|
|
||||||
|
return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
|
||||||
|
|
||||||
|
|
||||||
|
def test_mono_monoling(self):
|
||||||
|
assert self.run_sanitizer_on('mono', 'de', name='Foo') == [('Foo', 'de')]
|
||||||
|
assert self.run_sanitizer_on('mono', 'pt', name='Foo') == [('Foo', '')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_mono_multiling(self):
|
||||||
|
assert self.run_sanitizer_on('mono', 'ca', name='Foo') == [('Foo', '')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_monoling(self):
|
||||||
|
assert self.run_sanitizer_on('all', 'de', name='Foo') == [('Foo', 'de')]
|
||||||
|
assert self.run_sanitizer_on('all', 'pt', name='Foo') == [('Foo', '')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_multiling(self):
|
||||||
|
assert self.run_sanitizer_on('all', 'ca', name='Foo') == [('Foo', 'fr')]
|
||||||
|
assert self.run_sanitizer_on('all', 'ch', name='Foo') \
|
||||||
|
== [('Foo', 'de'), ('Foo', 'fr')]
|
||||||
|
|
||||||
|
|
||||||
|
class TestWhiteList:
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def run_sanitizer_on(whitelist, **kwargs):
|
||||||
|
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}})
|
||||||
|
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
|
||||||
|
'mode': 'replace',
|
||||||
|
'whitelist': whitelist}]).process_names(place)
|
||||||
|
|
||||||
|
assert all(isinstance(p.attr, dict) for p in name)
|
||||||
|
assert all(len(p.attr) <= 1 for p in name)
|
||||||
|
assert all(not p.attr or ('analyzer' in p.attr and p.attr['analyzer'])
|
||||||
|
for p in name)
|
||||||
|
|
||||||
|
return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
|
||||||
|
|
||||||
|
|
||||||
|
def test_in_whitelist(self):
|
||||||
|
assert self.run_sanitizer_on(['de', 'xx'], ref_xx='123') == [('123', 'xx')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_not_in_whitelist(self):
|
||||||
|
assert self.run_sanitizer_on(['de', 'xx'], ref_yy='123') == [('123', '')]
|
||||||
|
|
||||||
|
|
||||||
|
def test_empty_whitelist(self):
|
||||||
|
assert self.run_sanitizer_on([], ref_yy='123') == [('123', '')]
|
||||||
265
test/python/tokenizer/token_analysis/test_generic.py
Normal file
265
test/python/tokenizer/token_analysis/test_generic.py
Normal file
@@ -0,0 +1,265 @@
|
|||||||
|
"""
|
||||||
|
Tests for import name normalisation and variant generation.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from icu import Transliterator
|
||||||
|
|
||||||
|
import nominatim.tokenizer.token_analysis.generic as module
|
||||||
|
from nominatim.errors import UsageError
|
||||||
|
|
||||||
|
DEFAULT_NORMALIZATION = """ :: NFD ();
|
||||||
|
'🜳' > ' ';
|
||||||
|
[[:Nonspacing Mark:] [:Cf:]] >;
|
||||||
|
:: lower ();
|
||||||
|
[[:Punctuation:][:Space:]]+ > ' ';
|
||||||
|
:: NFC ();
|
||||||
|
"""
|
||||||
|
|
||||||
|
DEFAULT_TRANSLITERATION = """ :: Latin ();
|
||||||
|
'🜵' > ' ';
|
||||||
|
"""
|
||||||
|
|
||||||
|
def make_analyser(*variants, variant_only=False):
|
||||||
|
rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
|
||||||
|
if variant_only:
|
||||||
|
rules['mode'] = 'variant-only'
|
||||||
|
config = module.configure(rules, DEFAULT_NORMALIZATION)
|
||||||
|
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
|
||||||
|
|
||||||
|
return module.create(trans, config)
|
||||||
|
|
||||||
|
|
||||||
|
def get_normalized_variants(proc, name):
|
||||||
|
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
|
||||||
|
return proc.get_variants_ascii(norm.transliterate(name).strip())
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_variants():
|
||||||
|
rules = { 'analyzer': 'generic' }
|
||||||
|
config = module.configure(rules, DEFAULT_NORMALIZATION)
|
||||||
|
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
|
||||||
|
|
||||||
|
proc = module.create(trans, config)
|
||||||
|
|
||||||
|
assert get_normalized_variants(proc, '大德!') == ['dà dé']
|
||||||
|
|
||||||
|
|
||||||
|
def test_variants_empty():
|
||||||
|
proc = make_analyser('saint -> 🜵', 'street -> st')
|
||||||
|
|
||||||
|
assert get_normalized_variants(proc, '🜵') == []
|
||||||
|
assert get_normalized_variants(proc, '🜳') == []
|
||||||
|
assert get_normalized_variants(proc, 'saint') == ['saint']
|
||||||
|
|
||||||
|
|
||||||
|
VARIANT_TESTS = [
|
||||||
|
(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
|
||||||
|
(('weg => wg',), "holzweg", {'holzweg'}),
|
||||||
|
(('weg -> wg',), "holzweg", {'holzweg'}),
|
||||||
|
(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
|
||||||
|
(('~weg -> weg',), "holzweg", {'holz weg', 'holzweg'}),
|
||||||
|
(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
|
||||||
|
(('~weg -> w',), "holzweg", {'holz weg', 'holzweg', 'holz w', 'holzw'}),
|
||||||
|
(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
|
||||||
|
(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
|
||||||
|
(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
|
||||||
|
(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
|
||||||
|
(('weg => wg',), "Meier Weg", {'meier wg'}),
|
||||||
|
(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
|
||||||
|
(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
|
||||||
|
{'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
|
||||||
|
(('am => a', 'bach => b'), "am bach", {'a b'}),
|
||||||
|
(('am => a', '~bach => b'), "am bach", {'a b'}),
|
||||||
|
(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
|
||||||
|
(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
|
||||||
|
(('saint -> s,st', 'street -> st'), "Saint Johns Street",
|
||||||
|
{'saint johns street', 's johns street', 'st johns street',
|
||||||
|
'saint johns st', 's johns st', 'st johns st'}),
|
||||||
|
(('river$ -> r',), "River Bend Road", {'river bend road'}),
|
||||||
|
(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
|
||||||
|
(('^north => n',), "North 2nd Street", {'n 2nd street'}),
|
||||||
|
(('^north => n',), "Airport North", {'airport north'}),
|
||||||
|
(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
|
||||||
|
(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
|
||||||
|
]
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
|
||||||
|
def test_variants(rules, name, variants):
|
||||||
|
proc = make_analyser(*rules)
|
||||||
|
|
||||||
|
result = get_normalized_variants(proc, name)
|
||||||
|
|
||||||
|
assert len(result) == len(set(result))
|
||||||
|
assert set(get_normalized_variants(proc, name)) == variants
|
||||||
|
|
||||||
|
|
||||||
|
VARIANT_ONLY_TESTS = [
|
||||||
|
(('weg => wg',), "hallo", set()),
|
||||||
|
(('weg => wg',), "Meier Weg", {'meier wg'}),
|
||||||
|
(('weg -> wg',), "Meier Weg", {'meier wg'}),
|
||||||
|
]
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("rules,name,variants", VARIANT_ONLY_TESTS)
|
||||||
|
def test_variants_only(rules, name, variants):
|
||||||
|
proc = make_analyser(*rules, variant_only=True)
|
||||||
|
|
||||||
|
result = get_normalized_variants(proc, name)
|
||||||
|
|
||||||
|
assert len(result) == len(set(result))
|
||||||
|
assert set(get_normalized_variants(proc, name)) == variants
|
||||||
|
|
||||||
|
|
||||||
|
class TestGetReplacements:
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def configure_rules(*variants):
|
||||||
|
rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
|
||||||
|
return module.configure(rules, DEFAULT_NORMALIZATION)
|
||||||
|
|
||||||
|
|
||||||
|
def get_replacements(self, *variants):
|
||||||
|
config = self.configure_rules(*variants)
|
||||||
|
|
||||||
|
return sorted((k, sorted(v)) for k,v in config['replacements'])
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
|
||||||
|
'~foo~ -> bar', 'fo~ o -> bar'])
|
||||||
|
def test_invalid_variant_description(self, variant):
|
||||||
|
with pytest.raises(UsageError):
|
||||||
|
self.configure_rules(variant)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("rule", ["!!! -> bar", "bar => !!!"])
|
||||||
|
def test_ignore_unnormalizable_terms(self, rule):
|
||||||
|
repl = self.get_replacements(rule)
|
||||||
|
|
||||||
|
assert repl == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_full(self):
|
||||||
|
repl = self.get_replacements("foo -> bar")
|
||||||
|
|
||||||
|
assert repl == [(' foo ', [' bar', ' foo'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_full(self):
|
||||||
|
repl = self.get_replacements("foo => bar")
|
||||||
|
|
||||||
|
assert repl == [(' foo ', [' bar'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_suffix_no_decompose(self):
|
||||||
|
repl = self.get_replacements("~berg |-> bg")
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' berg', ' bg']),
|
||||||
|
('berg ', ['berg', 'bg'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_suffix_no_decompose(self):
|
||||||
|
repl = self.get_replacements("~berg |=> bg")
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' bg']),('berg ', ['bg'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_suffix_decompose(self):
|
||||||
|
repl = self.get_replacements("~berg -> bg")
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
|
||||||
|
('berg ', [' berg', ' bg', 'berg', 'bg'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_suffix_decompose(self):
|
||||||
|
repl = self.get_replacements("~berg => bg")
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' bg', 'bg']),
|
||||||
|
('berg ', [' bg', 'bg'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_prefix_no_compose(self):
|
||||||
|
repl = self.get_replacements("hinter~ |-> hnt")
|
||||||
|
|
||||||
|
assert repl == [(' hinter', [' hinter', ' hnt']),
|
||||||
|
(' hinter ', [' hinter', ' hnt'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_prefix_no_compose(self):
|
||||||
|
repl = self.get_replacements("hinter~ |=> hnt")
|
||||||
|
|
||||||
|
assert repl == [(' hinter', [' hnt']), (' hinter ', [' hnt'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_prefix_compose(self):
|
||||||
|
repl = self.get_replacements("hinter~-> h")
|
||||||
|
|
||||||
|
assert repl == [(' hinter', [' h', ' h ', ' hinter', ' hinter ']),
|
||||||
|
(' hinter ', [' h', ' h', ' hinter', ' hinter'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_prefix_compose(self):
|
||||||
|
repl = self.get_replacements("hinter~=> h")
|
||||||
|
|
||||||
|
assert repl == [(' hinter', [' h', ' h ']),
|
||||||
|
(' hinter ', [' h', ' h'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_beginning_only(self):
|
||||||
|
repl = self.get_replacements("^Premier -> Pr")
|
||||||
|
|
||||||
|
assert repl == [('^ premier ', ['^ pr', '^ premier'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_beginning_only(self):
|
||||||
|
repl = self.get_replacements("^Premier => Pr")
|
||||||
|
|
||||||
|
assert repl == [('^ premier ', ['^ pr'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_final_only(self):
|
||||||
|
repl = self.get_replacements("road$ -> rd")
|
||||||
|
|
||||||
|
assert repl == [(' road ^', [' rd ^', ' road ^'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_final_only(self):
|
||||||
|
repl = self.get_replacements("road$ => rd")
|
||||||
|
|
||||||
|
assert repl == [(' road ^', [' rd ^'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_decompose_only(self):
|
||||||
|
repl = self.get_replacements("~foo -> foo")
|
||||||
|
|
||||||
|
assert repl == [(' foo ', [' foo', 'foo']),
|
||||||
|
('foo ', [' foo', 'foo'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_suffix_decompose_end_only(self):
|
||||||
|
repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' berg', ' bg']),
|
||||||
|
(' berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^']),
|
||||||
|
('berg ', ['berg', 'bg']),
|
||||||
|
('berg ^', [' berg ^', ' bg ^', 'berg ^', 'bg ^'])]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_suffix_decompose_end_only(self):
|
||||||
|
repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' bg']),
|
||||||
|
(' berg ^', [' bg ^', 'bg ^']),
|
||||||
|
('berg ', ['bg']),
|
||||||
|
('berg ^', [' bg ^', 'bg ^'])]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('rule', ["~berg,~burg -> bg",
|
||||||
|
"~berg, ~burg -> bg",
|
||||||
|
"~berg,,~burg -> bg"])
|
||||||
|
def test_add_multiple_suffix(self, rule):
|
||||||
|
repl = self.get_replacements(rule)
|
||||||
|
|
||||||
|
assert repl == [(' berg ', [' berg', ' bg', 'berg', 'bg']),
|
||||||
|
(' burg ', [' bg', ' burg', 'bg', 'burg']),
|
||||||
|
('berg ', [' berg', ' bg', 'berg', 'bg']),
|
||||||
|
('burg ', [' bg', ' burg', 'bg', 'burg'])]
|
||||||
Reference in New Issue
Block a user