introduce sanitizer step before token analysis

Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
This commit is contained in:
Sarah Hoffmann
2021-09-30 21:30:13 +02:00
parent 16daa57e47
commit 8171fe4571
8 changed files with 259 additions and 58 deletions

View File

@@ -0,0 +1,28 @@
"""
Name processor that splits name values with multiple values into their components.
"""
import re
def create(func):
""" Create a name processing function that splits name values with
multiple values into their components. The optional parameter
'delimiters' can be used to define the characters that should be used
for splitting. The default is ',;'.
"""
regexp = re.compile('[{}]'.format(func.get('delimiters', ',;')))
def _process(obj):
if not obj.names:
return
new_names = []
for name in obj.names:
split_names = regexp.split(name.name)
if len(split_names) == 1:
new_names.append(name)
else:
new_names.extend(name.clone(name=n) for n in split_names)
obj.names = new_names
return _process