Merge pull request #3675 from TuringVerified/generic-preprocessors

Add generic preprocessors
This commit is contained in:
Sarah Hoffmann
2025-04-01 20:14:43 +02:00
committed by GitHub
3 changed files with 123 additions and 3 deletions

View File

@@ -67,7 +67,13 @@ Here is an example configuration file:
``` yaml
query-preprocessing:
- normalize
- step: split_japanese_phrases
- step: regex_replace
replacements:
- pattern: https?://[^\s]* # Filter URLs starting with http or https
replace: ''
- step: normalize
normalization:
- ":: lower ()"
- "ß > 'ss'" # German szet is unambiguously equal to double ss
@@ -88,8 +94,8 @@ token-analysis:
replacements: ['ä', 'ae']
```
The configuration file contains four sections:
`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
The configuration file contains five sections:
`query-preprocessing`, `normalization`, `transliteration`, `sanitizers` and `token-analysis`.
#### Query preprocessing
@@ -106,6 +112,19 @@ The following is a list of preprocessors that are shipped with Nominatim.
heading_level: 6
docstring_section_style: spacy
##### regex-replace
::: nominatim_api.query_preprocessing.regex_replace
options:
members: False
heading_level: 6
docstring_section_style: spacy
description:
This option runs any given regex pattern on the input and replaces values accordingly
replacements:
- pattern: regex pattern
replace: string to replace with
#### Normalization and Transliteration