forked from hans/Nominatim
add documentation for sanitizer interface
Also switches mkdocstrings to 0.18 with the rather unfortunate consequence that now mkdocstrings-python-legacy is needed as well.
This commit is contained in:
@@ -40,7 +40,8 @@ It has the following additional requirements:
|
||||
The documentation is built with mkdocs:
|
||||
|
||||
* [mkdocs](https://www.mkdocs.org/) >= 1.1.2
|
||||
* [mkdocstrings](https://mkdocstrings.github.io/)
|
||||
* [mkdocstrings](https://mkdocstrings.github.io/) >= 0.16
|
||||
* [mkdocstrings-python-legacy](https://mkdocstrings.github.io/python-legacy/)
|
||||
|
||||
### Installing prerequisites on Ubuntu/Debian
|
||||
|
||||
|
||||
82
docs/develop/ICU-Tokenizer-Modules.md
Normal file
82
docs/develop/ICU-Tokenizer-Modules.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Writing custom sanitizer and token analysis modules for the ICU tokenizer
|
||||
|
||||
The [ICU tokenizer](../customize/Tokenizers.md#icu-tokenizer) provides a
|
||||
highly customizable method to pre-process and normalize the name information
|
||||
of the input data before it is added to the search index. It comes with a
|
||||
selection of sanitizers and token analyzers which you can use to adapt your
|
||||
installation to your needs. If the provided modules are not enough, you can
|
||||
also provide your own implementations. This section describes how to do that.
|
||||
|
||||
## Using non-standard sanitizers and token analyzers
|
||||
|
||||
Sanitizer names (in the `step` property) and token analysis names (in the
|
||||
`analyzer`) may refer to externally supplied modules. There are two ways
|
||||
to include external modules: through a library or from the project directory.
|
||||
|
||||
To include a module from a library, use the absolute import path as name and
|
||||
make sure the library can be found in your PYTHONPATH.
|
||||
|
||||
To use a custom module without creating a library, you can put the module
|
||||
somewhere in your project directory and then use the relative path to the
|
||||
file. Include the whole name of the file including the `.py` ending.
|
||||
|
||||
## Custom sanitizer modules
|
||||
|
||||
A sanitizer module must export a single factory function `create` with the
|
||||
following signature:
|
||||
|
||||
``` python
|
||||
def create(config: SanitizerConfig) -> Callable[[ProcessInfo], None]
|
||||
```
|
||||
|
||||
The function receives the custom configuration for the sanitizer and must
|
||||
return a callable (function or class) that transforms the name and address
|
||||
terms of a place. When a place is processed, then a `ProcessInfo` object
|
||||
is created from the information that was queried from the database. This
|
||||
object is sequentially handed to each configured sanitizer, so that each
|
||||
sanitizer receives the result of processing from the previous sanitizer.
|
||||
After the last sanitizer is finished, the resulting name and address lists
|
||||
are forwarded to the token analysis module.
|
||||
|
||||
Sanitizer functions are instantiated once and then called for each place
|
||||
that is imported or updated. They don't need to be thread-safe.
|
||||
If multi-threading is used, each thread creates their own instance of
|
||||
the function.
|
||||
|
||||
### Sanitizer configuration
|
||||
|
||||
::: nominatim.tokenizer.sanitizers.config.SanitizerConfig
|
||||
rendering:
|
||||
show_source: no
|
||||
heading_level: 6
|
||||
|
||||
### The sanitation function
|
||||
|
||||
The sanitation function receives a single object with three members:
|
||||
|
||||
* `place`: read-only information about the place being processed.
|
||||
See PlaceInfo below.
|
||||
* `names`: The current list of names for the place. Each name is a
|
||||
PlaceName object.
|
||||
* `address`: The current list of address names for the place. Each name
|
||||
is a PlaceName object.
|
||||
|
||||
While the `place` member is provided for information only, the `names` and
|
||||
`address` lists are meant to be manipulated by the sanitizer. If may add and
|
||||
remove entries, change information within a single entry (for example by
|
||||
adding extra attributes) or completely replace the list with a different one.
|
||||
|
||||
#### PlaceInfo - information about the place
|
||||
|
||||
::: nominatim.data.place_info.PlaceInfo
|
||||
rendering:
|
||||
show_source: no
|
||||
heading_level: 6
|
||||
|
||||
|
||||
#### PlaceName - extended naming information
|
||||
|
||||
::: nominatim.tokenizer.sanitizers.base.PlaceName
|
||||
rendering:
|
||||
show_source: no
|
||||
heading_level: 6
|
||||
Reference in New Issue
Block a user