Commit Graph

151 Commits

Author SHA1 Message Date
Sarah Hoffmann
cbbcbb1fd7 move country_info into data submodule 2022-07-06 11:08:36 +02:00
Sarah Hoffmann
bce93d60bd move PlaceInfo into data submodule
This data structure is shared between indexer and tokenizer.
2022-07-06 10:54:47 +02:00
Sarah Hoffmann
612d34930b handle postcodes properly on word table updates
update_postcodes_from_db() needs to do the full postcode treatment
in order to derive the correct word table entries.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
5be320368c add documentation for postcode customization 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
7f2ad4ac7e fix linting issue 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
0f00f4968c fix up BDD tests for postcode changes
Includes smaller code fixes found by the tests.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
37b2c6a830 port legacy tokenizer to new postcode handling
Also documents the changes to the SQL functions of the tokenizer.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
e86db3001f fix postcode pattern for Mozambique
Optional groups are not implemented yet.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
80ea13437d move postcode matcher in a separate file 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
b7704833e4 icu: switch postcodes to using the pre-formatted one 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
ca7b46511d introduce and use analyzer for postcodes 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
18864afa8a postcodes: introduce a default pattern for countries without postcodes 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
5ba75df507 postcode: generate a generic form 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
9172696324 postcodes: add support for optional spaces 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
baee6f3de0 postcodes: strip leading country codes 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
90d4d339db initial postcode cleaner for simple patterns
Moves postcodes that are either in countries without a postcode
system or don't correspond to the local pattern for postcodes into
a field for a normal address part. Makes them searchable but not as
a special address. This has two consequences: they are no longer a
skippable part of the address and the postcodes cannot be searched
on their own.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
8080625747 remove postcodes from countries that don't have them
The postcodes will only be removed as a 'computed postcode' they
are still searchable for the given object.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
1821f68ca0 exclude addr:inclusion from search 2022-05-31 14:19:19 +02:00
Sarah Hoffmann
d14a585cc9 pylint: disable no-self-use check
This checker encourages bad behaviour (namely changing the static
status of a function during inheritence) and will be made optional
in upcoming versions of pylint.
2022-05-11 10:25:00 +02:00
Sarah Hoffmann
bb2bd76f91 pylint: avoid explicit use of format() function
Use psycopg2 SQL formatters for SQL and formatted string literals
everywhere else.
2022-05-11 09:48:56 +02:00
Sarah Hoffmann
7e70e5f503 always state encoding when opening files in text mode
Also applies to Path.write_text().
2022-05-10 15:36:29 +02:00
Sarah Hoffmann
a0ed80d821 restore the tokenizer directory when missing
Automatically repopulate the tokenizer/ directory with the PHP stub
and the postgresql module, when the directory is missing. This allows
to switch working directories and in particular run the service
from a different maschine then where it was installed.
Users still need to make sure that .env files are set up correctly
or they will shoot themselves in the foot.

See #2515.
2022-03-20 11:31:42 +01:00
Sarah Hoffmann
15beeef6ce do not expand records in select list
An expression of the form 'SELECT (func()).*' will be expanded
by Postgresql _before_ execution with the result that the function
will be called as many times as there are fields in the record.
This is not what we want. The function call needs to go into
the FROM clause instead.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
92bc3cd0a7 fix linting issue 2022-03-01 09:34:32 +01:00
Sarah Hoffmann
4a3bbd0319 adapt housenumber cleanup to new word table structure 2022-03-01 09:34:32 +01:00
Sarah Hoffmann
13ed184efd housenumber analyzer: avoid creating too many variants
Housenumber fields with lots of text are likely bad data. So is
data with many changes from letter to digit. Exclude them from adding
optional spaces.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
f03a05f6bb add new analyser for houenumbers
This analyser makes spaces optional.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
a6903651fc add framework for analysing housenumbers
This lays the groundwork for adding variants for housenumbers.
When analysis is enabled, then the 'word' field in the word table
is used as usual, so that variants can be created. There will be
only one analyser allowed which must have the fixed name
'@housenumber'.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
b8c544cc98 icu: move token deduplication into TokenInfo
Puts collection into one common place.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
243725aae1 icu: move housenumber token computation out of TokenInfo
This was the last function to use the cache. There is a more clean
separation of responsibility now.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
0bb59b2e22 handle unknown analyzer
When changing something in the default configuration of the sanatizers
that refers to an analyzer that is not yet loaded, there shouldn't be
any errors.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
837d44391c move generation of normalized token form to analyzer
This gives the analyzer more flexibility in choosing the normalized
form. In particular, an analyzer creating different variants can choose
the variant that will be used as the canonical form.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
f74228830d bdd: run full import on tests
This uncovered a couple of outdated/wrong tests which have been
fixed, too.
2022-02-24 14:27:51 +01:00
Sarah Hoffmann
a3e4e8e5cd delete unused country name tokens 2022-02-23 09:23:06 +01:00
Sarah Hoffmann
38c3ef3da0 add tests for get_string_list()
Renaming test file for sanitizer config because pytest requires
unique names for test files.
2022-02-07 11:22:24 +01:00
Sarah Hoffmann
610f2cc254 sanitizer: move helpers into a configuration class 2022-02-07 10:48:00 +01:00
Sarah Hoffmann
a79a3210e6 implement is-a-name option for housenumbers 2022-02-07 09:27:11 +01:00
Sarah Hoffmann
3ce123ab69 do not clean housenumbers in reverse-only mode 2022-01-20 20:21:13 +01:00
Sarah Hoffmann
d8b7a51ab6 add actual removal of housenumber tokens 2022-01-20 20:18:15 +01:00
Sarah Hoffmann
344a2bfc1a add new command for cleaning word tokens
Just pulls outdated housenumbers for the moment.
2022-01-20 20:05:15 +01:00
Sarah Hoffmann
1e5a8561c0 fix linting issues 2022-01-20 16:00:23 +01:00
Sarah Hoffmann
f3c9578bca complete documentation for new clean-houseunubmers sanatizer 2022-01-20 15:49:32 +01:00
Sarah Hoffmann
3741afa6dc generalize filter-kind parameter for sanatizers
Now behaves the same for tag_analyzer_by_language and
clean_housenumbers. Adds tests.
2022-01-20 15:42:42 +01:00
Sarah Hoffmann
4774e45218 clean_housenumbers: make kinds and delimiters configurable
Also adds unit tests for various options.
2022-01-20 12:07:12 +01:00
Sarah Hoffmann
206ee87188 factor out housenumber splitting into sanitizer 2022-01-19 17:27:50 +01:00
Sarah Hoffmann
3df560ea38 fix linting error 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
adbaf700cd move parsing of mutation config to setup phase 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
b453b0ea95 introduce mutation variants to generic token analyser
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.

Add spelling variations for German umlauts.
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
0192a7af96 move variant configuration reading in separate file 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
630ad38a67 refactor variant production to use generators 2022-01-18 11:09:21 +01:00