Nominatim

Author	SHA1	Message	Date
Sarah Hoffmann	243725aae1	icu: move housenumber token computation out of TokenInfo This was the last function to use the cache. There is a more clean separation of responsibility now.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	0bb59b2e22	handle unknown analyzer When changing something in the default configuration of the sanatizers that refers to an analyzer that is not yet loaded, there shouldn't be any errors.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	837d44391c	move generation of normalized token form to analyzer This gives the analyzer more flexibility in choosing the normalized form. In particular, an analyzer creating different variants can choose the variant that will be used as the canonical form.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	f74228830d	bdd: run full import on tests This uncovered a couple of outdated/wrong tests which have been fixed, too.	2022-02-24 14:27:51 +01:00
Sarah Hoffmann	a3e4e8e5cd	delete unused country name tokens	2022-02-23 09:23:06 +01:00
Sarah Hoffmann	38c3ef3da0	add tests for get_string_list() Renaming test file for sanitizer config because pytest requires unique names for test files.	2022-02-07 11:22:24 +01:00
Sarah Hoffmann	610f2cc254	sanitizer: move helpers into a configuration class	2022-02-07 10:48:00 +01:00
Sarah Hoffmann	a79a3210e6	implement is-a-name option for housenumbers	2022-02-07 09:27:11 +01:00
Sarah Hoffmann	3ce123ab69	do not clean housenumbers in reverse-only mode	2022-01-20 20:21:13 +01:00
Sarah Hoffmann	d8b7a51ab6	add actual removal of housenumber tokens	2022-01-20 20:18:15 +01:00
Sarah Hoffmann	344a2bfc1a	add new command for cleaning word tokens Just pulls outdated housenumbers for the moment.	2022-01-20 20:05:15 +01:00
Sarah Hoffmann	1e5a8561c0	fix linting issues	2022-01-20 16:00:23 +01:00
Sarah Hoffmann	f3c9578bca	complete documentation for new clean-houseunubmers sanatizer	2022-01-20 15:49:32 +01:00
Sarah Hoffmann	3741afa6dc	generalize filter-kind parameter for sanatizers Now behaves the same for tag_analyzer_by_language and clean_housenumbers. Adds tests.	2022-01-20 15:42:42 +01:00
Sarah Hoffmann	4774e45218	clean_housenumbers: make kinds and delimiters configurable Also adds unit tests for various options.	2022-01-20 12:07:12 +01:00
Sarah Hoffmann	206ee87188	factor out housenumber splitting into sanitizer	2022-01-19 17:27:50 +01:00
Sarah Hoffmann	3df560ea38	fix linting error	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	adbaf700cd	move parsing of mutation config to setup phase	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	b453b0ea95	introduce mutation variants to generic token analyser Mutations are regular-expression-based replacements that are applied after variants have been computed. They are meant to be used for variations on character level. Add spelling variations for German umlauts.	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	0192a7af96	move variant configuration reading in separate file	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	630ad38a67	refactor variant production to use generators	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	c3788d765e	add consistent SPDX copyright headers	2022-01-03 16:23:58 +01:00
Sarah Hoffmann	f9b56a8581	correctly match abbreviated addr:street This only works when addr:street is abbreviated and the street name isn't. It does not work the other way around.	2021-12-08 21:58:43 +01:00
Sarah Hoffmann	7f7d2fd5b3	skip most addr: tags with suffixes Only one addr: tag can be processed currently, so make sure it is the one without suffixes to not get odd data. addr:street is the exception because it uses a different matching mechanism.	2021-12-06 14:55:10 +01:00
Sarah Hoffmann	44cfce1ca4	revert to using full names for street name matching Using partial names turned out to not work well because there are often similarly named streets next to each other. It also prevents us from being able to take into account all addr:street:* tags. This change gets all the full term tokens for the addr:street tags from the DB. As they are used for matching only, we can assume that the term must already be there or there will be no match. This avoid creating unused full name tags.	2021-12-06 11:38:38 +01:00
Sarah Hoffmann	7beccb7997	remove unnecessary pass statements	2021-12-02 15:54:24 +01:00
Sarah Hoffmann	14a78f55cd	more unit tests for tokenizers	2021-12-02 15:46:36 +01:00
Sarah Hoffmann	37eeccbf4c	ICU: use normalization from config in PHP The TERM_NORMALIZATION config option is no longer applicable. That was already documented but not yet implemented.	2021-10-27 11:32:44 +02:00
Sarah Hoffmann	53dbe58ada	do not count words when in reverse-only mode	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	85797acf1e	ICU: add an index over word_ids Needed for keyword lookup in the details response.	2021-10-25 21:33:27 +02:00
Sarah Hoffmann	ec7184c533	icu: no longer precompute terms The ICU analyzer no longer drops frequent partials, so it is no longer necessary to know the frequencies in advance.	2021-10-19 11:52:28 +02:00
Sarah Hoffmann	e8e2502e2f	make word recount a tokenizer-specific function	2021-10-19 11:21:16 +02:00
Sarah Hoffmann	6c79a60e19	add documentation for new configuration of ICU tokenizer	2021-10-07 11:55:53 +02:00
Sarah Hoffmann	2a94bfc703	fix argument description for check_database	2021-10-07 09:49:13 +02:00
Sarah Hoffmann	299934fd2a	reorganize and complete tests around generic token analysis	2021-10-06 17:03:37 +02:00
Sarah Hoffmann	b18d042832	add tests for sanitizer tagging language	2021-10-06 12:29:25 +02:00
Sarah Hoffmann	97a10ec218	apply variants by languages Adds a tagger for names by language so that the analyzer of that language is used. Thus variants are now only applied to names in the specific language and only tag name tags, no longer to reference-like tags.	2021-10-06 11:09:54 +02:00
Sarah Hoffmann	d35400a7d7	use analyser provided in the 'analyzer' property Implements per-name choice of analyzer. If a non-default analyzer is choosen, then the 'word' identifier is extended with the name of the ana;yzer, so that we still have unique items.	2021-10-05 14:10:32 +02:00
Sarah Hoffmann	92f6ec2328	remove support for properties on variants Those are not going to be used in the near future, so no need to carry that code around just now.	2021-10-05 10:29:36 +02:00
Sarah Hoffmann	9ba2019470	precompute replacements while loading configuration	2021-10-05 10:20:08 +02:00
Sarah Hoffmann	c171d88194	move parsing of token analysis config to analyzer Adds a second callback for the analyzer which is responsible for parsing the configuration rules and converting it to whatever format necessary. This way, each analyzer implementation can define its own configuration rules.	2021-10-04 18:31:58 +02:00
Sarah Hoffmann	7cfcbacfc7	make token analyzers configurable modules Adds a mandatory section 'analyzer' to the token-analysis entries which define, which analyser to use. Currently there is exactly one, generic, which implements the former ICUNameProcessor.	2021-10-04 17:37:34 +02:00
Sarah Hoffmann	52847b61a3	extend ICU config to accomodate multiple analysers Adds parsing of multiple variant lists from the configuration. Every entry except one must have a unique 'id' paramter to distinguish the entries. The entry without id is considered the default. Currently only the list without an id is used for analysis.	2021-10-04 16:40:28 +02:00
Sarah Hoffmann	5a36559834	move flatten_config_list into config module For general usage by other modules.	2021-10-04 11:56:54 +02:00
Sarah Hoffmann	732cd27d2e	add unit tests for new sanatizer functions	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	8171fe4571	introduce sanitizer step before token analysis Sanatizer functions allow to transform name and address tags before they are handed to the tokenizer. Theses transformations are visible only for the tokenizer and thus only have an influence on the search terms and address match terms for a place. Currently two sanitizers are implemented which are responsible for splitting names with multiple values and removing bracket additions. Both was previously hard-coded in the tokenizer.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	16daa57e47	unify ICUNameProcessorRules and ICURuleLoader There is no need for the additional layer of indirection that the ICUNameProcessorRules class adds. The ICURuleLoader can fill the database properties directly.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	5e5addcdbf	fix typo	2021-09-29 14:16:09 +02:00
Sarah Hoffmann	be65c8303f	export more data for the tokenizer name preparation Adds class, type, country and rank to the exported information and removes the rather odd hack for countries. Whether a place represents a country boundary can now be computed by the tokenizer.	2021-09-29 11:54:14 +02:00
Sarah Hoffmann	231250f2eb	add wrapper class for place data passed to tokenizer This is mostly for convenience and documentation purposes.	2021-09-29 11:54:07 +02:00

1 2 3

122 Commits