Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2026-02-16 15:47:58 +00:00

Author	SHA1	Message	Date
Sarah Hoffmann	8d082c13e0	adapt to new type annotations from typeshed Some more functions frrom psycopg are now properly annotated. No ignoring necessary anymore.	2022-08-09 11:06:54 +02:00
Sarah Hoffmann	51b6d16dc6	overhaul the token analysis interface The functional split betweenthe two functions is now that the first one creates the ID that is used in the word table and the second one creates the variants. There no longer is a requirement that the ID is the normalized version. We might later reintroduce the requirement that a normalized version be available but it doesn't necessarily need to be through the ID. The function that creates the ID now gets the full PlaceName. That way it might take into account attributes that were set by the sanitizers. Finally rename both functions to something more sane.	2022-07-29 15:14:11 +02:00
Sarah Hoffmann	34d27ed45c	move PlaceName into the generic data module	2022-07-29 11:42:20 +02:00
Kian-Meng Ang	f5e52e748f	docs: fix typos	2022-07-20 22:05:31 +08:00
Sarah Hoffmann	9963261d8d	add type annotations to special phrase importer	2022-07-18 09:54:29 +02:00
Sarah Hoffmann	6c6bbe5747	add type annotations for ICU tokenizer	2022-07-18 09:47:57 +02:00
Sarah Hoffmann	bce93d60bd	move PlaceInfo into data submodule This data structure is shared between indexer and tokenizer.	2022-07-06 10:54:47 +02:00
Sarah Hoffmann	612d34930b	handle postcodes properly on word table updates update_postcodes_from_db() needs to do the full postcode treatment in order to derive the correct word table entries.	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	7f2ad4ac7e	fix linting issue	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	0f00f4968c	fix up BDD tests for postcode changes Includes smaller code fixes found by the tests.	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	80ea13437d	move postcode matcher in a separate file	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	b7704833e4	icu: switch postcodes to using the pre-formatted one	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	ca7b46511d	introduce and use analyzer for postcodes	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	1821f68ca0	exclude addr:inclusion from search	2022-05-31 14:19:19 +02:00
Sarah Hoffmann	d14a585cc9	pylint: disable no-self-use check This checker encourages bad behaviour (namely changing the static status of a function during inheritence) and will be made optional in upcoming versions of pylint.	2022-05-11 10:25:00 +02:00
Sarah Hoffmann	7e70e5f503	always state encoding when opening files in text mode Also applies to Path.write_text().	2022-05-10 15:36:29 +02:00
Sarah Hoffmann	a0ed80d821	restore the tokenizer directory when missing Automatically repopulate the tokenizer/ directory with the PHP stub and the postgresql module, when the directory is missing. This allows to switch working directories and in particular run the service from a different maschine then where it was installed. Users still need to make sure that .env files are set up correctly or they will shoot themselves in the foot. See #2515.	2022-03-20 11:31:42 +01:00
Sarah Hoffmann	15beeef6ce	do not expand records in select list An expression of the form 'SELECT (func()).*' will be expanded by Postgresql _before_ execution with the result that the function will be called as many times as there are fields in the record. This is not what we want. The function call needs to go into the FROM clause instead.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	92bc3cd0a7	fix linting issue	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	4a3bbd0319	adapt housenumber cleanup to new word table structure	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	a6903651fc	add framework for analysing housenumbers This lays the groundwork for adding variants for housenumbers. When analysis is enabled, then the 'word' field in the word table is used as usual, so that variants can be created. There will be only one analyser allowed which must have the fixed name '@housenumber'.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	b8c544cc98	icu: move token deduplication into TokenInfo Puts collection into one common place.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	243725aae1	icu: move housenumber token computation out of TokenInfo This was the last function to use the cache. There is a more clean separation of responsibility now.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	0bb59b2e22	handle unknown analyzer When changing something in the default configuration of the sanatizers that refers to an analyzer that is not yet loaded, there shouldn't be any errors.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	837d44391c	move generation of normalized token form to analyzer This gives the analyzer more flexibility in choosing the normalized form. In particular, an analyzer creating different variants can choose the variant that will be used as the canonical form.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	f74228830d	bdd: run full import on tests This uncovered a couple of outdated/wrong tests which have been fixed, too.	2022-02-24 14:27:51 +01:00
Sarah Hoffmann	a3e4e8e5cd	delete unused country name tokens	2022-02-23 09:23:06 +01:00
Sarah Hoffmann	3ce123ab69	do not clean housenumbers in reverse-only mode	2022-01-20 20:21:13 +01:00
Sarah Hoffmann	d8b7a51ab6	add actual removal of housenumber tokens	2022-01-20 20:18:15 +01:00
Sarah Hoffmann	344a2bfc1a	add new command for cleaning word tokens Just pulls outdated housenumbers for the moment.	2022-01-20 20:05:15 +01:00
Sarah Hoffmann	206ee87188	factor out housenumber splitting into sanitizer	2022-01-19 17:27:50 +01:00
Sarah Hoffmann	c3788d765e	add consistent SPDX copyright headers	2022-01-03 16:23:58 +01:00
Sarah Hoffmann	f9b56a8581	correctly match abbreviated addr:street This only works when addr:street is abbreviated and the street name isn't. It does not work the other way around.	2021-12-08 21:58:43 +01:00
Sarah Hoffmann	7f7d2fd5b3	skip most addr: tags with suffixes Only one addr: tag can be processed currently, so make sure it is the one without suffixes to not get odd data. addr:street is the exception because it uses a different matching mechanism.	2021-12-06 14:55:10 +01:00
Sarah Hoffmann	44cfce1ca4	revert to using full names for street name matching Using partial names turned out to not work well because there are often similarly named streets next to each other. It also prevents us from being able to take into account all addr:street:* tags. This change gets all the full term tokens for the addr:street tags from the DB. As they are used for matching only, we can assume that the term must already be there or there will be no match. This avoid creating unused full name tags.	2021-12-06 11:38:38 +01:00
Sarah Hoffmann	37eeccbf4c	ICU: use normalization from config in PHP The TERM_NORMALIZATION config option is no longer applicable. That was already documented but not yet implemented.	2021-10-27 11:32:44 +02:00
Sarah Hoffmann	53dbe58ada	do not count words when in reverse-only mode	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	85797acf1e	ICU: add an index over word_ids Needed for keyword lookup in the details response.	2021-10-25 21:33:27 +02:00
Sarah Hoffmann	ec7184c533	icu: no longer precompute terms The ICU analyzer no longer drops frequent partials, so it is no longer necessary to know the frequencies in advance.	2021-10-19 11:52:28 +02:00
Sarah Hoffmann	e8e2502e2f	make word recount a tokenizer-specific function	2021-10-19 11:21:16 +02:00
Sarah Hoffmann	d35400a7d7	use analyser provided in the 'analyzer' property Implements per-name choice of analyzer. If a non-default analyzer is choosen, then the 'word' identifier is extended with the name of the ana;yzer, so that we still have unique items.	2021-10-05 14:10:32 +02:00
Sarah Hoffmann	8171fe4571	introduce sanitizer step before token analysis Sanatizer functions allow to transform name and address tags before they are handed to the tokenizer. Theses transformations are visible only for the tokenizer and thus only have an influence on the search terms and address match terms for a place. Currently two sanitizers are implemented which are responsible for splitting names with multiple values and removing bracket additions. Both was previously hard-coded in the tokenizer.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	16daa57e47	unify ICUNameProcessorRules and ICURuleLoader There is no need for the additional layer of indirection that the ICUNameProcessorRules class adds. The ICURuleLoader can fill the database properties directly.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	be65c8303f	export more data for the tokenizer name preparation Adds class, type, country and rank to the exported information and removes the rather odd hack for countries. Whether a place represents a country boundary can now be computed by the tokenizer.	2021-09-29 11:54:14 +02:00
Sarah Hoffmann	231250f2eb	add wrapper class for place data passed to tokenizer This is mostly for convenience and documentation purposes.	2021-09-29 11:54:07 +02:00
Sarah Hoffmann	bb18479d5b	remove unused parameter	2021-09-27 14:58:43 +02:00
Sarah Hoffmann	bd7c7ddad0	icu tokenizer: switch to matching against partial names When matching address parts from addr:* tags against place names, the address names where so far converted to full names and compared those to the place names. This can become problematic with the new ICU tokenizer once we introduce creation of different variants depending on the place name context. It wouldn't be clear which variant to produce to get a match, so we would have to create all of them. To work around this issue, switch to using the partial terms for matching. This introduces a larger fuzziness between matches but that shouldn't be a problem because matching is always geographically restricted. The search terms created for address parts have a different problem: they are already created before we even know if they are going to be used. This can lead to spurious entries in the word table, which slows down searching. This problem can also be circumvented by using only partial terms for the search terms. In terms of searching that means that the address terms would not get the full-word boost, but given that the case where an address part does not exist as an OSM object should be the exception, this is likely acceptable.	2021-09-27 11:36:19 +02:00
Sarah Hoffmann	b894d2c04a	fix indent	2021-09-04 10:30:35 +02:00
Sarah Hoffmann	1c42780bb5	introduce generic YAML config loader Adds a function to the Configuration class to load a YAML file. This means that searching for the file is generalised and works the same now for all configuration files. Changes the search logic, so that it is always possible to have a custom version of the configuration file in the project directory. Move ICU tokenizer to use new load function.	2021-09-03 18:20:07 +02:00
Sarah Hoffmann	118858a55e	rename legacy_icu tokenizer to icu tokenizer The new icu tokenizer is now no longer compatible with the old legacy tokenizer in terms of data structures. Therefore there is also no longer a need to refer to the legacy tokenizer in the name.	2021-08-17 23:11:47 +02:00

50 Commits