Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2026-02-14 18:37:58 +00:00

Author	SHA1	Message	Date
Sarah Hoffmann	8f3845660f	add full tokens to addresses This is now needed to weigh results.	2024-05-02 11:47:35 +02:00
Sarah Hoffmann	07b7fd1dbb	add address counts to tokens	2024-03-18 11:25:48 +01:00
Sarah Hoffmann	81eed0680c	recreate word table when refreshing counts The counting touches a large part of the word table, leaving bloated tables and indexes. Thus recreate the table instead and swap it in.	2024-02-04 21:35:10 +01:00
Paweł Wroniszewski	2cae37ccde	Revert country settings	2023-10-20 12:50:28 +02:00
Paweł Wroniszewski	fbe40e005d	Properly validate postcodes with country code Include postcode pattern in postcode normalisation regex, instead of removing it from postcode pattern in config. It properly handles postcode validation and normalization when country code is part of the postcode, e.g. for Isle of Man, Jersey, Anguilla, Andorra, Cayman Islands and more. Fixes #3227.	2023-10-17 01:04:07 +02:00
miku0	67e1c7dc72	Moved KANJI_MAP to icu-rules	2023-07-31 11:57:49 +00:00
miku0	4d61cc87cf	Add the test of reconbine_place	2023-07-31 02:39:56 +00:00
miku0	0722495434	add japanese sanitizer	2023-07-26 07:54:58 +00:00
Sarah Hoffmann	d7a3039c2a	also switch legacy tokenizer to new street/place choice behaviour	2023-06-30 17:03:17 +02:00
Sarah Hoffmann	645ea5a057	use information from tokenizer to determine street vs. place address So far the SQL logic used the information from the address field to determine if an address is attached to a street or place. This changes the logic to use the information provided in the token_info. This allows sanitizers to enforce a certain parenting without changing the visible address information.	2023-06-30 11:08:25 +02:00
biswajit-k	8f03c80ce8	generalize filter for sanitizers	2023-04-01 19:24:09 +05:30
biswajit-k	ca149fb796	Adds sanitizer for preventing certain tags to enter search index based on parameters fix: pylint error added docs for delete tags sanitizer fixed typos in docs and code comments fix: python typechecking error fixed rank address type Revert "fixed typos in docs and code comments" This reverts commit 6839eea755a87f557895f30524fb5c03dd983d60. added default parameters and refactored code added test for all parameters	2023-03-09 14:18:39 +05:30
Sarah Hoffmann	fd3dec8efe	add sanitizer for TIGER tags Currently only takes over cleaning the tiger:county data. This was done by the import until now.	2022-11-23 10:37:27 +01:00
Sarah Hoffmann	51b6d16dc6	overhaul the token analysis interface The functional split betweenthe two functions is now that the first one creates the ID that is used in the word table and the second one creates the variants. There no longer is a requirement that the ID is the normalized version. We might later reintroduce the requirement that a normalized version be available but it doesn't necessarily need to be through the ID. The function that creates the ID now gets the full PlaceName. That way it might take into account attributes that were set by the sanitizers. Finally rename both functions to something more sane.	2022-07-29 15:14:11 +02:00
Sarah Hoffmann	c8873d34af	harmonize interface of token analysis module The configure() function now receives a Transliterator object instead of the ICU rules. This harmonizes the parameters with the create function.	2022-07-29 10:43:07 +02:00
Sarah Hoffmann	6d41046b15	add support for external sanitizer modules	2022-07-25 16:10:19 +02:00
Sarah Hoffmann	62eedbb8f6	add type hints for sanitizers	2022-07-18 09:47:57 +02:00
Sarah Hoffmann	cbbcbb1fd7	move country_info into data submodule	2022-07-06 11:08:36 +02:00
Sarah Hoffmann	bce93d60bd	move PlaceInfo into data submodule This data structure is shared between indexer and tokenizer.	2022-07-06 10:54:47 +02:00
Sarah Hoffmann	612d34930b	handle postcodes properly on word table updates update_postcodes_from_db() needs to do the full postcode treatment in order to derive the correct word table entries.	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	80ea13437d	move postcode matcher in a separate file	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	18864afa8a	postcodes: introduce a default pattern for countries without postcodes	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	9172696324	postcodes: add support for optional spaces	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	baee6f3de0	postcodes: strip leading country codes	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	28ab2f6048	add postcodes patterns without optional spaces	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	90d4d339db	initial postcode cleaner for simple patterns Moves postcodes that are either in countries without a postcode system or don't correspond to the local pattern for postcodes into a field for a normal address part. Makes them searchable but not as a special address. This has two consequences: they are no longer a skippable part of the address and the postcodes cannot be searched on their own.	2022-06-23 23:42:31 +02:00
Marc Tobias	0de83c4a51	fix typos of name Nominatim	2022-05-05 01:04:47 +02:00
Sarah Hoffmann	a0ed80d821	restore the tokenizer directory when missing Automatically repopulate the tokenizer/ directory with the PHP stub and the postgresql module, when the directory is missing. This allows to switch working directories and in particular run the service from a different maschine then where it was installed. Users still need to make sure that .env files are set up correctly or they will shoot themselves in the foot. See #2515.	2022-03-20 11:31:42 +01:00
Sarah Hoffmann	0a9f971e44	add tests for new analyzed housenumbers	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	837d44391c	move generation of normalized token form to analyzer This gives the analyzer more flexibility in choosing the normalized form. In particular, an analyzer creating different variants can choose the variant that will be used as the canonical form.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	a6b4e8ff67	add tests for housenumber-as-name feature	2022-02-07 11:45:12 +01:00
Sarah Hoffmann	38c3ef3da0	add tests for get_string_list() Renaming test file for sanitizer config because pytest requires unique names for test files.	2022-02-07 11:22:24 +01:00
Sarah Hoffmann	610f2cc254	sanitizer: move helpers into a configuration class	2022-02-07 10:48:00 +01:00
Sarah Hoffmann	c170d323d9	add tests for cleaning housenumbers	2022-01-20 23:47:20 +01:00
Sarah Hoffmann	d09db09849	adapt ICU tets to new housenumber sanitizer Restrict tests to making sure that handing in multiple housenumbers works.	2022-01-20 16:05:49 +01:00
Sarah Hoffmann	3741afa6dc	generalize filter-kind parameter for sanatizers Now behaves the same for tag_analyzer_by_language and clean_housenumbers. Adds tests.	2022-01-20 15:42:42 +01:00
Sarah Hoffmann	4774e45218	clean_housenumbers: make kinds and delimiters configurable Also adds unit tests for various options.	2022-01-20 12:07:12 +01:00
Sarah Hoffmann	b453b0ea95	introduce mutation variants to generic token analyser Mutations are regular-expression-based replacements that are applied after variants have been computed. They are meant to be used for variations on character level. Add spelling variations for German umlauts.	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	c3788d765e	add consistent SPDX copyright headers	2022-01-03 16:23:58 +01:00
Sarah Hoffmann	7f7d2fd5b3	skip most addr: tags with suffixes Only one addr: tag can be processed currently, so make sure it is the one without suffixes to not get odd data. addr:street is the exception because it uses a different matching mechanism.	2021-12-06 14:55:10 +01:00
Sarah Hoffmann	44cfce1ca4	revert to using full names for street name matching Using partial names turned out to not work well because there are often similarly named streets next to each other. It also prevents us from being able to take into account all addr:street:* tags. This change gets all the full term tokens for the addr:street tags from the DB. As they are used for matching only, we can assume that the term must already be there or there will be no match. This avoid creating unused full name tags.	2021-12-06 11:38:38 +01:00
Sarah Hoffmann	5a9fb6eaf7	specify text type in test SQL Older version of postgres fail otherwise.	2021-12-03 13:56:23 +01:00
Sarah Hoffmann	14a78f55cd	more unit tests for tokenizers	2021-12-02 15:46:36 +01:00
Sarah Hoffmann	c8958a22d2	tests: add fixture for making test project directory	2021-11-30 18:01:46 +01:00
Sarah Hoffmann	b90e719da5	organise python tests in subdirectories The directories follow the same structure as the modules in nominatim/.	2021-11-30 11:22:26 +01:00
Sarah Hoffmann	299934fd2a	reorganize and complete tests around generic token analysis	2021-10-06 17:03:37 +02:00
Sarah Hoffmann	b18d042832	add tests for sanitizer tagging language	2021-10-06 12:29:25 +02:00
Sarah Hoffmann	97a10ec218	apply variants by languages Adds a tagger for names by language so that the analyzer of that language is used. Thus variants are now only applied to names in the specific language and only tag name tags, no longer to reference-like tags.	2021-10-06 11:09:54 +02:00
Sarah Hoffmann	7cfcbacfc7	make token analyzers configurable modules Adds a mandatory section 'analyzer' to the token-analysis entries which define, which analyser to use. Currently there is exactly one, generic, which implements the former ICUNameProcessor.	2021-10-04 17:37:34 +02:00
Sarah Hoffmann	732cd27d2e	add unit tests for new sanatizer functions	2021-10-01 12:27:24 +02:00

50 Commits