Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2026-02-15 19:07:58 +00:00

Author	SHA1	Message	Date
marc tobias	247afe1f56	sanetizer no longer strips name parts in brackets when more parts follow	2025-08-23 01:06:35 +02:00
Sarah Hoffmann	186f562dd7	remove automatic setup of tokenizer directory ICU tokenizer doesn't need any extra data anymore, so it doesn't make sense to create a directory which then remains empty. If a tokenizer needs such a directory in the future, it needs to create it on its own and make sure to handle the situation correctly where no project directory is used at all.	2025-04-02 20:20:04 +02:00
Sarah Hoffmann	be4ba370ef	adapt tests to extended results	2025-03-31 14:52:50 +02:00
Sarah Hoffmann	4cc788f69e	enable flake for Python tests	2025-03-09 15:33:24 +01:00
Sarah Hoffmann	a574b98e4a	remove postcode computation for word table during import	2025-03-04 08:57:59 +01:00
Sarah Hoffmann	13db4c9731	replace datrie library with a more simple pure-Python class	2025-02-24 10:24:21 +01:00
Sarah Hoffmann	b87d6226fb	remove legacy tokenizer and direct tests	2024-09-21 11:38:08 +02:00
Sarah Hoffmann	3742fa2929	make DB helper functions free functions Also changes the drop function so that it can drop multiple tables at once.	2024-07-29 08:49:30 +02:00
Sarah Hoffmann	4da4cbfe27	reduce from 3 to 2 packages	2024-06-28 09:13:22 +02:00
Sarah Hoffmann	2bab0ca060	port unit tests to new python package layout	2024-06-26 11:52:47 +02:00
Sarah Hoffmann	8f3845660f	add full tokens to addresses This is now needed to weigh results.	2024-05-02 11:47:35 +02:00
Sarah Hoffmann	07b7fd1dbb	add address counts to tokens	2024-03-18 11:25:48 +01:00
Sarah Hoffmann	81eed0680c	recreate word table when refreshing counts The counting touches a large part of the word table, leaving bloated tables and indexes. Thus recreate the table instead and swap it in.	2024-02-04 21:35:10 +01:00
Paweł Wroniszewski	2cae37ccde	Revert country settings	2023-10-20 12:50:28 +02:00
Paweł Wroniszewski	fbe40e005d	Properly validate postcodes with country code Include postcode pattern in postcode normalisation regex, instead of removing it from postcode pattern in config. It properly handles postcode validation and normalization when country code is part of the postcode, e.g. for Isle of Man, Jersey, Anguilla, Andorra, Cayman Islands and more. Fixes #3227.	2023-10-17 01:04:07 +02:00
miku0	67e1c7dc72	Moved KANJI_MAP to icu-rules	2023-07-31 11:57:49 +00:00
miku0	4d61cc87cf	Add the test of reconbine_place	2023-07-31 02:39:56 +00:00
miku0	0722495434	add japanese sanitizer	2023-07-26 07:54:58 +00:00
Sarah Hoffmann	d7a3039c2a	also switch legacy tokenizer to new street/place choice behaviour	2023-06-30 17:03:17 +02:00
Sarah Hoffmann	645ea5a057	use information from tokenizer to determine street vs. place address So far the SQL logic used the information from the address field to determine if an address is attached to a street or place. This changes the logic to use the information provided in the token_info. This allows sanitizers to enforce a certain parenting without changing the visible address information.	2023-06-30 11:08:25 +02:00
biswajit-k	8f03c80ce8	generalize filter for sanitizers	2023-04-01 19:24:09 +05:30
biswajit-k	ca149fb796	Adds sanitizer for preventing certain tags to enter search index based on parameters fix: pylint error added docs for delete tags sanitizer fixed typos in docs and code comments fix: python typechecking error fixed rank address type Revert "fixed typos in docs and code comments" This reverts commit 6839eea755a87f557895f30524fb5c03dd983d60. added default parameters and refactored code added test for all parameters	2023-03-09 14:18:39 +05:30
Sarah Hoffmann	fd3dec8efe	add sanitizer for TIGER tags Currently only takes over cleaning the tiger:county data. This was done by the import until now.	2022-11-23 10:37:27 +01:00
Sarah Hoffmann	51b6d16dc6	overhaul the token analysis interface The functional split betweenthe two functions is now that the first one creates the ID that is used in the word table and the second one creates the variants. There no longer is a requirement that the ID is the normalized version. We might later reintroduce the requirement that a normalized version be available but it doesn't necessarily need to be through the ID. The function that creates the ID now gets the full PlaceName. That way it might take into account attributes that were set by the sanitizers. Finally rename both functions to something more sane.	2022-07-29 15:14:11 +02:00
Sarah Hoffmann	c8873d34af	harmonize interface of token analysis module The configure() function now receives a Transliterator object instead of the ICU rules. This harmonizes the parameters with the create function.	2022-07-29 10:43:07 +02:00
Sarah Hoffmann	6d41046b15	add support for external sanitizer modules	2022-07-25 16:10:19 +02:00
Sarah Hoffmann	62eedbb8f6	add type hints for sanitizers	2022-07-18 09:47:57 +02:00
Sarah Hoffmann	cbbcbb1fd7	move country_info into data submodule	2022-07-06 11:08:36 +02:00
Sarah Hoffmann	bce93d60bd	move PlaceInfo into data submodule This data structure is shared between indexer and tokenizer.	2022-07-06 10:54:47 +02:00
Sarah Hoffmann	612d34930b	handle postcodes properly on word table updates update_postcodes_from_db() needs to do the full postcode treatment in order to derive the correct word table entries.	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	80ea13437d	move postcode matcher in a separate file	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	18864afa8a	postcodes: introduce a default pattern for countries without postcodes	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	9172696324	postcodes: add support for optional spaces	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	baee6f3de0	postcodes: strip leading country codes	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	28ab2f6048	add postcodes patterns without optional spaces	2022-06-23 23:42:31 +02:00
Sarah Hoffmann	90d4d339db	initial postcode cleaner for simple patterns Moves postcodes that are either in countries without a postcode system or don't correspond to the local pattern for postcodes into a field for a normal address part. Makes them searchable but not as a special address. This has two consequences: they are no longer a skippable part of the address and the postcodes cannot be searched on their own.	2022-06-23 23:42:31 +02:00
Marc Tobias	0de83c4a51	fix typos of name Nominatim	2022-05-05 01:04:47 +02:00
Sarah Hoffmann	a0ed80d821	restore the tokenizer directory when missing Automatically repopulate the tokenizer/ directory with the PHP stub and the postgresql module, when the directory is missing. This allows to switch working directories and in particular run the service from a different maschine then where it was installed. Users still need to make sure that .env files are set up correctly or they will shoot themselves in the foot. See #2515.	2022-03-20 11:31:42 +01:00
Sarah Hoffmann	0a9f971e44	add tests for new analyzed housenumbers	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	837d44391c	move generation of normalized token form to analyzer This gives the analyzer more flexibility in choosing the normalized form. In particular, an analyzer creating different variants can choose the variant that will be used as the canonical form.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	a6b4e8ff67	add tests for housenumber-as-name feature	2022-02-07 11:45:12 +01:00
Sarah Hoffmann	38c3ef3da0	add tests for get_string_list() Renaming test file for sanitizer config because pytest requires unique names for test files.	2022-02-07 11:22:24 +01:00
Sarah Hoffmann	610f2cc254	sanitizer: move helpers into a configuration class	2022-02-07 10:48:00 +01:00
Sarah Hoffmann	c170d323d9	add tests for cleaning housenumbers	2022-01-20 23:47:20 +01:00
Sarah Hoffmann	d09db09849	adapt ICU tets to new housenumber sanitizer Restrict tests to making sure that handing in multiple housenumbers works.	2022-01-20 16:05:49 +01:00
Sarah Hoffmann	3741afa6dc	generalize filter-kind parameter for sanatizers Now behaves the same for tag_analyzer_by_language and clean_housenumbers. Adds tests.	2022-01-20 15:42:42 +01:00
Sarah Hoffmann	4774e45218	clean_housenumbers: make kinds and delimiters configurable Also adds unit tests for various options.	2022-01-20 12:07:12 +01:00
Sarah Hoffmann	b453b0ea95	introduce mutation variants to generic token analyser Mutations are regular-expression-based replacements that are applied after variants have been computed. They are meant to be used for variations on character level. Add spelling variations for German umlauts.	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	c3788d765e	add consistent SPDX copyright headers	2022-01-03 16:23:58 +01:00
Sarah Hoffmann	7f7d2fd5b3	skip most addr: tags with suffixes Only one addr: tag can be processed currently, so make sure it is the one without suffixes to not get odd data. addr:street is the exception because it uses a different matching mechanism.	2021-12-06 14:55:10 +01:00

1 2

60 Commits