forked from hans/Nominatim
Merge pull request #2757 from lonvia/filter-postcodes
Add filtering, normalisation and variants for postcodes
This commit is contained in:
@@ -13,4 +13,4 @@ ignored-classes=NominatimArgs,closing
|
|||||||
# 'too-many-ancestors' is triggered already by deriving from UserDict
|
# 'too-many-ancestors' is triggered already by deriving from UserDict
|
||||||
disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use
|
disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use
|
||||||
|
|
||||||
good-names=i,x,y,fd,db
|
good-names=i,x,y,fd,db,cc
|
||||||
|
|||||||
149
docs/customize/Country-Settings.md
Normal file
149
docs/customize/Country-Settings.md
Normal file
@@ -0,0 +1,149 @@
|
|||||||
|
# Customizing Per-Country Data
|
||||||
|
|
||||||
|
Whenever an OSM is imported into Nominatim, the object is first assigned
|
||||||
|
a country. Nominatim can use this information to adapt various aspects of
|
||||||
|
the address computation to the local customs of the country. This section
|
||||||
|
explains how country assignment works and the principal per-country
|
||||||
|
localizations.
|
||||||
|
|
||||||
|
## Country assignment
|
||||||
|
|
||||||
|
Countries are assigned on the basis of country data from the OpenStreetMap
|
||||||
|
input data itself. Countries are expected to be tagged according to the
|
||||||
|
[administrative boundary schema](https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative):
|
||||||
|
a OSM relation with `boundary=administrative` and `admin_level=2`. Nominatim
|
||||||
|
uses the country code to distinguish the countries.
|
||||||
|
|
||||||
|
If there is no country data available for a point, then Nominatim uses the
|
||||||
|
fallback data imported from `data/country_osm_grid.sql.gz`. This was computed
|
||||||
|
from OSM data as well but is guaranteed to cover all countries.
|
||||||
|
|
||||||
|
Some OSM objects may also be located outside any country, for example a buoy
|
||||||
|
in the middle of the ocean. These object do not get any country assigned and
|
||||||
|
get a default treatment when it comes to localized handling of data.
|
||||||
|
|
||||||
|
## Per-country settings
|
||||||
|
|
||||||
|
### Global country settings
|
||||||
|
|
||||||
|
The main place to configure settings per country is the file
|
||||||
|
`settings/country_settings.yaml`. This file has one section per country that
|
||||||
|
is recognised by Nominatim. Each section is tagged with the country code
|
||||||
|
(in lower case) and contains the different localization information. Only
|
||||||
|
countries which are listed in this file are taken into account for computations.
|
||||||
|
|
||||||
|
For example, the section for Andorra looks like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
partition: 35
|
||||||
|
languages: ca
|
||||||
|
names: !include country-names/ad.yaml
|
||||||
|
postcode:
|
||||||
|
pattern: "(ddd)"
|
||||||
|
output: AD\1
|
||||||
|
```
|
||||||
|
|
||||||
|
The individual settings are described below.
|
||||||
|
|
||||||
|
#### `partition`
|
||||||
|
|
||||||
|
Nominatim internally splits the data into multiple tables to improve
|
||||||
|
performance. The partition number tells Nominatim into which table to put
|
||||||
|
the country. This is purely internal management and has no effect on the
|
||||||
|
output data.
|
||||||
|
|
||||||
|
The default is to have one partition per country.
|
||||||
|
|
||||||
|
#### `languages`
|
||||||
|
|
||||||
|
A comma-separated list of ISO-639 language codes of default languages in the
|
||||||
|
country. These are the languages used in name tags without a language suffix.
|
||||||
|
Note that this is not necessarily the same as the list of official languages
|
||||||
|
in the country. There may be officially recognised languages in a country
|
||||||
|
which are only ever used in name tags with the appropriate language suffixes.
|
||||||
|
Conversely, a non-official language may appear a lot in the name tags, for
|
||||||
|
example when used as an unofficial Lingua Franca.
|
||||||
|
|
||||||
|
List the languages in order of frequency of appearance with the most frequently
|
||||||
|
used language first. It is not recommended to add languages when there are only
|
||||||
|
very few occurrences.
|
||||||
|
|
||||||
|
If only one language is listed, then Nominatim will 'auto-complete' the
|
||||||
|
language of names without an explicit language-suffix.
|
||||||
|
|
||||||
|
#### `names`
|
||||||
|
|
||||||
|
List of names of the country and its translations. These names are used as
|
||||||
|
a baseline. It is always possible to search countries by the given names, no
|
||||||
|
matter what other names are in the OSM data. They are also used as a fallback
|
||||||
|
when a needed translation is not available.
|
||||||
|
|
||||||
|
!!! Note
|
||||||
|
The list of names per country is currently fairly large because Nominatim
|
||||||
|
supports translations in many languages per default. That is why the
|
||||||
|
name lists have been separated out into extra files. You can find the
|
||||||
|
name lists in the file `settings/country-names/<country code>.yaml`.
|
||||||
|
The names section in the main country settings file only refers to these
|
||||||
|
files via the special `!include` directive.
|
||||||
|
|
||||||
|
#### `postcode`
|
||||||
|
|
||||||
|
Describes the format of the postcode that is in use in the country.
|
||||||
|
|
||||||
|
When a country has no official postcodes, set this to no. Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
ae:
|
||||||
|
postcode: no
|
||||||
|
```
|
||||||
|
|
||||||
|
When a country has a postcode, you need to state the postcode pattern and
|
||||||
|
the default output format. Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
bm:
|
||||||
|
postcode:
|
||||||
|
pattern: "(ll)[ -]?(dd)"
|
||||||
|
output: \1 \2
|
||||||
|
```
|
||||||
|
|
||||||
|
The **pattern** is a regular expression that describes the possible formats
|
||||||
|
accepted as a postcode. The pattern follows the standard syntax for
|
||||||
|
[regular expressions in Python](https://docs.python.org/3/library/re.html#regular-expression-syntax)
|
||||||
|
with two extra shortcuts: `d` is a shortcut for a single digit([0-9])
|
||||||
|
and `l` for a single ASCII letter ([A-Z]).
|
||||||
|
|
||||||
|
Use match groups to indicate groups in the postcode that may optionally be
|
||||||
|
separated with a space or a hyphen.
|
||||||
|
|
||||||
|
For example, the postcode for Bermuda above always consists of two letters
|
||||||
|
and two digits. They may optionally be separated by a space or hyphen. That
|
||||||
|
means that Nominatim will consider `AB56`, `AB 56` and `AB-56` spelling variants
|
||||||
|
for one and the same postcode.
|
||||||
|
|
||||||
|
Never add the country code in front of the postcode pattern. Nominatim will
|
||||||
|
automatically accept variants with a country code prefix for all postcodes.
|
||||||
|
|
||||||
|
The **output** field is an optional field that describes what the canonical
|
||||||
|
spelling of the postcode should be. The format is the
|
||||||
|
[regular expression expand syntax](https://docs.python.org/3/library/re.html#re.Match.expand) referring back to the bracket groups in the pattern.
|
||||||
|
|
||||||
|
Most simple postcodes only have one spelling variant. In that case, the
|
||||||
|
**output** can be omitted. The postcode will simply be used as is.
|
||||||
|
|
||||||
|
In the Bermuda example above, the canonical spelling would be to have a space
|
||||||
|
between letters and digits.
|
||||||
|
|
||||||
|
!!! Warning
|
||||||
|
When your postcode pattern covers multiple variants of the postcode, then
|
||||||
|
you must explicitly state the canonical output or Nominatim will not
|
||||||
|
handle the variations correctly.
|
||||||
|
|
||||||
|
### Other country-specific configuration
|
||||||
|
|
||||||
|
There are some other configuration files where you can set localized settings
|
||||||
|
according to the assigned country. These are:
|
||||||
|
|
||||||
|
* [Place ranking configuration](Ranking.md)
|
||||||
|
|
||||||
|
Please see the linked documentation sections for more information.
|
||||||
@@ -205,6 +205,14 @@ The following is a list of sanitizers that are shipped with Nominatim.
|
|||||||
rendering:
|
rendering:
|
||||||
heading_level: 6
|
heading_level: 6
|
||||||
|
|
||||||
|
##### clean-postcodes
|
||||||
|
|
||||||
|
::: nominatim.tokenizer.sanitizers.clean_postcodes
|
||||||
|
selection:
|
||||||
|
members: False
|
||||||
|
rendering:
|
||||||
|
heading_level: 6
|
||||||
|
|
||||||
|
|
||||||
#### Token Analysis
|
#### Token Analysis
|
||||||
|
|
||||||
@@ -222,8 +230,12 @@ by a sanitizer (see for example the
|
|||||||
The token-analysis section contains the list of configured analyzers. Each
|
The token-analysis section contains the list of configured analyzers. Each
|
||||||
analyzer must have an `id` parameter that uniquely identifies the analyzer.
|
analyzer must have an `id` parameter that uniquely identifies the analyzer.
|
||||||
The only exception is the default analyzer that is used when no special
|
The only exception is the default analyzer that is used when no special
|
||||||
analyzer was selected. There is one special id '@housenumber'. If an analyzer
|
analyzer was selected. There are analysers with special ids:
|
||||||
with that name is present, it is used for normalization of house numbers.
|
|
||||||
|
* '@housenumber'. If an analyzer with that name is present, it is used
|
||||||
|
for normalization of house numbers.
|
||||||
|
* '@potcode'. If an analyzer with that name is present, it is used
|
||||||
|
for normalization of postcodes.
|
||||||
|
|
||||||
Different analyzer implementations may exist. To select the implementation,
|
Different analyzer implementations may exist. To select the implementation,
|
||||||
the `analyzer` parameter must be set. The different implementations are
|
the `analyzer` parameter must be set. The different implementations are
|
||||||
@@ -356,6 +368,14 @@ house numbers of the form '3 a', '3A', '3-A' etc. are all considered equivalent.
|
|||||||
|
|
||||||
The analyzer cannot be customized.
|
The analyzer cannot be customized.
|
||||||
|
|
||||||
|
##### Postcode token analyzer
|
||||||
|
|
||||||
|
The analyzer `postcodes` is pupose-made to analyze postcodes. It supports
|
||||||
|
a 'lookup' varaint of the token, which produces variants with optional
|
||||||
|
spaces. Use together with the clean-postcodes sanitizer.
|
||||||
|
|
||||||
|
The analyzer cannot be customized.
|
||||||
|
|
||||||
### Reconfiguration
|
### Reconfiguration
|
||||||
|
|
||||||
Changing the configuration after the import is currently not possible, although
|
Changing the configuration after the import is currently not possible, although
|
||||||
|
|||||||
@@ -245,11 +245,11 @@ Currently, tokenizers are encouraged to make sure that matching works against
|
|||||||
both the search token list and the match token list.
|
both the search token list and the match token list.
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
|
FUNCTION token_get_postcode(info JSONB) RETURNS TEXT
|
||||||
```
|
```
|
||||||
|
|
||||||
Return the normalized version of the given postcode. This function must return
|
Return the postcode for the object, if any exists. The postcode must be in
|
||||||
the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
|
the form that should also be presented to the end-user.
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
FUNCTION token_strip_info(info JSONB) RETURNS JSONB
|
FUNCTION token_strip_info(info JSONB) RETURNS JSONB
|
||||||
|
|||||||
@@ -28,6 +28,7 @@ pages:
|
|||||||
- 'Overview': 'customize/Overview.md'
|
- 'Overview': 'customize/Overview.md'
|
||||||
- 'Import Styles': 'customize/Import-Styles.md'
|
- 'Import Styles': 'customize/Import-Styles.md'
|
||||||
- 'Configuration Settings': 'customize/Settings.md'
|
- 'Configuration Settings': 'customize/Settings.md'
|
||||||
|
- 'Per-Country Data': 'customize/Country-Settings.md'
|
||||||
- 'Place Ranking' : 'customize/Ranking.md'
|
- 'Place Ranking' : 'customize/Ranking.md'
|
||||||
- 'Tokenizers' : 'customize/Tokenizers.md'
|
- 'Tokenizers' : 'customize/Tokenizers.md'
|
||||||
- 'Special Phrases': 'customize/Special-Phrases.md'
|
- 'Special Phrases': 'customize/Special-Phrases.md'
|
||||||
|
|||||||
@@ -25,7 +25,12 @@ class Postcode
|
|||||||
public function __construct($iId, $sPostcode, $sCountryCode = '')
|
public function __construct($iId, $sPostcode, $sCountryCode = '')
|
||||||
{
|
{
|
||||||
$this->iId = $iId;
|
$this->iId = $iId;
|
||||||
$this->sPostcode = $sPostcode;
|
$iSplitPos = strpos($sPostcode, '@');
|
||||||
|
if ($iSplitPos === false) {
|
||||||
|
$this->sPostcode = $sPostcode;
|
||||||
|
} else {
|
||||||
|
$this->sPostcode = substr($sPostcode, 0, $iSplitPos);
|
||||||
|
}
|
||||||
$this->sCountryCode = empty($sCountryCode) ? '' : $sCountryCode;
|
$this->sCountryCode = empty($sCountryCode) ? '' : $sCountryCode;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -190,13 +190,17 @@ class Tokenizer
|
|||||||
if ($aWord['word'] !== null
|
if ($aWord['word'] !== null
|
||||||
&& pg_escape_string($aWord['word']) == $aWord['word']
|
&& pg_escape_string($aWord['word']) == $aWord['word']
|
||||||
) {
|
) {
|
||||||
$sNormPostcode = $this->normalizeString($aWord['word']);
|
$iSplitPos = strpos($aWord['word'], '@');
|
||||||
if (strpos($sNormQuery, $sNormPostcode) !== false) {
|
if ($iSplitPos === false) {
|
||||||
$oValidTokens->addToken(
|
$sPostcode = $aWord['word'];
|
||||||
$sTok,
|
} else {
|
||||||
new Token\Postcode($iId, $aWord['word'], null)
|
$sPostcode = substr($aWord['word'], 0, $iSplitPos);
|
||||||
);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$oValidTokens->addToken(
|
||||||
|
$sTok,
|
||||||
|
new Token\Postcode($iId, $sPostcode, null)
|
||||||
|
);
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case 'S': // tokens for classification terms (special phrases)
|
case 'S': // tokens for classification terms (special phrases)
|
||||||
|
|||||||
@@ -320,6 +320,11 @@ BEGIN
|
|||||||
location := ROW(null, null, null, hstore('ref', place.postcode), 'place',
|
location := ROW(null, null, null, hstore('ref', place.postcode), 'place',
|
||||||
'postcode', null, null, false, true, 5, 0)::addressline;
|
'postcode', null, null, false, true, 5, 0)::addressline;
|
||||||
RETURN NEXT location;
|
RETURN NEXT location;
|
||||||
|
ELSEIF place.address is not null and place.address ? 'postcode'
|
||||||
|
and not place.address->'postcode' SIMILAR TO '%(,|;)%' THEN
|
||||||
|
location := ROW(null, null, null, hstore('ref', place.address->'postcode'), 'place',
|
||||||
|
'postcode', null, null, false, true, 5, 0)::addressline;
|
||||||
|
RETURN NEXT location;
|
||||||
END IF;
|
END IF;
|
||||||
|
|
||||||
RETURN;
|
RETURN;
|
||||||
|
|||||||
@@ -156,7 +156,6 @@ DECLARE
|
|||||||
linegeo GEOMETRY;
|
linegeo GEOMETRY;
|
||||||
splitline GEOMETRY;
|
splitline GEOMETRY;
|
||||||
sectiongeo GEOMETRY;
|
sectiongeo GEOMETRY;
|
||||||
interpol_postcode TEXT;
|
|
||||||
postcode TEXT;
|
postcode TEXT;
|
||||||
stepmod SMALLINT;
|
stepmod SMALLINT;
|
||||||
BEGIN
|
BEGIN
|
||||||
@@ -174,8 +173,6 @@ BEGIN
|
|||||||
ST_PointOnSurface(NEW.linegeo),
|
ST_PointOnSurface(NEW.linegeo),
|
||||||
NEW.linegeo);
|
NEW.linegeo);
|
||||||
|
|
||||||
interpol_postcode := token_normalized_postcode(NEW.address->'postcode');
|
|
||||||
|
|
||||||
NEW.token_info := token_strip_info(NEW.token_info);
|
NEW.token_info := token_strip_info(NEW.token_info);
|
||||||
IF NEW.address ? '_inherited' THEN
|
IF NEW.address ? '_inherited' THEN
|
||||||
NEW.address := hstore('interpolation', NEW.address->'interpolation');
|
NEW.address := hstore('interpolation', NEW.address->'interpolation');
|
||||||
@@ -207,6 +204,11 @@ BEGIN
|
|||||||
FOR nextnode IN
|
FOR nextnode IN
|
||||||
SELECT DISTINCT ON (nodeidpos)
|
SELECT DISTINCT ON (nodeidpos)
|
||||||
osm_id, address, geometry,
|
osm_id, address, geometry,
|
||||||
|
-- Take the postcode from the node only if it has a housenumber itself.
|
||||||
|
-- Note that there is a corner-case where the node has a wrongly
|
||||||
|
-- formatted postcode and therefore 'postcode' contains a derived
|
||||||
|
-- variant.
|
||||||
|
CASE WHEN address ? 'postcode' THEN placex.postcode ELSE NULL::text END as postcode,
|
||||||
substring(address->'housenumber','[0-9]+')::integer as hnr
|
substring(address->'housenumber','[0-9]+')::integer as hnr
|
||||||
FROM placex, generate_series(1, array_upper(waynodes, 1)) nodeidpos
|
FROM placex, generate_series(1, array_upper(waynodes, 1)) nodeidpos
|
||||||
WHERE osm_type = 'N' and osm_id = waynodes[nodeidpos]::BIGINT
|
WHERE osm_type = 'N' and osm_id = waynodes[nodeidpos]::BIGINT
|
||||||
@@ -260,13 +262,10 @@ BEGIN
|
|||||||
endnumber := newend;
|
endnumber := newend;
|
||||||
|
|
||||||
-- determine postcode
|
-- determine postcode
|
||||||
postcode := coalesce(interpol_postcode,
|
postcode := coalesce(prevnode.postcode, nextnode.postcode, postcode);
|
||||||
token_normalized_postcode(prevnode.address->'postcode'),
|
IF postcode is NULL and NEW.parent_place_id > 0 THEN
|
||||||
token_normalized_postcode(nextnode.address->'postcode'),
|
SELECT placex.postcode FROM placex
|
||||||
postcode);
|
WHERE place_id = NEW.parent_place_id INTO postcode;
|
||||||
IF postcode is NULL THEN
|
|
||||||
SELECT token_normalized_postcode(placex.postcode)
|
|
||||||
FROM placex WHERE place_id = NEW.parent_place_id INTO postcode;
|
|
||||||
END IF;
|
END IF;
|
||||||
IF postcode is NULL THEN
|
IF postcode is NULL THEN
|
||||||
postcode := get_nearest_postcode(NEW.country_code, nextnode.geometry);
|
postcode := get_nearest_postcode(NEW.country_code, nextnode.geometry);
|
||||||
|
|||||||
@@ -992,7 +992,7 @@ BEGIN
|
|||||||
{% if debug %}RAISE WARNING 'Got parent details from search name';{% endif %}
|
{% if debug %}RAISE WARNING 'Got parent details from search name';{% endif %}
|
||||||
|
|
||||||
-- determine postcode
|
-- determine postcode
|
||||||
NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
|
NEW.postcode := coalesce(token_get_postcode(NEW.token_info),
|
||||||
location.postcode,
|
location.postcode,
|
||||||
get_nearest_postcode(NEW.country_code, NEW.centroid));
|
get_nearest_postcode(NEW.country_code, NEW.centroid));
|
||||||
|
|
||||||
@@ -1150,8 +1150,7 @@ BEGIN
|
|||||||
|
|
||||||
{% if debug %}RAISE WARNING 'RETURN insert_addresslines: %, %, %', NEW.parent_place_id, NEW.postcode, nameaddress_vector;{% endif %}
|
{% if debug %}RAISE WARNING 'RETURN insert_addresslines: %, %, %', NEW.parent_place_id, NEW.postcode, nameaddress_vector;{% endif %}
|
||||||
|
|
||||||
NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
|
NEW.postcode := coalesce(token_get_postcode(NEW.token_info), NEW.postcode);
|
||||||
NEW.postcode);
|
|
||||||
|
|
||||||
-- if we have a name add this to the name search table
|
-- if we have a name add this to the name search table
|
||||||
IF NEW.name IS NOT NULL THEN
|
IF NEW.name IS NOT NULL THEN
|
||||||
|
|||||||
@@ -97,10 +97,10 @@ AS $$
|
|||||||
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
||||||
|
|
||||||
|
|
||||||
CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
|
CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
|
||||||
RETURNS TEXT
|
RETURNS TEXT
|
||||||
AS $$
|
AS $$
|
||||||
SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
|
SELECT info->>'postcode';
|
||||||
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
||||||
|
|
||||||
|
|
||||||
@@ -223,3 +223,26 @@ BEGIN
|
|||||||
END;
|
END;
|
||||||
$$
|
$$
|
||||||
LANGUAGE plpgsql;
|
LANGUAGE plpgsql;
|
||||||
|
|
||||||
|
CREATE OR REPLACE FUNCTION create_postcode_word(postcode TEXT, lookup_terms TEXT[])
|
||||||
|
RETURNS BOOLEAN
|
||||||
|
AS $$
|
||||||
|
DECLARE
|
||||||
|
existing INTEGER;
|
||||||
|
BEGIN
|
||||||
|
SELECT count(*) INTO existing
|
||||||
|
FROM word WHERE word = postcode and type = 'P';
|
||||||
|
|
||||||
|
IF existing > 0 THEN
|
||||||
|
RETURN TRUE;
|
||||||
|
END IF;
|
||||||
|
|
||||||
|
-- postcodes don't need word ids
|
||||||
|
INSERT INTO word (word_token, type, word)
|
||||||
|
SELECT lookup_term, 'P', postcode FROM unnest(lookup_terms) as lookup_term;
|
||||||
|
|
||||||
|
RETURN FALSE;
|
||||||
|
END;
|
||||||
|
$$
|
||||||
|
LANGUAGE plpgsql;
|
||||||
|
|
||||||
|
|||||||
@@ -97,10 +97,10 @@ AS $$
|
|||||||
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
||||||
|
|
||||||
|
|
||||||
CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
|
CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
|
||||||
RETURNS TEXT
|
RETURNS TEXT
|
||||||
AS $$
|
AS $$
|
||||||
SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
|
SELECT info->>'postcode';
|
||||||
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
$$ LANGUAGE SQL IMMUTABLE STRICT;
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
0
nominatim/data/__init__.py
Normal file
0
nominatim/data/__init__.py
Normal file
109
nominatim/data/postcode_format.py
Normal file
109
nominatim/data/postcode_format.py
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Functions for formatting postcodes according to their country-specific
|
||||||
|
format.
|
||||||
|
"""
|
||||||
|
import re
|
||||||
|
|
||||||
|
from nominatim.errors import UsageError
|
||||||
|
from nominatim.tools import country_info
|
||||||
|
|
||||||
|
class CountryPostcodeMatcher:
|
||||||
|
""" Matches and formats a postcode according to a format definition
|
||||||
|
of the given country.
|
||||||
|
"""
|
||||||
|
def __init__(self, country_code, config):
|
||||||
|
if 'pattern' not in config:
|
||||||
|
raise UsageError("Field 'pattern' required for 'postcode' "
|
||||||
|
f"for country '{country_code}'")
|
||||||
|
|
||||||
|
pc_pattern = config['pattern'].replace('d', '[0-9]').replace('l', '[A-Z]')
|
||||||
|
|
||||||
|
self.norm_pattern = re.compile(f'\\s*(?:{country_code.upper()}[ -]?)?(.*)\\s*')
|
||||||
|
self.pattern = re.compile(pc_pattern)
|
||||||
|
|
||||||
|
self.output = config.get('output', r'\g<0>')
|
||||||
|
|
||||||
|
|
||||||
|
def match(self, postcode):
|
||||||
|
""" Match the given postcode against the postcode pattern for this
|
||||||
|
matcher. Returns a `re.Match` object if the match was successful
|
||||||
|
and None otherwise.
|
||||||
|
"""
|
||||||
|
# Upper-case, strip spaces and leading country code.
|
||||||
|
normalized = self.norm_pattern.fullmatch(postcode.upper())
|
||||||
|
|
||||||
|
if normalized:
|
||||||
|
return self.pattern.fullmatch(normalized.group(1))
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def normalize(self, match):
|
||||||
|
""" Return the default format of the postcode for the given match.
|
||||||
|
`match` must be a `re.Match` object previously returned by
|
||||||
|
`match()`
|
||||||
|
"""
|
||||||
|
return match.expand(self.output)
|
||||||
|
|
||||||
|
|
||||||
|
class PostcodeFormatter:
|
||||||
|
""" Container for different postcode formats of the world and
|
||||||
|
access functions.
|
||||||
|
"""
|
||||||
|
def __init__(self):
|
||||||
|
# Objects without a country code can't have a postcode per definition.
|
||||||
|
self.country_without_postcode = {None}
|
||||||
|
self.country_matcher = {}
|
||||||
|
self.default_matcher = CountryPostcodeMatcher('', {'pattern': '.*'})
|
||||||
|
|
||||||
|
for ccode, prop in country_info.iterate('postcode'):
|
||||||
|
if prop is False:
|
||||||
|
self.country_without_postcode.add(ccode)
|
||||||
|
elif isinstance(prop, dict):
|
||||||
|
self.country_matcher[ccode] = CountryPostcodeMatcher(ccode, prop)
|
||||||
|
else:
|
||||||
|
raise UsageError(f"Invalid entry 'postcode' for country '{ccode}'")
|
||||||
|
|
||||||
|
|
||||||
|
def set_default_pattern(self, pattern):
|
||||||
|
""" Set the postcode match pattern to use, when a country does not
|
||||||
|
have a specific pattern or is marked as country without postcode.
|
||||||
|
"""
|
||||||
|
self.default_matcher = CountryPostcodeMatcher('', {'pattern': pattern})
|
||||||
|
|
||||||
|
|
||||||
|
def get_matcher(self, country_code):
|
||||||
|
""" Return the CountryPostcodeMatcher for the given country.
|
||||||
|
Returns None if the country doesn't have a postcode and the
|
||||||
|
default matcher if there is no specific matcher configured for
|
||||||
|
the country.
|
||||||
|
"""
|
||||||
|
if country_code in self.country_without_postcode:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return self.country_matcher.get(country_code, self.default_matcher)
|
||||||
|
|
||||||
|
|
||||||
|
def match(self, country_code, postcode):
|
||||||
|
""" Match the given postcode against the postcode pattern for this
|
||||||
|
matcher. Returns a `re.Match` object if the country has a pattern
|
||||||
|
and the match was successful or None if the match failed.
|
||||||
|
"""
|
||||||
|
if country_code in self.country_without_postcode:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return self.country_matcher.get(country_code, self.default_matcher).match(postcode)
|
||||||
|
|
||||||
|
|
||||||
|
def normalize(self, country_code, match):
|
||||||
|
""" Return the default format of the postcode for the given match.
|
||||||
|
`match` must be a `re.Match` object previously returned by
|
||||||
|
`match()`
|
||||||
|
"""
|
||||||
|
return self.country_matcher.get(country_code, self.default_matcher).normalize(match)
|
||||||
@@ -11,7 +11,6 @@ libICU instead of the PostgreSQL module.
|
|||||||
import itertools
|
import itertools
|
||||||
import json
|
import json
|
||||||
import logging
|
import logging
|
||||||
import re
|
|
||||||
from textwrap import dedent
|
from textwrap import dedent
|
||||||
|
|
||||||
from nominatim.db.connection import connect
|
from nominatim.db.connection import connect
|
||||||
@@ -291,33 +290,72 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
|
|||||||
""" Update postcode tokens in the word table from the location_postcode
|
""" Update postcode tokens in the word table from the location_postcode
|
||||||
table.
|
table.
|
||||||
"""
|
"""
|
||||||
to_delete = []
|
analyzer = self.token_analysis.analysis.get('@postcode')
|
||||||
|
|
||||||
with self.conn.cursor() as cur:
|
with self.conn.cursor() as cur:
|
||||||
# This finds us the rows in location_postcode and word that are
|
# First get all postcode names currently in the word table.
|
||||||
# missing in the other table.
|
cur.execute("SELECT DISTINCT word FROM word WHERE type = 'P'")
|
||||||
cur.execute("""SELECT * FROM
|
word_entries = set((entry[0] for entry in cur))
|
||||||
(SELECT pc, word FROM
|
|
||||||
(SELECT distinct(postcode) as pc FROM location_postcode) p
|
|
||||||
FULL JOIN
|
|
||||||
(SELECT word FROM word WHERE type = 'P') w
|
|
||||||
ON pc = word) x
|
|
||||||
WHERE pc is null or word is null""")
|
|
||||||
|
|
||||||
with CopyBuffer() as copystr:
|
# Then compute the required postcode names from the postcode table.
|
||||||
for postcode, word in cur:
|
needed_entries = set()
|
||||||
if postcode is None:
|
cur.execute("SELECT country_code, postcode FROM location_postcode")
|
||||||
to_delete.append(word)
|
for cc, postcode in cur:
|
||||||
else:
|
info = PlaceInfo({'country_code': cc,
|
||||||
copystr.add(self._search_normalized(postcode),
|
'class': 'place', 'type': 'postcode',
|
||||||
'P', postcode)
|
'address': {'postcode': postcode}})
|
||||||
|
address = self.sanitizer.process_names(info)[1]
|
||||||
|
for place in address:
|
||||||
|
if place.kind == 'postcode':
|
||||||
|
if analyzer is None:
|
||||||
|
postcode_name = place.name.strip().upper()
|
||||||
|
variant_base = None
|
||||||
|
else:
|
||||||
|
postcode_name = analyzer.normalize(place.name)
|
||||||
|
variant_base = place.get_attr("variant")
|
||||||
|
|
||||||
|
if variant_base:
|
||||||
|
needed_entries.add(f'{postcode_name}@{variant_base}')
|
||||||
|
else:
|
||||||
|
needed_entries.add(postcode_name)
|
||||||
|
break
|
||||||
|
|
||||||
|
# Now update the word table.
|
||||||
|
self._delete_unused_postcode_words(word_entries - needed_entries)
|
||||||
|
self._add_missing_postcode_words(needed_entries - word_entries)
|
||||||
|
|
||||||
|
def _delete_unused_postcode_words(self, tokens):
|
||||||
|
if tokens:
|
||||||
|
with self.conn.cursor() as cur:
|
||||||
|
cur.execute("DELETE FROM word WHERE type = 'P' and word = any(%s)",
|
||||||
|
(list(tokens), ))
|
||||||
|
|
||||||
|
def _add_missing_postcode_words(self, tokens):
|
||||||
|
if not tokens:
|
||||||
|
return
|
||||||
|
|
||||||
|
analyzer = self.token_analysis.analysis.get('@postcode')
|
||||||
|
terms = []
|
||||||
|
|
||||||
|
for postcode_name in tokens:
|
||||||
|
if '@' in postcode_name:
|
||||||
|
term, variant = postcode_name.split('@', 2)
|
||||||
|
term = self._search_normalized(term)
|
||||||
|
variants = {term}
|
||||||
|
if analyzer is not None:
|
||||||
|
variants.update(analyzer.get_variants_ascii(variant))
|
||||||
|
variants = list(variants)
|
||||||
|
else:
|
||||||
|
variants = [self._search_normalized(postcode_name)]
|
||||||
|
terms.append((postcode_name, variants))
|
||||||
|
|
||||||
|
if terms:
|
||||||
|
with self.conn.cursor() as cur:
|
||||||
|
cur.execute_values("""SELECT create_postcode_word(pc, var)
|
||||||
|
FROM (VALUES %s) AS v(pc, var)""",
|
||||||
|
terms)
|
||||||
|
|
||||||
if to_delete:
|
|
||||||
cur.execute("""DELETE FROM WORD
|
|
||||||
WHERE type ='P' and word = any(%s)
|
|
||||||
""", (to_delete, ))
|
|
||||||
|
|
||||||
copystr.copy_out(cur, 'word',
|
|
||||||
columns=['word_token', 'type', 'word'])
|
|
||||||
|
|
||||||
|
|
||||||
def update_special_phrases(self, phrases, should_replace):
|
def update_special_phrases(self, phrases, should_replace):
|
||||||
@@ -473,7 +511,7 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
|
|||||||
def _process_place_address(self, token_info, address):
|
def _process_place_address(self, token_info, address):
|
||||||
for item in address:
|
for item in address:
|
||||||
if item.kind == 'postcode':
|
if item.kind == 'postcode':
|
||||||
self._add_postcode(item.name)
|
token_info.set_postcode(self._add_postcode(item))
|
||||||
elif item.kind == 'housenumber':
|
elif item.kind == 'housenumber':
|
||||||
token_info.add_housenumber(*self._compute_housenumber_token(item))
|
token_info.add_housenumber(*self._compute_housenumber_token(item))
|
||||||
elif item.kind == 'street':
|
elif item.kind == 'street':
|
||||||
@@ -605,26 +643,38 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
|
|||||||
return full_tokens, partial_tokens
|
return full_tokens, partial_tokens
|
||||||
|
|
||||||
|
|
||||||
def _add_postcode(self, postcode):
|
def _add_postcode(self, item):
|
||||||
""" Make sure the normalized postcode is present in the word table.
|
""" Make sure the normalized postcode is present in the word table.
|
||||||
"""
|
"""
|
||||||
if re.search(r'[:,;]', postcode) is None:
|
analyzer = self.token_analysis.analysis.get('@postcode')
|
||||||
postcode = self.normalize_postcode(postcode)
|
|
||||||
|
|
||||||
if postcode not in self._cache.postcodes:
|
if analyzer is None:
|
||||||
term = self._search_normalized(postcode)
|
postcode_name = item.name.strip().upper()
|
||||||
if not term:
|
variant_base = None
|
||||||
return
|
else:
|
||||||
|
postcode_name = analyzer.normalize(item.name)
|
||||||
|
variant_base = item.get_attr("variant")
|
||||||
|
|
||||||
with self.conn.cursor() as cur:
|
if variant_base:
|
||||||
# no word_id needed for postcodes
|
postcode = f'{postcode_name}@{variant_base}'
|
||||||
cur.execute("""INSERT INTO word (word_token, type, word)
|
else:
|
||||||
(SELECT %s, 'P', pc FROM (VALUES (%s)) as v(pc)
|
postcode = postcode_name
|
||||||
WHERE NOT EXISTS
|
|
||||||
(SELECT * FROM word
|
if postcode not in self._cache.postcodes:
|
||||||
WHERE type = 'P' and word = pc))
|
term = self._search_normalized(postcode_name)
|
||||||
""", (term, postcode))
|
if not term:
|
||||||
self._cache.postcodes.add(postcode)
|
return None
|
||||||
|
|
||||||
|
variants = {term}
|
||||||
|
if analyzer is not None and variant_base:
|
||||||
|
variants.update(analyzer.get_variants_ascii(variant_base))
|
||||||
|
|
||||||
|
with self.conn.cursor() as cur:
|
||||||
|
cur.execute("SELECT create_postcode_word(%s, %s)",
|
||||||
|
(postcode, list(variants)))
|
||||||
|
self._cache.postcodes.add(postcode)
|
||||||
|
|
||||||
|
return postcode_name
|
||||||
|
|
||||||
|
|
||||||
class _TokenInfo:
|
class _TokenInfo:
|
||||||
@@ -637,6 +687,7 @@ class _TokenInfo:
|
|||||||
self.street_tokens = set()
|
self.street_tokens = set()
|
||||||
self.place_tokens = set()
|
self.place_tokens = set()
|
||||||
self.address_tokens = {}
|
self.address_tokens = {}
|
||||||
|
self.postcode = None
|
||||||
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
@@ -665,6 +716,9 @@ class _TokenInfo:
|
|||||||
if self.address_tokens:
|
if self.address_tokens:
|
||||||
out['addr'] = self.address_tokens
|
out['addr'] = self.address_tokens
|
||||||
|
|
||||||
|
if self.postcode:
|
||||||
|
out['postcode'] = self.postcode
|
||||||
|
|
||||||
return out
|
return out
|
||||||
|
|
||||||
|
|
||||||
@@ -701,6 +755,11 @@ class _TokenInfo:
|
|||||||
if partials:
|
if partials:
|
||||||
self.address_tokens[key] = self._mk_array(partials)
|
self.address_tokens[key] = self._mk_array(partials)
|
||||||
|
|
||||||
|
def set_postcode(self, postcode):
|
||||||
|
""" Set the postcode to the given one.
|
||||||
|
"""
|
||||||
|
self.postcode = postcode
|
||||||
|
|
||||||
|
|
||||||
class _TokenCache:
|
class _TokenCache:
|
||||||
""" Cache for token information to avoid repeated database queries.
|
""" Cache for token information to avoid repeated database queries.
|
||||||
|
|||||||
@@ -467,8 +467,9 @@ class LegacyNameAnalyzer(AbstractAnalyzer):
|
|||||||
if key == 'postcode':
|
if key == 'postcode':
|
||||||
# Make sure the normalized postcode is present in the word table.
|
# Make sure the normalized postcode is present in the word table.
|
||||||
if re.search(r'[:,;]', value) is None:
|
if re.search(r'[:,;]', value) is None:
|
||||||
self._cache.add_postcode(self.conn,
|
norm_pc = self.normalize_postcode(value)
|
||||||
self.normalize_postcode(value))
|
token_info.set_postcode(norm_pc)
|
||||||
|
self._cache.add_postcode(self.conn, norm_pc)
|
||||||
elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
|
elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
|
||||||
hnrs.append(value)
|
hnrs.append(value)
|
||||||
elif key == 'street':
|
elif key == 'street':
|
||||||
@@ -527,6 +528,11 @@ class _TokenInfo:
|
|||||||
self.data['hnr_tokens'], self.data['hnr'] = cur.fetchone()
|
self.data['hnr_tokens'], self.data['hnr'] = cur.fetchone()
|
||||||
|
|
||||||
|
|
||||||
|
def set_postcode(self, postcode):
|
||||||
|
""" Set or replace the postcode token with the given value.
|
||||||
|
"""
|
||||||
|
self.data['postcode'] = postcode
|
||||||
|
|
||||||
def add_street(self, conn, street):
|
def add_street(self, conn, street):
|
||||||
""" Add addr:street match terms.
|
""" Add addr:street match terms.
|
||||||
"""
|
"""
|
||||||
|
|||||||
74
nominatim/tokenizer/sanitizers/clean_postcodes.py
Normal file
74
nominatim/tokenizer/sanitizers/clean_postcodes.py
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Sanitizer that filters postcodes by their officially allowed pattern.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
convert-to-address: If set to 'yes' (the default), then postcodes that do
|
||||||
|
not conform with their country-specific pattern are
|
||||||
|
converted to an address component. That means that
|
||||||
|
the postcode does not take part when computing the
|
||||||
|
postcode centroids of a country but is still searchable.
|
||||||
|
When set to 'no', non-conforming postcodes are not
|
||||||
|
searchable either.
|
||||||
|
default-pattern: Pattern to use, when there is none available for the
|
||||||
|
country in question. Warning: will not be used for
|
||||||
|
objects that have no country assigned. These are always
|
||||||
|
assumed to have no postcode.
|
||||||
|
"""
|
||||||
|
from nominatim.data.postcode_format import PostcodeFormatter
|
||||||
|
|
||||||
|
class _PostcodeSanitizer:
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
self.convert_to_address = config.get_bool('convert-to-address', True)
|
||||||
|
self.matcher = PostcodeFormatter()
|
||||||
|
|
||||||
|
default_pattern = config.get('default-pattern')
|
||||||
|
if default_pattern is not None and isinstance(default_pattern, str):
|
||||||
|
self.matcher.set_default_pattern(default_pattern)
|
||||||
|
|
||||||
|
|
||||||
|
def __call__(self, obj):
|
||||||
|
if not obj.address:
|
||||||
|
return
|
||||||
|
|
||||||
|
postcodes = ((i, o) for i, o in enumerate(obj.address) if o.kind == 'postcode')
|
||||||
|
|
||||||
|
for pos, postcode in postcodes:
|
||||||
|
formatted = self.scan(postcode.name, obj.place.country_code)
|
||||||
|
|
||||||
|
if formatted is None:
|
||||||
|
if self.convert_to_address:
|
||||||
|
postcode.kind = 'unofficial_postcode'
|
||||||
|
else:
|
||||||
|
obj.address.pop(pos)
|
||||||
|
else:
|
||||||
|
postcode.name = formatted[0]
|
||||||
|
postcode.set_attr('variant', formatted[1])
|
||||||
|
|
||||||
|
|
||||||
|
def scan(self, postcode, country):
|
||||||
|
""" Check the postcode for correct formatting and return the
|
||||||
|
normalized version. Returns None if the postcode does not
|
||||||
|
correspond to the oficial format of the given country.
|
||||||
|
"""
|
||||||
|
match = self.matcher.match(country, postcode)
|
||||||
|
if match is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return self.matcher.normalize(country, match),\
|
||||||
|
' '.join(filter(lambda p: p is not None, match.groups()))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def create(config):
|
||||||
|
""" Create a housenumber processing function.
|
||||||
|
"""
|
||||||
|
|
||||||
|
return _PostcodeSanitizer(config)
|
||||||
@@ -44,6 +44,20 @@ class SanitizerConfig(UserDict):
|
|||||||
return values
|
return values
|
||||||
|
|
||||||
|
|
||||||
|
def get_bool(self, param, default=None):
|
||||||
|
""" Extract a configuration parameter as a boolean.
|
||||||
|
The parameter must be one of the yaml boolean values or an
|
||||||
|
user error will be raised. If `default` is given, then the parameter
|
||||||
|
may also be missing or empty.
|
||||||
|
"""
|
||||||
|
value = self.data.get(param, default)
|
||||||
|
|
||||||
|
if not isinstance(value, bool):
|
||||||
|
raise UsageError(f"Parameter '{param}' must be a boolean value ('yes' or 'no'.")
|
||||||
|
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
def get_delimiter(self, default=',;'):
|
def get_delimiter(self, default=',;'):
|
||||||
""" Return the 'delimiter' parameter in the configuration as a
|
""" Return the 'delimiter' parameter in the configuration as a
|
||||||
compiled regular expression that can be used to split the names on the
|
compiled regular expression that can be used to split the names on the
|
||||||
|
|||||||
@@ -48,8 +48,7 @@ class _AnalyzerByLanguage:
|
|||||||
self.deflangs = {}
|
self.deflangs = {}
|
||||||
|
|
||||||
if use_defaults in ('mono', 'all'):
|
if use_defaults in ('mono', 'all'):
|
||||||
for ccode, prop in country_info.iterate():
|
for ccode, clangs in country_info.iterate('languages'):
|
||||||
clangs = prop['languages']
|
|
||||||
if len(clangs) == 1 or use_defaults == 'all':
|
if len(clangs) == 1 or use_defaults == 'all':
|
||||||
if self.whitelist:
|
if self.whitelist:
|
||||||
self.deflangs[ccode] = [l for l in clangs if l in self.whitelist]
|
self.deflangs[ccode] = [l for l in clangs if l in self.whitelist]
|
||||||
|
|||||||
65
nominatim/tokenizer/token_analysis/postcodes.py
Normal file
65
nominatim/tokenizer/token_analysis/postcodes.py
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Specialized processor for postcodes. Supports a 'lookup' variant of the
|
||||||
|
token, which produces variants with optional spaces.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
|
||||||
|
|
||||||
|
### Configuration section
|
||||||
|
|
||||||
|
def configure(rules, normalization_rules): # pylint: disable=W0613
|
||||||
|
""" All behaviour is currently hard-coded.
|
||||||
|
"""
|
||||||
|
return None
|
||||||
|
|
||||||
|
### Analysis section
|
||||||
|
|
||||||
|
def create(normalizer, transliterator, config): # pylint: disable=W0613
|
||||||
|
""" Create a new token analysis instance for this module.
|
||||||
|
"""
|
||||||
|
return PostcodeTokenAnalysis(normalizer, transliterator)
|
||||||
|
|
||||||
|
|
||||||
|
class PostcodeTokenAnalysis:
|
||||||
|
""" Special normalization and variant generation for postcodes.
|
||||||
|
|
||||||
|
This analyser must not be used with anything but postcodes as
|
||||||
|
it follows some special rules: `normalize` doesn't necessarily
|
||||||
|
need to return a standard form as per normalization rules. It
|
||||||
|
needs to return the canonical form of the postcode that is also
|
||||||
|
used for output. `get_variants_ascii` then needs to ensure that
|
||||||
|
the generated variants once more follow the standard normalization
|
||||||
|
and transliteration, so that postcodes are correctly recognised by
|
||||||
|
the search algorithm.
|
||||||
|
"""
|
||||||
|
def __init__(self, norm, trans):
|
||||||
|
self.norm = norm
|
||||||
|
self.trans = trans
|
||||||
|
|
||||||
|
self.mutator = MutationVariantGenerator(' ', (' ', ''))
|
||||||
|
|
||||||
|
|
||||||
|
def normalize(self, name):
|
||||||
|
""" Return the standard form of the postcode.
|
||||||
|
"""
|
||||||
|
return name.strip().upper()
|
||||||
|
|
||||||
|
|
||||||
|
def get_variants_ascii(self, norm_name):
|
||||||
|
""" Compute the spelling variants for the given normalized postcode.
|
||||||
|
|
||||||
|
Takes the canonical form of the postcode, normalizes it using the
|
||||||
|
standard rules and then creates variants of the result where
|
||||||
|
all spaces are optional.
|
||||||
|
"""
|
||||||
|
# Postcodes follow their own transliteration rules.
|
||||||
|
# Make sure at this point, that the terms are normalized in a way
|
||||||
|
# that they are searchable with the standard transliteration rules.
|
||||||
|
return [self.trans.transliterate(term) for term in
|
||||||
|
self.mutator.generate([self.norm.transliterate(norm_name)]) if term]
|
||||||
@@ -84,10 +84,20 @@ def setup_country_config(config):
|
|||||||
_COUNTRY_INFO.load(config)
|
_COUNTRY_INFO.load(config)
|
||||||
|
|
||||||
|
|
||||||
def iterate():
|
def iterate(prop=None):
|
||||||
""" Iterate over country code and properties.
|
""" Iterate over country code and properties.
|
||||||
|
|
||||||
|
When `prop` is None, all countries are returned with their complete
|
||||||
|
set of properties.
|
||||||
|
|
||||||
|
If `prop` is given, then only countries are returned where the
|
||||||
|
given property is set. The second item of the tuple contains only
|
||||||
|
the content of the given property.
|
||||||
"""
|
"""
|
||||||
return _COUNTRY_INFO.items()
|
if prop is None:
|
||||||
|
return _COUNTRY_INFO.items()
|
||||||
|
|
||||||
|
return ((c, p[prop]) for c, p in _COUNTRY_INFO.items() if prop in p)
|
||||||
|
|
||||||
|
|
||||||
def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
|
def setup_country_tables(dsn, sql_dir, ignore_partitions=False):
|
||||||
|
|||||||
@@ -8,6 +8,7 @@
|
|||||||
Functions for importing, updating and otherwise maintaining the table
|
Functions for importing, updating and otherwise maintaining the table
|
||||||
of artificial postcode centroids.
|
of artificial postcode centroids.
|
||||||
"""
|
"""
|
||||||
|
from collections import defaultdict
|
||||||
import csv
|
import csv
|
||||||
import gzip
|
import gzip
|
||||||
import logging
|
import logging
|
||||||
@@ -16,6 +17,8 @@ from math import isfinite
|
|||||||
from psycopg2 import sql as pysql
|
from psycopg2 import sql as pysql
|
||||||
|
|
||||||
from nominatim.db.connection import connect
|
from nominatim.db.connection import connect
|
||||||
|
from nominatim.utils.centroid import PointsCentroid
|
||||||
|
from nominatim.data.postcode_format import PostcodeFormatter
|
||||||
|
|
||||||
LOG = logging.getLogger()
|
LOG = logging.getLogger()
|
||||||
|
|
||||||
@@ -30,20 +33,31 @@ def _to_float(num, max_value):
|
|||||||
|
|
||||||
return num
|
return num
|
||||||
|
|
||||||
class _CountryPostcodesCollector:
|
class _PostcodeCollector:
|
||||||
""" Collector for postcodes of a single country.
|
""" Collector for postcodes of a single country.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, country):
|
def __init__(self, country, matcher):
|
||||||
self.country = country
|
self.country = country
|
||||||
self.collected = {}
|
self.matcher = matcher
|
||||||
|
self.collected = defaultdict(PointsCentroid)
|
||||||
|
self.normalization_cache = None
|
||||||
|
|
||||||
|
|
||||||
def add(self, postcode, x, y):
|
def add(self, postcode, x, y):
|
||||||
""" Add the given postcode to the collection cache. If the postcode
|
""" Add the given postcode to the collection cache. If the postcode
|
||||||
already existed, it is overwritten with the new centroid.
|
already existed, it is overwritten with the new centroid.
|
||||||
"""
|
"""
|
||||||
self.collected[postcode] = (x, y)
|
if self.matcher is not None:
|
||||||
|
if self.normalization_cache and self.normalization_cache[0] == postcode:
|
||||||
|
normalized = self.normalization_cache[1]
|
||||||
|
else:
|
||||||
|
match = self.matcher.match(postcode)
|
||||||
|
normalized = self.matcher.normalize(match) if match else None
|
||||||
|
self.normalization_cache = (postcode, normalized)
|
||||||
|
|
||||||
|
if normalized:
|
||||||
|
self.collected[normalized] += (x, y)
|
||||||
|
|
||||||
|
|
||||||
def commit(self, conn, analyzer, project_dir):
|
def commit(self, conn, analyzer, project_dir):
|
||||||
@@ -93,16 +107,16 @@ class _CountryPostcodesCollector:
|
|||||||
WHERE country_code = %s""",
|
WHERE country_code = %s""",
|
||||||
(self.country, ))
|
(self.country, ))
|
||||||
for postcode, x, y in cur:
|
for postcode, x, y in cur:
|
||||||
newx, newy = self.collected.pop(postcode, (None, None))
|
pcobj = self.collected.pop(postcode, None)
|
||||||
if newx is not None:
|
if pcobj:
|
||||||
dist = (x - newx)**2 + (y - newy)**2
|
newx, newy = pcobj.centroid()
|
||||||
if dist > 0.0000001:
|
if (x - newx) > 0.0000001 or (y - newy) > 0.0000001:
|
||||||
to_update.append((postcode, newx, newy))
|
to_update.append((postcode, newx, newy))
|
||||||
else:
|
else:
|
||||||
to_delete.append(postcode)
|
to_delete.append(postcode)
|
||||||
|
|
||||||
to_add = [(k, v[0], v[1]) for k, v in self.collected.items()]
|
to_add = [(k, *v.centroid()) for k, v in self.collected.items()]
|
||||||
self.collected = []
|
self.collected = None
|
||||||
|
|
||||||
return to_add, to_delete, to_update
|
return to_add, to_delete, to_update
|
||||||
|
|
||||||
@@ -125,8 +139,10 @@ class _CountryPostcodesCollector:
|
|||||||
postcode = analyzer.normalize_postcode(row['postcode'])
|
postcode = analyzer.normalize_postcode(row['postcode'])
|
||||||
if postcode not in self.collected:
|
if postcode not in self.collected:
|
||||||
try:
|
try:
|
||||||
self.collected[postcode] = (_to_float(row['lon'], 180),
|
# Do float conversation separately, it might throw
|
||||||
_to_float(row['lat'], 90))
|
centroid = (_to_float(row['lon'], 180),
|
||||||
|
_to_float(row['lat'], 90))
|
||||||
|
self.collected[postcode] += centroid
|
||||||
except ValueError:
|
except ValueError:
|
||||||
LOG.warning("Bad coordinates %s, %s in %s country postcode file.",
|
LOG.warning("Bad coordinates %s, %s in %s country postcode file.",
|
||||||
row['lat'], row['lon'], self.country)
|
row['lat'], row['lon'], self.country)
|
||||||
@@ -158,6 +174,7 @@ def update_postcodes(dsn, project_dir, tokenizer):
|
|||||||
potentially enhances it with external data and then updates the
|
potentially enhances it with external data and then updates the
|
||||||
postcodes in the table 'location_postcode'.
|
postcodes in the table 'location_postcode'.
|
||||||
"""
|
"""
|
||||||
|
matcher = PostcodeFormatter()
|
||||||
with tokenizer.name_analyzer() as analyzer:
|
with tokenizer.name_analyzer() as analyzer:
|
||||||
with connect(dsn) as conn:
|
with connect(dsn) as conn:
|
||||||
# First get the list of countries that currently have postcodes.
|
# First get the list of countries that currently have postcodes.
|
||||||
@@ -169,19 +186,17 @@ def update_postcodes(dsn, project_dir, tokenizer):
|
|||||||
# Recompute the list of valid postcodes from placex.
|
# Recompute the list of valid postcodes from placex.
|
||||||
with conn.cursor(name="placex_postcodes") as cur:
|
with conn.cursor(name="placex_postcodes") as cur:
|
||||||
cur.execute("""
|
cur.execute("""
|
||||||
SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
|
SELECT cc, pc, ST_X(centroid), ST_Y(centroid)
|
||||||
FROM (SELECT
|
FROM (SELECT
|
||||||
COALESCE(plx.country_code,
|
COALESCE(plx.country_code,
|
||||||
get_country_code(ST_Centroid(pl.geometry))) as cc,
|
get_country_code(ST_Centroid(pl.geometry))) as cc,
|
||||||
token_normalized_postcode(pl.address->'postcode') as pc,
|
pl.address->'postcode' as pc,
|
||||||
ST_Centroid(ST_Collect(COALESCE(plx.centroid,
|
COALESCE(plx.centroid, ST_Centroid(pl.geometry)) as centroid
|
||||||
ST_Centroid(pl.geometry)))) as centroid
|
|
||||||
FROM place AS pl LEFT OUTER JOIN placex AS plx
|
FROM place AS pl LEFT OUTER JOIN placex AS plx
|
||||||
ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
|
ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
|
||||||
WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null
|
WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null) xx
|
||||||
GROUP BY cc, pc) xx
|
|
||||||
WHERE pc IS NOT null AND cc IS NOT null
|
WHERE pc IS NOT null AND cc IS NOT null
|
||||||
ORDER BY country_code, pc""")
|
ORDER BY cc, pc""")
|
||||||
|
|
||||||
collector = None
|
collector = None
|
||||||
|
|
||||||
@@ -189,7 +204,7 @@ def update_postcodes(dsn, project_dir, tokenizer):
|
|||||||
if collector is None or country != collector.country:
|
if collector is None or country != collector.country:
|
||||||
if collector is not None:
|
if collector is not None:
|
||||||
collector.commit(conn, analyzer, project_dir)
|
collector.commit(conn, analyzer, project_dir)
|
||||||
collector = _CountryPostcodesCollector(country)
|
collector = _PostcodeCollector(country, matcher.get_matcher(country))
|
||||||
todo_countries.discard(country)
|
todo_countries.discard(country)
|
||||||
collector.add(postcode, x, y)
|
collector.add(postcode, x, y)
|
||||||
|
|
||||||
@@ -198,7 +213,8 @@ def update_postcodes(dsn, project_dir, tokenizer):
|
|||||||
|
|
||||||
# Now handle any countries that are only in the postcode table.
|
# Now handle any countries that are only in the postcode table.
|
||||||
for country in todo_countries:
|
for country in todo_countries:
|
||||||
_CountryPostcodesCollector(country).commit(conn, analyzer, project_dir)
|
fmt = matcher.get_matcher(country)
|
||||||
|
_PostcodeCollector(country, fmt).commit(conn, analyzer, project_dir)
|
||||||
|
|
||||||
conn.commit()
|
conn.commit()
|
||||||
|
|
||||||
|
|||||||
0
nominatim/utils/__init__.py
Normal file
0
nominatim/utils/__init__.py
Normal file
48
nominatim/utils/centroid.py
Normal file
48
nominatim/utils/centroid.py
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Functions for computation of centroids.
|
||||||
|
"""
|
||||||
|
from collections.abc import Collection
|
||||||
|
|
||||||
|
class PointsCentroid:
|
||||||
|
""" Centroid computation from single points using an online algorithm.
|
||||||
|
More points may be added at any time.
|
||||||
|
|
||||||
|
Coordinates are internally treated as a 7-digit fixed-point float
|
||||||
|
(i.e. in OSM style).
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.sum_x = 0
|
||||||
|
self.sum_y = 0
|
||||||
|
self.count = 0
|
||||||
|
|
||||||
|
def centroid(self):
|
||||||
|
""" Return the centroid of all points collected so far.
|
||||||
|
"""
|
||||||
|
if self.count == 0:
|
||||||
|
raise ValueError("No points available for centroid.")
|
||||||
|
|
||||||
|
return (float(self.sum_x/self.count)/10000000,
|
||||||
|
float(self.sum_y/self.count)/10000000)
|
||||||
|
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return self.count
|
||||||
|
|
||||||
|
|
||||||
|
def __iadd__(self, other):
|
||||||
|
if isinstance(other, Collection) and len(other) == 2:
|
||||||
|
if all(isinstance(p, (float, int)) for p in other):
|
||||||
|
x, y = other
|
||||||
|
self.sum_x += int(x * 10000000)
|
||||||
|
self.sum_y += int(y * 10000000)
|
||||||
|
self.count += 1
|
||||||
|
return self
|
||||||
|
|
||||||
|
raise ValueError("Can only add 2-element tuples to centroid.")
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -32,6 +32,9 @@ sanitizers:
|
|||||||
- streetnumber
|
- streetnumber
|
||||||
convert-to-name:
|
convert-to-name:
|
||||||
- (\A|.*,)[^\d,]{3,}(,.*|\Z)
|
- (\A|.*,)[^\d,]{3,}(,.*|\Z)
|
||||||
|
- step: clean-postcodes
|
||||||
|
convert-to-address: yes
|
||||||
|
default-pattern: "[A-Z0-9- ]{3,12}"
|
||||||
- step: split-name-list
|
- step: split-name-list
|
||||||
- step: strip-brace-terms
|
- step: strip-brace-terms
|
||||||
- step: tag-analyzer-by-language
|
- step: tag-analyzer-by-language
|
||||||
@@ -43,6 +46,8 @@ token-analysis:
|
|||||||
- analyzer: generic
|
- analyzer: generic
|
||||||
- id: "@housenumber"
|
- id: "@housenumber"
|
||||||
analyzer: housenumbers
|
analyzer: housenumbers
|
||||||
|
- id: "@postcode"
|
||||||
|
analyzer: postcodes
|
||||||
- id: bg
|
- id: bg
|
||||||
analyzer: generic
|
analyzer: generic
|
||||||
mode: variant-only
|
mode: variant-only
|
||||||
|
|||||||
@@ -163,25 +163,8 @@ Feature: Import of postcodes
|
|||||||
| de | 01982 | country:de |
|
| de | 01982 | country:de |
|
||||||
And there are word tokens for postcodes 01982
|
And there are word tokens for postcodes 01982
|
||||||
|
|
||||||
Scenario: Different postcodes with the same normalization can both be found
|
|
||||||
Given the places
|
|
||||||
| osm | class | type | addr+postcode | addr+housenumber | geometry |
|
|
||||||
| N34 | place | house | EH4 7EA | 111 | country:gb |
|
|
||||||
| N35 | place | house | E4 7EA | 111 | country:gb |
|
|
||||||
When importing
|
|
||||||
Then location_postcode contains exactly
|
|
||||||
| country | postcode | geometry |
|
|
||||||
| gb | EH4 7EA | country:gb |
|
|
||||||
| gb | E4 7EA | country:gb |
|
|
||||||
When sending search query "EH4 7EA"
|
|
||||||
Then results contain
|
|
||||||
| type | display_name |
|
|
||||||
| postcode | EH4 7EA |
|
|
||||||
When sending search query "E4 7EA"
|
|
||||||
Then results contain
|
|
||||||
| type | display_name |
|
|
||||||
| postcode | E4 7EA |
|
|
||||||
|
|
||||||
|
@Fail
|
||||||
Scenario: search and address ranks for GB post codes correctly assigned
|
Scenario: search and address ranks for GB post codes correctly assigned
|
||||||
Given the places
|
Given the places
|
||||||
| osm | class | type | postcode | geometry |
|
| osm | class | type | postcode | geometry |
|
||||||
@@ -195,55 +178,19 @@ Feature: Import of postcodes
|
|||||||
| E45 2 | gb | 23 | 5 |
|
| E45 2 | gb | 23 | 5 |
|
||||||
| Y45 | gb | 21 | 5 |
|
| Y45 | gb | 21 | 5 |
|
||||||
|
|
||||||
Scenario: wrongly formatted GB postcodes are down-ranked
|
@fail-legacy
|
||||||
|
Scenario: Postcodes outside all countries are not added to the postcode and word table
|
||||||
Given the places
|
Given the places
|
||||||
| osm | class | type | postcode | geometry |
|
| osm | class | type | addr+postcode | addr+housenumber | addr+place | geometry |
|
||||||
| N1 | place | postcode | EA452CD | country:gb |
|
| N34 | place | house | 01982 | 111 | Null Island | 0 0.00001 |
|
||||||
| N2 | place | postcode | E45 23 | country:gb |
|
And the places
|
||||||
|
| osm | class | type | name | geometry |
|
||||||
|
| N1 | place | hamlet | Null Island | 0 0 |
|
||||||
When importing
|
When importing
|
||||||
Then location_postcode contains exactly
|
Then location_postcode contains exactly
|
||||||
| postcode | country | rank_search | rank_address |
|
| country | postcode | geometry |
|
||||||
| EA452CD | gb | 30 | 30 |
|
And there are no word tokens for postcodes 01982
|
||||||
| E45 23 | gb | 30 | 30 |
|
When sending search query "111, 01982 Null Island"
|
||||||
|
Then results contain
|
||||||
Scenario: search and address rank for DE postcodes correctly assigned
|
| osm | display_name |
|
||||||
Given the places
|
| N34 | 111, Null Island, 01982 |
|
||||||
| osm | class | type | postcode | geometry |
|
|
||||||
| N1 | place | postcode | 56427 | country:de |
|
|
||||||
| N2 | place | postcode | 5642 | country:de |
|
|
||||||
| N3 | place | postcode | 5642A | country:de |
|
|
||||||
| N4 | place | postcode | 564276 | country:de |
|
|
||||||
When importing
|
|
||||||
Then location_postcode contains exactly
|
|
||||||
| postcode | country | rank_search | rank_address |
|
|
||||||
| 56427 | de | 21 | 11 |
|
|
||||||
| 5642 | de | 30 | 30 |
|
|
||||||
| 5642A | de | 30 | 30 |
|
|
||||||
| 564276 | de | 30 | 30 |
|
|
||||||
|
|
||||||
Scenario: search and address rank for other postcodes are correctly assigned
|
|
||||||
Given the places
|
|
||||||
| osm | class | type | postcode | geometry |
|
|
||||||
| N1 | place | postcode | 1 | country:ca |
|
|
||||||
| N2 | place | postcode | X3 | country:ca |
|
|
||||||
| N3 | place | postcode | 543 | country:ca |
|
|
||||||
| N4 | place | postcode | 54dc | country:ca |
|
|
||||||
| N5 | place | postcode | 12345 | country:ca |
|
|
||||||
| N6 | place | postcode | 55TT667 | country:ca |
|
|
||||||
| N7 | place | postcode | 123-65 | country:ca |
|
|
||||||
| N8 | place | postcode | 12 445 4 | country:ca |
|
|
||||||
| N9 | place | postcode | A1:bc10 | country:ca |
|
|
||||||
When importing
|
|
||||||
Then location_postcode contains exactly
|
|
||||||
| postcode | country | rank_search | rank_address |
|
|
||||||
| 1 | ca | 21 | 11 |
|
|
||||||
| X3 | ca | 21 | 11 |
|
|
||||||
| 543 | ca | 21 | 11 |
|
|
||||||
| 54DC | ca | 21 | 11 |
|
|
||||||
| 12345 | ca | 21 | 11 |
|
|
||||||
| 55TT667 | ca | 21 | 11 |
|
|
||||||
| 123-65 | ca | 25 | 11 |
|
|
||||||
| 12 445 4 | ca | 25 | 11 |
|
|
||||||
| A1:BC10 | ca | 25 | 11 |
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -168,14 +168,6 @@ Feature: Import and search of names
|
|||||||
| ID | osm |
|
| ID | osm |
|
||||||
| 0 | R1 |
|
| 0 | R1 |
|
||||||
|
|
||||||
Scenario: Unprintable characters in postcodes are ignored
|
|
||||||
Given the named places
|
|
||||||
| osm | class | type | address | geometry |
|
|
||||||
| N234 | amenity | prison | 'postcode' : u'1234\u200e' | country:de |
|
|
||||||
When importing
|
|
||||||
And sending search query "1234"
|
|
||||||
Then result 0 has not attributes osm_type
|
|
||||||
|
|
||||||
Scenario Outline: Housenumbers with special characters are found
|
Scenario Outline: Housenumbers with special characters are found
|
||||||
Given the grid
|
Given the grid
|
||||||
| 1 | | | | 2 |
|
| 1 | | | | 2 |
|
||||||
|
|||||||
97
test/bdd/db/query/postcodes.feature
Normal file
97
test/bdd/db/query/postcodes.feature
Normal file
@@ -0,0 +1,97 @@
|
|||||||
|
@DB
|
||||||
|
Feature: Querying fo postcode variants
|
||||||
|
|
||||||
|
Scenario: Postcodes in Singapore (6-digit postcode)
|
||||||
|
Given the grid with origin SG
|
||||||
|
| 10 | | | | 11 |
|
||||||
|
And the places
|
||||||
|
| osm | class | type | name | addr+postcode | geometry |
|
||||||
|
| W1 | highway | path | Lorang | 399174 | 10,11 |
|
||||||
|
When importing
|
||||||
|
When sending search query "399174"
|
||||||
|
Then results contain
|
||||||
|
| ID | type | display_name |
|
||||||
|
| 0 | postcode | 399174 |
|
||||||
|
|
||||||
|
|
||||||
|
@fail-legacy
|
||||||
|
Scenario Outline: Postcodes in the Netherlands (mixed postcode with spaces)
|
||||||
|
Given the grid with origin NL
|
||||||
|
| 10 | | | | 11 |
|
||||||
|
And the places
|
||||||
|
| osm | class | type | name | addr+postcode | geometry |
|
||||||
|
| W1 | highway | path | De Weide | 3993 DX | 10,11 |
|
||||||
|
When importing
|
||||||
|
When sending search query "3993 DX"
|
||||||
|
Then results contain
|
||||||
|
| ID | type | display_name |
|
||||||
|
| 0 | postcode | 3993 DX |
|
||||||
|
When sending search query "3993dx"
|
||||||
|
Then results contain
|
||||||
|
| ID | type | display_name |
|
||||||
|
| 0 | postcode | 3993 DX |
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
| postcode |
|
||||||
|
| 3993 DX |
|
||||||
|
| 3993DX |
|
||||||
|
| 3993 dx |
|
||||||
|
|
||||||
|
|
||||||
|
@fail-legacy
|
||||||
|
Scenario: Postcodes in Singapore (6-digit postcode)
|
||||||
|
Given the grid with origin SG
|
||||||
|
| 10 | | | | 11 |
|
||||||
|
And the places
|
||||||
|
| osm | class | type | name | addr+postcode | geometry |
|
||||||
|
| W1 | highway | path | Lorang | 399174 | 10,11 |
|
||||||
|
When importing
|
||||||
|
When sending search query "399174"
|
||||||
|
Then results contain
|
||||||
|
| ID | type | display_name |
|
||||||
|
| 0 | postcode | 399174 |
|
||||||
|
|
||||||
|
|
||||||
|
@fail-legacy
|
||||||
|
Scenario Outline: Postcodes in Andorra (with country code)
|
||||||
|
Given the grid with origin AD
|
||||||
|
| 10 | | | | 11 |
|
||||||
|
And the places
|
||||||
|
| osm | class | type | name | addr+postcode | geometry |
|
||||||
|
| W1 | highway | path | Lorang | <postcode> | 10,11 |
|
||||||
|
When importing
|
||||||
|
When sending search query "675"
|
||||||
|
Then results contain
|
||||||
|
| ID | type | display_name |
|
||||||
|
| 0 | postcode | AD675 |
|
||||||
|
When sending search query "AD675"
|
||||||
|
Then results contain
|
||||||
|
| ID | type | display_name |
|
||||||
|
| 0 | postcode | AD675 |
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
| postcode |
|
||||||
|
| 675 |
|
||||||
|
| AD 675 |
|
||||||
|
| AD675 |
|
||||||
|
|
||||||
|
|
||||||
|
Scenario: Different postcodes with the same normalization can both be found
|
||||||
|
Given the places
|
||||||
|
| osm | class | type | addr+postcode | addr+housenumber | geometry |
|
||||||
|
| N34 | place | house | EH4 7EA | 111 | country:gb |
|
||||||
|
| N35 | place | house | E4 7EA | 111 | country:gb |
|
||||||
|
When importing
|
||||||
|
Then location_postcode contains exactly
|
||||||
|
| country | postcode | geometry |
|
||||||
|
| gb | EH4 7EA | country:gb |
|
||||||
|
| gb | E4 7EA | country:gb |
|
||||||
|
When sending search query "EH4 7EA"
|
||||||
|
Then results contain
|
||||||
|
| type | display_name |
|
||||||
|
| postcode | EH4 7EA |
|
||||||
|
When sending search query "E4 7EA"
|
||||||
|
Then results contain
|
||||||
|
| type | display_name |
|
||||||
|
| postcode | E4 7EA |
|
||||||
|
|
||||||
@@ -18,13 +18,19 @@ from nominatim.tokenizer import factory as tokenizer_factory
|
|||||||
def check_database_integrity(context):
|
def check_database_integrity(context):
|
||||||
""" Check some generic constraints on the tables.
|
""" Check some generic constraints on the tables.
|
||||||
"""
|
"""
|
||||||
# place_addressline should not have duplicate (place_id, address_place_id)
|
with context.db.cursor() as cur:
|
||||||
cur = context.db.cursor()
|
# place_addressline should not have duplicate (place_id, address_place_id)
|
||||||
cur.execute("""SELECT count(*) FROM
|
cur.execute("""SELECT count(*) FROM
|
||||||
(SELECT place_id, address_place_id, count(*) as c
|
(SELECT place_id, address_place_id, count(*) as c
|
||||||
FROM place_addressline GROUP BY place_id, address_place_id) x
|
FROM place_addressline GROUP BY place_id, address_place_id) x
|
||||||
WHERE c > 1""")
|
WHERE c > 1""")
|
||||||
assert cur.fetchone()[0] == 0, "Duplicates found in place_addressline"
|
assert cur.fetchone()[0] == 0, "Duplicates found in place_addressline"
|
||||||
|
|
||||||
|
# word table must not have empty word_tokens
|
||||||
|
if context.nominatim.tokenizer != 'legacy':
|
||||||
|
cur.execute("SELECT count(*) FROM word WHERE word_token = ''")
|
||||||
|
assert cur.fetchone()[0] == 0, "Empty word tokens found in word table"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
################################ GIVEN ##################################
|
################################ GIVEN ##################################
|
||||||
|
|||||||
102
test/python/tokenizer/sanitizers/test_clean_postcodes.py
Normal file
102
test/python/tokenizer/sanitizers/test_clean_postcodes.py
Normal file
@@ -0,0 +1,102 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Tests for the sanitizer that normalizes postcodes.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
|
||||||
|
from nominatim.indexer.place_info import PlaceInfo
|
||||||
|
from nominatim.tools import country_info
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sanitize(def_config, request):
|
||||||
|
country_info.setup_country_config(def_config)
|
||||||
|
sanitizer_args = {'step': 'clean-postcodes'}
|
||||||
|
for mark in request.node.iter_markers(name="sanitizer_params"):
|
||||||
|
sanitizer_args.update({k.replace('_', '-') : v for k,v in mark.kwargs.items()})
|
||||||
|
|
||||||
|
def _run(country=None, **kwargs):
|
||||||
|
pi = {'address': kwargs}
|
||||||
|
if country is not None:
|
||||||
|
pi['country_code'] = country
|
||||||
|
|
||||||
|
_, address = PlaceSanitizer([sanitizer_args]).process_names(PlaceInfo(pi))
|
||||||
|
|
||||||
|
return sorted([(p.kind, p.name) for p in address])
|
||||||
|
|
||||||
|
return _run
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("country", (None, 'ae'))
|
||||||
|
def test_postcode_no_country(sanitize, country):
|
||||||
|
assert sanitize(country=country, postcode='23231') == [('unofficial_postcode', '23231')]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("country", (None, 'ae'))
|
||||||
|
@pytest.mark.sanitizer_params(convert_to_address=False)
|
||||||
|
def test_postcode_no_country_drop(sanitize, country):
|
||||||
|
assert sanitize(country=country, postcode='23231') == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('12345', ' 12345 ', 'de 12345',
|
||||||
|
'DE12345', 'DE 12345', 'DE-12345'))
|
||||||
|
def test_postcode_pass_good_format(sanitize, postcode):
|
||||||
|
assert sanitize(country='de', postcode=postcode) == [('postcode', '12345')]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('123456', '', ' ', '.....',
|
||||||
|
'DE 12345', 'DEF12345', 'CH 12345'))
|
||||||
|
@pytest.mark.sanitizer_params(convert_to_address=False)
|
||||||
|
def test_postcode_drop_bad_format(sanitize, postcode):
|
||||||
|
assert sanitize(country='de', postcode=postcode) == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('1234', '9435', '99000'))
|
||||||
|
def test_postcode_cyprus_pass(sanitize, postcode):
|
||||||
|
assert sanitize(country='cy', postcode=postcode) == [('postcode', postcode)]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('91234', '99a45', '567'))
|
||||||
|
@pytest.mark.sanitizer_params(convert_to_address=False)
|
||||||
|
def test_postcode_cyprus_fail(sanitize, postcode):
|
||||||
|
assert sanitize(country='cy', postcode=postcode) == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('123456', 'A33F2G7'))
|
||||||
|
def test_postcode_kazakhstan_pass(sanitize, postcode):
|
||||||
|
assert sanitize(country='kz', postcode=postcode) == [('postcode', postcode)]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('V34T6Y923456', '99345'))
|
||||||
|
@pytest.mark.sanitizer_params(convert_to_address=False)
|
||||||
|
def test_postcode_kazakhstan_fail(sanitize, postcode):
|
||||||
|
assert sanitize(country='kz', postcode=postcode) == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('675 34', '67534', 'SE-675 34', 'SE67534'))
|
||||||
|
def test_postcode_sweden_pass(sanitize, postcode):
|
||||||
|
assert sanitize(country='se', postcode=postcode) == [('postcode', '675 34')]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('67 345', '671123'))
|
||||||
|
@pytest.mark.sanitizer_params(convert_to_address=False)
|
||||||
|
def test_postcode_sweden_fail(sanitize, postcode):
|
||||||
|
assert sanitize(country='se', postcode=postcode) == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('AB1', '123-456-7890', '1 as 44'))
|
||||||
|
@pytest.mark.sanitizer_params(default_pattern='[A-Z0-9- ]{3,12}')
|
||||||
|
def test_postcode_default_pattern_pass(sanitize, postcode):
|
||||||
|
assert sanitize(country='an', postcode=postcode) == [('postcode', postcode.upper())]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("postcode", ('C', '12', 'ABC123DEF 456', '1234,5678', '11223;11224'))
|
||||||
|
@pytest.mark.sanitizer_params(convert_to_address=False, default_pattern='[A-Z0-9- ]{3,12}')
|
||||||
|
def test_postcode_default_pattern_fail(sanitize, postcode):
|
||||||
|
assert sanitize(country='an', postcode=postcode) == []
|
||||||
|
|
||||||
@@ -72,7 +72,8 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
|
|||||||
|
|
||||||
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
|
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
|
||||||
variants=('~gasse -> gasse', 'street => st', ),
|
variants=('~gasse -> gasse', 'street => st', ),
|
||||||
sanitizers=[], with_housenumber=False):
|
sanitizers=[], with_housenumber=False,
|
||||||
|
with_postcode=False):
|
||||||
cfgstr = {'normalization': list(norm),
|
cfgstr = {'normalization': list(norm),
|
||||||
'sanitizers': sanitizers,
|
'sanitizers': sanitizers,
|
||||||
'transliteration': list(trans),
|
'transliteration': list(trans),
|
||||||
@@ -81,6 +82,9 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
|
|||||||
if with_housenumber:
|
if with_housenumber:
|
||||||
cfgstr['token-analysis'].append({'id': '@housenumber',
|
cfgstr['token-analysis'].append({'id': '@housenumber',
|
||||||
'analyzer': 'housenumbers'})
|
'analyzer': 'housenumbers'})
|
||||||
|
if with_postcode:
|
||||||
|
cfgstr['token-analysis'].append({'id': '@postcode',
|
||||||
|
'analyzer': 'postcodes'})
|
||||||
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
|
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
|
||||||
tok.loader = nominatim.tokenizer.icu_rule_loader.ICURuleLoader(test_config)
|
tok.loader = nominatim.tokenizer.icu_rule_loader.ICURuleLoader(test_config)
|
||||||
|
|
||||||
@@ -246,28 +250,69 @@ def test_normalize_postcode(analyzer):
|
|||||||
anl.normalize_postcode('38 Б') == '38 Б'
|
anl.normalize_postcode('38 Б') == '38 Б'
|
||||||
|
|
||||||
|
|
||||||
def test_update_postcodes_from_db_empty(analyzer, table_factory, word_table):
|
class TestPostcodes:
|
||||||
table_factory('location_postcode', 'postcode TEXT',
|
|
||||||
content=(('1234',), ('12 34',), ('AB23',), ('1234',)))
|
|
||||||
|
|
||||||
with analyzer() as anl:
|
@pytest.fixture(autouse=True)
|
||||||
anl.update_postcodes_from_db()
|
def setup(self, analyzer, sql_functions):
|
||||||
|
sanitizers = [{'step': 'clean-postcodes'}]
|
||||||
assert word_table.count() == 3
|
with analyzer(sanitizers=sanitizers, with_postcode=True) as anl:
|
||||||
assert word_table.get_postcodes() == {'1234', '12 34', 'AB23'}
|
self.analyzer = anl
|
||||||
|
yield anl
|
||||||
|
|
||||||
|
|
||||||
def test_update_postcodes_from_db_add_and_remove(analyzer, table_factory, word_table):
|
def process_postcode(self, cc, postcode):
|
||||||
table_factory('location_postcode', 'postcode TEXT',
|
return self.analyzer.process_place(PlaceInfo({'country_code': cc,
|
||||||
content=(('1234',), ('45BC', ), ('XX45', )))
|
'address': {'postcode': postcode}}))
|
||||||
word_table.add_postcode(' 1234', '1234')
|
|
||||||
word_table.add_postcode(' 5678', '5678')
|
|
||||||
|
|
||||||
with analyzer() as anl:
|
|
||||||
anl.update_postcodes_from_db()
|
|
||||||
|
|
||||||
assert word_table.count() == 3
|
def test_update_postcodes_from_db_empty(self, table_factory, word_table):
|
||||||
assert word_table.get_postcodes() == {'1234', '45BC', 'XX45'}
|
table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
|
||||||
|
content=(('de', '12345'), ('se', '132 34'),
|
||||||
|
('bm', 'AB23'), ('fr', '12345')))
|
||||||
|
|
||||||
|
self.analyzer.update_postcodes_from_db()
|
||||||
|
|
||||||
|
assert word_table.count() == 5
|
||||||
|
assert word_table.get_postcodes() == {'12345', '132 34@132 34', 'AB 23@AB 23'}
|
||||||
|
|
||||||
|
|
||||||
|
def test_update_postcodes_from_db_ambigious(self, table_factory, word_table):
|
||||||
|
table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
|
||||||
|
content=(('in', '123456'), ('sg', '123456')))
|
||||||
|
|
||||||
|
self.analyzer.update_postcodes_from_db()
|
||||||
|
|
||||||
|
assert word_table.count() == 3
|
||||||
|
assert word_table.get_postcodes() == {'123456', '123456@123 456'}
|
||||||
|
|
||||||
|
|
||||||
|
def test_update_postcodes_from_db_add_and_remove(self, table_factory, word_table):
|
||||||
|
table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
|
||||||
|
content=(('ch', '1234'), ('bm', 'BC 45'), ('bm', 'XX45')))
|
||||||
|
word_table.add_postcode(' 1234', '1234')
|
||||||
|
word_table.add_postcode(' 5678', '5678')
|
||||||
|
|
||||||
|
self.analyzer.update_postcodes_from_db()
|
||||||
|
|
||||||
|
assert word_table.count() == 5
|
||||||
|
assert word_table.get_postcodes() == {'1234', 'BC 45@BC 45', 'XX 45@XX 45'}
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_place_postcode_simple(self, word_table):
|
||||||
|
info = self.process_postcode('de', '12345')
|
||||||
|
|
||||||
|
assert info['postcode'] == '12345'
|
||||||
|
|
||||||
|
assert word_table.get_postcodes() == {'12345', }
|
||||||
|
|
||||||
|
|
||||||
|
def test_process_place_postcode_with_space(self, word_table):
|
||||||
|
info = self.process_postcode('in', '123 567')
|
||||||
|
|
||||||
|
assert info['postcode'] == '123567'
|
||||||
|
|
||||||
|
assert word_table.get_postcodes() == {'123567@123 567', }
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def test_update_special_phrase_empty_table(analyzer, word_table):
|
def test_update_special_phrase_empty_table(analyzer, word_table):
|
||||||
@@ -437,13 +482,6 @@ class TestPlaceAddress:
|
|||||||
assert word_table.get_postcodes() == {pcode, }
|
assert word_table.get_postcodes() == {pcode, }
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
|
|
||||||
def test_process_place_bad_postcode(self, word_table, pcode):
|
|
||||||
self.process_address(postcode=pcode)
|
|
||||||
|
|
||||||
assert not word_table.get_postcodes()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
|
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
|
||||||
def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
|
def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
|
||||||
info = self.process_address(housenumber=hnr)
|
info = self.process_address(housenumber=hnr)
|
||||||
|
|||||||
@@ -0,0 +1,60 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Tests for special postcode analysis and variant generation.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from icu import Transliterator
|
||||||
|
|
||||||
|
import nominatim.tokenizer.token_analysis.postcodes as module
|
||||||
|
from nominatim.errors import UsageError
|
||||||
|
|
||||||
|
DEFAULT_NORMALIZATION = """ :: NFD ();
|
||||||
|
'🜳' > ' ';
|
||||||
|
[[:Nonspacing Mark:] [:Cf:]] >;
|
||||||
|
:: lower ();
|
||||||
|
[[:Punctuation:][:Space:]]+ > ' ';
|
||||||
|
:: NFC ();
|
||||||
|
"""
|
||||||
|
|
||||||
|
DEFAULT_TRANSLITERATION = """ :: Latin ();
|
||||||
|
'🜵' > ' ';
|
||||||
|
"""
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def analyser():
|
||||||
|
rules = { 'analyzer': 'postcodes'}
|
||||||
|
config = module.configure(rules, DEFAULT_NORMALIZATION)
|
||||||
|
|
||||||
|
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
|
||||||
|
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
|
||||||
|
|
||||||
|
return module.create(norm, trans, config)
|
||||||
|
|
||||||
|
|
||||||
|
def get_normalized_variants(proc, name):
|
||||||
|
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
|
||||||
|
return proc.get_variants_ascii(norm.transliterate(name).strip())
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('name,norm', [('12', '12'),
|
||||||
|
('A 34 ', 'A 34'),
|
||||||
|
('34-av', '34-AV')])
|
||||||
|
def test_normalize(analyser, name, norm):
|
||||||
|
assert analyser.normalize(name) == norm
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}),
|
||||||
|
('AB-998', {'ab 998', 'ab998'}),
|
||||||
|
('23 FGH D3', {'23 fgh d3', '23fgh d3',
|
||||||
|
'23 fghd3', '23fghd3'})])
|
||||||
|
def test_get_variants_ascii(analyser, postcode, variants):
|
||||||
|
out = analyser.get_variants_ascii(postcode)
|
||||||
|
|
||||||
|
assert len(out) == len(set(out))
|
||||||
|
assert set(out) == variants
|
||||||
@@ -11,7 +11,7 @@ import subprocess
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from nominatim.tools import postcodes
|
from nominatim.tools import postcodes, country_info
|
||||||
import dummy_tokenizer
|
import dummy_tokenizer
|
||||||
|
|
||||||
class MockPostcodeTable:
|
class MockPostcodeTable:
|
||||||
@@ -64,11 +64,26 @@ class MockPostcodeTable:
|
|||||||
def tokenizer():
|
def tokenizer():
|
||||||
return dummy_tokenizer.DummyTokenizer(None, None)
|
return dummy_tokenizer.DummyTokenizer(None, None)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def postcode_table(temp_db_conn, placex_table):
|
def postcode_table(def_config, temp_db_conn, placex_table):
|
||||||
|
country_info.setup_country_config(def_config)
|
||||||
return MockPostcodeTable(temp_db_conn)
|
return MockPostcodeTable(temp_db_conn)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def insert_implicit_postcode(placex_table, place_row):
|
||||||
|
"""
|
||||||
|
Inserts data into the placex and place table
|
||||||
|
which can then be used to compute one postcode.
|
||||||
|
"""
|
||||||
|
def _insert_implicit_postcode(osm_id, country, geometry, address):
|
||||||
|
placex_table.add(osm_id=osm_id, country=country, geom=geometry)
|
||||||
|
place_row(osm_id=osm_id, geom='SRID=4326;'+geometry, address=address)
|
||||||
|
|
||||||
|
return _insert_implicit_postcode
|
||||||
|
|
||||||
|
|
||||||
def test_postcodes_empty(dsn, postcode_table, place_table,
|
def test_postcodes_empty(dsn, postcode_table, place_table,
|
||||||
tmp_path, tokenizer):
|
tmp_path, tokenizer):
|
||||||
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
|
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
|
||||||
@@ -193,7 +208,22 @@ def test_can_compute(dsn, table_factory):
|
|||||||
table_factory('place')
|
table_factory('place')
|
||||||
assert postcodes.can_compute(dsn)
|
assert postcodes.can_compute(dsn)
|
||||||
|
|
||||||
|
|
||||||
def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
|
def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
|
||||||
|
#Rewrite the get_country_code function to verify its execution.
|
||||||
|
temp_db_cursor.execute("""
|
||||||
|
CREATE OR REPLACE FUNCTION get_country_code(place geometry)
|
||||||
|
RETURNS TEXT AS $$ BEGIN
|
||||||
|
RETURN 'yy';
|
||||||
|
END; $$ LANGUAGE plpgsql;
|
||||||
|
""")
|
||||||
|
place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
|
||||||
|
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
|
||||||
|
|
||||||
|
assert postcode_table.row_set == {('yy', 'AB 4511', 10, 12)}
|
||||||
|
|
||||||
|
|
||||||
|
def test_discard_badly_formatted_postcodes(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
|
||||||
#Rewrite the get_country_code function to verify its execution.
|
#Rewrite the get_country_code function to verify its execution.
|
||||||
temp_db_cursor.execute("""
|
temp_db_cursor.execute("""
|
||||||
CREATE OR REPLACE FUNCTION get_country_code(place geometry)
|
CREATE OR REPLACE FUNCTION get_country_code(place geometry)
|
||||||
@@ -204,16 +234,4 @@ def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_tabl
|
|||||||
place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
|
place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
|
||||||
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
|
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
|
||||||
|
|
||||||
assert postcode_table.row_set == {('fr', 'AB 4511', 10, 12)}
|
assert not postcode_table.row_set
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def insert_implicit_postcode(placex_table, place_row):
|
|
||||||
"""
|
|
||||||
Inserts data into the placex and place table
|
|
||||||
which can then be used to compute one postcode.
|
|
||||||
"""
|
|
||||||
def _insert_implicit_postcode(osm_id, country, geometry, address):
|
|
||||||
placex_table.add(osm_id=osm_id, country=country, geom=geometry)
|
|
||||||
place_row(osm_id=osm_id, geom='SRID=4326;'+geometry, address=address)
|
|
||||||
|
|
||||||
return _insert_implicit_postcode
|
|
||||||
|
|||||||
56
test/python/utils/test_centroid.py
Normal file
56
test/python/utils/test_centroid.py
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# SPDX-License-Identifier: GPL-2.0-only
|
||||||
|
#
|
||||||
|
# This file is part of Nominatim. (https://nominatim.org)
|
||||||
|
#
|
||||||
|
# Copyright (C) 2022 by the Nominatim developer community.
|
||||||
|
# For a full list of authors see the git log.
|
||||||
|
"""
|
||||||
|
Tests for centroid computation.
|
||||||
|
"""
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from nominatim.utils.centroid import PointsCentroid
|
||||||
|
|
||||||
|
def test_empty_set():
|
||||||
|
c = PointsCentroid()
|
||||||
|
|
||||||
|
with pytest.raises(ValueError, match='No points'):
|
||||||
|
c.centroid()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("centroid", [(0,0), (-1, 3), [0.0000032, 88.4938]])
|
||||||
|
def test_one_point_centroid(centroid):
|
||||||
|
c = PointsCentroid()
|
||||||
|
|
||||||
|
c += centroid
|
||||||
|
|
||||||
|
assert len(c.centroid()) == 2
|
||||||
|
assert c.centroid() == (pytest.approx(centroid[0]), pytest.approx(centroid[1]))
|
||||||
|
|
||||||
|
|
||||||
|
def test_multipoint_centroid():
|
||||||
|
c = PointsCentroid()
|
||||||
|
|
||||||
|
c += (20.0, -10.0)
|
||||||
|
assert c.centroid() == (pytest.approx(20.0), pytest.approx(-10.0))
|
||||||
|
c += (20.2, -9.0)
|
||||||
|
assert c.centroid() == (pytest.approx(20.1), pytest.approx(-9.5))
|
||||||
|
c += (20.2, -9.0)
|
||||||
|
assert c.centroid() == (pytest.approx(20.13333), pytest.approx(-9.333333))
|
||||||
|
|
||||||
|
|
||||||
|
def test_manypoint_centroid():
|
||||||
|
c = PointsCentroid()
|
||||||
|
|
||||||
|
for _ in range(10000):
|
||||||
|
c += (4.564732, -0.000034)
|
||||||
|
|
||||||
|
assert c.centroid() == (pytest.approx(4.564732), pytest.approx(-0.000034))
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("param", ["aa", None, 5, [1, 2, 3], (3, None), ("a", 3.9)])
|
||||||
|
def test_add_non_tuple(param):
|
||||||
|
c = PointsCentroid()
|
||||||
|
|
||||||
|
with pytest.raises(ValueError, match='2-element tuples'):
|
||||||
|
c += param
|
||||||
Reference in New Issue
Block a user