Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
We've previously added searching through rank 30 in a house
number search to enable searches for house number+name.
This had the unintended side effect that rank 30 objects
are also returned in s search that dropped the house number
from the query. This is wrong because POIs cannot function
as a parent to a house number.
This fix drops all rank 30 objects from the results for a
house number search if they do not match the requested house
number.
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
(but not the part after the bracket)
Fixes#244.
The ICU library only offers transliterations for a limited set of
script. Add transliterations for missing scripts from the PostgreSQL
module. These means that the same selection of scripts is supported
as with the old module.
The tokenizer to be used can be choosen with -DTOKENIZER.
Adapt all tests, so that they work with legacy_icu tokenizer.
Move lookup in word table to a function in the tokenizer.
Special phrases are temporarily imported from the wiki until
we have an implementation that can import from file. TIGER
tests do not work yet.
This adds an installation step for PHP code for the tokenizer. The
PHP code is split in two parts. The updateable code is found in
lib-php. The tokenizer installs an additional script in the
project directory which then includes the code from lib-php and
defines all settings that are static to the database. The website
code then always includes the PHP from the project directory.
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.
The second processing step doesn't do anything useful yet.
Drops all calls to PHP utility functions. nominatim cli functions
are used where possible, to stay as close to the final code as
possible with the tests.
By removing the PHP calls, the test code now only uses osm2pgsql and
the database module from the build directory.
Always look up the closest housenumber before looking up
interpolations. This ensures that closer housenumbers are
preferred over interpolations.
Fixes#2214.
So far the data directory constant has pointed to the source
directory to be usable with different subdirectories. Now only
the data subdirectory itself is being used with the constant,
so point to the directory directly.
This replaces {data_dir}/settings throughout the code, so that
the configuration may be placed somewhere else in the directory
structure (e.g. in /etc).
The PHP scripts need to know the position of the nominatim
tool in order to call it. This is handed in as environment
variable, so it can be set by the Python script.
In the future, the BDD tests will simply set up the required
test database themselves. Like with the template database, it
is not reimported when it already exists unless that is explicitly
forced.
Makes most of the API tests currently fail because they still
point to old test data.