Sarah Hoffmann
70f154be8b
switch word tokens to new word table layout
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
4342b28882
switch special phrases to new word table format
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
5394b1fa1b
switch postcode tokens to new word table layout
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
5ab0a63fd6
switch housenumber tokens to new word table layout
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
1618aba5f2
switch country name tokens to new word table layout
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
8377528952
new word table layout for icu tokenizer
...
The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
e42349c963
replace add-data function with native Python code
2021-07-26 10:41:37 +02:00
Sarah Hoffmann
878835e4bd
move add-data subcommand into a separate file
2021-07-25 18:14:12 +02:00
Sarah Hoffmann
2c8242c8df
remove special code for pre9.5 postgresql
...
9.5 is now the minimum requirement.
2021-07-19 10:24:57 +02:00
Sarah Hoffmann
e7d6f89aca
increase minimum version for PostgreSQL to 9.5
...
This is the minimum version we can test with the CI.
With 9.5 there is also complete support for jsonb available.
2021-07-19 10:21:19 +02:00
Sarah Hoffmann
14f777da18
use psycopg's SQL quoting where possible
...
Use the SQL formatting supplied with psycopg whenever the
query needs to be put together from snippets.
2021-07-12 22:05:22 +02:00
Sarah Hoffmann
6f6681ce67
add helper function for execute_values
...
Make psycopg2's convenience function accessible through
the cursor.
2021-07-12 21:08:20 +02:00
Sarah Hoffmann
06602b4ec0
provide wrapper function for DROP TABLE
...
Use psycopg2 formatting to ensure correct quoting.
2021-07-12 20:32:46 +02:00
Sarah Hoffmann
cf98cff2a1
more formatting fixes
...
Found by flake8.
2021-07-12 17:45:42 +02:00
Sarah Hoffmann
f8b5a63de3
factor out connection reset code
2021-07-12 14:58:44 +02:00
Sarah Hoffmann
568316f07c
simplify analyse function
2021-07-12 14:47:50 +02:00
Sarah Hoffmann
daa597b300
split up variant computation for better readability
2021-07-12 14:43:50 +02:00
Sarah Hoffmann
47adb2a3fc
reorganise process_place function
...
Move address processing into its own function as it is
rather extensive.
2021-07-12 11:57:55 +02:00
Sarah Hoffmann
fff0012249
simplify website setup code
...
Use formaat strings and move variable quoting code into extra
function.
2021-07-12 11:41:05 +02:00
Sarah Hoffmann
d5a1883b62
avoid repeated patterns for table name
2021-07-12 11:33:09 +02:00
Sarah Hoffmann
a08ef43e40
simplify if statements
2021-07-12 11:28:47 +02:00
Sarah Hoffmann
3661f7a321
avoid multiple returns of same value
...
Found by Sonarqube.
2021-07-11 18:23:42 +02:00
Sarah Hoffmann
a2edbbf78a
cannot use capture_output in subprocess.run
...
Only available since Python 3.7.
2021-07-06 22:57:42 +02:00
Sarah Hoffmann
1e86dc1d93
remove default parameter for namedtuple
...
This is only available in Python 3.7.
2021-07-06 22:57:42 +02:00
Sarah Hoffmann
62d5984b1b
limit the number of variants that can be produced
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
c32551b4e0
restrict partial word counting to names of reasoanble length
...
The partial word count does not split names to save a bit of time.
The result is that it might enounter unreasonably long names
which in truth consist of multiple words. No accurate statistics
are needed so simply restrict the count to words shorter than
75 characters.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
e85f7e7aa9
fix subsequent replacements
...
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.
Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
7b0f6b7905
leave ICU variant properties empty for now
...
Saving unused properties causes unnecessary duplicates.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
b9fbfeff67
only consider partials in multi-words for initial count
...
This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
62828fc5c1
switch to a more flexible variant description format
...
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a6aa6360e0
use yaml tag syntax to mark include files
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
f70930b1a0
make compund decomposition pure import feature
...
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
9ff4f66f55
complete tests for icu tokenizer
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
32ca631b74
fix full term token in special phrases
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e81084f35
complete tests for rule loader
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a0a7b05c9f
correctly quote strings when copying in data
...
Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb
update unit tests for adapted abbreviation code
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b
adapt tests for ICU tokenizer
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249
move abbreviation computation into import phase
...
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.
New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
6ba00e6aee
icu tokenizer: move transliteration rules in separate file
...
The tokenizer configuration has become difficult to handle
due to the additional manual transliteration rules. Allow
to have a separate rule file that is given to the ICU library
as is.
2021-07-04 10:28:20 +02:00
AntoJvlt
3676310efe
Improved performance of the postcodes query and some code cleaning
2021-06-12 15:46:08 +02:00
AntoJvlt
1c175e3a67
Clean and update tests for postcodes
2021-06-09 09:31:32 +02:00
AntoJvlt
47fb7cd3a8
Use place_exists() into can_compute() for postcodes
2021-06-09 09:31:32 +02:00
AntoJvlt
a4733eed90
Use place instead of placex to compute postcodes
2021-06-09 09:31:32 +02:00
Sarah Hoffmann
bc981d0261
fix insertion of special terms and countries into word table
...
Special terms need to be prefixed by a space because they are
full terms.
For countries avoid duplicate entries of word tokens.
Adds tests for adding country terms.
2021-06-02 20:22:39 +02:00
Sarah Hoffmann
72625dc72a
call freeze after running and non-updateable import
...
Some of the tables will have already been removed but
the tables for indexing are still there and should be
dropped.
2021-06-02 11:08:48 +02:00
Sarah Hoffmann
cc2f152d70
commit changes to replication log table
...
Fixes #2350 .
2021-05-26 11:47:08 +02:00
Sarah Hoffmann
a0e85cc17c
only initialise tokenizer for refresh functions where needed
...
Fixes #2347 .
2021-05-25 19:16:22 +02:00
Sarah Hoffmann
24c986c842
add tests for new full name computation with ICU
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
4f4d15c28a
reorganize keyword creation for legacy tokenizer
...
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
(but not the part after the bracket)
Fixes #244 .
2021-05-24 10:41:42 +02:00