Sarah Hoffmann
62d5984b1b
limit the number of variants that can be produced
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
e85f7e7aa9
fix subsequent replacements
...
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.
Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
b9fbfeff67
only consider partials in multi-words for initial count
...
This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
62828fc5c1
switch to a more flexible variant description format
...
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a6aa6360e0
use yaml tag syntax to mark include files
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
0d80a9b897
tests for composing decomposed suffixes
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
f70930b1a0
make compund decomposition pure import feature
...
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
9ff4f66f55
complete tests for icu tokenizer
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e81084f35
complete tests for rule loader
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a0a7b05c9f
correctly quote strings when copying in data
...
Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb
update unit tests for adapted abbreviation code
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b
adapt tests for ICU tokenizer
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249
move abbreviation computation into import phase
...
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.
New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
e7b4fc70e7
make sure old data gets deleted on place type change
...
When changing from some other place type to place=postcode
make sure that the old place type entry in the place table
is deleted.
2021-06-18 10:58:41 +02:00
Sarah Hoffmann
457982e1d2
update postcode in place if it already exists
2021-06-18 00:28:52 +02:00
Sarah Hoffmann
aa558e6080
Merge pull request #2369 from lonvia/exclude-poi-from-housenumber-search
...
Do not return POIs when dropping house number in query
2021-06-17 15:30:05 +02:00
Sarah Hoffmann
fe11d3cbbd
do not return POIs when dropping house number in query
...
We've previously added searching through rank 30 in a house
number search to enable searches for house number+name.
This had the unintended side effect that rank 30 objects
are also returned in s search that dropped the house number
from the query. This is wrong because POIs cannot function
as a parent to a house number.
This fix drops all rank 30 objects from the results for a
house number search if they do not match the requested house
number.
2021-06-17 14:21:20 +02:00
AntoJvlt
3676310efe
Improved performance of the postcodes query and some code cleaning
2021-06-12 15:46:08 +02:00
AntoJvlt
1c175e3a67
Clean and update tests for postcodes
2021-06-09 09:31:32 +02:00
AntoJvlt
e879814e43
Update tests for postcodes
2021-06-09 09:31:32 +02:00
Sarah Hoffmann
3aac51c81f
switch BDD tests to always use search API
2021-06-06 15:27:52 +02:00
Sarah Hoffmann
bc981d0261
fix insertion of special terms and countries into word table
...
Special terms need to be prefixed by a space because they are
full terms.
For countries avoid duplicate entries of word tokens.
Adds tests for adding country terms.
2021-06-02 20:22:39 +02:00
Sarah Hoffmann
24c986c842
add tests for new full name computation with ICU
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
4f4d15c28a
reorganize keyword creation for legacy tokenizer
...
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
(but not the part after the bracket)
Fixes #244 .
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
10143e0ac7
Merge pull request #2342 from lonvia/icu-tokenizer-ci
...
Add BDD tests with icu tokenizer to CI runs
2021-05-22 10:36:35 +02:00
Sarah Hoffmann
00094c43d1
enable Tiger BDD API test for legacy_icu
2021-05-21 22:39:56 +02:00
Sarah Hoffmann
430c316e45
test: fix linting errors
2021-05-19 23:07:39 +02:00
Sarah Hoffmann
01f5a9ff84
test: more use of table_factory
2021-05-19 17:37:03 +02:00
Sarah Hoffmann
af52eed0dd
test: avoid use of tempfile module
...
Use the tmp_path fixture instead which provides automatic
cleanup.
2021-05-19 16:43:26 +02:00
Sarah Hoffmann
f93d0fa957
test: use src_dir fixture instead of self-computed paths
2021-05-19 16:03:54 +02:00
Sarah Hoffmann
c06a1d007a
test: replace raw execute() with fixture code where possible
2021-05-19 12:11:04 +02:00
Sarah Hoffmann
65bd749918
test: use table_rows() and execute_values() where possible
...
Some uses of scalar() could also be replaced with convenience
functions from the word table mock.
2021-05-19 10:51:10 +02:00
Sarah Hoffmann
510eb53f53
test: move Testingcursor into separate class
...
Also adds more convenience functions: counting with a where
statement and a wrapper to execute_values().
2021-05-19 10:30:36 +02:00
Sarah Hoffmann
16bb007135
Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer
...
Do not hide errors when importing tokenizer
2021-05-18 23:00:10 +02:00
Sarah Hoffmann
b2722650d4
do not hide errors when importing tokenizer
...
Explicitly check for the tokenizer source file to check that
the name is correct. We can't use the import error for that
because it hides other import errors like a missing
library.
Fixes #2327 .
2021-05-18 16:28:21 +02:00
AntoJvlt
3206bf59df
Resolve conflicts
2021-05-17 13:52:35 +02:00
AntoJvlt
8b8dfc46eb
Added --no-replace command for special phrases importation and added corresponding tests
2021-05-17 13:25:06 +02:00
AntoJvlt
06aab389ed
Code cleaning and SPLoader deleted
2021-05-16 16:59:12 +02:00
AntoJvlt
fb0ebb5bf0
Add tests for the new SPWikiLoader and SPCsvLoader
2021-05-16 16:10:06 +02:00
Sarah Hoffmann
925726222f
Merge pull request #2323 from darkshredder/disable-search-reverse-only
...
Feat: Disabled search API for --reverse-only imports
2021-05-14 10:40:22 +02:00
Sarah Hoffmann
7d621389ee
adapt tests to new TIGER CSV format
2021-05-14 00:02:50 +02:00
Darkshredder
e5ffc59cd5
feat: Added reverse-only-search validation
2021-05-14 02:36:21 +05:30
Sarah Hoffmann
5feece64c1
use WorkerPool for Tiger data import
...
Requires adding an option that SQL errors are ignored.
2021-05-13 20:36:50 +02:00
Sarah Hoffmann
f5977dac75
ignore invalid coordinates in external postcodes
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
8f2746fe24
ignore entries without country code
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
1ccd4360b4
correctly handle removing all postcodes for country
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
bf864b2c54
index postcodes after refreshing
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
4abaf71234
add and extend tests for new postcode handling
2021-05-13 14:15:42 +02:00
AntoJvlt
9d83da830f
Introduction of SPCsvLoader to load special phrases from a csv file
2021-05-10 23:26:39 +02:00
AntoJvlt
00959fac57
Refactoring loading of external special phrases and importation process by introducing SPLoader and SPWikiLoader
2021-05-10 21:49:31 +02:00