Commit Graph

7 Commits

Author SHA1 Message Date
Sarah Hoffmann
62d5984b1b limit the number of variants that can be produced 2021-07-04 10:28:28 +02:00
Sarah Hoffmann
e85f7e7aa9 fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
62828fc5c1 switch to a more flexible variant description format
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
9ff4f66f55 complete tests for icu tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb update unit tests for adapted abbreviation code 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b adapt tests for ICU tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249 move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.
2021-07-04 10:28:20 +02:00