Sarah Hoffmann
7ebd121abc
give word break slight advantage towards continuation
...
prefers longer words
2025-07-11 11:01:21 +02:00
Sarah Hoffmann
4634ad0720
rebalance word transition penalties
2025-07-11 11:01:21 +02:00
Sarah Hoffmann
4a9253a0a9
simplify QueryNode penalty and initial assignment
2025-07-11 11:01:09 +02:00
Sarah Hoffmann
2ef0e20a3f
reorganise token reranking
...
As the reranking is about changing penalties in presence of other
tokens, change the datastructure to have the other tokens readily
avilable.
2025-04-11 13:38:34 +02:00
Sarah Hoffmann
497e27bb9a
move partial token into a separate field in the query struct
...
There is exactly one token to be expected and the token is usually
present.
2025-04-11 08:57:34 +02:00
Sarah Hoffmann
2c61fe08a0
use word_token length when penalizing against postcodes
2025-03-19 09:52:40 +01:00
Sarah Hoffmann
7b3c725f2a
postcode token should have transliterated term in word_token
2025-03-19 09:52:40 +01:00
Sarah Hoffmann
921db8bb2f
cache all info of ICUQueryAnalyser in a single object
2025-03-04 08:58:57 +01:00
Sarah Hoffmann
e67ae701ac
show token begin and end in debug output
2025-03-04 08:57:59 +01:00
Sarah Hoffmann
fc1c6261ed
add postcode parser
2025-03-04 08:57:37 +01:00
Sarah Hoffmann
6759edfb5d
make word generation from query a class method
2025-03-04 08:57:37 +01:00
Sarah Hoffmann
e362a965e1
search: merge QueryPart array with QueryNodes
...
The basic information on terms is pretty much always used together
with the node inforamtion. Merging them together saves some
allocation while making lookup easier at the same time.
2025-03-04 08:57:37 +01:00
Sarah Hoffmann
31412e0674
replace TokenType enum with simple char constants
2025-02-21 10:23:41 +01:00
Sarah Hoffmann
4577669213
replace BreakType enum with simple char constants
2025-02-21 09:57:48 +01:00
Sarah Hoffmann
b56edf3d0a
avoid yielding when extracting words from query
2025-02-20 23:32:39 +01:00
Sarah Hoffmann
abc911079e
remove word_number counting for phrases
...
We can just examine the break types to know if we are dealing
with a partial token.
2025-02-20 17:36:50 +01:00
Sarah Hoffmann
55c3176957
strip normalisation results of normal and special spaces
2025-02-19 14:40:35 +01:00
Sarah Hoffmann
d984100e23
add inner word break penalty
2025-01-07 21:42:25 +01:00
Sarah Hoffmann
499110f549
add SOFT_PHRASE break and enable parsing
...
Also enables parsing of PART breaks.
2025-01-06 17:10:24 +01:00
Sarah Hoffmann
2b87c016db
generalize normalization step for search query
...
It is now possible to configure functions for changing the query
input before it is analysed by the tokenizer.
Code is a cleaned-up version of the implementation by @miku.
2024-12-13 14:31:08 +01:00
Sarah Hoffmann
1f07967787
fix style issue found by flake8
2024-11-10 22:47:14 +01:00
Sarah Hoffmann
a690605a96
remove support for unindexed tokens
...
This was a special feature of the legacy tokenizer who would not
index very frequent tokens.
2024-09-22 10:39:10 +02:00
Sarah Hoffmann
4da4cbfe27
reduce from 3 to 2 packages
2024-06-28 09:13:22 +02:00
Sarah Hoffmann
6e89310a92
split code into submodules
2024-06-26 11:52:47 +02:00