docs: extend explanation of query phrase

This commit is contained in:
Sarah Hoffmann
2021-08-16 09:57:01 +02:00
parent c4b8a3b768
commit 2e82a6ce03

View File

@@ -50,7 +50,7 @@ tokenizer's internal token lists and creating a list of all token IDs for
the specific place. This list is later needed in the PL/pgSQL part where the the specific place. This list is later needed in the PL/pgSQL part where the
indexer needs to add the token IDs to the appropriate search tables. To be indexer needs to add the token IDs to the appropriate search tables. To be
able to communicate the list between the Python part and the pl/pgSQL trigger, able to communicate the list between the Python part and the pl/pgSQL trigger,
the placex table contains a special JSONB column `token_info` which is there the `placex` table contains a special JSONB column `token_info` which is there
for the exclusive use of the tokenizer. for the exclusive use of the tokenizer.
The Python part of the tokenizer returns a structured information about the The Python part of the tokenizer returns a structured information about the
@@ -67,12 +67,17 @@ consequently not create any special indexes on it.
### Querying ### Querying
The tokenizer is responsible for the initial parsing of the query. It needs At query time, Nominatim builds up multiple _interpretations_ of the search
to split the query into appropriate words and terms and match them against query. Each of these interpretations is tried against the database in order
the saved tokens in the database. It then returns the list of possibly matching of the likelihood with which they match to the search query. The first
tokens and the list of possible splits to the query parser. The parser uses interpretation that yields results wins.
this information to compute all possible interpretations of the query and
rank them accordingly. The interpretations are encapsulated in the `SearchDescription` class. An
instance of this class is created by applying a sequence of
_search tokens_ to an initially empty SearchDescription. It is the
responsibility of the tokenizer to parse the search query and derive all
possible sequences of search tokens. To that end the tokenizer needs to parse
the search query and look up matching words in its own data structures.
## Tokenizer API ## Tokenizer API
@@ -301,6 +306,14 @@ public function extractTokensFromPhrases(array &$aPhrases) : TokenList
Parse the given phrases, splitting them into word lists and retrieve the Parse the given phrases, splitting them into word lists and retrieve the
matching tokens. matching tokens.
The phrase array may take on two forms. In unstructured searches (using `q=`
parameter) the search query is split at the commas and the elements are
put into a sorted list. For structured searches the phrase array is an
associative array where the key designates the type of the term (street, city,
county etc.) The tokenizer may ignore the phrase type at this stage in parsing.
Matching phrase type and appropriate search token type will be done later
when the SearchDescription is built.
For each phrase in the list of phrases, the function must analyse the phrase For each phrase in the list of phrases, the function must analyse the phrase
string and then call `setWordSets()` to communicate the result of the analysis. string and then call `setWordSets()` to communicate the result of the analysis.
A word set is a list of strings, where each string refers to a search token. A word set is a list of strings, where each string refers to a search token.