mirror of
https://github.com/osm-search/Nominatim.git
synced 2026-02-15 19:07:58 +00:00
add documentation for PHP part of tokenizer
This commit is contained in:
@@ -248,3 +248,65 @@ permanently. The indexer calls this function when all processing is done and
|
||||
replaces the content of the `token_info` column with the returned value before
|
||||
the trigger stores the information in the database. May return NULL if no
|
||||
information should be stored permanently.
|
||||
|
||||
### PHP Tokenizer class
|
||||
|
||||
The PHP tokenizer class is instantiated once per request and responsible for
|
||||
analyzing the incoming query. Multiple requests may be in flight in
|
||||
parallel.
|
||||
|
||||
The class is expected to be found under the
|
||||
name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file
|
||||
`tokenizer/tokenizer.php` in the project directory. This file must be created
|
||||
when the tokenizer is first set up on import. The file should initialize any
|
||||
configuration variables by setting PHP constants and then require the file
|
||||
with the actual implementation of the tokenizer.
|
||||
|
||||
The tokenizer class must implement the following functions:
|
||||
|
||||
```php
|
||||
public function __construct(object &$oDB)
|
||||
```
|
||||
|
||||
The constructor of the class receives a database connection that can be used
|
||||
to query persistent data in the database.
|
||||
|
||||
```php
|
||||
public function checkStatus()
|
||||
```
|
||||
|
||||
Check that the tokenizer can access its persistent data structures. If there
|
||||
is an issue, throw an `\Exception`.
|
||||
|
||||
```php
|
||||
public function normalizeString(string $sTerm) : string
|
||||
```
|
||||
|
||||
Normalize string to a form to be used for comparisons when reordering results.
|
||||
Nominatim reweighs results how well the final display string matches the actual
|
||||
query. Before comparing result and query, names and query are normalised against
|
||||
this function. The tokenizer can thus remove all properties that should not be
|
||||
taken into account for reweighing, e.g. special characters or case.
|
||||
|
||||
```php
|
||||
public function tokensForSpecialTerm(string $sTerm) : array
|
||||
```
|
||||
|
||||
Return the list of special term tokens that match the given term.
|
||||
|
||||
```php
|
||||
public function extractTokensFromPhrases(array &$aPhrases) : TokenList
|
||||
```
|
||||
|
||||
Parse the given phrases, splitting them into word lists and retrieve the
|
||||
matching tokens.
|
||||
|
||||
For each phrase in the list of phrases, the function must analyse the phrase
|
||||
string and then call `setWordSets()` to communicate the result of the analysis.
|
||||
A word set is a list of strings, where each string refers to a search token.
|
||||
A phrase may have multiple interpretations. Therefore a list of word sets is
|
||||
usually attached to the phrase. The search tokens themselves are returned
|
||||
by the function in an associative array, where the key corresponds to the
|
||||
strings given in the word sets. The value is a list of search tokens. Thus
|
||||
a single string in the list of word sets may refer to multiple search tokens.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user