Greek language analysis generates unexpected empty tokens
Closed, ResolvedPublic

Description

While looking into T192502 (which looks at empty tokens created by ICU folding), I discovered that the monolithic Greek analyzer generates some empty tokens, too, particularly for these words: εστάτο, εστερ, εστέρ, έστερ, έστέρ, εστέρα, εστέρας, εστέρες, εστέρησε, εστερία, εστερικό, εστερικού, εστερικών, εστέρο, εστέρος, εστέρων, ήσανε, ότερ, οτέρι, ότερι, οτερό, οτέρο.

As a result, searching for any of them finds the others. Some are related, but as far as I can tell, searching for εστάτο (estáto) should not return articles with Εστέρες (estéres) and Οτερό (oteró) in the title as top hits—yet that's what happens!

A straightforward solution would be to unpack the Greek analyzer and add a filter for empty tokens. These words would no longer be conflated, and exact matches would still be available through the plain index.

Event Timeline

EBjune triaged this task as Medium priority.Aug 30 2018, 5:25 PM
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.

Change 494846 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

https://gerrit.wikimedia.org/r/494846

Unpacking the Greek analyzer exposes the lowercase filter, which is upgraded to icu_normalizer, losing the Greek-specific processing therein! So, we need to keep the Greek lowercasing even if we do ICU normalization. After that, everything is copacetic. Full write up on MediaWiki.

Change 494846 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

https://gerrit.wikimedia.org/r/494846