Maniphest T203117

Greek language analysis generates unexpected empty tokens
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Aug 29 2018, 9:03 PM

Description

While looking into T192502 (which looks at empty tokens created by ICU folding), I discovered that the monolithic Greek analyzer generates some empty tokens, too, particularly for these words: εστάτο, εστερ, εστέρ, έστερ, έστέρ, εστέρα, εστέρας, εστέρες, εστέρησε, εστερία, εστερικό, εστερικού, εστερικών, εστέρο, εστέρος, εστέρων, ήσανε, ότερ, οτέρι, ότερι, οτερό, οτέρο.

As a result, searching for any of them finds the others. Some are related, but as far as I can tell, searching for εστάτο (estáto) should not return articles with Εστέρες (estéres) and Οτερό (oteró) in the title as top hits—yet that's what happens!

A straightforward solution would be to unpack the Greek analyzer and add a filter for empty tokens. These words would no longer be conflated, and exact matches would still be available through the plain index.

Details

	Subject	Repo	Branch	Lines +/-
	Add Greek empty-token filter and keep lang-specific lowercasing	mediawiki/extensions/CirrusSearch	master	+439 -50

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	TJones	T203117 Greek language analysis generates unexpected empty tokens
Resolved	TJones	T217602 Properly handle language-specific lowercasing in language analyzers
Resolved	TJones	T217806 Reindex Greek, Turkish, and Irish wikis to keep lang-specific lowercasing & enable empty-token filtering (Greek)

Event Timeline

TJones created this task.Aug 29 2018, 9:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2018, 9:03 PM

TJones added a project: Discovery-Search.Aug 29 2018, 9:03 PM

• EBjune triaged this task as Medium priority.Aug 30 2018, 5:25 PM

• EBjune moved this task from needs triage to Up Next on the Discovery-Search board.

TJones moved this task from Up Next to search-icebox on the Discovery-Search board.Nov 13 2018, 6:47 PM

TJones moved this task from search-icebox to making others happy on the Discovery-Search board.Jan 29 2019, 7:07 PM

TJones moved this task from making others happy to Language Stuff on the Discovery-Search board.

TJones claimed this task.Feb 26 2019, 4:48 PM

TJones moved this task from Language Stuff to Current work on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones mentioned this in T217602: Properly handle language-specific lowercasing in language analyzers.Mar 4 2019, 8:49 PM

Change 494846 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

https://gerrit.wikimedia.org/r/494846

gerritbot added a project: Patch-For-Review.Mar 6 2019, 10:57 PM

TJones moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Mar 6 2019, 11:00 PM

TJones mentioned this in T217806: Reindex Greek, Turkish, and Irish wikis to keep lang-specific lowercasing & enable empty-token filtering (Greek).

Unpacking the Greek analyzer exposes the lowercase filter, which is upgraded to icu_normalizer, losing the Greek-specific processing therein! So, we need to keep the Greek lowercasing even if we do ICU normalization. After that, everything is copacetic. Full write up on MediaWiki.