Test Elastic 6.8 language analyzers
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Jan 27 2022, 8:31 PM

Description

User Story: As a search engineer I don't want there to be any big language analysis surprises when we upgrade to Elasticsearch 6.8.

We've already seen one unexpected change in 6.8 where a chess piece [♙] became searchable, which caused an existing test to fail.

We can do a relatively quick check on 500–1000 random documents from a selection of Wikipedias and Wiktionaries, to test language-specific analysis and a variety of scripts. I will also test some additional "rare" characters (such as ♙, ☥, 〃, and 〆—see T211824: Investigate a “rare-character” index).

If there are no big issues, it should be relatively quick. If there are any big issues, well then it'll be a good thing we found them. (We really want to detect problems like the Chinese punctuation problem in T172653—though admittedly that was not caused by an upgrade.)

We did a similar analysis when we upgraded from ES 5 to ES 6. (See T194849 and previous write-up).

Acceptance Criteria:

Report on language analysis diffs between current ES version (6.5) and ES 6.8
New phab tickets for any big issues that need to be addressed

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved	Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved	None	T283275 Make MW master tests pass on PHP 8.0
Resolved	Reedy	T268861 CirrusSearch uses Elastica's Match class
Resolved	Reedy	T268863 Translate uses Elastica's Match class
Resolved	matthiasmullie	T268866 WikibaseMediaInfo uses Elastica's Match class
Invalid	None	T268864 WikibaseCirrusSearch uses Elastica's Match class
Resolved	Reedy	T268865 WikibaseLexemeCirrusSearch uses Elastica's Match class
Resolved	EBernhardson	T271777 Bump rufin/elastica (and related libraries) to versions that support PHP 8.0
Resolved	Gehel	T263142 [EPIC] Upgrade Elasticsearch to version 7.10
Resolved	TJones	T300302 Test Elastic 6.8 language analyzers

Event Timeline

TJones created this task.Jan 27 2022, 8:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2022, 8:31 PM

TJones added a project: Discovery-Search.Jan 27 2022, 8:31 PM

• MPhamWMF added a parent task: T263142: [EPIC] Upgrade Elasticsearch to version 7.10.Jan 31 2022, 3:49 PM

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Jan 31 2022, 4:20 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• MPhamWMF set the point value for this task to 3.Feb 7 2022, 4:54 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones claimed this task.Feb 8 2022, 9:02 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Full details are on MediaWiki.

Between Elastic 6.5 and 6.8, changes in Lucene have caused changes to tokenization for the standard tokenizer:

A lot of "interesting" Unicode characters are now surviving tokenization.
The tokenizer no longer splits on narrow no-break spaces (U+202F).

It also turns out that both of the above were already true for the ICU tokenizer in ES 6.5.

The Nori (Korean) tokenizer has changed the way it defines character sets (regular vs "extended"), while still breaking on clearly different character sets (Hangul, Cyrillic, Latin, Greek), leading to lots of small changes in Korean tokens: 6.4% fewer tokens in my Korean Wikipedia sample, 9.5% fewer tokens in my Korean Wiktionary sample.

Next Steps

To be discussed with the Search Team. Options include:

Do nothing immediately. There are very few tokens affected, and some of the issues are already present in 6.5.
Do nothing until ES 7.10. It's hard to keep up with all the small changes in every new version of Elastic and Lucene, so just wait until ES 7.10 and re-assess then. Who knows what may have been changed or fixed?
Start fixing stuff, following something like the following list, roughly sorted by priority/complexity:
- Add a NNBSP => space character filter to plain filters and unpacked text filters everywhere.
- Look into tuning the aggressive_splitting filter for English and Italian to see if we can get what we want out of it without losing all the interesting rare characters coming out of the 6.8 tokenizers.
- Look into harmonizing tokenization + aggressive_splitting across all (non-monolithic) languages so that similar inputs get similar outputs when possible (e.g., when not dependent on dictionaries or other language-specific processing). See T219550.
  - Consider upgrading all languages that use the standard tokenizer to the ICU tokenizer.
  - Consider using some form of aggressive_splitting everywhere possible. See T219108.
- Look into patching our version of the Hebrew tokenizer to allow interesting Unicode characters to come through
- If various issues still exist in ES 7.10:
  - Open an upstream ticket for the standard and ICU tokenizers to split on NNBSP characters.
  - Open an upstream ticket for the Thai tokenizer to allow interesting Unicode characters to come through.

The consensus today was that we should just let it ride for 6.8 and re-assess in 7.10. If these problems still exist, then we can start fixing them there. I'll update the 7.10 task (T301131).

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Feb 16 2022, 8:17 PM

TJones mentioned this in T301131: Test Elastic 7.10 language analyzers.Feb 16 2022, 8:21 PM

Gehel closed this task as Resolved.Feb 21 2022, 2:25 PM

Test Elastic 6.8 language analyzersClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Test Elastic 6.8 language analyzers
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...