Test Elastic 6.8 language analyzers
Closed, ResolvedPublic3 Estimated Story Points

Description

User Story: As a search engineer I don't want there to be any big language analysis surprises when we upgrade to Elasticsearch 6.8.

We've already seen one unexpected change in 6.8 where a chess piece [♙] became searchable, which caused an existing test to fail.

We can do a relatively quick check on 500–1000 random documents from a selection of Wikipedias and Wiktionaries, to test language-specific analysis and a variety of scripts. I will also test some additional "rare" characters (such as ♙, ☥, 〃, and 〆—see T211824: Investigate a “rare-character” index).

If there are no big issues, it should be relatively quick. If there are any big issues, well then it'll be a good thing we found them. (We really want to detect problems like the Chinese punctuation problem in T172653—though admittedly that was not caused by an upgrade.)

We did a similar analysis when we upgraded from ES 5 to ES 6. (See T194849 and previous write-up).

Acceptance Criteria:

  • Report on language analysis diffs between current ES version (6.5) and ES 6.8
  • New phab tickets for any big issues that need to be addressed

Event Timeline

Full details are on MediaWiki.

Between Elastic 6.5 and 6.8, changes in Lucene have caused changes to tokenization for the standard tokenizer:

  • A lot of "interesting" Unicode characters are now surviving tokenization.
  • The tokenizer no longer splits on narrow no-break spaces (U+202F).

It also turns out that both of the above were already true for the ICU tokenizer in ES 6.5.

The Nori (Korean) tokenizer has changed the way it defines character sets (regular vs "extended"), while still breaking on clearly different character sets (Hangul, Cyrillic, Latin, Greek), leading to lots of small changes in Korean tokens: 6.4% fewer tokens in my Korean Wikipedia sample, 9.5% fewer tokens in my Korean Wiktionary sample.

Next Steps

To be discussed with the Search Team. Options include:

  • Do nothing immediately. There are very few tokens affected, and some of the issues are already present in 6.5.
  • Do nothing until ES 7.10. It's hard to keep up with all the small changes in every new version of Elastic and Lucene, so just wait until ES 7.10 and re-assess then. Who knows what may have been changed or fixed?
  • Start fixing stuff, following something like the following list, roughly sorted by priority/complexity:
    • Add a NNBSP => space character filter to plain filters and unpacked text filters everywhere.
    • Look into tuning the aggressive_splitting filter for English and Italian to see if we can get what we want out of it without losing all the interesting rare characters coming out of the 6.8 tokenizers.
    • Look into harmonizing tokenization + aggressive_splitting across all (non-monolithic) languages so that similar inputs get similar outputs when possible (e.g., when not dependent on dictionaries or other language-specific processing). See T219550.
      • Consider upgrading all languages that use the standard tokenizer to the ICU tokenizer.
      • Consider using some form of aggressive_splitting everywhere possible. See T219108.
    • Look into patching our version of the Hebrew tokenizer to allow interesting Unicode characters to come through
    • If various issues still exist in ES 7.10:
      • Open an upstream ticket for the standard and ICU tokenizers to split on NNBSP characters.
      • Open an upstream ticket for the Thai tokenizer to allow interesting Unicode characters to come through.

The consensus today was that we should just let it ride for 6.8 and re-assess in 7.10. If these problems still exist, then we can start fixing them there. I'll update the 7.10 task (T301131).