User Story: As a search engineer I don't want there to be any big language analysis surprises when we upgrade to Elasticsearch 6.8.
We've already seen one unexpected change in 6.8 where a chess piece [♙] became searchable, which caused an existing test to fail.
We can do a relatively quick check on 500–1000 random documents from a selection of Wikipedias and Wiktionaries, to test language-specific analysis and a variety of scripts. I will also test some additional "rare" characters (such as ♙, ☥, 〃, and 〆—see T211824: Investigate a “rare-character” index).
If there are no big issues, it should be relatively quick. If there are any big issues, well then it'll be a good thing we found them. (We really want to detect problems like the Chinese punctuation problem in T172653—though admittedly that was not caused by an upgrade.)
We did a similar analysis when we upgraded from ES 5 to ES 6. (See T194849 and previous write-up).
Acceptance Criteria:
- Report on language analysis diffs between current ES version (6.5) and ES 6.8
- New phab tickets for any big issues that need to be addressed