Reindex Arabic & Thai wikis to enable unpacked versions
Closed, ResolvedPublic3 Estimated Story Points

Description

Once T294147 is deployed (probably in MediaWiki_1.40/wmf.5), we can reindex the relevant wikis, to activate ICU tokenization, ICU normalization, ICU folding, and homoglyph normalization.

Currently 7 wikis for Arabic (ar), and 6 wikis for Thai (th).

Acceptance Criteria

  • All wikis in the relevant languages are reindexed
  • A before-and-after analysis for each language's Wikipedia is provided

Event Timeline

Full write up on Mediawiki.

  • Arabic showed the usual small recall improvements from ICU folding.
    • Small decrease in ZRR (23.7% to 23.2%), moderate increase in recall (21.0% of queries)
  • Thai had significantly more momentous changes because of the introduction of a different tokenizer.
    • ZRR went up! 16.1% to 17.6% (net +1.5%, with 1.9% losing results to 0 and 0.4% gaining results from 0)
    • More queries in general got fewer results than got more (44.2% vs 8.7%)
    • Recall generally decreased, but hopefully precision increased.
      • Without doing significantly more analysis with a Thai speaker, I can't say for sure that the changes are all positive, but they are generally in line with what we saw before reindexing, and highlight the cases where the Thai tokenizer seems to be overly agressive (breaking text up to one-character tokens).
TJones changed the point value for this task from 2 to 3.

Changing story points because this was more involved than usual due to the changes to Thai tokenization.