𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping
Closed, ResolvedPublic3 Estimated Story Points

Description

See T176197#3737728 and following comments.

After discussions on Wikipedia and Wiktionary, the general consensus is that we should enable the hiragana to katakana (H2K) mapping for French, Russian, Italian, and Swedish because it has a small effect but seems useful.

With the new ICU tokenizer, the H2K mapping causes problems and does very little positive, so we should just turn it off.

Event Timeline

TJones renamed this task from Enable hiragana/katakana mapping for other languages. to Enable hiragana/katakana mapping for other languages.Nov 13 2017, 6:52 PM
TJones updated the task description. (Show Details)

Not directly part of this ticket, but part of the related discussion: we should decide whether we should keep going down the list of unpacked analyzers, and whether we should pro-actively unpack the other analyzers, and whether we should just enable it for languages where it has a small but positive impact (and doesn't cause problems with the analyzer).

Not directly part of this ticket, but part of the related discussion: we should decide whether we should keep going down the list of unpacked analyzers, and whether we should pro-actively unpack the other analyzers, and whether we should just enable it for languages where it has a small but positive impact (and doesn't cause problems with the analyzer).

Let's talk about this during sprint planning tomorrow!

We'll make an announcement to the various wikis village pumps for this upcoming work (we're not sure of exact timelines) and we'll post again before we release the changes into production.

I think we'll be able to work on this in Q3, FY2017/18, as time allows

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM
TJones removed TJones as the assignee of this task.Apr 3 2024, 4:58 PM
TJones moved this task from In Progress to Incoming on the Discovery-Search (Current work) board.
TJones moved this task from Language Stuff to needs triage on the Discovery-Search board.

While we had planned to expand the deployment of the hiragana-to-katakana mapping from English to most other languages (though not Japanese), testing revealed that doing the mapping pre-tokenization interfered with the new ICU tokenizer's ability to parse Japanese text (on non-Japanese wikis).

Enabling the mapping after tokenization didn't do much because there aren't that many cross-kana terms in the dictionary used by the ICU tokenizer. It's also very expensive in the off-the-shelf form (around +10% to load/reindex times), or it would require development and deployment of a new filter—neither of which seems worth it.

In the spirit of cross-language analysis harmonization—and because it's now clear that it's interfering with the ICU tokenizer—I'm removing the cross-kana mapping from the English analysis chain.

Full write up on Mediawiki.

New patch to follow shortly.

Change #1020719 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Disable Hiragana-to-Katakana Mapping

https://gerrit.wikimedia.org/r/1020719

Change #1020719 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Disable Hiragana-to-Katakana Mapping

https://gerrit.wikimedia.org/r/1020719

TJones renamed this task from Enable hiragana/katakana mapping for other languages to 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping for other languages.Apr 18 2024, 7:02 PM
TJones renamed this task from 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping for other languages to 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping.Apr 18 2024, 7:04 PM
TJones updated the task description. (Show Details)