𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Nov 13 2017, 6:29 PM

Description

~~See T176197#3737728 and following comments.~~

After discussions on Wikipedia and Wiktionary, the general consensus is that we should enable the hiragana to katakana (H2K) mapping for French, Russian, Italian, and Swedish because it has a small effect but seems useful.

With the new ICU tokenizer, the H2K mapping causes problems and does very little positive, so we should just turn it off.

Details

	Subject	Repo	Branch	Lines +/-
	Disable Hiragana-to-Katakana Mapping	mediawiki/extensions/CirrusSearch	master	+16 -770

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	TJones	T176197 Allow hiragana searches to find katakana results and vice versa
Invalid	None	T218613 {EPIC} Search: Local Impact. Making making bigger improvements for smaller (or underrepresented) communities
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	TJones	T180387 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping

Event Timeline

TJones created this task.Nov 13 2017, 6:29 PM

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptNov 13 2017, 6:29 PM

Not directly part of this ticket, but part of the related discussion: we should decide whether we should keep going down the list of unpacked analyzers, and whether we should pro-actively unpack the other analyzers, and whether we should just enable it for languages where it has a small but positive impact (and doesn't cause problems with the analyzer).

Pamputt subscribed.Nov 13 2017, 7:48 PM

In T180387#3756152, @TJones wrote:

Not directly part of this ticket, but part of the related discussion: we should decide whether we should keep going down the list of unpacked analyzers, and whether we should pro-actively unpack the other analyzers, and whether we should just enable it for languages where it has a small but positive impact (and doesn't cause problems with the analyzer).

Let's talk about this during sprint planning tomorrow!

We'll make an announcement to the various wikis village pumps for this upcoming work (we're not sure of exact timelines) and we'll post again before we release the changes into production.

deryckchan unsubscribed.Nov 21 2017, 12:14 AM

I think we'll be able to work on this in Q3, FY2017/18, as time allows

debt moved this task from This Quarter to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:54 PM

• Nuria added a parent task: T218613: {EPIC} Search: Local Impact. Making making bigger improvements for smaller (or underrepresented) communities.Mar 18 2019, 10:09 PM

TJones mentioned this in T219550: [EPIC] Harmonize language analysis across languages.Mar 28 2019, 7:46 PM

TJones added a parent task: T219550: [EPIC] Harmonize language analysis across languages.

TJones removed TJones as the assignee of this task.Mar 18 2020, 6:31 PM

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search, Discovery-ARCHIVED.Apr 3 2024, 3:31 PM

TJones set the point value for this task to 3.

TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

TJones claimed this task.Apr 3 2024, 3:33 PM

TJones removed TJones as the assignee of this task.Apr 3 2024, 4:58 PM

TJones moved this task from In Progress to Incoming on the Discovery-Search (Current work) board.

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).

TJones moved this task from Language Stuff to needs triage on the Discovery-Search board.

debt unsubscribed.Apr 4 2024, 2:33 PM

TJones claimed this task.Apr 9 2024, 8:27 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

While we had planned to expand the deployment of the hiragana-to-katakana mapping from English to most other languages (though not Japanese), testing revealed that doing the mapping pre-tokenization interfered with the new ICU tokenizer's ability to parse Japanese text (on non-Japanese wikis).

Enabling the mapping after tokenization didn't do much because there aren't that many cross-kana terms in the dictionary used by the ICU tokenizer. It's also very expensive in the off-the-shelf form (around +10% to load/reindex times), or it would require development and deployment of a new filter—neither of which seems worth it.

In the spirit of cross-language analysis harmonization—and because it's now clear that it's interfering with the ICU tokenizer—I'm removing the cross-kana mapping from the English analysis chain.

Full write up on Mediawiki.

New patch to follow shortly.

Change #1020719 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Disable Hiragana-to-Katakana Mapping

https://gerrit.wikimedia.org/r/1020719

gerritbot added a project: Patch-For-Review.Apr 17 2024, 7:27 PM

Change #1020719 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Disable Hiragana-to-Katakana Mapping

https://gerrit.wikimedia.org/r/1020719

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.2; 2024-04-23).Apr 18 2024, 11:00 AM

TJones moved this task from In Progress to To Be Deployed on the Discovery-Search (Current work) board.Apr 18 2024, 1:57 PM

TJones renamed this task from Enable hiragana/katakana mapping for other languages to 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping for other languages.Apr 18 2024, 7:02 PM

TJones renamed this task from 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping for other languages to 𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping.Apr 18 2024, 7:04 PM

TJones updated the task description. (Show Details)

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2024, 6:41 PM

TJones mentioned this in T363734: Reindex all wikis to enable dotted I fix, Yiddish ligatures, maybe Arabic normalization.Apr 29 2024, 4:43 PM

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Jun 4 2024, 7:32 PM

Gehel closed this task as Resolved.Jun 7 2024, 9:26 AM

𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mappingClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

𝖤̶𝗇̶𝖺̶𝖻̶𝗅̶𝖾̶ Disable hiragana/katakana mapping
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...