Automatic language detection misidentifies language in some cases
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Pginer-WMF
	Oct 24 2023, 1:37 PM

Description

The service to automatically detect the language of a text (T99666, API, model card ) is failing in some cases. This ticket compiles some of those cases to identify if there is room for improvements.

Text	Identified language	Expected language
Moon	Wolof (wo)	English (en)
Languages	Ilocano (ilo)	English (en)
Add languages	Ilocano (ilo)	English (en)
Translate section	Chhattisgarhi (hne)	English (en)
Edit language links	Bambara (bm)	English (en)
Translate this page	Kinyarwanda (rw)	English (en)
Read an automatic translation	Mizo (lu)	English (en)
Jazz is a music genre	Sardinian (sc)	English(en)

The model is expected to work better for longer texts. From the above, the more outstanding cases seem to be:

"Read an automatic translation" being identified as Mizo, given the source text is a bit longer than the other cases.
"Translate section" being identified as Chhattisgarhi despite Chhattisgarhi using the Devanagari script instead of latin script.

This has been tested using the API. For example, the request below checks the language for the text "Read an automatic translation" and gets Mizo (lu) as the language identified:

> curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Read an automatic translation"}' -H "Content-type: application/json"

> {"language":"lus_Latn","wikicode":"lu","languagename":"Mizo","score":0.35616445541381836}

Related Objects

Mentioned Here: T99666: Provide a service to detect which language the user is writing on

Event Timeline

Pginer-WMF created this task.Oct 24 2023, 1:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2023, 1:37 PM

Pginer-WMF triaged this task as Medium priority.Oct 24 2023, 1:38 PM

Pginer-WMF removed a project: Language-Team (Language-2023-October-December).

Pginer-WMF moved this task from Backlog to Infrastructure on the MinT board.Oct 24 2023, 1:58 PM

Pginer-WMF updated the task description. (Show Details)Oct 24 2023, 3:51 PM

The model expects sentences. That is how it is trained. For example, words like "Moon" can appear in many latin based languages as proper noun or reference to a title of a book etc. The prediction quality increase as more words are provided. Then it knows better about the context of the word.

Because of the presense of English words in many latin languages, accuracy of prediction for English is low among them. One would expect predicting English is the easiest task. However, when accommodate many latin based languages, predicting something as English accurately is not easy :-) But as we get more words in text, this issue disappears.

For languages that use their own scripts, this limitation does not apply.

Pginer-WMF updated the task description. (Show Details)Nov 2 2023, 3:28 PM

In T349618#9278546, @santhosh wrote:

The model expects sentences. That is how it is trained. For example, words like "Moon" can appear in many latin based languages as proper noun or reference to a title of a book etc. The prediction quality increase as more words are provided. Then it knows better about the context of the word.

Because of the presense of English words in many latin languages, accuracy of prediction for English is low among them. One would expect predicting English is the easiest task. However, when accommodate many latin based languages, predicting something as English accurately is not easy :-) But as we get more words in text, this issue disappears.

For languages that use their own scripts, this limitation does not apply.

Thanks for the context, Santhosh. I added a note in the ticket description to highlight the most outstanding cases:

"Read an automatic translation" being identified as Mizo, give the source text is a bit longer than the other cases.
"Translate section" being identified as Chhattisgarhi despite Chhattisgarhi using the Devanagari script instead of latin script.

I wonder if it would make sense to use a different approach for short strings where we know the response may be more random. For example, try to identify the language among a smaller set of most spoken languages, or other approaches.

In any case, we don't neet to explore this right now. As we observe cases for language identification we'll have a better sense on how to address the issues and the need for those.

Pginer-WMF updated the task description. (Show Details)Nov 2 2023, 3:33 PM

A short-term solution we can consider for the MinT test instance is to apply only the language detection when there is enough content for it to be likely to be successful

Pginer-WMF updated the task description. (Show Details)Tue, Jun 18, 9:54 AM

Pginer-WMF updated the task description. (Show Details)Tue, Jun 18, 9:58 AM

Automatic language detection misidentifies language in some casesOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Automatic language detection misidentifies language in some cases
Open, MediumPublic
Actions