Automatic language detection misidentifies language in some cases
Open, MediumPublic

Description

The service to automatically detect the language of a text (T99666, API, model card ) is failing in some cases. This ticket compiles some of those cases to identify if there is room for improvements.

TextIdentified languageExpected language
MoonWolof (wo)English (en)
LanguagesIlocano (ilo)English (en)
Add languagesIlocano (ilo)English (en)
Translate sectionChhattisgarhi (hne)English (en)
Edit language linksBambara (bm)English (en)
Translate this pageKinyarwanda (rw)English (en)
Read an automatic translationMizo (lu)English (en)
Jazz is a music genreSardinian (sc)English(en)

The model is expected to work better for longer texts. From the above, the more outstanding cases seem to be:

  • "Read an automatic translation" being identified as Mizo, given the source text is a bit longer than the other cases.
  • "Translate section" being identified as Chhattisgarhi despite Chhattisgarhi using the Devanagari script instead of latin script.

This has been tested using the API. For example, the request below checks the language for the text "Read an automatic translation" and gets Mizo (lu) as the language identified:

> curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Read an automatic translation"}' -H "Content-type: application/json"

> {"language":"lus_Latn","wikicode":"lu","languagename":"Mizo","score":0.35616445541381836}

Event Timeline

The model expects sentences. That is how it is trained. For example, words like "Moon" can appear in many latin based languages as proper noun or reference to a title of a book etc. The prediction quality increase as more words are provided. Then it knows better about the context of the word.

Because of the presense of English words in many latin languages, accuracy of prediction for English is low among them. One would expect predicting English is the easiest task. However, when accommodate many latin based languages, predicting something as English accurately is not easy :-) But as we get more words in text, this issue disappears.

For languages that use their own scripts, this limitation does not apply.

The model expects sentences. That is how it is trained. For example, words like "Moon" can appear in many latin based languages as proper noun or reference to a title of a book etc. The prediction quality increase as more words are provided. Then it knows better about the context of the word.

Because of the presense of English words in many latin languages, accuracy of prediction for English is low among them. One would expect predicting English is the easiest task. However, when accommodate many latin based languages, predicting something as English accurately is not easy :-) But as we get more words in text, this issue disappears.

For languages that use their own scripts, this limitation does not apply.

Thanks for the context, Santhosh. I added a note in the ticket description to highlight the most outstanding cases:

  • "Read an automatic translation" being identified as Mizo, give the source text is a bit longer than the other cases.
  • "Translate section" being identified as Chhattisgarhi despite Chhattisgarhi using the Devanagari script instead of latin script.

I wonder if it would make sense to use a different approach for short strings where we know the response may be more random. For example, try to identify the language among a smaller set of most spoken languages, or other approaches.

In any case, we don't neet to explore this right now. As we observe cases for language identification we'll have a better sense on how to address the issues and the need for those.

A short-term solution we can consider for the MinT test instance is to apply only the language detection when there is enough content for it to be likely to be successful