The service to automatically detect the language of a text (T99666, API, model card ) is failing in some cases. This ticket compiles some of those cases to identify if there is room for improvements.
Text | Identified language | Expected language |
---|---|---|
Moon | Wolof (wo) | English (en) |
Languages | Ilocano (ilo) | English (en) |
Add languages | Ilocano (ilo) | English (en) |
Translate section | Chhattisgarhi (hne) | English (en) |
Edit language links | Bambara (bm) | English (en) |
Translate this page | Kinyarwanda (rw) | English (en) |
Read an automatic translation | Mizo (lu) | English (en) |
Jazz is a music genre | Sardinian (sc) | English(en) |
The model is expected to work better for longer texts. From the above, the more outstanding cases seem to be:
- "Read an automatic translation" being identified as Mizo, given the source text is a bit longer than the other cases.
- "Translate section" being identified as Chhattisgarhi despite Chhattisgarhi using the Devanagari script instead of latin script.
This has been tested using the API. For example, the request below checks the language for the text "Read an automatic translation" and gets Mizo (lu) as the language identified:
> curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Read an automatic translation"}' -H "Content-type: application/json" > {"language":"lus_Latn","wikicode":"lu","languagename":"Mizo","score":0.35616445541381836}