Mint translating wrong letter in punjabi
Open, Stalled, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • start translating an article from english to punjabi using content translation (with Mint)
  • e.g. Ganesh Ghosh (in screenshot)

What happens?:
Mint translates letter ਡ਼ instead of correct letter ੜ in words

What should have happened instead?:
ੜ letter should be translated instead of ਡ਼

Other information (browser name/version, screenshots, etc.):

Screenshot (2).png (677×1 px, 167 KB)

Event Timeline

Thanks for reporting, @Kuldeepburjbhalaike.

On my end both characters looked the same:

Screenshot 2024-06-26 at 12.27.30.png (165×1 px, 23 KB)

However, your screenshot above showed a different rendering of the character. Maybe the browser/OS is adding some noise there. So I'm sharing my initial observations in case others are confused when reading the ticket.

Although both characters looked the same, they seemed to correspond to different codes in Unicode:

ProposedCurrent
Character:ਡ਼
Unicode:\u0a5c\u0a21 \u0a3c

Post-processing transformations can replace automatically one character with the other. One important consideration is for native speakers to confirm whether occurrences of the problematic character are expected under some circumstances or it is safe to be always replaced. Looking at search results for each on Punjabi Wikipedia, there are 0 pages containing the problematic character, and many containing the proposed one. So the replacement seems to be safe.

thanks for the reply @Pginer-WMF, yeah it also appear same for me but on mobile only, my windows chrome browser rendering them differently. also pinging @satdeep_gill to add further in it.

Screenshot (4).png (605×1 px, 102 KB)

Looking at search results for each on Punjabi Wikipedia, there are 0 pages containing the problematic character, and many containing the proposed one. So the replacement seems to be safe.

there are newly created pages which contains this unwanted character but i don't why they aren't appearIng in the search. e.g. ਬਾਰਾਮੂਲਾ ਰੇਲਵੇ ਸਟੇਸ਼ਨ

dda + nukta forming the same ligature rendering of rra is a common issue in Gurmukhi fonts. For example Ektype's Mukta has this issue. And this practice of having same shape for nukta form and RRA is not adviced, yet many fonts has them. This is the reason why you see two different shapes as reported above. Common users not aware of this encoding difference, but focusing only in rendering, uses them interchangeably. This wrong usage appears in corpus. For example, in many dravidian scripts I have seen people using 0(zero) in the place of ഠ, :(colon) instead of ഃ(visarga) and so on. A neuaral MT system learns them and the same issues appear in MT output. I have seen this issue in many other languages too.

However, a blind find/replace of this kind of issues creates more problems than the issues it solve. This is different from unicode normalization or frequent error correction. ie, you cannot blindly replace all dda+nukta with rra.

  • Use a font that is free from this bad shaping.
  • Let us know if this issue is one-off or frequently occuring in MT output. We can inform the indictrans2 team about it. They might be able to do some corpus clean up before next training.
Nikerabbit changed the task status from Open to Stalled.Thu, Jun 27, 8:42 AM
Nikerabbit moved this task from Needs Triage to MT on the ContentTranslation board.