Unicode: Difference between revisions

Content deleted Content added
→‎Issues: Move the section on homoglyphs next to the section on unification, as the proverbial Scylla and Charybdis.
Line 724:
[[File:I acute - soft dotted and Lithuanian dot.svg|thumb|right|Localised forms of the letter í ({{serif|I}} with [[acute accent]])]]
Whether the lowercase letter {{serif|I}} is expected to retain its [[tittle]] when a diacritic applies also depends on local conventions.
=== Security ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022 |chapter-url=https://ieeexplore.ieee.org/document/9833641 |location=San Francisco, CA, US |publisher=IEEE |pages=1987–2004 |arxiv=2106.09898 |doi=10.1109/SP46214.2022.9833641 |isbn=978-1-66541-316-9 |s2cid=235485405}}</ref> Mitigation requires disallowing these characters, displaying them differently, or requiring that they resolve to the same identifier;<ref>{{Cite web |last=Engineering |first=Spotify |date=2013-06-18 |title=Creative usernames and Spotify account hijacking |url=https://engineering.atspotify.com/2013/06/creative-usernames/ |access-date=2023-04-15 |website=Spotify Engineering |language=en-US}}</ref> all of this is complicated due to the huge and constantly changing set of characters.<ref>{{Cite journal |last=Wheeler |first=David A. |year=2020 |title=Countermeasures |url=https://www.jstor.org/stable/resrep25332.7 |journal=Initial Analysis of Underhanded Source Code |pages=4–1}}</ref><ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |access-date=27 June 2022 |website=Unicode}}</ref>
A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>
 
=== Mapping to legacy character sets ===
Line 754 ⟶ 759:
 
While Unicode defines the script designator (name) to be "{{tt|[[ʼPhags-pa script|Phags_Pa]]}}", in that script's character names, a hyphen is added: {{Unichar|A840|PHAGS-PA LETTER KA}}.<ref name=USA24>{{Cite web |year=2021 |title=Unicode Standard Annex #24: Unicode Script Property |url=https://www.unicode.org/reports/tr24/ |access-date=29 April 2022 |publisher=The Unicode Consortium |at=2.2 Relation to ISO 15924 Codes}}</ref><ref>{{Cite web |year=2023 |title=Scripts-15.1.0.txt |url=https://www.unicode.org/Public/UNIDATA/Scripts.txt |access-date=12 September 2023 |publisher=The Unicode Consortium}}</ref> This, however, is not an anomaly, but the rule: hyphens are replaced by underscores in script designators.<ref name=USA24 />
 
=== Security ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022 |chapter-url=https://ieeexplore.ieee.org/document/9833641 |location=San Francisco, CA, US |publisher=IEEE |pages=1987–2004 |arxiv=2106.09898 |doi=10.1109/SP46214.2022.9833641 |isbn=978-1-66541-316-9 |s2cid=235485405}}</ref> Mitigation requires disallowing these characters, displaying them differently, or requiring that they resolve to the same identifier;<ref>{{Cite web |last=Engineering |first=Spotify |date=2013-06-18 |title=Creative usernames and Spotify account hijacking |url=https://engineering.atspotify.com/2013/06/creative-usernames/ |access-date=2023-04-15 |website=Spotify Engineering |language=en-US}}</ref> all of this is complicated due to the huge and constantly changing set of characters.<ref>{{Cite journal |last=Wheeler |first=David A. |year=2020 |title=Countermeasures |url=https://www.jstor.org/stable/resrep25332.7 |journal=Initial Analysis of Underhanded Source Code |pages=4–1}}</ref><ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |access-date=27 June 2022 |website=Unicode}}</ref>
 
A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>
 
== See also ==