UTF-8 or Not
Contents
"there is no way to predict its encoding".
This is perfectly true. At a recent meeting of BAA-London (2009-11-27) Nicolas Delcros put it nicely:
"there is no deterministic way to decide the encoding".
A very large number of monkeys each provided with a computer set up with an IME that provides 256 typeable characters and maps them to bytes 00-FF, while not necessarily producing the complete works of Shakespeare, could, given sufficient time, produce a file that was identical to one encoded in any encoding scheme. Bytes will be bytes!
a problem
Now I have a real world, though minor, problem that my program has to read an .INI file produced by a user in a text editor. I could supply one for the user to amend pre-configured in the encoding of my choice but it's so simple that he might well bin it and create another anyway.
In most cases it shouldn't matter as, so far, the file is expected to contain fairly minimal content, probably in English.
But he may be jealous of his European heritage and want to spell his name with ä (a-umlaut) or æ (ash) rather than ae. He may have a keyboard set up to allow him to do this without recourse to using an "Insert Special Character" dialogue.
I may be lucky enough for the particular text editor he is using to realise that use of ANY character beyond U+007F is a potential problem for any other program reading the file and ask him to choose the encoding in which to save the file.
He or I may want to include some APL code in the ini file.
As I say, I have a minor problem. I may get a file in ANSI. I may get a file in UCS-2, UCS-4, UTF-8 you name it!
To mitigate the problem I have included in the user guide:
"The file character encoding can be either ANSI or UTF-8".
I reason that if the user is unicode-literate he'll read this and know what to do. If he's not he'll ignore this and the file will almost certainly be ANSI.
Either way I've precluded almost all possibility of getting caught with UCS-2 or any of the others.
As I stressed above there is no way we can be sure that a valid UTF-8 encoded file was actually intended to be so. Our monkeys may succeed in producing something that just happens to map to a somewhat shorter piece of Anglo-Saxon, French or APL.
But there are one or two things we can do.
UTF-8
The designers of UTF-8 decided to bias it in favour of existing ascii files that include only byte values (00-7F). This isn't unreasonable if you consider that an 8-bit space saving scheme, and understand that that's all it is, is of absolutely no use to someone almost the entirity of whose character set has code-points that can't be represented in less than 16.
To do this they organised it such that the first 128 (U+0000-U+007F) code-points map directly on to the first 128 bytes (00-7F) all of whose first bits are '0'.
All characters beyond U+007F are encoded into 2 or more bytes themselves greater than 7F: a start byte and one or more continuation bytes. Continuation bytes all start with '10'. The start byte is determined such that the number of leading '1's is the total number of bytes in the encoding.
Thus they are all at least '11000000' (C0). The value of the code-point being encoded is in the concatenation of the remaining bits of the start and continuation bytes together. This scheme could allow for up to 4,398,046,511,104 code-points (U+40000000000) and up to seven possible encodings of any one of them.
To avoid ambiguity, only the shortest possible encoding of any particular code-point is permitted. It was later decided to limit UTF-8 to 4-byte encodings (U+0000-U+10FFFF) providing for a possible 1114111 characters.
In consequence of these rules, C0, C1 and all values F5 and above are excluded as start bytes and thus cannot occur anywhere.
So we have 13 excluded bytes (C0, C1 ,F5-FF) and a requirement for at least one byte of the 51 (C2-F3) followed immediately by at least one of the 64 (80-BF).
This is sufficient to exclude nearly all negatives. Any file containing an excluded byte, a start byte not followed immediately by a continuation byte or a continuation byte not preceded immediately by either a start byte or another continuation is not validly UTF-8 encoded.
In fact the only remaining are where the number of continuation bytes is different from that dictated by the start byte.
This leaves the false positives discussed above. We can estimate the likelihood of their occurrence through simian digitising as:
pc←'K2F6.2,<%>'∘⎕FMT ⍝ percent nsb←192÷256 ⍝ a valid non-start byte. vsb←nsb×51÷256 ⍝ a start byte following a non-start byte. pc+ve2←vsb×64÷256 ⍝ a valid two-byte encoding 3.74% pc+ve3←ve2×64÷256 ⍝ a valid three-byte encoding 0.93% pc+ve4←ve3×64÷256 ⍝ a valid four-byte encoding 0.23%
In fact the actual likelihood is probably much less even than this given that in ANSI a start byte is one of "ÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòó" and a continuation byte one of " ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿".
We need an extended Latin alphabetic or multiplication sign followed by one of a fairly obscure collection of symbols that includes diaresis and macron.
an exception
We can demonstrate the possibility by imagining an Anglo-Saxon APLer writing the following:
a Шׯ2+𨨨w
where capital 'Ð' and small 'ð' eth are functions and 'a' and 'w' are arrays.
⍉2⊥⍣¯1 ⎕UCS'a Шׯ2+𨨨w' 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 0 1 1 1
The 'Ш', the 'ׯ' and the '𨨨' are valid 2-, 2- and 4-byte encodings respectively. Strangely in Firefox the second appears after the 2!
⎕UCS ⎕←'UTF-8'⎕UCS ⎕UCS'a Шׯ2+𨨨w' a Шׯ2+𨨨w 97 32 1064 1519 50 43 166440 119
In sum, we can easily verify the invalidity of a file which is not a valid UTF-8 encoding. Even if it is valid, many simple text files written in English will be identical whether encoded in UTF-8 or ANSI.
a solution
The Dyalog dyadic system function ⎕UCS with a left argument of 'UTF-8' encodes unicode text into a UTF-8 integer vector and decodes an integer vector into unicode text.
It does all the validation required to ascertain that it is dealing with a valid UTF-8 encoding and gives DOMAIN ERROR if not. This makes our task a lot easier. Here's an attempt at a more or less fully validating function:
cu←{ { ⍵↓⍨(⊃⍵)=⎕UCS 65279 ⍝ remove byte order mark if present }⍵{ ⍝ input text ∧/⍵<194:⍺ ⍝ no start bytes ∨/⍵>244:⍺ ⍝ beyond max ⍝ start not followed by continuation ∨/((⍵>193)∧⍵<245)>1↓((⍵>127)∧⍵<192),0:⍺ ⍝ continuation not preceded by start or continuation ∨/((⍵>127)∧⍵<192)>¯1↓1,(⍵>127)∧⍵<245:⍺ ⍝ let ⎕UCS test for start followed by wrong number of continuations 11::⍺ ⍝ Return text if invalid UTF-8 'UTF-8'⎕UCS ⍵ ⍝ decode integers to text }⎕UCS ⍵ ⍝ convert to integers ⍝ convert UTF-8 ⍝ ⍵ text string, bytes - UTF-8 or ANSI ⍝ ← clear text }
It turns out on timings that the first two tests are definitely worth doing. With a scalar operation they prevent the unnecessary testing of range membership that ⎕UCS will inevitably do again if we get that far.
They add some 10-20% to the time to decode a true UTF-8 encoding, while avoiding the time that ⎕UCS would take to check an invalid one.
Nevertheless the more complicated checking of membership of ranges of start and continuation add more time than they can possibly save. So here's the thing:
cu←{ { ⍵↓⍨(⊃⍵)=⎕UCS 65279 ⍝ remove byte order mark if present }⍵{ ⍝ input text ∧/⍵<194:⍺ ⍝ no start bytes ∨/⍵>244:⍺ ⍝ beyond max ⍝ let ⎕UCS check the rest 11::⍺ ⍝ Return text if invalid UTF-8 'UTF-8'⎕UCS ⍵ ⍝ decode integers to text }⎕UCS ⍵ ⍝ convert to integers ⍝ convert UTF-8 ⍝ ⍵ text string, bytes - UTF-8 or ANSI ⍝ ← clear text }
-- PhilLast 2009-12-07 16:22:04