Differences between revisions 12 and 13

Unicode for APLers

Contents

Unicode for APLers

History

ASCII

Originally, the English characters were encoded in the first seven bits of a byte. That means 128 different opportunities to encode something we call a character. 32 so-called control-characters (e.g. tab, bell, form feed) were kept in the first positions; the rest was used for the English characters, numbers and punctuation.

The last 128 positions with the values 128-255 were free and often used for something else, for example, Greek symbols or APL symbols. Because Greek and APL together exceed 128 characters this meant you could see either of them but not both at the same time. Note also that you need to know what the upper 128 characters are meant to be.

So people started to use these upper 128 bits for whatever purpose they were in need for.

ANSI

To sort the mess out, ANSI was introduced. In ANSI, code-pages where used to define what the positions 128-256 are good for. Because code pages were well-defined somehow, it got easier to deal with the details. There was also a mechanism available to change code pages on the fly. There were even Multi-language code pages introduced. But never ever could one display Greek and APL characters at the same time at the same place.

However, when it came to Asian languages with potentially several thousands of characters, 256 possible combinations where definitely not enough anyway.

The Final Step: Unicode

Unicode does not define characters but associates them to code-points (numbers). This is a theoretical concept, an idea. It has absolutely nothing to do with bits and bytes and fonts. Before we look for the bits and bytes and fonts, we need to get clear about the idea.

All the symbols accepted as being part of "the" Unicode definition finally get a particular number, a single code point, usually written in hexadecimal notation. For example, the character uppercase A (65) is Unicode hex 0041. It is written as U+0041. If you know the hex code you can enter the letter in Word. Try it: enter 0041 and then press Alt+X! This is also possible in an edit field of a dialog created with Basic. Note that although this works in Word and Wordpad it is unlikely to work elsewhere: this is not a general but an application specific way of entering Unicode characters.

This is what the string "APL" looks like in Unicode hex codes:

U+0041 U+0050 U+004C

The Unicode codepoint for the letter A is however 065. When you know the codepoint of a character then you can enter it in any application under Windows that supports the IME (discussed below) by holding down the ALT key and then typing the three letters, although you can omit leading zeros.

Note that the definition of what a letter is gets tricky with Unicode. For example, is the German ß a letter in its own right or just a funny way of writing "sz"? In some languages letters change their shape if located at the end of a word. Are they different letters then, depending on their position? You probably know that a few APL characters were taken from the Greek alphabet, for example the letter rho. Although the thing looks the same in Greek and in APL it is nevertheless considered to be a different letter in Greek and APL.

However, that is all discussed and decided by the Unicode consortiums already, so we can save some years and focus on encoding. Unfortunately, they got some APL characters wrong. For details look at:

http://tinyurl.com/2yro69

Encoding is about how Unicode is implemented. A code point might be encoded in one, two or four bytes, or even more than that. It may take the same number of bytes for all letters, or take a dynamic approach, using a potentially different number of bytes for any letter. But keep in mind that this has nothing to do with Unicode; it is a matter of how Unicode is implemented. It is still believed by some people that Unicode is a two-byte-thing that allows you to define 65536 characters. Well, that's simply wrong. In fact the official Unicode definition already contains more than 65536 code points.

USC-2

One of the earliest implementations of Unicode was called UCS-2 (also known as UTF-16) due to the fact that 2 bytes where used for the encoding. That was when people started to believe that Unicode was a two-byte-definition because they confused the general concept of Unicode with a particular implementation.

Although USC-2 has been used internally by Windows for many years, it was not really successful due to some built-in drawbacks. First of all, it uses two bytes for everything. That means that a document that contains nothing but English characters would occupy twice the space of the ANSI version, without any advantage. Furthermore, USC-2 cannot deal with characters not included in the first 65536 Unicode positions. More important, there is a compatibility issue: With USC-2 , simple ANSI files cannot be displayed correctly.

UTF-8

Things changed to the better when the brilliant idea of UTF-8 was introduced. In the UTF-8 encoding, all ASCII positions remain the same, and they are encoded in a single byte. For the first 128 code points there is literally no difference between ASCII and UTF-8.

If the top bit of the first byte of a UTF-8 encoded character is set (value > 127) it indicates that the Unicode character encoding requires at least two bytes, and possibly more. The number of bytes required is determined by examining the bit pattern of the first byte. Each byte is a combination of some bits of character data together with top bits which indicate its type (first character of two byte sequence, first character of three byte sequence, second or subsequent character, etc).

That means that a string which remains 100% of ASCII characters can be encoded to UTF-8 and it will still look the same - it's already a valid UTF-8 string - and it does not occupy more space either. Therefore, one can say that UTF-8 is a superset of ASCII.

Although this has strong advantages, it has a disadvantage, too: to navigate within a string gets messy, because some characters need one, others two bytes or even more; in fact in UTF-8 possibly up to 6 bytes. That means you are in trouble when you want to go from a particular position inside a string "three letters backwards", because you would not know what that means in terms of bytes without examining the bit patterns carefully.

A similar problem occurs in terms of length: Imagine a file that contains the string "Dvořak" in UTF-8 encoding. If the contents of the file is read, which length will be reported – 6? 7? 8? 9?

Software needs to be prepared to deal properly with those problems.

Why Unicode after all?

Given that Unicode adds some complexity to the originally simple task of dealing with characters, why is almost everybody moving towards Unicode then?

As soon as the mess is sorted out, Unicode, and especially UTF-8 because of its downward compatibility, offers an easy way to deal with any characters and therefore any language. Internationalisation, a very difficult goal, is getting a bit easier. Software able to deal properly with UTF-8 will keep the programmer in her comfort zone. For example, Dyalog Version 12 is a true Unicode APL. Therefore, the byte combination that would result in

Dvořak

in an APL variable would still have a shape of 6 because the internal representation, which is different, is kept hidden from the APL programmer.

If you enter 0159 into a word document and then press Alt-X, you will get the " ř" character. If you copy this letter into the clipboard and then insert it into a session window of Dyalog version 12, this character will become visible. Of course this is true only if a Unicode font is used, and "APL385 Unicode" is such a font.

However, if you execute

⎕av⍳'ř'

you get 257 as result, even in Version 12. This is because the ⎕av has not changed: it is still there, and it's length is still 256, and the ř is not contained in ⎕av. In other words, Dyalog can deal with letters not contained in the ⎕av.

For APLers, UTF-8 is a dream because we can be sure then that all the hassle with APL characters will disappear.

Well, almost. There are still some strange effects. For example, Internet Explorer 7, which is a much better browser than version 6 was, might or might not display UTF-8 characters correctly, and nobody understands why this is. (Don't worry, on most machines it works just well)

Misc

BOM: Byte Order Mark

The original intention of a BOM was to identify a Unicode file and how to deal with byte order issues or to identify UTF-8. Some applications including Microsoft Notepad add a BOM even to files which do not contain a BOM in the first place.

This list was taken from the MSDN:

EF BB BF	UTF-8
FF FE	UTF-16	little endian
FE FF	UTF-16	big endian
FF FE 00 00	UTF-32	little endian
00 00 FE FF	UTF-32	big-endian

Displaying a UTF-8 file in HEX would therefore possibly but not necessarily show the sequence

EF BB BF

in the first 3 bytes although it shouldn't: according to the official definition UTF-8 must not have a BOM. If it does then some applications will fail.

The reason why the concept of a BOM was introduced was that nobody can tell what kind of encoding is used in a file. You can only have a guess, although it might be an informed guess. The BOM doesn't really help of course because a file might carry such bytes just by accident.

Method Input Editor (IME)

An IME is a program or operating system component that allows computer users to enter characters and symbols not found on their keyboard like Chinese, Japanese or APL letters. One can specify a keyboard shortcut to switch easily between different keyboard layouts.

Unfortunately, this is not implemented consistently. Some software packages might refuse to pass a particular keyboard combination to the IME and use it as an internal shortcut instead. In other words, in bad implementations the combination Ctrl+S with an APL IME might save the document instead of producing the APL Upstile character. However, things are getting better in this area. Many applications are now switching shortcuts occupied by a certain IME to other keystrokes temporarily as soon as they recognize the problem.

With Version 12, Dyalog provides the IME mechanism as standard input method. You first must tell Windows which keyboard layout to use. For this run "Start / Settings / Regional and Language Options", select the "Languages" tab and then press the "Details" (XP). There you see your default keyboard definition and the "Dyalog APL 12 Keyboard". It is recommended to define a shortcut for an easy change between the different layouts.

By the way: Did you ever reboot your machine because the keyboard went crazy? If you are working on a non-English system you are likely to come across this problem once in a while. This is because by default Microsoft defines keyboard shortcuts to change between the English keyboard and your own keyboard. Especially APLers are in danger to hit that shortcut accidentally. Changing to an English layout on a non-English keyboard does not make much sense. For that reason I suggest to deactivate the shortcuts.

APL385-Unicode

This font is a true monospace Unicode font. It is ideal for displaying APL characters in UTF-8 context. Prior to version 12, Dyalog itself was partly able to deal with Unicode but did not use Unicode from an APL programmer's point of view. Consequently, it still used a non-Unicode font ("Dyalog Std" or "Dyalog Alt" or "Causeway") in the session. But if you save a script on file and then use a tool like UltraEdit to edit this file, it can be displayed correctly only with "APL385 Unicode", because the file is a true Unicode file. Understandably, that caused quite some confusion.

Let us be clear about this: one cannot display most non-English characters in a Unicode file with "Dyalog Std" or "Dyalog Alt" or "Causeway "– that is simply not possible. Because version 12 of Dyalog is a true Unicode it will use the "APL385 Unicode" font for the session as well – the confusion will disappear then.

GrahamSteer used this approach to allow any APLer to send and receive APL code, even though their interpreters are not Unicode capable. Whilst the functions are written with APL+WIN they should be easily customised for any interpreter. See AplToUnicode for details.

Courier New, Arial and their Siblings

Interesting enough you will also see APL characters if one of the standard Windows fonts is used. That might look ugly, especially when a non-monospace font like Arial is used, but you can see the APL characters. Good news, but why is this? It is simply because these fonts contain by far not all but many of the Unicode characters. Luckily, most if not all of the APL glyphs are among them.

Or so it looks. Despite common believe that APL is "Greek" a large group of the APL characters are actually standard math symbols like ÷×∧∨≠≥≤⊂⊃∩∪*≡. Naturally they all show up in any good Unicode font. Only a small set of symbols of the APL characters is actually stemming from Greek (⍺⍵⍴∊⍳), while a much larger group is "made up" like ⍒⍋⌽⍉⊖⍫∇⌿⍀⍝⍎⍕⍞⎕¯⍟⍣←→≢⍠⌹⌈⌊⌶⊢⊣⍥⍸↑↓⍨. There is also a group of characters with Greek ancestry: ⍷⍸. Finally many APL characters are ordinary ASCII chars: |/\~!=+-_<>. These lists are certainly not complete and depend on the dialect anyway.

Important Remark

Still, you need to know about the encoding of a file. Even if a file contains a BOM in the first 3 chars, there is a chance that it is not a UTF-8 file but an image file or an exe holding these three bytes. If the file does not contain a BOM, well, there is no way to predict its encoding: you got to know. Of course the BOM is a reasonable source of identification if it is a document with an extension like TXT or HTML.

HTML Specialties

An HTML page, however, can contain APL chars even if the page is not UTF-8 encoded. There is a special syntax available in HTML to achieve that. It is called character entity references or HTML entities or named entities.

Everybody who came in touch with HTML know this, because this special syntax is the only way to display the lower-than and greater-than characters in an HTML page, since these characters are normally used to open and close the so-called HTML-tags.

Therefore, this line:

<APL is <b>very</b> nice indeed!>

displays:

This special syntax can also be used to display any Unicode characters, and that works even if the HTML page declares itself as Western-encoded, or refuses to declare any encoding information at all. The only thing you need to know are the values of the characters you want to show up. See UnicodeAplTable for a listing. However, there is a performance penalty for using HTML entities, see the examples; they are all simple HTML pages illustrating different aspects.

Contains Unicode chars and declares itself as Unicode

Contains Unicode chars but declares itself as ISO-8859-1

Contains HTML entities and no encoding information at all

HTML entities cannot be used in the APL wiki because you cannot insert pure HTML for security reasons. To demonstrate different encodings and HTML entities see the attached HTML files. Worse, you can't copy such characters into the clipboard and then insert them into any session, Unicode capable or not.

In short: don't even consider to use this in order to display APL characters, or any characters at all apart from & (&), < (<) and > (>)!

The Future

With Version 12 Dyalog is getting truly Unicode capable. Therefore, Dyalog APL, APL2 and APLX are all truly Unicode capable. Given that Browsers, e-mail clients and news reader are configured correctly, that should allow us to exchange APL in all sorts of ways, and to insert and execute it into the session of one of these interpreters without a problem.

Written by Kai Jaeger - 2008-12-19

Update 2012-07-18

3.5 years later it's time for an update: things got much better!

UTF-8 is used everywhere. Most application deal properly with IME related task. Best of all: the new IME Dyalog version 13.0 came along with is a big improvement over earlier versions. Note that this version of the IME can also be used with earlier versions of Dyalog (12.0 and 12.1) and can be downloaded independently from the Dyalog web site. It can even be used with other APLs! Well, they have to be true Unicode APLs.

Font embedding is now common and not only used by the APL wiki but also by Vector, SIGAPL and APL Germany to name just a few.

The future is bright!

CategoryArticles CategoryUnicode

-  ⇤ ← Revision 12 as of 2010-03-31 09:58:35 → 
  Size: 15003
  Editor: SimonMarsden
  Comment: Added additional detail on UTF8 to clarify difference to ASCII
+   ← Revision 13 as of 2012-07-18 08:38:19 → ⇥
  Size: 17448
  Editor: KaiJaeger
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 20:
-All the symbols accepted as being part of "the" Unicode definition finally get a particular number, a single code point, usually written in hexadecimal notation. For example, the character uppercase A (65) is Unicode 0041. It is written as U+0041. If you know the hex code you can enter the letter in Word. Try it: enter 0041 and then press Alt+C! This is also possible in an edit field of a dialog created with Basic, but make sure you press Alt+X instead of Alt+C then.
+All the symbols accepted as being part of "the" Unicode definition finally get a particular number, a single code point, usually written in hexadecimal notation. For example, the character uppercase A (65) is Unicode hex 0041. It is written as U+0041. If you know the hex code you can enter the letter in Word. Try it: enter 0041 and then press Alt+X! This is also possible in an edit field of a dialog created with Basic. Note that although this works in Word and Wordpad it is unlikely to work elsewhere: this is not a general but an application specific way of entering Unicode characters.
 Line 25:
+The Unicode codepoint for the letter A is however 065. When you know the codepoint of a character then you can enter it in any application under Windows that supports the IME (discussed below) by holding down the ALT key and then typing the three letters, although you can omit leading zeros.
-Line 42:
+Line 44:
-If the top bit of the first byte of a UTF-8 encoded character is set (value > 127) it indicates that the unicode character encoding requires at least two bytes, and possibly more. The number of bytes required is determined by examining the bit pattern of the first byte. Each byte is a combination of some bits of character data together with top bits which indicate its type (first character of two byte sequence, first character of three byte sequence, second or subsequent character, etc).
+If the top bit of the first byte of a UTF-8 encoded character is set (value > 127) it indicates that the Unicode character encoding requires at least two bytes, and possibly more. The number of bytes required is determined by examining the bit pattern of the first byte. Each byte is a combination of some bits of character data together with top bits which indicate its type (first character of two byte sequence, first character of three byte sequence, second or subsequent character, etc).
-Line 52:
+Line 54:
-== Why Unicode After All? ==
+== Why Unicode after all? ==
-Line 61:
+Line 63:
-If you enter 0159 into a word document and then press Alt-C, you will get the " ř" character. If you copy this letter into the clipboard and then insert it into a session window of Dyalog version 12, this character will become visible. Of course this is true only if a Unicode font is used, and "APL385 Unicode" is such a font.
+If you enter 0159 into a word document and then press Alt-X, you will get the " ř" character. If you copy this letter into the clipboard and then insert it into a session window of Dyalog version 12, this character will become visible. Of course this is true only if a Unicode font is used, and "APL385 Unicode" is such a font.
-Line 67:
+Line 69:
-you get 257 as result, even in Version 12. This is because the {{{⎕av}}} has not changed: it is still there, and it 's length is still 256, and the ř is not contained in {{{⎕av}}}. In other words, Dyalog can deal with letters not contained in the {{{⎕av}}}.
+you get 257 as result, even in Version 12. This is because the {{{⎕av}}} has not changed: it is still there, and it's length is still 256, and the ř is not contained in {{{⎕av}}}. In other words, Dyalog can deal with letters not contained in the {{{⎕av}}}.
-Line 71:
+Line 73:
-Well, almost. There are still some strange effects. For example, Internet Explorer 7, which is a much better browser than version 6 was, might or might not display UTF-8 characters correctly, and nobody understands why this is. (don't worry, on most machines it works just well)
+Well, almost. There are still some strange effects. For example, Internet Explorer 7, which is a much better browser than version 6 was, might or might not display UTF-8 characters correctly, and nobody understands why this is. (Don't worry, on most machines it works just well)
-Line 84:
+Line 86:
-Line 91:
+Line 90:
-in the first 3 bytes.
+in the first 3 bytes although it shouldn't: according to the official definition UTF-8 '''must not have a BOM'''. If it does then some applications will fail.

The reason why the concept of a BOM was introduced was that nobody can tell what kind of encoding is used in a file. You can only have a guess, although it might be an informed guess. The BOM doesn't really help of course because a file might carry such bytes just by accident.
-Line 96:
+Line 97:
-Unfortunately, this is not implemented consistently. Some software packages might refuse to pass a particular keyboard combination to the IME and use it as an internal shortcut instead. In other words, in bad implementations the combination Ctrl+S with an APL IME might save the document instead of producing the APL upstile character. However, things are getting better in this area. Many applications are now switching shortcuts occupied by a certain IME to other keystrokes temporarily as soon as they recognize the problem.
+Unfortunately, this is not implemented consistently. Some software packages might refuse to pass a particular keyboard combination to the IME and use it as an internal shortcut instead. In other words, in bad implementations the combination Ctrl+S with an APL IME might save the document instead of producing the APL Upstile character. However, things are getting better in this area. Many applications are now switching shortcuts occupied by a certain IME to other keystrokes temporarily as soon as they recognize the problem.
-Line 100:
+Line 101:
-By the way: Did you ever reboot your machine because the keyboard went crazy? If you are working on a non-English system you are likely to come across this problem once in a while. This is because by default Microsoft defines keyboard shortcuts to change between the English keyboard and your own keyboard. Especially APLers are in danger to hit that shortcut accidentically. Changing to an English layout on a non-English keyboard does not make much sense.  For that reason I suggest to deactivate the shortcuts.
+By the way: Did you ever reboot your machine because the keyboard went crazy? If you are working on a non-English system you are likely to come across this problem once in a while. This is because by default Microsoft defines keyboard shortcuts to change between the English keyboard and your own keyboard. Especially APLers are in danger to hit that shortcut accidentally. Changing to an English layout on a non-English keyboard does not make much sense.  For that reason I suggest to deactivate the shortcuts.
-Line 109:
+Line 110:
-=== Courier New, Arial And Their Siblings ===
Interesting enough, you will also see APL characters if one of the standard Windows fonts is used. That might look ugly, especially when a non-monospace font like Arial is used, but you can see the APL characters. Good news, but why is this? It is simply because these fonts contain by far not all but many of the Unicode characters. Luckily, most if not all of the APL glyphs are among them.
+=== Courier New, Arial and their Siblings ===
Interesting enough you will also see APL characters if one of the standard Windows fonts is used. That might look ugly, especially when a non-monospace font like Arial is used, but you can see the APL characters. Good news, but why is this? It is simply because these fonts contain by far not all but many of the Unicode characters. Luckily, most if not all of the APL glyphs are among them.   Or so it looks. Despite common believe that APL is "Greek" a large group of the APL characters are actually standard math symbols like `÷×∧∨≠≥≤⊂⊃∩∪*≡`. Naturally they all show up in any good Unicode font. Only a small set of symbols of the APL characters is actually stemming from Greek (`⍺⍵⍴∊⍳`), while a much larger group is "made up" like `⍒⍋⌽⍉⊖⍫∇⌿⍀⍝⍎⍕⍞⎕¯⍟⍣←→≢⍠⌹⌈⌊⌶⊢⊣⍥⍸↑↓⍨`. There is also a group of characters with Greek ancestry: `⍷⍸`.
Finally many APL characters are ordinary ASCII chars: `|/\~!=+-_<>`. These lists are certainly not complete and depend on the dialect anyway.
-Line 136:
+Line 140:
-HTML entities cannot be used in the APL wiki because you cannot insert pure HTML for security reasons. To demonstrate different encodings and HTML entities, see the attached HTML files.Worse, you can't copy such characters into the clipboard and then insert them into any session, Unicode capable or not.
+HTML entities cannot be used in the APL wiki because you cannot insert pure HTML for security reasons. To demonstrate different encodings and HTML entities see the attached HTML files. Worse, you can't copy such characters into the clipboard and then insert them into any session, Unicode capable or not.

In short: don't even consider to use this in order to display APL characters, or any characters at all apart from & (`&amp;`), < (`&lt;`) and > (`&gt;`)!
-Line 143:
+Line 149:
+== Update 2012-07-18 ==

3.5 years later it's time for an update: things got '''much''' better!

UTF-8 is used everywhere. Most application deal properly with IME related task. Best of all: the new IME Dyalog version 13.0 came along with is a '''big''' improvement over earlier versions. Note that this version of the IME can also be used with earlier versions of Dyalog (12.0 and 12.1) and can be downloaded independently from the Dyalog web site. It can even be used with other APLs! Well, they have to be true Unicode APLs.

Font embedding is now common and not only used by the APL wiki but also by [[ http://vector.org.uk/ | Vector ]], [[http://www.sigapl.org/ | SIGAPL ]] and [[ http://apl-germany.de | APL Germany ]] to name just a few.

The future is bright!
-Line 144:
+Line 160:
-CategoryArticles CategoryUnicode to
+CategoryArticles CategoryUnicode