Unicode for APLers

History

ASCII

Originally, the English characters were encoded in the first seven bits of a byte. That means 128 different opportunities to encode something we call a character. 32 so-called control-characters (e.g. tab, bell, form feed) were kept in the first positions; the rest was used for the English characters, numbers and punctuation.

The last 128 positions with the values 128-255 were free and often used for something else, for example, Greek symbols or APL symbols. Because Greek and APL exceed 128 characters this meant you could see either of them but not both at the same time. Note also that you need to know what the upper 128 characters are meant to be.

So people started to use these upper 128 bits for whatever purpose they were in need for.

ANSI

To sort the mess out, ANSI was introduced. In ANSI, code-pages where used to define what the positions 128-256 are good for. Because code pages were well defined somehow, it got easier to deal with the details. There was also a mechanism available to change code pages on the fly. There were even Multilanguage code pages introduced. But never ever could one display Greek and APL characters at the same time at the same place.

However, when it came to Asian languages with potentially several thousands of characters, 256 possible combinations where definitely not enough anyway.

The Final Step: Unicode

Unicode does not define characters but associates them to code-points (numbers). This is a theoretical concept, an idea. It has absolutely nothing to do with bits and bytes and fonts. Before we look for the bits and bytes and fonts, we need to get clear about the idea.

All the symbols accepted as being part of "the" Unicode definition finally get a particular number, a single code point, usually written in hexadecimal. For example, the character uppercase A (65) is Unicode 0041. It is written as U+0041. If you know the hex code , you can enter the letter in Word. Try it: enter 0041 and then press Alt+C! This is also possible in an edit field of a dialog created with Basic, but make sure you press Alt+X instead of Alt+C then.

This is what the string "APL" looks like in Unicode hex codes:

U+0041 U+0050 U+004C

There is a website available which allows you to enter, for example, APL code and get the Unicode hex codes:

Note that what is a letter might be a question difficult to answer. Is, for example, the German ß a letter in its own right or just a funny way of writing "sz"? In some languages letters change their shape if located at the end of a word. Are they different letters then, depending on their position? You probably know that a few APL characters were taken from the Greek alphabet, for example the letter rho. Although the thing looks the same in Greek and in APL it is nevertheless considered to be a different letter in Greek and APL.

However, that is all discussed and decided by the Unicode consortiums already, so we can save some years and focus on encoding. Unfortunately, they got some APL characters wrong. For details look at:

http://tinyurl.com/2yro69

Encoding is how Unicode is implemented. A code point might be encoded in one, two or four bytes, or even more than that. It may take the same number of bytes for all letters, or take a more dynamic approach, using a potentially different number of bytes for any letter. But keep in mind that this has nothing to do with Unicode; it is a matter of how Unicode is implemented. It is still believed by some people that Unicode is a two-byte-thing that allows you to define 65536 characters. Well, that's simply wrong. In fact Unicode already contains more than 65536 characters.

USC-2

One of the earliest implementations of Unicode was called UCS-2 (also known as UTF-16) due to the fact that 2 bytes where used for the encoding. That was when people started to believe that Unicode was a two-byte-definition because they confused the general concept of Unicode with a particular implementation.

Although USC-2 has been used internally by Windows for many years, it was not really successful due to some built in drawbacks. First of all, it uses two bytes for everything. That means that a document that contains nothing but English characters would occupy twice the space of the ANSI version, without any advantage. Furthermore, USC-2 cannot deal with characters not included in the first 65536 Unicode positions. More important, there is a compatibility issue: With USC-2 , simple ANSI files cannot be displayed correctly.

UTF-8

Things changed to the better when the brilliant idea of UTF-8 was introduced. In the UTF-8 encoding, all ASCII positions remain the same, and they are encoded in a single byte. For the first 128 bits there is literally no difference between ASCII and UTF-8.

Although this has strong advantages, it has a disadvantage, too: to navigate within a string gets messy, because some characters need one, others two bytes or even more (in fact in UTF-8 possibly up to 6 bytes for an appropriate encoding. That means you are in trouble when you want to go from a particular position inside a string "three letters backwards", because you would not know what that means in terms of bytes.

A similar problem occurs in terms of length: Imagine a file that contains the string "Dvořak" in UTF-8 encoding. If the contents of the file is read, which length will be reported – 6? 7? 8? 9?

Software needs to be prepared to deal properly with those problems.

Why Unicode After All?

Given that Unicode adds some complexity to the originally simple task of dealing with characters, why is almost everybody moving towards Unicode then?

As soon as the mess is sorted out, Unicode, and especially UTF-8 because of its downward compatibility, offers an easy way to deal with any characters and therefore any language. Internationalisation, a very difficult goal, is getting a bit easier. Software able to deal properly with UTF-8 will keep the programmer in her comfort zone. For example, Dyalog Version 12 is a true Unicode APL. Therefore, the byte combination that would result in

Dvořak

in an APL variable would still have a shape of 6 because the internal representation, which is different, is kept hidden from the APL programmer.

If you enter 0159 into a word document and then press Alt-C, you will get the " ř" character. If you copy this letter into the clipboard and then insert it into a session window of Dyalog version 12, this character will become visible. Of course this is true only if a Unicode font is used, and "APL385 Unicode" is such a font.

However, if you execute

⎕av⍳'ř'

you get 257 as result, even in Version 12. This is because the []AV has not changed: it is still there, and it 's length is still 256, and the ř is not contained in []AV. In other words, Dyalog can deal with letters not contained in the []AV.

For APLers, UTF-8 is a dream because we can be sure then that all the hassle with APL characters will disappear.

Well, almost. There are still some strange effects. For example, Internet Explorer 7, which is a much better browser than version 6 was, might or might not display UTF-8 characters correctly, and nobody understands why this is. (don't worry, on most machines it works just well)

Misc

BOM: Byte Order Mark

The original intention of a BOM was to identify a Unicode file and how to deal with byte order issues or to identify UTF-8. Some applications including Microsoft Notepad add a BOM even to files which do not contain a BOM in the first place.

This list was taken from the MSDN:

EF BB BF

UTF-8

FF FE

UTF-16

little endian

FE FF

UTF-16

big endian

FF FE 00 00

UTF-32

little endian

00 00 FE FF

UTF-32

big-endian

Displaying a UTF-8 file in HEX would therefore possibly but not necessarily show the sequence

EF BB BF

in the first 3 bytes.

Method Input Editor (IME)

An IME is a program or operating system component that allows computer users to enter characters and symbols not found on their keyboard like Chinese, Japanese or APL letters. One can specify a keyboard shortcut to switch easily between different keyboard layouts.

Unfortunately, this is not implemented consistently. Some software packages might refuse to pass a particular keyboard combination to the IME and use it as an internal shortcut instead. In other words, in bad implementations the combination Ctrl+S with an APL IME might save the document instead of producing the APL upstile character. However, things are getting better in this area. Many applications are now switching shortcuts occupied by a certain IME to other keystrokes temporarily as soon as they recognize the problem.

With Version 12, Dyalog provides the IME mechanism as standard input method. You first must tell Windows which keyboard layout to use. For this run "Start / Settings / Regional and Language Options", select the "Languages" tab and then press the "Details" (XP). There you see your default keyboard definition and the "Dyalog APL 12 Keyboard". It is recommended to define a shortcut for an easy change between the different layouts.

By the way: Did you ever reboot your machine because the keyboard went crazy? If you are working on a non-English system you are likely to encounter this problem. This is because by default Microsoft defines keyboard shortcuts to change between the English keyboard and your own keyboard. Especially APLers are in danger to hit the shortcut accidentically. Changing to an English layout on a non-English keyboard does not make much sense. For that reason I suggest to deactivate the shortcuts.

APL385-Unicode

This font is a true monospace Unicode font. It is ideal for displaying APL characters in UTF-8 context. Prior to version 12, Dyalog itself was partly able to deal with Unicode but did not use Unicode from an APL programmer's point of view. Consequently, it still used a non-Unicode font ("Dyalog Std" or "Dyalog Alt" or "Causeway") in the session. But if you save a script on file and then use a tool like UltraEdit to edit this file, it can be displayed correctly only with "APL385 Unicode", because the file is a true Unicode file. Understandably, that caused quite some confusion.

Let us be clear about this: one cannot display most non-English characters in a Unicode file with "Dyalog Std" or "Dyalog Alt" or "Causeway "– that is simply not possible. Because version 12 of Dyalog is a true Unicode it will use the "APL385 Unicode" font for the session as well – the confusion will disappear then.

Courier New, Arial And Their Siblings

Interesting enough, you will also see APL characters if one of the standard Windows fonts is used. That might look ugly, especially when a non-monospace font like Arial is used, but you can see the APL characters. Good news, but why is this? It is simply because these fonts contain by far not all but many of the Unicode characters. Luckily, most if not all of the APL glyphs are among them.

Important Remark

Still, you need to know about the encoding of a file. Even if a file contains a BOM in the first 3 chars, there is a chance that it is not a UTF-8 file but an image file or an exe holding these three bytes. If the file does not contain a BOM, well, there is no way to predict its encoding: you got to know. Of course the BOM is a reasonable source of identification if it is a document with an extension like TXT or HTML.

HTML Specialties

An HTML page, however, can contain APL chars even if the page is not UTF-8 encoded. There is a special syntax available in HTML to achieve that. It is called character entity references or HTML entities or named entities.

Everybody who came in touch with HTML know this, because this special syntax is the only way to display the lower-than and greater-than characters in an HTML page, since these characters are normally used to open and close the so-called HTML-tags.

Therefore, this line:

&lt;APL is <b>very</b> nice indeed!&gt;

displays:

<APL is very nice indeed>

This special syntax can also be used to display any Unicode characters, and that works even if the HTML page declares itself as Western-encoded, or refuses to declare any encoding information at all. The only thing you need to know are are values of the character you want to show up. See UnicodeAplTable for a listing. However, there is a performance penalty for using HTML entities, see the examples; they are all simple HTML pages illustrating different aspects.

Contains Unicode chars and declares itself as Unicode

Contains Unicode chars but declares itself as ISO-8859-1

Contains HTML entities and no encoding information at all

GrahamSteer used this approach to allow APLWin users to send APL code, although APLWin is not Unicode capable. See AplPlusWinToUnicode for details.

HTML entities cannot be used in the APL wiki because you cannot insert pure HTML for security reasons. To demonstrate different encodings and HTML entities, see the attached HTML files.

The Future

With Version 12 Dyalog is getting truly Unicode capable. Therefore, Dyalog APL, APL2 and APLX are all truly Unicode capable. Given that Browsers, e-mail clients and news reader are configured correctly, that should allow us to exchange APL in all sorts of ways, and to insert and execute it into the session of one of these interpreters without a problem.

Written by Kai Jaeger - 2008-12-19


CategoryArticles CategoryUnicode to