Unicode for APLers

Contents

Unicode for APLers

Note that this article was written back in 2008. That's why it refers to XP and version 12 as the current versions of Windows and Dyalog.

However, most of the information given here is still precise, and there is also an update at the bottom of the article.

History

ASCII

Originally, the English characters were encoded in the first seven bits of a byte. That means 128 different opportunities to encode something we call a character. 32 so-called control-characters (e.g. tab, bell, form feed) were kept in the first positions; the rest was used for the English characters, numbers and punctuation.

The last 128 positions with the values 128-255 were free and often used for something else, for example, Greek symbols or APL symbols.

Because Greek and APL together exceed 128 characters this meant you could see either of them but not both at the same time. Note also that you need to know what the upper 128 characters are meant to be.

So people started to use these upper 128 bits for whatever purpose they were in need for.

ANSI

To sort the mess out, ANSI was introduced. In ANSI, code-pages where used to define what the positions 128-256 are good for. Because code pages were well-defined somehow, it got easier to deal with the details.

There was also a mechanism available to change code pages on the fly. There were even Multi-language code pages introduced. But never ever could one display Greek and APL characters at the same time at the same place.

However, when it came to Asian languages with potentially several thousands of characters, 256 possible combinations where definitely not enough anyway.

The Final Step: Unicode

Unicode does not define characters but associates them to code-points (numbers). This is a theoretical concept, an idea. It has absolutely nothing to do with bits and bytes and fonts. Before we look for the bits and bytes and fonts, we need to get clear about the idea.

All the symbols accepted as being part of "the" Unicode definition finally get a particular number, a single code point, usually written in hexadecimal notation. For example, the character uppercase A (65) is Unicode hex 0041. It is written as U+0041

If you know the hex code you can enter the letter in Word. Try it: enter 0041 and then press Alt+X! This is also possible in an edit field of a dialog created with Basic.

Note that although this works in Word and Wordpad it is unlikely to work elsewhere: this is not a general but an application specific way of entering Unicode characters.

This is what the string "APL" looks like in Unicode hex codes:

U+0041 U+0050 U+004C

The Unicode codepoint for the letter A is however 065. When you know the codepoint of a character then you can enter it in any application under Windows that supports the IME (discussed below) by holding down the ALT key and then typing the three letters, although you can omit leading zeros.

Note that the definition of what a letter is gets tricky with Unicode. For example, is the German ß a letter in its own right or just a funny way of writing "sz"? In some languages letters change their shape if located at the end of a word. Are they different letters then, depending on their position? You probably know that a few APL characters were taken from the Greek alphabet, for example the letter ⍴ (rho).

Bute note that although the thing looks the same in Greek and in APL it is nevertheless considered to be a different letter in Greek and APL.

However, that is all discussed and decided by the Unicode consortiums already, so we can save some years and focus on encoding. Unfortunately, they got some APL characters wrong. For details look at:

http://tinyurl.com/2yro69

Encoding is about how Unicode is implemented. A code point might be encoded in one, two or four bytes, or even more than that. It may take the same number of bytes for all letters, or take a dynamic approach, using a potentially different number of bytes for any letter.

Keep in mind that this has nothing to do with Unicode; it is a matter of how Unicode is implemented. It is still believed by some people that Unicode is a two-byte-thing that allows you to define 65536 characters. Well, that's simply wrong. In fact the official Unicode definition already contains more than 65536 code points.

USC-2

One of the earliest implementations of Unicode was called UCS-2 (also known as UTF-16) due to the fact that 2 bytes where used for the encoding. That was when people started to believe that Unicode was a two-byte-definition because they confused the general concept of Unicode with a particular implementation.

Although USC-2 has been used internally by Windows for many years, it was not really successful due to some built-in drawbacks. First of all, it uses two bytes for everything. That means that a document that contains nothing but English characters would occupy twice the space of the ANSI version, without any advantage.

Furthermore, USC-2 cannot deal with characters not included in the first 65536 Unicode positions. More important, there is a compatibility issue: With USC-2 , simple ANSI files cannot be displayed correctly.

UTF-8

Things changed to the better when the brilliant idea of UTF-8 was introduced. In the UTF-8 encoding, all ASCII positions remain the same, and they are encoded in a single byte. For the first 128 code points there is literally no difference between ASCII and UTF-8.

If the top bit of the first byte of a UTF-8 encoded character is set (value > 127) it indicates that the Unicode character encoding requires at least two bytes, and possibly more.

The number of bytes required is determined by examining the bit pattern of the first byte. Each byte is a combination of some bits of character data together with top bits which indicate its type (first character of two byte sequence, first character of three byte sequence, second or subsequent character, etc).

That means that a string which remains 100% of ASCII characters can be encoded to UTF-8 and it will still look the same - it's already a valid UTF-8 string - and it does not occupy more space either. Therefore, one can say that UTF-8 is a superset of ASCII.

Although this has strong advantages, it has a disadvantage, too: to navigate within a string gets messy, because some characters need one, others two bytes or even more; in fact in UTF-8 possibly up to 6 bytes.

That means you are in trouble when you want to go from a particular position inside a string "three letters backwards", because you would not know what that means in terms of bytes without examining the bit patterns carefully.

A similar problem occurs in terms of length: Imagine a file that contains the string "Dvořak" in UTF-8 encoding. If the contents of the file is read, which length will be reported – 6? 7? 8? 9?

Software needs to be prepared to deal properly with those problems.

Why Unicode after all?

Given that Unicode adds some complexity to the originally simple task of dealing with characters, why is almost everybody moving towards Unicode then?

As soon as the mess is sorted out, Unicode, and especially UTF-8 because of its downward compatibility, offers an easy way to deal with any characters and therefore any language. Internationalisation, a very difficult goal, is getting a bit easier.

Software able to deal properly with UTF-8 will keep the programmer in her comfort zone. For example, Dyalog Version 12 is a true Unicode APL. Therefore, the byte combination that would result in

Dvořak

in an APL variable would still have a shape of 6 because the internal representation, which is different, is kept hidden from the APL programmer.

If you enter 0159 into a word document and then press Alt-X, you will get the " ř" character. If you copy this letter into the clipboard and then insert it into a session window of Dyalog version 12, this character will become visible. Of course this is true only if a Unicode font is used, and "APL385 Unicode" is such a font.

However, if you execute

⎕av⍳'ř'

you get 257 as result, even in Version 12. This is because the ⎕av has not changed: it is still there, and it's length is still 256, and the ř is not contained in ⎕av. In other words, Dyalog can deal with letters not contained in the ⎕av.

For APLers, UTF-8 is a dream because we can be sure then that all the hassle with APL characters will disappear.

Well, almost. There are still some strange effects. For example, Internet Explorer 7, which is a much better browser than version 6 was, might or might not display UTF-8 characters correctly, and nobody understands why this is. (Don't worry, on most machines it works just well)

Misc

BOM: Byte Order Mark

The original intention of a BOM was to identify a Unicode file and how to deal with byte order issues or to identify UTF-8. Some applications including Microsoft Notepad add a BOM even to files which do not contain a BOM in the first place.

This list was taken from the MSDN:

EF BB BF	UTF-8
FF FE	UTF-16	little endian
FE FF	UTF-16	big endian
FF FE 00 00	UTF-32	little endian
00 00 FE FF	UTF-32	big-endian

Displaying a UTF-8 file in HEX would therefore possibly but not necessarily show the sequence

EF BB BF

in the first 3 bytes although it shouldn't: according to the official definition UTF-8 must not have a BOM. If it does then some applications will fail.

The reason why the concept of a BOM was introduced was that nobody can tell what kind of encoding is used in a file. You can only have a guess, although it might be an informed guess. The BOM doesn't really help of course because a file might carry such bytes just by accident.

Method Input Editor (IME)

An IME is a program or operating system component that allows computer users to enter characters and symbols not found on their keyboard like Chinese, Japanese or APL letters. One can specify a keyboard shortcut to switch easily between different keyboard layouts.

Unfortunately, this is not implemented consistently. Some software packages might refuse to pass a particular keyboard combination to the IME and use it as an internal shortcut instead.

In other words, in bad implementations the combination Ctrl+S with an APL IME might save the document instead of producing the APL Upstile character. However, things are getting better in this area. Many applications are now switching shortcuts occupied by a certain IME to other keystrokes temporarily as soon as they recognize the problem.

With Version 12, Dyalog provides the IME mechanism as standard input method. You first must tell Windows which keyboard layout to use. For this run "Start / Settings / Regional and Language Options", select the "Languages" tab and then press the "Details" (XP). There you see your default keyboard definition and the "Dyalog APL 12 Keyboard". It is recommended to define a shortcut for an easy change between the different layouts.

By the way: Did you ever reboot your machine because the keyboard went crazy? If you are working with a non-American keyboard you are likely to come across this problem once in a while. This is because by default Microsoft defines keyboard shortcuts to change between an American keyboard and your own keyboard. Especially APLers are in danger to hit that shortcut accidentally.

Changing to an American keyboard layout if you don't have one does not make much sense. For that reason I strongly suggest to deactivate the shortcuts. This will save you some headache.

APL385-Unicode

This font is a true monospace Unicode font. It is ideal for displaying APL characters in UTF-8 context. Prior to version 12, Dyalog itself was partly able to deal with Unicode but did not use Unicode from an APL programmer's point of view.

Consequently, it still used a non-Unicode font ("Dyalog Std" or "Dyalog Alt" or "Causeway") in the session. But if you save a script on file and then use a tool like UltraEdit to edit this file, it can be displayed correctly only with "APL385 Unicode", because the file is a true Unicode file. Understandably, that caused quite some confusion.

Let us be clear about this: one cannot display most non-English characters in a Unicode file with "Dyalog Std" or "Dyalog Alt" or "Causeway "– that is simply not possible. Because version 12 of Dyalog is a true Unicode it will use the "APL385 Unicode" font for the session as well – the confusion will disappear then.

GrahamSteer used this approach to allow any APLer to send and receive APL code, even though their interpreters are not Unicode capable. Whilst the functions are written with APL+WIN they should be easily customised for any interpreter. See AplToUnicode for details.

Courier New, Arial and their Siblings

Interesting enough you will also see APL characters if one of the standard Windows fonts is used. That might look ugly, especially when a non-monospace font like Arial is used, but you can see the APL characters. Good news, but why is this? It is simply because these fonts contain by far not all but many of the Unicode characters. Luckily, most if not all of the APL glyphs are among them.

Or so it looks. Despite common believe that APL is "Greek" a large group of the APL characters are actually standard math symbols like ÷×∧∨≠≥≤⊂⊃∩∪*≡. Naturally they all show up in any good Unicode font. Only a small set of symbols of the APL characters is actually stemming from Greek (⍺⍵⍴∊⍳), while a much larger group is "made up" like ⍒⍋⌽⍉⊖⍫∇⌿⍀⍝⍎⍕⍞⎕¯⍟⍣←→≢⍠⌹⌈⌊⌶⊢⊣⍥⍸↑↓⍨.

There is also a group of characters with Greek ancestry: ⍷ and ⍸.

Finally many APL characters are ordinary ASCII chars: |/\~!=+-_<>. However, these lists are incomplete and depend on the dialect anyway.

Important Remark

Still, you need to know the encoding of a file. Even if a file contains a BOM in the first 3 chars, there is a (small) chance that it is not a UTF-8 file but an image file or an exe holding these three bytes. If the file does not contain a BOM, well, there is no way to predict its encoding: you got to know. (However, an informed-guess-strategy is possible; see Utf8orNot for details)

Of course the BOM is a reasonable source of identification if it is a document with an extension like TXT or HTML.

HTML Specialties

An HTML page, however, can show APL chars even if the page is not UTF-8 encoded. There is a special syntax available in HTML to achieve that. It is called character entity references or HTML entities or named entities.

Anybody who came in touch with HTML knows this because this special syntax is the only way to display the lower-than and greater-than characters in an HTML page, since these characters are normally used to open and close the so-called HTML-tags.

Therefore, this line:

<APL is <b>very</b> nice indeed!>

displays:

This special syntax can also be used to display any Unicode characters, and that works even if the HTML page declares itself as Western-encoded, or refuses to declare any encoding information at all. The only thing you need to know are the values of the characters you want to show up. See UnicodeAplTable for a listing.

However, there is a performance penalty for using HTML entities, see the examples; they are all simple HTML pages illustrating different aspects.

Contains Unicode chars and declares itself as Unicode

Contains Unicode chars but declares itself as ISO-8859-1

Contains HTML entities and no encoding information at all

HTML entities cannot be used in the APL wiki because you cannot insert pure HTML for security reasons. To demonstrate different encodings and HTML entities see the attached HTML files. Worse, you can't copy such characters into the clipboard and then insert them into any session, Unicode capable or not.

In short: don't even consider to use this in order to display APL characters, or any characters at all apart from these:

& (&)
< (<)
> (>)

The Future

With Version 12 Dyalog is getting truly Unicode capable. Therefore, Dyalog APL and APL2 are all truly Unicode capable. Given that Browsers, e-mail clients and news reader are configured correctly, that should allow us to exchange APL in all sorts of ways, and to insert and execute it into the session of one of these interpreters without a problem.

Written by Kai Jaeger - 2008-12-19

Update 2012-07-18

3.5 years later it's time for an update: things got much better!

UTF-8 is used everywhere. Most application deal properly with IME related task. Best of all: the new IME Dyalog version 13.0 came along with is a big improvement over earlier versions. Note that this version of the IME can also be used with earlier versions of Dyalog (12.0 and 12.1) and can be downloaded independently from the Dyalog web site. It can even be used with other APLs! Well, they have to be true Unicode APLs.

Font embedding is now common and not only used by the APL wiki but also by Vector, SIGAPL and APL Germany to name just a few.

The future is bright!

CategoryArticles CategoryUnicode