Differences between revisions 1 and 26 (spanning 25 versions)

APL to Unicode

Whilst the material described below relates specifically to APL+WIN is should be readily customisable to work with any APL interpreter that is not already unicode capable.

Currently the APL to unicode functions write the unicode to native text files from which it can be cut and pasted into emails, newsgroups, web pages etc. Similarly the unicode to APL function requires the unicode to be cut and pasted from its source into a native text file prior to conversion.

The unicode can be copied and pasted to the text files using MSNotePad with the APL385 Unicode font. Also make sure you select UTF-8 as the encoding when doing a "Save as" when you save a file.

My original aim was to work directly via the clipboard but the amount of APL code required to manage the windows clipboard is prohibitive for displaying here. APL+WIN has in-built user commands (]clipcopy and ]clippaste) to do the job and I suggest APL+WIN users use those if they want to go directly via the clipboard. Users of other interpreters no doubt have their own equivalents they can use.

AplToUtf8 takes the name of a function and converts the code to Unicode UTF-8 encoding. As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.

 ∇  AplToUtf8 f

⍝Get a character representation of the function
f←⎕cr f

⍝Append new line and carriage return characters
f←(f,⎕tcnl),⎕tclf

⍝Convert each character to its unicode binary value
f←∊Utf8 ¨∆avutf8[⎕av⍳,f]

⍝Add the encoding level header and convert back to ascii characters
f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f

⍝File the character stream
f FileData 'c:\unicode.txt'

 ∇

Utf8 which is called under each (¨) in the above simply implements the UTF-8 specification to create the unicode byte structure for each character. Anyone interested in the byte structure can see it here: http://en.wikipedia.org/wiki/UTF-8 and scroll down to the Description section.

 ∇ r←Utf8 c

  ⍝Determine the number of bytes required to represent the character in unicode
   r←+/(⌈/((21⍴2)⊤c)/⌽⍳21)>0 7 11 16

  ⍝Convert the character to bytes according to the UTF-8 specification
   :Select r
   :Case 1
       r←⍎⍕0,(7⍴2)⊤c
   :Case 2
       r←⍎⍕(1 1 0,5↑r),1 0,5↓r←(11⍴2)⊤c
   :Case 3
       r←⍎⍕(1 1 1 0,4↑r),(1 0,6↑4↓r),1 0,10↓r←(16⍴2)⊤c
   :Case 4
       r←⍎⍕(1 1 1 1 0,3↑r),(1 0,6↑3↓r),(1 0,6↑9↓r),1 0,15↓r←(21⍴2)⊤c
   :EndSelect
∇

The function FileData is a simple utility function to file the result. I am sure you all have your own versions.

Utf8ToApl is the reverse function. It assumes that the unicode resides in a native text file.

 ∇  r←Utf8ToApl;v

⍝Tie the native file containing the unicode
'c:\unicode.txt' ⎕ntie ¯1

⍝Read the bits from the file
v←⎕nread ¯1 11,(⎕nsize ¯1),0

⍝Untie the file
⎕nuntie ¯1

⍝Initialise the results vector
r←0⍴0

⍝Convert the bits to integers
v←2⊥⍉((.125×⍴v),8)⍴v

⍝Strip off the encoding header if present
:if 617=+/3↑v
   v←3↓v
:endif

⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification
:while 0≠⍴v

⍝Determine how many bytes represent the next character
    :select +/(↑v)>0 127 223 239
    :case 1
        r←r,2⊥1↓(8⍴2)⊤v[1]
        v←1↓v
    :case 2
        r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2]
        v←2↓v
    :case 3
        r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3]
        v←3↓v
    :case 4
        r←r,2⊥(5↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),(2↓(8⍴2)⊤v[3]),2↓(8⍴2)⊤v[4]
        v←4↓v
    :endselect

:endwhile

⍝Convert unicode integers back to ⎕av characters
r←⎕av[(∆avutf8⍳r)∼11]

∇

∆avutf8 is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points.

    0    1    2 9079 9674  168 8592    7    8    9   10 8834   12   13 8835 9055
   16   17   18 9067   20   21 9068 9077 8593 8595 8594   27 8867 8866 9035 9042
   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95
   96   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111
  112  113  114  115  116  117  118  119  120  121  122  123  166  125 8764  127
  199  252  233  226  228  224 8800  231  234  235  232  239  238 8968  196 8970
  201 8710  215  244  246 9109  251 9054 9017  214  220  162  163   63 9066 9064
  225  237  243  250  241  209 9053 9024  191 9015  337  248  253  161  171  187
 9109 9109 9109  124  124  124  124   43   43  124  124   43   43   43   43   43
  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207
   45  209  210  211  212  213  214   43  216  217  218  219  220  221  124  255
 9082  223 9075 9060  227 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067
 8801 9049 8805 8804 9045 9038  247   34 8728 9675 8744 9076 8745  175  124    0

I used the functions to convert themselves into unicode for the wiki. They should readily translate to any interpreter if not usable directly.

Anyone wishing to create their own versions for another interpreter needs firstly to create the appropriate translation vector for their interpreter. To get you started I have reproduced the APL+WIN atomic vector below. Another excellent resource is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf.

   ⍷◊¨←    ⊂  ⊃⍟åæì⍫ÙÒ⍬⍵↑↓→ ⊣⊢⍋⍒ !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{¦}∼
Çüéâäà≠çêëèïî⌈Ä⌊É∆×ôö⎕û⍞⌹ÖÜ¢£?⍪⍨áíóúñÑ⍝⍀¿⌷őøý¡«»⎕⎕⎕||||++||+++++
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÑÒÓÔÕÖ+ØÙÚÛÜÝ|ÿ⍺ß⍳⍤ã⍱⊥⊤⌽⊖⍲⌿∇⍉∊∩≡⍙≥≤⍕⍎÷"∘○∨⍴∪¯|

I could not resist the challenge when one reader commented that these functions were not very "APL like" so I created a new set at AplToUnicodeII

Author: GrahamSteer

CategoryUnicode

-  ⇤ ← Revision 1 as of 2008-12-22 14:26:23 → 
  Size: 4200
  Editor: anonymous
  Comment:
+   ← Revision 26 as of 2009-01-05 17:49:55 → ⇥
  Size: 6763
  Editor: anonymous
  Comment: APL+WIN atomic vector added
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= APL+WIN to Unicode =
At present (2008-12-22) it isn't possible to copy and paste unicode from and to APL+WIN.
+## page was renamed from AplPlusWinToUnicode
= APL to Unicode =
Whilst the material described below relates specifically to APL+WIN is should be readily customisable to work with any APL interpreter that is not already unicode capable.
-Line 4:
+Line 5:
-GrahamSteer has provided the following functions to enable this facility.
+Currently the APL to unicode functions write the unicode to native text files from which it can be cut and pasted into emails, newsgroups, web pages etc. Similarly the unicode to APL function requires the unicode to be cut and pasted from its source into a native text file prior to conversion.
-Line 6:
+Line 7:
-`AplToUtf8` takes the name of a function and converts the code, not just to unicode but to UTF-8.
No doubt it could be amended to accept one or more lines of code if an entire function were not available or required.
+The unicode can be copied and pasted to the text files using MSNotePad with the APL385 Unicode font. Also make sure you select UTF-8 as the encoding when doing a "Save as" when you save a file.

My original aim was to work directly via the clipboard but the amount of APL code required to manage the windows clipboard is prohibitive for displaying here. APL+WIN has in-built user commands (]clipcopy and ]clippaste) to do the job and I suggest APL+WIN users use those if they want to go directly via the clipboard. Users of other interpreters no doubt have their own equivalents they can use.

`AplToUtf8` takes the name of a function and converts the code to Unicode UTF-8 encoding. As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.
-Line 10:
+Line 14:
- ∇ AplToUtf8 f
+ ∇  AplToUtf8 f
-Line 12:
+Line 16:
-  ⍝Get a character representation of the function
   f←⎕cr f
+⍝Get a character representation of the function
f←⎕cr f
-Line 15:
+Line 19:
-  ⍝Append new line and carriage return characters
   f←(f,⎕av[14]),⎕av[11]
+⍝Append new line and carriage return characters
f←(f,⎕tcnl),⎕tclf
-Line 18:
+Line 22:
-  ⍝Convert each character to its unicode binary value
   f←∊Utf8 ¨∆aplToUtf8[¯1+⎕av⍳,f;2]
+⍝Convert each character to its unicode binary value
f←∊Utf8 ¨∆avutf8[⎕av⍳,f]
-Line 21:
+Line 25:
-  ⍝Add the encoding level header and convert back to ascii characters
   f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f
+⍝Add the encoding level header and convert back to ascii characters
f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f
-Line 24:
+Line 28:
-  ⍝File the character stream
   f FileData 'c:\test.txt'
+⍝File the character stream
f FileData 'c:\unicode.txt'
-Line 28:
+Line 33:
-`Utf8` which is called under each `(¨)` in the above is an eloquent restating of the definition of the encoding it implements.
+`Utf8` which is called under each `(¨)` in the above simply implements the UTF-8 specification to create the unicode byte structure for each character. Anyone interested in the byte structure can see it here: http://en.wikipedia.org/wiki/UTF-8 and scroll down to the Description section.
-Line 37:
+Line 41:
-  ⍝Convert the character to bytes according to the UTF≡8 specification
+  ⍝Convert the character to bytes according to the UTF-8 specification
-Line 50:
+Line 54:
+The function `FileData` is a simple utility function to file the result. I am sure you all have your own versions.
-Line 51:
+Line 56:
-`∆aplToUtf8`, below, is a two column integer matrix that maps the APL+WIN ⎕AV positions to their unicode code-points.

Graham used the functions to convert themselves into unicode for the wiki. To retrieve them from here for use in APL+WIN you would presumably have to copy them and correct the apl characters manually unless you already had the reverse translation functions in your workspace!

The function `FileData` is left as an exercise for the reader.
+`Utf8ToApl` is the reverse function. It assumes that the unicode resides in a native text file.
-Line 58:
+Line 59:
-	1
2	2
3	9079
4	9674
5	168
6	8592
7	7
8	8
9	9
10	10
11	8834
12	12
13	13
14	8835
15	9055
16	16
17	17
18	18
19	19
20	20
21	21
22	9068
23	9077
24	8593
25	8595
26	8594
27	27
28	8867
29	8866
30	9035
31	9042
32	32
33	33
34	34
35	35
36	36
37	37
38	38
39	39
40	40
41	41
42	42
43	43
44	44
45	8801
46	46
47	47
48	48
49	49
50	50
51	51
52	52
53	53
54	54
55	55
56	56
57	57
58	58
59	59
60	60
61	61
62	62
63	63
64	64
65	65
66	66
67	67
68	68
69	69
70	70
71	71
72	72
73	73
74	74
75	75
76	76
77	77
78	78
79	79
80	80
81	81
82	82
83	83
84	84
85	85
86	86
87	87
88	88
89	89
90	90
91	91
92	92
93	93
94	94
95	95
96	96
97	97
98	98
99	99
100	100
101	101
102	102
103	103
104	104
105	105
106	106
107	107
108	108
109	109
110	110
111	111
112	112
113	113
114	114
115	115
116	116
117	117
118	118
119	119
120	120
121	121
122	122
123	123
124	124
125	125
126	8764
127	127
128	128
129	63
130	63
131	63
132	63
133	63
134	8800
135	63
136	63
137	63
138	63
139	63
140	63
141	8968
142	63
143	8970
144	63
145	8710
146	215
147	63
148	63
149	9109
150	63
151	9054
152	9017
153	63
154	63
155	63
156	63
157	63
158	9066
159	63
160	63
161	63
162	63
163	63
164	63
165	63
166	9053
167	9024
168	63
169	63
170	63
171	63
172	63
173	63
174	63
175	63
176	63
177	63
178	63
179	63
180	63
181	63
182	63
183	63
184	63
185	63
186	63
187	63
188	63
189	63
190	63
191	63
192	63
193	63
194	63
195	63
196	63
197	63
198	63
199	63
200	63
201	63
202	63
203	63
204	63
205	63
206	63
207	63
208	63
209	63
210	63
211	63
212	63
213	63
214	63
215	63
216	63
217	63
218	63
219	63
220	63
221	63
222	63
223	63
224	9082
225	63
226	9075
227	63
228	63
229	9073
230	8869
231	8868
232	9021
233	8854
234	9074
235	9023
236	8711
237	9033
238	8714
239	9067
240	63
241	9049
242	8805
243	8804
244	9045
245	9038
246	247
247	63
248	8728
249	9675
250	8744
251	9076
252	63
253	175
254	124
255	63
256	63
+ ∇  r←Utf8ToApl;v

⍝Tie the native file containing the unicode
'c:\unicode.txt' ⎕ntie ¯1

⍝Read the bits from the file
v←⎕nread ¯1 11,(⎕nsize ¯1),0

⍝Untie the file
⎕nuntie ¯1

⍝Initialise the results vector
r←0⍴0

⍝Convert the bits to integers
v←2⊥⍉((.125×⍴v),8)⍴v

⍝Strip off the encoding header if present
:if 617=+/3↑v
   v←3↓v
:endif

⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification
:while 0≠⍴v

⍝Determine how many bytes represent the next character
    :select +/(↑v)>0 127 223 239
    :case 1
        r←r,2⊥1↓(8⍴2)⊤v[1]
        v←1↓v
    :case 2
        r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2]
        v←2↓v
    :case 3
        r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3]
        v←3↓v
    :case 4
        r←r,2⊥(5↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),(2↓(8⍴2)⊤v[3]),2↓(8⍴2)⊤v[4]
        v←4↓v
    :endselect

:endwhile

⍝Convert unicode integers back to ⎕av characters
r←⎕av[(∆avutf8⍳r)∼11]

∇
-Line 315:
+Line 107:
+`∆avutf8` is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points.
-Line 316:
+Line 109:
+{{{
    0    1    2 9079 9674  168 8592    7    8    9   10 8834   12   13 8835 9055
   16   17   18 9067   20   21 9068 9077 8593 8595 8594   27 8867 8866 9035 9042
   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95
   96   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111
  112  113  114  115  116  117  118  119  120  121  122  123  166  125 8764  127
  199  252  233  226  228  224 8800  231  234  235  232  239  238 8968  196 8970
  201 8710  215  244  246 9109  251 9054 9017  214  220  162  163   63 9066 9064
  225  237  243  250  241  209 9053 9024  191 9015  337  248  253  161  171  187
 9109 9109 9109  124  124  124  124   43   43  124  124   43   43   43   43   43
  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207
   45  209  210  211  212  213  214   43  216  217  218  219  220  221  124  255
 9082  223 9075 9060  227 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067
 8801 9049 8805 8804 9045 9038  247   34 8728 9675 8744 9076 8745  175  124    0

}}}
I used the functions to convert themselves into unicode for the wiki. They should readily translate to any interpreter if not usable directly.

Anyone wishing to create their own versions for another interpreter needs firstly to create the appropriate translation vector for their interpreter. To get you started I have reproduced the APL+WIN atomic vector below. Another excellent resource is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. 

{{{
   ⍷◊¨←    ⊂  ⊃⍟åæì⍫ÙÒ⍬⍵↑↓→ ⊣⊢⍋⍒ !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{¦}∼
Çüéâäà≠çêëèïî⌈Ä⌊É∆×ôö⎕û⍞⌹ÖÜ¢£?⍪⍨áíóúñÑ⍝⍀¿⌷őøý¡«»⎕⎕⎕||||++||+++++
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ-ÑÒÓÔÕÖ+ØÙÚÛÜÝ|ÿ⍺ß⍳⍤ã⍱⊥⊤⌽⊖⍲⌿∇⍉∊∩≡⍙≥≤⍕⍎÷"∘○∨⍴∪¯|

}}}
I could not resist the challenge when one reader commented that these functions were not very "APL like" so I created a new set at [[AplToUnicodeII]]

Author: GrahamSteer

----