Differences between revisions 19 and 20
Revision 19 as of 2008-12-25 10:37:11
Size: 9486
Editor: anonymous
Comment: reorganised text
Revision 20 as of 2009-01-03 17:17:01
Size: 6121
Editor: anonymous
Comment: full APL+WIN atomic vector unicode code points added
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Line 12: Line 11:
`AplToUtf8` takes the name of a function and converts the code to Unicode UTF-8 encoding.
As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.
`AplToUtf8` takes the name of a function and converts the code to Unicode UTF-8 encoding. As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.
Line 16: Line 14:
 ∇ AplToUtf8 f
                                                                   
⍝Get a character representation of the function
f←⎕cr f
                                                                   
⍝Append new line and carriage return characters
f←(f,⎕tcnl),⎕tclf
                                                                   
⍝Convert each character to its unicode binary value
f←∊Utf8 ¨∆avutf8[⎕av⍳,f]
                                                                   
 ∇ AplToUtf8 f

⍝Get a character representation of the function
f←⎕cr f

⍝Append new line and carriage return characters
f←(f,⎕tcnl),⎕tclf

⍝Convert each character to its unicode binary value
f←∊Utf8 ¨∆avutf8[⎕av⍳,f]
Line 28: Line 26:
f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f         
                                                                   
⍝File the character stream                                         
f FileData 'c:\unicode.txt'                                           
f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f

⍝File the character stream
f FileData 'c:\unicode.txt'
Line 35: Line 33:
Line 57: Line 54:
Line 63: Line 59:
 ∇ r←Utf8ToApl;v
                                                                       
⍝Tie the native file containing the unicode
'c:\unicode.txt' ⎕ntie ¯1
                                                                       
⍝Read the bits from the file
v←⎕nread ¯1 11,(⎕nsize ¯1),0
                                                                       
⍝Untie the file
⎕nuntie ¯1
                                                                       
⍝Initialise the results vector
r←0⍴0
                                                                       
⍝Convert the bits to integers
v←2⊥⍉((.125×⍴v),8)⍴v
                                                                       
⍝Strip off the encoding header if present
:if 617=+/3↑v
   v←3↓v
:endif
                                                                       
⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification
:while 0≠⍴v
                                                                       
⍝Determine how many bytes represent the next character
    :select +/(↑v)>0 127 223 239
    :case 1
        r←r,2⊥1↓(8⍴2)⊤v[1]
        v←1↓v
    :case 2
        r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2]
        v←2↓v
    :case 3
        r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3]
        v←3↓v
    :case 4
 ∇ r←Utf8ToApl;v

⍝Tie the native file containing the unicode
'c:\unicode.txt' ⎕ntie ¯1

⍝Read the bits from the file
v←⎕nread ¯1 11,(⎕nsize ¯1),0

⍝Untie the file
⎕nuntie ¯1

⍝Initialise the results vector
r←0⍴0

⍝Convert the bits to integers
v←2⊥⍉((.125×⍴v),8)⍴v

⍝Strip off the encoding header if present
:if 617=+/3↑v
   v←3↓v
:endif

⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification
:while 0≠⍴v

⍝Determine how many bytes represent the next character
    :select +/(↑v)>0 127 223 239
    :case 1
        r←r,2⊥1↓(8⍴2)⊤v[1]
        v←1↓v
    :case 2
        r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2]
        v←2↓v
    :case 3
        r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3]
        v←3↓v
    :case 4
Line 101: Line 97:
        v←4↓v
    :endselect
                                                                       
:endwhile
                                                                       
⍝Convert unicode integers back to ⎕av characters
r←⎕av[(∆avutf8⍳r)∼11]
        v←4↓v
    :endselect

:endwhile

⍝Convert unicode integers back to ⎕av characters
r←⎕av[(∆avutf8⍳r)∼11]
Line 111: Line 107:

`∆avutf8` is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points. The large number of 63 entries represent question marks and anyone who knows the corresponding unicode code points for those places in the APL+WIN atomic vector should feel free to add them in. Also please feel free to correct any errors in the vector that might show up when using it with the above functions.
`∆avutf8` is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points.
Line 122: Line 117:
  112 113 114 115 116 117 118 119 120 121 122 123 124 125 8764 127
   63 63 63 63 63 63 8800 63 63 63 63 63 63 8968 63 8970
   63 8710 215 63 63 9109 63 9054 9017 63 63 63 63 63 9066 63
   63 63 63 63 63 63 9053 9024 63 63 63 63 63 63 63 63
   63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63
   63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63
   63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63
 9082 63 9075 63 63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067
 8801 9049 8805 8804 9045 9038 247 63 8728 9675 8744 9076 63 175 124 63
  112 113 114 115 116 117 118 119 120 121 122 123 164 125 8764 127
  199 252 233 226 228 224 8800 231 234 235 232 239 238 8968 196 8970
  201 8710 215 244 246 9109 251 9054 9017 214 220 162 163 63 9066 9064
  225 237 243 250 241 209 9053 9024 191 9015 337 248 253 161 171 187
 9109 9109 9109 124 124 124 124 43 43 124 124 43 43 43 43 43
  192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
   45 209 210 211 212 213 214 43 216 217 218 219 220 221 124 255
 9082 9083 9075 9060 63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067
 8801 9049 8805 8804 9045 9038 247 34 8728 9675 8744 9076 8746 175 124 0
Line 132: Line 127:
Line 135: Line 129:
Author: GrahamSteer
Line 136: Line 131:


Author: GrahamSteer

APL to Unicode

Whilst the material described below relates specifically to APL+WIN is should be readily customisable to work with any APL interpreter that is not already unicode capable.

Currently the APL to unicode functions write the unicode to native text files from which it can be cut and pasted into emails, newsgroups, web pages etc. Similarly the unicode to APL function requires the unicode to be cut and pasted from its source into a native text file prior to conversion.

The unicode can be copied and pasted to the text files using MSNotePad with the APL385 Unicode font. Also make sure you select UTF-8 as the encoding when doing a "Save as" when you save a file.

My original aim was to work directly via the clipboard but the amount of APL code required to manage the windows clipboard is prohibitive for displaying here. APL+WIN has in-built user commands (]clipcopy and ]clippaste) to do the job and I suggest APL+WIN users use those if they want to go directly via the clipboard. Users of other interpreters no doubt have their own equivalents they can use.

AplToUtf8 takes the name of a function and converts the code to Unicode UTF-8 encoding. As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.

 ∇  AplToUtf8 f

⍝Get a character representation of the function
f←⎕cr f

⍝Append new line and carriage return characters
f←(f,⎕tcnl),⎕tclf

⍝Convert each character to its unicode binary value
f←∊Utf8 ¨∆avutf8[⎕av⍳,f]

⍝Add the encoding level header and convert back to ascii characters
f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f

⍝File the character stream
f FileData 'c:\unicode.txt'

Utf8 which is called under each (¨) in the above simply implements the UTF-8 specification to create the unicode byte structure for each character. Anyone interested in the byte structure can see it here: http://en.wikipedia.org/wiki/UTF-8 and scroll down to the Description section.

 ∇ r←Utf8 c

  ⍝Determine the number of bytes required to represent the character in unicode
   r←+/(⌈/((21⍴2)⊤c)/⌽⍳21)>0 7 11 16

  ⍝Convert the character to bytes according to the UTF-8 specification
   :Select r
   :Case 1
       r←⍎⍕0,(7⍴2)⊤c
   :Case 2
       r←⍎⍕(1 1 0,5↑r),1 0,5↓r←(11⍴2)⊤c
   :Case 3
       r←⍎⍕(1 1 1 0,4↑r),(1 0,6↑4↓r),1 0,10↓r←(16⍴2)⊤c
   :Case 4
       r←⍎⍕(1 1 1 1 0,3↑r),(1 0,6↑3↓r),(1 0,6↑9↓r),1 0,15↓r←(21⍴2)⊤c
   :EndSelect

The function FileData is a simple utility function to file the result. I am sure you all have your own versions.

Utf8ToApl is the reverse function. It assumes that the unicode resides in a native text file.

 ∇  r←Utf8ToApl;v

⍝Tie the native file containing the unicode
'c:\unicode.txt' ⎕ntie ¯1

⍝Read the bits from the file
v←⎕nread ¯1 11,(⎕nsize ¯1),0

⍝Untie the file
⎕nuntie ¯1

⍝Initialise the results vector
r←0⍴0

⍝Convert the bits to integers
v←2⊥⍉((.125×⍴v),8)⍴v

⍝Strip off the encoding header if present
:if 617=+/3↑v
   v←3↓v
:endif

⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification
:while 0≠⍴v

⍝Determine how many bytes represent the next character
    :select +/(↑v)>0 127 223 239
    :case 1
        r←r,2⊥1↓(8⍴2)⊤v[1]
        v←1↓v
    :case 2
        r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2]
        v←2↓v
    :case 3
        r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3]
        v←3↓v
    :case 4
        r←r,2⊥(5↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),(2↓(8⍴2)⊤v[3]),2↓(8⍴2)⊤v[4]
        v←4↓v
    :endselect

:endwhile

⍝Convert unicode integers back to ⎕av characters
r←⎕av[(∆avutf8⍳r)∼11]

∆avutf8 is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points.

    0    1    2 9079 9674  168 8592    7    8    9   10 8834   12   13 8835 9055
   16   17   18 9067   20   21 9068 9077 8593 8595 8594   27 8867 8866 9035 9042
   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95
   96   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111
  112  113  114  115  116  117  118  119  120  121  122  123  164  125 8764  127
  199  252  233  226  228  224 8800  231  234  235  232  239  238 8968  196 8970
  201 8710  215  244  246 9109  251 9054 9017  214  220  162  163   63 9066 9064
  225  237  243  250  241  209 9053 9024  191 9015  337  248  253  161  171  187
 9109 9109 9109  124  124  124  124   43   43  124  124   43   43   43   43   43
  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207
   45  209  210  211  212  213  214   43  216  217  218  219  220  221  124  255
 9082 9083 9075 9060   63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067
 8801 9049 8805 8804 9045 9038  247   34 8728 9675 8744 9076 8746  175  124    0

I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter needs firstly to create the appropriate translation vector for their interpreter. An excellent place to start for anyone wishing to do this is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. The functions themselves should readily translate to any interpreter if not usable directly.

Author: GrahamSteer


CategoryUnicode

AplToUnicode (last edited 2009-01-17 09:47:54 by anonymous)