Differences between revisions 7 and 9 (spanning 2 versions)

APL+WIN to Unicode - Work in Progress

At present (2008-12-22) it isn't possible to copy and paste unicode from and to APL+WIN at least up to version 5.

I am developing a set of functions to enable this facility. The current prototypes are below. The aim is to miss out the file stage and work directly via the clipboard. The unicode can be copied and pasted to the text files using NotePad with the APL385 Unicode font. Also make sure you select UTF-8 as the encoding when doing a "Save as" if you save a file.

AplToUtf8 takes the name of a function and converts the code to Unicode UTF-8 encoding. At the moment this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.

 ∇  AplToUtf8 f                                                        
                                                                   
⍝Get a character representation of the function                     
f←⎕cr f                                                            
                                                                   
⍝Append new line and carriage return characters                    
f←(f,⎕tcnl),⎕tclf                                                  
                                                                   
⍝Convert each character to its unicode binary value                
f←∊Utf8 ¨∆avutf8[⎕av⍳,f]                                           
                                                                   
⍝Add the encoding level header and convert back to ascii characters
f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f         
                                                                   
⍝File the character stream                                         
f FileData 'c:\test.txt'                                           

 ∇

Utf8 which is called under each (¨) in the above simply implements the UTF-8 specification to create the unicode byte structure for each character. Anyone interested in the byte structure can see it here: http://en.wikipedia.org/wiki/UTF-8 and simply scroll down to the Description section.

 ∇ r←Utf8 c

  ⍝Determine the number of bytes required to represent the character in unicode
   r←+/(⌈/((21⍴2)⊤c)/⌽⍳21)>0 7 11 16

  ⍝Convert the character to bytes according to the UTF-8 specification
   :Select r
   :Case 1
       r←⍎⍕0,(7⍴2)⊤c
   :Case 2
       r←⍎⍕(1 1 0,5↑r),1 0,5↓r←(11⍴2)⊤c
   :Case 3
       r←⍎⍕(1 1 1 0,4↑r),(1 0,6↑4↓r),1 0,10↓r←(16⍴2)⊤c
   :Case 4
       r←⍎⍕(1 1 1 1 0,3↑r),(1 0,6↑3↓r),(1 0,6↑9↓r),1 0,15↓r←(21⍴2)⊤c
   :EndSelect
∇

∆avutf8, below, is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points. The large number of 63 entries represent question marks and anyone who knows the corresponding unicode code points for those places in the APL+WIN atomic vector should feel free to add them in. Also please feel free to correct any errors in the vector that might show up when using it with the above functions

I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter firstly need to create the appropriate translation vector for their interpreter. An excellent place to start for anyone wishing to do this is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. The functions should readily translate if not directly usable.

The function FileData is a simple utility function to file the result. I am sure you all have your own versions but for completeness I might list it when the more important stuff is finished.

    0    1    2 9079 9674  168 8592    7    8    9   10 8834   12   13 8835 9055
   16   17   18 9067   20   21 9068 9077 8593 8595 8594   27 8867 8866 9035 9042
   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95
   96   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111
  112  113  114  115  116  117  118  119  120  121  122  123  124  125 8764  127
   63   63   63   63   63   63 8800   63   63   63   63   63   63 8968   63 8970
   63 8710  215   63   63 9109   63 9054 9017   63   63   63   63   63 9066   63
   63   63   63   63   63   63 9053 9024   63   63   63   63   63   63   63   63
   63   63   63   63   63   63   63   63   63   63   63   63   63   63   63   63
   63   63   63   63   63   63   63   63   63   63   63   63   63   63   63   63
   63   63   63   63   63   63   63   63   63   63   63   63   63   63   63   63
 9082   63 9075   63   63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067
 8801 9049 8805 8804 9045 9038  247   63 8728 9675 8744 9076   63  175  124   63

Here is the first prototype reverse function, Utf8ToApl. It assumes that the unicode resides in a native text file.

 ∇  r←Utf8ToApl;v                                                          
                                                                       
⍝Tie the native file containing the unicode                            
'c:\test.txt' ⎕ntie ¯1                                                 
                                                                       
⍝Read the bits from the file                                           
v←⎕nread ¯1 11,(⎕nsize ¯1),0                                           
                                                                       
⍝Untie the file                                                        
⎕nuntie ¯1                                                             
                                                                       
⍝Initialise the results vector                                         
r←0⍴0                                                                  
                                                                       
⍝Convert the bits to integers                                          
v←2⊥⍉((.125×⍴v),8)⍴v                                                   
                                                                       
⍝Strip off the encoding header if present                               
:if 617=+/3↑v                                                          
   v←3↓v                                                               
:endif                                                                 
                                                                       
⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification         
:while 0≠⍴v                                                            
                                                                       
⍝Determine how many bytes represent the next character                 
    :select +/(↑v)>0 127 223 239                                       
    :case 1                                                            
        r←r,2⊥1↓(8⍴2)⊤v[1]                                             
        v←1↓v                                                          
    :case 2                                                            
        r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2]                              
        v←2↓v                                                          
    :case 3                                                            
        r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3]               
        v←3↓v                                                          
    :case 4                                                            
        r←r,2⊥(5↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),(2↓(8⍴2)⊤v[3]),2↓(8⍴2)⊤v[4]
        v←4↓v                                                          
    :endselect                                                         
                                                                       
:endwhile                                                              
                                                                       
⍝Convert unicode integers back to ⎕av characters                       
r←⎕av[(∆avutf8⍳r)∼11]                                                  

∇

CategoryUnicode

-  ⇤ ← Revision 7 as of 2008-12-22 22:35:44 → 
  Size: 8442
  Editor: anonymous
  Comment:
+   ← Revision 9 as of 2008-12-23 11:23:28 → ⇥
  Size: 8915
  Editor: anonymous
  Comment: Vector reference added
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
-I am developing a set of functions to enable this facility. The current prototypes are below. The aim is to miss out the file stage and work directly via the clipboard.
+I am developing a set of functions to enable this facility. The current prototypes are below. The aim is to miss out the file stage and work directly via the clipboard. The unicode can be copied and pasted to the text files using NotePad with the APL385 Unicode font. Also make sure you select UTF-8 as the encoding when doing a "Save as" if you save a file.
 Line 30:
-`Utf8` which is called under each `(¨)` in the above simply implements the UTF-8 specification to create the unicode byte structure for each character.
+`Utf8` which is called under each `(¨)` in the above simply implements the UTF-8 specification to create the unicode byte structure for each character. Anyone interested in the byte structure can see it here: http://en.wikipedia.org/wiki/UTF-8 and simply scroll down to the Description section.
 Line 54:
-I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter firstly need to create the appropriate translation vector for their interpreter. The functions should readily translate if not directly usable.
+I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter firstly need to create the appropriate translation vector for their interpreter. An excellent place to start for anyone wishing to do this is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. The functions should readily translate if not directly usable.