Size: 9583
Comment:
|
Size: 9486
Comment: reorganised text
|
Deletions are marked like this. | Additions are marked like this. |
Line 13: | Line 13: |
At the moment this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input. | As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input. |
Line 31: | Line 31: |
f FileData 'c:\test.txt' | f FileData 'c:\unicode.txt' |
Line 58: | Line 58: |
`∆avutf8`, below, is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points. The large number of 63 entries represent question marks and anyone who knows the corresponding unicode code points for those places in the APL+WIN atomic vector should feel free to add them in. Also please feel free to correct any errors in the vector that might show up when using it with the above functions | The function `FileData` is a simple utility function to file the result. I am sure you all have your own versions. |
Line 60: | Line 60: |
I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter firstly needs to create the appropriate translation vector for their interpreter. An excellent place to start for anyone wishing to do this is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. The functions themselves should readily translate to any interpreter if not usable directly. The function `FileData` is a simple utility function to file the result. I am sure you all have your own versions but for completeness I might list it when the more important stuff is finished. {{{ 0 1 2 9079 9674 168 8592 7 8 9 10 8834 12 13 8835 9055 16 17 18 9067 20 21 9068 9077 8593 8595 8594 27 8867 8866 9035 9042 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 8764 127 63 63 63 63 63 63 8800 63 63 63 63 63 63 8968 63 8970 63 8710 215 63 63 9109 63 9054 9017 63 63 63 63 63 9066 63 63 63 63 63 63 63 9053 9024 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 9082 63 9075 63 63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067 8801 9049 8805 8804 9045 9038 247 63 8728 9675 8744 9076 63 175 124 63 }}} Here is the first prototype reverse function, `Utf8ToApl`. It assumes that the unicode resides in a native text file. |
`Utf8ToApl` is the reverse function. It assumes that the unicode resides in a native text file. |
Line 89: | Line 66: |
'c:\test.txt' ⎕ntie ¯1 | 'c:\unicode.txt' ⎕ntie ¯1 |
Line 135: | Line 112: |
`∆avutf8` is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points. The large number of 63 entries represent question marks and anyone who knows the corresponding unicode code points for those places in the APL+WIN atomic vector should feel free to add them in. Also please feel free to correct any errors in the vector that might show up when using it with the above functions. {{{ 0 1 2 9079 9674 168 8592 7 8 9 10 8834 12 13 8835 9055 16 17 18 9067 20 21 9068 9077 8593 8595 8594 27 8867 8866 9035 9042 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 8764 127 63 63 63 63 63 63 8800 63 63 63 63 63 63 8968 63 8970 63 8710 215 63 63 9109 63 9054 9017 63 63 63 63 63 9066 63 63 63 63 63 63 63 9053 9024 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 9082 63 9075 63 63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067 8801 9049 8805 8804 9045 9038 247 63 8728 9675 8744 9076 63 175 124 63 }}} I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter needs firstly to create the appropriate translation vector for their interpreter. An excellent place to start for anyone wishing to do this is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. The functions themselves should readily translate to any interpreter if not usable directly. |
APL to Unicode
Whilst the material described below relates specifically to APL+WIN is should be readily customisable to work with any APL interpreter that is not already unicode capable.
Currently the APL to unicode functions write the unicode to native text files from which it can be cut and pasted into emails, newsgroups, web pages etc. Similarly the unicode to APL function requires the unicode to be cut and pasted from its source into a native text file prior to conversion.
The unicode can be copied and pasted to the text files using MSNotePad with the APL385 Unicode font. Also make sure you select UTF-8 as the encoding when doing a "Save as" when you save a file.
My original aim was to work directly via the clipboard but the amount of APL code required to manage the windows clipboard is prohibitive for displaying here. APL+WIN has in-built user commands (]clipcopy and ]clippaste) to do the job and I suggest APL+WIN users use those if they want to go directly via the clipboard. Users of other interpreters no doubt have their own equivalents they can use.
AplToUtf8 takes the name of a function and converts the code to Unicode UTF-8 encoding. As it stands this function simply deals with whole functions but can easily be generalised to work with any character string input. For a quick and dirty job just comment out the first two lines of working code for it to work on simple character input.
∇ AplToUtf8 f ⍝Get a character representation of the function f←⎕cr f ⍝Append new line and carriage return characters f←(f,⎕tcnl),⎕tclf ⍝Convert each character to its unicode binary value f←∊Utf8 ¨∆avutf8[⎕av⍳,f] ⍝Add the encoding level header and convert back to ascii characters f←82 ⎕dr 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1,f ⍝File the character stream f FileData 'c:\unicode.txt' ∇
Utf8 which is called under each (¨) in the above simply implements the UTF-8 specification to create the unicode byte structure for each character. Anyone interested in the byte structure can see it here: http://en.wikipedia.org/wiki/UTF-8 and scroll down to the Description section.
∇ r←Utf8 c ⍝Determine the number of bytes required to represent the character in unicode r←+/(⌈/((21⍴2)⊤c)/⌽⍳21)>0 7 11 16 ⍝Convert the character to bytes according to the UTF-8 specification :Select r :Case 1 r←⍎⍕0,(7⍴2)⊤c :Case 2 r←⍎⍕(1 1 0,5↑r),1 0,5↓r←(11⍴2)⊤c :Case 3 r←⍎⍕(1 1 1 0,4↑r),(1 0,6↑4↓r),1 0,10↓r←(16⍴2)⊤c :Case 4 r←⍎⍕(1 1 1 1 0,3↑r),(1 0,6↑3↓r),(1 0,6↑9↓r),1 0,15↓r←(21⍴2)⊤c :EndSelect ∇
The function FileData is a simple utility function to file the result. I am sure you all have your own versions.
Utf8ToApl is the reverse function. It assumes that the unicode resides in a native text file.
∇ r←Utf8ToApl;v ⍝Tie the native file containing the unicode 'c:\unicode.txt' ⎕ntie ¯1 ⍝Read the bits from the file v←⎕nread ¯1 11,(⎕nsize ¯1),0 ⍝Untie the file ⎕nuntie ¯1 ⍝Initialise the results vector r←0⍴0 ⍝Convert the bits to integers v←2⊥⍉((.125×⍴v),8)⍴v ⍝Strip off the encoding header if present :if 617=+/3↑v v←3↓v :endif ⍝Decode the unicode bytes back to integers in accordance with the UTF-8 specification :while 0≠⍴v ⍝Determine how many bytes represent the next character :select +/(↑v)>0 127 223 239 :case 1 r←r,2⊥1↓(8⍴2)⊤v[1] v←1↓v :case 2 r←r,2⊥(3↓(8⍴2)⊤v[1]),2↓(8⍴2)⊤v[2] v←2↓v :case 3 r←r,2⊥(4↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),2↓(8⍴2)⊤v[3] v←3↓v :case 4 r←r,2⊥(5↓(8⍴2)⊤v[1]),(2↓(8⍴2)⊤v[2]),(2↓(8⍴2)⊤v[3]),2↓(8⍴2)⊤v[4] v←4↓v :endselect :endwhile ⍝Convert unicode integers back to ⎕av characters r←⎕av[(∆avutf8⍳r)∼11] ∇
∆avutf8 is a vector used to map the APL+WIN ⎕AV positions to their unicode code-points. The large number of 63 entries represent question marks and anyone who knows the corresponding unicode code points for those places in the APL+WIN atomic vector should feel free to add them in. Also please feel free to correct any errors in the vector that might show up when using it with the above functions.
0 1 2 9079 9674 168 8592 7 8 9 10 8834 12 13 8835 9055 16 17 18 9067 20 21 9068 9077 8593 8595 8594 27 8867 8866 9035 9042 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 8764 127 63 63 63 63 63 63 8800 63 63 63 63 63 63 8968 63 8970 63 8710 215 63 63 9109 63 9054 9017 63 63 63 63 63 9066 63 63 63 63 63 63 63 9053 9024 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 9082 63 9075 63 63 9073 8869 8868 9021 8854 9074 9023 8711 9033 8714 9067 8801 9049 8805 8804 9045 9038 247 63 8728 9675 8744 9076 63 175 124 63
I used the functions to convert themselves into unicode for the wiki. Anyone wishing to create their own versions for another interpreter needs firstly to create the appropriate translation vector for their interpreter. An excellent place to start for anyone wishing to do this is Adrian Smith's article in Vector http://www.vector.org.uk/resource/uniref.pdf. The functions themselves should readily translate to any interpreter if not usable directly.
Author: GrahamSteer