Size: 7910
Comment: Added APLX examples
|
← Revision 54 as of 2017-02-16 19:12:54 ⇥
Size: 9125
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
Line 5: | Line 4: |
CSV stands for comma separated values. Those files are still used to transport tabular data between applications that are not directly connected. The files can be edited with any spreadsheet application like Microsoft Excel. | CSV stands for comma separated values. Those files are still used to transport tabular data between applications that are not directly connected. |
Line 7: | Line 6: |
There are some things one need to know about CSV file in order to deal with them: | Such files can be edited with any spreadsheet application like Microsoft Excel. |
Line 9: | Line 8: |
* Fields are separated by commas | There are some things one need to know about CSV files in order to deal with them: * Fields are separated by commas. Well, mostly. |
Line 13: | Line 14: |
* If a field contains either a comma or one of the end line chars, either the char(s) or the whole contents needs to be escaped. Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text apples, carrots, and oranges becomes "apples, carrots, and oranges" | * If a field contains either a comma or a double quote or one of the end line chars either the char(s) or the whole contents needs to be escaped. |
Line 15: | Line 16: |
* Strictly speaking, the delimiter is not defined in the specs. Some banks offer downloads where a semicolon is used instead of a comma. You might wonder why the name of this format is '''Comma''' Seperated Values, so, but anyway. | Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text `apples, carrots, and oranges` becomes `"apples, carrots, and oranges"`. |
Line 17: | Line 18: |
For details and background information see http://www.csvreader.com/csv_format.php | * Strictly speaking the delimiter is not defined in the specs. Some banks offer downloads with a semicolon used as separator instead of a comma. You might wonder why the name of this format is '''Comma''' Separated Values but anyway. Several versions of Excel do not recognize a semicolon as a separator. |
Line 19: | Line 20: |
Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. The only solution is therefore to make an informed guess. | For details and background information see: |
Line 21: | Line 22: |
== Reading and writing CSV files using APLX == Reading and writing CSV files in APLX is very straightforward. You can just use the {{{⎕IMPORT}}} and {{{⎕EXPORT}}} system commands, specifying CSV as the format to use: |
http://www.csvreader.com/csv_format.php |
Line 24: | Line 24: |
{{{ ⍝ Reading a CSV file... myVariable←⎕IMPORT 'C:\Users\simon\Desktop\spreadsheet_data.csv' |
Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. |
Line 28: | Line 26: |
⍝ Writing a CSV file some_data←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 some_data ⎕EXPORT 'C:\Users\simon\Desktop\new_data.csv' }}} |
The only solution is therefore to make an informed guess. This informed guess can vary from file to file and person to person, so please look at the functions {{{Csv2MatrixWithDyalog}}} and {{{Csv2Numeric}}} below to see if you want to change them before you start using them. |
Line 34: | Line 29: |
Line 37: | Line 31: |
{{attachment:cvsexcel3.jpg}} | ||numeric|| char || date ||currency ||misc|| || 1 ||1 || 2015-03-24 || 1.23||Yes || || 2 ||Hello || 2015-01-01 || ¯10 || || || 3 || || 1999-12-31 || ||No || || 4 ||More || 2001-02-01 ||123456789.1 ||"Are your sure?" || || ||Less || || || || || 5 ||Much more || 2014-04-03 || 0 ||apples, carrots, and bananas|| |
Line 39: | Line 39: |
Saving this into a csv file, the file can be read into APL. The variable would look like this: | Saving this into a csv file the file can be read into APL. The variable would look like this: |
Line 45: | Line 45: |
* partition the simple string from file | 1. Partition the simple string from file 1. Extract the data and build up the APL matrix |
Line 47: | Line 48: |
* extract the data and build up the APL matrix | === First step: partition the string being read from file === With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by double quotes (") remain unchanged. |
Line 49: | Line 51: |
=== First Step: Partition The String Being Read From File === With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by " remain unchanged. The functions can deal with files from Unix, Mac and Windows. |
The functions can deal with files from Unix, Mac and Windows. |
Line 54: | Line 54: |
r←{ignoreBetween}PartitionRecordsWithDyalog string;masked;cr;lf;bool | r←PartitionRecordsWithDyalog string;masked;cr;lf;bool |
Line 59: | Line 59: |
⍝ Note that everything between "ignoreBetween" is ignored. ⍝ This can be used to masked stuff between "" (CSV files), for example. |
|
Line 62: | Line 60: |
(cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> :If 0=⎕NC'ignoreBetween' ignoreBetween←'' ⍝ establish default |
(cr lf)←⎕UCS 10 13 ⍝ <CarriageReturn> and <LineFeed> :If 0<+/bool←(cr,lf)⍷string ⍝ are there any cr+lf in "string"? string←(~bool)/string ⍝ Let only the cr survive |
Line 66: | Line 64: |
:If ~masked←0∊⍴ignoreBetween masked←~{⍵∨≠\⍵}'"'=string ⍝ what is not escaped (between "") |
:If 0<+/bool←cr=string ⍝ Are there still any cr's? (bool/string)←lf ⍝ Convert them to lf |
Line 69: | Line 67: |
:If 0∊bool←~(cr,lf)⍷masked/string ⍝ are there any unmasked cr/lf in "string"? bool←(~masked)∨masked\bool ⍝ "insert" the masked string[1+{⍵/⍳⍴⍵}~bool]←cr ⍝ convert lf into cr string←bool/string ⍝ remove original cr masked←bool/masked :ElseIf 1∊bool←lf=masked/string ⍝ Are there any unmasked lf in "string"? ((masked\bool)/string)←cr ⍝ change them to cr |
⍝ In the remaining string, there might be lf's inside text, Those ⍝ need to be masked before we decide where records really start. masked←~{⍵∨≠\⍵}'"'=string ⍝ what is not escaped (between "") :If 1∊bool←lf=masked/string ⍝ are there any unmasked lf in "string"? r←(~masked\bool)⊂string :Else ⍝ so it's a single record r←⊂string |
Line 77: | Line 75: |
r←¯1↓(+\1,1↓masked\cr=masked/string)⊂string r←(0,1↓1⍴⍨⍴r)↓¨r |
|
Line 81: | Line 77: |
=== Second Step: Extract The Real Data === |
=== Second step: extract the real data === |
Line 84: | Line 79: |
r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO;buffer;isNum | r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO |
Line 86: | Line 81: |
⍝ come from a *.csv file and which got already partinioned ⍝ into an APL matrix. Takes care of escaped stuff. ⍝ "sep" defaults to a comma but that can be changed by specifying a left argument. |
⍝ come from a *.csv file and which got already partitioned ⍝ into an APL matrix. Takes care of escaped stuff etc. ⍝ "sep" defaults to a comma. |
Line 92: | Line 87: |
r←',',¨r ⍝ Add starting seperator | |
Line 93: | Line 89: |
r←⊃r{⍺⊂⍨⍵≠sep}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas | r←r{⎕ML←1 ⋄ ⍺⊂⍨⍵=sep}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas r←⊃{1↓¨⍵}¨r ⍝ Drop command and transform to a matrix |
Line 95: | Line 92: |
buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-" buffer←{0=+/bool←','=w←⍵:⍵ ⋄ (bool/w)←'.' ⋄ w}¨buffer ⍝ "," gets "." r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields whith appropriate content numeric scalars |
r←Csv2Numeric r ⍝ Convert numeric cells r←(~'""'∘⍷¨r)/¨r ⍝ Reduce double-" to single ones |
Line 100: | Line 96: |
=== The Final Step === Put it all together: |
|
Line 104: | Line 97: |
r←DealWithCsv filename;data ⍝ Read "filename" which is assumed to be a *.csv file ⍝ and converts it into a matrix data←FileRead filename data←'"'PartitionRecordsWithDyalog data r←Csv2MatrixWithDyalog data |
r←{ignore}Csv2Numeric r;buffer ⍝ Transform cells that contain digits into numeric values, BUT: ⍝ * Commas are ignored. ⍝ * "$£€¥" are ignored because the left argument "ignore" defaults to those. ⍝ * Blanks are removed ⍝ Example: ⍝ (¯10 3 4 1234.5 12 1000 '1A')←Csv2Numeric '-10' '3' '4' '123,4.5' '£12' '1E3' '1A' ignore←{0<⎕NC ⍵:⍎⍵ ⋄ '$£€¥'}'ignore' buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-" r←buffer{(0∊⍴⍵):'' ⋄ ,↑1⊃v←⎕VFI ⍺~' ,',ignore:↑2⊃v ⋄ ⍵}¨r ⍝ make fields with appropriate content numeric |
Line 112: | Line 109: |
=== Putting it all together === {{{ r←{sep} DealWithCsv filename;data ⍝ Read "filename" which is assumed to be a *.csv file ⍝ and convert it into a matrix sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep' data←FileRead filename data←PartitionRecordsWithDyalog data r←sep Csv2MatrixWithDyalog data }}} |
|
Line 119: | Line 127: |
Line 124: | Line 131: |
APL is fine, very fine 1 2.2 ¯3 |
APL is fine, very fine 1 2.2 ¯3 |
Line 127: | Line 134: |
The following functions take such an array as right argument and convert it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator. |
The following function takes such an array as right argument and converts it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator. |
Line 131: | Line 137: |
r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar | r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar;dq |
Line 139: | Line 145: |
(cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record separator IsChar←{0 2∊⍨10|⎕DR ⍵} ⍝ Version 12 compatible bool←,~IsChar¨array ⍝ locate number (bool/,array)←⍕¨bool/,array ⍝ make numbers text bool←,(lf∊¨array)∨','∊¨array ⍝ where are special chars used? (bool/,array)←{'"',⍵,'"'}¨bool/,array ⍝ escape field with special chars array←{⊃{⍺,',',⍵}/⍵}¨↓array ⍝ separate fields by comma r←⊃,/array,¨⊂sep ⍝ make it simpel ((r='¯')/r)←'-' ⍝ Handle ¯ |
(cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record separator IsChar←{0 2∊⍨10|⎕DR ⍵} ⍝ Version 12 compatible bool←,~IsChar¨array ⍝ locate numbers (bool/,array)←{('-',⍵)[('¯',⍵)⍳⍵]}¨⍕¨bool/,array ⍝ make numbers text and convert ¯ to - dq←,'"'∊¨array ⍝ Where are double quotes in the text? (dq/,array)←{⍵/⍨1+'"'=⍵}¨dq/,array ⍝ Double the double quotes bool←dq∨,(lf∊¨array)∨','∊¨array ⍝ where are special chars used? (bool/,array)←{'"',⍵,'"'}¨bool/,array ⍝ escape field with special chars array←{⊃{⍺,',',⍵}/⍵}¨↓array ⍝ separate fields by comma r←⊃,/array,¨⊂sep ⍝ make it simple |
Line 150: | Line 157: |
Line 153: | Line 159: |
#.Array2Csv 2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 APL,is,"fine, very fine" 1,2.2,-3 |
#.Array2CsvWithDyalog 2 3⍴'APL' 'really "really" is' 'fine, very fine' 1 2.2 ¯3 APL,"really ""really"" is","fine, very fine" 1,2.2,-3 |
Line 161: | Line 165: |
|| Update -- KaiJaeger <<DateTime(2012-08-05T11:06:46Z)>> incorporating a couple of findings/suggestions from EllisMorgan.|| || Update -- KaiJaeger <<DateTime(2015-03-24T11:26:39Z)>> bug fix: empty cells were not handled correctly.|| || Update -- KaiJaeger <<DateTime(2016-02-02T13:28:21Z)>> Improvements as suggested by PierreGilbert. || ---- CategoryArticles |
CSV to APL
Contents
CSV stands for comma separated values. Those files are still used to transport tabular data between applications that are not directly connected.
Such files can be edited with any spreadsheet application like Microsoft Excel.
There are some things one need to know about CSV files in order to deal with them:
- Fields are separated by commas. Well, mostly.
- Records are separated with system end of line characters, CRLF (ASCII 13 Dec or 0D Hex and ASCII 10 Dec or 0A Hex respectively) for Windows, LF for Unix, and CR for Mac
- If a field contains either a comma or a double quote or one of the end line chars either the char(s) or the whole contents needs to be escaped.
Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text apples, carrots, and oranges becomes "apples, carrots, and oranges".
Strictly speaking the delimiter is not defined in the specs. Some banks offer downloads with a semicolon used as separator instead of a comma. You might wonder why the name of this format is Comma Separated Values but anyway. Several versions of Excel do not recognize a semicolon as a separator.
For details and background information see:
Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file.
The only solution is therefore to make an informed guess. This informed guess can vary from file to file and person to person, so please look at the functions Csv2MatrixWithDyalog and Csv2Numeric below to see if you want to change them before you start using them.
Reading a CSV file using Dyalog APL
Given an Excel spreadsheet that looks like this:
numeric |
char |
date |
currency |
misc |
1 |
1 |
2015-03-24 |
1.23 |
Yes |
2 |
Hello |
2015-01-01 |
¯10 |
|
3 |
|
1999-12-31 |
|
No |
4 |
More |
2001-02-01 |
123456789.1 |
"Are your sure?" |
|
Less |
|
|
|
5 |
Much more |
2014-04-03 |
0 |
apples, carrots, and bananas |
Saving this into a csv file the file can be read into APL. The variable would look like this:
To convert this into an APL matrix is a two-step-process:
- Partition the simple string from file
- Extract the data and build up the APL matrix
First step: partition the string being read from file
With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by double quotes (") remain unchanged.
The functions can deal with files from Unix, Mac and Windows.
r←PartitionRecordsWithDyalog string;masked;cr;lf;bool ⍝ Takes a string and partitions records. ⍝ Can deal with Mac/Unix/Windows files. ⍝ For that, CR+LF as well as single LFs are converted into CR. ⍝ CR is then used to partition "string". ⎕IO←1 ⋄ ⎕ML←3 (cr lf)←⎕UCS 10 13 ⍝ <CarriageReturn> and <LineFeed> :If 0<+/bool←(cr,lf)⍷string ⍝ are there any cr+lf in "string"? string←(~bool)/string ⍝ Let only the cr survive :EndIf :If 0<+/bool←cr=string ⍝ Are there still any cr's? (bool/string)←lf ⍝ Convert them to lf :EndIf ⍝ In the remaining string, there might be lf's inside text, Those ⍝ need to be masked before we decide where records really start. masked←~{⍵∨≠\⍵}'"'=string ⍝ what is not escaped (between "") :If 1∊bool←lf=masked/string ⍝ are there any unmasked lf in "string"? r←(~masked\bool)⊂string :Else ⍝ so it's a single record r←⊂string :EndIf
Second step: extract the real data
r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO ⍝ Convert vector-of-text-vectors "csv" that is assumed to ⍝ come from a *.csv file and which got already partitioned ⍝ into an APL matrix. Takes care of escaped stuff etc. ⍝ "sep" defaults to a comma. ⎕IO←1 ⋄ ⎕ML←3 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep' r←(⌽∨\0≠⌽↑∘⍴¨csv)/csv ⍝ remove empty stuff from the end if any r←',',¨r ⍝ Add starting seperator bool←{~{⍵∨≠\⍵}'"'=⍵}¨r ⍝ prepare booleans useful to mask escaped stuff r←r{⎕ML←1 ⋄ ⍺⊂⍨⍵=sep}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas r←⊃{1↓¨⍵}¨r ⍝ Drop command and transform to a matrix r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r ⍝ remove leading and trailing " r←Csv2Numeric r ⍝ Convert numeric cells r←(~'""'∘⍷¨r)/¨r ⍝ Reduce double-" to single ones
r←{ignore}Csv2Numeric r;buffer ⍝ Transform cells that contain digits into numeric values, BUT: ⍝ * Commas are ignored. ⍝ * "$£€¥" are ignored because the left argument "ignore" defaults to those. ⍝ * Blanks are removed ⍝ Example: ⍝ (¯10 3 4 1234.5 12 1000 '1A')←Csv2Numeric '-10' '3' '4' '123,4.5' '£12' '1E3' '1A' ignore←{0<⎕NC ⍵:⍎⍵ ⋄ '$£€¥'}'ignore' buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-" r←buffer{(0∊⍴⍵):'' ⋄ ,↑1⊃v←⎕VFI ⍺~' ,',ignore:↑2⊃v ⋄ ⍵}¨r ⍝ make fields with appropriate content numeric
Putting it all together
r←{sep} DealWithCsv filename;data ⍝ Read "filename" which is assumed to be a *.csv file ⍝ and convert it into a matrix sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep' data←FileRead filename data←PartitionRecordsWithDyalog data r←sep Csv2MatrixWithDyalog data
The resulting variable in APL would look like this:
Note that the 1 in the second row/second column got converted into the number because the contents of the cell remained of digits only. However, in the original Excel spreadsheet that cell is text; this is indicated by the small green triangle. This information is not contained in the CSV file.
Writing a CSV file using Dyalog APL
Given an APL array like:
⎕←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 APL is fine, very fine 1 2.2 ¯3
The following function takes such an array as right argument and converts it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator.
r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar;dq ⎕IO←1 ⋄ ⎕ML←3 :If 0=⎕NC'os' os←'windows' :EndIf 'Invalid left argument; must be one of: windows, unix, mac'⎕SIGNAL 11/⍨~(⊂os)∊'windows' 'unix' 'mac' 'Right argument must have a depth of 2'⎕SIGNAL 11/⍨2≠≡array 'Right argument must be either a matrix or a vector'⎕SIGNAL 11/⍨~(⍴⍴array)∊1 2 (cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record separator IsChar←{0 2∊⍨10|⎕DR ⍵} ⍝ Version 12 compatible bool←,~IsChar¨array ⍝ locate numbers (bool/,array)←{('-',⍵)[('¯',⍵)⍳⍵]}¨⍕¨bool/,array ⍝ make numbers text and convert ¯ to - dq←,'"'∊¨array ⍝ Where are double quotes in the text? (dq/,array)←{⍵/⍨1+'"'=⍵}¨dq/,array ⍝ Double the double quotes bool←dq∨,(lf∊¨array)∨','∊¨array ⍝ where are special chars used? (bool/,array)←{'"',⍵,'"'}¨bool/,array ⍝ escape field with special chars array←{⊃{⍺,',',⍵}/⍵}¨↓array ⍝ separate fields by comma r←⊃,/array,¨⊂sep ⍝ make it simple
⍝ Example: #.Array2CsvWithDyalog 2 3⍴'APL' 'really "really" is' 'fine, very fine' 1 2.2 ¯3 APL,"really ""really"" is","fine, very fine" 1,2.2,-3
Author: KaiJaeger
Update -- KaiJaeger 2012-08-05 11:06:46 incorporating a couple of findings/suggestions from EllisMorgan. |
Update -- KaiJaeger 2015-03-24 11:26:39 bug fix: empty cells were not handled correctly. |
Update -- KaiJaeger 2016-02-02 13:28:21 Improvements as suggested by PierreGilbert. |