8489
Comment:
|
8040
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
[[TableOfContents]] | <<TableOfContents>> |
Line 14: | Line 15: |
For details and background information see http://www.csvreader.com/csv_format.php | * Strictly speaking, the delimiter is not defined in the specs. Some banks offer downloads where a semicolon is used instead of a comma. You might wonder why the name of this format is '''Comma''' Seperated Values, so, but anyway. For details and background information see: http://www.csvreader.com/csv_format.php |
Line 18: | Line 22: |
== Reading a CSV file == | == Reading and writing CSV files using APLX == Reading and writing CSV files in APLX is very straightforward. You can just use the {{{⎕IMPORT}}} and {{{⎕EXPORT}}} system commands, specifying CSV as the format to use: {{{ ⍝ Reading a CSV file... myVariable←⎕IMPORT 'C:\Users\simon\Desktop\spreadsheet_data.csv' 'csv' ⍝ Writing a CSV file some_data←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 some_data ⎕EXPORT 'C:\Users\simon\Desktop\new_data.csv' 'csv' }}} For another example of reading a CSV file of Google finance data and charting it, see [[CSVandChartingAplx|here]] == Reading a CSV file using Dyalog APL == |
Line 22: | Line 40: |
attachment:cvsexcel3.jpg | {{attachment:cvsexcel3.jpg}} |
Line 26: | Line 44: |
attachment:csvapl.jpg | {{attachment:csvapl.jpg}} |
Line 34: | Line 52: |
=== Step 1: Partition The String Being Read From File === | === First Step: Partition The String Being Read From File === |
Line 38: | Line 56: |
==== APL2 Version ==== {{{ r←{ignoreBetween}PartitionRecordsWithAPL2 string;masked;cr;lf;bool ⍝ Takes a string and partitions records. ⍝ Can deal with Mac/Unix/Windows files. ⍝ For that, CR+LF as well as single LFs are converted into CR. ⍝ CR is then used to partition "string". ⍝ Note that everything between "ignoreBetween" is ignored. ⍝ This can be used to masked stuff between "" (CSV files), for example. ⎕IO←1 (cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> →L01×⍳2=⎕NC'ignoreBetween' ignoreBetween←'' ⍝ establish default L01: →L02×⍳masked←0∊⍴ignoreBetween masked←~masked∨≠\masked←'"'=string ⍝ what is not escaped (between "") L02: →L03×⍳~0∊bool←~(cr,lf)⍷masked/string ⍝ are there any unmasked cr/lf in "string"? bool←(~masked)∨masked\bool ⍝ "insert" the masked string[1+(~bool)/⍳⍴bool]←cr ⍝ convert lf into cr string←bool/string ⍝ remove original cr masked←bool/masked →L04 L03:→L03×⍳1∊bool←lf=masked/string ⍝ Are there any unmasked lf in "string"? (((~masked)∨masked\bool)/string)←cr ⍝ change them to cr L04:r←(cr≠masked/string)⊂masked/string ⍝ use unmasked cr for partitioning }}} ==== Dyalog Version ==== |
|
Line 89: | Line 78: |
(((~masked)∨masked\bool)/string)←cr ⍝ change them to cr | ((masked\bool)/string)←cr ⍝ change them to cr |
Line 91: | Line 80: |
r←(cr≠masked/string)⊂masked/string | r←¯1↓(+\1,1↓masked\cr=masked/string)⊂string r←(0,1↓1⍴⍨⍴r)↓¨r |
Line 94: | Line 84: |
=== Step 2: Extract The Real Data === | === Second Step: Extract The Real Data === |
Line 96: | Line 86: |
==== APL2 Version ==== | |
Line 98: | Line 87: |
r←Csv2MatrixWithAPL2 csv;buffer;⎕IO;isNotEmpty;mask;bool | r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO;buffer;isNum |
Line 102: | Line 91: |
⎕IO←1 r←(⌽∨\0≠⌽↑¨⍴¨r)/r ⍝ remove empty stuff from the end if any mask←~mask∨¨≠\¨mask←'"'=¨r ⍝ what is not escaped (between "") r←⊃(','≠¨mask\¨mask/¨r)⊂¨r ⍝ partition fields by commas r←('"'=¨↑¨r)↓¨r ⍝ remove leading " r←(-'"'=¨↑¨¯1↑¨r)↓¨r ⍝ remove trailing " isNotEmpty←0<↑¨⍴¨r ⍝ remember empty fields bool←,isNotEmpty∧∧/¨r∊¨⊂'0123456789.' ⍝ fields which contains only ... (bool/,r)←⍎¨bool/,r ⍝ Make those numeric |
⍝ "sep" defaults to a comma but that can be changed by specifying a left argument. ⎕IO←1 ⋄ ⎕ML←3 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep' r←(⌽∨\0≠⌽↑∘⍴¨csv)/csv ⍝ remove empty stuff from the end if any bool←{~{⍵∨≠\⍵}'"'=⍵}¨r ⍝ prepare booleans useful to mask escaped stuff r←⊃r{⍺⊂⍨⍵≠sep}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r ⍝ remove leading and trailing " buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-" buffer←{0=+/bool←','=w←⍵:⍵ ⋄ (bool/w)←'.' ⋄ w}¨buffer ⍝ "," gets "." r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields whith appropriate content numeric scalars |
Line 113: | Line 103: |
==== Dyalog Version ==== | === The Final Step === |
Line 115: | Line 105: |
{{{ r←Csv2MatrixWithDyalog csv;bool;⎕IO ⍝ Convert vector-of-text-vectors "csv" that is assumed to ⍝ come from a *.csv file and which got already partinioned ⍝ into an APL matrix. Takes care of escaped stuff. ⎕IO←1 ⋄ ⎕ML←3 r/⍨←⌽∨\0≠⌽↑∘⍴¨r ⍝ remove empty stuff from the end if any bool←{~{⍵∨≠\⍵}'"'=⍵}¨r ⍝ prepare booleans useful to mask escaped stuff r←⊃r{⍺⊂⍨⍵≠','}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r ⍝ remove leading and trailing " r←{↑1⊃v←⎕VFI ⍵:↑2⊃v ⋄ ⍵}¨r ⍝ make fields whith appropriate content numeric scalars }}} == The final step == Put it all together (here for the Dyalog version): |
Put it all together: |
Line 142: | Line 117: |
attachment:csvinapl.jpg | {{attachment:csvinapl.jpg}} |
Line 146: | Line 121: |
== Writing a CSV file == | == Writing a CSV file using Dyalog APL == |
Line 148: | Line 123: |
Given this APL array: attachment:array2csv.jpg The following functions take such an array as right argument and convert them into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record seperator. === APL2 Version === (Volunteers needed) === Dyalog Version === |
Given an APL array like: |
Line 161: | Line 126: |
r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool | ⎕←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 APL is fine, very fine 1 2.2 ¯3 }}} The following functions take such an array as right argument and convert it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator. {{{ r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar |
Line 170: | Line 143: |
sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record seperator | sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record separator IsChar←{0 2∊⍨10|⎕DR ⍵} ⍝ Version 12 compatible bool←,~IsChar¨array ⍝ locate number (bool/,array)←⍕¨bool/,array ⍝ make numbers text |
Line 173: | Line 149: |
array←{⊃{⍺,',',⍵}/⍵}¨↓array ⍝ separate fields by comma | |
Line 174: | Line 151: |
((r='¯')/r)←'-' ⍝ Handle ¯ }}} {{{ ⍝ Example: #.Array2Csv 2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 APL,is,"fine, very fine" 1,2.2,-3 |
CSV to APL
Contents
CSV stands for comma separated values. Those files are still used to transport tabular data between applications that are not directly connected. The files can be edited with any spreadsheet application like Microsoft Excel.
There are some things one need to know about CSV file in order to deal with them:
- Fields are separated by commas
- Records are separated with system end of line characters, CRLF (ASCII 13 Dec or 0D Hex and ASCII 10 Dec or 0A Hex respectively) for Windows, LF for Unix, and CR for Mac
- If a field contains either a comma or one of the end line chars, either the char(s) or the whole contents needs to be escaped. Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text apples, carrots, and oranges becomes "apples, carrots, and oranges"
Strictly speaking, the delimiter is not defined in the specs. Some banks offer downloads where a semicolon is used instead of a comma. You might wonder why the name of this format is Comma Seperated Values, so, but anyway.
For details and background information see:
Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. The only solution is therefore to make an informed guess.
Reading and writing CSV files using APLX
Reading and writing CSV files in APLX is very straightforward. You can just use the ⎕IMPORT and ⎕EXPORT system commands, specifying CSV as the format to use:
⍝ Reading a CSV file... myVariable←⎕IMPORT 'C:\Users\simon\Desktop\spreadsheet_data.csv' 'csv' ⍝ Writing a CSV file some_data←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 some_data ⎕EXPORT 'C:\Users\simon\Desktop\new_data.csv' 'csv'
For another example of reading a CSV file of Google finance data and charting it, see here
Reading a CSV file using Dyalog APL
Given an Excel spreadsheet that looks like this:
Saving this into a csv file, the file can be read into APL. The variable would look like this:
To convert this into an APL matrix is a two-step-process:
- partition the simple string from file
- extract the data and build up the APL matrix
First Step: Partition The String Being Read From File
With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by " remain unchanged. The functions can deal with files from Unix, Mac and Windows.
r←{ignoreBetween}PartitionRecordsWithDyalog string;masked;cr;lf;bool ⍝ Takes a string and partitions records. ⍝ Can deal with Mac/Unix/Windows files. ⍝ For that, CR+LF as well as single LFs are converted into CR. ⍝ CR is then used to partition "string". ⍝ Note that everything between "ignoreBetween" is ignored. ⍝ This can be used to masked stuff between "" (CSV files), for example. ⎕IO←1 ⋄ ⎕ML←3 (cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> :If 0=⎕NC'ignoreBetween' ignoreBetween←'' ⍝ establish default :EndIf :If ~masked←0∊⍴ignoreBetween masked←~{⍵∨≠\⍵}'"'=string ⍝ what is not escaped (between "") :EndIf :If 0∊bool←~(cr,lf)⍷masked/string ⍝ are there any unmasked cr/lf in "string"? bool←(~masked)∨masked\bool ⍝ "insert" the masked string[1+{⍵/⍳⍴⍵}~bool]←cr ⍝ convert lf into cr string←bool/string ⍝ remove original cr masked←bool/masked :ElseIf 1∊bool←lf=masked/string ⍝ Are there any unmasked lf in "string"? ((masked\bool)/string)←cr ⍝ change them to cr :EndIf r←¯1↓(+\1,1↓masked\cr=masked/string)⊂string r←(0,1↓1⍴⍨⍴r)↓¨r
Second Step: Extract The Real Data
r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO;buffer;isNum ⍝ Convert vector-of-text-vectors "csv" that is assumed to ⍝ come from a *.csv file and which got already partinioned ⍝ into an APL matrix. Takes care of escaped stuff. ⍝ "sep" defaults to a comma but that can be changed by specifying a left argument. ⎕IO←1 ⋄ ⎕ML←3 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep' r←(⌽∨\0≠⌽↑∘⍴¨csv)/csv ⍝ remove empty stuff from the end if any bool←{~{⍵∨≠\⍵}'"'=⍵}¨r ⍝ prepare booleans useful to mask escaped stuff r←⊃r{⍺⊂⍨⍵≠sep}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r ⍝ remove leading and trailing " buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-" buffer←{0=+/bool←','=w←⍵:⍵ ⋄ (bool/w)←'.' ⋄ w}¨buffer ⍝ "," gets "." r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields whith appropriate content numeric scalars
The Final Step
Put it all together:
r←DealWithCsv filename;data ⍝ Read "filename" which is assumed to be a *.csv file ⍝ and converts it into a matrix data←FileRead filename data←'"'PartitionRecordsWithDyalog data r←Csv2MatrixWithDyalog data
The resulting variable in APL would look like this:
Note that the 1 in the second row/second column got converted into the number because the contents of the cell remained of digits only. However, in the original Excel spreadsheet that cell is text; this is indicated by the small green triangle. This information is not contained in the CSV file.
Writing a CSV file using Dyalog APL
Given an APL array like:
⎕←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 APL is fine, very fine 1 2.2 ¯3
The following functions take such an array as right argument and convert it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator.
r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar ⎕IO←1 ⋄ ⎕ML←3 :If 0=⎕NC'os' os←'windows' :EndIf 'Invalid left argument; must be one of: windows, unix, mac'⎕SIGNAL 11/⍨~(⊂os)∊'windows' 'unix' 'mac' 'Right argument must have a depth of 2'⎕SIGNAL 11/⍨2≠≡array 'Right argument must be either a matrix or a vector'⎕SIGNAL 11/⍨~(⍴⍴array)∊1 2 (cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed> sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record separator IsChar←{0 2∊⍨10|⎕DR ⍵} ⍝ Version 12 compatible bool←,~IsChar¨array ⍝ locate number (bool/,array)←⍕¨bool/,array ⍝ make numbers text bool←,(lf∊¨array)∨','∊¨array ⍝ where are special chars used? (bool/,array)←{'"',⍵,'"'}¨bool/,array ⍝ escape field with special chars array←{⊃{⍺,',',⍵}/⍵}¨↓array ⍝ separate fields by comma r←⊃,/array,¨⊂sep ⍝ make it simpel ((r='¯')/r)←'-' ⍝ Handle ¯
⍝ Example: #.Array2Csv 2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3 APL,is,"fine, very fine" 1,2.2,-3
Author: KaiJaeger