Differences between revisions 3 and 42 (spanning 39 versions)
Revision 3 as of 2007-11-25 18:38:24
Size: 3610
Editor: KaiJaeger
Comment:
Revision 42 as of 2011-06-11 07:50:27
Size: 10005
Editor: EllisMorgan
Comment: corrections
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
<<TableOfContents>>
Line 7: Line 8:
 * Fields are separated by commas  * Fields are separated by commas. Well, mostly.
Line 11: Line 12:
 * If a field contains either a comma or one of the end line chars, either the char(s) or the whole contents needs to be escaped. Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text apples, carrots, and oranges becomes "apples, carrots, and oranges"  * If a field contains either a comma or a double quote or one of the end line chars, either the char(s) or the whole contents needs to be escaped. Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text `apples, carrots, and oranges` becomes `"apples, carrots, and oranges"`.
Line 13: Line 14:
For details and background information see http://www.csvreader.com/csv_format.php  * Strictly speaking, the delimiter is not defined in the specs. Some banks offer downloads where a semicolon is used instead of a comma. You might wonder why the name of this format is '''Comma''' Separated Values, but anyway. Several versions of Excel do not recognize a semicolon as a separator.
Line 15: Line 16:
Note that the format comes with some a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which contains appropriate does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. The only solution is therefore to make an informed guess. For details and background information see:
Line 17: Line 18:
 . http://www.csvreader.com/csv_format.php

Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. The only solution is therefore to make an informed guess. This informed guess can vary from file to file and person to person, so please look at the functions {{{Csv2MatrixWithDyalog}}} and {{{Csv2Numeric}}} below to see if you want to change them before you start using this.

== Reading and writing CSV files using APLX ==
Reading and writing CSV files in APLX is straightforward; You can just use the {{{⎕IMPORT}}} and {{{⎕EXPORT}}} system commands, specifying CSV as the format to use:

{{{
⍝ Reading a CSV file...
myVariable←⎕IMPORT 'C:\Users\simon\Desktop\spreadsheet_data.csv' 'csv'

⍝ Writing a CSV file
some_data←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3
some_data ⎕EXPORT 'C:\Users\simon\Desktop\new_data.csv' 'csv'
}}}
For another example of reading a CSV file of Google finance data and charting it, see [[CSVandChartingAplx|here]]

== Reading a CSV file using Dyalog APL ==
Line 19: Line 38:
attachment:cvsexcel3.jpg {{attachment:cvsexcel3.jpg}}
Line 23: Line 42:
attachment:csvapl.jpg {{attachment:csvapl.jpg}}
Line 25: Line 44:
With the following two functions this variable can be transformed into an APL matrix. To convert this into an APL matrix is a two-step-process:
Line 27: Line 46:
== APL2 Version ==  * partition the simple string from file

 * extract the data and build up the APL matrix

=== First step: partition the string being read from file ===
With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by " remain unchanged. The functions can deal with files from Unix, Mac and Windows.
Line 29: Line 54:
 r←Csv2MatrixWithAPL2 csv;crlf;buffer;bool;⎕IO;isNotEmpty
⍝ Convert a simple string "csv" that is assumed to come from a *.csv
⍝ file into an APL matrix. Takes care of escaped stuff.
 ⎕IO←1
 crlf←⎕AV[4 3] ⍝ That's what CR LF is in Dyalog APL - tell me what it is in APL2?!
 buffer←crlf,csv
 r←1↓¨buffer⊂⍨~crlf⍷buffer ⍝ partion records
 r←(⌽∨\0≠⌽↑¨⍴¨r)/r ⍝ remove empty stuff from the end
 bool←~bool∨¨≠\¨bool←'"'=¨r ⍝ ignore what's escaped (between " and ")
 r←⊃(','≠¨bool\¨bool/¨r)⊂¨r ⍝ partition fields by commas
 r←('"'=¨↑¨r)↓¨r ⍝ remove leading "
 r←(-'"'=¨↑¨¯1↑¨r)↓¨r ⍝ remove trailing "
 isNotEmpty←0<↑¨⍴¨r ⍝ remember empty fields
 bool←,isNotEmpty∧∧/¨r∊¨⊂⎕D,'.' ⍝ fields which contains digits only, and maybe a .
 (bool/,r)←⍎¨bool/,r ⍝ Make those numeric
r←PartitionRecordsWithDyalog string;masked;cr;lf;bool;
⍝ Takes a string and partitions records.
⍝ Can deal with Mac/Unix/Windows files.
⍝ For that, CR+LF as well as single LFs are converted into CR.
⍝ CR is then used to partition "string".
 ⎕IO←1 ⋄ ⎕ML←3
 (cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed>
 :If 0<+/bool←(cr,lf)⍷string ⍝ are there any cr+lf in "string"?
     string←(~bool)/string ⍝ Let only the cr survive
 :EndIf
 :If 0<+/bool←cr=string ⍝ Are there still any cr's?
     (bool/string)←lf ⍝ Convert them to lf
 :EndIf
⍝ In the remaining string, there might be lf's inside text, Those
⍝ need to be masked before we decide where recors really starts.
 masked←~{⍵∨≠\⍵}'"'=string ⍝ what is not escaped (between "")
 :If 1∊bool←lf=masked/string ⍝ are there any unmasked lf in "string"?
     r←(~masked\bool)⊂string
 :Else ⍝ so it's a single record
     r←⊂string
 :EndIf
}}}
=== Second step: extract the real data ===
{{{
  r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO
⍝ Convert vector-of-text-vectors "csv" that is assumed to
⍝ come from a *.csv file and which got already partitioned
⍝ into an APL matrix. Takes care of escaped stuff.
⍝ "sep" defaults to a comma but that can be changed by specifying a left argument.
 ⎕IO←1 ⋄ ⎕ML←3
 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep'
⍝ r←(⌽∨\0≠⌽↑∘⍴¨csv)/csv ⍝ remove empty stuff from the end if any
 r←sep,¨(⌽∨\0≠⌽↑∘⍴¨csv)/csv ⍝ remove empty stuff from the end if any
 bool←{~{⍵∨≠\⍵}'"'=⍵}¨r ⍝ prepare booleans useful to mask escaped stuff
 r←⊃r{⍺⊂⍨⍵≠sep}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas
 r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r ⍝ remove leading and trailing "
 r←Csv2Numeric r ⍝ Convert numeric cells
 r←(~'""'∘⍷¨r)/¨r ⍝ double-" into a single one
 →0

⍝ test← ,⊂'-12,"€ 1,23","1,23 €",1234 abc,"£1,234"'
⍝ (,¨1 5⍴¯12 '€ 1,23' 1.23 1234 '£1,234')≡ Csv2MatrixWithDyalog test
Line 46: Line 98:
== Dyalog Version == The original version above changes "1,23" into 1.23. It also converts "2 Bridge Street" into 2.
I want to convert "£1,235" into 1235 and to leave addresses unchanged, so are are two alternatives :

The original:
Line 48: Line 103:
 r←Csv2MatrixWithDyalog csv;crlf;bool;⎕IO
⍝ Convert a simple string "csv" that is assumed to come from a *.csv
⍝ file into an APL matrix. Takes care of escaped stuff.
 ⎕IO←1 ⋄ ⎕ML←3
 crlf←⎕AV[4 3] ⍝ that's what CR LF is in Dyalog APL
 r←{1↓¨⍵⊂⍨~crlf⍷⍵}crlf,csv ⍝ partion records
 r/⍨←⌽∨\0≠⌽↑∘⍴¨r ⍝ remove empty stuff from the end
 bool←{~{⍵∨≠\⍵}'"'=⍵}¨r ⍝ prepare booleans useful to mask escaped stuff
 r←⊃r{⍺⊂⍨⍵≠','}¨bool{⍺\⍺/⍵}¨r ⍝ partition fields by unmasked commas
 r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r ⍝ remove leading and trailing "
 r←{↑1⊃v←⎕VFI ⍵:↑2⊃v ⋄ ⍵}¨r ⍝ make fields whith appropriate content numeric scalars
r←Csv2Numeric r;buffer
⍝ Kai's version from APLWIKI page CsvToApl
 buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-"
 buffer←{0=+/bool←','=w←⍵:⍵ ⋄ (bool/w)←'.' ⋄ w}¨buffer ⍝ "," gets "."
 r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields with appropriate content numeric scalars
 →0

⍝ test← '-12' '€ 1,23' '1,23 €' '1234 abc' '£1,234'
⍝ ¯12 '€ 1,23' 1.23 1234 '£1,234'≡ Csv2Numeric test
⍝ € the euro symbol
Line 61: Line 115:
The resulting vaiable in APL would look like this: and an alternative:
{{{
 r←Csv2NumericEllis r;buffer
⍝ Ellis's version
 buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r ⍝ "buffer" is a copy of r with "¯" for "-"
 buffer←{('∘',⍵)[1+(' '≠⍵)×⍳⍴⍵]}¨buffer~¨⊂'£,' ⍝ blank to jot, remove £ and comma
 r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields with appropriate content numeric scalars
 →0
Line 63: Line 124:
attachment:csvinapl.jpg ⍝ test← '-12' '€ 1,23' '1,23 €' '1234 abc' '£1,234'
⍝ ¯12 '€ 1,23' '1,23 €' '1234 abc' 1234≡ Csv2NumericEllis test
Line 65: Line 127:
}}}

I have another that only converts when a whole column is potentially numeric. It is so easy for
you to write your own version that I do not see any point in providing more alternatives
=== The Final Step ===
Put it all together:

{{{
 r←{sep} DealWithCsv filename;data
⍝ Read "filename" which is assumed to be a *.csv file
⍝ and convert it into a matrix
 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep'
 data←FileRead filename
 data←PartitionRecordsWithDyalog data
 r←sep Csv2MatrixWithDyalog data
}}}
The resulting variable in APL would look like this:

{{attachment:csvinapl.jpg}}

Note that the 1 in the second row/second column got converted into the number because the contents of the cell remained of digits only. However, in the original Excel spreadsheet that cell is text; this is indicated by the small green triangle. This information is not contained in the CSV file.

== Writing a CSV file using Dyalog APL ==
Given an APL array like:

{{{
   ⎕←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3
 APL is fine, very fine
   1 2.2 ¯3
}}}
The following function takes such an array as right argument and converts it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator.

{{{
 r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar;dq
 ⎕IO←1 ⋄ ⎕ML←3
 :If 0=⎕NC'os'
     os←'windows'
 :EndIf
 'Invalid left argument; must be one of: windows, unix, mac'⎕SIGNAL 11/⍨~(⊂os)∊'windows' 'unix' 'mac'
 'Right argument must have a depth of 2'⎕SIGNAL 11/⍨2≠≡array
 'Right argument must be either a matrix or a vector'⎕SIGNAL 11/⍨~(⍴⍴array)∊1 2
 (cr lf)←⎕TC[2 3] ⍝ <CarriageReturn> and <LineFeed>
 sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr ⍝ select proper record separator
 IsChar←{0 2∊⍨10|⎕DR ⍵} ⍝ Version 12 compatible
 bool←,~IsChar¨array ⍝ locate number
 (bool/,array)←⍕¨bool/,array ⍝ make numbers text
 dq←,'"'∊¨array ⍝ Where are double quotes in the text?
 (dq/,array)←{b←'"'=w←⍵ ⋄ (b/w)←⊂'""' ⋄ ⊃,/w}¨dq/,array ⍝ Double the double quotes
 bool←dq∨,(lf∊¨array)∨','∊¨array ⍝ where are special chars used?
 (bool/,array)←{'"',⍵,'"'}¨bool/,array ⍝ escape field with special chars
 bool←'"'∊¨array ⍝ where are special chars used?
 array←{⊃{⍺,',',⍵}/⍵}¨↓array ⍝ separate fields by comma
 r←⊃,/array,¨⊂sep ⍝ make it simpel
 ((r='¯')/r)←'-'
}}}
{{{
⍝ Example:
#.Array2CsvWithDyalog 2 3⍴'APL' 'really "really" is' 'fine, very fine' 1 2.2 ¯3
APL,"really ""really"" is","fine, very fine"
1,2.2,-3
}}}

CSV to APL

CSV stands for comma separated values. Those files are still used to transport tabular data between applications that are not directly connected. The files can be edited with any spreadsheet application like Microsoft Excel.

There are some things one need to know about CSV file in order to deal with them:

  • Fields are separated by commas. Well, mostly.
  • Records are separated with system end of line characters, CRLF (ASCII 13 Dec or 0D Hex and ASCII 10 Dec or 0A Hex respectively) for Windows, LF for Unix, and CR for Mac
  • If a field contains either a comma or a double quote or one of the end line chars, either the char(s) or the whole contents needs to be escaped. Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text apples, carrots, and oranges becomes "apples, carrots, and oranges".

  • Strictly speaking, the delimiter is not defined in the specs. Some banks offer downloads where a semicolon is used instead of a comma. You might wonder why the name of this format is Comma Separated Values, but anyway. Several versions of Excel do not recognize a semicolon as a separator.

For details and background information see:

Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. The only solution is therefore to make an informed guess. This informed guess can vary from file to file and person to person, so please look at the functions Csv2MatrixWithDyalog and Csv2Numeric below to see if you want to change them before you start using this.

Reading and writing CSV files using APLX

Reading and writing CSV files in APLX is straightforward; You can just use the ⎕IMPORT and ⎕EXPORT system commands, specifying CSV as the format to use:

⍝ Reading a CSV file...
myVariable←⎕IMPORT 'C:\Users\simon\Desktop\spreadsheet_data.csv' 'csv'

⍝ Writing a CSV file
some_data←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3
some_data ⎕EXPORT 'C:\Users\simon\Desktop\new_data.csv' 'csv'

For another example of reading a CSV file of Google finance data and charting it, see here

Reading a CSV file using Dyalog APL

Given an Excel spreadsheet that looks like this:

Saving this into a csv file, the file can be read into APL. The variable would look like this:

csvapl.jpg

To convert this into an APL matrix is a two-step-process:

  • partition the simple string from file
  • extract the data and build up the APL matrix

First step: partition the string being read from file

With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by " remain unchanged. The functions can deal with files from Unix, Mac and Windows.

r←PartitionRecordsWithDyalog string;masked;cr;lf;bool;
⍝ Takes a string and partitions records.
⍝ Can deal with Mac/Unix/Windows files.
⍝ For that, CR+LF as well as single LFs are converted into CR.
⍝ CR is then used to partition "string".
 ⎕IO←1 ⋄ ⎕ML←3
 (cr lf)←⎕TC[2 3]                           ⍝ <CarriageReturn> and <LineFeed>
 :If 0<+/bool←(cr,lf)⍷string                ⍝ are there any cr+lf in "string"?
     string←(~bool)/string                  ⍝ Let only the cr survive
 :EndIf
 :If 0<+/bool←cr=string                     ⍝ Are there still any cr's?
     (bool/string)←lf                       ⍝ Convert them to lf
 :EndIf
⍝ In the remaining string, there might be lf's inside text, Those
⍝ need to be masked before we decide where recors really starts.
 masked←~{⍵∨≠\⍵}'"'=string                  ⍝ what is not escaped (between "")
 :If 1∊bool←lf=masked/string                ⍝ are there any unmasked lf in "string"?
     r←(~masked\bool)⊂string
 :Else ⍝ so it's a single record
     r←⊂string
 :EndIf

Second step: extract the real data

  r←{sep}Csv2MatrixWithDyalog csv;bool;⎕IO
⍝ Convert vector-of-text-vectors "csv" that is assumed to
⍝ come from  a *.csv file and which got already partitioned
⍝ into an APL matrix. Takes care of escaped stuff.
⍝ "sep" defaults to a comma but that can be changed by specifying a left argument.
 ⎕IO←1 ⋄ ⎕ML←3
 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep'
⍝ r←(⌽∨\0≠⌽↑∘⍴¨csv)/csv         ⍝ remove empty stuff from the end if any
 r←sep,¨(⌽∨\0≠⌽↑∘⍴¨csv)/csv     ⍝ remove empty stuff from the end if any
 bool←{~{⍵∨≠\⍵}'"'=⍵}¨r         ⍝ prepare booleans useful to mask escaped stuff
 r←⊃r{⍺⊂⍨⍵≠sep}¨bool{⍺\⍺/⍵}¨r   ⍝ partition fields by unmasked commas
 r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r       ⍝ remove leading and trailing "
 r←Csv2Numeric r                ⍝ Convert numeric cells
 r←(~'""'∘⍷¨r)/¨r               ⍝ double-" into a single one
 →0

⍝ test← ,⊂'-12,"€ 1,23","1,23 €",1234 abc,"£1,234"'
⍝ (,¨1 5⍴¯12 '€ 1,23' 1.23 1234 '£1,234')≡ Csv2MatrixWithDyalog test

The original version above changes "1,23" into 1.23. It also converts "2 Bridge Street" into 2. I want to convert "£1,235" into 1235 and to leave addresses unchanged, so are are two alternatives :

The original:

r←Csv2Numeric r;buffer
⍝ Kai's version from APLWIKI page CsvToApl
 buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r  ⍝ "buffer" is a copy of r with "¯" for "-"
 buffer←{0=+/bool←','=w←⍵:⍵ ⋄ (bool/w)←'.' ⋄ w}¨buffer  ⍝ "," gets "."
 r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields with appropriate content numeric scalars
 →0

⍝ test← '-12'  '€ 1,23'  '1,23 €'  '1234 abc'  '£1,234'
⍝ ¯12 '€ 1,23' 1.23 1234 '£1,234'≡ Csv2Numeric test
⍝ € the euro symbol

and an alternative:

 r←Csv2NumericEllis r;buffer
⍝ Ellis's version
 buffer←{0=+/bool←'-'=w←⍵:⍵ ⋄ (bool/w)←'¯' ⋄ w}¨r  ⍝ "buffer" is a copy of r with "¯" for "-"
 buffer←{('∘',⍵)[1+(' '≠⍵)×⍳⍴⍵]}¨buffer~¨⊂'£,'  ⍝ blank to jot, remove £ and comma
 r←buffer{↑1⊃v←⎕VFI ⍺:↑2⊃v ⋄ ⍵}¨r ⍝ make fields with appropriate content numeric scalars
 →0

⍝ test← '-12'  '€ 1,23'  '1,23 €'  '1234 abc'  '£1,234'
⍝ ¯12  '€ 1,23'  '1,23 €'  '1234 abc'  1234≡ Csv2NumericEllis test

I have another that only converts when a whole column is potentially numeric. It is so easy for you to write your own version that I do not see any point in providing more alternatives

The Final Step

Put it all together:

 r←{sep} DealWithCsv filename;data
⍝ Read "filename" which is assumed to be a *.csv file
⍝ and convert it into a matrix
 sep←{2=⎕NC ⍵:⍎⍵ ⋄ ','}'sep'
 data←FileRead filename
 data←PartitionRecordsWithDyalog data
 r←sep Csv2MatrixWithDyalog data

The resulting variable in APL would look like this:

csvinapl.jpg

Note that the 1 in the second row/second column got converted into the number because the contents of the cell remained of digits only. However, in the original Excel spreadsheet that cell is text; this is indicated by the small green triangle. This information is not contained in the CSV file.

Writing a CSV file using Dyalog APL

Given an APL array like:

   ⎕←2 3⍴'APL' 'is' 'fine, very fine' 1 2.2 ¯3
 APL   is  fine, very fine
   1  2.2    ¯3

The following function takes such an array as right argument and converts it into a string that can be written to a file with the extension ".csv". The left argument defaults to "windows" and can be "unix" or "mac" as well. Note that the left argument is case sensitive. The left argument is used to determine the appropriate record separator.

 r←{os}Array2CsvWithDyalog array;cr;lf;sep;bool;IsChar;dq
 ⎕IO←1 ⋄ ⎕ML←3
 :If 0=⎕NC'os'
     os←'windows'
 :EndIf
 'Invalid left argument; must be one of: windows, unix, mac'⎕SIGNAL 11/⍨~(⊂os)∊'windows' 'unix' 'mac'
 'Right argument must have a depth of 2'⎕SIGNAL 11/⍨2≠≡array
 'Right argument must be either a matrix or a vector'⎕SIGNAL 11/⍨~(⍴⍴array)∊1 2
 (cr lf)←⎕TC[2 3]                                ⍝ <CarriageReturn> and <LineFeed>
 sep←('windows' 'unix' 'mac'⍳⊂os)⊃(cr,lf)lf cr   ⍝ select proper record separator
 IsChar←{0 2∊⍨10|⎕DR ⍵}                          ⍝ Version 12 compatible
 bool←,~IsChar¨array                             ⍝ locate number
 (bool/,array)←⍕¨bool/,array                     ⍝ make numbers text
 dq←,'"'∊¨array                                   ⍝ Where are double quotes in the text?
 (dq/,array)←{b←'"'=w←⍵ ⋄ (b/w)←⊂'""' ⋄ ⊃,/w}¨dq/,array ⍝ Double the double quotes
 bool←dq∨,(lf∊¨array)∨','∊¨array                 ⍝ where are special chars used?
 (bool/,array)←{'"',⍵,'"'}¨bool/,array           ⍝ escape field with special chars
 bool←'"'∊¨array                                 ⍝ where are special chars used?
 array←{⊃{⍺,',',⍵}/⍵}¨↓array                     ⍝ separate fields by comma
 r←⊃,/array,¨⊂sep                                ⍝ make it simpel
 ((r='¯')/r)←'-'

⍝ Example:
#.Array2CsvWithDyalog 2 3⍴'APL' 'really "really" is' 'fine, very fine' 1 2.2 ¯3
APL,"really ""really"" is","fine, very fine"
1,2.2,-3

Author: KaiJaeger

CsvToApl (last edited 2017-02-16 19:12:54 by KaiJaeger)