Differences between revisions 13 and 18 (spanning 5 versions)
Revision 13 as of 2007-11-29 14:58:58
Size: 7083
Editor: KaiJaeger
Comment:
Revision 18 as of 2007-11-29 19:38:58
Size: 7163
Editor: KaiJaeger
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
[[TableOfContents]]
Line 7: Line 8:
 * Fields are separated by commas TrustedGroup  * Fields are separated by commas
Line 16: Line 17:

== Reading a CSV file ==
Line 31: Line 34:
== Step 1: Partition The String Being Read From File == === Step 1: Partition The String Being Read From File ===
Line 35: Line 38:
=== APL2 Version === ==== APL2 Version ====
Line 63: Line 66:
== Dyalog Version == ==== Dyalog Version ====
Line 91: Line 94:
== Step 2: Extract The Real Data == === Step 2: Extract The Real Data ===
Line 93: Line 96:
=== APL2 Version === ==== APL2 Version ====
Line 110: Line 113:
=== Dyalog Version === ==== Dyalog Version ====
Line 142: Line 146:
== Writing a CSV file ==

CSV to APL

TableOfContents

CSV stands for comma separated values. Those files are still used to transport tabular data between applications that are not directly connected. The files can be edited with any spreadsheet application like Microsoft Excel.

There are some things one need to know about CSV file in order to deal with them:

  • Fields are separated by commas
  • Records are separated with system end of line characters, CRLF (ASCII 13 Dec or 0D Hex and ASCII 10 Dec or 0A Hex respectively) for Windows, LF for Unix, and CR for Mac
  • If a field contains either a comma or one of the end line chars, either the char(s) or the whole contents needs to be escaped. Excel escapes these values by embedding the field inside a set of double quotes. For example, a single cell with the text apples, carrots, and oranges becomes "apples, carrots, and oranges"

For details and background information see http://www.csvreader.com/csv_format.php

Note that the format comes with a nasty built-in-problem: there is no way to recognize a cell as being numeric. Converting cells which only contain a proper number does not help because if you enter a digit with a leading quote, Excel handles this as text but again this cannot be recognized as text in the csv file. The only solution is therefore to make an informed guess.

Reading a CSV file

Given an Excel spreadsheet that looks like this:

attachment:cvsexcel3.jpg

Saving this into a csv file, the file can be read into APL. The variable would look like this:

attachment:csvapl.jpg

To convert this into an APL matrix is a two-step-process:

  • partition the simple string from file
  • extract the data and build up the APL matrix

Step 1: Partition The String Being Read From File

With the following two functions this variable can be transformed into an APL array where every item represents a record. Data masked by " remain unchanged. The functions can deal with files from Unix, Mac and Windows.

APL2 Version

r←{ignoreBetween}PartitionRecordsWithAPL2 string;masked;cr;lf;bool
⍝ Takes a string and partitions records.
⍝ Can deal with Mac/Unix/Windows files.
⍝ For that, CR+LF as well as single LFs are converted into CR.
⍝ CR is then used to partition "string".
⍝ Note that everything between "ignoreBetween" is ignored.
⍝ This can be used to masked stuff between "" (CSV files), for example.
 ⎕IO←1 
 (cr lf)←⎕TC[2 3]                       ⍝ <CarriageReturn> and <LineFeed>
 →L01×⍳2=⎕NC'ignoreBetween'
 ignoreBetween←''                       ⍝ establish default
L01:
 →L02×⍳masked←0∊⍴ignoreBetween
 masked←~masked∨≠\masked←'"'=string     ⍝ what is not escaped (between "")
L02:
 →L03×⍳~0∊bool←~(cr,lf)⍷masked/string   ⍝ are there any unmasked cr/lf in "string"?
 bool←(~masked)∨masked\bool             ⍝ "insert" the masked
 string[1+(~bool)/⍳⍴bool]←cr            ⍝ convert lf into cr
 string←bool/string                     ⍝ remove original cr
 masked←bool/masked
 →L04
L03:→L03×⍳1∊bool←lf=masked/string       ⍝ Are there any unmasked lf in "string"?
 (((~masked)∨masked\bool)/string)←cr    ⍝ change them to cr
L04:r←(cr≠masked/string)⊂masked/string  ⍝ use unmasked cr for partitioning

Dyalog Version

r←{ignoreBetween}PartitionRecordsWithDyalog string;masked;cr;lf;bool
⍝ Takes a string and partitions records.
⍝ Can deal with Mac/Unix/Windows files.
⍝ For that, CR+LF as well as single LFs are converted into CR.
⍝ CR is then used to partition "string".
⍝ Note that everything between "ignoreBetween" is ignored.
⍝ This can be used to masked stuff between "" (CSV files), for example.
 ⎕IO←1 ⋄ ⎕ML←3
 (cr lf)←⎕TC[2 3]                         ⍝ <CarriageReturn> and <LineFeed>
 :If 0=⎕NC'ignoreBetween'
     ignoreBetween←''                     ⍝ establish default
 :EndIf
 :If ~masked←0∊⍴ignoreBetween
     masked←~{⍵∨≠\⍵}'"'=string            ⍝ what is not escaped (between "")
 :EndIf
 :If 0∊bool←~(cr,lf)⍷masked/string        ⍝ are there any unmasked cr/lf in "string"?
     bool←(~masked)∨masked\bool           ⍝ "insert" the masked
     string[1+{⍵/⍳⍴⍵}~bool]←cr            ⍝ convert lf into cr
     string←bool/string                   ⍝ remove original cr
     masked←bool/masked
 :ElseIf 1∊bool←lf=masked/string          ⍝ Are there any unmasked lf in "string"?
     (((~masked)∨masked\bool)/string)←cr  ⍝ change them to cr
 :EndIf
 r←(cr≠masked/string)⊂masked/string

Step 2: Extract The Real Data

APL2 Version

r←Csv2MatrixWithAPL2 csv;buffer;⎕IO;isNotEmpty;mask;bool
⍝ Convert vector-of-text-vectors "csv" that is assumed to
⍝ come from  a *.csv file and which got already partinioned
⍝ into an APL matrix. Takes care of escaped stuff.
 ⎕IO←1
 r←(⌽∨\0≠⌽↑¨⍴¨r)/r               ⍝ remove empty stuff from the end if any
 mask←~mask∨¨≠\¨mask←'"'=¨r      ⍝ what is not escaped (between "")
 r←⊃(','≠¨mask\¨mask/¨r)⊂¨r      ⍝ partition fields by commas
 r←('"'=¨↑¨r)↓¨r                 ⍝ remove leading "
 r←(-'"'=¨↑¨¯1↑¨r)↓¨r            ⍝ remove trailing "
 isNotEmpty←0<↑¨⍴¨r              ⍝ remember empty fields
 bool←,isNotEmpty∧∧/¨r∊¨⊂'0123456789.' ⍝ fields which contains only ...
 (bool/,r)←⍎¨bool/,r             ⍝ Make those numeric

Dyalog Version

r←Csv2MatrixWithDyalog csv;bool;⎕IO
⍝ Convert vector-of-text-vectors "csv" that is assumed to
⍝ come from  a *.csv file and which got already partinioned
⍝ into an APL matrix. Takes care of escaped stuff.
 ⎕IO←1 ⋄ ⎕ML←3
 r/⍨←⌽∨\0≠⌽↑∘⍴¨r                ⍝ remove empty stuff from the end if any
 bool←{~{⍵∨≠\⍵}'"'=⍵}¨r         ⍝ prepare booleans useful to mask escaped stuff
 r←⊃r{⍺⊂⍨⍵≠','}¨bool{⍺\⍺/⍵}¨r   ⍝ partition fields by unmasked commas
 r←{'"'≠1⍴⍵:⍵ ⋄ ¯1↓1↓⍵}¨r       ⍝ remove leading and trailing "
 r←{↑1⊃v←⎕VFI ⍵:↑2⊃v ⋄ ⍵}¨r     ⍝ make fields whith appropriate content numeric scalars

The final step

Put it all together (here for the Dyalog version):

 r←DealWithCsv filename;data
⍝ Read "filename" which is assumed to be a *.csv file 
⍝ and converts it into a matrix
 data←FileRead filename
 data←'"'PartitionRecordsWithDyalog data
 r←Csv2MatrixWithDyalog data

The resulting variable in APL would look like this:

attachment:csvinapl.jpg

Note that the 1 in the second row/second column got converted into the number because the contents of the cell remained of digits only. However, in the original Excel spreadsheet that cell is text; this is indicated by the small green triangle. This information is not contained in the CSV file.

Writing a CSV file

Author: KaiJaeger

CsvToApl (last edited 2017-02-16 19:12:54 by KaiJaeger)