Package thirdparty :: Module DSV
[hide private]
[frames] | no frames]

Module DSV

source code


DSV.py - Cliff Wells, 2002
  Import/export DSV (delimiter separated values, a generalization of CSV).

$Id: DSV.py 3878 2007-01-09 22:28:37Z djpham $
Modified by Joe Pham <djpham@bitpim.org> to accommodate wxPython 2.8+

Basic use:

   from DSV import DSV

   data = file.read()
   qualifier = DSV.guessTextQualifier(data) # optional
   data = DSV.organizeIntoLines(data, textQualifier = qualifier)
   delimiter = DSV.guessDelimiter(data) # optional
   data = DSV.importDSV(data, delimiter = delimiter, textQualifier = qualifier)
   hasHeader = DSV.guessHeaders(data) # optional

If you know the delimiters, qualifiers, etc, you may skip the optional
'guessing' steps as they rely on heuristics anyway (although they seem
to work well, there is no guarantee they are correct). What they are
best used for is to make a good guess regarding the data structure and then
let the user confirm it.

As such there is a 'wizard' to aid in this process (use this in lieu of
the above code - requires wxPython):

   from DSV import DSV

   dlg = DSV.ImportWizardDialog(parent, -1, 'DSV Import Wizard', filename)
   dlg.ShowModal()
   headers, data = dlg.ImportData() # may also return None
   dlg.Destroy()

The dlg.ImportData() method may also take a function as an optional argument
specifying what it should do about malformed rows.  See the example at the bottom
of this file. A few common functions are provided in this file (padRow, skipRow,
useRow).

Requires Python 2.0 or later
Wizards tested with wxPython 2.2.5/NT 4.0, 2.3.2/Win2000 and Linux/GTK (RedHat 7.x)


Version: 1.4

Classes [hide private]
  InvalidDelimiter
  InvalidTextQualifier
  InvalidData
  InvalidNumberOfColumns
  ImportWizardPanel_Delimiters
CLASS(SUPERCLASS): ImportWizardPanel_Delimiters(wx.Panel) DESCRIPTION: A wx.Panel that provides a basic interface for validating and changing the parameters for importing a delimited text file.
  ImportWizardDialog
CLASS(SUPERCLASS): ImportWizardDialog(wx.Dialog) DESCRIPTION: A dialog allowing the user to preview and change the options for importing a file.
Functions [hide private]
 
guessTextQualifier(input)
PROTOTYPE:...
source code
 
guessDelimiter(input, textQualifier='"')
PROTOTYPE: guessDelimiter(input, textQualifier = '"') DESCRIPTION: Tries to guess the delimiter.
source code
 
modeOfLengths(input)
PROTOTYPE: modeOfLengths(input) DESCRIPTION: Finds the mode (most frequently occurring value) of the lengths of the lines.
source code
 
guessHeaders(input, columns=0)
PROTOTYPE:...
source code
 
organizeIntoLines(input, textQualifier='"', limit=None)
PROTOTYPE: organizeIntoLines(input, textQualifier = '"', limit = None) DESCRIPTION: Takes raw data (as from file.read()) and organizes it into lines.
source code
 
padRow(oldrow, newrow, columns, maxColumns)
pads all rows to the same length with empty strings
source code
 
skipRow(oldrow, newrow, columns, maxColumns)
skips any inconsistent rows
source code
 
useRow(oldrow, newrow, columns, maxColumns)
returns row unchanged
source code
 
importDSV(input, delimiter=',', textQualifier='"', columns=0, updateFunction=None, errorHandler=None)
PROTOTYPE: importDSV(input, delimiter = ',', textQualifier = '"', columns = 0, updateFunction = None, errorHandler = None) DESCRIPTION: parses lines of data in CSV format ARGUMENTS: - input is a list of strings (built by organizeIntoLines) - delimiter is the character used to delimit columns - textQualifier is the character used to delimit ambiguous data - columns is the expected number of columns in each row or 0 - updateFunction is a callback function called once per record (could be used for updating progress bars).
source code
 
exportDSV(input, delimiter=',', textQualifier='"', quoteall=0)
PROTOTYPE: exportDSV(input, delimiter = ',', textQualifier = '"', quoteall = 0) DESCRIPTION: Exports to DSV (delimiter-separated values) format.
source code
Variables [hide private]
  __version__ = '1.4'
Bugs/Caveats:
Function Details [hide private]

guessTextQualifier(input)

source code 

PROTOTYPE:
  guessTextQualifier(input)
DESCRIPTION:
  tries to guess if the text qualifier (a character delimiting ambiguous data)
  is a single or double-quote (or None)
ARGUMENTS:
  - input is raw data as a string
RETURNS:
  single character or None

guessDelimiter(input, textQualifier='"')

source code 

PROTOTYPE:
  guessDelimiter(input, textQualifier = '"')
DESCRIPTION:
  Tries to guess the delimiter.
ARGUMENTS:
  - input is raw data as string
  - textQualifier is a character used to delimit ambiguous data
RETURNS:
  single character or None

modeOfLengths(input)

source code 

PROTOTYPE:
  modeOfLengths(input)
DESCRIPTION:
  Finds the mode (most frequently occurring value) of the lengths of the lines.
ARGUMENTS:
  - input is list of lists of data
RETURNS:
  mode as integer

guessHeaders(input, columns=0)

source code 

PROTOTYPE:
  guessHeaders(input, columns = 0)
DESCRIPTION:
  Decides whether row 0 is a header row
ARGUMENTS:
  - input is a list of lists of data (as returned by importDSV)
  - columns is either the expected number of columns in each row or 0
RETURNS:
  - true if data has header row

organizeIntoLines(input, textQualifier='"', limit=None)

source code 

PROTOTYPE:
  organizeIntoLines(input, textQualifier = '"', limit = None)
DESCRIPTION:
  Takes raw data (as from file.read()) and organizes it into lines.
  Newlines that occur within text qualifiers are treated as normal
  characters, not line delimiters.
ARGUMENTS:
  - input is raw data as a string
  - textQualifier is a character used to delimit ambiguous data
  - limit is a integer specifying the maximum number of lines to organize
RETURNS:
  list of strings

importDSV(input, delimiter=',', textQualifier='"', columns=0, updateFunction=None, errorHandler=None)

source code 

PROTOTYPE:
  importDSV(input, delimiter = ',', textQualifier = '"', columns = 0,
            updateFunction = None, errorHandler = None)
DESCRIPTION:
  parses lines of data in CSV format
ARGUMENTS:
  - input is a list of strings (built by organizeIntoLines)
  - delimiter is the character used to delimit columns
  - textQualifier is the character used to delimit ambiguous data
  - columns is the expected number of columns in each row or 0
  - updateFunction is a callback function called once per record (could be
    used for updating progress bars). Its prototype is
       updateFunction(percentDone)
       - percentDone is an integer between 0 and 100
  - errorHandler is a callback invoked whenever a row has an unexpected number
    of columns. Its prototype is
       errorHandler(oldrow, newrow, columns, maxColumns)
          where
          - oldrow is the unparsed data
          - newrow is the parsed data
          - columns is the expected length of a row
          - maxColumns is the longest row in the data
RETURNS:
  list of lists of data

exportDSV(input, delimiter=',', textQualifier='"', quoteall=0)

source code 

PROTOTYPE:
  exportDSV(input, delimiter = ',', textQualifier = '"', quoteall = 0)
DESCRIPTION:
  Exports to DSV (delimiter-separated values) format.
ARGUMENTS:
  - input is list of lists of data (as returned by importDSV)
  - delimiter is character used to delimit columns
  - textQualifier is character used to delimit ambiguous data
  - quoteall is boolean specifying whether to quote all data or only data
    that requires it
RETURNS:
  data as string


Variables Details [hide private]

__version__

Bugs/Caveats:

  • Although I've tested this stuff on varied data, I'm sure there are cases that I haven't seen that will choke any one of these routines (or at least return invalid data). This is beta code!
  • guessTextQualifier() algorithm is limited to quotes (double or single).
  • Surprising feature: Hitting <enter> on wxSpinCtrl causes seg fault under Linux/GTK (not Win32). Strangely, pressing <tab> seems ok. Therefore, I had to use wxSpinButton. Also, spurious spin events get generated for both of these controls (e.g. when calling wxBeginBusyCursor)
  • Keyboard navigation needs to be implemented on wizards
  • There may be issues with cr/lf translation, although I haven't yet seen any.

Why another CSV tool?:

  • Because I needed a more flexible CSV importer, one that could accept different delimiters (not just commas or tabs), one that could make an intelligent guess regarding file structure (for user convenience), be compatible with the files output by MS Excel, and finally, be easily integrated with a wizard. All of the modules I have seen prior to this fell short on one count or another.
  • It seemed interesting.

To do:

  • Better guessTextQualifier() algorithm. In the perfect world I envision, I can use any character as a text qualifier, not just quotes.
  • Finish wizards and move them into separate module.
  • Better guessHeaders() algorithm, although this is difficult.
  • Optimize maps() - try to eliminate lambda when possible
  • Optimize memory usage. Presently the entire file is loaded and then saved as a list. A better approach might be to analyze a smaller part of the file and then return an iterator to step through it.
Value:
'1.4'