[Bioperl-l] SeqIO::table

Hilmar Lapp hlapp at gnf.org
Fri Apr 8 20:15:26 EDT 2005


I wrote two new SeqIO-compliant streams that will return Bio::Seq 
objects from a table in either column-delimited ASCII text-format or 
contained in an Excel worksheet inside an Excel file, respectively.

The table in either format is presumed to contain one seq per line (or 
row). The parser allows you to identify a few columns with implied 
semantic meaning (display_id, accession, species, sequence string). All 
other columns may be selectively chosen to be preserved in the 
annotation bundle.

The motivation for this was that several comprehensive gene family 
publications made their data available in manually curated 
spreadsheets. I needed these data as a SeqIO-compliant stream, and 
going through an intermediary fasta file can mess up the annotation a 
lot.

If anybody else is interested in this or if anybody else thinks this 
could be of general interest I'll commit it to bioperl.

I've enclosed the supported arguments for the SeqIO::table::new method, 
this will give an idea of what is configurable. The excel parser 
supports the same arguments and the name of the worksheet in addition.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

Named parameters supported by the proposed Bio::SeqIO::table:

              -comment leading character(s) introducing a comment line
              -header  the number of header lines to skip; the first
                       non-comment header line will be used to obtain
                       column names; column names will be used as the
                       default tags for attaching annotation.
              -delim   the delimiter for columns as a regular expression;
                       consecutive occurrences of the delimiter will
                       not be collapsed.
              -display_id the one-based index of the column containing
                       the display ID of the sequence
              -accession_number the one-based index of the column
                       containing the accession number of the sequence
              -seq     the one-based index of the column containing
                       the sequence string of the sequence
              -species the one-based index of the column containing the
                       species for the sequence record; if not a
                       number, will be used as the static species
                       common to all records
              -annotation if provided and a scalar, a flag whether or
                       not all additional columns are to be preserved
                       as annotation, the tags used will either be
                       'colX' if there is no column header and where X
                       is the one-based column index, and otherwise the
                       column headers will be used as tags; if a
                       reference to an array, only those columns
                       (one-based index) will be preserved as
                       annotation, tags as before; if a reference to a
                       hash, the keys are one-based column indexes to
                       be preserved, and the values are the tags under
                       which the annotation is to be attached; if not
                       provided or supplied as undef, no additional
                       annotation will be preserved.
              -trim    flag determining whether or not all values should
                       be trimmed of leading and trailing white space



More information about the Bioperl-l mailing list