[Bioperl-l] SeqIO::table

Brian Osborne brian_osborne at cognia.com
Sun Apr 17 11:00:00 EDT 2005


Hilmar,

Yes, this is a good idea, like the existing 'tab' format but with more
information.

Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Hilmar Lapp
Sent: Friday, April 08, 2005 8:15 PM
To: Bioperl
Subject: [Bioperl-l] SeqIO::table


I wrote two new SeqIO-compliant streams that will return Bio::Seq
objects from a table in either column-delimited ASCII text-format or
contained in an Excel worksheet inside an Excel file, respectively.

The table in either format is presumed to contain one seq per line (or
row). The parser allows you to identify a few columns with implied
semantic meaning (display_id, accession, species, sequence string). All
other columns may be selectively chosen to be preserved in the
annotation bundle.

The motivation for this was that several comprehensive gene family
publications made their data available in manually curated
spreadsheets. I needed these data as a SeqIO-compliant stream, and
going through an intermediary fasta file can mess up the annotation a
lot.

If anybody else is interested in this or if anybody else thinks this
could be of general interest I'll commit it to bioperl.

I've enclosed the supported arguments for the SeqIO::table::new method,
this will give an idea of what is configurable. The excel parser
supports the same arguments and the name of the worksheet in addition.

	-hilmar
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

Named parameters supported by the proposed Bio::SeqIO::table:

              -comment leading character(s) introducing a comment line
              -header  the number of header lines to skip; the first
                       non-comment header line will be used to obtain
                       column names; column names will be used as the
                       default tags for attaching annotation.
              -delim   the delimiter for columns as a regular expression;
                       consecutive occurrences of the delimiter will
                       not be collapsed.
              -display_id the one-based index of the column containing
                       the display ID of the sequence
              -accession_number the one-based index of the column
                       containing the accession number of the sequence
              -seq     the one-based index of the column containing
                       the sequence string of the sequence
              -species the one-based index of the column containing the
                       species for the sequence record; if not a
                       number, will be used as the static species
                       common to all records
              -annotation if provided and a scalar, a flag whether or
                       not all additional columns are to be preserved
                       as annotation, the tags used will either be
                       'colX' if there is no column header and where X
                       is the one-based column index, and otherwise the
                       column headers will be used as tags; if a
                       reference to an array, only those columns
                       (one-based index) will be preserved as
                       annotation, tags as before; if a reference to a
                       hash, the keys are one-based column indexes to
                       be preserved, and the values are the tags under
                       which the annotation is to be attached; if not
                       provided or supplied as undef, no additional
                       annotation will be preserved.
              -trim    flag determining whether or not all values should
                       be trimmed of leading and trailing white space

_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list