[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Wed Aug 2 09:25:27 UTC 2006

On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> Question One
> ============
> Is reading sequence files an important function to you, and if so which 
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW

> If you have had to write you own code to read a "common" file format 
> which BioPython doesn't support, please get in touch.

EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
pretty).

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
> title, and the sequence as a string)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

Mostly (f), a homegrown Pyrex/Flex parser.

> Question Three - index_file based dictionaries
> ==============================================
> Do you use any of the following:
> (a) Bio.Fasta.Dictionary
> (b) Bio.Genbank.Dictionary
> (c) Any other "Martel/Mindy" based dictionary which first requires 
> creation of an index using the index_file function

No, but I do create dictionaries on-the-fly from (name, sequence)
tuples, where necessary.

> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the 
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you 
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the records 
> using an iterator but saving them in a list).
> 
> Do you have any additional comments on this?  For example, flexibility 
> versus memory requirements.

Depending on what I need to do, I might use different approaches.  If
I'm filtering sequences on, say, sequence composition, I'll use an
iterator.  If I need to cross-reference sequences from the file to some
other set of sequences by ID, I'll use a dictionary.  In each case, I
will generally either use a for loop or build a dictionary on-the-fly.

> Question Four - Fasta files: FastaRecord or SeqRecord
> =====================================================
> If you use Fasta files, do you want get records returned as FastaRecords 
> or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

I'd rather have SeqRecords.  SeqRecords are particularly useful for
annotations and attaching data to the sequence which, later, gets
written out in some format other than FASTA sequence format.  For
operations where no further information is associated with the sequence,
they offer equivalent functionality to FastaRecords.  

Currently I default to (name, seq) tuples, and only create SeqRecords
when necessary, but this is only out of convenience for the parser I
use.

> Question Five - GenBank files: GenbankRecord or SeqRecord
> ==========================================================
> If you use GenBank files, do you use:
> (a) Bio.Genbank.FeatureParser which returns SeqRecord objects
> (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects
> 
> Do you care much either way?  For me the only significant difference is 
> that feature locations are held as objects in the SeqRecord, and as the 
> raw string in the Record.

I use Bio.GenBank.FeatureParser because I prefer the storage of features
(which are what I'm generally interested in) as SeqFeature objects.

> Question Six - Martel, Scanners and Consumers
> ==============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an 
> event/callback model, where the scanner component generates parsing 
> events which are dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or 
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ... (please 
> provide details).

I care mostly about performance on large files and the convenient
representation of sequences and features.  Where parsers have not been
available (or quickly locatable) for file formats, such as EMBL, I have
sometimes used the Bio.ParserSupport classes and the Scanner/Consumer
pattern.  

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).