[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Leighton Pritchard lpritc at scri.sari.ac.uk
Wed Aug 2 11:23:27 UTC 2006


On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote:
> GFF (General Feature Format)
> 
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> 
> GFF and PTT aren't exactly what I would call sequence files, in that
> they don't contain any sequence data.  

Fair point, but GFF3 (see below) can optionally carry sequence data, and
I use them for exactly what you say here:

> those files could be turned into SeqRecords or SeqFeatures (with empty
> sequences).

I was thinking that GFF3 would be more useful than GFF:

http://song.sourceforge.net/gff3.shtml

NCBI have already gone over to this on bacterial genomes, at least,
(e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much richer format than the original specification.  Andrew Dalke has already written a GFF3 parser/writer, which is available at

http://www.dalkescientific.com/PyGFF3-0.5.tar.gz

I've not used this in anger, yet...

> Its looks like there is enough overlap between the EMBL and Genbank to
> make sharing code between them a good idea.  Certainly EMBL was a file
> format I was thinking we should try to support.

In a scanner/consumer pattern it's easy enough.  I've not looked under
the hood of the new GenBank parser yet, to see what you've done.  Most
of my contact with EMBL format is with headerless feature tables and
Artemis, which aren't directly similar to GenBank entries. 

> Reading your other comments, it looks like you wouldn't miss FastaRecord
> or GenBank records if they were phased out.

Not personally, but others may have strong opinions and breakable code,
yet.

> Personally, I'm suggesting we try and standardise on having any Sequence
> IO framework standardize on returning SeqRecord objects.
> 
> I think we should have a generic "Sequence Iterator" object to do this
> which takes a file handle, subclassed for each file format - giving a
> "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc.

> I'm inclined not to give any choice of parser object (e.g.
> Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a
> SeqRecord.

It may be a side-issue, but should a Clustal parser return an Alignment
object or iterate over SeqRecord objects?  And for that matter, what
about other MSA files in FASTA format?  I think we ought allow parsers
to return an Alignment where the user requests it, which is a
functionality I'm not currently aware of in the FASTA sequence parsers.

> The individual readers should offer some level of control, for example
> the title2ids function for Fasta files lets the user decide how the
> title line should be broken up into id/name/description.  Also for some
> file formats the user should be able to specify the alphabet.

Could the alphabet be optionally specified by the user on parsing, and
maybe return a warning or error if there are non-compliant symbols in
the file, as a quick validator for bad sequences, or reminder to the
occasionally forgetful that, for example, they're not working with
nucleotide sequences, today <cough, embarrassed glance at floor> ;)

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).




More information about the Biopython-dev mailing list