[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Bruce Southey bsouthey at gmail.com
Tue Aug 22 13:52:10 UTC 2006


Hi,
To date I have only used SwissProt code from BioPython so I am really
only lurking. But here are some responses.

Bruce

On 8/21/06, Peter (BioPython Dev) <biopython-dev at maubp.freeserve.co.uk> wrote:
> You probably noticed I sent out a "Dealing with sequence files"
> questionnaire on the main discussion list:
>
> http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html
>
> I've had four replies to date (off the list), and with the previous list
> discussion and counting myself that makes eight views.  Not a very big
> sample I know.
>
> > Question One
> > ============
> > Is reading sequence files an important function to you, and if so which
> > file formats in particular (e.g. Fasta, GenBank, ...)
>
> Fasta very popular, with GenBank also scoring highly.  Michiel and I
> both use clustalw.  Apart from EMBL (next question) there wasn't any
> other popular file format given.

Well, this is not a surprise because most apps around also use FASTA
as default format. Although most do not accept a comment line. Thus,
FASTA is the most important format.


>
> I'm tempted to ask again regarding multiple alignment formats.
>
> > Question Two
> > ============
> > Are there any sequence formats you would like to be able to read using
> > BioPython that are not currently supported (e.g. EMBL, ...)
>
> It may have been a leading question, but several respondents would like
> to be able to read in EMBL format.
>
> Other requests included:
>
> XML based 454 sequence files
> UniGene sequence cluster format
>
> Leighton mentioned:
>
> PTT (Protein table files)
> GFF (General Feature Format)
>
> And I wanted to be able to read Stockholm alignments.

I would like to be able to use a custom format that is based on the
FASTA format. That is allowing non-standard characters to included as
part of the sequence that I later remove. Perhaps this is just being
able to do subclassing.


>
> > Question Three - Reading Fasta Files
> > ====================================
> > Which of the following do you currently use (and why)?:
> >
> > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
> > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> > (c) Bio.Fasta with your own parser (Could you tell us more?)
> > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> > (e) Bio.FormatIO (giving SeqRecord objects)
> > (f) Other (Could you tell us more?)
>
> A range covering (a), (b) and (d) plus DIY parsers.
>
> > Question Four - Reading GenBank Files
> > =====================================
> > Which of the following do you currently use (and why)?:
> >
> > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
> > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
> > (c) Other (Could you tell us more?)
>
> Both (a) and (b) with no clear majority.
>
> > Question Five - Record Access...
> > ================================
> > When loading a file with multiple sequences do you use:
> >
> > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
> > records one by one in the order from the file.
> >
> > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> > random access to the records using their identifier.
> >
> > (c) A list giving random access by index number (e.g. load the records
> > using an iterator but save them in a list).
>
> Most of you use iterators, storing records in memory as required.

a

>
> > Question Six - Martel, Scanners and Consumers
> > =============================================
> > Some of BioPython's existing parsers (e.g. those using Martel) use an
> > event/callback model, where the scanner component generates parsing
> > events which are dealt with by the consumer component.
> >
> > Do any of you use this system to modify existing parser behaviour, or
> > use it as part of your own personal file parser?
> >
> > (a) I don't know, or don't care.  I just the the parsers provided.
> > (b) I use this framework to modify a parser in order to do ... (please
> > provide details).
>
> Almost everyone said (a) which I think is a good thing if we are going
> to try and re-work the BioPython's sequence reading.

a

>
> > And finally...
> > ==============
> > Do you have any general questions of comments.
>
> Several people have commented that BioPerl has a nice unified system
> with good documentation.
>
> -----------------------------------------------------------------------
>
> Where next...
>
> I think my code could be included "in parallel" with the existing
> parsers, without the upheaval of creating a new branch etc.
>
> I have started thinking about writing files too.
>
> Part of this will involve trying to be as consistent as possible about
> mapping annotations from different file formats to the SeqRecord
> object's annotations dictionary.
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059
>
> My code currently on bug 2059 is written as a single python file,
> provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea
> long term as more file formats are supported.
>
> If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a
> slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the
> filenames would clash on Windows.  Some people are using the code in
> Bio.SeqIO.FASTA, but I suppose the file could contain both the old code,
> and my new fasta interface.
>
> Alternatively, the new system could be put in Bio.SequenceIO or are
> there any other suggestions?
>
> Peter
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>



More information about the Biopython-dev mailing list