[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Wed Aug 16 14:44:36 UTC 2006

On Wed, Aug 16, 2006 at 03:00:36PM +0100, Peter wrote:
> Albert Krewinkel wrote:
> >The _parse_genbank_features function could also be used to parse embl
> >or ddjb features, therefore I think it should be named differently.
> 
> First of all, that bit of code is for a new feature which I personally
> wanted - to be able to iterate over CDS features in a genbank file.
> 
> But yes, I did have in mind that it (and the GenBank parser) could be
> re-used to deal with EMBL files.  I have not yet taken the time to
> learn the EMBL file format and how it corresponds to the GenBank file
> format - but I agree a lot of the code could be shared.

I will try to build something similar for EMBL files within the next
days.  This should be easy, since features really should look the same
in both formates:

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

> >Since there is a lot of clean up effort right now: How about moving
> >the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> >are closely related and seperate modules only clutter the namespace.
> 
> What real benefit does that give us?  It will cause a certain amount
> of upheaval in the short term as people will have to change their
> import statements on existing scripts.  If we do start a new branch
> for "big changes" then I have no real problem with this suggest.

Agree.

> >To me, this seems to be a general problem. It's very difficult to find
> >a tool to use for a certain problem if one doesn't allready know what
> >to look for.  I'd pretty much favour to create modules like
> >Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
> >a very big change, and therefore I'd like to follow Marc's suggestion
> >of splitting off a branch.  In general, I pretty much agree with what
> >Marc said in his <rant />.
> >
> >I cannot estimate how much work it would be to maintain two separate
> >biopython distributions, so please forgive me if I re-suggest
> >something completely idiotic here.  I just don't believe there is much
> >that could be lost that way.
> 
> BioPython probably would benefit from a little reorganising - and for
> anything drastic like moving entire modules about, a new branch makes
> sense.  On the other hand, do we have the man-power to do it?  Are any
> of the developers familiar with all of (or even most of) the existing
> modules?  I would guess I have used less than half of the modules - I
> have looked at the very basics of Bio.PDB for example, but have never
> tried Bio.NMR

I attached a file which I created when I was teaching myself
biopython. It provides a basic grouping for the current biopython
modules.  Naturaly, it's by no means complete and probably wrong in
some places.

> I would favour gradual incremental (and backwards compatible) changes.
> Such as adding a new sequence reading module and then marking the old
> code as depreciated.

I think we could do both: A new branch might make it easier to see
which modules are usefull the way they are and which are not.  Even if
this seperate branch never is released itself, it still would be handy
for reorganising coordination.

> For example of some small changes, have any of you looked at:
> 
> Bug 2057 - SeqRecord has no __str__ or __repr__
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
> 
> Bug 1963 - Adding __str__ method to codon tables and translators
> http://bugzilla.open-bio.org/show_bug.cgi?id=1963
> 
> Little things in themselves that I think would help.

True.  My (naive) hope is, that such things would be by-products of a
new branch.  I have to admit, that this is probably not possible
without doing a code sprint.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics
-------------- next part --------------
Databases:
 o NCBI
   - UniGene
   - GenBank
   - PubMed
   - Entrez
   - LocusLink
   - Geo
 o Kabat
 o KEGG
 o SwissProt
 o Medline
 o biblio (pywebsvcs dependency is mentioned only in the module itself)
 o dbdefs
 o InterPro
 o Gobase
 o Enzyme
 o Rebase

Models and Simulations:
 o Ais
 o MetaTool
 o Pathway
 o ECell                 			    

Algorigthms, Machine Learning and Pattern Recognition:
 o HMM
 o NeuralNetwork
 o Cluster
 o LogisticRegression, Statistics
 o GA
 o MarkovModel
 o pairwise2
 o NaiveBayes
 o MaxEntropy

Alignments:
 o Align
 o Blast
 o AlignAce
 o Clusalw
 o Fasta
 o FSSP
 o SubsMat
 o Search (WUBLAST output)
 o Saf
 o IntelliGenetics

Applications:
 o Application
 o Emboss
 o Nexus
 o AlignAce
 o Blast
 o MEME
 o Sequencing
 o Wise

Data Structures:
 o KDTree
 o trie

Sequences:
 o GFF
 o Seq
 o SeqUtils
 o SeqFeature
 o SeqRecord
 o Alphabet
 o Transcribe
 o Translate
 o lcc
 o Encodings
 o Data
 o NBRF

SeqIO:
 o writers
 o Writer
 o SeqIO
 o builders
 o Fasta
 o Index

Utilities:
 o utils.py
 o ParserSupport
 o File
 o Tools
 o Mindy
 o HotRand
 o config
 o formatdefs
 o MarkupEditor
 o DocSQL (wouldn't usage of SQL-Object be nicer? (if possible))
 o EUtils.ReseakFile
 o Std, StdHandler
 o PropertyManager
 o MultiProc
 o Decode
 o FilteredReader

Graphics:
 o Graphics

Web-Based:
 o GenBank
 o NetCache
 o EUtils
 o WWW

Microarrays:
 o Affy

Structure:
 o NMR
 o PDB
 o Crystal
 o Ndb
 o SCOP
 o SVDSuperimposer

Motives:
 o MEME
 o Prosite
 o CDD
 o Compass

References:
 o Medline, PubMed
 o DBXref

Restriction:
 o Restriction
 o CAPS