[Biopython-dev] GSoC SearchIO project

Eric Talevich eric.talevich at gmail.com
Sat Apr 7 12:13:16 EDT 2012


On Sat, Apr 7, 2012 at 12:43 AM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> --- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > The reason for using SearchIO (despite not being PEP8
> > compatible - something I regret in the naming of SeqIO
> > and the pattern it set) is to match SeqIO and AlignIO and
> > BioPerl. Anyone familiar with BioPerl will immediately see
> > what it is for - and some of the student applicants have
> > already used BioPerl's SearchIO. Personally I find this
> > quite a compelling argument.
>
> Sorry but I am not convinced. I doubt that somebody familiar with
> BioPerl's Align and AlignIO modules will have trouble finding the parser in
> Biopython if in Biopython there is only a Bio.Align module. Also this means
> that some modules in Biopython are split up in Module and ModuleIO, whereas
> most others are not. In this particular case, for consistency you would
> have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a
> clean module organization in Biopython instead of strictly following what
> BioPerl did.
>

How about Bio.Search, for now?

We had a similar discussion at the end of GSoC 2009, when we decided to
merge Tree and TreeIO (names inspired by BioPerl) to create Phylo (because
not all trees are phylogenies, although there is also a Perl module called
Bio::Phylo). Since the *IO namespaces have only 4 public functions, plus a
<Format>IO.py module for each supported I/O format, it's not too cluttered.

Likewise, at the end of this GSoC it may be more clear whether the new
sub-package should have a different name. (SearchIO seems to have been
plenty effective at drawing attention to the project.) But in any case, I
support putting all the new work under one sub-package, rather than two.


 > That said, the name SearchIO isn't the clearest in the
> > the world for a newcomer - however I haven't come up
> > with anything significantly better myself. Perhaps there
> > is a better name out there, which would justify breaking
> > the pattern? I've considered pairwise and palign, but
> > neither feels right.
>
> How about including this module as a submodule in Bio.Align? If we think
> of Bio.Align as a general module for alignments, then pairwise alignments
> fit in it too. It depends a bit on the exact API, but I expect that we can
> come up with something elegant.
>
>
Does anything in Bio.Align already operate on SeqFeature objects?

Given that BLAST or HMMer output could be interpreted as (1) a series of
annotated features/regions on target sequences, or (2) a series of pairwise
alignments [*], perhaps it would be most effective to support those aspects
separately, through (1) Bio.Search or Bio.Feature [**], and (2) Bio.Align
or Bio.AlignIO.

[*] The multiple sequence alignment produced by HMMer is in a format we
already handle (Stockholm). Some people want to convert BLAST output to a
multiple sequence alignment, too, and while I suppose we could support that
in a literal sense, the result would be worse than the output of pretty
much any other alignment program so I don't think we should.

[**] A Bio.Feature module could involve GFF parsing and the variant
parsers, too. It would contain I/O functions that emit SeqFeatures, of
course.


More information about the Biopython-dev mailing list