[Biopython-dev] Alignment object

Brad Chapman chapmanb at 50mail.com
Wed Mar 3 14:12:15 UTC 2010


Kevin and Peter;

> I find pysam pretty limited for doing more than reading and subsetting
> SAM/BAM files.  I'm planning to add a constructor and helper functions for
> creating new aligned reads.  The current AlignedRead object is also
> read-only, which will need to be relaxed for many serious applications.
>  Until then, I'm writing (text) SAM records and piping them to samtools to
> encode in BAM format (see the script attached to one of my earlier emails).

Agreed. These sound like good improvements.

> Scalability is okay for conversion to pileup format, but not what I'd
> consider great.  But I agree, pysam is a good starting point.  I just wish
> that the read identifiers and attributes were  available via the C API,
> since those are often needed when, e.g., writing a genotype caller.

Do you think we could build off of what pysam has? The project hasn't
seemed especially active, but it would be great to have a unified
code base in python for dealing with BAM files. They use mercurial
for revision control, so worst case we can always fork this on
bitbucket and work off of that. Galaxy has a fork for their use:

http://bitbucket.org/kanwei/kanwei-pysam/

The bioconductor folks also seem to be standardizing around SAM/BAM for
their analysis pipelines, so practically we may be able to borrow
some of their APIs once they have a released version of Rsamtools.

> What do you think about the fact I am introducing an "improved"
> version of the existing Bio.Align.Generic.Alignment class under
> Bio.Align.MultipleSeqAlignment?

Yes please. I don't think Generic is that great and am happy to see
it improved upon.

> That's actually several questions in one - should this be a new
> object or just enhance the old one? I favour a new object here
> because I want to *enforce* the fact that all the rows are the
> same length, but I doubt people are using the flexibility of
> the current alignment object in this way.
> 
> Next where should the new object live? I find the current use
> of Bio.Align.Generic somewhat hidden away, thus my
> suggestion of using Bio.Align directly.
> 
> Next, what should the new object be called? We could reuse
> the old name of Alignment but it is a bit vague and would
> cause confusion given the existing object is also called that.
> I have used MultipleSeqAlignment but am open to suggestions
> (e.g. MulSeqAlignment is shorter).

I like MultipleSeqAlignment, and agree it should be as top level as
possible in Bio.Align. If you think a new object is better, go for
that and we can move Generic on a deprecation path. It's great you
are cleaning this up.

Brad



More information about the Biopython-dev mailing list