[Biopython-dev] Alignment object

Peter biopython at maubp.freeserve.co.uk
Wed Mar 3 10:57:09 EST 2010


On Wed, Mar 3, 2010 at 2:12 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Kevin and Peter;
>
>> I find pysam pretty limited for doing more than reading and subsetting
>> SAM/BAM files.  I'm planning to add a constructor and helper functions for
>> creating new aligned reads.  The current AlignedRead object is also
>> read-only, which will need to be relaxed for many serious applications.
>>  Until then, I'm writing (text) SAM records and piping them to samtools to
>> encode in BAM format (see the script attached to one of my earlier emails).
>
> Agreed. These sound like good improvements.
>
>> Scalability is okay for conversion to pileup format, but not what I'd
>> consider great.  But I agree, pysam is a good starting point.  I just wish
>> that the read identifiers and attributes were  available via the C API,
>> since those are often needed when, e.g., writing a genotype caller.
>
> Do you think we could build off of what pysam has? The project hasn't
> seemed especially active, but it would be great to have a unified
> code base in python for dealing with BAM files. They use mercurial
> for revision control, so worst case we can always fork this on
> bitbucket and work off of that. Galaxy has a fork for their use:
>
> http://bitbucket.org/kanwei/kanwei-pysam/
>
> The bioconductor folks also seem to be standardizing around
> SAM/BAM for their analysis pipelines, so practically we may be
> able to borrow some of their APIs once they have a released
> version of Rsamtools.

I agree that we should work towards supporting SAM (and perhaps
also BAM) in Biopython, and other projects APIs can be very
useful for inspiration or guidance.

I was aware of pysam but am concerned about the dependencies:
pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
itself - which may all be fine on Linux, but will likely be trouble for
us on other platforms (especially Windows).

Is anyone aware of any other SAM/BAM parser in Python?

>> What do you think about the fact I am introducing an "improved"
>> version of the existing Bio.Align.Generic.Alignment class under
>> Bio.Align.MultipleSeqAlignment?
>
> Yes please. I don't think Generic is that great and am happy to see
> it improved upon.
>
>> That's actually several questions in one - should this be a new
>> object or just enhance the old one? I favour a new object here
>> because I want to *enforce* the fact that all the rows are the
>> same length, but I doubt people are using the flexibility of
>> the current alignment object in this way.
>>
>> Next where should the new object live? I find the current use
>> of Bio.Align.Generic somewhat hidden away, thus my
>> suggestion of using Bio.Align directly.
>>
>> Next, what should the new object be called? We could reuse
>> the old name of Alignment but it is a bit vague and would
>> cause confusion given the existing object is also called that.
>> I have used MultipleSeqAlignment but am open to suggestions
>> (e.g. MulSeqAlignment is shorter).
>
> I like MultipleSeqAlignment, and agree it should be as top level as
> possible in Bio.Align. If you think a new object is better, go for
> that and we can move Generic on a deprecation path. It's great you
> are cleaning this up.

OK then - I've been wanting to "clean this up" for some time.
I'll make time to merge what I have so far (which shouldn't be
controversial) and update the tutorial.

I would also like to investigate moving the useful bits of the
SummaryInfo class into methods of the main alignment class.

Testing would be very welcome!

Peter



More information about the Biopython-dev mailing list