[Biopython-dev] MAF Parser/Indexer

Mon Apr 2 20:33:51 EDT 2012

Hi Peter,

Thank you for the feedback. I will try to make sure this code is well 
tested before the next release.

> Is there any more about reverse complemented sequences
> and how they are handled, for in simple iterators, but more
> so when indexing? What I'm getting at here is the non-typical
> treatment of start and end being relative to the reverse
> complemented sequence for minus strand alignments. Here
> most tools/formats always count from the first base on the
> forward strand.

I'm not sure I'm understanding you, but I hope I am. In theory it seems 
like strandedness would be an issue, however in practice the reference 
species in a multiz MAF file is always the plus strand. To make sure the 
user isn't trying to pass a MAF file containing blocks with mixed 
strands to MafIndex.get_spliced(), there's a check in there to make sure 
all strands for the reference species are the same. We also assume that 
coordinates specified in a block are always in the ascending direction 
(i.e. they are given as 'start' and 'size' and we assume the coordinates 
are [start, start + size]).

There could be an issue, however, if the best alignment for a particular 
species swaps strands between alignment blocks and/or exons of a 
transcript. However, it might be safe to say that the user is interested 
in the best alignment however it occurs, and not necessarily strand 
consistency.

WRT MultipleSeqAlignment objects produced by get_spliced(), all 
annotation properties are lost upon slicing, so it is up to the user to 
keep track of what's what. I do remember we had talked about a way to 
maintain these annotations, even after slicing. Any thoughts?

Thanks,
Andrew