[Biopython-dev] Start/end co-ordinates in SeqRecord objects?

Tue May 22 21:31:48 UTC 2012

Apologies, I accidentally hit send before finishing.

--

Peter,

It sort of seems like every letter in a sequence needs to have its own 
annotation, mapping it to its chromosome/sequence and position of 
origin. In this way, when multiple sequences are sliced and concatenated 
the annotation is preserved. For example,

a = GappedSeq("ATGATG")
                ^    ^
                |    chr1:6
                chr1:1

b = GappedSeq("GGG")
                ^ ^
                | chr1:502
                chr1:500

b = b.reverse_complement()

c = a + b = GappedSeq("ATGATGGGG")

Such that c[1].someproperty = "chr1[+]2" while chr[7].someproperty = 
"chr1[-]501". Strand information could be preserved on a per-letter 
basis and flipped from -1 to +1 upon reverse_complement(). The API could 
find and report contiguous stretches by analyzing these per-letter 
annotations, for example:

 >>> print c
GappedSeq('ATGATGGGG', someproperty=["chr1[+]1-6", "chr1[-]500-502"])

The issue of gaps and of translating multiple alignments of gapped 
sequences could be resolved by having a convention where gaps always 
belong to the right-nearest gap except in the case of right-terminal 
gaps. For example:

a = GappedSeq("----AGCG-ATG---")
                000001234456666

a[0] = GappedSeq("----A")
a[1] = GappedSeq("G")
a[4] = GappedSeq("-A")
a[6] = GappedSeq("G---")

A nucleotide triplet of this sequence would thus look like this:

a[:3] = GappedSeq("----AGC")
a[-3:] = GappedSeq("ATG---")

In the case of slicing a MultipleSeqAlignment of GappedSeq objects, 
there would have to be an "anchor" sequence (like there is in UCSC MAF 
files) with which other sequences in the alignment are sliced in 
reference to. For example:

a = GappedSeq("----AGCG-ATG---", id="a", anchor=True)
b = GappedSeq("AG--GG---ATAG--", id="b")
c = GappedSeq("A--CGG---ATAGGG", id="c")

d = GappedSeqAlignment([a, b, c])

 >>> print d[:,:3]
SingleLetterAlphabet() alignment with 3 rows and 7 columns
----AGC a, anchor=True
AG--GG- b
A--CGG- c

One problem with this might be how to translate the multiple 
alignment... in this case, should b and c have no translation?

Thanks,
Andrew

On 05/21/2012 12:37 PM, Peter Cock wrote:
> Hello all,
>
> This is something I talked to Bow a little about during our last
> weekly meeting for his GSoC meeting, but it is broader than
> just SearchIO...
>
> When describing  BLAST results, or FASTA alignments, or
> indeed many other local alignments you typically have a
> (gapped) query sequence and match sequence fragment,
> and the co-ordinates describing which part of the full query
> and matched sequence this is. i.e. You are told the start
> and end of the subsequence (and perhaps strand).
>
> The same essentially applies to some multiple alignment
> formats in AlignIO as well, including Stockholm/PFAM
> (where this is encoded into the record name as identifier
> slash start-end), FASTA output (which will be handled via
> SearchIO in future) and MAF.
>
> http://biopython.org/wiki/Multiple_Alignment_Format
>
> Indeed thinking about how best to handle this was the
> main reason I haven't merged Andrew's MAF branch yet.
>
> (There are subtleties, for instance how is the strand given
> in the file, do you get start+end explicitly or must the end
> be inferred from the start and the sequence, etc).
>
> Currently recording these in the SeqRecord's annotation
> dictionary 'works', but does not exploit the structure. In
> particular, if the SeqRecord is sliced to get a fragment
> of the alignment, this co-ordinate information is lost. It
> would be nice if this preserved the start/end/strand and
> updated it accordingly.
>
> One idea for doing this is to introduce a new location
> property to the SeqRecord (defaulting to None), which
> would be a FeatureLocation object normally used for
> SeqFeature objects.
>
> If an operation couldn't preserve or update the location,
> it would become None. Note that slicing we will generally
> need to know the gap characters of the sequence (in
> order to recalculate the sub-sequence's start/end), which
> for the parsers may mean some minor updates to ensure
> the default alphabet specifies the '-' gap character.
>
> On the order hand, perhaps this 'location' property idea
> is overly complicated?
>
> Maybe all we need is a common convention about which
> keys to use in the annotation dictionary, and how to store
> the information (e.g. Python counting, start<  end, and
> strand as +1 or -1 if present)?
>
> Thoughts or feedback please? Would a worked example
> help with my explanation?
>
> Thanks,
>
> Peter

-- 
Andrew Sczesnak
Bioinformatician, Littman Lab
Howard Hughes Medical Institute
New York University School of Medicine
540 First Avenue
New York, NY 10016

p: (212) 263-6921
f: (212) 263-1498
e: andrew.sczesnak at med.nyu.edu