[Bioperl-l] Bug #2936
Francisco J. Ossandón
fossandonc at hotmail.com
Mon Mar 18 03:09:17 UTC 2013
Wow, trans-splicing is a phenomena I was not aware about, thanks for telling
me. Then, the Guide Strand of the Split-location object must always check if
all the sublocations are present in the same strand; so if all sublocations
are positive the Guide have to be positive, if all are negative the Guide
have to be negative as well, and if there is a mix of strands in
sublocations the Guide have to be undef... The changes I made in flip_strand
didnt check for the strand consistency in the sublocations when setting the
Guide Strand, so I have to add that.
Ok, how about this, in the following lines I will analyze all possibilities
for location coordinates, step by step, from simpler to complex. Please
follow my reasoning in every step and tell me if there is something you
dont agree, so I can work on the code and use these cases as tests, and
then make another pull request. Lets see when there is a reversal of the
sublocations order needed and when is not. For this time, I will only check
cases where the DNA comes from the same chromosome/plasmid/contig
One thing to keep in mind is that the flipped coordinates should reconstruct
the reverse-complement (revcom) of the original sequence.
Example sequence used below (fictional 20bp circular plasmid):
AAAAACCCCCGGGGGTTTTT
### Reasoning: For the following cases, where ALL segments are in the same
strand, NO ORDER REVERSAL is needed because for the complement-strand
sequences the code can join all the positive sequences first and then
complement it complete at the final step. Also, since all segments are in
the same strand, Guide Strand have to be set with the same value than the
sublocations.
# Single location
Coordinates: 1..5 => AAAAA
Revcom: complement(1..5) => TTTTT
# Basic split, both locations in same strand
Coords: join(6..10,16..20) => CCCCCTTTTT
Revcom: complement(join(6..10,16..20)) => AAAAAGGGGG
# Cut by origin-split, same strand, single sequence that pass through origin
Coords: join(16..20,1..2) => TTTTTAA
Revcom: complement(join(16..20,1..2)) => TTAAAAA
# Cut by origin-combo split, same strand, 2 sequences with 1st passing
through origin
Coords: join(19..20,1..2,11..13) => TTAAGGG
Revcom: complement(join(19..20,1..2,11..13)) => CCCTTAA
# Cut by origin-combo split, same strand, 2 sequences with 2nd passing
through origin
Coords: join(6..10,19..20,1..4) => CCCCCTTAAAA
Revcom: complement(join(6..10,19..20,1..4)) => TTTTAAGGGGG
### Reasoning: For the following cases, where segments are in MIXED strands,
ORDER REVERSAL is needed because the code have to complement some segments
and others not, so it have to complement those segments first and then join
them at the last step. Also, since there is no strand consistency in all
segments, Guide Strand have to be set with undef value.
# Trans-splicing, 2 sequences in different strands, 2nd in complement
Coords: join(6..10,complement(16..20)) => CCCCCAAAAA
Revcom: join(16..20,complement(6..10)) => TTTTTGGGGG
# Trans-splicing, 2 sequences in different strands, 1st in complement
Coords: join(complement(16..20),6..10) => AAAAACCCCC
Revcom: join(complement(6..10),16..20) => GGGGGTTTTT
# Trans-splicing w/cut by origin, 2 sequences with 1st passing through
origin, 2nd in complement
Coords: join(19..20,1..3,complement(11..13)) => TTAAACCC
Revcom: join(11..13,complement(1..3),complement(19..20)) => GGGTTTAA
# Trans-splicing w/cut by origin, 2 sequences with 1st passing through
origin, 1st in complement
Coords: join(complement(1..3),complement(19..20), 11..13) => TTTAAGGG
Revcom: join(complement(11..13), 19..20,1..3) => CCCTTAAA
# Trans-splicing w/cut by origin, 2 sequences with 2nd passing through
origin, 2nd in complement
Coords: join(6..10,complement(1..2),complement(18..20),) => CCCCCTTAAA
Revcom: join(18..20,1..2, complement(6..10)) => TTTAAGGGGG
# Trans-splicing w/cut by origin, 2 sequences with 2nd passing through
origin, 1st in complement
Coords: join(complement(6..10),18..20,1..2) => GGGGGTTTAA
Revcom: join(complement(1..2),complement(18..20), 6..10) => TTAAACCCCC
####
As summary, Guide Strand must ensure that all the sublocations have the same
strand value set before using that value, and be undef if there is no
consistency (like in trans-splicing); so any new "add_sub_Location" should
check the other sublocations strands and set Guide Strand according to the
consistency. Establishing Guide Strand value from the beginning and updating
it accordingly with any change in the Split location object, should
correctly signal when to reverse the sublocations order and when not.
Thoughts??
Francisco J. Ossandon
-----Mensaje original-----
De: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de Fields,
Christopher J
Enviado el: sábado, 16 de marzo de 2013 11:02
Para: Peter Cock
CC: bioperl mailing list; Francisco J. Ossandón
Asunto: Re: [Bioperl-l] Bug #2936
On Mar 16, 2013, at 7:46 AM, Peter Cock <p.j.a.cock at googlemail.com>
wrote:
> On Fri, Mar 15, 2013 at 9:36 PM, Francisco J. Ossandón wrote:
>>
>> ... Another thing that
>> is weird to me is that the sublocations are free to have different
>> strands values (like the first being positive strand and the second
>> being negative strand), since I can't think of one example where that
>> can happen in real genomes. In fact one of the tests in PrimarySeq.t
>> is designed exactly to have sublocations in opposite strands at the
>> same time and then extract the sequence, so I wonder if I'm wrong and
there are real cases like that...
>>
>
> This is a real biological phenomena - trans-splicing, often in tRNA
> genes, for example:
> http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation#transplice
> d
>
> As a result the BioPerl / BioSQL / Biopython etc location models do
> have to cope with this corner case. Worse, there are examples where
> pieces from different chromosomes are spliced together - which is even
> harder to deal with - like my favourite pathological example, nad1 in
> NC_016406 (and NC_016402), which has the following GenBank location
string:
>
> join(complement(149815..150200),
> complement(295492..295573),complement(293787..293978),
> NC_016402.1:6618..6676,181647..181905)
>
> See also:
> http://blastedbio.blogspot.co.uk/2012/03/missing-external-exons-in-gen
> bank-with.html
>
> Peter
Exactly. So, in this implementation the problem is the internal logic used
for Bio::Location deals with cases like this, but there is an assumption
made to deal with some cases like this that is wrong for others (say,
circular chromosomes where a feature wraps the origin, where sorting messes
things up). It really needs a top to bottom overhaul to be explicit about
the join, but such an overhaul will break some expected behavior, hence my
suggestion of putting this in a bioperl v2.
chris
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list