[Biojava-dev] Using DNASequence reverseComplement
Andy Yates
ayates at ebi.ac.uk
Tue May 18 09:16:32 UTC 2010
Hey Trevor,
So first things first have a look at the current SVN status (or GitHub) as I did a commit last night bringing in a lot of bugfixes in & throwing more unsupported errors I haven't had time to implement (mostly to do with the stranded version of sequence as string). If that doesn't work then let me know asap and I'll take your example as another test case to make sure we don't regress.
There's also some new bits of code in there which may help you out. There's a Sequence which will return the same compound for a given length no matter what you call out to (something I thought may be of some use for large runs of Ns) without consuming the space for that. There's also some code in there for working with Edits and for handling translations which don't translate well; a feature of Ensembl databases and something Scooter was after.
It's a pity it needs that but maybe there is something we can do to help yourself out and lock down the access for others (some kind of hint if you allow this level of access).
I'm glad that you're happy we can accommodate your needs. If you're happy with what's been done would you be happy to merge these changes & publish them to github as a fork? If not fair enough :)
Andy
On 18 May 2010, at 10:00, PATERSON Trevor wrote:
> Hi andy
>
> I know this is all work in progress so I'm not too hung up on everything not working out the box..
>
> What I am doing is stitching together contigs into an assembly, so I have an AssembledDNASequence Object,
> that extends DNASequence but defers its method to an Assembly Object which contains a Map of DNASequences
> with ranges and orientations. This map is created from a single SQL call, and is populated with DNASequence
> Objects that lazy load their actual sequence data from the database when required
>
> The only way I can check that the map is giving my the correct assembly is to print out
> the sequence as string of each component in order,
> ie
> Frag 1, from 1 to 200
> Gap of 150 N
> Frag 2, from 1 to 5000 REVERSED
> Frag 3, from 150 to 6000
> Gap of 2000 N
>
> etc
>
> I think your suggestion actually demonstrates the 'bug' ( or 'feature' ;)
>
> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString()
> returns the whole reverse complement - not just 1-4: AACCCGGGGTTTTT
> Because the getSequenceAsString is looking at the parent of the view :
>
> Something even more horrible happens if you try to bound that query
> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString(1,4,Strand.POSITIVE)
> Returns AAAA ( ie this is 1-4 of the parent, not even the reverse complement.)
> And perhaps more interestingly
> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString(1,4,Strand.NEGATIVE))
> Also returns AAAA (which I just plain can't understand..)
>
> ___________________________________________________________________________________
>
> Anyway - I am probably at a stage where I don't want to do any more development on this at the moment...
> I am reasonably happy that an EnsemblAPI will be able to use the BioJava Sequence objects down the road.
>
> I have made an EnsemblDNASequenceReader that extends and implements ProxySequenceReader
> This is used to create a DataSourceAwareDNASequence and responsible for lazy loading the actual sequence data
> from Ensembl when required
>
> To do this I needed to give the AbstractSequences's (SequenceReader) sequenceStorage property and its getter protected rather
> than private access, because there needs to be exchange of info from one to the other
>
> I will need to implement some sort of chunking and caching for retrieval of larger sequences, and maybe even pass
> all the methods to a buffered reader for large sequences, but I'm not going to worry about this at the moment.
>
> There are plenty of other things for me to protoype in ensembl at the moment!
>
> Cheers Trevor
>
>
>
>
>
>
>
>
>> -----Original Message-----
>> From: Andy Yates [mailto:andyyatz at gmail.com] On Behalf Of Andy Yates
>> Sent: 17 May 2010 17:08
>> To: PATERSON Trevor
>> Cc: biojava-dev at lists.open-bio.org
>> Subject: Re: [Biojava-dev] Using DNASequence reverseComplement
>>
>> Hi Trevor,
>>
>> You've stumbled right into something myself & Scooter are
>> trying to clean-up now. The assumption I had originally made
>> is that all operations on things like getSequenceAsString()
>> would go via the view since that's where the logic is located
>> for both the reversing & the complementing code. That call
>> now delegates onto the backing store and not the view which
>> means you get these very odd results happening.
>>
>> For the moment I think the following code would do what
>> you're expecting:
>>
>> seq.getReverseComplement().getSubSequence(1,4).getSequenceAsString()
>>
>> It's annoying because you're in the reverse coordinate system
>> you've got to reverse the original coordinates so asking for
>> position 11,14 just isn't going to work. The other way of
>> working with this would be to construct the views yourself
>> and pass in a subsequence of the original sequence i.e.
>>
>> new ReversedSequenceView<NucleotideCompound>(new
>> ComplementSequenceView<NucleotideCompound>(seq.getSubSequence(11,14));
>>
>> This is really a problem with the erasure of the Sequence
>> types from DNASequence. If DNASequence returned the same type
>> from its subsequence method then you would just call revComp
>> on that and it would have been fine.
>>
>> The thing to take away from are:
>>
>> * getSequenceAsString(Integer, Integer, Strand) is not well
>> supported atmo
>>
>> * So long as we are sure it should remain it will be
>>
>> * There should be no reason to materialise the Sequence into
>> a String to get a part of the API working. If there is then
>> we've messed up
>>
>> Andy
>>
>> p.s. The strand stuff is confusing; originally it was meant
>> to be +ve & -ve strands but assumed that the Sequence you had
>> was always on the +ve strand. Eventually the meaning will
>> come back but will require the methods to be more aware of
>> the strand DNA is on to make the right call about what you
>> want to do. This all ties in with circular genomes support
>> and locations
>>
>> On 17 May 2010, at 16:24, PATERSON Trevor wrote:
>>
>>> Sorry for raising that behemoth earlier..
>>>
>>> I have a separate problem with the DNASequence API -
>>>
>>> Probably I just don't understand how to use the View objects
>>>
>>>
>>> If I make a DNASequence
>>>
>>> DNASequence seq = new DNASequence("AAAAACCCCGGGTT");
>>>
>>> i.e. length = 14,
>>>
>>> I might reasonably want to get the ReverseComplement of
>> bases 11-14, which should 'be' "AACC"
>>>
>>> But I cannot manage to get this in one easy step....
>>>
>>> seq.toString(): AAAAACCCCGGGTT --> FINE
>>>
>>> seq.getReverseComplement().getSequenceAsString():
>> AACCCGGGGTTTTT -->
>>> FINE
>>>
>>> But when I try to use bounds on this complement - methods
>> refer back
>>> to the original seq's iterator, not the complement
>>>
>>>
>> seq.getReverseComplement().getSequenceAsString(11,14,Strand.PO
>> SITIVE): GGTT
>>> i.e the same as seq.getSequenceAsString(11,14,Strand.POSITIVE)
>>>
>> seq.getReverseComplement().getSequenceAsString(11,14,Strand.NE
>> GATIVE): TTGG
>>> i.e the same as seq.getSequenceAsString(11,14,Strand.NEGATIVE)
>>>
>>> Is this the desired behaviour? How would I get the desired
>> reverseComplement fragment?
>>>
>>> The only obvious way that I can see is
>>>
>>> DNASequence subseq = new
>> DNASequence(seq.getSequenceAsString(11, 14, Strand.POSITIVE));
>>> System.out.println(""+
>>> subseq.getReverseComplement().getSequenceAsString());
>>>
>>>
>> ______________________________________________________________________
>>> _______________________
>>>
>>> On a related point I was mightily confused by the
>>> Strand.POSITIVE/Strand.NEGATIVE enumeration
>>>
>>> I was naively interpreting them to refer to the strand of the DNA:
>>> Whereas they infact refer to the directionality of the Iterator *on
>>> the same Strand*
>>>
>>> A better name might be Direction:FORWARDS/Direction.BACKWARDS?
>>> Positive and negative strand has loaded biological meaning
>> for newbies
>>> like me ( sense versus antisense ) So I made the assumption that a
>>> Strand.NEGATIVE call would itself reverseComplement
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> --
>> Andrew Yates Ensembl Genomes Engineer
>> EMBL-EBI Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
--
Andrew Yates Ensembl Genomes Engineer
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the biojava-dev
mailing list