[Biojava-l] How to find a sequence within a larger sequence and flip it

Mark Schreiber markjschreiber at gmail.com
Fri Sep 19 12:43:44 UTC 2008


Hi -

You don't have to go to a String to make a match. There is a class
SymbolListCharSequence that wraps a SymbolList as a CharSequence that lets
you perform Regexs etc to identify the match. You can also use the
KnuthMorrisPrattSearch to find exact matches.

Finally to find non-exact matches you can use the SmithWaterman or Needleman
Wunsch.

- Mark

On Fri, Sep 19, 2008 at 4:42 PM, Richard Holland
<holland at eaglegenomics.com>wrote:

> Hello.
>
> To be honest, I think you've already got the only way to quickly
> locate a subsequence within a sequence. For whatever reason, the
> Sequence and SymbolList interfaces lack any kind of indexOf() or
> find() functions, and the SequenceTools class, usually the provider of
> all things useful, also fails to fill the gap.
>
> You're right about there being a SymbolList edit facility. This only
> works on SymbolLists that have declared themselves editable, which
> will depend on how your SymbolList objects were created. What you do
> is create a new Edit object, based on starting position in the
> original sequence, length of sequence to remove in the original, and
> the SymbolList you want to use to replace the removed bits. Then you
> pass this to the edit() method on the SymbolList/Sequence object you
> want to replace.
>
> So, the end result is only a small improvement on your original plan,
> but here goes:
>
>  1. Create your sequence.
>  2. Create your other sequence.
>  3. Convert both to strings and use an indexOf in the String object to
> locate the subsequence in the original sequence.
>  4. Use string tools to flip the subsequence then create a new
> SymbolList based on it.
>  5. If the original sequence is editable, use the Edit method
> described above to replace a chunk of it with the new flipped
> subsequence. Otherwise, construct a new string using the String object
> methods and construct a new original sequence based on that instead.
>
> cheers.
> Richard
>
> 2008/9/19 Doug Swisher <big.swish at gmail.com>:
>  > Hi,
> >
> > I'm pretty new to BioJava, and I'm a bit stuck.  I'm hoping someone can
> help
> > out a bit...even if it's just a hint as to where to look next.
> >
> > I have a long DNA sequence and a shorter sequence that exists within the
> > larger one.  I want to find the location of the smaller sequence within
> the
> > larger one, and then create a new sequence with the small one flipped
> > end-for-end.  That's confusing, so let me give an example.
> >
> > Long sequence: aaaagacttttt
> > Short sequence: gact
> > Goal sequence: aaaatcagtttt
> >
> > To find the location of the short sequence within the larger one, I could
> > certainly do some string manipulation:
> >
> >    SymbolList bigDNA = DNATools.createDNA("aaaagacttttt");
> >    SymbolList subDNA = DNATools.createDNA("gact");
> >    int start = bigDNA.seqString().indexOf(subDNA.seqString());
> >
> > While that would work, I'm wondering if there is a more efficient method
> > that avoids the conversion to strings (in my real code, I start with
> > Sequences, not strings; I used SymbolLists here for simplicity).
> >
> > To "excise" the short sequence, flip it around, and construct a new
> > SymbolList, I could also do some string manipulation, as in the
> following:
> >
> >    StringBuilder middle = new StringBuilder(subDNA.seqString());
> >    String leftPart = bigDNA.seqString().substring(0, subDNA.length());
> >    String rightPart = bigDNA.seqString().substring(start +
> subDNA.length(),
> > bigDNA.length());
> >    SymbolList goalDNA = DNATools.createDNA(leftPart + middle.reverse() +
> > rightPart);
> >
> > Looking at the documentation, such as ProjectionUtils or
> SymbolList.edit(),
> > it appears there might be some support for manipulating the sequence
> > directly.  Is there a way to do it, without again dropping "down" to
> > strings?
> >
> > Thanks in advance for any assistance.
> >
> > Cheers,
> > -Doug
> >
> > P.S. Yeah, the second code snippet is pretty inefficient; I was trying to
> be
> > clear rather than efficient.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>  _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list