[Biojava-l] K-mers
jitesh dundas
jbdundas at gmail.com
Sat Oct 30 09:40:35 UTC 2010
I got your point Andy. .Thanks.
On Sat, Oct 30, 2010 at 2:50 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> You should be aware I just found a bug in the code. This has been fixed but
> the bug will still be in the alpha3 release. I would recommend either
> building a version yourself or if Andreas can post up the continuous
> integration server address there will be a release tonight.
>
> Just goes to show you should always do more testing than you think :).
>
> Andy
>
> On 29 Oct 2010, at 20:43, jitesh dundas wrote:
>
> > That is good news.Thanks for the directions Andy.
> >
> > I have already started on this.Let me analyze and write the code now.
> >
> > Maybe a next month deadline is not unreachable in this case.
> >
> > Here we go!
> > JD
> >
> > On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> So we've got some basic kmer work now in SVN. If you look in the class
> >> SequenceMixin there are two static methods there for generating the two
> >> types of k-mers. It's not developed with Map storage in mind & I'll
> leave
> >> the door open there for anyone else to come in & develop it. The k-mers
> are
> >> also not unique across the sequence but it's a start :)
> >>
> >> Share & enjoy!
> >>
> >> Andy
> >>
> >> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
> >>
> >>> I agree Andy. These have become standard functionalities that
> >>> scientists do these days. I am all for implementing that in BioJava3.
> >>> Java isn't that efficient for such functionalities so we will surely
> >>> need more effort compared to the same in Python/Perl.
> >>>
> >>> Regards,
> >>> Jitesh Dundas
> >>>
> >>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> So if it's a suffix tree that's quite a fixed data structure so the
> >>>> chances
> >>>> of developing a pluggable mechanism there would be hard. I think there
> >>>> also
> >>>> has to be a limit as to what we can sensibly do. If people want to
> >>>> contribute this kind of work though then it's all be very well
> received
> >>>> (with the corresponding test environment/cases of course).
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Andy
> >>>>
> >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
> >>>>
> >>>>> It might be useful to make the K-mer storage mechanism pluggable.
> This
> >>>>> would allow a developer to use anything from a simple MultiMap, to a
> >>>>> NoSQL
> >>>>> key-value database to store K-mers. You could plugin custom map
> >>>>> implementations to allow you to keep a count of the number of
> instances
> >>>>> of
> >>>>> particular K-mers that were found. It might also be useful to be
> able
> >>>>> to
> >>>>> do
> >>>>> set operations on those K-mer collections. You could use it to
> >>>>> determine
> >>>>> which K-mers were present in a pathogen and not in a host.
> >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
> >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> card.ly: <http://card.ly/phidias51>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
> >>>>> <vishalthapar at gmail.com>wrote:
> >>>>>
> >>>>>> Hi Andy,
> >>>>>>
> >>>>>> This is good to have. I feel that including it as a part of core may
> >>>>>> not
> >>>>>> be
> >>>>>> necessary but having it as part of Genomic module in biojava3 will
> be
> >>>>>> nice.
> >>>>>> There is a project Bioinformatica
> >>>>>>
> >>>>>>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> >>>>>> does something similar although not exactly. It counts the k-mers in
> a
> >>>>>> given fasta file but it does not count k-mers for each sequence
> within
> >>>>>> the
> >>>>>> file, just all within a file. This is a good feature to have
> specially
> >>>>>> if
> >>>>>> one is trying to find patterns within sequences which is what I am
> >>>>>> trying
> >>>>>> to
> >>>>>> do. It would most certainly be helpful to have a k-mer counting
> >>>>>> algorithm
> >>>>>> that counts k-mer frequency for each sequence. The way to go would
> be
> >>>>>> to
> >>>>>> use
> >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
> >>>>>> not
> >>>>>> since I haven't used java in a while and am just switching back to
> it.
> >>>>>> A
> >>>>>> paper on using suffix trees to generate genome wide k-mer
> frequencies
> >>>>>> is:
> >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> >>>>>> software
> >>>>>> is tallymer). It would be some work to implement this in java as a
> >>>>>> module
> >>>>>> for biojava3 but I can see that this will be helpful. Again, for
> small
> >>>>>> fasta
> >>>>>> files, it might not be efficient to create a suffix tree but for
> bigger
> >>>>>> files, I think that might be the way to go.
> >>>>>>
> >>>>>> Thats just my two cents.What do you think?
> >>>>>>
> >>>>>> -vishal
> >>>>>>
> >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk>
> wrote:
> >>>>>>
> >>>>>>> Hi Vishal,
> >>>>>>>
> >>>>>>> As far as I am aware there is nothing which will generate them in
> >>>>>>> BioJava
> >>>>>>> at the moment. However it is possible to do it with BioJava3:
> >>>>>>>
> >>>>>>> public static void main(String[] args) {
> >>>>>>> DNASequence d = new DNASequence("ATGATC");
> >>>>>>> System.out.println("Non-Overlap");
> >>>>>>> nonOverlap(d);
> >>>>>>> System.out.println("Overlap");
> >>>>>>> overlap(d);
> >>>>>>> }
> >>>>>>>
> >>>>>>> public static final int KMER = 3;
> >>>>>>>
> >>>>>>> //Generate triplets overlapping
> >>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
> >>>>>>> List<WindowedSequence<NucleotideCompound>> l =
> >>>>>>> new ArrayList<WindowedSequence<NucleotideCompound>>();
> >>>>>>> for(int i=1; i<=KMER; i++) {
> >>>>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >>>>>>> i, d.getLength());
> >>>>>>> WindowedSequence<NucleotideCompound> w =
> >>>>>>> new WindowedSequence<NucleotideCompound>(sub, KMER);
> >>>>>>> l.add(w);
> >>>>>>> }
> >>>>>>>
> >>>>>>> //Will return ATG, ATC, TGA & GAT
> >>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
> >>>>>>> for(List<NucleotideCompound> subList: w) {
> >>>>>>> System.out.println(subList);
> >>>>>>> }
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> //Generate triplet Compound lists non-overlapping
> >>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >>>>>>> WindowedSequence<NucleotideCompound> w =
> >>>>>>> new WindowedSequence<NucleotideCompound>(d, KMER);
> >>>>>>> //Will return ATG & ATC
> >>>>>>> for(List<NucleotideCompound> subList: w) {
> >>>>>>> System.out.println(subList);
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> The disadvantage of all of these solutions is that they generate
> lists
> >>>>>>> of
> >>>>>>> Compounds so kmer generation can/will be a memory intensive
> operation.
> >>>>>> This
> >>>>>>> does mean it has to be since sub sequences are thin wrappers around
> an
> >>>>>>> underlying sequence. Also the overlap solution is non-optimal since
> it
> >>>>>>> iterates through each window rather than stepping through
> delegating
> >>>>>>> onto
> >>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
> >>>>>>>
> >>>>>>> As for unique k-mers that's something which would require a bit
> more
> >>>>>>> engineering & would be better suited to a solution built around a
> Trie
> >>>>>>> (prefix tree).
> >>>>>>>
> >>>>>>> Hope this helps,
> >>>>>>>
> >>>>>>> Andy
> >>>>>>>
> >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >>>>>>>
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> I had a quick question: Does Biojava have a method to generate
> k-mers
> >>>>>> or
> >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
> >>>>>> k-mer
> >>>>>>>> counts for every sequence in a fasta file. If something like this
> >>>>>> exists
> >>>>>>> it
> >>>>>>>> would save me some time to write the code.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Vishal
> >>>>>>>> _______________________________________________
> >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>
> >>>>>>> --
> >>>>>>> Andrew Yates Ensembl Genomes Engineer
> >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538
> >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> *Vishal Thapar, Ph.D.*
> >>>>>> *Scientific informatics Analyst
> >>>>>> Cold Spring Harbor Lab
> >>>>>> Quick Bldg, Lowe Lab
> >>>>>> 1 Bungtown Road
> >>>>>> Cold Spring Harbor, NY - 11724*
> >>>>>> _______________________________________________
> >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>>> --
> >>>> Andrew Yates Ensembl Genomes Engineer
> >>>> EMBL-EBI Tel: +44-(0)1223-492538
> >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>
> >> --
> >> Andrew Yates Ensembl Genomes Engineer
> >> EMBL-EBI Tel: +44-(0)1223-492538
> >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
> >>
> >>
> >>
> >>
> >>
>
> --
> Andrew Yates Ensembl Genomes Engineer
> EMBL-EBI Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>
>
>
>
>
More information about the Biojava-l
mailing list