From asandro1501 at gmail.com  Fri Oct  1 12:52:50 2010
From: asandro1501 at gmail.com (Alex Silva)
Date: Fri, 1 Oct 2010 13:52:50 -0300
Subject: [Biojava-l] Help files genbank
Message-ID: <AANLkTintytv9WHwLAja6mUYWrP1YBG-a+sL9fxX7iq8n@mail.gmail.com>

Hi

I am asking again for help reading a file format in genbank, I need to do
the analysis of the headers. I could not use any because I am a beginner in
java. Does anyone have some code that you used for this?


Em portugu?s

Estou solicitando novamente uma ajuda para leitura de arquivos no formato
genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar
nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha
utilizado para isso?

-- 
Alex Silva
G.R.A. Sistemas Corporativos
msn: gra.sistemas at hotmail.com
55-9165-7378


From holland at eaglegenomics.com  Fri Oct  1 12:56:09 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Fri, 1 Oct 2010 17:56:09 +0100
Subject: [Biojava-l] Help files genbank
In-Reply-To: <AANLkTintytv9WHwLAja6mUYWrP1YBG-a+sL9fxX7iq8n@mail.gmail.com>
References: <AANLkTintytv9WHwLAja6mUYWrP1YBG-a+sL9fxX7iq8n@mail.gmail.com>
Message-ID: <D24FA959-4A56-47E8-B326-B6CEFE893ECC@eaglegenomics.com>

This is a good starting point: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_and_writing_files.


On 1 Oct 2010, at 17:52, Alex Silva wrote:

> Hi
> 
> I am asking again for help reading a file format in genbank, I need to do
> the analysis of the headers. I could not use any because I am a beginner in
> java. Does anyone have some code that you used for this?
> 
> 
> 
> 
> Em portugu?s
> 
> Estou solicitando novamente uma ajuda para leitura de arquivos no formato
> genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar
> nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha
> utilizado para isso?
> 
> -- 
> Alex Silva
> G.R.A. Sistemas Corporativos
> msn: gra.sistemas at hotmail.com
> 55-9165-7378
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From pjotr.public23 at thebird.nl  Sat Oct  2 05:15:06 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Sat, 2 Oct 2010 11:15:06 +0200
Subject: [Biojava-l] BioJava <-> R
Message-ID: <20101002091506.GA17702@thebird.nl>

Anyone here who has real experience using the JRI? Who would be
interested, and have some exposure to, invoking R from Java through a
native interface in bioinformatics?

Pj.

From hlapp at drycafe.net  Sat Oct  2 21:26:49 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Sat, 2 Oct 2010 21:26:49 -0400
Subject: [Biojava-l] BioJava <-> R
In-Reply-To: <20101002091506.GA17702@thebird.nl>
References: <20101002091506.GA17702@thebird.nl>
Message-ID: <74DF3E4D-FC22-4719-9E6B-08248B14D4AA@drycafe.net>

We use this in the Mesquite<->R bridge. I haven't worked much on the  
Java to R side, but it seems to work well.

http://mesquiteproject.org/packages/Mesquite.R/

	-hilmar

On Oct 2, 2010, at 5:15 AM, Pjotr Prins wrote:

> Anyone here who has real experience using the JRI? Who would be
> interested, and have some exposure to, invoking R from Java through a
> native interface in bioinformatics?
>
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From andrew.mcsweeny at rockets.utoledo.edu  Tue Oct 12 17:41:07 2010
From: andrew.mcsweeny at rockets.utoledo.edu (McSweeny, Andrew J)
Date: Tue, 12 Oct 2010 21:41:07 +0000
Subject: [Biojava-l] How to share code while protecting copyrights?
Message-ID: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>

Hi,

I am working on a project which simulates sexual reproduction in a population of digital organisms.  Their genome is just a contig from hg18.  It's pretty interesting and I can talk more about it in the future....

Anyways, how can I share my code for this project without having to worry that someone else will use it to publish a paper before my group does?

I'm certain nobody in the open source community would do that, but how do I convince my group that opening our project to BioJava is a good idea?

-Andrew


From andreas at sdsc.edu  Wed Oct 13 02:02:34 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 12 Oct 2010 23:02:34 -0700
Subject: [Biojava-l] biojava 3.0 release plan
Message-ID: <AANLkTi=9X-uUEXJPk=To36nxj6JJBr4xazE09jW3RTso@mail.gmail.com>

Hi,

BioJava 3 has matured massively in SVN during this year and it is time to
prepare a first release. I propose the following release plan. See also two
other topics for discussion below.

Release Plan 3.0

* Alpha release build(s)
  during the next days I will start to provide a first alpha release build.
This will be followed by semi-regular follow up alpha builds (depending on
SVN activity)

- During the next weeks any missing features should be committed to SVN.
 Refactoring of code can still be done during this time.
- Add and update documentation in wiki
- Module maintainers: check compile warnings for your modules in automated
builds. Make sure no compile warnings are being displayed.


* Beta release build(s)
  the first beta release is scheduled for the weekend Nov 21st.

- From this point on only minor changes (bug fixes) should be added to the
code base
- Module maintainers: check and update javadoc for your modules

* Release 3.0
  The 3.0 Release is scheduled for Dez 12th


There are two things we should still discuss:

* backwards compatibility:
the current "core" module contains tons of legacy 1.7 code. Shall I go ahead
and delete this module?

* documentation:
The wiki contains tons of documentation for 1.7 which will not be useful for
3.0. As a procedure for cleaning this up and avoiding confusion I suggest to
move all 1.7 related docu into a special section of the wiki. All toplevel
links to documentation should point to 3.0. Any other suggestions?


Andreas

From markjschreiber at gmail.com  Wed Oct 13 05:26:04 2010
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 13 Oct 2010 11:26:04 +0200
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
Message-ID: <AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>

Hi -

My understanding of copyright is that it is yours as soon as you assert that
it is your creation. You can simply add a copyright statement to each file
containing the code (in the header for example). The reality is that
defending copyright is your responsibility. If someone violates it, you have
to take them to court or issue a legal letter.

You can also put an appropriate license on the code specifying how it can be
used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick one
of these that best matches your needs. BioJava code is LGPL so if you want
your code to go into the BioJava code base you will need to make your code
LGPL.

It's always a good idea to add @author tags to Java code to ensure
appropriate attribution.

Finally, if someone steals your code and publishes results before you then
you can always make a complaint to the journal editors. If it is a reputable
journal, and you have reasonable proof the editor should take some action
such as forcing a retraction.  You can also make a distribution agreement
saying that if someone uses this code they agree not to publish without
first consulting you.

If you want to make it really water tight, get a lawyer and explain
specifically what you want to share and what you want to protect or prevent.

- Mark

On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
andrew.mcsweeny at rockets.utoledo.edu> wrote:

> Hi,
>
> I am working on a project which simulates sexual reproduction in a
> population of digital organisms.  Their genome is just a contig from hg18.
>  It's pretty interesting and I can talk more about it in the future....
>
> Anyways, how can I share my code for this project without having to worry
> that someone else will use it to publish a paper before my group does?
>
> I'm certain nobody in the open source community would do that, but how do I
> convince my group that opening our project to BioJava is a good idea?
>
> -Andrew
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From markjschreiber at gmail.com  Wed Oct 13 05:28:05 2010
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 13 Oct 2010 11:28:05 +0200
Subject: [Biojava-l] [Biojava-dev] biojava 3.0 release plan
In-Reply-To: <AANLkTi=9X-uUEXJPk=To36nxj6JJBr4xazE09jW3RTso@mail.gmail.com>
References: <AANLkTi=9X-uUEXJPk=To36nxj6JJBr4xazE09jW3RTso@mail.gmail.com>
Message-ID: <AANLkTik3LJLoBsDcwtfhCQbw-tyRM2cE+mWb0xg7EnGJ@mail.gmail.com>

Hi Andreas -

Excellent work from the team this year.

I would recommend removing as much legacy code as possible and removing
(preferably rewriting) the legacy documentation. I think it would be better
to have no docs than out of date docs.

- Mark

On Wed, Oct 13, 2010 at 8:02 AM, Andreas Prlic <andreas at sdsc.edu> wrote:

> Hi,
>
> BioJava 3 has matured massively in SVN during this year and it is time to
> prepare a first release. I propose the following release plan. See also two
> other topics for discussion below.
>
> Release Plan 3.0
>
> * Alpha release build(s)
>  during the next days I will start to provide a first alpha release build.
> This will be followed by semi-regular follow up alpha builds (depending on
> SVN activity)
>
> - During the next weeks any missing features should be committed to SVN.
>  Refactoring of code can still be done during this time.
> - Add and update documentation in wiki
> - Module maintainers: check compile warnings for your modules in automated
> builds. Make sure no compile warnings are being displayed.
>
>
> * Beta release build(s)
>  the first beta release is scheduled for the weekend Nov 21st.
>
> - From this point on only minor changes (bug fixes) should be added to the
> code base
> - Module maintainers: check and update javadoc for your modules
>
> * Release 3.0
>  The 3.0 Release is scheduled for Dez 12th
>
>
> There are two things we should still discuss:
>
> * backwards compatibility:
> the current "core" module contains tons of legacy 1.7 code. Shall I go
> ahead
> and delete this module?
>
> * documentation:
> The wiki contains tons of documentation for 1.7 which will not be useful
> for
> 3.0. As a procedure for cleaning this up and avoiding confusion I suggest
> to
> move all 1.7 related docu into a special section of the wiki. All toplevel
> links to documentation should point to 3.0. Any other suggestions?
>
>
> Andreas
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>

From paolo.romano at istge.it  Wed Oct 13 06:17:27 2010
From: paolo.romano at istge.it (Paolo Romano)
Date: Wed, 13 Oct 2010 12:17:27 +0200
Subject: [Biojava-l] NETTAB 2010 Biological Wikis: Call for posters and
 participation
Message-ID: <201010131018.o9DAHTjq009877@clus2.istge.it>

Apologizes for duplications
====

Joint NETTAB 2010 and BBCC 2010 workshop

Biological Wikis

November 29 - December 1, 2010
Congress Center, University of Naples "Federico II", Naples, Italy

http://www.nettab.org/2010/


The joint NETTAB and BBCC 2010 workshop on "Biological Wikis" 
promises to be a great meeting for all researchers involved in the 
exploitation of wikis in biology.
Come and discuss your ideas and doubts with such scientists as Alex 
Bateman, Alexander Pico, Andrew Su, Dan Bolser, Robert Hoffmann, 
Thomas Kelder, Mike Cariaso, Adam Godzik, Luca Toldo and many other 
who, we hope, will join the workshop.

It's a great chance to follow smart tutorials and lectures on 
WikiPathways, WikiGenes, Semantic Wiki, PDBWiki, Gene Wiki and a 
proficient use of Wikipedia.
See a list of keynote speakers and tutorials at 
http://www.nettab.org/2010/progr.html .

There still is time to submit abstracts for posters and software 
demonstrations until next October 17, 2010!
The complete Call is available on-line at 
http://www.nettab.org/2010/call.html .

Registration is open at http://www.nettab.org/2010/rform.html .
Register within next October 29, 2010 and take profit of early 
registration fees.

A reduction of 20 euro applies to all fees for members of ISCB and 
other societies and networks.
More reductions are foreseen for PhD students.

Further information is availble at http://www.nettab.org/2010/ .

Looking forward to seeing you soon in Naples.

Paolo Romano

Paolo Romano (paolo.romano at istge.it)
Bioinformatics
National Cancer Research Institute (IST)
Largo Rosanna Benzi, 10, I-16132, Genova, Italy
Tel: +39-010-5737-288  Fax: +39-010-5737-295  Skype: p.romano
Web: http://www.nettab.org/promano/


From pjotr.public23 at thebird.nl  Wed Oct 13 07:15:41 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 13:15:41 +0200
Subject: [Biojava-l] BioJava translation
Message-ID: <20101013111541.GA512@thebird.nl>

I am using biojava-1.7.1 nucleotide -> amino acid translation. It is
rather slow. In fact, the biopython equivalent in native Python is
twice as fast. EMBOSS is again magnitudes faster. I am using
something like 

  rna = RNATools.createRNA(nucleotides);
  aa = RNATools.translate(rna);

Embarrassingly, even the R version is faster in the GeneR module, as
it uses a C module. 

I have a feeling this has to do with typed object creation at every
level, whereas Python and others uses plain character Strings. 

Any plans for speeding this up on the JVM? 

Pj.

From pjotr.public23 at thebird.nl  Wed Oct 13 07:40:37 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 13:40:37 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
Message-ID: <20101013114037.GA1166@thebird.nl>

Great! You mean BJ3 translation should work? Do you have a short
example of use?

Pj.

On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

From holland at eaglegenomics.com  Wed Oct 13 07:27:05 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 13 Oct 2010 12:27:05 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013111541.GA512@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
Message-ID: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>

BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

On 13 Oct 2010, at 12:15, Pjotr Prins wrote:

> I am using biojava-1.7.1 nucleotide -> amino acid translation. It is
> rather slow. In fact, the biopython equivalent in native Python is
> twice as fast. EMBOSS is again magnitudes faster. I am using
> something like 
> 
>  rna = RNATools.createRNA(nucleotides);
>  aa = RNATools.translate(rna);
> 
> Embarrassingly, even the R version is faster in the GeneR module, as
> it uses a C module. 
> 
> I have a feeling this has to do with typed object creation at every
> level, whereas Python and others uses plain character Strings. 
> 
> Any plans for speeding this up on the JVM? 
> 
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From holland at eaglegenomics.com  Wed Oct 13 07:42:21 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 13 Oct 2010 12:42:21 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013114037.GA1166@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
Message-ID: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com>

Afraid I'm a bit out of touch but someone else on this list should be able to help. Andy or Andreas maybe?

On 13 Oct 2010, at 12:40, Pjotr Prins wrote:

> Great! You mean BJ3 translation should work? Do you have a short
> example of use?
> 
> Pj.
> 
> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
>> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From pjotr.public23 at thebird.nl  Wed Oct 13 07:48:07 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 13:48:07 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com>
Message-ID: <20101013114807.GA1569@thebird.nl>

On Wed, Oct 13, 2010 at 12:42:21PM +0100, Richard Holland wrote:
> Afraid I'm a bit out of touch but someone else on this list should
> be able to help. Andy or Andreas maybe?

It is not on the wiki yet, and I must admit I get lost in the source
tree. Any short example will do, translating from an ntseq (String) to
aaseq (String).

Pj.

From ayates at ebi.ac.uk  Wed Oct 13 07:50:25 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 12:50:25 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013114037.GA1166@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
Message-ID: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>

As of the moment there are the translation test cases which is the best documentation:

http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java

This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing.

Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available

Andy


On 13 Oct 2010, at 12:40, Pjotr Prins wrote:

> Great! You mean BJ3 translation should work? Do you have a short
> example of use?
> 
> Pj.
> 
> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
>> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From koen.bruynseels at cropdesign.com  Wed Oct 13 08:16:00 2010
From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com)
Date: Wed, 13 Oct 2010 14:16:00 +0200
Subject: [Biojava-l] Koen Bruynseels is out of the office.
Message-ID: <OF5E0BDE0E.450C15D1-ONC12577BB.00436226-C12577BB.00436226@basf-c-s.be>


I will be out of the office starting  10/12/2010 and will not return until
10/14/2010.

I will respond to your message when I return.


From andreas at sdsc.edu  Wed Oct 13 11:42:44 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 13 Oct 2010 08:42:44 -0700
Subject: [Biojava-l] BioJava translation
In-Reply-To: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
Message-ID: <AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>

Hi Andy,

any chance to add some wiki documentation for this as well? Would be
great....

Andreas


On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> As of the moment there are the translation test cases which is the best
> documentation:
>
>
> http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java
>
> This hopefully will give you a good idea about how to go about it. I was
> managing over 1000 translations per second of BRCA2 going from mRNA to
> peptide with checks. YMMV but I hope this is a lot faster than what you're
> currently seeing.
>
> Translation supports a lot of different modes with TranscriptionEngine
> being the place to configure this. The Javadoc should be good enough to help
> you through the different modes available
>
> Andy
>
>
> On 13 Oct 2010, at 12:40, Pjotr Prins wrote:
>
> > Great! You mean BJ3 translation should work? Do you have a short
> > example of use?
> >
> > Pj.
> >
> > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> >> BJ3 should be replacing most sequence operations with string operations,
> making the whole thing much faster.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From ayates at ebi.ac.uk  Wed Oct 13 11:46:58 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 16:46:58 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
Message-ID: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>

I will try my best to

Andy

On 13 Oct 2010, at 16:42, Andreas Prlic wrote:

> 
> Hi Andy,
> 
> any chance to add some wiki documentation for this as well? Would be great....
> 
> Andreas
> 
> 
> On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> As of the moment there are the translation test cases which is the best documentation:
> 
> http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java
> 
> This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing.
> 
> Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available
> 
> Andy
> 
> 
> On 13 Oct 2010, at 12:40, Pjotr Prins wrote:
> 
> > Great! You mean BJ3 translation should work? Do you have a short
> > example of use?
> >
> > Pj.
> >
> > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Wed Oct 13 11:58:44 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 17:58:44 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
	<3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
Message-ID: <20101013155844.GA2918@thebird.nl>

On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote:
> I will try my best to

Make sure to add the sequence should be uppercase. Took me a while to
crack that, as I only got a null pointer exception.

Pj.

From holland at eaglegenomics.com  Wed Oct 13 12:02:24 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 13 Oct 2010 17:02:24 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013155844.GA2918@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
	<3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
	<20101013155844.GA2918@thebird.nl>
Message-ID: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com>

whuh??? Shouldn't we be coding to cater for all case mixtures?!


On 13 Oct 2010, at 16:58, Pjotr Prins wrote:

> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote:
>> I will try my best to
> 
> Make sure to add the sequence should be uppercase. Took me a while to
> crack that, as I only got a null pointer exception.
> 
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From ayates at ebi.ac.uk  Wed Oct 13 12:11:40 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 17:11:40 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
	<3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
	<20101013155844.GA2918@thebird.nl>
	<1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com>
Message-ID: <7740A206-98A0-4FBC-9CF8-B1AC0DE7D859@ebi.ac.uk>

I also thought we were as well. I can investigate

On 13 Oct 2010, at 17:02, Richard Holland wrote:

> whuh??? Shouldn't we be coding to cater for all case mixtures?!
> 
> 
> On 13 Oct 2010, at 16:58, Pjotr Prins wrote:
> 
>> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote:
>>> I will try my best to
>> 
>> Make sure to add the sequence should be uppercase. Took me a while to
>> crack that, as I only got a null pointer exception.
>> 
>> Pj.
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Wed Oct 13 12:13:36 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 18:13:36 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
Message-ID: <20101013161336.GA3184@thebird.nl>

On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

Good news, BJ3 is a lot faster! The previous version took 2 minutes
for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my
modest Thinkpad X61 laptop. After parsing the Fasta and turning it
into an upper case string the actual translation takes 16sec.

Only the C implementations are faster.

Here the relevant Scala code:

import bio._
import java.io._
import org.biojava3.core.sequence._
import org.biojava3.core.sequence.transcription.TranscriptionEngine
import org.biojava3.core.sequence.io.IUPACParser

// <cut> fetching infile from command line...

IUPACParser.getInstance().getTable(1);  // not sure we need this
IUPACParser.getInstance().getTable("UNIVERSAL");
val engine = TranscriptionEngine.getDefault()
val f = new FastaReader(infile)
f.foreach { 
  res => 
    val (id,tag,dna) = res
    println(List(">",id).mkString) 
    val dna2 = new DNASequence(dna.mkString.toUpperCase)
    val rna = dna2.getRNASequence(engine)
    println(rna.getProteinSequence(engine))
  }
}

prints:

>B0222.10
MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG
>B0222.11
MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS
(...)

Pj.


From ayates at ebi.ac.uk  Wed Oct 13 12:25:41 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 17:25:41 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013161336.GA3184@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
Message-ID: <F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>

That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice.

I wonder what the C version does to make itself even faster

Andy

On 13 Oct 2010, at 17:13, Pjotr Prins wrote:

> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
>> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.
> 
> Good news, BJ3 is a lot faster! The previous version took 2 minutes
> for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my
> modest Thinkpad X61 laptop. After parsing the Fasta and turning it
> into an upper case string the actual translation takes 16sec.
> 
> Only the C implementations are faster.
> 
> Here the relevant Scala code:
> 
> import bio._
> import java.io._
> import org.biojava3.core.sequence._
> import org.biojava3.core.sequence.transcription.TranscriptionEngine
> import org.biojava3.core.sequence.io.IUPACParser
> 
> // <cut> fetching infile from command line...
> 
> IUPACParser.getInstance().getTable(1);  // not sure we need this
> IUPACParser.getInstance().getTable("UNIVERSAL");
> val engine = TranscriptionEngine.getDefault()
> val f = new FastaReader(infile)
> f.foreach { 
>  res => 
>    val (id,tag,dna) = res
>    println(List(">",id).mkString) 
>    val dna2 = new DNASequence(dna.mkString.toUpperCase)
>    val rna = dna2.getRNASequence(engine)
>    println(rna.getProteinSequence(engine))
>  }
> }
> 
> prints:
> 
>> B0222.10
> MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG
>> B0222.11
> MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS
> (...)
> 
> Pj.
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Wed Oct 13 12:34:23 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 18:34:23 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
Message-ID: <20101013163423.GA3849@thebird.nl>

On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote:
> That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice.
> 
> I wonder what the C version does to make itself even faster

The EMBOSS implementation is fastest by a mile - takes less than 3
seconds. But the code is, uhm, hard to read.

I think table lookups will win in C, whatever you try. But it may be an
interesting exercise if we can get close. Note I am perhaps not using the
fastest JVM.

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)

Pj.

From willishf at ufl.edu  Wed Oct 13 13:16:01 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Wed, 13 Oct 2010 13:16:01 -0400
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013163423.GA3849@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
Message-ID: <AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>

The Biojava3 has an additional validation layer and object creation going
from DNA sequence to RNA sequence and then using the appropriate translation
rules to return a protein sequence. Could be easily twice as fast if you
went from DNA sequence to ProteinSequence which would put it at 8 seconds.
We are going to carry a performance penalty setting everything up as a
proper object versus doing a simple String to String translation.


On Wed, Oct 13, 2010 at 12:34 PM, Pjotr Prins <pjotr.public23 at thebird.nl>wrote:

> On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote:
> > That's great news and should be even faster once we get rid of the
> requirement to upper case since you're having to parse the same sequence
> twice.
> >
> > I wonder what the C version does to make itself even faster
>
> The EMBOSS implementation is fastest by a mile - takes less than 3
> seconds. But the code is, uhm, hard to read.
>
> I think table lookups will win in C, whatever you try. But it may be an
> interesting exercise if we can get close. Note I am perhaps not using the
> fastest JVM.
>
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)
>
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>

From pjotr.public23 at thebird.nl  Wed Oct 13 14:17:12 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 20:17:12 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
	<AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
Message-ID: <20101013181712.GA4482@thebird.nl>

I think it is a good idea. From a purist point of view you may object
(it is not biological), but most libraries do exactly that.

If direct translation gets it down to 8sec, we may well half that
with further tweaking.

Pj.

On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote:
> The Biojava3 has an additional validation layer and object creation going
> from DNA sequence to RNA sequence and then using the appropriate translation
> rules to return a protein sequence. Could be easily twice as fast if you
> went from DNA sequence to ProteinSequence which would put it at 8 seconds.
> We are going to carry a performance penalty setting everything up as a
> proper object versus doing a simple String to String translation.

From darnells at dnastar.com  Wed Oct 13 14:21:52 2010
From: darnells at dnastar.com (Steve Darnell)
Date: Wed, 13 Oct 2010 13:21:52 -0500
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>

Andrew,

Forgive me for being pessimistic, but I do not believe you can
publically distribute your code without running the risk of being
scooped.  Mark's suggestions are very good; however, the safest route
would be to withhold distribution of your code until your work is
published (or at very least accepted).

Also, I would suggest this argument for convincing your group to use
BioJava (disclaimer - I am not a lawyer).

Under the LGPL, you are not obligated to release your source code if:

(1) you create a "work based on the library" (e.g. direct modifications
or additions to the licensed work) but do not distribute it, and
(2) you create a "work that uses the library" by dynamically linking
your work to the licensed work (see distribution clause #5 of the LGPL:
http://www.gnu.org/licenses/lgpl-2.1.html)

If you follow choice #2, you can license and distribute your work under
terms of your group's choosing (open or closed, submit it to the BioJava
developers for inclusion or not) while gaining the benefit of reusing
BioJava.

~Steve

-----Original Message-----
From: biojava-l-bounces at lists.open-bio.org
[mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark
Schreiber
Sent: Wednesday, October 13, 2010 4:26 AM
To: McSweeny, Andrew J
Cc: biojava-l at biojava.org
Subject: Re: [Biojava-l] How to share code while protecting copyrights?

Hi -

My understanding of copyright is that it is yours as soon as you assert
that
it is your creation. You can simply add a copyright statement to each
file
containing the code (in the header for example). The reality is that
defending copyright is your responsibility. If someone violates it, you
have
to take them to court or issue a legal letter.

You can also put an appropriate license on the code specifying how it
can be
used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick
one
of these that best matches your needs. BioJava code is LGPL so if you
want
your code to go into the BioJava code base you will need to make your
code
LGPL.

It's always a good idea to add @author tags to Java code to ensure
appropriate attribution.

Finally, if someone steals your code and publishes results before you
then
you can always make a complaint to the journal editors. If it is a
reputable
journal, and you have reasonable proof the editor should take some
action
such as forcing a retraction.  You can also make a distribution
agreement
saying that if someone uses this code they agree not to publish without
first consulting you.

If you want to make it really water tight, get a lawyer and explain
specifically what you want to share and what you want to protect or
prevent.

- Mark

On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
andrew.mcsweeny at rockets.utoledo.edu> wrote:

> Hi,
>
> I am working on a project which simulates sexual reproduction in a
> population of digital organisms.  Their genome is just a contig from
hg18.
>  It's pretty interesting and I can talk more about it in the
future....
>
> Anyways, how can I share my code for this project without having to
worry
> that someone else will use it to publish a paper before my group does?
>
> I'm certain nobody in the open source community would do that, but how
do I
> convince my group that opening our project to BioJava is a good idea?
>
> -Andrew
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


From andreas at sdsc.edu  Wed Oct 13 14:48:32 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 13 Oct 2010 11:48:32 -0700
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
Message-ID: <AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>

> Forgive me for being pessimistic, but I do not believe you can
> publically distribute your code without running the risk of being
> scooped.  Mark's suggestions are very good; however, the safest route
> would be to withhold distribution of your code until your work is
> published (or at very least accepted).
>


I think that is too conservative - if getting scooped is an issue, I would
release the code shortly before submission of the first manuscript to a
journal. That way the source code can form part of the publication and the
referees can view the code during the review process.  Many views/downloads
of articles happen in the first few weeks after publication. Having a link
to the source code in the paper can be a great advertisement for the open
source project and help in community-building.

Andreas


>
> Also, I would suggest this argument for convincing your group to use
> BioJava (disclaimer - I am not a lawyer).
>
> Under the LGPL, you are not obligated to release your source code if:
>
> (1) you create a "work based on the library" (e.g. direct modifications
> or additions to the licensed work) but do not distribute it, and
> (2) you create a "work that uses the library" by dynamically linking
> your work to the licensed work (see distribution clause #5 of the LGPL:
> http://www.gnu.org/licenses/lgpl-2.1.html)
>
> If you follow choice #2, you can license and distribute your work under
> terms of your group's choosing (open or closed, submit it to the BioJava
> developers for inclusion or not) while gaining the benefit of reusing
> BioJava.
>
> ~Steve
>
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark
> Schreiber
> Sent: Wednesday, October 13, 2010 4:26 AM
> To: McSweeny, Andrew J
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] How to share code while protecting copyrights?
>
> Hi -
>
> My understanding of copyright is that it is yours as soon as you assert
> that
> it is your creation. You can simply add a copyright statement to each
> file
> containing the code (in the header for example). The reality is that
> defending copyright is your responsibility. If someone violates it, you
> have
> to take them to court or issue a legal letter.
>
> You can also put an appropriate license on the code specifying how it
> can be
> used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick
> one
> of these that best matches your needs. BioJava code is LGPL so if you
> want
> your code to go into the BioJava code base you will need to make your
> code
> LGPL.
>
> It's always a good idea to add @author tags to Java code to ensure
> appropriate attribution.
>
> Finally, if someone steals your code and publishes results before you
> then
> you can always make a complaint to the journal editors. If it is a
> reputable
> journal, and you have reasonable proof the editor should take some
> action
> such as forcing a retraction.  You can also make a distribution
> agreement
> saying that if someone uses this code they agree not to publish without
> first consulting you.
>
> If you want to make it really water tight, get a lawyer and explain
> specifically what you want to share and what you want to protect or
> prevent.
>
> - Mark
>
> On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
> andrew.mcsweeny at rockets.utoledo.edu> wrote:
>
> > Hi,
> >
> > I am working on a project which simulates sexual reproduction in a
> > population of digital organisms.  Their genome is just a contig from
> hg18.
> >  It's pretty interesting and I can talk more about it in the
> future....
> >
> > Anyways, how can I share my code for this project without having to
> worry
> > that someone else will use it to publish a paper before my group does?
> >
> > I'm certain nobody in the open source community would do that, but how
> do I
> > convince my group that opening our project to BioJava is a good idea?
> >
> > -Andrew
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From andreas.prlic at gmail.com  Wed Oct 13 15:18:12 2010
From: andreas.prlic at gmail.com (Andreas Prlic)
Date: Wed, 13 Oct 2010 12:18:12 -0700
Subject: [Biojava-l] Questions related to biojava
In-Reply-To: <COL117-W299595E149AB88CB62D441A7550@phx.gbl>
References: <COL117-W299595E149AB88CB62D441A7550@phx.gbl>
Message-ID: <AANLkTi=oDrwtXBt3qxQinw7zUNN81uzYGOWEZ=rc2vkv@mail.gmail.com>

Hi Madhu,

best to keep such mails on the mailing list, otherwise they might get lost
in my flood of emails... see my reply below.

On Wed, Oct 13, 2010 at 12:08 PM, Madhusudan Gujral <mgujral2000 at hotmail.com
> wrote:

>  Hi Andreas,
>
>  I have couple of questions related to biojava. I would greatly appreciate
> if you could provide directions.
>
>  Is the biojava version 3.0 mature?
>  Is there any pom file for biojava that I can work with?
>  Is there a single tool to validate a fasta file?
>
>

- biojava 3.0 is in preparation of getting released. It is not release ready
but some of the modules are already used in some production environments
-  not sure what you mean with this question. You can see the source code in
SVN/git and there is also an automated build server providing snapshot
builds that can be used for Maven installations.
- what kind of vallidation do you have in mind? biojava3-core can do FASTA
parsing for you...

Andreas

From willishf at ufl.edu  Wed Oct 13 15:16:39 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Wed, 13 Oct 2010 15:16:39 -0400
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013181712.GA4482@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
	<AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
	<20101013181712.GA4482@thebird.nl>
Message-ID: <AANLkTikZBaPj4dvn-sKCqOWh2LVbVjmfG9yHwpFzUzUf@mail.gmail.com>

Pjotr

What is an extra 8 seconds among friends if you know you are going to get
the correct answer and you can change the rules if needed!!!

Are you parsing the C.elgans genome or DNA representation of each protein in
the C.elgans genome?

If you take out the println statement that will help speed things up a
bunch. Java System.out is always slow.

I am checking on the problem with upper case. That shouldn't be an issue.

Thanks

Scooter


On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins <pjotr.public23 at thebird.nl>wrote:

> I think it is a good idea. From a purist point of view you may object
> (it is not biological), but most libraries do exactly that.
>
> If direct translation gets it down to 8sec, we may well half that
> with further tweaking.
>
> Pj.
>
> On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote:
> > The Biojava3 has an additional validation layer and object creation going
> > from DNA sequence to RNA sequence and then using the appropriate
> translation
> > rules to return a protein sequence. Could be easily twice as fast if you
> > went from DNA sequence to ProteinSequence which would put it at 8
> seconds.
> > We are going to carry a performance penalty setting everything up as a
> > proper object versus doing a simple String to String translation.
>
>

From pjotr.public23 at thebird.nl  Wed Oct 13 17:05:46 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 23:05:46 +0200
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>
Message-ID: <20101013210546.GB5479@thebird.nl>

Is that idea of getting scooped realistic?

All my code is online, that is my scientific track record, next to my
papers.

Online OSS code may bring benefits when other people find bugs, or
even improve things. I don't worry about getting scooped. First it is
easy to prove it is mine, exactly because it is out in the open, and
second it takes more than plain old code to get something published in
a journal.

In the rare case an idea is so sensitive and easy to copy, you can
publish it with some part missing.

I think too much code sits on planks gathering dust, just because
people have these worries. It is old school. We are in the business
of moving science forward - writing beautiful tools. Nothing less.

Pj.

On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote:
> > Forgive me for being pessimistic, but I do not believe you can
> > publically distribute your code without running the risk of being
> > scooped.  Mark's suggestions are very good; however, the safest route
> > would be to withhold distribution of your code until your work is
> > published (or at very least accepted).

From andreas at sdsc.edu  Wed Oct 13 17:24:54 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 13 Oct 2010 14:24:54 -0700
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <20101013210546.GB5479@thebird.nl>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>
	<20101013210546.GB5479@thebird.nl>
Message-ID: <AANLkTinQatqghBZONzGcxRx9ySzDdZHnOBopri8TTHxZ@mail.gmail.com>

nicely put :-)

A

On Wed, Oct 13, 2010 at 2:05 PM, Pjotr Prins <pjotr.public23 at thebird.nl>wrote:

> Is that idea of getting scooped realistic?
>
> All my code is online, that is my scientific track record, next to my
> papers.
>
> Online OSS code may bring benefits when other people find bugs, or
> even improve things. I don't worry about getting scooped. First it is
> easy to prove it is mine, exactly because it is out in the open, and
> second it takes more than plain old code to get something published in
> a journal.
>
> In the rare case an idea is so sensitive and easy to copy, you can
> publish it with some part missing.
>
> I think too much code sits on planks gathering dust, just because
> people have these worries. It is old school. We are in the business
> of moving science forward - writing beautiful tools. Nothing less.
>
> Pj.
>
> On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote:
> > > Forgive me for being pessimistic, but I do not believe you can
> > > publically distribute your code without running the risk of being
> > > scooped.  Mark's suggestions are very good; however, the safest route
> > > would be to withhold distribution of your code until your work is
> > > published (or at very least accepted).
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From hlapp at drycafe.net  Wed Oct 13 17:44:36 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Wed, 13 Oct 2010 16:44:36 -0500
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
Message-ID: <A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>

How and when you want to be attributed in publications, and what you  
want someone else not to publish on, is an ethical matter. Licenses  
are legal instruments and not suited for ethical questions or social  
conventions. Rather, this is addressed by ethical and social  
conventions and requests.

A good example is the Ft Lauderdale agreement, which is not a legal  
instrument but an ethical request of those who peruse immediate- 
release sequencing data. If you have ethical or social requests to  
make of those who peruse your code, state them explicitly in a README  
and in the code.

By their nature, you can't legally enforce them. However, ethical  
behavior is policed - by all of us as a scientific community, not in  
the courts.

	-hilmar

On Oct 13, 2010, at 1:21 PM, Steve Darnell wrote:

> Andrew,
>
> Forgive me for being pessimistic, but I do not believe you can
> publically distribute your code without running the risk of being
> scooped.  Mark's suggestions are very good; however, the safest route
> would be to withhold distribution of your code until your work is
> published (or at very least accepted).
>
> Also, I would suggest this argument for convincing your group to use
> BioJava (disclaimer - I am not a lawyer).
>
> Under the LGPL, you are not obligated to release your source code if:
>
> (1) you create a "work based on the library" (e.g. direct  
> modifications
> or additions to the licensed work) but do not distribute it, and
> (2) you create a "work that uses the library" by dynamically linking
> your work to the licensed work (see distribution clause #5 of the  
> LGPL:
> http://www.gnu.org/licenses/lgpl-2.1.html)
>
> If you follow choice #2, you can license and distribute your work  
> under
> terms of your group's choosing (open or closed, submit it to the  
> BioJava
> developers for inclusion or not) while gaining the benefit of reusing
> BioJava.
>
> ~Steve
>
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark
> Schreiber
> Sent: Wednesday, October 13, 2010 4:26 AM
> To: McSweeny, Andrew J
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] How to share code while protecting  
> copyrights?
>
> Hi -
>
> My understanding of copyright is that it is yours as soon as you  
> assert
> that
> it is your creation. You can simply add a copyright statement to each
> file
> containing the code (in the header for example). The reality is that
> defending copyright is your responsibility. If someone violates it,  
> you
> have
> to take them to court or issue a legal letter.
>
> You can also put an appropriate license on the code specifying how it
> can be
> used. Examples include GPL, LGPL, BSD, Apache License etc. You can  
> pick
> one
> of these that best matches your needs. BioJava code is LGPL so if you
> want
> your code to go into the BioJava code base you will need to make your
> code
> LGPL.
>
> It's always a good idea to add @author tags to Java code to ensure
> appropriate attribution.
>
> Finally, if someone steals your code and publishes results before you
> then
> you can always make a complaint to the journal editors. If it is a
> reputable
> journal, and you have reasonable proof the editor should take some
> action
> such as forcing a retraction.  You can also make a distribution
> agreement
> saying that if someone uses this code they agree not to publish  
> without
> first consulting you.
>
> If you want to make it really water tight, get a lawyer and explain
> specifically what you want to share and what you want to protect or
> prevent.
>
> - Mark
>
> On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
> andrew.mcsweeny at rockets.utoledo.edu> wrote:
>
>> Hi,
>>
>> I am working on a project which simulates sexual reproduction in a
>> population of digital organisms.  Their genome is just a contig from
> hg18.
>> It's pretty interesting and I can talk more about it in the
> future....
>>
>> Anyways, how can I share my code for this project without having to
> worry
>> that someone else will use it to publish a paper before my group  
>> does?
>>
>> I'm certain nobody in the open source community would do that, but  
>> how
> do I
>> convince my group that opening our project to BioJava is a good idea?
>>
>> -Andrew
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From ayates at ebi.ac.uk  Wed Oct 13 18:52:17 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 23:52:17 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <AANLkTikZBaPj4dvn-sKCqOWh2LVbVjmfG9yHwpFzUzUf@mail.gmail.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
	<AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
	<20101013181712.GA4482@thebird.nl>
	<AANLkTikZBaPj4dvn-sKCqOWh2LVbVjmfG9yHwpFzUzUf@mail.gmail.com>
Message-ID: <7E59B83F-8371-4F79-AC4C-57D1A49A9398@ebi.ac.uk>

LOL well you could always parallelise it :)

I've gone & pushed a new version of the translator code to the SVN repo so it'll filter through to the public server soon. There's an added test case as well. The overall impact of this change seems to be about 25 translations of BRCA2 per second so it is significant; our current limit looks to be approx. 200 per second.

I hope you find this is faster without the need to edit & parse a Sequence String twice

Andy

On 13 Oct 2010, at 20:16, Scooter Willis wrote:

> Pjotr
> 
> What is an extra 8 seconds among friends if you know you are going to get the correct answer and you can change the rules if needed!!!
> 
> Are you parsing the C.elgans genome or DNA representation of each protein in the C.elgans genome? 
> 
> If you take out the println statement that will help speed things up a bunch. Java System.out is always slow.
> 
> I am checking on the problem with upper case. That shouldn't be an issue.
> 
> Thanks
> 
> Scooter
> 
> 
> On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins <pjotr.public23 at thebird.nl> wrote:
> I think it is a good idea. From a purist point of view you may object
> (it is not biological), but most libraries do exactly that.
> 
> If direct translation gets it down to 8sec, we may well half that
> with further tweaking.
> 
> Pj.
> 
> On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote:
> > The Biojava3 has an additional validation layer and object creation going
> > from DNA sequence to RNA sequence and then using the appropriate translation
> > rules to return a protein sequence. Could be easily twice as fast if you
> > went from DNA sequence to ProteinSequence which would put it at 8 seconds.
> > We are going to carry a performance penalty setting everything up as a
> > proper object versus doing a simple String to String translation.
> 
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Thu Oct 14 03:00:12 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Thu, 14 Oct 2010 09:00:12 +0200
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>
Message-ID: <20101014070012.GA7296@thebird.nl>

On Wed, Oct 13, 2010 at 04:44:36PM -0500, Hilmar Lapp wrote:
> By their nature, you can't legally enforce them. However, ethical  
> behavior is policed - by all of us as a scientific community, not in the 
> courts.

I know people who make it their business to pursue companies that do
not honour OSS licenses. The companies always have to retrack.

Is there any precedent in science where open source software was used
to scoop research? And how did that scientist fare?

With scientists I can't see it happening. Getting caught out that way
will hurt all future prospects for an individual or group.

With this reasoning you are best off putting code in the public domain
as fast as possible.

Pj.

From hlapp at drycafe.net  Thu Oct 14 10:47:19 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Thu, 14 Oct 2010 09:47:19 -0500
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <20101014070012.GA7296@thebird.nl>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>
	<20101014070012.GA7296@thebird.nl>
Message-ID: <FB150474-3E31-4795-BC3E-1D4B2A12EB4C@drycafe.net>


On Oct 14, 2010, at 2:00 AM, Pjotr Prins wrote:

> I know people who make it their business to pursue companies that do
> not honour OSS licenses. The companies always have to retrack.


Of course. That's a legal issue. Attribution on publications, or what  
someone publishes on reusing your stuff, is not a legal issue.

	-hilmar
-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri Oct 15 07:53:13 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 15 Oct 2010 12:53:13 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013111541.GA512@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
Message-ID: <AANLkTimigiSLuDnXyV1EdaPeasvPu2=1Dned71pAbT7h@mail.gmail.com>

On Wed, Oct 13, 2010 at 12:15 PM, Pjotr Prins <pjotr.public23 at thebird.nl> wrote:
> I am using biojava-1.7.1 nucleotide -> amino acid translation. It is
> rather slow. In fact, the biopython equivalent in native Python is
> twice as fast. EMBOSS is again magnitudes faster. I am using
> something like
>
> ?rna = RNATools.createRNA(nucleotides);
> ?aa = RNATools.translate(rna);
>
> Embarrassingly, even the R version is faster in the GeneR module, as
> it uses a C module.
>
> I have a feeling this has to do with typed object creation at every
> level, whereas Python and others uses plain character Strings.
>
> Any plans for speeding this up on the JVM?
>
> Pj.

Actually (assuming you are not explicitly using strings),
Biopython would also be using objects for each sequence,
which does impose a speed penalty.

Peter


From kurka at mikro.biologie.tu-muenchen.de  Tue Oct 19 07:25:31 2010
From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka)
Date: Tue, 19 Oct 2010 13:25:31 +0200
Subject: [Biojava-l] feature request - full query description from blast
	result
Message-ID: <4CBD802B.7030809@mikro.biologie.tu-muenchen.de>

Hi all,

I just read in a blast file and I want to get the full query description.
For example, when I have that query:
Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
         (1208 letters)

I get as query-information locus_tag= CD0002
The rest is truncated.

In the biojava-mailinglist I found the same question
http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html

And Mark suggested to make a request for improvement, but as I see it,
nothing happened. So I would like to ask, if you can change it. Or is it
changed and I don't see it.

Thanks,
Hedwig

From sb.genny at gmail.com  Thu Oct 21 10:28:53 2010
From: sb.genny at gmail.com (sobia idrees)
Date: Thu, 21 Oct 2010 19:28:53 +0500
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9
In-Reply-To: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
References: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
Message-ID: <AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>

Hi

I want to develop phylogenetics application in biojava..but need help to do
that..Kindly help me in developing some applications..

Thanks in anticipation

Regards,
Sobia Idrees


On Tue, Oct 19, 2010 at 9:00 PM, <biojava-l-request at lists.open-bio.org>wrote:

> Send Biojava-l mailing list submissions to
>        biojava-l at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/biojava-l
> or, via email, send a message with subject or body 'help' to
>        biojava-l-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        biojava-l-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biojava-l digest..."
>
>
> Today's Topics:
>
>   1. feature request - full query description from blast       result
>      (Hedwig Kurka)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 19 Oct 2010 13:25:31 +0200
> From: Hedwig Kurka <kurka at mikro.biologie.tu-muenchen.de>
> Subject: [Biojava-l] feature request - full query description from
>        blast   result
> To: biojava-l at lists.open-bio.org
> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de>
> Content-Type: text/plain; charset=ISO-8859-15
>
> Hi all,
>
> I just read in a blast file and I want to get the full query description.
> For example, when I have that query:
> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
>         (1208 letters)
>
> I get as query-information locus_tag= CD0002
> The rest is truncated.
>
> In the biojava-mailinglist I found the same question
> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html
>
> And Mark suggested to make a request for improvement, but as I see it,
> nothing happened. So I would like to ask, if you can change it. Or is it
> changed and I don't see it.
>
> Thanks,
> Hedwig
>
>
> ------------------------------
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
> End of Biojava-l Digest, Vol 93, Issue 9
> ****************************************
>

From sb.genny at gmail.com  Thu Oct 21 10:30:35 2010
From: sb.genny at gmail.com (sobia idrees)
Date: Thu, 21 Oct 2010 19:30:35 +0500
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9
In-Reply-To: <AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>
References: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
	<AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>
Message-ID: <AANLkTi=x2mrAFoYMf2v1iE=HN8fkV=0R2wS4tHrVLsRn@mail.gmail.com>

Hi

I have developed some web based and desktop based applications using
biojava..Can it be published in Biojava journal?

Thanks,
Sobia Idrees

On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees <sb.genny at gmail.com> wrote:

> Hi
>
> I want to develop phylogenetics application in biojava..but need help to do
> that..Kindly help me in developing some applications..
>
> Thanks in anticipation
>
> Regards,
> Sobia Idrees
>
>
> On Tue, Oct 19, 2010 at 9:00 PM, <biojava-l-request at lists.open-bio.org>wrote:
>
>> Send Biojava-l mailing list submissions to
>>        biojava-l at lists.open-bio.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>        http://lists.open-bio.org/mailman/listinfo/biojava-l
>> or, via email, send a message with subject or body 'help' to
>>        biojava-l-request at lists.open-bio.org
>>
>> You can reach the person managing the list at
>>        biojava-l-owner at lists.open-bio.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Biojava-l digest..."
>>
>>
>> Today's Topics:
>>
>>   1. feature request - full query description from blast       result
>>      (Hedwig Kurka)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 19 Oct 2010 13:25:31 +0200
>> From: Hedwig Kurka <kurka at mikro.biologie.tu-muenchen.de>
>> Subject: [Biojava-l] feature request - full query description from
>>        blast   result
>> To: biojava-l at lists.open-bio.org
>> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de>
>> Content-Type: text/plain; charset=ISO-8859-15
>>
>> Hi all,
>>
>> I just read in a blast file and I want to get the full query description.
>> For example, when I have that query:
>> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
>> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
>> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
>>         (1208 letters)
>>
>> I get as query-information locus_tag= CD0002
>> The rest is truncated.
>>
>> In the biojava-mailinglist I found the same question
>> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html
>>
>> And Mark suggested to make a request for improvement, but as I see it,
>> nothing happened. So I would like to ask, if you can change it. Or is it
>> changed and I don't see it.
>>
>> Thanks,
>> Hedwig
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
>> End of Biojava-l Digest, Vol 93, Issue 9
>> ****************************************
>>
>
>

From holland at eaglegenomics.com  Thu Oct 21 10:41:35 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Thu, 21 Oct 2010 15:41:35 +0100
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9
In-Reply-To: <AANLkTi=x2mrAFoYMf2v1iE=HN8fkV=0R2wS4tHrVLsRn@mail.gmail.com>
References: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
	<AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>
	<AANLkTi=x2mrAFoYMf2v1iE=HN8fkV=0R2wS4tHrVLsRn@mail.gmail.com>
Message-ID: <97591963-F741-45C1-8E9D-231A5D05D4DA@eaglegenomics.com>

There is no such thing as a Biojava journal. You would need to submit your paper to one of the main bioinformatics journals.

cheers,
Richard

On 21 Oct 2010, at 15:30, sobia idrees wrote:

> Hi
> 
> I have developed some web based and desktop based applications using
> biojava..Can it be published in Biojava journal?
> 
> Thanks,
> Sobia Idrees
> 
> On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees <sb.genny at gmail.com> wrote:
> 
>> Hi
>> 
>> I want to develop phylogenetics application in biojava..but need help to do
>> that..Kindly help me in developing some applications..
>> 
>> Thanks in anticipation
>> 
>> Regards,
>> Sobia Idrees
>> 
>> 
>> On Tue, Oct 19, 2010 at 9:00 PM, <biojava-l-request at lists.open-bio.org>wrote:
>> 
>>> Send Biojava-l mailing list submissions to
>>>       biojava-l at lists.open-bio.org
>>> 
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>       http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> or, via email, send a message with subject or body 'help' to
>>>       biojava-l-request at lists.open-bio.org
>>> 
>>> You can reach the person managing the list at
>>>       biojava-l-owner at lists.open-bio.org
>>> 
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Biojava-l digest..."
>>> 
>>> 
>>> Today's Topics:
>>> 
>>>  1. feature request - full query description from blast       result
>>>     (Hedwig Kurka)
>>> 
>>> 
>>> ----------------------------------------------------------------------
>>> 
>>> Message: 1
>>> Date: Tue, 19 Oct 2010 13:25:31 +0200
>>> From: Hedwig Kurka <kurka at mikro.biologie.tu-muenchen.de>
>>> Subject: [Biojava-l] feature request - full query description from
>>>       blast   result
>>> To: biojava-l at lists.open-bio.org
>>> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de>
>>> Content-Type: text/plain; charset=ISO-8859-15
>>> 
>>> Hi all,
>>> 
>>> I just read in a blast file and I want to get the full query description.
>>> For example, when I have that query:
>>> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
>>> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
>>> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
>>>        (1208 letters)
>>> 
>>> I get as query-information locus_tag= CD0002
>>> The rest is truncated.
>>> 
>>> In the biojava-mailinglist I found the same question
>>> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html
>>> 
>>> And Mark suggested to make a request for improvement, but as I see it,
>>> nothing happened. So I would like to ask, if you can change it. Or is it
>>> changed and I don't see it.
>>> 
>>> Thanks,
>>> Hedwig
>>> 
>>> 
>>> ------------------------------
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> 
>>> End of Biojava-l Digest, Vol 93, Issue 9
>>> ****************************************
>>> 
>> 
>> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From jc.lucky at laposte.net  Fri Oct 22 04:11:43 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Fri, 22 Oct 2010 10:11:43 +0200 (CEST)
Subject: [Biojava-l] Retrieve Information from GenBank file
Message-ID: <31170592.35650.1287735103724.JavaMail.www@wwinf8210>


Hi

I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945

With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. 
Please help me find what I do wrong or what should be done to achieve my goal.

//read the GeneBank File
public static RichSequenceIterator readFile(String input,
RichSequenceBuilderFactory seqFactory,
Namespace ns)
throws IOException, NoSuchElementException, BioException
{
ns = null;
InputStream stream = new FileInputStream(input);
BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream));
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); 
return seqs;
}

//Retrieve information and convert them in rdf format
public void writeToRDFFile(RichSequenceIterator rsi, String output)
throws IOException, NoSuchElementException, BioException {
//create model for the ontology
OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null);
OntClass parents;
String URI = "http://pbr.wur.nl/#";

while(rsi.hasNext())
{
RichSequence seq = rsi.nextRichSequence();
String id = seq.getName(); 
parents = model.createClass(URI + id);
Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString
String definition = seq.getDescription(); //code to clean up String
//Add to model
parents.addProperty(DC.description, definition);
parents.addProperty(DC.publisher, authors);
parents.addComment(taxonomy, "EN");
parents.addProperty(DC.type, organism);
//print in rdf format
model.write(out, "RDF/XML");
out.close(); }
}


Thanks,
Jean-Charles Ferri?res

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From andreas at sdsc.edu  Fri Oct 22 15:56:49 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Fri, 22 Oct 2010 12:56:49 -0700
Subject: [Biojava-l] 3.0-alpha2
Message-ID: <AANLkTimUbGR+LpMj-f8wVctMz_hv4E-5DZw542xNtMeh@mail.gmail.com>

Hi,

In preparation for the upcoming biojava 3 release, 3.0-alpha2  has
just been released on http://biojava.org/download/maven/

Andreas

From cfriedline at vcu.edu  Sun Oct 24 10:38:46 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Sun, 24 Oct 2010 10:38:46 -0400
Subject: [Biojava-l] Test Message
Message-ID: <AANLkTinVV_ZhkATN1sXn=UnS0A39RcjV-nEqdSDfVvpq@mail.gmail.com>

Per Andreas, this is a test.

Chris

-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA

From cfriedline at vcu.edu  Sun Oct 24 10:57:48 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Sun, 24 Oct 2010 10:57:48 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
Message-ID: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>

Hello,

I am getting a weird problem with protein alignment using
NeedlemanWunsch in 1.7.1, in that the alignment does not span the
entire length of the proteins.  I've verified that this should not
happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
I'm reluctant to switch to BioJava3 at this time, since performance is
about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
350,000 of them.

An example of this alignment error, is shown here: http://pastebin.com/mdX516R6

Notice that the alignment stops 1 amino acid short of the end in both
cases.  The parameters for the alignment are: BLOSUM50, gapOpen=10,
gapExtend=2.

Thanks,
Chris

-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA

From andreas.draeger at uni-tuebingen.de  Sun Oct 24 12:01:05 2010
From: andreas.draeger at uni-tuebingen.de (Andreas Draeger)
Date: Sun, 24 Oct 2010 18:01:05 +0200
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
Message-ID: <4CC45841.5080604@uni-tuebingen.de>

Hi Chris,

Thank you for reprorting this problem. It would be very nice if you
could also provide your source code. Then I would like to test what
happens. You can send source code, substitution matrix, and the two
example protein sequences that cause the problems directly to me. I'll
then have a look into it.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091

From cfriedline at vcu.edu  Sun Oct 24 14:04:25 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Sun, 24 Oct 2010 14:04:25 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <4CC45841.5080604@uni-tuebingen.de>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
	<4CC45841.5080604@uni-tuebingen.de>
Message-ID: <AANLkTimdTxvpTuJC1j8Mfk9W1bXs3S8+5xw5UjNvzQ4Q@mail.gmail.com>

Thanks, Andreas.  I've sent you the information that you asked for below.

Chris

On Sun, Oct 24, 2010 at 12:01 PM, Andreas Draeger
<andreas.draeger at uni-tuebingen.de> wrote:
> Hi Chris,
>
> Thank you for reprorting this problem. It would be very nice if you
> could also provide your source code. Then I would like to test what
> happens. You can send source code, substitution matrix, and the two
> example protein sequences that cause the problems directly to me. I'll
> then have a look into it.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax: ? +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From koen.bruynseels at cropdesign.com  Mon Oct 25 12:15:59 2010
From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com)
Date: Mon, 25 Oct 2010 18:15:59 +0200
Subject: [Biojava-l] Koen Bruynseels is out of the office.
Message-ID: <OFA688DB7B.E88D2042-ONC12577C7.00595AE1-C12577C7.00595AE1@basf-c-s.be>


I will be out of the office starting  10/25/2010 and will not return until
11/02/2010.

I will respond to your message when I return.


From andreas at sdsc.edu  Tue Oct 26 14:42:29 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 11:42:29 -0700
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
Message-ID: <AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>

Hi Chris,

about your comment that the biojava3-alignment is slower than the 1.7
one: Do you have any data if this is coming from the io or is the
actual alignment calculation slower?

Andreas

On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline <cfriedline at vcu.edu> wrote:
> Hello,
>
> I am getting a weird problem with protein alignment using
> NeedlemanWunsch in 1.7.1, in that the alignment does not span the
> entire length of the proteins. ?I've verified that this should not
> happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
> I'm reluctant to switch to BioJava3 at this time, since performance is
> about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
> 350,000 of them.
>
> An example of this alignment error, is shown here: http://pastebin.com/mdX516R6
>
> Notice that the alignment stops 1 amino acid short of the end in both
> cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10,
> gapExtend=2.
>
> Thanks,
> Chris
>
> --
> PhD Candidate, Integrative Life Sciences
> Virginia Commonwealth University
> Richmond, VA
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From cfriedline at vcu.edu  Tue Oct 26 15:21:39 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Tue, 26 Oct 2010 15:21:39 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
	<AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>
Message-ID: <AANLkTinhUCwcdsOtJg5jdG-wwqiCx9oevSqfzLi_96de@mail.gmail.com>

Hi Andreas,

The io should be the same, since I've used the same set of genes for testing
both.  So, I'm guessing it's either the alignment calculation or the new
biojava design contributing to the slowness.

Chris

On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic <andreas at sdsc.edu> wrote:

> Hi Chris,
>
> about your comment that the biojava3-alignment is slower than the 1.7
> one: Do you have any data if this is coming from the io or is the
> actual alignment calculation slower?
>
> Andreas
>
> On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline <cfriedline at vcu.edu>
> wrote:
> > Hello,
> >
> > I am getting a weird problem with protein alignment using
> > NeedlemanWunsch in 1.7.1, in that the alignment does not span the
> > entire length of the proteins.  I've verified that this should not
> > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
> > I'm reluctant to switch to BioJava3 at this time, since performance is
> > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
> > 350,000 of them.
> >
> > An example of this alignment error, is shown here:
> http://pastebin.com/mdX516R6
> >
> > Notice that the alignment stops 1 amino acid short of the end in both
> > cases.  The parameters for the alignment are: BLOSUM50, gapOpen=10,
> > gapExtend=2.
> >
> > Thanks,
> > Chris
> >
> > --
> > PhD Candidate, Integrative Life Sciences
> > Virginia Commonwealth University
> > Richmond, VA
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA

From cfriedline at vcu.edu  Tue Oct 26 15:29:30 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Tue, 26 Oct 2010 15:29:30 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTi=4nF1_t6z=zot2xnRWR_9KRxBeTuyfk_PXD0P_@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
	<AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>
	<AANLkTinJ01kX5wBK-wno=rLOMjRpHeYrF4NEi+5jCPBz@mail.gmail.com>
	<AANLkTi=4nF1_t6z=zot2xnRWR_9KRxBeTuyfk_PXD0P_@mail.gmail.com>
Message-ID: <AANLkTikBy2Ub3e2OWsc1+9hHca+bRJ859XWZUDNUsS69@mail.gmail.com>

That's something I'll need to go back and revisit after my deadline
passes at the end of this week. Initially, I was creating them on the
fly at the time of alignment, but it would be more efficient to store
them that way in the gene object itself. ?I was also passing an
InputStreamReader for the substitution matrix each time (pulling the
matrix from my jar), but storing it as a string would also be a better
option, especially since I'm threading and there are so many
alignments.

Chris

On Tue, Oct 26, 2010 at 3:23 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>
> ok, how do you create the biojava3 Sequence objects? just trying to
> find out where the bottlenecks are, so we can fix them...
>
> A
>
> On Tue, Oct 26, 2010 at 12:20 PM, Chris Friedline <cfriedline at vcu.edu> wrote:
> > Hi,
> > The io should be the same, since I've used the same set of genes for testing
> > both. ?So, it's either the alignment calculation or the new biojava design
> > contributing to the slowness.
> > Chris
> >
> > On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> >>
> >> Hi Chris,
> >>
> >> about your comment that the biojava3-alignment is slower than the 1.7
> >> one: Do you have any data if this is coming from the io or is the
> >> actual alignment calculation slower?
> >>
> >> Andreas
> >>
> >> On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline <cfriedline at vcu.edu>
> >> wrote:
> >> > Hello,
> >> >
> >> > I am getting a weird problem with protein alignment using
> >> > NeedlemanWunsch in 1.7.1, in that the alignment does not span the
> >> > entire length of the proteins. ?I've verified that this should not
> >> > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
> >> > I'm reluctant to switch to BioJava3 at this time, since performance is
> >> > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
> >> > 350,000 of them.
> >> >
> >> > An example of this alignment error, is shown here:
> >> > http://pastebin.com/mdX516R6
> >> >
> >> > Notice that the alignment stops 1 amino acid short of the end in both
> >> > cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10,
> >> > gapExtend=2.
> >> >
> >> > Thanks,
> >> > Chris
> >> >
> >> > --
> >> > PhD Candidate, Integrative Life Sciences
> >> > Virginia Commonwealth University
> >> > Richmond, VA
> >> > _______________________________________________
> >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> >> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >> >
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------------------------
> >> Dr. Andreas Prlic
> >> Senior Scientist, RCSB PDB Protein Data Bank
> >> University of California, San Diego
> >> (+1) 858.246.0526
> >> -----------------------------------------------------------------------
> >
> >
> >
> > --
> > PhD Candidate, Integrative Life Sciences
> > Virginia Commonwealth University
> > Richmond, VA
> >
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------


--
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From andreas.draeger at uni-tuebingen.de  Tue Oct 26 18:18:00 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Tue, 26 Oct 2010 23:18:00 +0100
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTikBy2Ub3e2OWsc1+9hHca+bRJ859XWZUDNUsS69@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>	<AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>	<AANLkTinJ01kX5wBK-wno=rLOMjRpHeYrF4NEi+5jCPBz@mail.gmail.com>	<AANLkTi=4nF1_t6z=zot2xnRWR_9KRxBeTuyfk_PXD0P_@mail.gmail.com>
	<AANLkTikBy2Ub3e2OWsc1+9hHca+bRJ859XWZUDNUsS69@mail.gmail.com>
Message-ID: <4CC75398.7000301@uni-tuebingen.de>

Hi all,

By the way, I would like to mention that the bug has been fixed. It was 
a problem with the way how the alignment was presented to the user 
afterwards, i.e., a problem of the formatting algorithm. The alignment 
itself was correct and also when obtaining the GappedSequences after the 
alignment, these were correct. The problem was that the formatter was 
started with the original lenght of the sequences, which is usually to 
short after inserting gaps. This is now solved and the alignment should 
work fine now.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091

From dasarnow at gmail.com  Tue Oct 26 23:54:43 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Tue, 26 Oct 2010 20:54:43 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader
Message-ID: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>

Hi all,
Let me first say thanks to all the BioJava community members for
delivering such a useful set of libraries, and that I'm still a newbie
when it comes to BioJava (and Java) so forgive me if my question is
too trivial.

I am doing work on lots (at least thousands) of PDB files from RCSB.
As is commonly known, these are often rife with errors which can lead
to exceptions during parsing with PDBFileParser.  Because
PDBFileParser's methods contain their own try-catch blocks, exception
propagation stops there and my code proceeds blindly along regardless
of any error checking I do.  I would like to catch the exceptions up
in my code where the parser is called, so that I can branch to a
continue statement and have my batch processing loops move on to the
next file.
Should I edit out the try-catch blocks and compile my own version of
the library?  Or should I test the returned StructureImpl objects for
possession of the fields in question?  In that case, I'm not sure
which properties will give the most general success information...and
I'd rather not have to check for /every/ property being correct.

If there is some great way to check if an exception was caught down a
series of nested method calls, please hit me over the head with it.

Thanks!

-da

From andreas at sdsc.edu  Wed Oct 27 00:11:28 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 21:11:28 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
Message-ID: <AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>

Hi Daniel,

can you explain a bit more what you are doing, in particular what
errors you would like to deal with on your end?  You should not need
to worry too much about exception handling. Are there any special
cases you are interested in?  In this case we should support you with
a clean interface rather than exception handling from your end...

Andreas


On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Hi all,
> Let me first say thanks to all the BioJava community members for
> delivering such a useful set of libraries, and that I'm still a newbie
> when it comes to BioJava (and Java) so forgive me if my question is
> too trivial.
>
> I am doing work on lots (at least thousands) of PDB files from RCSB.
> As is commonly known, these are often rife with errors which can lead
> to exceptions during parsing with PDBFileParser. ?Because
> PDBFileParser's methods contain their own try-catch blocks, exception
> propagation stops there and my code proceeds blindly along regardless
> of any error checking I do. ?I would like to catch the exceptions up
> in my code where the parser is called, so that I can branch to a
> continue statement and have my batch processing loops move on to the
> next file.
> Should I edit out the try-catch blocks and compile my own version of
> the library? ?Or should I test the returned StructureImpl objects for
> possession of the fields in question? ?In that case, I'm not sure
> which properties will give the most general success information...and
> I'd rather not have to check for /every/ property being correct.
>
> If there is some great way to check if an exception was caught down a
> series of nested method calls, please hit me over the head with it.
>
> Thanks!
>
> -da
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From dasarnow at gmail.com  Wed Oct 27 00:59:56 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Tue, 26 Oct 2010 21:59:56 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
Message-ID: <AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>

Glad to hear it, who doesn't like support or clean interfaces?.  No
offense intended, by the way, with respect to PDB errors - obviously
the PDB is an indispensable resource for all protein scientists.

I am looking at many (fixed-length) pieces of protein chains and doin'
stuff with 'em.  My current code has a pair of nested while loops; the
outer iterates over PDB entries (locally rsync'd copy), parsing them
and the inner iterates over the pieces from each.  When
StructureExceptions come out of my PDBFileReader object I want to
continue the outer loop, moving on to the next set of files without
executing any of the code that depends on correct StructureImpl
objects from the reader (database updates, the inner loop).
Since the reader's methods have their own try-catch blocks, a thrown
StructureException is stopped there and never reaches my own error
handling.  I just need to know when those errors occur so I can skip
those proteins - I am presuming that the correct entries will outweigh
the problem ones by a significant factor and the overall data wont be
seriously impacted.

-da

On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> can you explain a bit more what you are doing, in particular what
> errors you would like to deal with on your end? ?You should not need
> to worry too much about exception handling. Are there any special
> cases you are interested in? ?In this case we should support you with
> a clean interface rather than exception handling from your end...
>
> Andreas
>
>
>
> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> Hi all,
>> Let me first say thanks to all the BioJava community members for
>> delivering such a useful set of libraries, and that I'm still a newbie
>> when it comes to BioJava (and Java) so forgive me if my question is
>> too trivial.
>>
>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>> As is commonly known, these are often rife with errors which can lead
>> to exceptions during parsing with PDBFileParser. ?Because
>> PDBFileParser's methods contain their own try-catch blocks, exception
>> propagation stops there and my code proceeds blindly along regardless
>> of any error checking I do. ?I would like to catch the exceptions up
>> in my code where the parser is called, so that I can branch to a
>> continue statement and have my batch processing loops move on to the
>> next file.
>> Should I edit out the try-catch blocks and compile my own version of
>> the library? ?Or should I test the returned StructureImpl objects for
>> possession of the fields in question? ?In that case, I'm not sure
>> which properties will give the most general success information...and
>> I'd rather not have to check for /every/ property being correct.
>>
>> If there is some great way to check if an exception was caught down a
>> series of nested method calls, please hit me over the head with it.
>>
>> Thanks!
>>
>> -da
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From dasarnow at gmail.com  Wed Oct 27 01:03:59 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Tue, 26 Oct 2010 22:03:59 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
Message-ID: <AANLkTi=56JhQc=Joddg2i1foNW4sw9r-QTHHTEF_p133@mail.gmail.com>

I think that would be perfect...and of course I'm happy perform
testing on whatever gets cooked up.

-da

2010/10/26 Amr Al-Hossary <amr_alhossary at hotmail.com>:
> We can?add some thing like an exception tracing queue, that can be?checked
> for later by the caller.
>
> would that be OK?
>
> Amr
>
>> Date: Tue, 26 Oct 2010 21:11:28 -0700
>> From: andreas at sdsc.edu
>> To: dasarnow at gmail.com
>> CC: biojava-l at lists.open-bio.org
>> Subject: Re: [Biojava-l] Bad PDB files and batch processing with
>> PDBFileReader
>>
>> Hi Daniel,
>>
>> can you explain a bit more what you are doing, in particular what
>> errors you would like to deal with on your end? You should not need
>> to worry too much about exception handling. Are there any special
>> cases you are interested in? In this case we should support you with
>> a clean interface rather than exception handling from your end...
>>
>> Andreas
>>
>>
>>
>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com>
>> wrote:
>> > Hi all,
>> > Let me first say thanks to all the BioJava community members for
>> > delivering such a useful set of libraries, and that I'm still a newbie
>> > when it comes to BioJava (and Java) so forgive me if my question is
>> > too trivial.
>> >
>> > I am doing work on lots (at least thousands) of PDB files from RCSB.
>> > As is commonly known, these are often rife with errors which can lead
>> > to exceptions during parsing with PDBFileParser. ?Because
>> > PDBFileParser's methods contain their own try-catch blocks, exception
>> > propagation stops there and my code proceeds blindly along regardless
>> > of any error checking I do. ?I would like to catch the exceptions up
>> > in my code where the parser is called, so that I can branch to a
>> > continue statement and have my batch processing loops move on to the
>> > next file.
>> > Should I edit out the try-catch blocks and compile my own version of
>> > the library? ?Or should I test the returned StructureImpl objects for
>> > possession of the fields in question? ?In that case, I'm not sure
>> > which properties will give the most general success information...and
>> > I'd rather not have to check for /every/ property being correct.
>> >
>> > If there is some great way to check if an exception was caught down a
>> > series of nested method calls, please hit me over the head with it.
>> >
>> > Thanks!
>> >
>> > -da
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From andreas at sdsc.edu  Wed Oct 27 01:19:07 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 22:19:07 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
Message-ID: <AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>

Hi Daniel,

PDB files are better nowadays, due to remediation, however there are
still issues..

it sounds like you just want to figure out how to do the try/catch
block properly. You could do something like that:
		
		boolean splitFileOrganisation = true;
		AtomCache cache = new
AtomCache("/path/to/your/installation/",splitFileOrganisation);
		
		String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
		
		for (String pdbID : pdbIDs){
			
			try {
				Structure s = cache.getStructure(pdbID);
				if ( s == null) {
					System.out.println("could not find structure " + pdbID);
					continue;
				}
				// do something with the structure - your inner loop
				System.out.println(s);
				
			} catch (Exception e){
				// something crazy happened...
				System.err.println("Can't load structure " + pdbID + " reason: " +
e.getMessage());
				e.printStackTrace();
			}
		}
		

On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Glad to hear it, who doesn't like support or clean interfaces?. ?No
> offense intended, by the way, with respect to PDB errors - obviously
> the PDB is an indispensable resource for all protein scientists.
>
> I am looking at many (fixed-length) pieces of protein chains and doin'
> stuff with 'em. ?My current code has a pair of nested while loops; the
> outer iterates over PDB entries (locally rsync'd copy), parsing them
> and the inner iterates over the pieces from each. ?When
> StructureExceptions come out of my PDBFileReader object I want to
> continue the outer loop, moving on to the next set of files without
> executing any of the code that depends on correct StructureImpl
> objects from the reader (database updates, the inner loop).
> Since the reader's methods have their own try-catch blocks, a thrown
> StructureException is stopped there and never reaches my own error
> handling. ?I just need to know when those errors occur so I can skip
> those proteins - I am presuming that the correct entries will outweigh
> the problem ones by a significant factor and the overall data wont be
> seriously impacted.
>
> -da
>
> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> can you explain a bit more what you are doing, in particular what
>> errors you would like to deal with on your end? ?You should not need
>> to worry too much about exception handling. Are there any special
>> cases you are interested in? ?In this case we should support you with
>> a clean interface rather than exception handling from your end...
>>
>> Andreas
>>
>>
>>
>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> Hi all,
>>> Let me first say thanks to all the BioJava community members for
>>> delivering such a useful set of libraries, and that I'm still a newbie
>>> when it comes to BioJava (and Java) so forgive me if my question is
>>> too trivial.
>>>
>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>> As is commonly known, these are often rife with errors which can lead
>>> to exceptions during parsing with PDBFileParser. ?Because
>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>> propagation stops there and my code proceeds blindly along regardless
>>> of any error checking I do. ?I would like to catch the exceptions up
>>> in my code where the parser is called, so that I can branch to a
>>> continue statement and have my batch processing loops move on to the
>>> next file.
>>> Should I edit out the try-catch blocks and compile my own version of
>>> the library? ?Or should I test the returned StructureImpl objects for
>>> possession of the fields in question? ?In that case, I'm not sure
>>> which properties will give the most general success information...and
>>> I'd rather not have to check for /every/ property being correct.
>>>
>>> If there is some great way to check if an exception was caught down a
>>> series of nested method calls, please hit me over the head with it.
>>>
>>> Thanks!
>>>
>>> -da
>>> _______________________________________________
>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>>


From andreas at sdsc.edu  Wed Oct 27 02:01:38 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 23:01:38 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
Message-ID: <AANLkTin_ucD4u_CgyMwbDX9+mFTVYksjSzZVtUAuefvp@mail.gmail.com>

Hi Amr,

2010/10/26 Amr Al-Hossary <amr_alhossary at hotmail.com>:
> We can?add some thing like an exception tracing queue, that can be?checked
> for later by the caller.

thanks for your suggestion. In terms of API I would prefer if we can
separare a user from inconsistencies in the files and I hope we won't
need such a queue...  If something is off, the code is written to
ignore or work around issues...

Abdreas


> would that be OK?
>
> Amr
>
>> Date: Tue, 26 Oct 2010 21:11:28 -0700
>> From: andreas at sdsc.edu
>> To: dasarnow at gmail.com
>> CC: biojava-l at lists.open-bio.org
>> Subject: Re: [Biojava-l] Bad PDB files and batch processing with
>> PDBFileReader
>>
>> Hi Daniel,
>>
>> can you explain a bit more what you are doing, in particular what
>> errors you would like to deal with on your end? You should not need
>> to worry too much about exception handling. Are there any special
>> cases you are interested in? In this case we should support you with
>> a clean interface rather than exception handling from your end...
>>
>> Andreas
>>
>>
>>
>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com>
>> wrote:
>> > Hi all,
>> > Let me first say thanks to all the BioJava community members for
>> > delivering such a useful set of libraries, and that I'm still a newbie
>> > when i! t comes to BioJava (and Java) so forgive me if my question is
>> > too trivial.
>> >
>> > I am doing work on lots (at least thousands) of PDB files from RCSB.
>> > As is commonly known, these are often rife with errors which can lead
>> > to exceptions during parsing with PDBFileParser. ?Because
>> > PDBFileParser's methods contain their own try-catch blocks, exception
>> > propagation stops there and my code proceeds blindly along regardless
>> > of any error checking I do. ?I would like to catch the exceptions up
>> > in my code where the parser is called, so that I can branch to a
>> > continue statement and have my batch processing loops move on to the
>> > next file.
>> > Should I edit out the try-catch blocks and compile my own version of
>> > the library? ?Or should I test the returned StructureImpl objects for
>> > possession of the fields i! n question? ?In that case, I'm not sure
>> > which proper ties will give the most general success information...and
>> > I'd rather not have to check for /every/ property being correct.
>> >
>> > If there is some great way to check if an exception was caught down a
>> > series of nested method calls, please hit me over the head with it.
>> >
>> > Thanks!
>> >
>> > -da
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>>
>>
>>


From dasarnow at gmail.com  Wed Oct 27 03:26:22 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Wed, 27 Oct 2010 00:26:22 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
Message-ID: <AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>

I assume AtomCache is a new class in BioJava3?

I must give you my embarrassed apology...after a bunch of testing I
finally figured out that I had misunderstood where the Parser's error
handling returns control and started going after the wrong exceptions.
 It does looks like if setParseCAOnly is true, the reader excepts on
chains with no CA's instead of just skipping them, though the other
chains are still parsed into the structure.

-da

On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> PDB files are better nowadays, due to remediation, however there are
> still issues..
>
> it sounds like you just want to figure out how to do the try/catch
> block properly. You could do something like that:
>
> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
> ? ? ? ? ? ? ? ?AtomCache cache = new
> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>
> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>
> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>
> ? ? ? ? ? ? ? ? ? ? ? ?try {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>
> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
> e.getMessage());
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?}
>
>
>
>
> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>> offense intended, by the way, with respect to PDB errors - obviously
>> the PDB is an indispensable resource for all protein scientists.
>>
>> I am looking at many (fixed-length) pieces of protein chains and doin'
>> stuff with 'em. ?My current code has a pair of nested while loops; the
>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>> and the inner iterates over the pieces from each. ?When
>> StructureExceptions come out of my PDBFileReader object I want to
>> continue the outer loop, moving on to the next set of files without
>> executing any of the code that depends on correct StructureImpl
>> objects from the reader (database updates, the inner loop).
>> Since the reader's methods have their own try-catch blocks, a thrown
>> StructureException is stopped there and never reaches my own error
>> handling. ?I just need to know when those errors occur so I can skip
>> those proteins - I am presuming that the correct entries will outweigh
>> the problem ones by a significant factor and the overall data wont be
>> seriously impacted.
>>
>> -da
>>
>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> can you explain a bit more what you are doing, in particular what
>>> errors you would like to deal with on your end? ?You should not need
>>> to worry too much about exception handling. Are there any special
>>> cases you are interested in? ?In this case we should support you with
>>> a clean interface rather than exception handling from your end...
>>>
>>> Andreas
>>>
>>>
>>>
>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> Hi all,
>>>> Let me first say thanks to all the BioJava community members for
>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>> too trivial.
>>>>
>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>> As is commonly known, these are often rife with errors which can lead
>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>> propagation stops there and my code proceeds blindly along regardless
>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>> in my code where the parser is called, so that I can branch to a
>>>> continue statement and have my batch processing loops move on to the
>>>> next file.
>>>> Should I edit out the try-catch blocks and compile my own version of
>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>> possession of the fields in question? ?In that case, I'm not sure
>>>> which properties will give the most general success information...and
>>>> I'd rather not have to check for /every/ property being correct.
>>>>
>>>> If there is some great way to check if an exception was caught down a
>>>> series of nested method calls, please hit me over the head with it.
>>>>
>>>> Thanks!
>>>>
>>>> -da
>>>> _______________________________________________
>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>
>>>
>


From jc.lucky at laposte.net  Wed Oct 27 04:11:13 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Wed, 27 Oct 2010 10:11:13 +0200 (CEST)
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
Message-ID: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>


I tried once again with the new version of BioJava but without succeding. Any idea or suggestion?

Thanks in advance
Regards,

Jean-Charles Ferri?res


> Message du 22/10/10 10:11
> De : "jc.lucky" 
> A : biojava-l at lists.open-bio.org
> Copie ? : 
> Objet : [Biojava-l] Retrieve Information from GenBank file
>
> 
> Hi
> 
> I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> 
> With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. 
> Please help me find what I do wrong or what should be done to achieve my goal.
> 
> //read the GeneBank File
> public static RichSequenceIterator readFile(String input,
> RichSequenceBuilderFactory seqFactory,
> Namespace ns)
> throws IOException, NoSuchElementException, BioException
> {
> ns = null;
> InputStream stream = new FileInputStream(input);
> BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream));
> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); 
> return seqs;
> }
> 
> //Retrieve information and convert them in rdf format
> public void writeToRDFFile(RichSequenceIterator rsi, String output)
> throws IOException, NoSuchElementException, BioException {
> //create model for the ontology
> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null);
> OntClass parents;
> String URI = "http://pbr.wur.nl/#";
> 
> while(rsi.hasNext())
> {
> RichSequence seq = rsi.nextRichSequence();
> String id = seq.getName(); 
> parents = model.createClass(URI + id);
> Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString
> String definition = seq.getDescription(); //code to clean up String
> //Add to model
> parents.addProperty(DC.description, definition);
> parents.addProperty(DC.publisher, authors);
> parents.addComment(taxonomy, "EN");
> parents.addProperty(DC.type, organism);
> //print in rdf format
> model.write(out, "RDF/XML");
> out.close(); }
> }
> 
> 
> Thanks,
> Jean-Charles Ferri?res
_____________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From willishf at ufl.edu  Wed Oct 27 06:41:06 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Wed, 27 Oct 2010 06:41:06 -0400
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
Message-ID: <AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>

Jean-Charles

I have it on my list to do a GenBank parser but haven't had the time. I
can't promise anything in the next couple weeks. Can you send some details
about what a typical use case is for your purpose? Are you trying to get the
sequence data or are you more interested in the features?

Thanks

Scooter

On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky <jc.lucky at laposte.net> wrote:

>
> I tried once again with the new version of BioJava but without succeding.
> Any idea or suggestion?
>
> Thanks in advance
> Regards,
>
> Jean-Charles Ferri?res
>
>
> > Message du 22/10/10 10:11
> > De : "jc.lucky"
> > A : biojava-l at lists.open-bio.org
> > Copie ? :
> > Objet : [Biojava-l] Retrieve Information from GenBank file
> >
> >
> > Hi
> >
> > I'm trying to convert a GenBank file into a rdf file. The gene of
> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> >
> > With the below code I can read the GenBank file and I manage to retrieve
> information and convert them in a rdf format. However I don't succeed in
> retrieving some information such as Title, protein or product. According to
> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> possible to do so.
> > Please help me find what I do wrong or what should be done to achieve my
> goal.
> >
> > //read the GeneBank File
> > public static RichSequenceIterator readFile(String input,
> > RichSequenceBuilderFactory seqFactory,
> > Namespace ns)
> > throws IOException, NoSuchElementException, BioException
> > {
> > ns = null;
> > InputStream stream = new FileInputStream(input);
> > BufferedReader rdfFile = new BufferedReader(new
> InputStreamReader(stream));
> > RichSequenceIterator seqs =
> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> > return seqs;
> > }
> >
> > //Retrieve information and convert them in rdf format
> > public void writeToRDFFile(RichSequenceIterator rsi, String output)
> > throws IOException, NoSuchElementException, BioException {
> > //create model for the ontology
> > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> null);
> > OntClass parents;
> > String URI = "http://pbr.wur.nl/#";
> >
> > while(rsi.hasNext())
> > {
> > RichSequence seq = rsi.nextRichSequence();
> > String id = seq.getName();
> > parents = model.createClass(URI + id);
> > Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> toString
> > String definition = seq.getDescription(); //code to clean up String
> > //Add to model
> > parents.addProperty(DC.description, definition);
> > parents.addProperty(DC.publisher, authors);
> > parents.addComment(taxonomy, "EN");
> > parents.addProperty(DC.type, organism);
> > //print in rdf format
> > model.write(out, "RDF/XML");
> > out.close(); }
> > }
> >
> >
> > Thanks,
> > Jean-Charles Ferri?res
> _____________________________________________
> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous
> tente ?
> Je cr?e ma bo?te mail www.laposte.net
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From jc.lucky at laposte.net  Wed Oct 27 09:03:55 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Wed, 27 Oct 2010 15:03:55 +0200 (CEST)
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
	<AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
Message-ID: <21411489.155159.1288184635185.JavaMail.www@wwinf8222>


I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.

Thanks,

Jean-Charles


> Message du 27/10/10 12:41
> De : "Scooter Willis" 
> A : "jc.lucky" 
> Copie ? : "biojava-l lists open-bio org" 
> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>
> Jean-Charles
> 
> I have it on my list to do a GenBank parser but haven't had the time. I
> can't promise anything in the next couple weeks. Can you send some details
> about what a typical use case is for your purpose? Are you trying to get the
> sequence data or are you more interested in the features?
> 
> Thanks
> 
> Scooter
> 
> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky  wrote:
> 
> >
> > I tried once again with the new version of BioJava but without succeding.
> > Any idea or suggestion?
> >
> > Thanks in advance
> > Regards,
> >
> > Jean-Charles Ferri?res
> >
> >
> > > Message du 22/10/10 10:11
> > > De : "jc.lucky"
> > > A : biojava-l at lists.open-bio.org
> > > Copie ? :
> > > Objet : [Biojava-l] Retrieve Information from GenBank file
> > >
> > >
> > > Hi
> > >
> > > I'm trying to convert a GenBank file into a rdf file. The gene of
> > interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> > >
> > > With the below code I can read the GenBank file and I manage to retrieve
> > information and convert them in a rdf format. However I don't succeed in
> > retrieving some information such as Title, protein or product. According to
> > this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> > possible to do so.
> > > Please help me find what I do wrong or what should be done to achieve my
> > goal.
> > >
> > > //read the GeneBank File
> > > public static RichSequenceIterator readFile(String input,
> > > RichSequenceBuilderFactory seqFactory,
> > > Namespace ns)
> > > throws IOException, NoSuchElementException, BioException
> > > {
> > > ns = null;
> > > InputStream stream = new FileInputStream(input);
> > > BufferedReader rdfFile = new BufferedReader(new
> > InputStreamReader(stream));
> > > RichSequenceIterator seqs =
> > RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> > > return seqs;
> > > }
> > >
> > > //Retrieve information and convert them in rdf format
> > > public void writeToRDFFile(RichSequenceIterator rsi, String output)
> > > throws IOException, NoSuchElementException, BioException {
> > > //create model for the ontology
> > > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> > null);
> > > OntClass parents;
> > > String URI = "http://pbr.wur.nl/#";
> > >
> > > while(rsi.hasNext())
> > > {
> > > RichSequence seq = rsi.nextRichSequence();
> > > String id = seq.getName();
> > > parents = model.createClass(URI + id);
> > > Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> > toString
> > > String definition = seq.getDescription(); //code to clean up String
> > > //Add to model
> > > parents.addProperty(DC.description, definition);
> > > parents.addProperty(DC.publisher, authors);
> > > parents.addComment(taxonomy, "EN");
> > > parents.addProperty(DC.type, organism);
> > > //print in rdf format
> > > model.write(out, "RDF/XML");
> > > out.close(); }
> > > }
> > >
> > >
> > > Thanks,
> > > Jean-Charles Ferri?res
> > _____________________________________________
> > > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From holland at eaglegenomics.com  Wed Oct 27 09:16:56 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 27 Oct 2010 14:16:56 +0100
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <21411489.155159.1288184635185.JavaMail.www@wwinf8222>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
	<AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
	<21411489.155159.1288184635185.JavaMail.www@wwinf8222>
Message-ID: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com>

Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs().

This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2

cheers,
Richard

On 27 Oct 2010, at 14:03, jc.lucky wrote:

> 
> I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
> My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.
> 
> Thanks,
> 
> Jean-Charles
> 
> 
> 
>> Message du 27/10/10 12:41
>> De : "Scooter Willis" 
>> A : "jc.lucky" 
>> Copie ? : "biojava-l lists open-bio org" 
>> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>> 
>> Jean-Charles
>> 
>> I have it on my list to do a GenBank parser but haven't had the time. I
>> can't promise anything in the next couple weeks. Can you send some details
>> about what a typical use case is for your purpose? Are you trying to get the
>> sequence data or are you more interested in the features?
>> 
>> Thanks
>> 
>> Scooter
>> 
>> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky  wrote:
>> 
>>> 
>>> I tried once again with the new version of BioJava but without succeding.
>>> Any idea or suggestion?
>>> 
>>> Thanks in advance
>>> Regards,
>>> 
>>> Jean-Charles Ferri?res
>>> 
>>> 
>>>> Message du 22/10/10 10:11
>>>> De : "jc.lucky"
>>>> A : biojava-l at lists.open-bio.org
>>>> Copie ? :
>>>> Objet : [Biojava-l] Retrieve Information from GenBank file
>>>> 
>>>> 
>>>> Hi
>>>> 
>>>> I'm trying to convert a GenBank file into a rdf file. The gene of
>>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
>>>> 
>>>> With the below code I can read the GenBank file and I manage to retrieve
>>> information and convert them in a rdf format. However I don't succeed in
>>> retrieving some information such as Title, protein or product. According to
>>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
>>> possible to do so.
>>>> Please help me find what I do wrong or what should be done to achieve my
>>> goal.
>>>> 
>>>> //read the GeneBank File
>>>> public static RichSequenceIterator readFile(String input,
>>>> RichSequenceBuilderFactory seqFactory,
>>>> Namespace ns)
>>>> throws IOException, NoSuchElementException, BioException
>>>> {
>>>> ns = null;
>>>> InputStream stream = new FileInputStream(input);
>>>> BufferedReader rdfFile = new BufferedReader(new
>>> InputStreamReader(stream));
>>>> RichSequenceIterator seqs =
>>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
>>>> return seqs;
>>>> }
>>>> 
>>>> //Retrieve information and convert them in rdf format
>>>> public void writeToRDFFile(RichSequenceIterator rsi, String output)
>>>> throws IOException, NoSuchElementException, BioException {
>>>> //create model for the ontology
>>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
>>> null);
>>>> OntClass parents;
>>>> String URI = "http://pbr.wur.nl/#";
>>>> 
>>>> while(rsi.hasNext())
>>>> {
>>>> RichSequence seq = rsi.nextRichSequence();
>>>> String id = seq.getName();
>>>> parents = model.createClass(URI + id);
>>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert
>>> toString
>>>> String definition = seq.getDescription(); //code to clean up String
>>>> //Add to model
>>>> parents.addProperty(DC.description, definition);
>>>> parents.addProperty(DC.publisher, authors);
>>>> parents.addComment(taxonomy, "EN");
>>>> parents.addProperty(DC.type, organism);
>>>> //print in rdf format
>>>> model.write(out, "RDF/XML");
>>>> out.close(); }
>>>> }
>>>> 
>>>> 
>>>> Thanks,
>>>> Jean-Charles Ferri?res
>>> _____________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
> Je cr?e ma bo?te mail www.laposte.net
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From jc.lucky at laposte.net  Wed Oct 27 09:34:22 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Wed, 27 Oct 2010 15:34:22 +0200 (CEST)
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
	<AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
	<21411489.155159.1288184635185.JavaMail.www@wwinf8222>
	<3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com>
Message-ID: <6229150.91865.1288186462649.JavaMail.www@wwinf8218>


Thanks for your reply and indeed as mentioned at the bottom that is what I use to try to retrieve the maximum of information. However and that is my problem the methods described do not provide the required information.
For example getRankedDocRefs() provides authors and Journals but no TITLE
getFeaturesSet() only provides /organism, /mol_type and /db_xref
Thereby I was asking for help and suggestion fo how to fix this "problem".

Best,
Jean-Charles


> Message du 27/10/10 15:17
> De : "Richard Holland" 
> A : "jc.lucky" 
> Copie ? : "Scooter Willis" , "biojava-l lists open-bio org" 
> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>
> 
> Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs().
> 
> This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2
> 
> cheers,
> Richard
> 
> On 27 Oct 2010, at 14:03, jc.lucky wrote:
> 
> > 
> > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
> > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.
> > 
> > Thanks,
> > 
> > Jean-Charles
> > 
> > 
> > 
> >> Message du 27/10/10 12:41
> >> De : "Scooter Willis" 
> >> A : "jc.lucky" 
> >> Copie ? : "biojava-l lists open-bio org" 
> >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
> >> 
> >> Jean-Charles
> >> 
> >> I have it on my list to do a GenBank parser but haven't had the time. I
> >> can't promise anything in the next couple weeks. Can you send some details
> >> about what a typical use case is for your purpose? Are you trying to get the
> >> sequence data or are you more interested in the features?
> >> 
> >> Thanks
> >> 
> >> Scooter
> >> 
> >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote:
> >> 
> >>> 
> >>> I tried once again with the new version of BioJava but without succeding.
> >>> Any idea or suggestion?
> >>> 
> >>> Thanks in advance
> >>> Regards,
> >>> 
> >>> Jean-Charles Ferri?res
> >>> 
> >>> 
> >>>> Message du 22/10/10 10:11
> >>>> De : "jc.lucky"
> >>>> A : biojava-l at lists.open-bio.org
> >>>> Copie ? :
> >>>> Objet : [Biojava-l] Retrieve Information from GenBank file
> >>>> 
> >>>> 
> >>>> Hi
> >>>> 
> >>>> I'm trying to convert a GenBank file into a rdf file. The gene of
> >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> >>>> 
> >>>> With the below code I can read the GenBank file and I manage to retrieve
> >>> information and convert them in a rdf format. However I don't succeed in
> >>> retrieving some information such as Title, protein or product. According to
> >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> >>> possible to do so.
> >>>> Please help me find what I do wrong or what should be done to achieve my
> >>> goal.
> >>>> 
> >>>> //read the GeneBank File
> >>>> public static RichSequenceIterator readFile(String input,
> >>>> RichSequenceBuilderFactory seqFactory,
> >>>> Namespace ns)
> >>>> throws IOException, NoSuchElementException, BioException
> >>>> {
> >>>> ns = null;
> >>>> InputStream stream = new FileInputStream(input);
> >>>> BufferedReader rdfFile = new BufferedReader(new
> >>> InputStreamReader(stream));
> >>>> RichSequenceIterator seqs =
> >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> >>>> return seqs;
> >>>> }
> >>>> 
> >>>> //Retrieve information and convert them in rdf format
> >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output)
> >>>> throws IOException, NoSuchElementException, BioException {
> >>>> //create model for the ontology
> >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> >>> null);
> >>>> OntClass parents;
> >>>> String URI = "http://pbr.wur.nl/#";
> >>>> 
> >>>> while(rsi.hasNext())
> >>>> {
> >>>> RichSequence seq = rsi.nextRichSequence();
> >>>> String id = seq.getName();
> >>>> parents = model.createClass(URI + id);
> >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> >>> toString
> >>>> String definition = seq.getDescription(); //code to clean up String
> >>>> //Add to model
> >>>> parents.addProperty(DC.description, definition);
> >>>> parents.addProperty(DC.publisher, authors);
> >>>> parents.addComment(taxonomy, "EN");
> >>>> parents.addProperty(DC.type, organism);
> >>>> //print in rdf format
> >>>> model.write(out, "RDF/XML");
> >>>> out.close(); }
> >>>> }
> >>>> 
> >>>> 
> >>>> Thanks,
> >>>> Jean-Charles Ferri?res
> >>> _____________________________________________
> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> > 
> > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
> > Je cr?e ma bo?te mail www.laposte.net
> > 
> > 
> > _______________________________________________
> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> 

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From andreas at sdsc.edu  Wed Oct 27 20:47:50 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 27 Oct 2010 17:47:50 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
Message-ID: <AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>

> I assume AtomCache is a new class in BioJava3?

yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0

>
> I must give you my embarrassed apology...after a bunch of testing I
> finally figured out that I had misunderstood where the Parser's error
> handling returns control and started going after the wrong exceptions.
> ?It does looks like if setParseCAOnly is true, the reader excepts on
> chains with no CA's instead of just skipping them, though the other
> chains are still parsed into the structure.

This sounds like there might be  a problem with CA only.. do you have
an example ID? also: are you on biojava 1.7 or 3.0 ?

Andreas


>
> -da
>
> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> PDB files are better nowadays, due to remediation, however there are
>> still issues..
>>
>> it sounds like you just want to figure out how to do the try/catch
>> block properly. You could do something like that:
>>
>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>> ? ? ? ? ? ? ? ?AtomCache cache = new
>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>
>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>
>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>
>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>
>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>> e.getMessage());
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>> ? ? ? ? ? ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ? ?}
>>
>>
>>
>>
>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>> offense intended, by the way, with respect to PDB errors - obviously
>>> the PDB is an indispensable resource for all protein scientists.
>>>
>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>> and the inner iterates over the pieces from each. ?When
>>> StructureExceptions come out of my PDBFileReader object I want to
>>> continue the outer loop, moving on to the next set of files without
>>> executing any of the code that depends on correct StructureImpl
>>> objects from the reader (database updates, the inner loop).
>>> Since the reader's methods have their own try-catch blocks, a thrown
>>> StructureException is stopped there and never reaches my own error
>>> handling. ?I just need to know when those errors occur so I can skip
>>> those proteins - I am presuming that the correct entries will outweigh
>>> the problem ones by a significant factor and the overall data wont be
>>> seriously impacted.
>>>
>>> -da
>>>
>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Daniel,
>>>>
>>>> can you explain a bit more what you are doing, in particular what
>>>> errors you would like to deal with on your end? ?You should not need
>>>> to worry too much about exception handling. Are there any special
>>>> cases you are interested in? ?In this case we should support you with
>>>> a clean interface rather than exception handling from your end...
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>> Hi all,
>>>>> Let me first say thanks to all the BioJava community members for
>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>> too trivial.
>>>>>
>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>> As is commonly known, these are often rife with errors which can lead
>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>> in my code where the parser is called, so that I can branch to a
>>>>> continue statement and have my batch processing loops move on to the
>>>>> next file.
>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>> which properties will give the most general success information...and
>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>
>>>>> If there is some great way to check if an exception was caught down a
>>>>> series of nested method calls, please hit me over the head with it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -da
>>>>> _______________________________________________
>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>
>>>>
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From dasarnow at gmail.com  Thu Oct 28 00:05:18 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Wed, 27 Oct 2010 21:05:18 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader
In-Reply-To: <AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
Message-ID: <AANLkTikjOML9RtOJJj2pRH-URhC7WVJdVaTobY82qWJF@mail.gmail.com>

I'm using 1.7, partially because my distro had a package for it and
partially because I was initially using the online Javadoc a lot.
PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
chain F appears to parse correctly.

-da

org.biojava.bio.structure.StructureException: could not find chain A
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: could not find chain B
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: did not find chain with
chainId >A<
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: did not find chain with
chainId >B<
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)


On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>> I assume AtomCache is a new class in BioJava3?
>
> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>
>>
>> I must give you my embarrassed apology...after a bunch of testing I
>> finally figured out that I had misunderstood where the Parser's error
>> handling returns control and started going after the wrong exceptions.
>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>> chains with no CA's instead of just skipping them, though the other
>> chains are still parsed into the structure.
>
> This sounds like there might be ?a problem with CA only.. do you have
> an example ID? also: are you on biojava 1.7 or 3.0 ?
>
> Andreas
>
>
>
>>
>> -da
>>
>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> PDB files are better nowadays, due to remediation, however there are
>>> still issues..
>>>
>>> it sounds like you just want to figure out how to do the try/catch
>>> block properly. You could do something like that:
>>>
>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>
>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>
>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>
>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>
>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>> e.getMessage());
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>> ? ? ? ? ? ? ? ?}
>>>
>>>
>>>
>>>
>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>> the PDB is an indispensable resource for all protein scientists.
>>>>
>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>> and the inner iterates over the pieces from each. ?When
>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>> continue the outer loop, moving on to the next set of files without
>>>> executing any of the code that depends on correct StructureImpl
>>>> objects from the reader (database updates, the inner loop).
>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>> StructureException is stopped there and never reaches my own error
>>>> handling. ?I just need to know when those errors occur so I can skip
>>>> those proteins - I am presuming that the correct entries will outweigh
>>>> the problem ones by a significant factor and the overall data wont be
>>>> seriously impacted.
>>>>
>>>> -da
>>>>
>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> can you explain a bit more what you are doing, in particular what
>>>>> errors you would like to deal with on your end? ?You should not need
>>>>> to worry too much about exception handling. Are there any special
>>>>> cases you are interested in? ?In this case we should support you with
>>>>> a clean interface rather than exception handling from your end...
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>> Hi all,
>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>> too trivial.
>>>>>>
>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>> continue statement and have my batch processing loops move on to the
>>>>>> next file.
>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>> which properties will give the most general success information...and
>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>
>>>>>> If there is some great way to check if an exception was caught down a
>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -da
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From andreas at sdsc.edu  Thu Oct 28 13:28:07 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 28 Oct 2010 10:28:07 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
Message-ID: <AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>

Hi Daniel,

I just checked, this is a bug which is already resolved in 3.0... If
it is an issue for you, you might want to upgrade... (should be very
easy, if you start using Maven ...)

Thanks,
Andreas

On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> I'm using 1.7, partially because my distro had a package for it and
> partially because I was initially using the online Javadoc a lot.
> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
> chain F appears to parse correctly.
>
> -da
>
> org.biojava.bio.structure.StructureException: could not find chain A
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
> org.biojava.bio.structure.StructureException: could not find chain B
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
> org.biojava.bio.structure.StructureException: did not find chain with
> chainId >A<
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
> org.biojava.bio.structure.StructureException: did not find chain with
> chainId >B<
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
>
>
> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> I assume AtomCache is a new class in BioJava3?
>>
>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>
>>>
>>> I must give you my embarrassed apology...after a bunch of testing I
>>> finally figured out that I had misunderstood where the Parser's error
>>> handling returns control and started going after the wrong exceptions.
>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>> chains with no CA's instead of just skipping them, though the other
>>> chains are still parsed into the structure.
>>
>> This sounds like there might be ?a problem with CA only.. do you have
>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>
>> Andreas
>>
>>
>>
>>>
>>> -da
>>>
>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Daniel,
>>>>
>>>> PDB files are better nowadays, due to remediation, however there are
>>>> still issues..
>>>>
>>>> it sounds like you just want to figure out how to do the try/catch
>>>> block properly. You could do something like that:
>>>>
>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>
>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>
>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>> e.getMessage());
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>> ? ? ? ? ? ? ? ?}
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>
>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>> and the inner iterates over the pieces from each. ?When
>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>> continue the outer loop, moving on to the next set of files without
>>>>> executing any of the code that depends on correct StructureImpl
>>>>> objects from the reader (database updates, the inner loop).
>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>> StructureException is stopped there and never reaches my own error
>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>> the problem ones by a significant factor and the overall data wont be
>>>>> seriously impacted.
>>>>>
>>>>> -da
>>>>>
>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>> to worry too much about exception handling. Are there any special
>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>> a clean interface rather than exception handling from your end...
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>> too trivial.
>>>>>>>
>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>> next file.
>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>> which properties will give the most general success information...and
>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>
>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -da
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From vishalthapar at gmail.com  Thu Oct 28 13:40:49 2010
From: vishalthapar at gmail.com (Vishal Thapar)
Date: Thu, 28 Oct 2010 13:40:49 -0400
Subject: [Biojava-l] K-mers
Message-ID: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>

Hi All,

I had a quick question: Does Biojava have a method to generate k-mers or
K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
counts for every sequence in a fasta file. If something like this exists it
would save me some time to write the code.

Thanks,

Vishal

From jayunit100 at gmail.com  Thu Oct 28 15:43:17 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Thu, 28 Oct 2010 15:43:17 -0400
Subject: [Biojava-l] biojava maven integration
Message-ID: <AANLkTinjaDZ_BGUXk3F=DbNbxppky8548zqm2d2PNvP8@mail.gmail.com>

Hi guys, I added the following to my pom file

  <dependency>
        <groupId>org.biojava</groupId>
        <artifactId>biojava</artifactId>
        <version>3.0-alpha2</version>
   </dependency>

<repository>
        <id>biojava-maven-repo</id>
        <name>BioJava repository</name>
        <url>http://www.biojava.org/download/maven/</url>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
        <releases>
            <enabled>true</enabled>
        </releases>
    </repository>
 <repository>

But to no avail.  Does anyone know how to add biojava3 to the libraries in a
maven managed application >?

Thanks.

From jayunit100 at gmail.com  Thu Oct 28 18:51:25 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Thu, 28 Oct 2010 18:51:25 -0400
Subject: [Biojava-l] biojava maven integration
In-Reply-To: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk>
References: <AANLkTinjaDZ_BGUXk3F=DbNbxppky8548zqm2d2PNvP8@mail.gmail.com>
	<5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk>
Message-ID: <AANLkTi=FV0-Wh6Abh0DOGCWZfMicobGo9gjkcRMSJUBx@mail.gmail.com>

Does anybody have a maven POM example of how to integrate biojava into my
application ?
Thanks!

Im currently using biojava 1.7, and have put it in my own, local maven
repository.


On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy <andy.law at roslin.ed.ac.uk> wrote:

> Not 100% certain but I *think* you want to depend on biojava-core rather
> than biojava.
>
> Later,
>
> Andy
>
> On 28 Oct 2010, at 20:43, Jay Vyas wrote:
>
> > Hi guys, I added the following to my pom file
> >
> >  <dependency>
> >        <groupId>org.biojava</groupId>
> >        <artifactId>biojava</artifactId>
> >        <version>3.0-alpha2</version>
> >   </dependency>
> >
> > <repository>
> >        <id>biojava-maven-repo</id>
> >        <name>BioJava repository</name>
> >        <url>http://www.biojava.org/download/maven/</url>
> >        <snapshots>
> >            <enabled>true</enabled>
> >        </snapshots>
> >        <releases>
> >            <enabled>true</enabled>
> >        </releases>
> >    </repository>
> > <repository>
> >
> > But to no avail.  Does anyone know how to add biojava3 to the libraries
> in a
> > maven managed application >?
> >
> > Thanks.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


-- 
Jay Vyas
MMSB/UCHC

From dasarnow at gmail.com  Thu Oct 28 19:45:05 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Thu, 28 Oct 2010 16:45:05 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
	<AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
Message-ID: <AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>

It's not a big deal - after all if you use CA only, chains with no
CA's aren't important, and the error messages aren't that long.  But
I'm going to switch anyway...
I'm getting the dreaded "can't read line length in file" error while
trying to checkout biojava-live/trunk, though.

-da

On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> I just checked, this is a bug which is already resolved in 3.0... If
> it is an issue for you, you might want to upgrade... (should be very
> easy, if you start using Maven ...)
>
> Thanks,
> Andreas
>
> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> I'm using 1.7, partially because my distro had a package for it and
>> partially because I was initially using the online Javadoc a lot.
>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
>> chain F appears to parse correctly.
>>
>> -da
>>
>> org.biojava.bio.structure.StructureException: could not find chain A
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: could not find chain B
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: did not find chain with
>> chainId >A<
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: did not find chain with
>> chainId >B<
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>
>>
>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> I assume AtomCache is a new class in BioJava3?
>>>
>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>
>>>>
>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>> finally figured out that I had misunderstood where the Parser's error
>>>> handling returns control and started going after the wrong exceptions.
>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>>> chains with no CA's instead of just skipping them, though the other
>>>> chains are still parsed into the structure.
>>>
>>> This sounds like there might be ?a problem with CA only.. do you have
>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>
>>> Andreas
>>>
>>>
>>>
>>>>
>>>> -da
>>>>
>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>> still issues..
>>>>>
>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>> block properly. You could do something like that:
>>>>>
>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>
>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>
>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>>
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>>
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>> e.getMessage());
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>> ? ? ? ? ? ? ? ?}
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>
>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>> and the inner iterates over the pieces from each. ?When
>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>> objects from the reader (database updates, the inner loop).
>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>> StructureException is stopped there and never reaches my own error
>>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>> seriously impacted.
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Hi all,
>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>> too trivial.
>>>>>>>>
>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>> next file.
>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>>> which properties will give the most general success information...and
>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>
>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -da
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From dasarnow at gmail.com  Thu Oct 28 19:51:25 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Thu, 28 Oct 2010 16:51:25 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
	<AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
	<AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>
Message-ID: <AANLkTimzVwJLyzyTZvmf0=og+t7xyc2z4OKTf1RNN5Cy@mail.gmail.com>

Ahh, I suppose that is the "problem" referred to in the wiki?  I
checked out successfully from the repository on github.

-da

On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
> It's not a big deal - after all if you use CA only, chains with no
> CA's aren't important, and the error messages aren't that long. ?But
> I'm going to switch anyway...
> I'm getting the dreaded "can't read line length in file" error while
> trying to checkout biojava-live/trunk, though.
>
> -da
>
> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> I just checked, this is a bug which is already resolved in 3.0... If
>> it is an issue for you, you might want to upgrade... (should be very
>> easy, if you start using Maven ...)
>>
>> Thanks,
>> Andreas
>>
>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> I'm using 1.7, partially because my distro had a package for it and
>>> partially because I was initially using the online Javadoc a lot.
>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
>>> chain F appears to parse correctly.
>>>
>>> -da
>>>
>>> org.biojava.bio.structure.StructureException: could not find chain A
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: could not find chain B
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: did not find chain with
>>> chainId >A<
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: did not find chain with
>>> chainId >B<
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>
>>>
>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> I assume AtomCache is a new class in BioJava3?
>>>>
>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>
>>>>>
>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>> handling returns control and started going after the wrong exceptions.
>>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>>>> chains with no CA's instead of just skipping them, though the other
>>>>> chains are still parsed into the structure.
>>>>
>>>> This sounds like there might be ?a problem with CA only.. do you have
>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>>>>
>>>>> -da
>>>>>
>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>> still issues..
>>>>>>
>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>> block properly. You could do something like that:
>>>>>>
>>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>
>>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>
>>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>>>
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>>>
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>> e.getMessage());
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>> ? ? ? ? ? ? ? ?}
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>
>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>> and the inner iterates over the pieces from each. ?When
>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>> seriously impacted.
>>>>>>>
>>>>>>> -da
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>> too trivial.
>>>>>>>>>
>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>> next file.
>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>
>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> -da
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


From andreas at sdsc.edu  Thu Oct 28 20:06:55 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 28 Oct 2010 17:06:55 -0700
Subject: [Biojava-l] biojava maven integration
In-Reply-To: <AANLkTi=FV0-Wh6Abh0DOGCWZfMicobGo9gjkcRMSJUBx@mail.gmail.com>
References: <AANLkTinjaDZ_BGUXk3F=DbNbxppky8548zqm2d2PNvP8@mail.gmail.com>
	<5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk>
	<AANLkTi=FV0-Wh6Abh0DOGCWZfMicobGo9gjkcRMSJUBx@mail.gmail.com>
Message-ID: <AANLkTikzeXetdY_kf1P-GEQPueYj7MQfkwYWQ7OgH=q=@mail.gmail.com>

Hi Jay,

here is some UI code that is using biojava from Maven:

http://github.com/biojava/RCSB_SequenceViewer/blob/master/pom.xml

Andreas


On Thu, Oct 28, 2010 at 3:51 PM, Jay Vyas <jayunit100 at gmail.com> wrote:
> Does anybody have a maven POM example of how to integrate biojava into my
> application ?
> Thanks!
>
> Im currently using biojava 1.7, and have put it in my own, local maven
> repository.
>
>
>
>
> On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy <andy.law at roslin.ed.ac.uk> wrote:
>
>> Not 100% certain but I *think* you want to depend on biojava-core rather
>> than biojava.
>>
>> Later,
>>
>> Andy
>>
>> On 28 Oct 2010, at 20:43, Jay Vyas wrote:
>>
>> > Hi guys, I added the following to my pom file
>> >
>> > ?<dependency>
>> > ? ? ? ?<groupId>org.biojava</groupId>
>> > ? ? ? ?<artifactId>biojava</artifactId>
>> > ? ? ? ?<version>3.0-alpha2</version>
>> > ? </dependency>
>> >
>> > <repository>
>> > ? ? ? ?<id>biojava-maven-repo</id>
>> > ? ? ? ?<name>BioJava repository</name>
>> > ? ? ? ?<url>http://www.biojava.org/download/maven/</url>
>> > ? ? ? ?<snapshots>
>> > ? ? ? ? ? ?<enabled>true</enabled>
>> > ? ? ? ?</snapshots>
>> > ? ? ? ?<releases>
>> > ? ? ? ? ? ?<enabled>true</enabled>
>> > ? ? ? ?</releases>
>> > ? ?</repository>
>> > <repository>
>> >
>> > But to no avail. ?Does anyone know how to add biojava3 to the libraries
>> in a
>> > maven managed application >?
>> >
>> > Thanks.
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
> --
> Jay Vyas
> MMSB/UCHC
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Thu Oct 28 20:08:49 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 28 Oct 2010 17:08:49 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTimzVwJLyzyTZvmf0=og+t7xyc2z4OKTf1RNN5Cy@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
	<AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
	<AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>
	<AANLkTimzVwJLyzyTZvmf0=og+t7xyc2z4OKTf1RNN5Cy@mail.gmail.com>
Message-ID: <AANLkTik5WLTFevA1iCBHPbLutcyA6TbxioaisKGAs-gY@mail.gmail.com>

good, I was just about to say that... ;-)

Andreas


On Thu, Oct 28, 2010 at 4:51 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Ahh, I suppose that is the "problem" referred to in the wiki? ?I
> checked out successfully from the repository on github.
>
> -da
>
> On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> It's not a big deal - after all if you use CA only, chains with no
>> CA's aren't important, and the error messages aren't that long. ?But
>> I'm going to switch anyway...
>> I'm getting the dreaded "can't read line length in file" error while
>> trying to checkout biojava-live/trunk, though.
>>
>> -da
>>
>> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> I just checked, this is a bug which is already resolved in 3.0... If
>>> it is an issue for you, you might want to upgrade... (should be very
>>> easy, if you start using Maven ...)
>>>
>>> Thanks,
>>> Andreas
>>>
>>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> I'm using 1.7, partially because my distro had a package for it and
>>>> partially because I was initially using the online Javadoc a lot.
>>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
>>>> chain F appears to parse correctly.
>>>>
>>>> -da
>>>>
>>>> org.biojava.bio.structure.StructureException: could not find chain A
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: could not find chain B
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >A<
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >B<
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>>
>>>>
>>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> I assume AtomCache is a new class in BioJava3?
>>>>>
>>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>>
>>>>>>
>>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>>> handling returns control and started going after the wrong exceptions.
>>>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>>>>> chains with no CA's instead of just skipping them, though the other
>>>>>> chains are still parsed into the structure.
>>>>>
>>>>> This sounds like there might be ?a problem with CA only.. do you have
>>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>>> still issues..
>>>>>>>
>>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>>> block properly. You could do something like that:
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>>> e.getMessage());
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>>> ? ? ? ? ? ? ? ?}
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>>
>>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>>> and the inner iterates over the pieces from each. ?When
>>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>>> seriously impacted.
>>>>>>>>
>>>>>>>> -da
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>>
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>>> too trivial.
>>>>>>>>>>
>>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>>> next file.
>>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>>
>>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> -da
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From ayates at ebi.ac.uk  Fri Oct 29 04:12:09 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 09:12:09 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
Message-ID: <A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>

Hi Vishal,

As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3:

public static void main(String[] args) {
    DNASequence d = new DNASequence("ATGATC");
    System.out.println("Non-Overlap");
    nonOverlap(d);
    System.out.println("Overlap");
    overlap(d);
}

public static final int KMER = 3;

//Generate triplets overlapping
public static void overlap(Sequence<NucleotideCompound> d) {
    List<WindowedSequence<NucleotideCompound>> l =
            new ArrayList<WindowedSequence<NucleotideCompound>>();
    for(int i=1; i<=KMER; i++) {
        SequenceView<NucleotideCompound> sub = d.getSubSequence(
                i, d.getLength());
        WindowedSequence<NucleotideCompound> w =
            new WindowedSequence<NucleotideCompound>(sub, KMER);
        l.add(w);
    }

    //Will return ATG, ATC, TGA & GAT
    for(WindowedSequence<NucleotideCompound> w: l) {
        for(List<NucleotideCompound> subList: w) {
            System.out.println(subList);
        }
    }
}

//Generate triplet Compound lists non-overlapping
public static void nonOverlap(Sequence<NucleotideCompound> d) {
    WindowedSequence<NucleotideCompound> w = 
            new WindowedSequence<NucleotideCompound>(d, KMER);
    //Will return ATG & ATC
    for(List<NucleotideCompound> subList: w) {
        System.out.println(subList);
    }
}

The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA)

As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree).

Hope this helps,

Andy

On 28 Oct 2010, at 18:40, Vishal Thapar wrote:

> Hi All,
> 
> I had a quick question: Does Biojava have a method to generate k-mers or
> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> counts for every sequence in a fasta file. If something like this exists it
> would save me some time to write the code.
> 
> Thanks,
> 
> Vishal
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Fri Oct 29 05:12:53 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 29 Oct 2010 14:42:53 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
Message-ID: <AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>

Dear Friends,

Thanks to Vishal & Andy for this. I actually needed this code too..
Vishal, I think Andy's suggestions may be a good option to include in
BioJava 3. Would you like to add this to the BioJava 3.

Thanks again.

Regards,
Jitesh Dundas

On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Vishal,
>
> As far as I am aware there is nothing which will generate them in BioJava at
> the moment. However it is possible to do it with BioJava3:
>
> public static void main(String[] args) {
>     DNASequence d = new DNASequence("ATGATC");
>     System.out.println("Non-Overlap");
>     nonOverlap(d);
>     System.out.println("Overlap");
>     overlap(d);
> }
>
> public static final int KMER = 3;
>
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>     List<WindowedSequence<NucleotideCompound>> l =
>             new ArrayList<WindowedSequence<NucleotideCompound>>();
>     for(int i=1; i<=KMER; i++) {
>         SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                 i, d.getLength());
>         WindowedSequence<NucleotideCompound> w =
>             new WindowedSequence<NucleotideCompound>(sub, KMER);
>         l.add(w);
>     }
>
>     //Will return ATG, ATC, TGA & GAT
>     for(WindowedSequence<NucleotideCompound> w: l) {
>         for(List<NucleotideCompound> subList: w) {
>             System.out.println(subList);
>         }
>     }
> }
>
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>     WindowedSequence<NucleotideCompound> w =
>             new WindowedSequence<NucleotideCompound>(d, KMER);
>     //Will return ATG & ATC
>     for(List<NucleotideCompound> subList: w) {
>         System.out.println(subList);
>     }
> }
>
> The disadvantage of all of these solutions is that they generate lists of
> Compounds so kmer generation can/will be a memory intensive operation. This
> does mean it has to be since sub sequences are thin wrappers around an
> underlying sequence. Also the overlap solution is non-optimal since it
> iterates through each window rather than stepping through delegating onto
> each base in turn (hence why we get ATG & ATC before TGA)
>
> As for unique k-mers that's something which would require a bit more
> engineering & would be better suited to a solution built around a Trie
> (prefix tree).
>
> Hope this helps,
>
> Andy
>
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>
>> Hi All,
>>
>> I had a quick question: Does Biojava have a method to generate k-mers or
>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>> counts for every sequence in a fasta file. If something like this exists
>> it
>> would save me some time to write the code.
>>
>> Thanks,
>>
>> Vishal
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From ayates at ebi.ac.uk  Fri Oct 29 05:20:36 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 10:20:36 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
Message-ID: <B903229A-1A5A-432C-9307-4D01ACD8DFB3@ebi.ac.uk>

Okay couple of points here:

1). Which biojava3 module? This sounds like something for the genomic module rather than core

2). It'll need some more work. I'm not happy about using the WindowedSequenceView in its current state. I think an alteration to avoid it making Lists would be a good idea (plus recent developments in the API as to its main use means this is a viable change). Also it should return the overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6

Comments?

Andy

On 29 Oct 2010, at 10:12, jitesh dundas wrote:

> Dear Friends,
> 
> Thanks to Vishal & Andy for this. I actually needed this code too..
> Vishal, I think Andy's suggestions may be a good option to include in
> BioJava 3. Would you like to add this to the BioJava 3.
> 
> Thanks again.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Vishal,
>> 
>> As far as I am aware there is nothing which will generate them in BioJava at
>> the moment. However it is possible to do it with BioJava3:
>> 
>> public static void main(String[] args) {
>>    DNASequence d = new DNASequence("ATGATC");
>>    System.out.println("Non-Overlap");
>>    nonOverlap(d);
>>    System.out.println("Overlap");
>>    overlap(d);
>> }
>> 
>> public static final int KMER = 3;
>> 
>> //Generate triplets overlapping
>> public static void overlap(Sequence<NucleotideCompound> d) {
>>    List<WindowedSequence<NucleotideCompound>> l =
>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>    for(int i=1; i<=KMER; i++) {
>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>                i, d.getLength());
>>        WindowedSequence<NucleotideCompound> w =
>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>        l.add(w);
>>    }
>> 
>>    //Will return ATG, ATC, TGA & GAT
>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>        for(List<NucleotideCompound> subList: w) {
>>            System.out.println(subList);
>>        }
>>    }
>> }
>> 
>> //Generate triplet Compound lists non-overlapping
>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>    WindowedSequence<NucleotideCompound> w =
>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>    //Will return ATG & ATC
>>    for(List<NucleotideCompound> subList: w) {
>>        System.out.println(subList);
>>    }
>> }
>> 
>> The disadvantage of all of these solutions is that they generate lists of
>> Compounds so kmer generation can/will be a memory intensive operation. This
>> does mean it has to be since sub sequences are thin wrappers around an
>> underlying sequence. Also the overlap solution is non-optimal since it
>> iterates through each window rather than stepping through delegating onto
>> each base in turn (hence why we get ATG & ATC before TGA)
>> 
>> As for unique k-mers that's something which would require a bit more
>> engineering & would be better suited to a solution built around a Trie
>> (prefix tree).
>> 
>> Hope this helps,
>> 
>> Andy
>> 
>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>> 
>>> Hi All,
>>> 
>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>> counts for every sequence in a fasta file. If something like this exists
>>> it
>>> would save me some time to write the code.
>>> 
>>> Thanks,
>>> 
>>> Vishal
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Fri Oct 29 06:00:44 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 29 Oct 2010 15:30:44 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
Message-ID: <AANLkTimLoXvOO1aHBrRno4=P7CJ923iGUoQ+558LoBWE@mail.gmail.com>

Dear Sir,

Is there any way to detect patterns in the recorded k-mers .

I have a large set of miRNAs (study for mutations and patgerns for
gastric cancer).I made a record of k-mers for each sequence but the
patterns that are generated are difficult to track.

Can BioJava do this point. Regular Expressions in Java maybe useful here..

Request expert advise  in this.Any other s/w that might be useful.

Thanks,
Jitesh Dundas

On 10/29/10, jitesh dundas <jbdundas at gmail.com> wrote:
> Dear Friends,
>
> Thanks to Vishal & Andy for this. I actually needed this code too..
> Vishal, I think Andy's suggestions may be a good option to include in
> BioJava 3. Would you like to add this to the BioJava 3.
>
> Thanks again.
>
> Regards,
> Jitesh Dundas
>
> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Vishal,
>>
>> As far as I am aware there is nothing which will generate them in BioJava
>> at
>> the moment. However it is possible to do it with BioJava3:
>>
>> public static void main(String[] args) {
>>     DNASequence d = new DNASequence("ATGATC");
>>     System.out.println("Non-Overlap");
>>     nonOverlap(d);
>>     System.out.println("Overlap");
>>     overlap(d);
>> }
>>
>> public static final int KMER = 3;
>>
>> //Generate triplets overlapping
>> public static void overlap(Sequence<NucleotideCompound> d) {
>>     List<WindowedSequence<NucleotideCompound>> l =
>>             new ArrayList<WindowedSequence<NucleotideCompound>>();
>>     for(int i=1; i<=KMER; i++) {
>>         SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>                 i, d.getLength());
>>         WindowedSequence<NucleotideCompound> w =
>>             new WindowedSequence<NucleotideCompound>(sub, KMER);
>>         l.add(w);
>>     }
>>
>>     //Will return ATG, ATC, TGA & GAT
>>     for(WindowedSequence<NucleotideCompound> w: l) {
>>         for(List<NucleotideCompound> subList: w) {
>>             System.out.println(subList);
>>         }
>>     }
>> }
>>
>> //Generate triplet Compound lists non-overlapping
>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>     WindowedSequence<NucleotideCompound> w =
>>             new WindowedSequence<NucleotideCompound>(d, KMER);
>>     //Will return ATG & ATC
>>     for(List<NucleotideCompound> subList: w) {
>>         System.out.println(subList);
>>     }
>> }
>>
>> The disadvantage of all of these solutions is that they generate lists of
>> Compounds so kmer generation can/will be a memory intensive operation.
>> This
>> does mean it has to be since sub sequences are thin wrappers around an
>> underlying sequence. Also the overlap solution is non-optimal since it
>> iterates through each window rather than stepping through delegating onto
>> each base in turn (hence why we get ATG & ATC before TGA)
>>
>> As for unique k-mers that's something which would require a bit more
>> engineering & would be better suited to a solution built around a Trie
>> (prefix tree).
>>
>> Hope this helps,
>>
>> Andy
>>
>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>
>>> Hi All,
>>>
>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>> counts for every sequence in a fasta file. If something like this exists
>>> it
>>> would save me some time to write the code.
>>>
>>> Thanks,
>>>
>>> Vishal
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>

From jbdundas at gmail.com  Fri Oct 29 06:04:35 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 29 Oct 2010 15:34:35 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <B903229A-1A5A-432C-9307-4D01ACD8DFB3@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
	<B903229A-1A5A-432C-9307-4D01ACD8DFB3@ebi.ac.uk>
Message-ID: <AANLkTimktS_moccBSOXLKMDxYVe8Y91LrDOxndHWUP_P@mail.gmail.com>

You are right again my friend.Definitely that would hang up my machine
with the xml file parsing activity.

This is about sequence alignment and related modules.

I will look at this today and send a fix on that.Hope that you can help.

PS: what about pattern matching in sequences?interesting  to have in
biojava 3 ?

Regards,
JD

On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Okay couple of points here:
>
> 1). Which biojava3 module? This sounds like something for the genomic module
> rather than core
>
> 2). It'll need some more work. I'm not happy about using the
> WindowedSequenceView in its current state. I think an alteration to avoid it
> making Lists would be a good idea (plus recent developments in the API as to
> its main use means this is a viable change). Also it should return the
> overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6
>
> Comments?
>
> Andy
>
> On 29 Oct 2010, at 10:12, jitesh dundas wrote:
>
>> Dear Friends,
>>
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>>
>> Thanks again.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>>
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>>
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>>
>>> public static final int KMER = 3;
>>>
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>>
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>>
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>>
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>>
>>> Hope this helps,
>>>
>>> Andy
>>>
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>
>>>> Hi All,
>>>>
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>>
>>>> Thanks,
>>>>
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>

From ayates at ebi.ac.uk  Fri Oct 29 06:09:11 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 11:09:11 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTimLoXvOO1aHBrRno4=P7CJ923iGUoQ+558LoBWE@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
	<AANLkTimLoXvOO1aHBrRno4=P7CJ923iGUoQ+558LoBWE@mail.gmail.com>
Message-ID: <5832FAFE-FEC3-4A7C-9469-3C334551900B@ebi.ac.uk>

One of the disadvantages of the Sequence based system is that we have no support for searching in sequences with patterns like regular expressions. Whilst it's possible to convert a Sequence into a String & then perform the expression but that is a sub-optimal solution.

Looking at the Pattern code in Java6 it can take in a CharSequence which one could write an adaptor to make a Sequence act as a CharSequence for the matching procedure but really it looks like a lot of work.

As for a way of doing matching to sequence HMMER3 is awesome :)

Andy

On 29 Oct 2010, at 11:00, jitesh dundas wrote:

> Dear Sir,
> 
> Is there any way to detect patterns in the recorded k-mers .
> 
> I have a large set of miRNAs (study for mutations and patgerns for
> gastric cancer).I made a record of k-mers for each sequence but the
> patterns that are generated are difficult to track.
> 
> Can BioJava do this point. Regular Expressions in Java maybe useful here..
> 
> Request expert advise  in this.Any other s/w that might be useful.
> 
> Thanks,
> Jitesh Dundas
> 
> On 10/29/10, jitesh dundas <jbdundas at gmail.com> wrote:
>> Dear Friends,
>> 
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>> 
>> Thanks again.
>> 
>> Regards,
>> Jitesh Dundas
>> 
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>> 
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jnarayan81 at gmail.com  Fri Oct 29 07:46:11 2010
From: jnarayan81 at gmail.com (jitendra narayan)
Date: Fri, 29 Oct 2010 17:16:11 +0530
Subject: [Biojava-l] New Biojava Logo
Message-ID: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>

Dear All
I have designed a n new biojava logo. Please see the detail of it:
http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
<http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your valuable
suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo


thanks

-- 
Jitendra Narayan
Bioinformatist
www.bioinformaticsonline.com

From genjasp at gmail.com  Fri Oct 29 09:05:57 2010
From: genjasp at gmail.com (Alessandro Cipriani)
Date: Fri, 29 Oct 2010 15:05:57 +0200
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
Message-ID: <AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>

Great Logo!!!

:D

2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> Dear All
> I have designed a n new biojava logo. Please see the detail of it:
> http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your valuable
> suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
>
>
> thanks
>
> --
> Jitendra Narayan
> Bioinformatist
> www.bioinformaticsonline.com
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Alessandro Cipriani
(+39) 3206009509
(+39) 3931311792
http://www.cipriania.it
skype:genjasp at gmail.com
msn:jaspzz


From vishalthapar at gmail.com  Fri Oct 29 12:27:11 2010
From: vishalthapar at gmail.com (Vishal Thapar)
Date: Fri, 29 Oct 2010 12:27:11 -0400
Subject: [Biojava-l] K-mers
In-Reply-To: <A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
Message-ID: <AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>

Hi Andy,

This is good to have. I feel that including it as a part of core may not be
necessary but having it as part of Genomic module in biojava3 will be nice.
There is a project Bioinformatica
http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
does something similar although not exactly. It counts the k-mers in a
given fasta file but it does not count k-mers for each sequence within the
file, just all within a file. This is a good feature to have specially if
one is trying to find patterns within sequences which is what I am trying to
do. It would most certainly be helpful to have a k-mer counting algorithm
that counts k-mer frequency for each sequence. The way to go would be to use
suffix trees. Again I don't know if biojava has a suffix tree api or not
since I haven't used java in a while and am just switching back to it. A
paper on using suffix trees to generate genome wide k-mer frequencies is:
http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software
is tallymer). It would be some work to implement this in java as a module
for biojava3 but I can see that this will be helpful. Again, for small fasta
files, it might not be efficient to create a suffix tree but for bigger
files, I think that might be the way to go.

Thats just my two cents.What do you think?

-vishal

On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> Hi Vishal,
>
> As far as I am aware there is nothing which will generate them in BioJava
> at the moment. However it is possible to do it with BioJava3:
>
> public static void main(String[] args) {
>    DNASequence d = new DNASequence("ATGATC");
>    System.out.println("Non-Overlap");
>    nonOverlap(d);
>    System.out.println("Overlap");
>    overlap(d);
> }
>
> public static final int KMER = 3;
>
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>    List<WindowedSequence<NucleotideCompound>> l =
>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>    for(int i=1; i<=KMER; i++) {
>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                i, d.getLength());
>        WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>        l.add(w);
>    }
>
>    //Will return ATG, ATC, TGA & GAT
>    for(WindowedSequence<NucleotideCompound> w: l) {
>        for(List<NucleotideCompound> subList: w) {
>            System.out.println(subList);
>        }
>    }
> }
>
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>    WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(d, KMER);
>    //Will return ATG & ATC
>    for(List<NucleotideCompound> subList: w) {
>        System.out.println(subList);
>    }
> }
>
> The disadvantage of all of these solutions is that they generate lists of
> Compounds so kmer generation can/will be a memory intensive operation. This
> does mean it has to be since sub sequences are thin wrappers around an
> underlying sequence. Also the overlap solution is non-optimal since it
> iterates through each window rather than stepping through delegating onto
> each base in turn (hence why we get ATG & ATC before TGA)
>
> As for unique k-mers that's something which would require a bit more
> engineering & would be better suited to a solution built around a Trie
> (prefix tree).
>
> Hope this helps,
>
> Andy
>
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>
> > Hi All,
> >
> > I had a quick question: Does Biojava have a method to generate k-mers or
> > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> > counts for every sequence in a fasta file. If something like this exists
> it
> > would save me some time to write the code.
> >
> > Thanks,
> >
> > Vishal
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
*Vishal Thapar, Ph.D.*
*Scientific informatics Analyst
Cold Spring Harbor Lab
Quick Bldg, Lowe Lab
1 Bungtown Road
Cold Spring Harbor, NY - 11724*

From phidias51 at gmail.com  Fri Oct 29 12:56:45 2010
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 29 Oct 2010 09:56:45 -0700
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
Message-ID: <AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>

It might be useful to make the K-mer storage mechanism pluggable.  This
would allow a developer to use anything from a simple MultiMap, to a NoSQL
key-value database to store K-mers.  You could plugin custom map
implementations to allow you to keep a count of the number of instances of
particular K-mers that were found.  It might also be useful to be able to do
set operations on those K-mer collections.  You could use it to determine
which K-mers were present in a pathogen and not in a host.
http://www.ncbi.nlm.nih.gov/pubmed/20428334
http://www.ncbi.nlm.nih.gov/pubmed/16403026

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com>wrote:

> Hi Andy,
>
> This is good to have. I feel that including it as a part of core may not be
> necessary but having it as part of Genomic module in biojava3 will be nice.
> There is a project Bioinformatica
>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> does something similar although not exactly. It counts the k-mers in a
> given fasta file but it does not count k-mers for each sequence within the
> file, just all within a file. This is a good feature to have specially if
> one is trying to find patterns within sequences which is what I am trying
> to
> do. It would most certainly be helpful to have a k-mer counting algorithm
> that counts k-mer frequency for each sequence. The way to go would be to
> use
> suffix trees. Again I don't know if biojava has a suffix tree api or not
> since I haven't used java in a while and am just switching back to it. A
> paper on using suffix trees to generate genome wide k-mer frequencies is:
> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> software
> is tallymer). It would be some work to implement this in java as a module
> for biojava3 but I can see that this will be helpful. Again, for small
> fasta
> files, it might not be efficient to create a suffix tree but for bigger
> files, I think that might be the way to go.
>
> Thats just my two cents.What do you think?
>
> -vishal
>
> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>
> > Hi Vishal,
> >
> > As far as I am aware there is nothing which will generate them in BioJava
> > at the moment. However it is possible to do it with BioJava3:
> >
> > public static void main(String[] args) {
> >    DNASequence d = new DNASequence("ATGATC");
> >    System.out.println("Non-Overlap");
> >    nonOverlap(d);
> >    System.out.println("Overlap");
> >    overlap(d);
> > }
> >
> > public static final int KMER = 3;
> >
> > //Generate triplets overlapping
> > public static void overlap(Sequence<NucleotideCompound> d) {
> >    List<WindowedSequence<NucleotideCompound>> l =
> >            new ArrayList<WindowedSequence<NucleotideCompound>>();
> >    for(int i=1; i<=KMER; i++) {
> >        SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >                i, d.getLength());
> >        WindowedSequence<NucleotideCompound> w =
> >            new WindowedSequence<NucleotideCompound>(sub, KMER);
> >        l.add(w);
> >    }
> >
> >    //Will return ATG, ATC, TGA & GAT
> >    for(WindowedSequence<NucleotideCompound> w: l) {
> >        for(List<NucleotideCompound> subList: w) {
> >            System.out.println(subList);
> >        }
> >    }
> > }
> >
> > //Generate triplet Compound lists non-overlapping
> > public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >    WindowedSequence<NucleotideCompound> w =
> >            new WindowedSequence<NucleotideCompound>(d, KMER);
> >    //Will return ATG & ATC
> >    for(List<NucleotideCompound> subList: w) {
> >        System.out.println(subList);
> >    }
> > }
> >
> > The disadvantage of all of these solutions is that they generate lists of
> > Compounds so kmer generation can/will be a memory intensive operation.
> This
> > does mean it has to be since sub sequences are thin wrappers around an
> > underlying sequence. Also the overlap solution is non-optimal since it
> > iterates through each window rather than stepping through delegating onto
> > each base in turn (hence why we get ATG & ATC before TGA)
> >
> > As for unique k-mers that's something which would require a bit more
> > engineering & would be better suited to a solution built around a Trie
> > (prefix tree).
> >
> > Hope this helps,
> >
> > Andy
> >
> > On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >
> > > Hi All,
> > >
> > > I had a quick question: Does Biojava have a method to generate k-mers
> or
> > > K-mer counting in a given Fasta Sequence / File? Basically, I want
> k-mer
> > > counts for every sequence in a fasta file. If something like this
> exists
> > it
> > > would save me some time to write the code.
> > >
> > > Thanks,
> > >
> > > Vishal
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> > --
> > Andrew Yates                   Ensembl Genomes Engineer
> > EMBL-EBI                       Tel: +44-(0)1223-492538
> > Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> > Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >
> >
> >
> >
> >
>
>
> --
> *Vishal Thapar, Ph.D.*
> *Scientific informatics Analyst
> Cold Spring Harbor Lab
> Quick Bldg, Lowe Lab
> 1 Bungtown Road
> Cold Spring Harbor, NY - 11724*
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From ayates at ebi.ac.uk  Fri Oct 29 14:32:45 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 19:32:45 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
Message-ID: <C65577E4-26B7-4DE6-A5F2-B7E8994F2CF0@ebi.ac.uk>

Hi Vishal,

There's no suffix tree impl in BioJava but if you want to give it a shot then go for it :). I'm interested in how they work but no time to implement it. As for efficiency give it a shot & lets see what it does. 

Andy

On 29 Oct 2010, at 17:27, Vishal Thapar wrote:

> Hi Andy,
> 
> This is good to have. I feel that including it as a part of core may not be necessary but having it as part of Genomic module in biojava3 will be nice. There is a project Bioinformatica http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequence which does something similar although not exactly. It counts the k-mers in a given fasta file but it does not count k-mers for each sequence within the file, just all within a file. This is a good feature to have specially if one is trying to find patterns within sequences which is what I am trying to do. It would most certainly be helpful to have a k-mer counting algorithm that counts k-mer frequency for each sequence. The way to go would be to use suffix trees. Again I don't know if biojava has a suffix tree api or not since I haven't used java in a while and am just switching back to it. A paper on using suffix trees to generate genome wide k-mer frequencies is: http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software is tallymer). It would be some work to implement this in java as a module for biojava3 but I can see that this will be helpful. Again, for small fasta files, it might not be efficient to create a suffix tree but for bigger files, I think that might be the way to go.
> 
> Thats just my two cents.What do you think?
> 
> -vishal
> 
> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Vishal,
> 
> As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3:
> 
> public static void main(String[] args) {
>    DNASequence d = new DNASequence("ATGATC");
>    System.out.println("Non-Overlap");
>    nonOverlap(d);
>    System.out.println("Overlap");
>    overlap(d);
> }
> 
> public static final int KMER = 3;
> 
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>    List<WindowedSequence<NucleotideCompound>> l =
>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>    for(int i=1; i<=KMER; i++) {
>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                i, d.getLength());
>        WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>        l.add(w);
>    }
> 
>    //Will return ATG, ATC, TGA & GAT
>    for(WindowedSequence<NucleotideCompound> w: l) {
>        for(List<NucleotideCompound> subList: w) {
>            System.out.println(subList);
>        }
>    }
> }
> 
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>    WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(d, KMER);
>    //Will return ATG & ATC
>    for(List<NucleotideCompound> subList: w) {
>        System.out.println(subList);
>    }
> }
> 
> The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA)
> 
> As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree).
> 
> Hope this helps,
> 
> Andy
> 
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> 
> > Hi All,
> >
> > I had a quick question: Does Biojava have a method to generate k-mers or
> > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> > counts for every sequence in a fasta file. If something like this exists it
> > would save me some time to write the code.
> >
> > Thanks,
> >
> > Vishal
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> 
> 
> -- 
> Vishal Thapar, Ph.D.
> Scientific informatics Analyst
> Cold Spring Harbor Lab
> Quick Bldg, Lowe Lab
> 1 Bungtown Road
> Cold Spring Harbor, NY - 11724
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From ayates at ebi.ac.uk  Fri Oct 29 14:35:43 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 19:35:43 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
Message-ID: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>

So if it's a suffix tree that's quite a fixed data structure so the chances of developing a pluggable mechanism there would be hard. I think there also has to be a limit as to what we can sensibly do. If people want to contribute this kind of work though then it's all be very well received (with the corresponding test environment/cases of course).

Cheers,

Andy

On 29 Oct 2010, at 17:56, Mark Fortner wrote:

> It might be useful to make the K-mer storage mechanism pluggable.  This
> would allow a developer to use anything from a simple MultiMap, to a NoSQL
> key-value database to store K-mers.  You could plugin custom map
> implementations to allow you to keep a count of the number of instances of
> particular K-mers that were found.  It might also be useful to be able to do
> set operations on those K-mer collections.  You could use it to determine
> which K-mers were present in a pathogen and not in a host.
> http://www.ncbi.nlm.nih.gov/pubmed/20428334
> http://www.ncbi.nlm.nih.gov/pubmed/16403026
> 
> Cheers,
> 
> Mark
> 
> card.ly: <http://card.ly/phidias51>
> 
> 
> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com>wrote:
> 
>> Hi Andy,
>> 
>> This is good to have. I feel that including it as a part of core may not be
>> necessary but having it as part of Genomic module in biojava3 will be nice.
>> There is a project Bioinformatica
>> 
>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>> does something similar although not exactly. It counts the k-mers in a
>> given fasta file but it does not count k-mers for each sequence within the
>> file, just all within a file. This is a good feature to have specially if
>> one is trying to find patterns within sequences which is what I am trying
>> to
>> do. It would most certainly be helpful to have a k-mer counting algorithm
>> that counts k-mer frequency for each sequence. The way to go would be to
>> use
>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>> since I haven't used java in a while and am just switching back to it. A
>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>> software
>> is tallymer). It would be some work to implement this in java as a module
>> for biojava3 but I can see that this will be helpful. Again, for small
>> fasta
>> files, it might not be efficient to create a suffix tree but for bigger
>> files, I think that might be the way to go.
>> 
>> Thats just my two cents.What do you think?
>> 
>> -vishal
>> 
>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> 
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>   DNASequence d = new DNASequence("ATGATC");
>>>   System.out.println("Non-Overlap");
>>>   nonOverlap(d);
>>>   System.out.println("Overlap");
>>>   overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>   List<WindowedSequence<NucleotideCompound>> l =
>>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>   for(int i=1; i<=KMER; i++) {
>>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>               i, d.getLength());
>>>       WindowedSequence<NucleotideCompound> w =
>>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>       l.add(w);
>>>   }
>>> 
>>>   //Will return ATG, ATC, TGA & GAT
>>>   for(WindowedSequence<NucleotideCompound> w: l) {
>>>       for(List<NucleotideCompound> subList: w) {
>>>           System.out.println(subList);
>>>       }
>>>   }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>   WindowedSequence<NucleotideCompound> w =
>>>           new WindowedSequence<NucleotideCompound>(d, KMER);
>>>   //Will return ATG & ATC
>>>   for(List<NucleotideCompound> subList: w) {
>>>       System.out.println(subList);
>>>   }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers
>> or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>> k-mer
>>>> counts for every sequence in a fasta file. If something like this
>> exists
>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> --
>> *Vishal Thapar, Ph.D.*
>> *Scientific informatics Analyst
>> Cold Spring Harbor Lab
>> Quick Bldg, Lowe Lab
>> 1 Bungtown Road
>> Cold Spring Harbor, NY - 11724*
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jayunit100 at gmail.com  Fri Oct 29 14:40:46 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Fri, 29 Oct 2010 14:40:46 -0400
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
Message-ID: <AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>

Hi guys : Im trying to break up a biojava project built on 1.7 into biojava
3, and am having to look up some modules etc...
Im having trouble finding biojava3 javadocs ?  Unfortunately, the
'googleable' java docs are all from 1.7 .....

Where is the formal/generated javadoc info for biojava3 ? is it online ?

From phidias51 at gmail.com  Fri Oct 29 14:48:53 2010
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 29 Oct 2010 11:48:53 -0700
Subject: [Biojava-l] K-mers
In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
Message-ID: <AANLkTimiWU4PfB==xcVkgTo4TfEnNe5qoRuJUrvAUYVy@mail.gmail.com>

I was thinking more along the lines of using something that implements the
Map interface.  This would allow a developer to easily unit test the code
without having to load the data for a genome.  You would also be able to
provide different implementations to suit your needs.  If you wanted to use
a suffix tree as the underlying implementation, that would be OK, but you
would have other options as well.

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 11:35 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> So if it's a suffix tree that's quite a fixed data structure so the chances
> of developing a pluggable mechanism there would be hard. I think there also
> has to be a limit as to what we can sensibly do. If people want to
> contribute this kind of work though then it's all be very well received
> (with the corresponding test environment/cases of course).
>
> Cheers,
>
> Andy
>
> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>
> > It might be useful to make the K-mer storage mechanism pluggable.  This
> > would allow a developer to use anything from a simple MultiMap, to a
> NoSQL
> > key-value database to store K-mers.  You could plugin custom map
> > implementations to allow you to keep a count of the number of instances
> of
> > particular K-mers that were found.  It might also be useful to be able to
> do
> > set operations on those K-mer collections.  You could use it to determine
> > which K-mers were present in a pathogen and not in a host.
> > http://www.ncbi.nlm.nih.gov/pubmed/20428334
> > http://www.ncbi.nlm.nih.gov/pubmed/16403026
> >
> > Cheers,
> >
> > Mark
> >
> > card.ly: <http://card.ly/phidias51>
> >
> >
> > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com
> >wrote:
> >
> >> Hi Andy,
> >>
> >> This is good to have. I feel that including it as a part of core may not
> be
> >> necessary but having it as part of Genomic module in biojava3 will be
> nice.
> >> There is a project Bioinformatica
> >>
> >>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> >> does something similar although not exactly. It counts the k-mers in a
> >> given fasta file but it does not count k-mers for each sequence within
> the
> >> file, just all within a file. This is a good feature to have specially
> if
> >> one is trying to find patterns within sequences which is what I am
> trying
> >> to
> >> do. It would most certainly be helpful to have a k-mer counting
> algorithm
> >> that counts k-mer frequency for each sequence. The way to go would be to
> >> use
> >> suffix trees. Again I don't know if biojava has a suffix tree api or not
> >> since I haven't used java in a while and am just switching back to it. A
> >> paper on using suffix trees to generate genome wide k-mer frequencies
> is:
> >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> >> software
> >> is tallymer). It would be some work to implement this in java as a
> module
> >> for biojava3 but I can see that this will be helpful. Again, for small
> >> fasta
> >> files, it might not be efficient to create a suffix tree but for bigger
> >> files, I think that might be the way to go.
> >>
> >> Thats just my two cents.What do you think?
> >>
> >> -vishal
> >>
> >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As far as I am aware there is nothing which will generate them in
> BioJava
> >>> at the moment. However it is possible to do it with BioJava3:
> >>>
> >>> public static void main(String[] args) {
> >>>   DNASequence d = new DNASequence("ATGATC");
> >>>   System.out.println("Non-Overlap");
> >>>   nonOverlap(d);
> >>>   System.out.println("Overlap");
> >>>   overlap(d);
> >>> }
> >>>
> >>> public static final int KMER = 3;
> >>>
> >>> //Generate triplets overlapping
> >>> public static void overlap(Sequence<NucleotideCompound> d) {
> >>>   List<WindowedSequence<NucleotideCompound>> l =
> >>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
> >>>   for(int i=1; i<=KMER; i++) {
> >>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >>>               i, d.getLength());
> >>>       WindowedSequence<NucleotideCompound> w =
> >>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
> >>>       l.add(w);
> >>>   }
> >>>
> >>>   //Will return ATG, ATC, TGA & GAT
> >>>   for(WindowedSequence<NucleotideCompound> w: l) {
> >>>       for(List<NucleotideCompound> subList: w) {
> >>>           System.out.println(subList);
> >>>       }
> >>>   }
> >>> }
> >>>
> >>> //Generate triplet Compound lists non-overlapping
> >>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >>>   WindowedSequence<NucleotideCompound> w =
> >>>           new WindowedSequence<NucleotideCompound>(d, KMER);
> >>>   //Will return ATG & ATC
> >>>   for(List<NucleotideCompound> subList: w) {
> >>>       System.out.println(subList);
> >>>   }
> >>> }
> >>>
> >>> The disadvantage of all of these solutions is that they generate lists
> of
> >>> Compounds so kmer generation can/will be a memory intensive operation.
> >> This
> >>> does mean it has to be since sub sequences are thin wrappers around an
> >>> underlying sequence. Also the overlap solution is non-optimal since it
> >>> iterates through each window rather than stepping through delegating
> onto
> >>> each base in turn (hence why we get ATG & ATC before TGA)
> >>>
> >>> As for unique k-mers that's something which would require a bit more
> >>> engineering & would be better suited to a solution built around a Trie
> >>> (prefix tree).
> >>>
> >>> Hope this helps,
> >>>
> >>> Andy
> >>>
> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I had a quick question: Does Biojava have a method to generate k-mers
> >> or
> >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
> >> k-mer
> >>>> counts for every sequence in a fasta file. If something like this
> >> exists
> >>> it
> >>>> would save me some time to write the code.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Vishal
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>
> >>> --
> >>> Andrew Yates                   Ensembl Genomes Engineer
> >>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> *Vishal Thapar, Ph.D.*
> >> *Scientific informatics Analyst
> >> Cold Spring Harbor Lab
> >> Quick Bldg, Lowe Lab
> >> 1 Bungtown Road
> >> Cold Spring Harbor, NY - 11724*
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>

From jbdundas at gmail.com  Fri Oct 29 14:50:11 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 30 Oct 2010 00:20:11 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
Message-ID: <AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>

I agree Andy. These have become standard functionalities that
scientists do these days. I am all for implementing that in BioJava3.
Java isn't that efficient for such functionalities so we will surely
need more effort compared to the same in Python/Perl.

Regards,
Jitesh Dundas

On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> So if it's a suffix tree that's quite a fixed data structure so the chances
> of developing a pluggable mechanism there would be hard. I think there also
> has to be a limit as to what we can sensibly do. If people want to
> contribute this kind of work though then it's all be very well received
> (with the corresponding test environment/cases of course).
>
> Cheers,
>
> Andy
>
> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>
>> It might be useful to make the K-mer storage mechanism pluggable.  This
>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>> key-value database to store K-mers.  You could plugin custom map
>> implementations to allow you to keep a count of the number of instances of
>> particular K-mers that were found.  It might also be useful to be able to
>> do
>> set operations on those K-mer collections.  You could use it to determine
>> which K-mers were present in a pathogen and not in a host.
>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>
>> Cheers,
>>
>> Mark
>>
>> card.ly: <http://card.ly/phidias51>
>>
>>
>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>> <vishalthapar at gmail.com>wrote:
>>
>>> Hi Andy,
>>>
>>> This is good to have. I feel that including it as a part of core may not
>>> be
>>> necessary but having it as part of Genomic module in biojava3 will be
>>> nice.
>>> There is a project Bioinformatica
>>>
>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>> does something similar although not exactly. It counts the k-mers in a
>>> given fasta file but it does not count k-mers for each sequence within
>>> the
>>> file, just all within a file. This is a good feature to have specially if
>>> one is trying to find patterns within sequences which is what I am trying
>>> to
>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>> that counts k-mer frequency for each sequence. The way to go would be to
>>> use
>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>> since I haven't used java in a while and am just switching back to it. A
>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>> software
>>> is tallymer). It would be some work to implement this in java as a module
>>> for biojava3 but I can see that this will be helpful. Again, for small
>>> fasta
>>> files, it might not be efficient to create a suffix tree but for bigger
>>> files, I think that might be the way to go.
>>>
>>> Thats just my two cents.What do you think?
>>>
>>> -vishal
>>>
>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> As far as I am aware there is nothing which will generate them in
>>>> BioJava
>>>> at the moment. However it is possible to do it with BioJava3:
>>>>
>>>> public static void main(String[] args) {
>>>>   DNASequence d = new DNASequence("ATGATC");
>>>>   System.out.println("Non-Overlap");
>>>>   nonOverlap(d);
>>>>   System.out.println("Overlap");
>>>>   overlap(d);
>>>> }
>>>>
>>>> public static final int KMER = 3;
>>>>
>>>> //Generate triplets overlapping
>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>   List<WindowedSequence<NucleotideCompound>> l =
>>>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>   for(int i=1; i<=KMER; i++) {
>>>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>               i, d.getLength());
>>>>       WindowedSequence<NucleotideCompound> w =
>>>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>       l.add(w);
>>>>   }
>>>>
>>>>   //Will return ATG, ATC, TGA & GAT
>>>>   for(WindowedSequence<NucleotideCompound> w: l) {
>>>>       for(List<NucleotideCompound> subList: w) {
>>>>           System.out.println(subList);
>>>>       }
>>>>   }
>>>> }
>>>>
>>>> //Generate triplet Compound lists non-overlapping
>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>   WindowedSequence<NucleotideCompound> w =
>>>>           new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>   //Will return ATG & ATC
>>>>   for(List<NucleotideCompound> subList: w) {
>>>>       System.out.println(subList);
>>>>   }
>>>> }
>>>>
>>>> The disadvantage of all of these solutions is that they generate lists
>>>> of
>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>> iterates through each window rather than stepping through delegating
>>>> onto
>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>
>>>> As for unique k-mers that's something which would require a bit more
>>>> engineering & would be better suited to a solution built around a Trie
>>>> (prefix tree).
>>>>
>>>> Hope this helps,
>>>>
>>>> Andy
>>>>
>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>> or
>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>> k-mer
>>>>> counts for every sequence in a fasta file. If something like this
>>> exists
>>>> it
>>>>> would save me some time to write the code.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Vishal
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>> --
>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Vishal Thapar, Ph.D.*
>>> *Scientific informatics Analyst
>>> Cold Spring Harbor Lab
>>> Quick Bldg, Lowe Lab
>>> 1 Bungtown Road
>>> Cold Spring Harbor, NY - 11724*
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From willishf at ufl.edu  Fri Oct 29 15:20:19 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Fri, 29 Oct 2010 15:20:19 -0400
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
	<AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
Message-ID: <AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>

Jay

I don't think we have pushed the biojava3 docs up to a place where google
can find them. From the nightly build
http://www.biojava.org/download/maven/org/biojava/ you can find javadocs in
the jar files. Biojava3 has two parts now. The older 1.7 modules refactored
into standalone jar files when possible but it is still a very cross
dependent code base. Then the newer modules labeled biojava3- are a clean
break from 1.7 so depending on what you are doing it may be easy/difficult
to start using the newer biojava3 code without lots of changes in your code.

Thanks

Scooter

On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas <jayunit100 at gmail.com> wrote:

> Hi guys : Im trying to break up a biojava project built on 1.7 into biojava
> 3, and am having to look up some modules etc...
> Im having trouble finding biojava3 javadocs ?  Unfortunately, the
> 'googleable' java docs are all from 1.7 .....
>
> Where is the formal/generated javadoc info for biojava3 ? is it online ?
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>

From markjschreiber at gmail.com  Fri Oct 29 15:25:12 2010
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Fri, 29 Oct 2010 15:25:12 -0400
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
	<AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
	<AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>
Message-ID: <AANLkTi=dn1nx4sqs1wNdrOP3OYRXzdfd8sy7JPD+_UdD@mail.gmail.com>

It might pay to put the link to the docs on the top level page.

You may need to get an Admin to change the front page.

On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis <willishf at ufl.edu> wrote:

> Jay
>
> I don't think we have pushed the biojava3 docs up to a place where google
> can find them. From the nightly build
> http://www.biojava.org/download/maven/org/biojava/ you can find javadocs
> in
> the jar files. Biojava3 has two parts now. The older 1.7 modules refactored
> into standalone jar files when possible but it is still a very cross
> dependent code base. Then the newer modules labeled biojava3- are a clean
> break from 1.7 so depending on what you are doing it may be easy/difficult
> to start using the newer biojava3 code without lots of changes in your
> code.
>
> Thanks
>
> Scooter
>
> On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas <jayunit100 at gmail.com> wrote:
>
> > Hi guys : Im trying to break up a biojava project built on 1.7 into
> biojava
> > 3, and am having to look up some modules etc...
> > Im having trouble finding biojava3 javadocs ?  Unfortunately, the
> > 'googleable' java docs are all from 1.7 .....
> >
> > Where is the formal/generated javadoc info for biojava3 ? is it online ?
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From ayates at ebi.ac.uk  Fri Oct 29 15:34:11 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 20:34:11 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
Message-ID: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>

So we've got some basic kmer work now in SVN. If you look in the class SequenceMixin there are two static methods there for generating the two types of k-mers. It's not developed with Map storage in mind & I'll leave the door open there for anyone else to come in & develop it. The k-mers are also not unique across the sequence but it's a start :)

Share & enjoy!

Andy

On 29 Oct 2010, at 19:50, jitesh dundas wrote:

> I agree Andy. These have become standard functionalities that
> scientists do these days. I am all for implementing that in BioJava3.
> Java isn't that efficient for such functionalities so we will surely
> need more effort compared to the same in Python/Perl.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So if it's a suffix tree that's quite a fixed data structure so the chances
>> of developing a pluggable mechanism there would be hard. I think there also
>> has to be a limit as to what we can sensibly do. If people want to
>> contribute this kind of work though then it's all be very well received
>> (with the corresponding test environment/cases of course).
>> 
>> Cheers,
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>> 
>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>>> key-value database to store K-mers.  You could plugin custom map
>>> implementations to allow you to keep a count of the number of instances of
>>> particular K-mers that were found.  It might also be useful to be able to
>>> do
>>> set operations on those K-mer collections.  You could use it to determine
>>> which K-mers were present in a pathogen and not in a host.
>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>> 
>>> Cheers,
>>> 
>>> Mark
>>> 
>>> card.ly: <http://card.ly/phidias51>
>>> 
>>> 
>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>> <vishalthapar at gmail.com>wrote:
>>> 
>>>> Hi Andy,
>>>> 
>>>> This is good to have. I feel that including it as a part of core may not
>>>> be
>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>> nice.
>>>> There is a project Bioinformatica
>>>> 
>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>> does something similar although not exactly. It counts the k-mers in a
>>>> given fasta file but it does not count k-mers for each sequence within
>>>> the
>>>> file, just all within a file. This is a good feature to have specially if
>>>> one is trying to find patterns within sequences which is what I am trying
>>>> to
>>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>>> that counts k-mer frequency for each sequence. The way to go would be to
>>>> use
>>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>>> since I haven't used java in a while and am just switching back to it. A
>>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>> software
>>>> is tallymer). It would be some work to implement this in java as a module
>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>> fasta
>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>> files, I think that might be the way to go.
>>>> 
>>>> Thats just my two cents.What do you think?
>>>> 
>>>> -vishal
>>>> 
>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> 
>>>>> Hi Vishal,
>>>>> 
>>>>> As far as I am aware there is nothing which will generate them in
>>>>> BioJava
>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>> 
>>>>> public static void main(String[] args) {
>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>  System.out.println("Non-Overlap");
>>>>>  nonOverlap(d);
>>>>>  System.out.println("Overlap");
>>>>>  overlap(d);
>>>>> }
>>>>> 
>>>>> public static final int KMER = 3;
>>>>> 
>>>>> //Generate triplets overlapping
>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>              i, d.getLength());
>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>      l.add(w);
>>>>>  }
>>>>> 
>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>          System.out.println(subList);
>>>>>      }
>>>>>  }
>>>>> }
>>>>> 
>>>>> //Generate triplet Compound lists non-overlapping
>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>  //Will return ATG & ATC
>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>      System.out.println(subList);
>>>>>  }
>>>>> }
>>>>> 
>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>> of
>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>> This
>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>> iterates through each window rather than stepping through delegating
>>>>> onto
>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>> 
>>>>> As for unique k-mers that's something which would require a bit more
>>>>> engineering & would be better suited to a solution built around a Trie
>>>>> (prefix tree).
>>>>> 
>>>>> Hope this helps,
>>>>> 
>>>>> Andy
>>>>> 
>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>> or
>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>> k-mer
>>>>>> counts for every sequence in a fasta file. If something like this
>>>> exists
>>>>> it
>>>>>> would save me some time to write the code.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Vishal
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>>> --
>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Vishal Thapar, Ph.D.*
>>>> *Scientific informatics Analyst
>>>> Cold Spring Harbor Lab
>>>> Quick Bldg, Lowe Lab
>>>> 1 Bungtown Road
>>>> Cold Spring Harbor, NY - 11724*
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Fri Oct 29 15:43:38 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 30 Oct 2010 01:13:38 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
Message-ID: <AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>

That is good news.Thanks for the directions Andy.

I have already started on this.Let me analyze and write the code now.

Maybe a next month deadline is not unreachable in this case.

Here we go!
JD

On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> So we've got some basic kmer work now in SVN. If you look in the class
> SequenceMixin there are two static methods there for generating the two
> types of k-mers. It's not developed with Map storage in mind & I'll leave
> the door open there for anyone else to come in & develop it. The k-mers are
> also not unique across the sequence but it's a start :)
>
> Share & enjoy!
>
> Andy
>
> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>
>> I agree Andy. These have become standard functionalities that
>> scientists do these days. I am all for implementing that in BioJava3.
>> Java isn't that efficient for such functionalities so we will surely
>> need more effort compared to the same in Python/Perl.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> So if it's a suffix tree that's quite a fixed data structure so the
>>> chances
>>> of developing a pluggable mechanism there would be hard. I think there
>>> also
>>> has to be a limit as to what we can sensibly do. If people want to
>>> contribute this kind of work though then it's all be very well received
>>> (with the corresponding test environment/cases of course).
>>>
>>> Cheers,
>>>
>>> Andy
>>>
>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>
>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>> NoSQL
>>>> key-value database to store K-mers.  You could plugin custom map
>>>> implementations to allow you to keep a count of the number of instances
>>>> of
>>>> particular K-mers that were found.  It might also be useful to be able
>>>> to
>>>> do
>>>> set operations on those K-mer collections.  You could use it to
>>>> determine
>>>> which K-mers were present in a pathogen and not in a host.
>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>
>>>> Cheers,
>>>>
>>>> Mark
>>>>
>>>> card.ly: <http://card.ly/phidias51>
>>>>
>>>>
>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>> <vishalthapar at gmail.com>wrote:
>>>>
>>>>> Hi Andy,
>>>>>
>>>>> This is good to have. I feel that including it as a part of core may
>>>>> not
>>>>> be
>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>> nice.
>>>>> There is a project Bioinformatica
>>>>>
>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>> the
>>>>> file, just all within a file. This is a good feature to have specially
>>>>> if
>>>>> one is trying to find patterns within sequences which is what I am
>>>>> trying
>>>>> to
>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>> algorithm
>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>> to
>>>>> use
>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>> not
>>>>> since I haven't used java in a while and am just switching back to it.
>>>>> A
>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>> is:
>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>> software
>>>>> is tallymer). It would be some work to implement this in java as a
>>>>> module
>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>> fasta
>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>> files, I think that might be the way to go.
>>>>>
>>>>> Thats just my two cents.What do you think?
>>>>>
>>>>> -vishal
>>>>>
>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>> BioJava
>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>
>>>>>> public static void main(String[] args) {
>>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>>  System.out.println("Non-Overlap");
>>>>>>  nonOverlap(d);
>>>>>>  System.out.println("Overlap");
>>>>>>  overlap(d);
>>>>>> }
>>>>>>
>>>>>> public static final int KMER = 3;
>>>>>>
>>>>>> //Generate triplets overlapping
>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>              i, d.getLength());
>>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>      l.add(w);
>>>>>>  }
>>>>>>
>>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>>          System.out.println(subList);
>>>>>>      }
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>  //Will return ATG & ATC
>>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>>      System.out.println(subList);
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>> of
>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>> This
>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>> iterates through each window rather than stepping through delegating
>>>>>> onto
>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>
>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>> (prefix tree).
>>>>>>
>>>>>> Hope this helps,
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>> or
>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>> k-mer
>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>> exists
>>>>>> it
>>>>>>> would save me some time to write the code.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Vishal
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>> --
>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Vishal Thapar, Ph.D.*
>>>>> *Scientific informatics Analyst
>>>>> Cold Spring Harbor Lab
>>>>> Quick Bldg, Lowe Lab
>>>>> 1 Bungtown Road
>>>>> Cold Spring Harbor, NY - 11724*
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>

From jayunit100 at gmail.com  Fri Oct 29 17:39:34 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Fri, 29 Oct 2010 17:39:34 -0400
Subject: [Biojava-l] JavaDocs and Backwards compatibility
Message-ID: <AANLkTin75ggrrpFNE7DhhgcYnxYd3yPEXjKnWPww2p2z@mail.gmail.com>

Thanks, I am now all up to date with biojava 3.0 and it really works well.

It really would be valuable to have some public biojava java docs !

This is because, for example, when I completely removed biojava 1.7, and
replaced it with biojava 3.0,  it was somewhat tedious to refactor/find old
classes under new package names, for example :

For example,

 org.biojava3.alignment.
SimpleSubstitutionMatrix;
 org.biojava3.alignment.template.SubstitutionMatrix;

From andreas at sdsc.edu  Fri Oct 29 17:59:23 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Fri, 29 Oct 2010 14:59:23 -0700
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <AANLkTi=dn1nx4sqs1wNdrOP3OYRXzdfd8sy7JPD+_UdD@mail.gmail.com>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
	<AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
	<AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>
	<AANLkTi=dn1nx4sqs1wNdrOP3OYRXzdfd8sy7JPD+_UdD@mail.gmail.com>
Message-ID: <AANLkTimg8dFAXtWe1ZsiTeNLWQTk96LE8nfrpnxC3-Vn@mail.gmail.com>

Ideally I would like to see the automated build system also deploy the
latest javadocs on the website. I guess I should play around with the
maven site-plugin if it can do that ... or does anybody have a
recommendation for any other plugin?

Andreas

On Fri, Oct 29, 2010 at 12:25 PM, Mark Schreiber
<markjschreiber at gmail.com> wrote:
> It might pay to put the link to the docs on the top level page.
>
> You may need to get an Admin to change the front page.
>
> On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis <willishf at ufl.edu> wrote:
>
>> Jay
>>
>> I don't think we have pushed the biojava3 docs up to a place where google
>> can find them. From the nightly build
>> http://www.biojava.org/download/maven/org/biojava/ you can find javadocs
>> in
>> the jar files. Biojava3 has two parts now. The older 1.7 modules refactored
>> into standalone jar files when possible but it is still a very cross
>> dependent code base. Then the newer modules labeled biojava3- are a clean
>> break from 1.7 so depending on what you are doing it may be easy/difficult
>> to start using the newer biojava3 code without lots of changes in your
>> code.
>>
>> Thanks
>>
>> Scooter
>>
>> On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas <jayunit100 at gmail.com> wrote:
>>
>> > Hi guys : Im trying to break up a biojava project built on 1.7 into
>> biojava
>> > 3, and am having to look up some modules etc...
>> > Im having trouble finding biojava3 javadocs ? ?Unfortunately, the
>> > 'googleable' java docs are all from 1.7 .....
>> >
>> > Where is the formal/generated javadoc info for biojava3 ? is it online ?
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>> >
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From simon.rayner.cn at gmail.com  Fri Oct 29 19:38:13 2010
From: simon.rayner.cn at gmail.com (simon rayner)
Date: Sat, 30 Oct 2010 07:38:13 +0800
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
	<AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
Message-ID: <AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>

just a suggestion, but might beans falling out the cup suggest that biojava
is unstable?  just offering feedback, i still think it looks very slick!

On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani <genjasp at gmail.com>wrote:

> Great Logo!!!
>
> :D
>
> 2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> > Dear All
> > I have designed a n new biojava logo. Please see the detail of it:
> > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> > <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your
> valuable
> > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
> >
> >
> > thanks
> >
> > --
> > Jitendra Narayan
> > Bioinformatist
> > www.bioinformaticsonline.com
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> Alessandro Cipriani
> (+39) 3206009509
> (+39) 3931311792
> http://www.cipriania.it
> skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com>
> msn:jaspzz
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Simon Rayner

State Key Laboratory of Virology
Wuhan Institute of Virology
Chinese Academy of Sciences
Wuhan, Hubei 430071
P.R.China

+86 (27) 87199895 (office)
+86 18627113001 (cell)

From phidias51 at gmail.com  Fri Oct 29 19:49:54 2010
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 29 Oct 2010 16:49:54 -0700
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
	<AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
	<AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>
Message-ID: <AANLkTi=S_5J+9OLH6g93U4i5MOKo0eP-7Q4vmx4nGJVK@mail.gmail.com>

The first logo looks nice; however, I don't see anything in it that connects
it to biology.  The second logo is too close to Oracle's logo, and I suspect
would require written permission from them in order to use it.

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 4:38 PM, simon rayner <simon.rayner.cn at gmail.com>wrote:

> just a suggestion, but might beans falling out the cup suggest that biojava
> is unstable?  just offering feedback, i still think it looks very slick!
>
> On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani <genjasp at gmail.com
> >wrote:
>
> > Great Logo!!!
> >
> > :D
> >
> > 2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> > > Dear All
> > > I have designed a n new biojava logo. Please see the detail of it:
> > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> > > <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your
> > valuable
> > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
> > >
> > >
> > > thanks
> > >
> > > --
> > > Jitendra Narayan
> > > Bioinformatist
> > > www.bioinformaticsonline.com
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> >
> >
> >
> > --
> > Alessandro Cipriani
> > (+39) 3206009509
> > (+39) 3931311792
> > http://www.cipriania.it
> > skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com> <
> skype%3Agenjasp at gmail.com <skype%253Agenjasp at gmail.com>>
> > msn:jaspzz
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> Simon Rayner
>
> State Key Laboratory of Virology
> Wuhan Institute of Virology
> Chinese Academy of Sciences
> Wuhan, Hubei 430071
> P.R.China
>
> +86 (27) 87199895 (office)
> +86 18627113001 (cell)
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From willishf at ufl.edu  Fri Oct 29 20:02:32 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Fri, 29 Oct 2010 20:02:32 -0400
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTi=S_5J+9OLH6g93U4i5MOKo0eP-7Q4vmx4nGJVK@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
	<AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
	<AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>
	<AANLkTi=S_5J+9OLH6g93U4i5MOKo0eP-7Q4vmx4nGJVK@mail.gmail.com>
Message-ID: <AANLkTi=Rb8XAaNhT3bSkO6MNqAh8H2_tw39to-d3=15e@mail.gmail.com>

Jitendra

Could you morph from the coffee liquid to a DNA helix?

Scooter

On Fri, Oct 29, 2010 at 7:49 PM, Mark Fortner <phidias51 at gmail.com> wrote:

> The first logo looks nice; however, I don't see anything in it that
> connects
> it to biology.  The second logo is too close to Oracle's logo, and I
> suspect
> would require written permission from them in order to use it.
>
> Cheers,
>
> Mark
>
> card.ly: <http://card.ly/phidias51>
>
>
> On Fri, Oct 29, 2010 at 4:38 PM, simon rayner <simon.rayner.cn at gmail.com
> >wrote:
>
> > just a suggestion, but might beans falling out the cup suggest that
> biojava
> > is unstable?  just offering feedback, i still think it looks very slick!
> >
> > On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani <genjasp at gmail.com
> > >wrote:
> >
> > > Great Logo!!!
> > >
> > > :D
> > >
> > > 2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> > > > Dear All
> > > > I have designed a n new biojava logo. Please see the detail of it:
> > > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> > > > <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your
> > > valuable
> > > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
> > > >
> > > >
> > > > thanks
> > > >
> > > > --
> > > > Jitendra Narayan
> > > > Bioinformatist
> > > > www.bioinformaticsonline.com
> > > > _______________________________________________
> > > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > > >
> > >
> > >
> > >
> > > --
> > > Alessandro Cipriani
> > > (+39) 3206009509
> > > (+39) 3931311792
> > > http://www.cipriania.it
> > > skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com> <
> skype%3Agenjasp at gmail.com <skype%253Agenjasp at gmail.com>> <
> > skype%3Agenjasp at gmail.com <skype%253Agenjasp at gmail.com> <
> skype%253Agenjasp at gmail.com <skype%25253Agenjasp at gmail.com>>>
> > > msn:jaspzz
> > >
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> >
> >
> >
> > --
> > Simon Rayner
> >
> > State Key Laboratory of Virology
> > Wuhan Institute of Virology
> > Chinese Academy of Sciences
> > Wuhan, Hubei 430071
> > P.R.China
> >
> > +86 (27) 87199895 (office)
> > +86 18627113001 (cell)
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>

From ayates at ebi.ac.uk  Sat Oct 30 05:20:30 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Sat, 30 Oct 2010 10:20:30 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
	<AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
Message-ID: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>

You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. 

Just goes to show you should always do more testing than you think :).

Andy

On 29 Oct 2010, at 20:43, jitesh dundas wrote:

> That is good news.Thanks for the directions Andy.
> 
> I have already started on this.Let me analyze and write the code now.
> 
> Maybe a next month deadline is not unreachable in this case.
> 
> Here we go!
> JD
> 
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So we've got some basic kmer work now in SVN. If you look in the class
>> SequenceMixin there are two static methods there for generating the two
>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>> the door open there for anyone else to come in & develop it. The k-mers are
>> also not unique across the sequence but it's a start :)
>> 
>> Share & enjoy!
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>> 
>>> I agree Andy. These have become standard functionalities that
>>> scientists do these days. I am all for implementing that in BioJava3.
>>> Java isn't that efficient for such functionalities so we will surely
>>> need more effort compared to the same in Python/Perl.
>>> 
>>> Regards,
>>> Jitesh Dundas
>>> 
>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>> chances
>>>> of developing a pluggable mechanism there would be hard. I think there
>>>> also
>>>> has to be a limit as to what we can sensibly do. If people want to
>>>> contribute this kind of work though then it's all be very well received
>>>> (with the corresponding test environment/cases of course).
>>>> 
>>>> Cheers,
>>>> 
>>>> Andy
>>>> 
>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>> 
>>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>> NoSQL
>>>>> key-value database to store K-mers.  You could plugin custom map
>>>>> implementations to allow you to keep a count of the number of instances
>>>>> of
>>>>> particular K-mers that were found.  It might also be useful to be able
>>>>> to
>>>>> do
>>>>> set operations on those K-mer collections.  You could use it to
>>>>> determine
>>>>> which K-mers were present in a pathogen and not in a host.
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Mark
>>>>> 
>>>>> card.ly: <http://card.ly/phidias51>
>>>>> 
>>>>> 
>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>> <vishalthapar at gmail.com>wrote:
>>>>> 
>>>>>> Hi Andy,
>>>>>> 
>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>> not
>>>>>> be
>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>> nice.
>>>>>> There is a project Bioinformatica
>>>>>> 
>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>> the
>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>> if
>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>> trying
>>>>>> to
>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>> algorithm
>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>> to
>>>>>> use
>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>> not
>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>> A
>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>> is:
>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>> software
>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>> module
>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>> fasta
>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>> files, I think that might be the way to go.
>>>>>> 
>>>>>> Thats just my two cents.What do you think?
>>>>>> 
>>>>>> -vishal
>>>>>> 
>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>> 
>>>>>>> Hi Vishal,
>>>>>>> 
>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>> BioJava
>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>> 
>>>>>>> public static void main(String[] args) {
>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>> System.out.println("Non-Overlap");
>>>>>>> nonOverlap(d);
>>>>>>> System.out.println("Overlap");
>>>>>>> overlap(d);
>>>>>>> }
>>>>>>> 
>>>>>>> public static final int KMER = 3;
>>>>>>> 
>>>>>>> //Generate triplets overlapping
>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>>         new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>>     SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>>             i, d.getLength());
>>>>>>>     WindowedSequence<NucleotideCompound> w =
>>>>>>>         new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>>     l.add(w);
>>>>>>> }
>>>>>>> 
>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>>     for(List<NucleotideCompound> subList: w) {
>>>>>>>         System.out.println(subList);
>>>>>>>     }
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>>         new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>> //Will return ATG & ATC
>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>>     System.out.println(subList);
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>> of
>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>> This
>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>> onto
>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>> 
>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>> (prefix tree).
>>>>>>> 
>>>>>>> Hope this helps,
>>>>>>> 
>>>>>>> Andy
>>>>>>> 
>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>> 
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>> or
>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>> k-mer
>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>> exists
>>>>>>> it
>>>>>>>> would save me some time to write the code.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Vishal
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> 
>>>>>>> --
>>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> *Vishal Thapar, Ph.D.*
>>>>>> *Scientific informatics Analyst
>>>>>> Cold Spring Harbor Lab
>>>>>> Quick Bldg, Lowe Lab
>>>>>> 1 Bungtown Road
>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>> 
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>>> --
>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Sat Oct 30 05:40:35 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 30 Oct 2010 15:10:35 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
	<AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
	<1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
Message-ID: <AANLkTikbG5xQG6uresAVQ4-QLVhUobQobEQ8ti8j=1nD@mail.gmail.com>

I got your point Andy. .Thanks.

On Sat, Oct 30, 2010 at 2:50 PM, Andy Yates <ayates at ebi.ac.uk> wrote:

> You should be aware I just found a bug in the code. This has been fixed but
> the bug will still be in the alpha3 release. I would recommend either
> building a version yourself or if Andreas can post up the continuous
> integration server address there will be a release tonight.
>
> Just goes to show you should always do more testing than you think :).
>
> Andy
>
> On 29 Oct 2010, at 20:43, jitesh dundas wrote:
>
> > That is good news.Thanks for the directions Andy.
> >
> > I have already started on this.Let me analyze and write the code now.
> >
> > Maybe a next month deadline is not unreachable in this case.
> >
> > Here we go!
> > JD
> >
> > On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> So we've got some basic kmer work now in SVN. If you look in the class
> >> SequenceMixin there are two static methods there for generating the two
> >> types of k-mers. It's not developed with Map storage in mind & I'll
> leave
> >> the door open there for anyone else to come in & develop it. The k-mers
> are
> >> also not unique across the sequence but it's a start :)
> >>
> >> Share & enjoy!
> >>
> >> Andy
> >>
> >> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
> >>
> >>> I agree Andy. These have become standard functionalities that
> >>> scientists do these days. I am all for implementing that in BioJava3.
> >>> Java isn't that efficient for such functionalities so we will surely
> >>> need more effort compared to the same in Python/Perl.
> >>>
> >>> Regards,
> >>> Jitesh Dundas
> >>>
> >>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> So if it's a suffix tree that's quite a fixed data structure so the
> >>>> chances
> >>>> of developing a pluggable mechanism there would be hard. I think there
> >>>> also
> >>>> has to be a limit as to what we can sensibly do. If people want to
> >>>> contribute this kind of work though then it's all be very well
> received
> >>>> (with the corresponding test environment/cases of course).
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Andy
> >>>>
> >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
> >>>>
> >>>>> It might be useful to make the K-mer storage mechanism pluggable.
>  This
> >>>>> would allow a developer to use anything from a simple MultiMap, to a
> >>>>> NoSQL
> >>>>> key-value database to store K-mers.  You could plugin custom map
> >>>>> implementations to allow you to keep a count of the number of
> instances
> >>>>> of
> >>>>> particular K-mers that were found.  It might also be useful to be
> able
> >>>>> to
> >>>>> do
> >>>>> set operations on those K-mer collections.  You could use it to
> >>>>> determine
> >>>>> which K-mers were present in a pathogen and not in a host.
> >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
> >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> card.ly: <http://card.ly/phidias51>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
> >>>>> <vishalthapar at gmail.com>wrote:
> >>>>>
> >>>>>> Hi Andy,
> >>>>>>
> >>>>>> This is good to have. I feel that including it as a part of core may
> >>>>>> not
> >>>>>> be
> >>>>>> necessary but having it as part of Genomic module in biojava3 will
> be
> >>>>>> nice.
> >>>>>> There is a project Bioinformatica
> >>>>>>
> >>>>>>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> >>>>>> does something similar although not exactly. It counts the k-mers in
> a
> >>>>>> given fasta file but it does not count k-mers for each sequence
> within
> >>>>>> the
> >>>>>> file, just all within a file. This is a good feature to have
> specially
> >>>>>> if
> >>>>>> one is trying to find patterns within sequences which is what I am
> >>>>>> trying
> >>>>>> to
> >>>>>> do. It would most certainly be helpful to have a k-mer counting
> >>>>>> algorithm
> >>>>>> that counts k-mer frequency for each sequence. The way to go would
> be
> >>>>>> to
> >>>>>> use
> >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
> >>>>>> not
> >>>>>> since I haven't used java in a while and am just switching back to
> it.
> >>>>>> A
> >>>>>> paper on using suffix trees to generate genome wide k-mer
> frequencies
> >>>>>> is:
> >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> >>>>>> software
> >>>>>> is tallymer). It would be some work to implement this in java as a
> >>>>>> module
> >>>>>> for biojava3 but I can see that this will be helpful. Again, for
> small
> >>>>>> fasta
> >>>>>> files, it might not be efficient to create a suffix tree but for
> bigger
> >>>>>> files, I think that might be the way to go.
> >>>>>>
> >>>>>> Thats just my two cents.What do you think?
> >>>>>>
> >>>>>> -vishal
> >>>>>>
> >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk>
> wrote:
> >>>>>>
> >>>>>>> Hi Vishal,
> >>>>>>>
> >>>>>>> As far as I am aware there is nothing which will generate them in
> >>>>>>> BioJava
> >>>>>>> at the moment. However it is possible to do it with BioJava3:
> >>>>>>>
> >>>>>>> public static void main(String[] args) {
> >>>>>>> DNASequence d = new DNASequence("ATGATC");
> >>>>>>> System.out.println("Non-Overlap");
> >>>>>>> nonOverlap(d);
> >>>>>>> System.out.println("Overlap");
> >>>>>>> overlap(d);
> >>>>>>> }
> >>>>>>>
> >>>>>>> public static final int KMER = 3;
> >>>>>>>
> >>>>>>> //Generate triplets overlapping
> >>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
> >>>>>>> List<WindowedSequence<NucleotideCompound>> l =
> >>>>>>>         new ArrayList<WindowedSequence<NucleotideCompound>>();
> >>>>>>> for(int i=1; i<=KMER; i++) {
> >>>>>>>     SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >>>>>>>             i, d.getLength());
> >>>>>>>     WindowedSequence<NucleotideCompound> w =
> >>>>>>>         new WindowedSequence<NucleotideCompound>(sub, KMER);
> >>>>>>>     l.add(w);
> >>>>>>> }
> >>>>>>>
> >>>>>>> //Will return ATG, ATC, TGA & GAT
> >>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
> >>>>>>>     for(List<NucleotideCompound> subList: w) {
> >>>>>>>         System.out.println(subList);
> >>>>>>>     }
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> //Generate triplet Compound lists non-overlapping
> >>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >>>>>>> WindowedSequence<NucleotideCompound> w =
> >>>>>>>         new WindowedSequence<NucleotideCompound>(d, KMER);
> >>>>>>> //Will return ATG & ATC
> >>>>>>> for(List<NucleotideCompound> subList: w) {
> >>>>>>>     System.out.println(subList);
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> The disadvantage of all of these solutions is that they generate
> lists
> >>>>>>> of
> >>>>>>> Compounds so kmer generation can/will be a memory intensive
> operation.
> >>>>>> This
> >>>>>>> does mean it has to be since sub sequences are thin wrappers around
> an
> >>>>>>> underlying sequence. Also the overlap solution is non-optimal since
> it
> >>>>>>> iterates through each window rather than stepping through
> delegating
> >>>>>>> onto
> >>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
> >>>>>>>
> >>>>>>> As for unique k-mers that's something which would require a bit
> more
> >>>>>>> engineering & would be better suited to a solution built around a
> Trie
> >>>>>>> (prefix tree).
> >>>>>>>
> >>>>>>> Hope this helps,
> >>>>>>>
> >>>>>>> Andy
> >>>>>>>
> >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >>>>>>>
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> I had a quick question: Does Biojava have a method to generate
> k-mers
> >>>>>> or
> >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
> >>>>>> k-mer
> >>>>>>>> counts for every sequence in a fasta file. If something like this
> >>>>>> exists
> >>>>>>> it
> >>>>>>>> would save me some time to write the code.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Vishal
> >>>>>>>> _______________________________________________
> >>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>
> >>>>>>> --
> >>>>>>> Andrew Yates                   Ensembl Genomes Engineer
> >>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> *Vishal Thapar, Ph.D.*
> >>>>>> *Scientific informatics Analyst
> >>>>>> Cold Spring Harbor Lab
> >>>>>> Quick Bldg, Lowe Lab
> >>>>>> 1 Bungtown Road
> >>>>>> Cold Spring Harbor, NY - 11724*
> >>>>>> _______________________________________________
> >>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>>> --
> >>>> Andrew Yates                   Ensembl Genomes Engineer
> >>>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>
> >> --
> >> Andrew Yates                   Ensembl Genomes Engineer
> >> EMBL-EBI                       Tel: +44-(0)1223-492538
> >> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>
> >>
> >>
> >>
> >>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>

From andreas at sdsc.edu  Sat Oct 30 06:50:48 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sat, 30 Oct 2010 06:50:48 -0400
Subject: [Biojava-l] K-mers
In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
	<AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
	<1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
Message-ID: <AANLkTik+WUeseWqDSnLkba6N+35xADYtnMQ9xVgGmDtp@mail.gmail.com>

just kicked off a new build.. alpha4 should be on the servers
shortly... you don't need cruisecontrol for a release. Anybody with an
ssh account on portal.open-bio (and set up ssh keys correctly) can do
mvn release:clean release:prepare release:perform

A

On Sat, Oct 30, 2010 at 5:20 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight.
>
> Just goes to show you should always do more testing than you think :).
>
> Andy
>
> On 29 Oct 2010, at 20:43, jitesh dundas wrote:
>
>> That is good news.Thanks for the directions Andy.
>>
>> I have already started on this.Let me analyze and write the code now.
>>
>> Maybe a next month deadline is not unreachable in this case.
>>
>> Here we go!
>> JD
>>
>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> So we've got some basic kmer work now in SVN. If you look in the class
>>> SequenceMixin there are two static methods there for generating the two
>>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>>> the door open there for anyone else to come in & develop it. The k-mers are
>>> also not unique across the sequence but it's a start :)
>>>
>>> Share & enjoy!
>>>
>>> Andy
>>>
>>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>>>
>>>> I agree Andy. These have become standard functionalities that
>>>> scientists do these days. I am all for implementing that in BioJava3.
>>>> Java isn't that efficient for such functionalities so we will surely
>>>> need more effort compared to the same in Python/Perl.
>>>>
>>>> Regards,
>>>> Jitesh Dundas
>>>>
>>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>>> chances
>>>>> of developing a pluggable mechanism there would be hard. I think there
>>>>> also
>>>>> has to be a limit as to what we can sensibly do. If people want to
>>>>> contribute this kind of work though then it's all be very well received
>>>>> (with the corresponding test environment/cases of course).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Andy
>>>>>
>>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>>>
>>>>>> It might be useful to make the K-mer storage mechanism pluggable. ?This
>>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>>> NoSQL
>>>>>> key-value database to store K-mers. ?You could plugin custom map
>>>>>> implementations to allow you to keep a count of the number of instances
>>>>>> of
>>>>>> particular K-mers that were found. ?It might also be useful to be able
>>>>>> to
>>>>>> do
>>>>>> set operations on those K-mer collections. ?You could use it to
>>>>>> determine
>>>>>> which K-mers were present in a pathogen and not in a host.
>>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> card.ly: <http://card.ly/phidias51>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>>> <vishalthapar at gmail.com>wrote:
>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>>> not
>>>>>>> be
>>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>>> nice.
>>>>>>> There is a project Bioinformatica
>>>>>>>
>>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>>> the
>>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>>> if
>>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>>> trying
>>>>>>> to
>>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>>> algorithm
>>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>>> to
>>>>>>> use
>>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>>> not
>>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>>> A
>>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>>> is:
>>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>>> software
>>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>>> module
>>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>>> fasta
>>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>>> files, I think that might be the way to go.
>>>>>>>
>>>>>>> Thats just my two cents.What do you think?
>>>>>>>
>>>>>>> -vishal
>>>>>>>
>>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>>> BioJava
>>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>>>
>>>>>>>> public static void main(String[] args) {
>>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>>> System.out.println("Non-Overlap");
>>>>>>>> nonOverlap(d);
>>>>>>>> System.out.println("Overlap");
>>>>>>>> overlap(d);
>>>>>>>> }
>>>>>>>>
>>>>>>>> public static final int KMER = 3;
>>>>>>>>
>>>>>>>> //Generate triplets overlapping
>>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>>> ? ? ? ? new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>>> ? ? SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>>> ? ? ? ? ? ? i, d.getLength());
>>>>>>>> ? ? WindowedSequence<NucleotideCompound> w =
>>>>>>>> ? ? ? ? new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>>> ? ? l.add(w);
>>>>>>>> }
>>>>>>>>
>>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>>> ? ? for(List<NucleotideCompound> subList: w) {
>>>>>>>> ? ? ? ? System.out.println(subList);
>>>>>>>> ? ? }
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>>> ? ? ? ? new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>>> //Will return ATG & ATC
>>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>>> ? ? System.out.println(subList);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>>> of
>>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>>> This
>>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>>> onto
>>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>>>
>>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>>> (prefix tree).
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>>> or
>>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>>> k-mer
>>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>>> exists
>>>>>>>> it
>>>>>>>>> would save me some time to write the code.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Vishal
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
>>>>>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
>>>>>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
>>>>>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Vishal Thapar, Ph.D.*
>>>>>>> *Scientific informatics Analyst
>>>>>>> Cold Spring Harbor Lab
>>>>>>> Quick Bldg, Lowe Lab
>>>>>>> 1 Bungtown Road
>>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>> --
>>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
>>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>
>>> --
>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>
> --
> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From dasarnow at gmail.com  Sun Oct 31 19:56:05 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Sun, 31 Oct 2010 16:56:05 -0700
Subject: [Biojava-l] Superimposing structure pieces
Message-ID: <AANLkTimubnTZ3qKFvVqhbNpFuV+EjVNBV1=4k+XMnQLj@mail.gmail.com>

I've been trying to pull out pieces of protein chains and superimpose
them...my current code (as generic-ified code snips below) works, but
I wonder if it couldn't be faster.
Has anyone worked on similar methods?  Any other advice?

Best regards everyone,
da

Getting residue CA's as Atom[]:

for (int i; i < length; i++) {
    someAtoms[i] = someChain.getSeqResGroup(start + i).getAtom("CA");
}

Superimposing/aligning:

SVDSuperimposer svds = new SVDSuperimposer(someAtoms1, someAtoms2);
Matrix rot = svds.getRotation();
Atom trans = svds.getTranslation();
for (int i = 0; i < length; i++) {
    Calc.rotate(someAtoms1[i], rot);
    Calc.shift(someAtoms1[i], trans);
}
SVDSuperimposer.getRmsd(someAtoms1, someAtoms2);

From andreas at sdsc.edu  Sun Oct 31 23:08:00 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sun, 31 Oct 2010 23:08:00 -0400
Subject: [Biojava-l] Superimposing structure pieces
In-Reply-To: <AANLkTimubnTZ3qKFvVqhbNpFuV+EjVNBV1=4k+XMnQLj@mail.gmail.com>
References: <AANLkTimubnTZ3qKFvVqhbNpFuV+EjVNBV1=4k+XMnQLj@mail.gmail.com>
Message-ID: <AANLkTimm+xX45xCYUZVuu9bPFEhXT2fgNY44C0L-qzRq@mail.gmail.com>

Hi Daniel,

couple of thoughts when I see this:

- in case you have not seen this yet, take a look at docu on structure
alignment: http://biojava.org/wiki/BioJava:CookBook:PDB:align
- the direction of your rotations is wrong,the SVDSuperimposer gives
you the operations to be applied on the second atom set.
- there is some utility methods in StructureTools, that might come in
handy. e.g.
           Atom[] ca1 = StructureTools.getAtomCAArray(structure1);
           Atom[] ca2 = StructureTools.getAtomCAArray(structure2);

- any particular reason why you are working with SEQRES records? for
the superposition it might be sufficient to work with the ATOM records
only, which can give you a quicker parsing of the files, since you can
turn off the alignment of ATOM and SEQRES. Having said that, there can
be situations when you actually might want it, e.g. see
SmithWaterman3Daligner, which does a sequence based structure
alignment...

hope that helps,

Andreas


On Sun, Oct 31, 2010 at 7:56 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> I've been trying to pull out pieces of protein chains and superimpose
> them...my current code (as generic-ified code snips below) works, but
> I wonder if it couldn't be faster.
> Has anyone worked on similar methods? ?Any other advice?
>
> Best regards everyone,
> da
>
> Getting residue CA's as Atom[]:
>
> for (int i; i < length; i++) {
> ? ?someAtoms[i] = someChain.getSeqResGroup(start + i).getAtom("CA");
> }
>
> Superimposing/aligning:
>
> SVDSuperimposer svds = new SVDSuperimposer(someAtoms1, someAtoms2);
> Matrix rot = svds.getRotation();
> Atom trans = svds.getTranslation();
> for (int i = 0; i < length; i++) {
> ? ?Calc.rotate(someAtoms1[i], rot);
> ? ?Calc.shift(someAtoms1[i], trans);
> }
> SVDSuperimposer.getRmsd(someAtoms1, someAtoms2);
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From asandro1501 at gmail.com  Fri Oct  1 16:52:50 2010
From: asandro1501 at gmail.com (Alex Silva)
Date: Fri, 1 Oct 2010 13:52:50 -0300
Subject: [Biojava-l] Help files genbank
Message-ID: <AANLkTintytv9WHwLAja6mUYWrP1YBG-a+sL9fxX7iq8n@mail.gmail.com>

Hi

I am asking again for help reading a file format in genbank, I need to do
the analysis of the headers. I could not use any because I am a beginner in
java. Does anyone have some code that you used for this?


Em portugu?s

Estou solicitando novamente uma ajuda para leitura de arquivos no formato
genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar
nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha
utilizado para isso?

-- 
Alex Silva
G.R.A. Sistemas Corporativos
msn: gra.sistemas at hotmail.com
55-9165-7378


From holland at eaglegenomics.com  Fri Oct  1 16:56:09 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Fri, 1 Oct 2010 17:56:09 +0100
Subject: [Biojava-l] Help files genbank
In-Reply-To: <AANLkTintytv9WHwLAja6mUYWrP1YBG-a+sL9fxX7iq8n@mail.gmail.com>
References: <AANLkTintytv9WHwLAja6mUYWrP1YBG-a+sL9fxX7iq8n@mail.gmail.com>
Message-ID: <D24FA959-4A56-47E8-B326-B6CEFE893ECC@eaglegenomics.com>

This is a good starting point: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_and_writing_files.


On 1 Oct 2010, at 17:52, Alex Silva wrote:

> Hi
> 
> I am asking again for help reading a file format in genbank, I need to do
> the analysis of the headers. I could not use any because I am a beginner in
> java. Does anyone have some code that you used for this?
> 
> 
> 
> 
> Em portugu?s
> 
> Estou solicitando novamente uma ajuda para leitura de arquivos no formato
> genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar
> nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha
> utilizado para isso?
> 
> -- 
> Alex Silva
> G.R.A. Sistemas Corporativos
> msn: gra.sistemas at hotmail.com
> 55-9165-7378
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From pjotr.public23 at thebird.nl  Sat Oct  2 09:15:06 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Sat, 2 Oct 2010 11:15:06 +0200
Subject: [Biojava-l] BioJava <-> R
Message-ID: <20101002091506.GA17702@thebird.nl>

Anyone here who has real experience using the JRI? Who would be
interested, and have some exposure to, invoking R from Java through a
native interface in bioinformatics?

Pj.


From hlapp at drycafe.net  Sun Oct  3 01:26:49 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Sat, 2 Oct 2010 21:26:49 -0400
Subject: [Biojava-l] BioJava <-> R
In-Reply-To: <20101002091506.GA17702@thebird.nl>
References: <20101002091506.GA17702@thebird.nl>
Message-ID: <74DF3E4D-FC22-4719-9E6B-08248B14D4AA@drycafe.net>

We use this in the Mesquite<->R bridge. I haven't worked much on the  
Java to R side, but it seems to work well.

http://mesquiteproject.org/packages/Mesquite.R/

	-hilmar

On Oct 2, 2010, at 5:15 AM, Pjotr Prins wrote:

> Anyone here who has real experience using the JRI? Who would be
> interested, and have some exposure to, invoking R from Java through a
> native interface in bioinformatics?
>
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From andrew.mcsweeny at rockets.utoledo.edu  Tue Oct 12 21:41:07 2010
From: andrew.mcsweeny at rockets.utoledo.edu (McSweeny, Andrew J)
Date: Tue, 12 Oct 2010 21:41:07 +0000
Subject: [Biojava-l] How to share code while protecting copyrights?
Message-ID: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>

Hi,

I am working on a project which simulates sexual reproduction in a population of digital organisms.  Their genome is just a contig from hg18.  It's pretty interesting and I can talk more about it in the future....

Anyways, how can I share my code for this project without having to worry that someone else will use it to publish a paper before my group does?

I'm certain nobody in the open source community would do that, but how do I convince my group that opening our project to BioJava is a good idea?

-Andrew


From andreas at sdsc.edu  Wed Oct 13 06:02:34 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 12 Oct 2010 23:02:34 -0700
Subject: [Biojava-l] biojava 3.0 release plan
Message-ID: <AANLkTi=9X-uUEXJPk=To36nxj6JJBr4xazE09jW3RTso@mail.gmail.com>

Hi,

BioJava 3 has matured massively in SVN during this year and it is time to
prepare a first release. I propose the following release plan. See also two
other topics for discussion below.

Release Plan 3.0

* Alpha release build(s)
  during the next days I will start to provide a first alpha release build.
This will be followed by semi-regular follow up alpha builds (depending on
SVN activity)

- During the next weeks any missing features should be committed to SVN.
 Refactoring of code can still be done during this time.
- Add and update documentation in wiki
- Module maintainers: check compile warnings for your modules in automated
builds. Make sure no compile warnings are being displayed.


* Beta release build(s)
  the first beta release is scheduled for the weekend Nov 21st.

- From this point on only minor changes (bug fixes) should be added to the
code base
- Module maintainers: check and update javadoc for your modules

* Release 3.0
  The 3.0 Release is scheduled for Dez 12th


There are two things we should still discuss:

* backwards compatibility:
the current "core" module contains tons of legacy 1.7 code. Shall I go ahead
and delete this module?

* documentation:
The wiki contains tons of documentation for 1.7 which will not be useful for
3.0. As a procedure for cleaning this up and avoiding confusion I suggest to
move all 1.7 related docu into a special section of the wiki. All toplevel
links to documentation should point to 3.0. Any other suggestions?


Andreas


From markjschreiber at gmail.com  Wed Oct 13 09:26:04 2010
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 13 Oct 2010 11:26:04 +0200
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
Message-ID: <AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>

Hi -

My understanding of copyright is that it is yours as soon as you assert that
it is your creation. You can simply add a copyright statement to each file
containing the code (in the header for example). The reality is that
defending copyright is your responsibility. If someone violates it, you have
to take them to court or issue a legal letter.

You can also put an appropriate license on the code specifying how it can be
used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick one
of these that best matches your needs. BioJava code is LGPL so if you want
your code to go into the BioJava code base you will need to make your code
LGPL.

It's always a good idea to add @author tags to Java code to ensure
appropriate attribution.

Finally, if someone steals your code and publishes results before you then
you can always make a complaint to the journal editors. If it is a reputable
journal, and you have reasonable proof the editor should take some action
such as forcing a retraction.  You can also make a distribution agreement
saying that if someone uses this code they agree not to publish without
first consulting you.

If you want to make it really water tight, get a lawyer and explain
specifically what you want to share and what you want to protect or prevent.

- Mark

On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
andrew.mcsweeny at rockets.utoledo.edu> wrote:

> Hi,
>
> I am working on a project which simulates sexual reproduction in a
> population of digital organisms.  Their genome is just a contig from hg18.
>  It's pretty interesting and I can talk more about it in the future....
>
> Anyways, how can I share my code for this project without having to worry
> that someone else will use it to publish a paper before my group does?
>
> I'm certain nobody in the open source community would do that, but how do I
> convince my group that opening our project to BioJava is a good idea?
>
> -Andrew
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From markjschreiber at gmail.com  Wed Oct 13 09:28:05 2010
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 13 Oct 2010 11:28:05 +0200
Subject: [Biojava-l] [Biojava-dev] biojava 3.0 release plan
In-Reply-To: <AANLkTi=9X-uUEXJPk=To36nxj6JJBr4xazE09jW3RTso@mail.gmail.com>
References: <AANLkTi=9X-uUEXJPk=To36nxj6JJBr4xazE09jW3RTso@mail.gmail.com>
Message-ID: <AANLkTik3LJLoBsDcwtfhCQbw-tyRM2cE+mWb0xg7EnGJ@mail.gmail.com>

Hi Andreas -

Excellent work from the team this year.

I would recommend removing as much legacy code as possible and removing
(preferably rewriting) the legacy documentation. I think it would be better
to have no docs than out of date docs.

- Mark

On Wed, Oct 13, 2010 at 8:02 AM, Andreas Prlic <andreas at sdsc.edu> wrote:

> Hi,
>
> BioJava 3 has matured massively in SVN during this year and it is time to
> prepare a first release. I propose the following release plan. See also two
> other topics for discussion below.
>
> Release Plan 3.0
>
> * Alpha release build(s)
>  during the next days I will start to provide a first alpha release build.
> This will be followed by semi-regular follow up alpha builds (depending on
> SVN activity)
>
> - During the next weeks any missing features should be committed to SVN.
>  Refactoring of code can still be done during this time.
> - Add and update documentation in wiki
> - Module maintainers: check compile warnings for your modules in automated
> builds. Make sure no compile warnings are being displayed.
>
>
> * Beta release build(s)
>  the first beta release is scheduled for the weekend Nov 21st.
>
> - From this point on only minor changes (bug fixes) should be added to the
> code base
> - Module maintainers: check and update javadoc for your modules
>
> * Release 3.0
>  The 3.0 Release is scheduled for Dez 12th
>
>
> There are two things we should still discuss:
>
> * backwards compatibility:
> the current "core" module contains tons of legacy 1.7 code. Shall I go
> ahead
> and delete this module?
>
> * documentation:
> The wiki contains tons of documentation for 1.7 which will not be useful
> for
> 3.0. As a procedure for cleaning this up and avoiding confusion I suggest
> to
> move all 1.7 related docu into a special section of the wiki. All toplevel
> links to documentation should point to 3.0. Any other suggestions?
>
>
> Andreas
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>


From paolo.romano at istge.it  Wed Oct 13 10:17:27 2010
From: paolo.romano at istge.it (Paolo Romano)
Date: Wed, 13 Oct 2010 12:17:27 +0200
Subject: [Biojava-l] NETTAB 2010 Biological Wikis: Call for posters and
 participation
Message-ID: <201010131018.o9DAHTjq009877@clus2.istge.it>

Apologizes for duplications
====

Joint NETTAB 2010 and BBCC 2010 workshop

Biological Wikis

November 29 - December 1, 2010
Congress Center, University of Naples "Federico II", Naples, Italy

http://www.nettab.org/2010/


The joint NETTAB and BBCC 2010 workshop on "Biological Wikis" 
promises to be a great meeting for all researchers involved in the 
exploitation of wikis in biology.
Come and discuss your ideas and doubts with such scientists as Alex 
Bateman, Alexander Pico, Andrew Su, Dan Bolser, Robert Hoffmann, 
Thomas Kelder, Mike Cariaso, Adam Godzik, Luca Toldo and many other 
who, we hope, will join the workshop.

It's a great chance to follow smart tutorials and lectures on 
WikiPathways, WikiGenes, Semantic Wiki, PDBWiki, Gene Wiki and a 
proficient use of Wikipedia.
See a list of keynote speakers and tutorials at 
http://www.nettab.org/2010/progr.html .

There still is time to submit abstracts for posters and software 
demonstrations until next October 17, 2010!
The complete Call is available on-line at 
http://www.nettab.org/2010/call.html .

Registration is open at http://www.nettab.org/2010/rform.html .
Register within next October 29, 2010 and take profit of early 
registration fees.

A reduction of 20 euro applies to all fees for members of ISCB and 
other societies and networks.
More reductions are foreseen for PhD students.

Further information is availble at http://www.nettab.org/2010/ .

Looking forward to seeing you soon in Naples.

Paolo Romano

Paolo Romano (paolo.romano at istge.it)
Bioinformatics
National Cancer Research Institute (IST)
Largo Rosanna Benzi, 10, I-16132, Genova, Italy
Tel: +39-010-5737-288  Fax: +39-010-5737-295  Skype: p.romano
Web: http://www.nettab.org/promano/


From pjotr.public23 at thebird.nl  Wed Oct 13 11:15:41 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 13:15:41 +0200
Subject: [Biojava-l] BioJava translation
Message-ID: <20101013111541.GA512@thebird.nl>

I am using biojava-1.7.1 nucleotide -> amino acid translation. It is
rather slow. In fact, the biopython equivalent in native Python is
twice as fast. EMBOSS is again magnitudes faster. I am using
something like 

  rna = RNATools.createRNA(nucleotides);
  aa = RNATools.translate(rna);

Embarrassingly, even the R version is faster in the GeneR module, as
it uses a C module. 

I have a feeling this has to do with typed object creation at every
level, whereas Python and others uses plain character Strings. 

Any plans for speeding this up on the JVM? 

Pj.


From pjotr.public23 at thebird.nl  Wed Oct 13 11:40:37 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 13:40:37 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
Message-ID: <20101013114037.GA1166@thebird.nl>

Great! You mean BJ3 translation should work? Do you have a short
example of use?

Pj.

On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.


From holland at eaglegenomics.com  Wed Oct 13 11:27:05 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 13 Oct 2010 12:27:05 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013111541.GA512@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
Message-ID: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>

BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

On 13 Oct 2010, at 12:15, Pjotr Prins wrote:

> I am using biojava-1.7.1 nucleotide -> amino acid translation. It is
> rather slow. In fact, the biopython equivalent in native Python is
> twice as fast. EMBOSS is again magnitudes faster. I am using
> something like 
> 
>  rna = RNATools.createRNA(nucleotides);
>  aa = RNATools.translate(rna);
> 
> Embarrassingly, even the R version is faster in the GeneR module, as
> it uses a C module. 
> 
> I have a feeling this has to do with typed object creation at every
> level, whereas Python and others uses plain character Strings. 
> 
> Any plans for speeding this up on the JVM? 
> 
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From holland at eaglegenomics.com  Wed Oct 13 11:42:21 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 13 Oct 2010 12:42:21 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013114037.GA1166@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
Message-ID: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com>

Afraid I'm a bit out of touch but someone else on this list should be able to help. Andy or Andreas maybe?

On 13 Oct 2010, at 12:40, Pjotr Prins wrote:

> Great! You mean BJ3 translation should work? Do you have a short
> example of use?
> 
> Pj.
> 
> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
>> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From pjotr.public23 at thebird.nl  Wed Oct 13 11:48:07 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 13:48:07 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com>
Message-ID: <20101013114807.GA1569@thebird.nl>

On Wed, Oct 13, 2010 at 12:42:21PM +0100, Richard Holland wrote:
> Afraid I'm a bit out of touch but someone else on this list should
> be able to help. Andy or Andreas maybe?

It is not on the wiki yet, and I must admit I get lost in the source
tree. Any short example will do, translating from an ntseq (String) to
aaseq (String).

Pj.


From ayates at ebi.ac.uk  Wed Oct 13 11:50:25 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 12:50:25 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013114037.GA1166@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
Message-ID: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>

As of the moment there are the translation test cases which is the best documentation:

http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java

This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing.

Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available

Andy


On 13 Oct 2010, at 12:40, Pjotr Prins wrote:

> Great! You mean BJ3 translation should work? Do you have a short
> example of use?
> 
> Pj.
> 
> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
>> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From koen.bruynseels at cropdesign.com  Wed Oct 13 12:16:00 2010
From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com)
Date: Wed, 13 Oct 2010 14:16:00 +0200
Subject: [Biojava-l] Koen Bruynseels is out of the office.
Message-ID: <OF5E0BDE0E.450C15D1-ONC12577BB.00436226-C12577BB.00436226@basf-c-s.be>


I will be out of the office starting  10/12/2010 and will not return until
10/14/2010.

I will respond to your message when I return.


From andreas at sdsc.edu  Wed Oct 13 15:42:44 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 13 Oct 2010 08:42:44 -0700
Subject: [Biojava-l] BioJava translation
In-Reply-To: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
Message-ID: <AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>

Hi Andy,

any chance to add some wiki documentation for this as well? Would be
great....

Andreas


On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> As of the moment there are the translation test cases which is the best
> documentation:
>
>
> http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java
>
> This hopefully will give you a good idea about how to go about it. I was
> managing over 1000 translations per second of BRCA2 going from mRNA to
> peptide with checks. YMMV but I hope this is a lot faster than what you're
> currently seeing.
>
> Translation supports a lot of different modes with TranscriptionEngine
> being the place to configure this. The Javadoc should be good enough to help
> you through the different modes available
>
> Andy
>
>
> On 13 Oct 2010, at 12:40, Pjotr Prins wrote:
>
> > Great! You mean BJ3 translation should work? Do you have a short
> > example of use?
> >
> > Pj.
> >
> > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> >> BJ3 should be replacing most sequence operations with string operations,
> making the whole thing much faster.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From ayates at ebi.ac.uk  Wed Oct 13 15:46:58 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 16:46:58 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
Message-ID: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>

I will try my best to

Andy

On 13 Oct 2010, at 16:42, Andreas Prlic wrote:

> 
> Hi Andy,
> 
> any chance to add some wiki documentation for this as well? Would be great....
> 
> Andreas
> 
> 
> On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> As of the moment there are the translation test cases which is the best documentation:
> 
> http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java
> 
> This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing.
> 
> Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available
> 
> Andy
> 
> 
> On 13 Oct 2010, at 12:40, Pjotr Prins wrote:
> 
> > Great! You mean BJ3 translation should work? Do you have a short
> > example of use?
> >
> > Pj.
> >
> > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Wed Oct 13 15:58:44 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 17:58:44 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
	<3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
Message-ID: <20101013155844.GA2918@thebird.nl>

On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote:
> I will try my best to

Make sure to add the sequence should be uppercase. Took me a while to
crack that, as I only got a null pointer exception.

Pj.


From holland at eaglegenomics.com  Wed Oct 13 16:02:24 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 13 Oct 2010 17:02:24 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013155844.GA2918@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
	<3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
	<20101013155844.GA2918@thebird.nl>
Message-ID: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com>

whuh??? Shouldn't we be coding to cater for all case mixtures?!


On 13 Oct 2010, at 16:58, Pjotr Prins wrote:

> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote:
>> I will try my best to
> 
> Make sure to add the sequence should be uppercase. Took me a while to
> crack that, as I only got a null pointer exception.
> 
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From ayates at ebi.ac.uk  Wed Oct 13 16:11:40 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 17:11:40 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013114037.GA1166@thebird.nl>
	<5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk>
	<AANLkTi=RcHRR37QiZPtyg2g8fQEpU5+7+maPeo3=U9JA@mail.gmail.com>
	<3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk>
	<20101013155844.GA2918@thebird.nl>
	<1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com>
Message-ID: <7740A206-98A0-4FBC-9CF8-B1AC0DE7D859@ebi.ac.uk>

I also thought we were as well. I can investigate

On 13 Oct 2010, at 17:02, Richard Holland wrote:

> whuh??? Shouldn't we be coding to cater for all case mixtures?!
> 
> 
> On 13 Oct 2010, at 16:58, Pjotr Prins wrote:
> 
>> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote:
>>> I will try my best to
>> 
>> Make sure to add the sequence should be uppercase. Took me a while to
>> crack that, as I only got a null pointer exception.
>> 
>> Pj.
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Wed Oct 13 16:13:36 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 18:13:36 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
Message-ID: <20101013161336.GA3184@thebird.nl>

On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.

Good news, BJ3 is a lot faster! The previous version took 2 minutes
for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my
modest Thinkpad X61 laptop. After parsing the Fasta and turning it
into an upper case string the actual translation takes 16sec.

Only the C implementations are faster.

Here the relevant Scala code:

import bio._
import java.io._
import org.biojava3.core.sequence._
import org.biojava3.core.sequence.transcription.TranscriptionEngine
import org.biojava3.core.sequence.io.IUPACParser

// <cut> fetching infile from command line...

IUPACParser.getInstance().getTable(1);  // not sure we need this
IUPACParser.getInstance().getTable("UNIVERSAL");
val engine = TranscriptionEngine.getDefault()
val f = new FastaReader(infile)
f.foreach { 
  res => 
    val (id,tag,dna) = res
    println(List(">",id).mkString) 
    val dna2 = new DNASequence(dna.mkString.toUpperCase)
    val rna = dna2.getRNASequence(engine)
    println(rna.getProteinSequence(engine))
  }
}

prints:

>B0222.10
MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG
>B0222.11
MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS
(...)

Pj.


From ayates at ebi.ac.uk  Wed Oct 13 16:25:41 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 17:25:41 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013161336.GA3184@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
Message-ID: <F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>

That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice.

I wonder what the C version does to make itself even faster

Andy

On 13 Oct 2010, at 17:13, Pjotr Prins wrote:

> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote:
>> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster.
> 
> Good news, BJ3 is a lot faster! The previous version took 2 minutes
> for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my
> modest Thinkpad X61 laptop. After parsing the Fasta and turning it
> into an upper case string the actual translation takes 16sec.
> 
> Only the C implementations are faster.
> 
> Here the relevant Scala code:
> 
> import bio._
> import java.io._
> import org.biojava3.core.sequence._
> import org.biojava3.core.sequence.transcription.TranscriptionEngine
> import org.biojava3.core.sequence.io.IUPACParser
> 
> // <cut> fetching infile from command line...
> 
> IUPACParser.getInstance().getTable(1);  // not sure we need this
> IUPACParser.getInstance().getTable("UNIVERSAL");
> val engine = TranscriptionEngine.getDefault()
> val f = new FastaReader(infile)
> f.foreach { 
>  res => 
>    val (id,tag,dna) = res
>    println(List(">",id).mkString) 
>    val dna2 = new DNASequence(dna.mkString.toUpperCase)
>    val rna = dna2.getRNASequence(engine)
>    println(rna.getProteinSequence(engine))
>  }
> }
> 
> prints:
> 
>> B0222.10
> MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG
>> B0222.11
> MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS
> (...)
> 
> Pj.
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Wed Oct 13 16:34:23 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 18:34:23 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
Message-ID: <20101013163423.GA3849@thebird.nl>

On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote:
> That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice.
> 
> I wonder what the C version does to make itself even faster

The EMBOSS implementation is fastest by a mile - takes less than 3
seconds. But the code is, uhm, hard to read.

I think table lookups will win in C, whatever you try. But it may be an
interesting exercise if we can get close. Note I am perhaps not using the
fastest JVM.

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)

Pj.


From willishf at ufl.edu  Wed Oct 13 17:16:01 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Wed, 13 Oct 2010 13:16:01 -0400
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013163423.GA3849@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
Message-ID: <AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>

The Biojava3 has an additional validation layer and object creation going
from DNA sequence to RNA sequence and then using the appropriate translation
rules to return a protein sequence. Could be easily twice as fast if you
went from DNA sequence to ProteinSequence which would put it at 8 seconds.
We are going to carry a performance penalty setting everything up as a
proper object versus doing a simple String to String translation.


On Wed, Oct 13, 2010 at 12:34 PM, Pjotr Prins <pjotr.public23 at thebird.nl>wrote:

> On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote:
> > That's great news and should be even faster once we get rid of the
> requirement to upper case since you're having to parse the same sequence
> twice.
> >
> > I wonder what the C version does to make itself even faster
>
> The EMBOSS implementation is fastest by a mile - takes less than 3
> seconds. But the code is, uhm, hard to read.
>
> I think table lookups will win in C, whatever you try. But it may be an
> interesting exercise if we can get close. Note I am perhaps not using the
> fastest JVM.
>
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)
>
> Pj.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From pjotr.public23 at thebird.nl  Wed Oct 13 18:17:12 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 20:17:12 +0200
Subject: [Biojava-l] BioJava translation
In-Reply-To: <AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
	<AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
Message-ID: <20101013181712.GA4482@thebird.nl>

I think it is a good idea. From a purist point of view you may object
(it is not biological), but most libraries do exactly that.

If direct translation gets it down to 8sec, we may well half that
with further tweaking.

Pj.

On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote:
> The Biojava3 has an additional validation layer and object creation going
> from DNA sequence to RNA sequence and then using the appropriate translation
> rules to return a protein sequence. Could be easily twice as fast if you
> went from DNA sequence to ProteinSequence which would put it at 8 seconds.
> We are going to carry a performance penalty setting everything up as a
> proper object versus doing a simple String to String translation.


From darnells at dnastar.com  Wed Oct 13 18:21:52 2010
From: darnells at dnastar.com (Steve Darnell)
Date: Wed, 13 Oct 2010 13:21:52 -0500
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>

Andrew,

Forgive me for being pessimistic, but I do not believe you can
publically distribute your code without running the risk of being
scooped.  Mark's suggestions are very good; however, the safest route
would be to withhold distribution of your code until your work is
published (or at very least accepted).

Also, I would suggest this argument for convincing your group to use
BioJava (disclaimer - I am not a lawyer).

Under the LGPL, you are not obligated to release your source code if:

(1) you create a "work based on the library" (e.g. direct modifications
or additions to the licensed work) but do not distribute it, and
(2) you create a "work that uses the library" by dynamically linking
your work to the licensed work (see distribution clause #5 of the LGPL:
http://www.gnu.org/licenses/lgpl-2.1.html)

If you follow choice #2, you can license and distribute your work under
terms of your group's choosing (open or closed, submit it to the BioJava
developers for inclusion or not) while gaining the benefit of reusing
BioJava.

~Steve

-----Original Message-----
From: biojava-l-bounces at lists.open-bio.org
[mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark
Schreiber
Sent: Wednesday, October 13, 2010 4:26 AM
To: McSweeny, Andrew J
Cc: biojava-l at biojava.org
Subject: Re: [Biojava-l] How to share code while protecting copyrights?

Hi -

My understanding of copyright is that it is yours as soon as you assert
that
it is your creation. You can simply add a copyright statement to each
file
containing the code (in the header for example). The reality is that
defending copyright is your responsibility. If someone violates it, you
have
to take them to court or issue a legal letter.

You can also put an appropriate license on the code specifying how it
can be
used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick
one
of these that best matches your needs. BioJava code is LGPL so if you
want
your code to go into the BioJava code base you will need to make your
code
LGPL.

It's always a good idea to add @author tags to Java code to ensure
appropriate attribution.

Finally, if someone steals your code and publishes results before you
then
you can always make a complaint to the journal editors. If it is a
reputable
journal, and you have reasonable proof the editor should take some
action
such as forcing a retraction.  You can also make a distribution
agreement
saying that if someone uses this code they agree not to publish without
first consulting you.

If you want to make it really water tight, get a lawyer and explain
specifically what you want to share and what you want to protect or
prevent.

- Mark

On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
andrew.mcsweeny at rockets.utoledo.edu> wrote:

> Hi,
>
> I am working on a project which simulates sexual reproduction in a
> population of digital organisms.  Their genome is just a contig from
hg18.
>  It's pretty interesting and I can talk more about it in the
future....
>
> Anyways, how can I share my code for this project without having to
worry
> that someone else will use it to publish a paper before my group does?
>
> I'm certain nobody in the open source community would do that, but how
do I
> convince my group that opening our project to BioJava is a good idea?
>
> -Andrew
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


From andreas at sdsc.edu  Wed Oct 13 18:48:32 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 13 Oct 2010 11:48:32 -0700
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
Message-ID: <AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>

> Forgive me for being pessimistic, but I do not believe you can
> publically distribute your code without running the risk of being
> scooped.  Mark's suggestions are very good; however, the safest route
> would be to withhold distribution of your code until your work is
> published (or at very least accepted).
>


I think that is too conservative - if getting scooped is an issue, I would
release the code shortly before submission of the first manuscript to a
journal. That way the source code can form part of the publication and the
referees can view the code during the review process.  Many views/downloads
of articles happen in the first few weeks after publication. Having a link
to the source code in the paper can be a great advertisement for the open
source project and help in community-building.

Andreas


>
> Also, I would suggest this argument for convincing your group to use
> BioJava (disclaimer - I am not a lawyer).
>
> Under the LGPL, you are not obligated to release your source code if:
>
> (1) you create a "work based on the library" (e.g. direct modifications
> or additions to the licensed work) but do not distribute it, and
> (2) you create a "work that uses the library" by dynamically linking
> your work to the licensed work (see distribution clause #5 of the LGPL:
> http://www.gnu.org/licenses/lgpl-2.1.html)
>
> If you follow choice #2, you can license and distribute your work under
> terms of your group's choosing (open or closed, submit it to the BioJava
> developers for inclusion or not) while gaining the benefit of reusing
> BioJava.
>
> ~Steve
>
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark
> Schreiber
> Sent: Wednesday, October 13, 2010 4:26 AM
> To: McSweeny, Andrew J
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] How to share code while protecting copyrights?
>
> Hi -
>
> My understanding of copyright is that it is yours as soon as you assert
> that
> it is your creation. You can simply add a copyright statement to each
> file
> containing the code (in the header for example). The reality is that
> defending copyright is your responsibility. If someone violates it, you
> have
> to take them to court or issue a legal letter.
>
> You can also put an appropriate license on the code specifying how it
> can be
> used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick
> one
> of these that best matches your needs. BioJava code is LGPL so if you
> want
> your code to go into the BioJava code base you will need to make your
> code
> LGPL.
>
> It's always a good idea to add @author tags to Java code to ensure
> appropriate attribution.
>
> Finally, if someone steals your code and publishes results before you
> then
> you can always make a complaint to the journal editors. If it is a
> reputable
> journal, and you have reasonable proof the editor should take some
> action
> such as forcing a retraction.  You can also make a distribution
> agreement
> saying that if someone uses this code they agree not to publish without
> first consulting you.
>
> If you want to make it really water tight, get a lawyer and explain
> specifically what you want to share and what you want to protect or
> prevent.
>
> - Mark
>
> On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
> andrew.mcsweeny at rockets.utoledo.edu> wrote:
>
> > Hi,
> >
> > I am working on a project which simulates sexual reproduction in a
> > population of digital organisms.  Their genome is just a contig from
> hg18.
> >  It's pretty interesting and I can talk more about it in the
> future....
> >
> > Anyways, how can I share my code for this project without having to
> worry
> > that someone else will use it to publish a paper before my group does?
> >
> > I'm certain nobody in the open source community would do that, but how
> do I
> > convince my group that opening our project to BioJava is a good idea?
> >
> > -Andrew
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas.prlic at gmail.com  Wed Oct 13 19:18:12 2010
From: andreas.prlic at gmail.com (Andreas Prlic)
Date: Wed, 13 Oct 2010 12:18:12 -0700
Subject: [Biojava-l] Questions related to biojava
In-Reply-To: <COL117-W299595E149AB88CB62D441A7550@phx.gbl>
References: <COL117-W299595E149AB88CB62D441A7550@phx.gbl>
Message-ID: <AANLkTi=oDrwtXBt3qxQinw7zUNN81uzYGOWEZ=rc2vkv@mail.gmail.com>

Hi Madhu,

best to keep such mails on the mailing list, otherwise they might get lost
in my flood of emails... see my reply below.

On Wed, Oct 13, 2010 at 12:08 PM, Madhusudan Gujral <mgujral2000 at hotmail.com
> wrote:

>  Hi Andreas,
>
>  I have couple of questions related to biojava. I would greatly appreciate
> if you could provide directions.
>
>  Is the biojava version 3.0 mature?
>  Is there any pom file for biojava that I can work with?
>  Is there a single tool to validate a fasta file?
>
>

- biojava 3.0 is in preparation of getting released. It is not release ready
but some of the modules are already used in some production environments
-  not sure what you mean with this question. You can see the source code in
SVN/git and there is also an automated build server providing snapshot
builds that can be used for Maven installations.
- what kind of vallidation do you have in mind? biojava3-core can do FASTA
parsing for you...

Andreas


From willishf at ufl.edu  Wed Oct 13 19:16:39 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Wed, 13 Oct 2010 15:16:39 -0400
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013181712.GA4482@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
	<AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
	<20101013181712.GA4482@thebird.nl>
Message-ID: <AANLkTikZBaPj4dvn-sKCqOWh2LVbVjmfG9yHwpFzUzUf@mail.gmail.com>

Pjotr

What is an extra 8 seconds among friends if you know you are going to get
the correct answer and you can change the rules if needed!!!

Are you parsing the C.elgans genome or DNA representation of each protein in
the C.elgans genome?

If you take out the println statement that will help speed things up a
bunch. Java System.out is always slow.

I am checking on the problem with upper case. That shouldn't be an issue.

Thanks

Scooter


On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins <pjotr.public23 at thebird.nl>wrote:

> I think it is a good idea. From a purist point of view you may object
> (it is not biological), but most libraries do exactly that.
>
> If direct translation gets it down to 8sec, we may well half that
> with further tweaking.
>
> Pj.
>
> On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote:
> > The Biojava3 has an additional validation layer and object creation going
> > from DNA sequence to RNA sequence and then using the appropriate
> translation
> > rules to return a protein sequence. Could be easily twice as fast if you
> > went from DNA sequence to ProteinSequence which would put it at 8
> seconds.
> > We are going to carry a performance penalty setting everything up as a
> > proper object versus doing a simple String to String translation.
>
>


From pjotr.public23 at thebird.nl  Wed Oct 13 21:05:46 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Oct 2010 23:05:46 +0200
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>
Message-ID: <20101013210546.GB5479@thebird.nl>

Is that idea of getting scooped realistic?

All my code is online, that is my scientific track record, next to my
papers.

Online OSS code may bring benefits when other people find bugs, or
even improve things. I don't worry about getting scooped. First it is
easy to prove it is mine, exactly because it is out in the open, and
second it takes more than plain old code to get something published in
a journal.

In the rare case an idea is so sensitive and easy to copy, you can
publish it with some part missing.

I think too much code sits on planks gathering dust, just because
people have these worries. It is old school. We are in the business
of moving science forward - writing beautiful tools. Nothing less.

Pj.

On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote:
> > Forgive me for being pessimistic, but I do not believe you can
> > publically distribute your code without running the risk of being
> > scooped.  Mark's suggestions are very good; however, the safest route
> > would be to withhold distribution of your code until your work is
> > published (or at very least accepted).


From andreas at sdsc.edu  Wed Oct 13 21:24:54 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 13 Oct 2010 14:24:54 -0700
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <20101013210546.GB5479@thebird.nl>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<AANLkTi=MBGFw6HpYksENcf6_BqywPC0zS6OsonSiqx7Y@mail.gmail.com>
	<20101013210546.GB5479@thebird.nl>
Message-ID: <AANLkTinQatqghBZONzGcxRx9ySzDdZHnOBopri8TTHxZ@mail.gmail.com>

nicely put :-)

A

On Wed, Oct 13, 2010 at 2:05 PM, Pjotr Prins <pjotr.public23 at thebird.nl>wrote:

> Is that idea of getting scooped realistic?
>
> All my code is online, that is my scientific track record, next to my
> papers.
>
> Online OSS code may bring benefits when other people find bugs, or
> even improve things. I don't worry about getting scooped. First it is
> easy to prove it is mine, exactly because it is out in the open, and
> second it takes more than plain old code to get something published in
> a journal.
>
> In the rare case an idea is so sensitive and easy to copy, you can
> publish it with some part missing.
>
> I think too much code sits on planks gathering dust, just because
> people have these worries. It is old school. We are in the business
> of moving science forward - writing beautiful tools. Nothing less.
>
> Pj.
>
> On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote:
> > > Forgive me for being pessimistic, but I do not believe you can
> > > publically distribute your code without running the risk of being
> > > scooped.  Mark's suggestions are very good; however, the safest route
> > > would be to withhold distribution of your code until your work is
> > > published (or at very least accepted).
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From hlapp at drycafe.net  Wed Oct 13 21:44:36 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Wed, 13 Oct 2010 16:44:36 -0500
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
Message-ID: <A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>

How and when you want to be attributed in publications, and what you  
want someone else not to publish on, is an ethical matter. Licenses  
are legal instruments and not suited for ethical questions or social  
conventions. Rather, this is addressed by ethical and social  
conventions and requests.

A good example is the Ft Lauderdale agreement, which is not a legal  
instrument but an ethical request of those who peruse immediate- 
release sequencing data. If you have ethical or social requests to  
make of those who peruse your code, state them explicitly in a README  
and in the code.

By their nature, you can't legally enforce them. However, ethical  
behavior is policed - by all of us as a scientific community, not in  
the courts.

	-hilmar

On Oct 13, 2010, at 1:21 PM, Steve Darnell wrote:

> Andrew,
>
> Forgive me for being pessimistic, but I do not believe you can
> publically distribute your code without running the risk of being
> scooped.  Mark's suggestions are very good; however, the safest route
> would be to withhold distribution of your code until your work is
> published (or at very least accepted).
>
> Also, I would suggest this argument for convincing your group to use
> BioJava (disclaimer - I am not a lawyer).
>
> Under the LGPL, you are not obligated to release your source code if:
>
> (1) you create a "work based on the library" (e.g. direct  
> modifications
> or additions to the licensed work) but do not distribute it, and
> (2) you create a "work that uses the library" by dynamically linking
> your work to the licensed work (see distribution clause #5 of the  
> LGPL:
> http://www.gnu.org/licenses/lgpl-2.1.html)
>
> If you follow choice #2, you can license and distribute your work  
> under
> terms of your group's choosing (open or closed, submit it to the  
> BioJava
> developers for inclusion or not) while gaining the benefit of reusing
> BioJava.
>
> ~Steve
>
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark
> Schreiber
> Sent: Wednesday, October 13, 2010 4:26 AM
> To: McSweeny, Andrew J
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] How to share code while protecting  
> copyrights?
>
> Hi -
>
> My understanding of copyright is that it is yours as soon as you  
> assert
> that
> it is your creation. You can simply add a copyright statement to each
> file
> containing the code (in the header for example). The reality is that
> defending copyright is your responsibility. If someone violates it,  
> you
> have
> to take them to court or issue a legal letter.
>
> You can also put an appropriate license on the code specifying how it
> can be
> used. Examples include GPL, LGPL, BSD, Apache License etc. You can  
> pick
> one
> of these that best matches your needs. BioJava code is LGPL so if you
> want
> your code to go into the BioJava code base you will need to make your
> code
> LGPL.
>
> It's always a good idea to add @author tags to Java code to ensure
> appropriate attribution.
>
> Finally, if someone steals your code and publishes results before you
> then
> you can always make a complaint to the journal editors. If it is a
> reputable
> journal, and you have reasonable proof the editor should take some
> action
> such as forcing a retraction.  You can also make a distribution
> agreement
> saying that if someone uses this code they agree not to publish  
> without
> first consulting you.
>
> If you want to make it really water tight, get a lawyer and explain
> specifically what you want to share and what you want to protect or
> prevent.
>
> - Mark
>
> On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J <
> andrew.mcsweeny at rockets.utoledo.edu> wrote:
>
>> Hi,
>>
>> I am working on a project which simulates sexual reproduction in a
>> population of digital organisms.  Their genome is just a contig from
> hg18.
>> It's pretty interesting and I can talk more about it in the
> future....
>>
>> Anyways, how can I share my code for this project without having to
> worry
>> that someone else will use it to publish a paper before my group  
>> does?
>>
>> I'm certain nobody in the open source community would do that, but  
>> how
> do I
>> convince my group that opening our project to BioJava is a good idea?
>>
>> -Andrew
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From ayates at ebi.ac.uk  Wed Oct 13 22:52:17 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 13 Oct 2010 23:52:17 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <AANLkTikZBaPj4dvn-sKCqOWh2LVbVjmfG9yHwpFzUzUf@mail.gmail.com>
References: <20101013111541.GA512@thebird.nl>
	<9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com>
	<20101013161336.GA3184@thebird.nl>
	<F29FBE5F-76A0-467F-9228-6092DED0D781@ebi.ac.uk>
	<20101013163423.GA3849@thebird.nl>
	<AANLkTi=wCT_mOPDfEMeue1htfY321PRuC9Gu9F=J=7Qf@mail.gmail.com>
	<20101013181712.GA4482@thebird.nl>
	<AANLkTikZBaPj4dvn-sKCqOWh2LVbVjmfG9yHwpFzUzUf@mail.gmail.com>
Message-ID: <7E59B83F-8371-4F79-AC4C-57D1A49A9398@ebi.ac.uk>

LOL well you could always parallelise it :)

I've gone & pushed a new version of the translator code to the SVN repo so it'll filter through to the public server soon. There's an added test case as well. The overall impact of this change seems to be about 25 translations of BRCA2 per second so it is significant; our current limit looks to be approx. 200 per second.

I hope you find this is faster without the need to edit & parse a Sequence String twice

Andy

On 13 Oct 2010, at 20:16, Scooter Willis wrote:

> Pjotr
> 
> What is an extra 8 seconds among friends if you know you are going to get the correct answer and you can change the rules if needed!!!
> 
> Are you parsing the C.elgans genome or DNA representation of each protein in the C.elgans genome? 
> 
> If you take out the println statement that will help speed things up a bunch. Java System.out is always slow.
> 
> I am checking on the problem with upper case. That shouldn't be an issue.
> 
> Thanks
> 
> Scooter
> 
> 
> On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins <pjotr.public23 at thebird.nl> wrote:
> I think it is a good idea. From a purist point of view you may object
> (it is not biological), but most libraries do exactly that.
> 
> If direct translation gets it down to 8sec, we may well half that
> with further tweaking.
> 
> Pj.
> 
> On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote:
> > The Biojava3 has an additional validation layer and object creation going
> > from DNA sequence to RNA sequence and then using the appropriate translation
> > rules to return a protein sequence. Could be easily twice as fast if you
> > went from DNA sequence to ProteinSequence which would put it at 8 seconds.
> > We are going to carry a performance penalty setting everything up as a
> > proper object versus doing a simple String to String translation.
> 
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From pjotr.public23 at thebird.nl  Thu Oct 14 07:00:12 2010
From: pjotr.public23 at thebird.nl (Pjotr Prins)
Date: Thu, 14 Oct 2010 09:00:12 +0200
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>
Message-ID: <20101014070012.GA7296@thebird.nl>

On Wed, Oct 13, 2010 at 04:44:36PM -0500, Hilmar Lapp wrote:
> By their nature, you can't legally enforce them. However, ethical  
> behavior is policed - by all of us as a scientific community, not in the 
> courts.

I know people who make it their business to pursue companies that do
not honour OSS licenses. The companies always have to retrack.

Is there any precedent in science where open source software was used
to scoop research? And how did that scientist fare?

With scientists I can't see it happening. Getting caught out that way
will hurt all future prospects for an individual or group.

With this reasoning you are best off putting code in the public domain
as fast as possible.

Pj.


From hlapp at drycafe.net  Thu Oct 14 14:47:19 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Thu, 14 Oct 2010 09:47:19 -0500
Subject: [Biojava-l] How to share code while protecting copyrights?
In-Reply-To: <20101014070012.GA7296@thebird.nl>
References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com>
	<AANLkTi=ohMNA-UTjoSV0urYFY5meJVz9=uNfSjbcFwv-@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610026E5DAE@FS1.dnastar.com>
	<A60950CB-2553-4D49-A9A7-2F8AEA981A7D@drycafe.net>
	<20101014070012.GA7296@thebird.nl>
Message-ID: <FB150474-3E31-4795-BC3E-1D4B2A12EB4C@drycafe.net>


On Oct 14, 2010, at 2:00 AM, Pjotr Prins wrote:

> I know people who make it their business to pursue companies that do
> not honour OSS licenses. The companies always have to retrack.


Of course. That's a legal issue. Attribution on publications, or what  
someone publishes on reusing your stuff, is not a legal issue.

	-hilmar
-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri Oct 15 11:53:13 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 15 Oct 2010 12:53:13 +0100
Subject: [Biojava-l] BioJava translation
In-Reply-To: <20101013111541.GA512@thebird.nl>
References: <20101013111541.GA512@thebird.nl>
Message-ID: <AANLkTimigiSLuDnXyV1EdaPeasvPu2=1Dned71pAbT7h@mail.gmail.com>

On Wed, Oct 13, 2010 at 12:15 PM, Pjotr Prins <pjotr.public23 at thebird.nl> wrote:
> I am using biojava-1.7.1 nucleotide -> amino acid translation. It is
> rather slow. In fact, the biopython equivalent in native Python is
> twice as fast. EMBOSS is again magnitudes faster. I am using
> something like
>
> ?rna = RNATools.createRNA(nucleotides);
> ?aa = RNATools.translate(rna);
>
> Embarrassingly, even the R version is faster in the GeneR module, as
> it uses a C module.
>
> I have a feeling this has to do with typed object creation at every
> level, whereas Python and others uses plain character Strings.
>
> Any plans for speeding this up on the JVM?
>
> Pj.

Actually (assuming you are not explicitly using strings),
Biopython would also be using objects for each sequence,
which does impose a speed penalty.

Peter


From kurka at mikro.biologie.tu-muenchen.de  Tue Oct 19 11:25:31 2010
From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka)
Date: Tue, 19 Oct 2010 13:25:31 +0200
Subject: [Biojava-l] feature request - full query description from blast
	result
Message-ID: <4CBD802B.7030809@mikro.biologie.tu-muenchen.de>

Hi all,

I just read in a blast file and I want to get the full query description.
For example, when I have that query:
Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
         (1208 letters)

I get as query-information locus_tag= CD0002
The rest is truncated.

In the biojava-mailinglist I found the same question
http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html

And Mark suggested to make a request for improvement, but as I see it,
nothing happened. So I would like to ask, if you can change it. Or is it
changed and I don't see it.

Thanks,
Hedwig


From sb.genny at gmail.com  Thu Oct 21 14:28:53 2010
From: sb.genny at gmail.com (sobia idrees)
Date: Thu, 21 Oct 2010 19:28:53 +0500
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9
In-Reply-To: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
References: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
Message-ID: <AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>

Hi

I want to develop phylogenetics application in biojava..but need help to do
that..Kindly help me in developing some applications..

Thanks in anticipation

Regards,
Sobia Idrees


On Tue, Oct 19, 2010 at 9:00 PM, <biojava-l-request at lists.open-bio.org>wrote:

> Send Biojava-l mailing list submissions to
>        biojava-l at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/biojava-l
> or, via email, send a message with subject or body 'help' to
>        biojava-l-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        biojava-l-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biojava-l digest..."
>
>
> Today's Topics:
>
>   1. feature request - full query description from blast       result
>      (Hedwig Kurka)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 19 Oct 2010 13:25:31 +0200
> From: Hedwig Kurka <kurka at mikro.biologie.tu-muenchen.de>
> Subject: [Biojava-l] feature request - full query description from
>        blast   result
> To: biojava-l at lists.open-bio.org
> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de>
> Content-Type: text/plain; charset=ISO-8859-15
>
> Hi all,
>
> I just read in a blast file and I want to get the full query description.
> For example, when I have that query:
> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
>         (1208 letters)
>
> I get as query-information locus_tag= CD0002
> The rest is truncated.
>
> In the biojava-mailinglist I found the same question
> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html
>
> And Mark suggested to make a request for improvement, but as I see it,
> nothing happened. So I would like to ask, if you can change it. Or is it
> changed and I don't see it.
>
> Thanks,
> Hedwig
>
>
> ------------------------------
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
> End of Biojava-l Digest, Vol 93, Issue 9
> ****************************************
>


From sb.genny at gmail.com  Thu Oct 21 14:30:35 2010
From: sb.genny at gmail.com (sobia idrees)
Date: Thu, 21 Oct 2010 19:30:35 +0500
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9
In-Reply-To: <AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>
References: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
	<AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>
Message-ID: <AANLkTi=x2mrAFoYMf2v1iE=HN8fkV=0R2wS4tHrVLsRn@mail.gmail.com>

Hi

I have developed some web based and desktop based applications using
biojava..Can it be published in Biojava journal?

Thanks,
Sobia Idrees

On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees <sb.genny at gmail.com> wrote:

> Hi
>
> I want to develop phylogenetics application in biojava..but need help to do
> that..Kindly help me in developing some applications..
>
> Thanks in anticipation
>
> Regards,
> Sobia Idrees
>
>
> On Tue, Oct 19, 2010 at 9:00 PM, <biojava-l-request at lists.open-bio.org>wrote:
>
>> Send Biojava-l mailing list submissions to
>>        biojava-l at lists.open-bio.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>        http://lists.open-bio.org/mailman/listinfo/biojava-l
>> or, via email, send a message with subject or body 'help' to
>>        biojava-l-request at lists.open-bio.org
>>
>> You can reach the person managing the list at
>>        biojava-l-owner at lists.open-bio.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Biojava-l digest..."
>>
>>
>> Today's Topics:
>>
>>   1. feature request - full query description from blast       result
>>      (Hedwig Kurka)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 19 Oct 2010 13:25:31 +0200
>> From: Hedwig Kurka <kurka at mikro.biologie.tu-muenchen.de>
>> Subject: [Biojava-l] feature request - full query description from
>>        blast   result
>> To: biojava-l at lists.open-bio.org
>> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de>
>> Content-Type: text/plain; charset=ISO-8859-15
>>
>> Hi all,
>>
>> I just read in a blast file and I want to get the full query description.
>> For example, when I have that query:
>> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
>> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
>> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
>>         (1208 letters)
>>
>> I get as query-information locus_tag= CD0002
>> The rest is truncated.
>>
>> In the biojava-mailinglist I found the same question
>> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html
>>
>> And Mark suggested to make a request for improvement, but as I see it,
>> nothing happened. So I would like to ask, if you can change it. Or is it
>> changed and I don't see it.
>>
>> Thanks,
>> Hedwig
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
>> End of Biojava-l Digest, Vol 93, Issue 9
>> ****************************************
>>
>
>


From holland at eaglegenomics.com  Thu Oct 21 14:41:35 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Thu, 21 Oct 2010 15:41:35 +0100
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9
In-Reply-To: <AANLkTi=x2mrAFoYMf2v1iE=HN8fkV=0R2wS4tHrVLsRn@mail.gmail.com>
References: <mailman.3.1287504006.17895.biojava-l@lists.open-bio.org>
	<AANLkTi=ApGT5g0LO8w=dnJNbV+6q5r2U55r_oSGH09s7@mail.gmail.com>
	<AANLkTi=x2mrAFoYMf2v1iE=HN8fkV=0R2wS4tHrVLsRn@mail.gmail.com>
Message-ID: <97591963-F741-45C1-8E9D-231A5D05D4DA@eaglegenomics.com>

There is no such thing as a Biojava journal. You would need to submit your paper to one of the main bioinformatics journals.

cheers,
Richard

On 21 Oct 2010, at 15:30, sobia idrees wrote:

> Hi
> 
> I have developed some web based and desktop based applications using
> biojava..Can it be published in Biojava journal?
> 
> Thanks,
> Sobia Idrees
> 
> On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees <sb.genny at gmail.com> wrote:
> 
>> Hi
>> 
>> I want to develop phylogenetics application in biojava..but need help to do
>> that..Kindly help me in developing some applications..
>> 
>> Thanks in anticipation
>> 
>> Regards,
>> Sobia Idrees
>> 
>> 
>> On Tue, Oct 19, 2010 at 9:00 PM, <biojava-l-request at lists.open-bio.org>wrote:
>> 
>>> Send Biojava-l mailing list submissions to
>>>       biojava-l at lists.open-bio.org
>>> 
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>       http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> or, via email, send a message with subject or body 'help' to
>>>       biojava-l-request at lists.open-bio.org
>>> 
>>> You can reach the person managing the list at
>>>       biojava-l-owner at lists.open-bio.org
>>> 
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Biojava-l digest..."
>>> 
>>> 
>>> Today's Topics:
>>> 
>>>  1. feature request - full query description from blast       result
>>>     (Hedwig Kurka)
>>> 
>>> 
>>> ----------------------------------------------------------------------
>>> 
>>> Message: 1
>>> Date: Tue, 19 Oct 2010 13:25:31 +0200
>>> From: Hedwig Kurka <kurka at mikro.biologie.tu-muenchen.de>
>>> Subject: [Biojava-l] feature request - full query description from
>>>       blast   result
>>> To: biojava-l at lists.open-bio.org
>>> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de>
>>> Content-Type: text/plain; charset=ISO-8859-15
>>> 
>>> Hi all,
>>> 
>>> I just read in a blast file and I want to get the full query description.
>>> For example, when I have that query:
>>> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase
>>> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11
>>> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2
>>>        (1208 letters)
>>> 
>>> I get as query-information locus_tag= CD0002
>>> The rest is truncated.
>>> 
>>> In the biojava-mailinglist I found the same question
>>> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html
>>> 
>>> And Mark suggested to make a request for improvement, but as I see it,
>>> nothing happened. So I would like to ask, if you can change it. Or is it
>>> changed and I don't see it.
>>> 
>>> Thanks,
>>> Hedwig
>>> 
>>> 
>>> ------------------------------
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> 
>>> End of Biojava-l Digest, Vol 93, Issue 9
>>> ****************************************
>>> 
>> 
>> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From jc.lucky at laposte.net  Fri Oct 22 08:11:43 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Fri, 22 Oct 2010 10:11:43 +0200 (CEST)
Subject: [Biojava-l] Retrieve Information from GenBank file
Message-ID: <31170592.35650.1287735103724.JavaMail.www@wwinf8210>


Hi

I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945

With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. 
Please help me find what I do wrong or what should be done to achieve my goal.

//read the GeneBank File
public static RichSequenceIterator readFile(String input,
RichSequenceBuilderFactory seqFactory,
Namespace ns)
throws IOException, NoSuchElementException, BioException
{
ns = null;
InputStream stream = new FileInputStream(input);
BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream));
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); 
return seqs;
}

//Retrieve information and convert them in rdf format
public void writeToRDFFile(RichSequenceIterator rsi, String output)
throws IOException, NoSuchElementException, BioException {
//create model for the ontology
OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null);
OntClass parents;
String URI = "http://pbr.wur.nl/#";

while(rsi.hasNext())
{
RichSequence seq = rsi.nextRichSequence();
String id = seq.getName(); 
parents = model.createClass(URI + id);
Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString
String definition = seq.getDescription(); //code to clean up String
//Add to model
parents.addProperty(DC.description, definition);
parents.addProperty(DC.publisher, authors);
parents.addComment(taxonomy, "EN");
parents.addProperty(DC.type, organism);
//print in rdf format
model.write(out, "RDF/XML");
out.close(); }
}


Thanks,
Jean-Charles Ferri?res

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From andreas at sdsc.edu  Fri Oct 22 19:56:49 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Fri, 22 Oct 2010 12:56:49 -0700
Subject: [Biojava-l] 3.0-alpha2
Message-ID: <AANLkTimUbGR+LpMj-f8wVctMz_hv4E-5DZw542xNtMeh@mail.gmail.com>

Hi,

In preparation for the upcoming biojava 3 release, 3.0-alpha2  has
just been released on http://biojava.org/download/maven/

Andreas


From cfriedline at vcu.edu  Sun Oct 24 14:38:46 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Sun, 24 Oct 2010 10:38:46 -0400
Subject: [Biojava-l] Test Message
Message-ID: <AANLkTinVV_ZhkATN1sXn=UnS0A39RcjV-nEqdSDfVvpq@mail.gmail.com>

Per Andreas, this is a test.

Chris

-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From cfriedline at vcu.edu  Sun Oct 24 14:57:48 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Sun, 24 Oct 2010 10:57:48 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
Message-ID: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>

Hello,

I am getting a weird problem with protein alignment using
NeedlemanWunsch in 1.7.1, in that the alignment does not span the
entire length of the proteins.  I've verified that this should not
happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
I'm reluctant to switch to BioJava3 at this time, since performance is
about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
350,000 of them.

An example of this alignment error, is shown here: http://pastebin.com/mdX516R6

Notice that the alignment stops 1 amino acid short of the end in both
cases.  The parameters for the alignment are: BLOSUM50, gapOpen=10,
gapExtend=2.

Thanks,
Chris

-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From andreas.draeger at uni-tuebingen.de  Sun Oct 24 16:01:05 2010
From: andreas.draeger at uni-tuebingen.de (Andreas Draeger)
Date: Sun, 24 Oct 2010 18:01:05 +0200
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
Message-ID: <4CC45841.5080604@uni-tuebingen.de>

Hi Chris,

Thank you for reprorting this problem. It would be very nice if you
could also provide your source code. Then I would like to test what
happens. You can send source code, substitution matrix, and the two
example protein sequences that cause the problems directly to me. I'll
then have a look into it.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From cfriedline at vcu.edu  Sun Oct 24 18:04:25 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Sun, 24 Oct 2010 14:04:25 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <4CC45841.5080604@uni-tuebingen.de>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
	<4CC45841.5080604@uni-tuebingen.de>
Message-ID: <AANLkTimdTxvpTuJC1j8Mfk9W1bXs3S8+5xw5UjNvzQ4Q@mail.gmail.com>

Thanks, Andreas.  I've sent you the information that you asked for below.

Chris

On Sun, Oct 24, 2010 at 12:01 PM, Andreas Draeger
<andreas.draeger at uni-tuebingen.de> wrote:
> Hi Chris,
>
> Thank you for reprorting this problem. It would be very nice if you
> could also provide your source code. Then I would like to test what
> happens. You can send source code, substitution matrix, and the two
> example protein sequences that cause the problems directly to me. I'll
> then have a look into it.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax: ? +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From koen.bruynseels at cropdesign.com  Mon Oct 25 16:15:59 2010
From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com)
Date: Mon, 25 Oct 2010 18:15:59 +0200
Subject: [Biojava-l] Koen Bruynseels is out of the office.
Message-ID: <OFA688DB7B.E88D2042-ONC12577C7.00595AE1-C12577C7.00595AE1@basf-c-s.be>


I will be out of the office starting  10/25/2010 and will not return until
11/02/2010.

I will respond to your message when I return.


From andreas at sdsc.edu  Tue Oct 26 18:42:29 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 11:42:29 -0700
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
Message-ID: <AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>

Hi Chris,

about your comment that the biojava3-alignment is slower than the 1.7
one: Do you have any data if this is coming from the io or is the
actual alignment calculation slower?

Andreas

On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline <cfriedline at vcu.edu> wrote:
> Hello,
>
> I am getting a weird problem with protein alignment using
> NeedlemanWunsch in 1.7.1, in that the alignment does not span the
> entire length of the proteins. ?I've verified that this should not
> happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
> I'm reluctant to switch to BioJava3 at this time, since performance is
> about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
> 350,000 of them.
>
> An example of this alignment error, is shown here: http://pastebin.com/mdX516R6
>
> Notice that the alignment stops 1 amino acid short of the end in both
> cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10,
> gapExtend=2.
>
> Thanks,
> Chris
>
> --
> PhD Candidate, Integrative Life Sciences
> Virginia Commonwealth University
> Richmond, VA
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From cfriedline at vcu.edu  Tue Oct 26 19:21:39 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Tue, 26 Oct 2010 15:21:39 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
	<AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>
Message-ID: <AANLkTinhUCwcdsOtJg5jdG-wwqiCx9oevSqfzLi_96de@mail.gmail.com>

Hi Andreas,

The io should be the same, since I've used the same set of genes for testing
both.  So, I'm guessing it's either the alignment calculation or the new
biojava design contributing to the slowness.

Chris

On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic <andreas at sdsc.edu> wrote:

> Hi Chris,
>
> about your comment that the biojava3-alignment is slower than the 1.7
> one: Do you have any data if this is coming from the io or is the
> actual alignment calculation slower?
>
> Andreas
>
> On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline <cfriedline at vcu.edu>
> wrote:
> > Hello,
> >
> > I am getting a weird problem with protein alignment using
> > NeedlemanWunsch in 1.7.1, in that the alignment does not span the
> > entire length of the proteins.  I've verified that this should not
> > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
> > I'm reluctant to switch to BioJava3 at this time, since performance is
> > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
> > 350,000 of them.
> >
> > An example of this alignment error, is shown here:
> http://pastebin.com/mdX516R6
> >
> > Notice that the alignment stops 1 amino acid short of the end in both
> > cases.  The parameters for the alignment are: BLOSUM50, gapOpen=10,
> > gapExtend=2.
> >
> > Thanks,
> > Chris
> >
> > --
> > PhD Candidate, Integrative Life Sciences
> > Virginia Commonwealth University
> > Richmond, VA
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


-- 
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From cfriedline at vcu.edu  Tue Oct 26 19:29:30 2010
From: cfriedline at vcu.edu (Chris Friedline)
Date: Tue, 26 Oct 2010 15:29:30 -0400
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTi=4nF1_t6z=zot2xnRWR_9KRxBeTuyfk_PXD0P_@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>
	<AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>
	<AANLkTinJ01kX5wBK-wno=rLOMjRpHeYrF4NEi+5jCPBz@mail.gmail.com>
	<AANLkTi=4nF1_t6z=zot2xnRWR_9KRxBeTuyfk_PXD0P_@mail.gmail.com>
Message-ID: <AANLkTikBy2Ub3e2OWsc1+9hHca+bRJ859XWZUDNUsS69@mail.gmail.com>

That's something I'll need to go back and revisit after my deadline
passes at the end of this week. Initially, I was creating them on the
fly at the time of alignment, but it would be more efficient to store
them that way in the gene object itself. ?I was also passing an
InputStreamReader for the substitution matrix each time (pulling the
matrix from my jar), but storing it as a string would also be a better
option, especially since I'm threading and there are so many
alignments.

Chris

On Tue, Oct 26, 2010 at 3:23 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>
> ok, how do you create the biojava3 Sequence objects? just trying to
> find out where the bottlenecks are, so we can fix them...
>
> A
>
> On Tue, Oct 26, 2010 at 12:20 PM, Chris Friedline <cfriedline at vcu.edu> wrote:
> > Hi,
> > The io should be the same, since I've used the same set of genes for testing
> > both. ?So, it's either the alignment calculation or the new biojava design
> > contributing to the slowness.
> > Chris
> >
> > On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> >>
> >> Hi Chris,
> >>
> >> about your comment that the biojava3-alignment is slower than the 1.7
> >> one: Do you have any data if this is coming from the io or is the
> >> actual alignment calculation slower?
> >>
> >> Andreas
> >>
> >> On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline <cfriedline at vcu.edu>
> >> wrote:
> >> > Hello,
> >> >
> >> > I am getting a weird problem with protein alignment using
> >> > NeedlemanWunsch in 1.7.1, in that the alignment does not span the
> >> > entire length of the proteins. ?I've verified that this should not
> >> > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI.
> >> > I'm reluctant to switch to BioJava3 at this time, since performance is
> >> > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about
> >> > 350,000 of them.
> >> >
> >> > An example of this alignment error, is shown here:
> >> > http://pastebin.com/mdX516R6
> >> >
> >> > Notice that the alignment stops 1 amino acid short of the end in both
> >> > cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10,
> >> > gapExtend=2.
> >> >
> >> > Thanks,
> >> > Chris
> >> >
> >> > --
> >> > PhD Candidate, Integrative Life Sciences
> >> > Virginia Commonwealth University
> >> > Richmond, VA
> >> > _______________________________________________
> >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> >> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >> >
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------------------------
> >> Dr. Andreas Prlic
> >> Senior Scientist, RCSB PDB Protein Data Bank
> >> University of California, San Diego
> >> (+1) 858.246.0526
> >> -----------------------------------------------------------------------
> >
> >
> >
> > --
> > PhD Candidate, Integrative Life Sciences
> > Virginia Commonwealth University
> > Richmond, VA
> >
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------


--
PhD Candidate, Integrative Life Sciences
Virginia Commonwealth University
Richmond, VA


From andreas.draeger at uni-tuebingen.de  Tue Oct 26 22:18:00 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Tue, 26 Oct 2010 23:18:00 +0100
Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch
In-Reply-To: <AANLkTikBy2Ub3e2OWsc1+9hHca+bRJ859XWZUDNUsS69@mail.gmail.com>
References: <AANLkTimwkMk=PfL-caqQ1zFXHMaKUADEr=vEXNRYGE1r@mail.gmail.com>	<AANLkTim4kLMYWK8504wDZG2Lk8MBOTEU6TNDRJU2cyLt@mail.gmail.com>	<AANLkTinJ01kX5wBK-wno=rLOMjRpHeYrF4NEi+5jCPBz@mail.gmail.com>	<AANLkTi=4nF1_t6z=zot2xnRWR_9KRxBeTuyfk_PXD0P_@mail.gmail.com>
	<AANLkTikBy2Ub3e2OWsc1+9hHca+bRJ859XWZUDNUsS69@mail.gmail.com>
Message-ID: <4CC75398.7000301@uni-tuebingen.de>

Hi all,

By the way, I would like to mention that the bug has been fixed. It was 
a problem with the way how the alignment was presented to the user 
afterwards, i.e., a problem of the formatting algorithm. The alignment 
itself was correct and also when obtaining the GappedSequences after the 
alignment, these were correct. The problem was that the formatter was 
started with the original lenght of the sequences, which is usually to 
short after inserting gaps. This is now solved and the alignment should 
work fine now.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From dasarnow at gmail.com  Wed Oct 27 03:54:43 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Tue, 26 Oct 2010 20:54:43 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader
Message-ID: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>

Hi all,
Let me first say thanks to all the BioJava community members for
delivering such a useful set of libraries, and that I'm still a newbie
when it comes to BioJava (and Java) so forgive me if my question is
too trivial.

I am doing work on lots (at least thousands) of PDB files from RCSB.
As is commonly known, these are often rife with errors which can lead
to exceptions during parsing with PDBFileParser.  Because
PDBFileParser's methods contain their own try-catch blocks, exception
propagation stops there and my code proceeds blindly along regardless
of any error checking I do.  I would like to catch the exceptions up
in my code where the parser is called, so that I can branch to a
continue statement and have my batch processing loops move on to the
next file.
Should I edit out the try-catch blocks and compile my own version of
the library?  Or should I test the returned StructureImpl objects for
possession of the fields in question?  In that case, I'm not sure
which properties will give the most general success information...and
I'd rather not have to check for /every/ property being correct.

If there is some great way to check if an exception was caught down a
series of nested method calls, please hit me over the head with it.

Thanks!

-da


From andreas at sdsc.edu  Wed Oct 27 04:11:28 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 21:11:28 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
Message-ID: <AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>

Hi Daniel,

can you explain a bit more what you are doing, in particular what
errors you would like to deal with on your end?  You should not need
to worry too much about exception handling. Are there any special
cases you are interested in?  In this case we should support you with
a clean interface rather than exception handling from your end...

Andreas


On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Hi all,
> Let me first say thanks to all the BioJava community members for
> delivering such a useful set of libraries, and that I'm still a newbie
> when it comes to BioJava (and Java) so forgive me if my question is
> too trivial.
>
> I am doing work on lots (at least thousands) of PDB files from RCSB.
> As is commonly known, these are often rife with errors which can lead
> to exceptions during parsing with PDBFileParser. ?Because
> PDBFileParser's methods contain their own try-catch blocks, exception
> propagation stops there and my code proceeds blindly along regardless
> of any error checking I do. ?I would like to catch the exceptions up
> in my code where the parser is called, so that I can branch to a
> continue statement and have my batch processing loops move on to the
> next file.
> Should I edit out the try-catch blocks and compile my own version of
> the library? ?Or should I test the returned StructureImpl objects for
> possession of the fields in question? ?In that case, I'm not sure
> which properties will give the most general success information...and
> I'd rather not have to check for /every/ property being correct.
>
> If there is some great way to check if an exception was caught down a
> series of nested method calls, please hit me over the head with it.
>
> Thanks!
>
> -da
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From dasarnow at gmail.com  Wed Oct 27 04:59:56 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Tue, 26 Oct 2010 21:59:56 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
Message-ID: <AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>

Glad to hear it, who doesn't like support or clean interfaces?.  No
offense intended, by the way, with respect to PDB errors - obviously
the PDB is an indispensable resource for all protein scientists.

I am looking at many (fixed-length) pieces of protein chains and doin'
stuff with 'em.  My current code has a pair of nested while loops; the
outer iterates over PDB entries (locally rsync'd copy), parsing them
and the inner iterates over the pieces from each.  When
StructureExceptions come out of my PDBFileReader object I want to
continue the outer loop, moving on to the next set of files without
executing any of the code that depends on correct StructureImpl
objects from the reader (database updates, the inner loop).
Since the reader's methods have their own try-catch blocks, a thrown
StructureException is stopped there and never reaches my own error
handling.  I just need to know when those errors occur so I can skip
those proteins - I am presuming that the correct entries will outweigh
the problem ones by a significant factor and the overall data wont be
seriously impacted.

-da

On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> can you explain a bit more what you are doing, in particular what
> errors you would like to deal with on your end? ?You should not need
> to worry too much about exception handling. Are there any special
> cases you are interested in? ?In this case we should support you with
> a clean interface rather than exception handling from your end...
>
> Andreas
>
>
>
> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> Hi all,
>> Let me first say thanks to all the BioJava community members for
>> delivering such a useful set of libraries, and that I'm still a newbie
>> when it comes to BioJava (and Java) so forgive me if my question is
>> too trivial.
>>
>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>> As is commonly known, these are often rife with errors which can lead
>> to exceptions during parsing with PDBFileParser. ?Because
>> PDBFileParser's methods contain their own try-catch blocks, exception
>> propagation stops there and my code proceeds blindly along regardless
>> of any error checking I do. ?I would like to catch the exceptions up
>> in my code where the parser is called, so that I can branch to a
>> continue statement and have my batch processing loops move on to the
>> next file.
>> Should I edit out the try-catch blocks and compile my own version of
>> the library? ?Or should I test the returned StructureImpl objects for
>> possession of the fields in question? ?In that case, I'm not sure
>> which properties will give the most general success information...and
>> I'd rather not have to check for /every/ property being correct.
>>
>> If there is some great way to check if an exception was caught down a
>> series of nested method calls, please hit me over the head with it.
>>
>> Thanks!
>>
>> -da
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From dasarnow at gmail.com  Wed Oct 27 05:03:59 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Tue, 26 Oct 2010 22:03:59 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
Message-ID: <AANLkTi=56JhQc=Joddg2i1foNW4sw9r-QTHHTEF_p133@mail.gmail.com>

I think that would be perfect...and of course I'm happy perform
testing on whatever gets cooked up.

-da

2010/10/26 Amr Al-Hossary <amr_alhossary at hotmail.com>:
> We can?add some thing like an exception tracing queue, that can be?checked
> for later by the caller.
>
> would that be OK?
>
> Amr
>
>> Date: Tue, 26 Oct 2010 21:11:28 -0700
>> From: andreas at sdsc.edu
>> To: dasarnow at gmail.com
>> CC: biojava-l at lists.open-bio.org
>> Subject: Re: [Biojava-l] Bad PDB files and batch processing with
>> PDBFileReader
>>
>> Hi Daniel,
>>
>> can you explain a bit more what you are doing, in particular what
>> errors you would like to deal with on your end? You should not need
>> to worry too much about exception handling. Are there any special
>> cases you are interested in? In this case we should support you with
>> a clean interface rather than exception handling from your end...
>>
>> Andreas
>>
>>
>>
>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com>
>> wrote:
>> > Hi all,
>> > Let me first say thanks to all the BioJava community members for
>> > delivering such a useful set of libraries, and that I'm still a newbie
>> > when it comes to BioJava (and Java) so forgive me if my question is
>> > too trivial.
>> >
>> > I am doing work on lots (at least thousands) of PDB files from RCSB.
>> > As is commonly known, these are often rife with errors which can lead
>> > to exceptions during parsing with PDBFileParser. ?Because
>> > PDBFileParser's methods contain their own try-catch blocks, exception
>> > propagation stops there and my code proceeds blindly along regardless
>> > of any error checking I do. ?I would like to catch the exceptions up
>> > in my code where the parser is called, so that I can branch to a
>> > continue statement and have my batch processing loops move on to the
>> > next file.
>> > Should I edit out the try-catch blocks and compile my own version of
>> > the library? ?Or should I test the returned StructureImpl objects for
>> > possession of the fields in question? ?In that case, I'm not sure
>> > which properties will give the most general success information...and
>> > I'd rather not have to check for /every/ property being correct.
>> >
>> > If there is some great way to check if an exception was caught down a
>> > series of nested method calls, please hit me over the head with it.
>> >
>> > Thanks!
>> >
>> > -da
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From andreas at sdsc.edu  Wed Oct 27 05:19:07 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 22:19:07 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
Message-ID: <AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>

Hi Daniel,

PDB files are better nowadays, due to remediation, however there are
still issues..

it sounds like you just want to figure out how to do the try/catch
block properly. You could do something like that:
		
		boolean splitFileOrganisation = true;
		AtomCache cache = new
AtomCache("/path/to/your/installation/",splitFileOrganisation);
		
		String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
		
		for (String pdbID : pdbIDs){
			
			try {
				Structure s = cache.getStructure(pdbID);
				if ( s == null) {
					System.out.println("could not find structure " + pdbID);
					continue;
				}
				// do something with the structure - your inner loop
				System.out.println(s);
				
			} catch (Exception e){
				// something crazy happened...
				System.err.println("Can't load structure " + pdbID + " reason: " +
e.getMessage());
				e.printStackTrace();
			}
		}
		

On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Glad to hear it, who doesn't like support or clean interfaces?. ?No
> offense intended, by the way, with respect to PDB errors - obviously
> the PDB is an indispensable resource for all protein scientists.
>
> I am looking at many (fixed-length) pieces of protein chains and doin'
> stuff with 'em. ?My current code has a pair of nested while loops; the
> outer iterates over PDB entries (locally rsync'd copy), parsing them
> and the inner iterates over the pieces from each. ?When
> StructureExceptions come out of my PDBFileReader object I want to
> continue the outer loop, moving on to the next set of files without
> executing any of the code that depends on correct StructureImpl
> objects from the reader (database updates, the inner loop).
> Since the reader's methods have their own try-catch blocks, a thrown
> StructureException is stopped there and never reaches my own error
> handling. ?I just need to know when those errors occur so I can skip
> those proteins - I am presuming that the correct entries will outweigh
> the problem ones by a significant factor and the overall data wont be
> seriously impacted.
>
> -da
>
> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> can you explain a bit more what you are doing, in particular what
>> errors you would like to deal with on your end? ?You should not need
>> to worry too much about exception handling. Are there any special
>> cases you are interested in? ?In this case we should support you with
>> a clean interface rather than exception handling from your end...
>>
>> Andreas
>>
>>
>>
>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> Hi all,
>>> Let me first say thanks to all the BioJava community members for
>>> delivering such a useful set of libraries, and that I'm still a newbie
>>> when it comes to BioJava (and Java) so forgive me if my question is
>>> too trivial.
>>>
>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>> As is commonly known, these are often rife with errors which can lead
>>> to exceptions during parsing with PDBFileParser. ?Because
>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>> propagation stops there and my code proceeds blindly along regardless
>>> of any error checking I do. ?I would like to catch the exceptions up
>>> in my code where the parser is called, so that I can branch to a
>>> continue statement and have my batch processing loops move on to the
>>> next file.
>>> Should I edit out the try-catch blocks and compile my own version of
>>> the library? ?Or should I test the returned StructureImpl objects for
>>> possession of the fields in question? ?In that case, I'm not sure
>>> which properties will give the most general success information...and
>>> I'd rather not have to check for /every/ property being correct.
>>>
>>> If there is some great way to check if an exception was caught down a
>>> series of nested method calls, please hit me over the head with it.
>>>
>>> Thanks!
>>>
>>> -da
>>> _______________________________________________
>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>>


From andreas at sdsc.edu  Wed Oct 27 06:01:38 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 26 Oct 2010 23:01:38 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<BLU150-w3435EF8891DE863CBCC8278E430@phx.gbl>
Message-ID: <AANLkTin_ucD4u_CgyMwbDX9+mFTVYksjSzZVtUAuefvp@mail.gmail.com>

Hi Amr,

2010/10/26 Amr Al-Hossary <amr_alhossary at hotmail.com>:
> We can?add some thing like an exception tracing queue, that can be?checked
> for later by the caller.

thanks for your suggestion. In terms of API I would prefer if we can
separare a user from inconsistencies in the files and I hope we won't
need such a queue...  If something is off, the code is written to
ignore or work around issues...

Abdreas


> would that be OK?
>
> Amr
>
>> Date: Tue, 26 Oct 2010 21:11:28 -0700
>> From: andreas at sdsc.edu
>> To: dasarnow at gmail.com
>> CC: biojava-l at lists.open-bio.org
>> Subject: Re: [Biojava-l] Bad PDB files and batch processing with
>> PDBFileReader
>>
>> Hi Daniel,
>>
>> can you explain a bit more what you are doing, in particular what
>> errors you would like to deal with on your end? You should not need
>> to worry too much about exception handling. Are there any special
>> cases you are interested in? In this case we should support you with
>> a clean interface rather than exception handling from your end...
>>
>> Andreas
>>
>>
>>
>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com>
>> wrote:
>> > Hi all,
>> > Let me first say thanks to all the BioJava community members for
>> > delivering such a useful set of libraries, and that I'm still a newbie
>> > when i! t comes to BioJava (and Java) so forgive me if my question is
>> > too trivial.
>> >
>> > I am doing work on lots (at least thousands) of PDB files from RCSB.
>> > As is commonly known, these are often rife with errors which can lead
>> > to exceptions during parsing with PDBFileParser. ?Because
>> > PDBFileParser's methods contain their own try-catch blocks, exception
>> > propagation stops there and my code proceeds blindly along regardless
>> > of any error checking I do. ?I would like to catch the exceptions up
>> > in my code where the parser is called, so that I can branch to a
>> > continue statement and have my batch processing loops move on to the
>> > next file.
>> > Should I edit out the try-catch blocks and compile my own version of
>> > the library? ?Or should I test the returned StructureImpl objects for
>> > possession of the fields i! n question? ?In that case, I'm not sure
>> > which proper ties will give the most general success information...and
>> > I'd rather not have to check for /every/ property being correct.
>> >
>> > If there is some great way to check if an exception was caught down a
>> > series of nested method calls, please hit me over the head with it.
>> >
>> > Thanks!
>> >
>> > -da
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>>
>>
>>


From dasarnow at gmail.com  Wed Oct 27 07:26:22 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Wed, 27 Oct 2010 00:26:22 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
Message-ID: <AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>

I assume AtomCache is a new class in BioJava3?

I must give you my embarrassed apology...after a bunch of testing I
finally figured out that I had misunderstood where the Parser's error
handling returns control and started going after the wrong exceptions.
 It does looks like if setParseCAOnly is true, the reader excepts on
chains with no CA's instead of just skipping them, though the other
chains are still parsed into the structure.

-da

On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> PDB files are better nowadays, due to remediation, however there are
> still issues..
>
> it sounds like you just want to figure out how to do the try/catch
> block properly. You could do something like that:
>
> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
> ? ? ? ? ? ? ? ?AtomCache cache = new
> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>
> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>
> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>
> ? ? ? ? ? ? ? ? ? ? ? ?try {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>
> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
> e.getMessage());
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?}
>
>
>
>
> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>> offense intended, by the way, with respect to PDB errors - obviously
>> the PDB is an indispensable resource for all protein scientists.
>>
>> I am looking at many (fixed-length) pieces of protein chains and doin'
>> stuff with 'em. ?My current code has a pair of nested while loops; the
>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>> and the inner iterates over the pieces from each. ?When
>> StructureExceptions come out of my PDBFileReader object I want to
>> continue the outer loop, moving on to the next set of files without
>> executing any of the code that depends on correct StructureImpl
>> objects from the reader (database updates, the inner loop).
>> Since the reader's methods have their own try-catch blocks, a thrown
>> StructureException is stopped there and never reaches my own error
>> handling. ?I just need to know when those errors occur so I can skip
>> those proteins - I am presuming that the correct entries will outweigh
>> the problem ones by a significant factor and the overall data wont be
>> seriously impacted.
>>
>> -da
>>
>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> can you explain a bit more what you are doing, in particular what
>>> errors you would like to deal with on your end? ?You should not need
>>> to worry too much about exception handling. Are there any special
>>> cases you are interested in? ?In this case we should support you with
>>> a clean interface rather than exception handling from your end...
>>>
>>> Andreas
>>>
>>>
>>>
>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> Hi all,
>>>> Let me first say thanks to all the BioJava community members for
>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>> too trivial.
>>>>
>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>> As is commonly known, these are often rife with errors which can lead
>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>> propagation stops there and my code proceeds blindly along regardless
>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>> in my code where the parser is called, so that I can branch to a
>>>> continue statement and have my batch processing loops move on to the
>>>> next file.
>>>> Should I edit out the try-catch blocks and compile my own version of
>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>> possession of the fields in question? ?In that case, I'm not sure
>>>> which properties will give the most general success information...and
>>>> I'd rather not have to check for /every/ property being correct.
>>>>
>>>> If there is some great way to check if an exception was caught down a
>>>> series of nested method calls, please hit me over the head with it.
>>>>
>>>> Thanks!
>>>>
>>>> -da
>>>> _______________________________________________
>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>
>>>
>


From jc.lucky at laposte.net  Wed Oct 27 08:11:13 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Wed, 27 Oct 2010 10:11:13 +0200 (CEST)
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
Message-ID: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>


I tried once again with the new version of BioJava but without succeding. Any idea or suggestion?

Thanks in advance
Regards,

Jean-Charles Ferri?res


> Message du 22/10/10 10:11
> De : "jc.lucky" 
> A : biojava-l at lists.open-bio.org
> Copie ? : 
> Objet : [Biojava-l] Retrieve Information from GenBank file
>
> 
> Hi
> 
> I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> 
> With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. 
> Please help me find what I do wrong or what should be done to achieve my goal.
> 
> //read the GeneBank File
> public static RichSequenceIterator readFile(String input,
> RichSequenceBuilderFactory seqFactory,
> Namespace ns)
> throws IOException, NoSuchElementException, BioException
> {
> ns = null;
> InputStream stream = new FileInputStream(input);
> BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream));
> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); 
> return seqs;
> }
> 
> //Retrieve information and convert them in rdf format
> public void writeToRDFFile(RichSequenceIterator rsi, String output)
> throws IOException, NoSuchElementException, BioException {
> //create model for the ontology
> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null);
> OntClass parents;
> String URI = "http://pbr.wur.nl/#";
> 
> while(rsi.hasNext())
> {
> RichSequence seq = rsi.nextRichSequence();
> String id = seq.getName(); 
> parents = model.createClass(URI + id);
> Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString
> String definition = seq.getDescription(); //code to clean up String
> //Add to model
> parents.addProperty(DC.description, definition);
> parents.addProperty(DC.publisher, authors);
> parents.addComment(taxonomy, "EN");
> parents.addProperty(DC.type, organism);
> //print in rdf format
> model.write(out, "RDF/XML");
> out.close(); }
> }
> 
> 
> Thanks,
> Jean-Charles Ferri?res
_____________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From willishf at ufl.edu  Wed Oct 27 10:41:06 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Wed, 27 Oct 2010 06:41:06 -0400
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
Message-ID: <AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>

Jean-Charles

I have it on my list to do a GenBank parser but haven't had the time. I
can't promise anything in the next couple weeks. Can you send some details
about what a typical use case is for your purpose? Are you trying to get the
sequence data or are you more interested in the features?

Thanks

Scooter

On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky <jc.lucky at laposte.net> wrote:

>
> I tried once again with the new version of BioJava but without succeding.
> Any idea or suggestion?
>
> Thanks in advance
> Regards,
>
> Jean-Charles Ferri?res
>
>
> > Message du 22/10/10 10:11
> > De : "jc.lucky"
> > A : biojava-l at lists.open-bio.org
> > Copie ? :
> > Objet : [Biojava-l] Retrieve Information from GenBank file
> >
> >
> > Hi
> >
> > I'm trying to convert a GenBank file into a rdf file. The gene of
> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> >
> > With the below code I can read the GenBank file and I manage to retrieve
> information and convert them in a rdf format. However I don't succeed in
> retrieving some information such as Title, protein or product. According to
> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> possible to do so.
> > Please help me find what I do wrong or what should be done to achieve my
> goal.
> >
> > //read the GeneBank File
> > public static RichSequenceIterator readFile(String input,
> > RichSequenceBuilderFactory seqFactory,
> > Namespace ns)
> > throws IOException, NoSuchElementException, BioException
> > {
> > ns = null;
> > InputStream stream = new FileInputStream(input);
> > BufferedReader rdfFile = new BufferedReader(new
> InputStreamReader(stream));
> > RichSequenceIterator seqs =
> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> > return seqs;
> > }
> >
> > //Retrieve information and convert them in rdf format
> > public void writeToRDFFile(RichSequenceIterator rsi, String output)
> > throws IOException, NoSuchElementException, BioException {
> > //create model for the ontology
> > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> null);
> > OntClass parents;
> > String URI = "http://pbr.wur.nl/#";
> >
> > while(rsi.hasNext())
> > {
> > RichSequence seq = rsi.nextRichSequence();
> > String id = seq.getName();
> > parents = model.createClass(URI + id);
> > Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> toString
> > String definition = seq.getDescription(); //code to clean up String
> > //Add to model
> > parents.addProperty(DC.description, definition);
> > parents.addProperty(DC.publisher, authors);
> > parents.addComment(taxonomy, "EN");
> > parents.addProperty(DC.type, organism);
> > //print in rdf format
> > model.write(out, "RDF/XML");
> > out.close(); }
> > }
> >
> >
> > Thanks,
> > Jean-Charles Ferri?res
> _____________________________________________
> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous
> tente ?
> Je cr?e ma bo?te mail www.laposte.net
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From jc.lucky at laposte.net  Wed Oct 27 13:03:55 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Wed, 27 Oct 2010 15:03:55 +0200 (CEST)
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
	<AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
Message-ID: <21411489.155159.1288184635185.JavaMail.www@wwinf8222>


I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.

Thanks,

Jean-Charles


> Message du 27/10/10 12:41
> De : "Scooter Willis" 
> A : "jc.lucky" 
> Copie ? : "biojava-l lists open-bio org" 
> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>
> Jean-Charles
> 
> I have it on my list to do a GenBank parser but haven't had the time. I
> can't promise anything in the next couple weeks. Can you send some details
> about what a typical use case is for your purpose? Are you trying to get the
> sequence data or are you more interested in the features?
> 
> Thanks
> 
> Scooter
> 
> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky  wrote:
> 
> >
> > I tried once again with the new version of BioJava but without succeding.
> > Any idea or suggestion?
> >
> > Thanks in advance
> > Regards,
> >
> > Jean-Charles Ferri?res
> >
> >
> > > Message du 22/10/10 10:11
> > > De : "jc.lucky"
> > > A : biojava-l at lists.open-bio.org
> > > Copie ? :
> > > Objet : [Biojava-l] Retrieve Information from GenBank file
> > >
> > >
> > > Hi
> > >
> > > I'm trying to convert a GenBank file into a rdf file. The gene of
> > interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> > >
> > > With the below code I can read the GenBank file and I manage to retrieve
> > information and convert them in a rdf format. However I don't succeed in
> > retrieving some information such as Title, protein or product. According to
> > this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> > possible to do so.
> > > Please help me find what I do wrong or what should be done to achieve my
> > goal.
> > >
> > > //read the GeneBank File
> > > public static RichSequenceIterator readFile(String input,
> > > RichSequenceBuilderFactory seqFactory,
> > > Namespace ns)
> > > throws IOException, NoSuchElementException, BioException
> > > {
> > > ns = null;
> > > InputStream stream = new FileInputStream(input);
> > > BufferedReader rdfFile = new BufferedReader(new
> > InputStreamReader(stream));
> > > RichSequenceIterator seqs =
> > RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> > > return seqs;
> > > }
> > >
> > > //Retrieve information and convert them in rdf format
> > > public void writeToRDFFile(RichSequenceIterator rsi, String output)
> > > throws IOException, NoSuchElementException, BioException {
> > > //create model for the ontology
> > > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> > null);
> > > OntClass parents;
> > > String URI = "http://pbr.wur.nl/#";
> > >
> > > while(rsi.hasNext())
> > > {
> > > RichSequence seq = rsi.nextRichSequence();
> > > String id = seq.getName();
> > > parents = model.createClass(URI + id);
> > > Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> > toString
> > > String definition = seq.getDescription(); //code to clean up String
> > > //Add to model
> > > parents.addProperty(DC.description, definition);
> > > parents.addProperty(DC.publisher, authors);
> > > parents.addComment(taxonomy, "EN");
> > > parents.addProperty(DC.type, organism);
> > > //print in rdf format
> > > model.write(out, "RDF/XML");
> > > out.close(); }
> > > }
> > >
> > >
> > > Thanks,
> > > Jean-Charles Ferri?res
> > _____________________________________________
> > > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From holland at eaglegenomics.com  Wed Oct 27 13:16:56 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 27 Oct 2010 14:16:56 +0100
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <21411489.155159.1288184635185.JavaMail.www@wwinf8222>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
	<AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
	<21411489.155159.1288184635185.JavaMail.www@wwinf8222>
Message-ID: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com>

Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs().

This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2

cheers,
Richard

On 27 Oct 2010, at 14:03, jc.lucky wrote:

> 
> I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
> My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.
> 
> Thanks,
> 
> Jean-Charles
> 
> 
> 
>> Message du 27/10/10 12:41
>> De : "Scooter Willis" 
>> A : "jc.lucky" 
>> Copie ? : "biojava-l lists open-bio org" 
>> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>> 
>> Jean-Charles
>> 
>> I have it on my list to do a GenBank parser but haven't had the time. I
>> can't promise anything in the next couple weeks. Can you send some details
>> about what a typical use case is for your purpose? Are you trying to get the
>> sequence data or are you more interested in the features?
>> 
>> Thanks
>> 
>> Scooter
>> 
>> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky  wrote:
>> 
>>> 
>>> I tried once again with the new version of BioJava but without succeding.
>>> Any idea or suggestion?
>>> 
>>> Thanks in advance
>>> Regards,
>>> 
>>> Jean-Charles Ferri?res
>>> 
>>> 
>>>> Message du 22/10/10 10:11
>>>> De : "jc.lucky"
>>>> A : biojava-l at lists.open-bio.org
>>>> Copie ? :
>>>> Objet : [Biojava-l] Retrieve Information from GenBank file
>>>> 
>>>> 
>>>> Hi
>>>> 
>>>> I'm trying to convert a GenBank file into a rdf file. The gene of
>>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
>>>> 
>>>> With the below code I can read the GenBank file and I manage to retrieve
>>> information and convert them in a rdf format. However I don't succeed in
>>> retrieving some information such as Title, protein or product. According to
>>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
>>> possible to do so.
>>>> Please help me find what I do wrong or what should be done to achieve my
>>> goal.
>>>> 
>>>> //read the GeneBank File
>>>> public static RichSequenceIterator readFile(String input,
>>>> RichSequenceBuilderFactory seqFactory,
>>>> Namespace ns)
>>>> throws IOException, NoSuchElementException, BioException
>>>> {
>>>> ns = null;
>>>> InputStream stream = new FileInputStream(input);
>>>> BufferedReader rdfFile = new BufferedReader(new
>>> InputStreamReader(stream));
>>>> RichSequenceIterator seqs =
>>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
>>>> return seqs;
>>>> }
>>>> 
>>>> //Retrieve information and convert them in rdf format
>>>> public void writeToRDFFile(RichSequenceIterator rsi, String output)
>>>> throws IOException, NoSuchElementException, BioException {
>>>> //create model for the ontology
>>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
>>> null);
>>>> OntClass parents;
>>>> String URI = "http://pbr.wur.nl/#";
>>>> 
>>>> while(rsi.hasNext())
>>>> {
>>>> RichSequence seq = rsi.nextRichSequence();
>>>> String id = seq.getName();
>>>> parents = model.createClass(URI + id);
>>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert
>>> toString
>>>> String definition = seq.getDescription(); //code to clean up String
>>>> //Add to model
>>>> parents.addProperty(DC.description, definition);
>>>> parents.addProperty(DC.publisher, authors);
>>>> parents.addComment(taxonomy, "EN");
>>>> parents.addProperty(DC.type, organism);
>>>> //print in rdf format
>>>> model.write(out, "RDF/XML");
>>>> out.close(); }
>>>> }
>>>> 
>>>> 
>>>> Thanks,
>>>> Jean-Charles Ferri?res
>>> _____________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
> Je cr?e ma bo?te mail www.laposte.net
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From jc.lucky at laposte.net  Wed Oct 27 13:34:22 2010
From: jc.lucky at laposte.net (jc.lucky)
Date: Wed, 27 Oct 2010 15:34:22 +0200 (CEST)
Subject: [Biojava-l] Tr: Retrieve Information from GenBank file
In-Reply-To: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com>
References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210>
	<AANLkTimHnp-p+rNCDyGc0nNDQJY1tpO03XBbD8GK4naf@mail.gmail.com>
	<21411489.155159.1288184635185.JavaMail.www@wwinf8222>
	<3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com>
Message-ID: <6229150.91865.1288186462649.JavaMail.www@wwinf8218>


Thanks for your reply and indeed as mentioned at the bottom that is what I use to try to retrieve the maximum of information. However and that is my problem the methods described do not provide the required information.
For example getRankedDocRefs() provides authors and Journals but no TITLE
getFeaturesSet() only provides /organism, /mol_type and /db_xref
Thereby I was asking for help and suggestion fo how to fix this "problem".

Best,
Jean-Charles


> Message du 27/10/10 15:17
> De : "Richard Holland" 
> A : "jc.lucky" 
> Copie ? : "Scooter Willis" , "biojava-l lists open-bio org" 
> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
>
> 
> Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs().
> 
> This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2
> 
> cheers,
> Richard
> 
> On 27 Oct 2010, at 14:03, jc.lucky wrote:
> 
> > 
> > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data.
> > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future.
> > 
> > Thanks,
> > 
> > Jean-Charles
> > 
> > 
> > 
> >> Message du 27/10/10 12:41
> >> De : "Scooter Willis" 
> >> A : "jc.lucky" 
> >> Copie ? : "biojava-l lists open-bio org" 
> >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file
> >> 
> >> Jean-Charles
> >> 
> >> I have it on my list to do a GenBank parser but haven't had the time. I
> >> can't promise anything in the next couple weeks. Can you send some details
> >> about what a typical use case is for your purpose? Are you trying to get the
> >> sequence data or are you more interested in the features?
> >> 
> >> Thanks
> >> 
> >> Scooter
> >> 
> >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote:
> >> 
> >>> 
> >>> I tried once again with the new version of BioJava but without succeding.
> >>> Any idea or suggestion?
> >>> 
> >>> Thanks in advance
> >>> Regards,
> >>> 
> >>> Jean-Charles Ferri?res
> >>> 
> >>> 
> >>>> Message du 22/10/10 10:11
> >>>> De : "jc.lucky"
> >>>> A : biojava-l at lists.open-bio.org
> >>>> Copie ? :
> >>>> Objet : [Biojava-l] Retrieve Information from GenBank file
> >>>> 
> >>>> 
> >>>> Hi
> >>>> 
> >>>> I'm trying to convert a GenBank file into a rdf file. The gene of
> >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945
> >>>> 
> >>>> With the below code I can read the GenBank file and I manage to retrieve
> >>> information and convert them in a rdf format. However I don't succeed in
> >>> retrieving some information such as Title, protein or product. According to
> >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is
> >>> possible to do so.
> >>>> Please help me find what I do wrong or what should be done to achieve my
> >>> goal.
> >>>> 
> >>>> //read the GeneBank File
> >>>> public static RichSequenceIterator readFile(String input,
> >>>> RichSequenceBuilderFactory seqFactory,
> >>>> Namespace ns)
> >>>> throws IOException, NoSuchElementException, BioException
> >>>> {
> >>>> ns = null;
> >>>> InputStream stream = new FileInputStream(input);
> >>>> BufferedReader rdfFile = new BufferedReader(new
> >>> InputStreamReader(stream));
> >>>> RichSequenceIterator seqs =
> >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns);
> >>>> return seqs;
> >>>> }
> >>>> 
> >>>> //Retrieve information and convert them in rdf format
> >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output)
> >>>> throws IOException, NoSuchElementException, BioException {
> >>>> //create model for the ontology
> >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM,
> >>> null);
> >>>> OntClass parents;
> >>>> String URI = "http://pbr.wur.nl/#";
> >>>> 
> >>>> while(rsi.hasNext())
> >>>> {
> >>>> RichSequence seq = rsi.nextRichSequence();
> >>>> String id = seq.getName();
> >>>> parents = model.createClass(URI + id);
> >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert
> >>> toString
> >>>> String definition = seq.getDescription(); //code to clean up String
> >>>> //Add to model
> >>>> parents.addProperty(DC.description, definition);
> >>>> parents.addProperty(DC.publisher, authors);
> >>>> parents.addComment(taxonomy, "EN");
> >>>> parents.addProperty(DC.type, organism);
> >>>> //print in rdf format
> >>>> model.write(out, "RDF/XML");
> >>>> out.close(); }
> >>>> }
> >>>> 
> >>>> 
> >>>> Thanks,
> >>>> Jean-Charles Ferri?res
> >>> _____________________________________________
> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> > 
> > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
> > Je cr?e ma bo?te mail www.laposte.net
> > 
> > 
> > _______________________________________________
> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> 

Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ?
Je cr?e ma bo?te mail www.laposte.net


From andreas at sdsc.edu  Thu Oct 28 00:47:50 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 27 Oct 2010 17:47:50 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
Message-ID: <AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>

> I assume AtomCache is a new class in BioJava3?

yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0

>
> I must give you my embarrassed apology...after a bunch of testing I
> finally figured out that I had misunderstood where the Parser's error
> handling returns control and started going after the wrong exceptions.
> ?It does looks like if setParseCAOnly is true, the reader excepts on
> chains with no CA's instead of just skipping them, though the other
> chains are still parsed into the structure.

This sounds like there might be  a problem with CA only.. do you have
an example ID? also: are you on biojava 1.7 or 3.0 ?

Andreas


>
> -da
>
> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> PDB files are better nowadays, due to remediation, however there are
>> still issues..
>>
>> it sounds like you just want to figure out how to do the try/catch
>> block properly. You could do something like that:
>>
>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>> ? ? ? ? ? ? ? ?AtomCache cache = new
>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>
>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>
>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>
>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>
>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>> e.getMessage());
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>> ? ? ? ? ? ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ? ?}
>>
>>
>>
>>
>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>> offense intended, by the way, with respect to PDB errors - obviously
>>> the PDB is an indispensable resource for all protein scientists.
>>>
>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>> and the inner iterates over the pieces from each. ?When
>>> StructureExceptions come out of my PDBFileReader object I want to
>>> continue the outer loop, moving on to the next set of files without
>>> executing any of the code that depends on correct StructureImpl
>>> objects from the reader (database updates, the inner loop).
>>> Since the reader's methods have their own try-catch blocks, a thrown
>>> StructureException is stopped there and never reaches my own error
>>> handling. ?I just need to know when those errors occur so I can skip
>>> those proteins - I am presuming that the correct entries will outweigh
>>> the problem ones by a significant factor and the overall data wont be
>>> seriously impacted.
>>>
>>> -da
>>>
>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Daniel,
>>>>
>>>> can you explain a bit more what you are doing, in particular what
>>>> errors you would like to deal with on your end? ?You should not need
>>>> to worry too much about exception handling. Are there any special
>>>> cases you are interested in? ?In this case we should support you with
>>>> a clean interface rather than exception handling from your end...
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>> Hi all,
>>>>> Let me first say thanks to all the BioJava community members for
>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>> too trivial.
>>>>>
>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>> As is commonly known, these are often rife with errors which can lead
>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>> in my code where the parser is called, so that I can branch to a
>>>>> continue statement and have my batch processing loops move on to the
>>>>> next file.
>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>> which properties will give the most general success information...and
>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>
>>>>> If there is some great way to check if an exception was caught down a
>>>>> series of nested method calls, please hit me over the head with it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -da
>>>>> _______________________________________________
>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>
>>>>
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From dasarnow at gmail.com  Thu Oct 28 04:05:18 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Wed, 27 Oct 2010 21:05:18 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader
In-Reply-To: <AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
Message-ID: <AANLkTikjOML9RtOJJj2pRH-URhC7WVJdVaTobY82qWJF@mail.gmail.com>

I'm using 1.7, partially because my distro had a package for it and
partially because I was initially using the online Javadoc a lot.
PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
chain F appears to parse correctly.

-da

org.biojava.bio.structure.StructureException: could not find chain A
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: could not find chain B
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: did not find chain with
chainId >A<
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: did not find chain with
chainId >B<
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
? ? ? ?at fragalign.pair.getStructs(pair.java:42)
? ? ? ?at fragalign.Main.main(Main.java:40)


On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>> I assume AtomCache is a new class in BioJava3?
>
> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>
>>
>> I must give you my embarrassed apology...after a bunch of testing I
>> finally figured out that I had misunderstood where the Parser's error
>> handling returns control and started going after the wrong exceptions.
>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>> chains with no CA's instead of just skipping them, though the other
>> chains are still parsed into the structure.
>
> This sounds like there might be ?a problem with CA only.. do you have
> an example ID? also: are you on biojava 1.7 or 3.0 ?
>
> Andreas
>
>
>
>>
>> -da
>>
>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> PDB files are better nowadays, due to remediation, however there are
>>> still issues..
>>>
>>> it sounds like you just want to figure out how to do the try/catch
>>> block properly. You could do something like that:
>>>
>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>
>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>
>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>
>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>
>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>> e.getMessage());
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>> ? ? ? ? ? ? ? ?}
>>>
>>>
>>>
>>>
>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>> the PDB is an indispensable resource for all protein scientists.
>>>>
>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>> and the inner iterates over the pieces from each. ?When
>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>> continue the outer loop, moving on to the next set of files without
>>>> executing any of the code that depends on correct StructureImpl
>>>> objects from the reader (database updates, the inner loop).
>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>> StructureException is stopped there and never reaches my own error
>>>> handling. ?I just need to know when those errors occur so I can skip
>>>> those proteins - I am presuming that the correct entries will outweigh
>>>> the problem ones by a significant factor and the overall data wont be
>>>> seriously impacted.
>>>>
>>>> -da
>>>>
>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> can you explain a bit more what you are doing, in particular what
>>>>> errors you would like to deal with on your end? ?You should not need
>>>>> to worry too much about exception handling. Are there any special
>>>>> cases you are interested in? ?In this case we should support you with
>>>>> a clean interface rather than exception handling from your end...
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>> Hi all,
>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>> too trivial.
>>>>>>
>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>> continue statement and have my batch processing loops move on to the
>>>>>> next file.
>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>> which properties will give the most general success information...and
>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>
>>>>>> If there is some great way to check if an exception was caught down a
>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -da
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From andreas at sdsc.edu  Thu Oct 28 17:28:07 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 28 Oct 2010 10:28:07 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
Message-ID: <AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>

Hi Daniel,

I just checked, this is a bug which is already resolved in 3.0... If
it is an issue for you, you might want to upgrade... (should be very
easy, if you start using Maven ...)

Thanks,
Andreas

On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> I'm using 1.7, partially because my distro had a package for it and
> partially because I was initially using the online Javadoc a lot.
> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
> chain F appears to parse correctly.
>
> -da
>
> org.biojava.bio.structure.StructureException: could not find chain A
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
> org.biojava.bio.structure.StructureException: could not find chain B
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
> org.biojava.bio.structure.StructureException: did not find chain with
> chainId >A<
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
> org.biojava.bio.structure.StructureException: did not find chain with
> chainId >B<
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
> ? ? ? ?at fragalign.Main.main(Main.java:40)
>
>
> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> I assume AtomCache is a new class in BioJava3?
>>
>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>
>>>
>>> I must give you my embarrassed apology...after a bunch of testing I
>>> finally figured out that I had misunderstood where the Parser's error
>>> handling returns control and started going after the wrong exceptions.
>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>> chains with no CA's instead of just skipping them, though the other
>>> chains are still parsed into the structure.
>>
>> This sounds like there might be ?a problem with CA only.. do you have
>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>
>> Andreas
>>
>>
>>
>>>
>>> -da
>>>
>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Daniel,
>>>>
>>>> PDB files are better nowadays, due to remediation, however there are
>>>> still issues..
>>>>
>>>> it sounds like you just want to figure out how to do the try/catch
>>>> block properly. You could do something like that:
>>>>
>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>
>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>
>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>> e.getMessage());
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>> ? ? ? ? ? ? ? ?}
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>
>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>> and the inner iterates over the pieces from each. ?When
>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>> continue the outer loop, moving on to the next set of files without
>>>>> executing any of the code that depends on correct StructureImpl
>>>>> objects from the reader (database updates, the inner loop).
>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>> StructureException is stopped there and never reaches my own error
>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>> the problem ones by a significant factor and the overall data wont be
>>>>> seriously impacted.
>>>>>
>>>>> -da
>>>>>
>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>> to worry too much about exception handling. Are there any special
>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>> a clean interface rather than exception handling from your end...
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>> too trivial.
>>>>>>>
>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>> next file.
>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>> which properties will give the most general success information...and
>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>
>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -da
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From vishalthapar at gmail.com  Thu Oct 28 17:40:49 2010
From: vishalthapar at gmail.com (Vishal Thapar)
Date: Thu, 28 Oct 2010 13:40:49 -0400
Subject: [Biojava-l] K-mers
Message-ID: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>

Hi All,

I had a quick question: Does Biojava have a method to generate k-mers or
K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
counts for every sequence in a fasta file. If something like this exists it
would save me some time to write the code.

Thanks,

Vishal


From jayunit100 at gmail.com  Thu Oct 28 19:43:17 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Thu, 28 Oct 2010 15:43:17 -0400
Subject: [Biojava-l] biojava maven integration
Message-ID: <AANLkTinjaDZ_BGUXk3F=DbNbxppky8548zqm2d2PNvP8@mail.gmail.com>

Hi guys, I added the following to my pom file

  <dependency>
        <groupId>org.biojava</groupId>
        <artifactId>biojava</artifactId>
        <version>3.0-alpha2</version>
   </dependency>

<repository>
        <id>biojava-maven-repo</id>
        <name>BioJava repository</name>
        <url>http://www.biojava.org/download/maven/</url>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
        <releases>
            <enabled>true</enabled>
        </releases>
    </repository>
 <repository>

But to no avail.  Does anyone know how to add biojava3 to the libraries in a
maven managed application >?

Thanks.


From jayunit100 at gmail.com  Thu Oct 28 22:51:25 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Thu, 28 Oct 2010 18:51:25 -0400
Subject: [Biojava-l] biojava maven integration
In-Reply-To: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk>
References: <AANLkTinjaDZ_BGUXk3F=DbNbxppky8548zqm2d2PNvP8@mail.gmail.com>
	<5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk>
Message-ID: <AANLkTi=FV0-Wh6Abh0DOGCWZfMicobGo9gjkcRMSJUBx@mail.gmail.com>

Does anybody have a maven POM example of how to integrate biojava into my
application ?
Thanks!

Im currently using biojava 1.7, and have put it in my own, local maven
repository.


On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy <andy.law at roslin.ed.ac.uk> wrote:

> Not 100% certain but I *think* you want to depend on biojava-core rather
> than biojava.
>
> Later,
>
> Andy
>
> On 28 Oct 2010, at 20:43, Jay Vyas wrote:
>
> > Hi guys, I added the following to my pom file
> >
> >  <dependency>
> >        <groupId>org.biojava</groupId>
> >        <artifactId>biojava</artifactId>
> >        <version>3.0-alpha2</version>
> >   </dependency>
> >
> > <repository>
> >        <id>biojava-maven-repo</id>
> >        <name>BioJava repository</name>
> >        <url>http://www.biojava.org/download/maven/</url>
> >        <snapshots>
> >            <enabled>true</enabled>
> >        </snapshots>
> >        <releases>
> >            <enabled>true</enabled>
> >        </releases>
> >    </repository>
> > <repository>
> >
> > But to no avail.  Does anyone know how to add biojava3 to the libraries
> in a
> > maven managed application >?
> >
> > Thanks.
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


-- 
Jay Vyas
MMSB/UCHC


From dasarnow at gmail.com  Thu Oct 28 23:45:05 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Thu, 28 Oct 2010 16:45:05 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
	<AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
Message-ID: <AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>

It's not a big deal - after all if you use CA only, chains with no
CA's aren't important, and the error messages aren't that long.  But
I'm going to switch anyway...
I'm getting the dreaded "can't read line length in file" error while
trying to checkout biojava-live/trunk, though.

-da

On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> I just checked, this is a bug which is already resolved in 3.0... If
> it is an issue for you, you might want to upgrade... (should be very
> easy, if you start using Maven ...)
>
> Thanks,
> Andreas
>
> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> I'm using 1.7, partially because my distro had a package for it and
>> partially because I was initially using the online Javadoc a lot.
>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
>> chain F appears to parse correctly.
>>
>> -da
>>
>> org.biojava.bio.structure.StructureException: could not find chain A
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: could not find chain B
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: did not find chain with
>> chainId >A<
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: did not find chain with
>> chainId >B<
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>
>>
>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> I assume AtomCache is a new class in BioJava3?
>>>
>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>
>>>>
>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>> finally figured out that I had misunderstood where the Parser's error
>>>> handling returns control and started going after the wrong exceptions.
>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>>> chains with no CA's instead of just skipping them, though the other
>>>> chains are still parsed into the structure.
>>>
>>> This sounds like there might be ?a problem with CA only.. do you have
>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>
>>> Andreas
>>>
>>>
>>>
>>>>
>>>> -da
>>>>
>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>> still issues..
>>>>>
>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>> block properly. You could do something like that:
>>>>>
>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>
>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>
>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>>
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>>
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>> e.getMessage());
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>> ? ? ? ? ? ? ? ?}
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>
>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>> and the inner iterates over the pieces from each. ?When
>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>> objects from the reader (database updates, the inner loop).
>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>> StructureException is stopped there and never reaches my own error
>>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>> seriously impacted.
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Hi all,
>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>> too trivial.
>>>>>>>>
>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>> next file.
>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>>> which properties will give the most general success information...and
>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>
>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -da
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From dasarnow at gmail.com  Thu Oct 28 23:51:25 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Thu, 28 Oct 2010 16:51:25 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
	<AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
	<AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>
Message-ID: <AANLkTimzVwJLyzyTZvmf0=og+t7xyc2z4OKTf1RNN5Cy@mail.gmail.com>

Ahh, I suppose that is the "problem" referred to in the wiki?  I
checked out successfully from the repository on github.

-da

On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
> It's not a big deal - after all if you use CA only, chains with no
> CA's aren't important, and the error messages aren't that long. ?But
> I'm going to switch anyway...
> I'm getting the dreaded "can't read line length in file" error while
> trying to checkout biojava-live/trunk, though.
>
> -da
>
> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> I just checked, this is a bug which is already resolved in 3.0... If
>> it is an issue for you, you might want to upgrade... (should be very
>> easy, if you start using Maven ...)
>>
>> Thanks,
>> Andreas
>>
>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> I'm using 1.7, partially because my distro had a package for it and
>>> partially because I was initially using the online Javadoc a lot.
>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
>>> chain F appears to parse correctly.
>>>
>>> -da
>>>
>>> org.biojava.bio.structure.StructureException: could not find chain A
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: could not find chain B
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: did not find chain with
>>> chainId >A<
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: did not find chain with
>>> chainId >B<
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>
>>>
>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> I assume AtomCache is a new class in BioJava3?
>>>>
>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>
>>>>>
>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>> handling returns control and started going after the wrong exceptions.
>>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>>>> chains with no CA's instead of just skipping them, though the other
>>>>> chains are still parsed into the structure.
>>>>
>>>> This sounds like there might be ?a problem with CA only.. do you have
>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>>>>
>>>>> -da
>>>>>
>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>> still issues..
>>>>>>
>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>> block properly. You could do something like that:
>>>>>>
>>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>
>>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>
>>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>>>
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>>>
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>> e.getMessage());
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>> ? ? ? ? ? ? ? ?}
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>
>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>> and the inner iterates over the pieces from each. ?When
>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>> seriously impacted.
>>>>>>>
>>>>>>> -da
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>> too trivial.
>>>>>>>>>
>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>> next file.
>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>
>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> -da
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


From andreas at sdsc.edu  Fri Oct 29 00:06:55 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 28 Oct 2010 17:06:55 -0700
Subject: [Biojava-l] biojava maven integration
In-Reply-To: <AANLkTi=FV0-Wh6Abh0DOGCWZfMicobGo9gjkcRMSJUBx@mail.gmail.com>
References: <AANLkTinjaDZ_BGUXk3F=DbNbxppky8548zqm2d2PNvP8@mail.gmail.com>
	<5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk>
	<AANLkTi=FV0-Wh6Abh0DOGCWZfMicobGo9gjkcRMSJUBx@mail.gmail.com>
Message-ID: <AANLkTikzeXetdY_kf1P-GEQPueYj7MQfkwYWQ7OgH=q=@mail.gmail.com>

Hi Jay,

here is some UI code that is using biojava from Maven:

http://github.com/biojava/RCSB_SequenceViewer/blob/master/pom.xml

Andreas


On Thu, Oct 28, 2010 at 3:51 PM, Jay Vyas <jayunit100 at gmail.com> wrote:
> Does anybody have a maven POM example of how to integrate biojava into my
> application ?
> Thanks!
>
> Im currently using biojava 1.7, and have put it in my own, local maven
> repository.
>
>
>
>
> On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy <andy.law at roslin.ed.ac.uk> wrote:
>
>> Not 100% certain but I *think* you want to depend on biojava-core rather
>> than biojava.
>>
>> Later,
>>
>> Andy
>>
>> On 28 Oct 2010, at 20:43, Jay Vyas wrote:
>>
>> > Hi guys, I added the following to my pom file
>> >
>> > ?<dependency>
>> > ? ? ? ?<groupId>org.biojava</groupId>
>> > ? ? ? ?<artifactId>biojava</artifactId>
>> > ? ? ? ?<version>3.0-alpha2</version>
>> > ? </dependency>
>> >
>> > <repository>
>> > ? ? ? ?<id>biojava-maven-repo</id>
>> > ? ? ? ?<name>BioJava repository</name>
>> > ? ? ? ?<url>http://www.biojava.org/download/maven/</url>
>> > ? ? ? ?<snapshots>
>> > ? ? ? ? ? ?<enabled>true</enabled>
>> > ? ? ? ?</snapshots>
>> > ? ? ? ?<releases>
>> > ? ? ? ? ? ?<enabled>true</enabled>
>> > ? ? ? ?</releases>
>> > ? ?</repository>
>> > <repository>
>> >
>> > But to no avail. ?Does anyone know how to add biojava3 to the libraries
>> in a
>> > maven managed application >?
>> >
>> > Thanks.
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
> --
> Jay Vyas
> MMSB/UCHC
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Fri Oct 29 00:08:49 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 28 Oct 2010 17:08:49 -0700
Subject: [Biojava-l] Bad PDB files and batch processing with
	PDBFileReader
In-Reply-To: <AANLkTimzVwJLyzyTZvmf0=og+t7xyc2z4OKTf1RNN5Cy@mail.gmail.com>
References: <AANLkTik18=DfW-aTKcu4Jd=nBYSwjaHs2TsRYaiyG2j-@mail.gmail.com>
	<AANLkTine5b-TMTqgZYRb+gPL5wX_6dbS2exgVpMDxib-@mail.gmail.com>
	<AANLkTikd3WoczvgRA7J5n0LZbY+zmymOVO+ZrGnxk1pF@mail.gmail.com>
	<AANLkTi=CsnCY+4gAFMcjd6W-bBM7=+T0tXbXZmX5X+qs@mail.gmail.com>
	<AANLkTinL8KZ3Zs3uAjCf9+-YAnD1U_nNffwQhvMGim6j@mail.gmail.com>
	<AANLkTim9rVZ+iy4JtxUTLbj4eojeevCMz=+-TVatXPJW@mail.gmail.com>
	<AANLkTikw0XPOFAyABvrmX-6bPuKZS2GxADyD23ww4DZk@mail.gmail.com>
	<AANLkTi=A_g20YNwMBvU587HRB_Xe4vne6sw7hn7Xk5j1@mail.gmail.com>
	<AANLkTinWMK9nFCLovKnSGgQssj7-nT0dEOvMi1Cy486q@mail.gmail.com>
	<AANLkTimzVwJLyzyTZvmf0=og+t7xyc2z4OKTf1RNN5Cy@mail.gmail.com>
Message-ID: <AANLkTik5WLTFevA1iCBHPbLutcyA6TbxioaisKGAs-gY@mail.gmail.com>

good, I was just about to say that... ;-)

Andreas


On Thu, Oct 28, 2010 at 4:51 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Ahh, I suppose that is the "problem" referred to in the wiki? ?I
> checked out successfully from the repository on github.
>
> -da
>
> On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> It's not a big deal - after all if you use CA only, chains with no
>> CA's aren't important, and the error messages aren't that long. ?But
>> I'm going to switch anyway...
>> I'm getting the dreaded "can't read line length in file" error while
>> trying to checkout biojava-live/trunk, though.
>>
>> -da
>>
>> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> I just checked, this is a bug which is already resolved in 3.0... If
>>> it is an issue for you, you might want to upgrade... (should be very
>>> easy, if you start using Maven ...)
>>>
>>> Thanks,
>>> Andreas
>>>
>>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> I'm using 1.7, partially because my distro had a package for it and
>>>> partially because I was initially using the online Javadoc a lot.
>>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide
>>>> chain F appears to parse correctly.
>>>>
>>>> -da
>>>>
>>>> org.biojava.bio.structure.StructureException: could not find chain A
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: could not find chain B
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >A<
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >B<
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42)
>>>> ? ? ? ?at fragalign.Main.main(Main.java:40)
>>>>
>>>>
>>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> I assume AtomCache is a new class in BioJava3?
>>>>>
>>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>>
>>>>>>
>>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>>> handling returns control and started going after the wrong exceptions.
>>>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on
>>>>>> chains with no CA's instead of just skipping them, though the other
>>>>>> chains are still parsed into the structure.
>>>>>
>>>>> This sounds like there might be ?a problem with CA only.. do you have
>>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>>> still issues..
>>>>>>>
>>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>>> block properly. You could do something like that:
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true;
>>>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new
>>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try {
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID);
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) {
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID);
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s);
>>>>>>>
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened...
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>>> e.getMessage());
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace();
>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?}
>>>>>>> ? ? ? ? ? ? ? ?}
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No
>>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>>
>>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the
>>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>>> and the inner iterates over the pieces from each. ?When
>>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>>> handling. ?I just need to know when those errors occur so I can skip
>>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>>> seriously impacted.
>>>>>>>>
>>>>>>>> -da
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>>> errors you would like to deal with on your end? ?You should not need
>>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>>> cases you are interested in? ?In this case we should support you with
>>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>>
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>>> too trivial.
>>>>>>>>>>
>>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because
>>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up
>>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>>> next file.
>>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for
>>>>>>>>>> possession of the fields in question? ?In that case, I'm not sure
>>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>>
>>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> -da
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From ayates at ebi.ac.uk  Fri Oct 29 08:12:09 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 09:12:09 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
Message-ID: <A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>

Hi Vishal,

As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3:

public static void main(String[] args) {
    DNASequence d = new DNASequence("ATGATC");
    System.out.println("Non-Overlap");
    nonOverlap(d);
    System.out.println("Overlap");
    overlap(d);
}

public static final int KMER = 3;

//Generate triplets overlapping
public static void overlap(Sequence<NucleotideCompound> d) {
    List<WindowedSequence<NucleotideCompound>> l =
            new ArrayList<WindowedSequence<NucleotideCompound>>();
    for(int i=1; i<=KMER; i++) {
        SequenceView<NucleotideCompound> sub = d.getSubSequence(
                i, d.getLength());
        WindowedSequence<NucleotideCompound> w =
            new WindowedSequence<NucleotideCompound>(sub, KMER);
        l.add(w);
    }

    //Will return ATG, ATC, TGA & GAT
    for(WindowedSequence<NucleotideCompound> w: l) {
        for(List<NucleotideCompound> subList: w) {
            System.out.println(subList);
        }
    }
}

//Generate triplet Compound lists non-overlapping
public static void nonOverlap(Sequence<NucleotideCompound> d) {
    WindowedSequence<NucleotideCompound> w = 
            new WindowedSequence<NucleotideCompound>(d, KMER);
    //Will return ATG & ATC
    for(List<NucleotideCompound> subList: w) {
        System.out.println(subList);
    }
}

The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA)

As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree).

Hope this helps,

Andy

On 28 Oct 2010, at 18:40, Vishal Thapar wrote:

> Hi All,
> 
> I had a quick question: Does Biojava have a method to generate k-mers or
> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> counts for every sequence in a fasta file. If something like this exists it
> would save me some time to write the code.
> 
> Thanks,
> 
> Vishal
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Fri Oct 29 09:12:53 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 29 Oct 2010 14:42:53 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
Message-ID: <AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>

Dear Friends,

Thanks to Vishal & Andy for this. I actually needed this code too..
Vishal, I think Andy's suggestions may be a good option to include in
BioJava 3. Would you like to add this to the BioJava 3.

Thanks again.

Regards,
Jitesh Dundas

On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Vishal,
>
> As far as I am aware there is nothing which will generate them in BioJava at
> the moment. However it is possible to do it with BioJava3:
>
> public static void main(String[] args) {
>     DNASequence d = new DNASequence("ATGATC");
>     System.out.println("Non-Overlap");
>     nonOverlap(d);
>     System.out.println("Overlap");
>     overlap(d);
> }
>
> public static final int KMER = 3;
>
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>     List<WindowedSequence<NucleotideCompound>> l =
>             new ArrayList<WindowedSequence<NucleotideCompound>>();
>     for(int i=1; i<=KMER; i++) {
>         SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                 i, d.getLength());
>         WindowedSequence<NucleotideCompound> w =
>             new WindowedSequence<NucleotideCompound>(sub, KMER);
>         l.add(w);
>     }
>
>     //Will return ATG, ATC, TGA & GAT
>     for(WindowedSequence<NucleotideCompound> w: l) {
>         for(List<NucleotideCompound> subList: w) {
>             System.out.println(subList);
>         }
>     }
> }
>
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>     WindowedSequence<NucleotideCompound> w =
>             new WindowedSequence<NucleotideCompound>(d, KMER);
>     //Will return ATG & ATC
>     for(List<NucleotideCompound> subList: w) {
>         System.out.println(subList);
>     }
> }
>
> The disadvantage of all of these solutions is that they generate lists of
> Compounds so kmer generation can/will be a memory intensive operation. This
> does mean it has to be since sub sequences are thin wrappers around an
> underlying sequence. Also the overlap solution is non-optimal since it
> iterates through each window rather than stepping through delegating onto
> each base in turn (hence why we get ATG & ATC before TGA)
>
> As for unique k-mers that's something which would require a bit more
> engineering & would be better suited to a solution built around a Trie
> (prefix tree).
>
> Hope this helps,
>
> Andy
>
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>
>> Hi All,
>>
>> I had a quick question: Does Biojava have a method to generate k-mers or
>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>> counts for every sequence in a fasta file. If something like this exists
>> it
>> would save me some time to write the code.
>>
>> Thanks,
>>
>> Vishal
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From ayates at ebi.ac.uk  Fri Oct 29 09:20:36 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 10:20:36 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
Message-ID: <B903229A-1A5A-432C-9307-4D01ACD8DFB3@ebi.ac.uk>

Okay couple of points here:

1). Which biojava3 module? This sounds like something for the genomic module rather than core

2). It'll need some more work. I'm not happy about using the WindowedSequenceView in its current state. I think an alteration to avoid it making Lists would be a good idea (plus recent developments in the API as to its main use means this is a viable change). Also it should return the overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6

Comments?

Andy

On 29 Oct 2010, at 10:12, jitesh dundas wrote:

> Dear Friends,
> 
> Thanks to Vishal & Andy for this. I actually needed this code too..
> Vishal, I think Andy's suggestions may be a good option to include in
> BioJava 3. Would you like to add this to the BioJava 3.
> 
> Thanks again.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Vishal,
>> 
>> As far as I am aware there is nothing which will generate them in BioJava at
>> the moment. However it is possible to do it with BioJava3:
>> 
>> public static void main(String[] args) {
>>    DNASequence d = new DNASequence("ATGATC");
>>    System.out.println("Non-Overlap");
>>    nonOverlap(d);
>>    System.out.println("Overlap");
>>    overlap(d);
>> }
>> 
>> public static final int KMER = 3;
>> 
>> //Generate triplets overlapping
>> public static void overlap(Sequence<NucleotideCompound> d) {
>>    List<WindowedSequence<NucleotideCompound>> l =
>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>    for(int i=1; i<=KMER; i++) {
>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>                i, d.getLength());
>>        WindowedSequence<NucleotideCompound> w =
>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>        l.add(w);
>>    }
>> 
>>    //Will return ATG, ATC, TGA & GAT
>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>        for(List<NucleotideCompound> subList: w) {
>>            System.out.println(subList);
>>        }
>>    }
>> }
>> 
>> //Generate triplet Compound lists non-overlapping
>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>    WindowedSequence<NucleotideCompound> w =
>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>    //Will return ATG & ATC
>>    for(List<NucleotideCompound> subList: w) {
>>        System.out.println(subList);
>>    }
>> }
>> 
>> The disadvantage of all of these solutions is that they generate lists of
>> Compounds so kmer generation can/will be a memory intensive operation. This
>> does mean it has to be since sub sequences are thin wrappers around an
>> underlying sequence. Also the overlap solution is non-optimal since it
>> iterates through each window rather than stepping through delegating onto
>> each base in turn (hence why we get ATG & ATC before TGA)
>> 
>> As for unique k-mers that's something which would require a bit more
>> engineering & would be better suited to a solution built around a Trie
>> (prefix tree).
>> 
>> Hope this helps,
>> 
>> Andy
>> 
>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>> 
>>> Hi All,
>>> 
>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>> counts for every sequence in a fasta file. If something like this exists
>>> it
>>> would save me some time to write the code.
>>> 
>>> Thanks,
>>> 
>>> Vishal
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Fri Oct 29 10:00:44 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 29 Oct 2010 15:30:44 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
Message-ID: <AANLkTimLoXvOO1aHBrRno4=P7CJ923iGUoQ+558LoBWE@mail.gmail.com>

Dear Sir,

Is there any way to detect patterns in the recorded k-mers .

I have a large set of miRNAs (study for mutations and patgerns for
gastric cancer).I made a record of k-mers for each sequence but the
patterns that are generated are difficult to track.

Can BioJava do this point. Regular Expressions in Java maybe useful here..

Request expert advise  in this.Any other s/w that might be useful.

Thanks,
Jitesh Dundas

On 10/29/10, jitesh dundas <jbdundas at gmail.com> wrote:
> Dear Friends,
>
> Thanks to Vishal & Andy for this. I actually needed this code too..
> Vishal, I think Andy's suggestions may be a good option to include in
> BioJava 3. Would you like to add this to the BioJava 3.
>
> Thanks again.
>
> Regards,
> Jitesh Dundas
>
> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Vishal,
>>
>> As far as I am aware there is nothing which will generate them in BioJava
>> at
>> the moment. However it is possible to do it with BioJava3:
>>
>> public static void main(String[] args) {
>>     DNASequence d = new DNASequence("ATGATC");
>>     System.out.println("Non-Overlap");
>>     nonOverlap(d);
>>     System.out.println("Overlap");
>>     overlap(d);
>> }
>>
>> public static final int KMER = 3;
>>
>> //Generate triplets overlapping
>> public static void overlap(Sequence<NucleotideCompound> d) {
>>     List<WindowedSequence<NucleotideCompound>> l =
>>             new ArrayList<WindowedSequence<NucleotideCompound>>();
>>     for(int i=1; i<=KMER; i++) {
>>         SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>                 i, d.getLength());
>>         WindowedSequence<NucleotideCompound> w =
>>             new WindowedSequence<NucleotideCompound>(sub, KMER);
>>         l.add(w);
>>     }
>>
>>     //Will return ATG, ATC, TGA & GAT
>>     for(WindowedSequence<NucleotideCompound> w: l) {
>>         for(List<NucleotideCompound> subList: w) {
>>             System.out.println(subList);
>>         }
>>     }
>> }
>>
>> //Generate triplet Compound lists non-overlapping
>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>     WindowedSequence<NucleotideCompound> w =
>>             new WindowedSequence<NucleotideCompound>(d, KMER);
>>     //Will return ATG & ATC
>>     for(List<NucleotideCompound> subList: w) {
>>         System.out.println(subList);
>>     }
>> }
>>
>> The disadvantage of all of these solutions is that they generate lists of
>> Compounds so kmer generation can/will be a memory intensive operation.
>> This
>> does mean it has to be since sub sequences are thin wrappers around an
>> underlying sequence. Also the overlap solution is non-optimal since it
>> iterates through each window rather than stepping through delegating onto
>> each base in turn (hence why we get ATG & ATC before TGA)
>>
>> As for unique k-mers that's something which would require a bit more
>> engineering & would be better suited to a solution built around a Trie
>> (prefix tree).
>>
>> Hope this helps,
>>
>> Andy
>>
>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>
>>> Hi All,
>>>
>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>> counts for every sequence in a fasta file. If something like this exists
>>> it
>>> would save me some time to write the code.
>>>
>>> Thanks,
>>>
>>> Vishal
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>


From jbdundas at gmail.com  Fri Oct 29 10:04:35 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 29 Oct 2010 15:34:35 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <B903229A-1A5A-432C-9307-4D01ACD8DFB3@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
	<B903229A-1A5A-432C-9307-4D01ACD8DFB3@ebi.ac.uk>
Message-ID: <AANLkTimktS_moccBSOXLKMDxYVe8Y91LrDOxndHWUP_P@mail.gmail.com>

You are right again my friend.Definitely that would hang up my machine
with the xml file parsing activity.

This is about sequence alignment and related modules.

I will look at this today and send a fix on that.Hope that you can help.

PS: what about pattern matching in sequences?interesting  to have in
biojava 3 ?

Regards,
JD

On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> Okay couple of points here:
>
> 1). Which biojava3 module? This sounds like something for the genomic module
> rather than core
>
> 2). It'll need some more work. I'm not happy about using the
> WindowedSequenceView in its current state. I think an alteration to avoid it
> making Lists would be a good idea (plus recent developments in the API as to
> its main use means this is a viable change). Also it should return the
> overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6
>
> Comments?
>
> Andy
>
> On 29 Oct 2010, at 10:12, jitesh dundas wrote:
>
>> Dear Friends,
>>
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>>
>> Thanks again.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>>
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>>
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>>
>>> public static final int KMER = 3;
>>>
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>>
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>>
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>>
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>>
>>> Hope this helps,
>>>
>>> Andy
>>>
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>
>>>> Hi All,
>>>>
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>>
>>>> Thanks,
>>>>
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


From ayates at ebi.ac.uk  Fri Oct 29 10:09:11 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 11:09:11 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTimLoXvOO1aHBrRno4=P7CJ923iGUoQ+558LoBWE@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTinKApvuxy6i3JyZz9H2J6qTf1cDpTn3946BX86O@mail.gmail.com>
	<AANLkTimLoXvOO1aHBrRno4=P7CJ923iGUoQ+558LoBWE@mail.gmail.com>
Message-ID: <5832FAFE-FEC3-4A7C-9469-3C334551900B@ebi.ac.uk>

One of the disadvantages of the Sequence based system is that we have no support for searching in sequences with patterns like regular expressions. Whilst it's possible to convert a Sequence into a String & then perform the expression but that is a sub-optimal solution.

Looking at the Pattern code in Java6 it can take in a CharSequence which one could write an adaptor to make a Sequence act as a CharSequence for the matching procedure but really it looks like a lot of work.

As for a way of doing matching to sequence HMMER3 is awesome :)

Andy

On 29 Oct 2010, at 11:00, jitesh dundas wrote:

> Dear Sir,
> 
> Is there any way to detect patterns in the recorded k-mers .
> 
> I have a large set of miRNAs (study for mutations and patgerns for
> gastric cancer).I made a record of k-mers for each sequence but the
> patterns that are generated are difficult to track.
> 
> Can BioJava do this point. Regular Expressions in Java maybe useful here..
> 
> Request expert advise  in this.Any other s/w that might be useful.
> 
> Thanks,
> Jitesh Dundas
> 
> On 10/29/10, jitesh dundas <jbdundas at gmail.com> wrote:
>> Dear Friends,
>> 
>> Thanks to Vishal & Andy for this. I actually needed this code too..
>> Vishal, I think Andy's suggestions may be a good option to include in
>> BioJava 3. Would you like to add this to the BioJava 3.
>> 
>> Thanks again.
>> 
>> Regards,
>> Jitesh Dundas
>> 
>> On 10/29/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at
>>> the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>    DNASequence d = new DNASequence("ATGATC");
>>>    System.out.println("Non-Overlap");
>>>    nonOverlap(d);
>>>    System.out.println("Overlap");
>>>    overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>    List<WindowedSequence<NucleotideCompound>> l =
>>>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>    for(int i=1; i<=KMER; i++) {
>>>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>                i, d.getLength());
>>>        WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>        l.add(w);
>>>    }
>>> 
>>>    //Will return ATG, ATC, TGA & GAT
>>>    for(WindowedSequence<NucleotideCompound> w: l) {
>>>        for(List<NucleotideCompound> subList: w) {
>>>            System.out.println(subList);
>>>        }
>>>    }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>    WindowedSequence<NucleotideCompound> w =
>>>            new WindowedSequence<NucleotideCompound>(d, KMER);
>>>    //Will return ATG & ATC
>>>    for(List<NucleotideCompound> subList: w) {
>>>        System.out.println(subList);
>>>    }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
>>>> counts for every sequence in a fasta file. If something like this exists
>>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jnarayan81 at gmail.com  Fri Oct 29 11:46:11 2010
From: jnarayan81 at gmail.com (jitendra narayan)
Date: Fri, 29 Oct 2010 17:16:11 +0530
Subject: [Biojava-l] New Biojava Logo
Message-ID: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>

Dear All
I have designed a n new biojava logo. Please see the detail of it:
http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
<http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your valuable
suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo


thanks

-- 
Jitendra Narayan
Bioinformatist
www.bioinformaticsonline.com


From genjasp at gmail.com  Fri Oct 29 13:05:57 2010
From: genjasp at gmail.com (Alessandro Cipriani)
Date: Fri, 29 Oct 2010 15:05:57 +0200
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
Message-ID: <AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>

Great Logo!!!

:D

2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> Dear All
> I have designed a n new biojava logo. Please see the detail of it:
> http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your valuable
> suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
>
>
> thanks
>
> --
> Jitendra Narayan
> Bioinformatist
> www.bioinformaticsonline.com
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Alessandro Cipriani
(+39) 3206009509
(+39) 3931311792
http://www.cipriania.it
skype:genjasp at gmail.com
msn:jaspzz


From vishalthapar at gmail.com  Fri Oct 29 16:27:11 2010
From: vishalthapar at gmail.com (Vishal Thapar)
Date: Fri, 29 Oct 2010 12:27:11 -0400
Subject: [Biojava-l] K-mers
In-Reply-To: <A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
Message-ID: <AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>

Hi Andy,

This is good to have. I feel that including it as a part of core may not be
necessary but having it as part of Genomic module in biojava3 will be nice.
There is a project Bioinformatica
http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
does something similar although not exactly. It counts the k-mers in a
given fasta file but it does not count k-mers for each sequence within the
file, just all within a file. This is a good feature to have specially if
one is trying to find patterns within sequences which is what I am trying to
do. It would most certainly be helpful to have a k-mer counting algorithm
that counts k-mer frequency for each sequence. The way to go would be to use
suffix trees. Again I don't know if biojava has a suffix tree api or not
since I haven't used java in a while and am just switching back to it. A
paper on using suffix trees to generate genome wide k-mer frequencies is:
http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software
is tallymer). It would be some work to implement this in java as a module
for biojava3 but I can see that this will be helpful. Again, for small fasta
files, it might not be efficient to create a suffix tree but for bigger
files, I think that might be the way to go.

Thats just my two cents.What do you think?

-vishal

On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> Hi Vishal,
>
> As far as I am aware there is nothing which will generate them in BioJava
> at the moment. However it is possible to do it with BioJava3:
>
> public static void main(String[] args) {
>    DNASequence d = new DNASequence("ATGATC");
>    System.out.println("Non-Overlap");
>    nonOverlap(d);
>    System.out.println("Overlap");
>    overlap(d);
> }
>
> public static final int KMER = 3;
>
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>    List<WindowedSequence<NucleotideCompound>> l =
>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>    for(int i=1; i<=KMER; i++) {
>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                i, d.getLength());
>        WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>        l.add(w);
>    }
>
>    //Will return ATG, ATC, TGA & GAT
>    for(WindowedSequence<NucleotideCompound> w: l) {
>        for(List<NucleotideCompound> subList: w) {
>            System.out.println(subList);
>        }
>    }
> }
>
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>    WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(d, KMER);
>    //Will return ATG & ATC
>    for(List<NucleotideCompound> subList: w) {
>        System.out.println(subList);
>    }
> }
>
> The disadvantage of all of these solutions is that they generate lists of
> Compounds so kmer generation can/will be a memory intensive operation. This
> does mean it has to be since sub sequences are thin wrappers around an
> underlying sequence. Also the overlap solution is non-optimal since it
> iterates through each window rather than stepping through delegating onto
> each base in turn (hence why we get ATG & ATC before TGA)
>
> As for unique k-mers that's something which would require a bit more
> engineering & would be better suited to a solution built around a Trie
> (prefix tree).
>
> Hope this helps,
>
> Andy
>
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>
> > Hi All,
> >
> > I had a quick question: Does Biojava have a method to generate k-mers or
> > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> > counts for every sequence in a fasta file. If something like this exists
> it
> > would save me some time to write the code.
> >
> > Thanks,
> >
> > Vishal
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
*Vishal Thapar, Ph.D.*
*Scientific informatics Analyst
Cold Spring Harbor Lab
Quick Bldg, Lowe Lab
1 Bungtown Road
Cold Spring Harbor, NY - 11724*


From phidias51 at gmail.com  Fri Oct 29 16:56:45 2010
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 29 Oct 2010 09:56:45 -0700
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
Message-ID: <AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>

It might be useful to make the K-mer storage mechanism pluggable.  This
would allow a developer to use anything from a simple MultiMap, to a NoSQL
key-value database to store K-mers.  You could plugin custom map
implementations to allow you to keep a count of the number of instances of
particular K-mers that were found.  It might also be useful to be able to do
set operations on those K-mer collections.  You could use it to determine
which K-mers were present in a pathogen and not in a host.
http://www.ncbi.nlm.nih.gov/pubmed/20428334
http://www.ncbi.nlm.nih.gov/pubmed/16403026

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com>wrote:

> Hi Andy,
>
> This is good to have. I feel that including it as a part of core may not be
> necessary but having it as part of Genomic module in biojava3 will be nice.
> There is a project Bioinformatica
>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> does something similar although not exactly. It counts the k-mers in a
> given fasta file but it does not count k-mers for each sequence within the
> file, just all within a file. This is a good feature to have specially if
> one is trying to find patterns within sequences which is what I am trying
> to
> do. It would most certainly be helpful to have a k-mer counting algorithm
> that counts k-mer frequency for each sequence. The way to go would be to
> use
> suffix trees. Again I don't know if biojava has a suffix tree api or not
> since I haven't used java in a while and am just switching back to it. A
> paper on using suffix trees to generate genome wide k-mer frequencies is:
> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> software
> is tallymer). It would be some work to implement this in java as a module
> for biojava3 but I can see that this will be helpful. Again, for small
> fasta
> files, it might not be efficient to create a suffix tree but for bigger
> files, I think that might be the way to go.
>
> Thats just my two cents.What do you think?
>
> -vishal
>
> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>
> > Hi Vishal,
> >
> > As far as I am aware there is nothing which will generate them in BioJava
> > at the moment. However it is possible to do it with BioJava3:
> >
> > public static void main(String[] args) {
> >    DNASequence d = new DNASequence("ATGATC");
> >    System.out.println("Non-Overlap");
> >    nonOverlap(d);
> >    System.out.println("Overlap");
> >    overlap(d);
> > }
> >
> > public static final int KMER = 3;
> >
> > //Generate triplets overlapping
> > public static void overlap(Sequence<NucleotideCompound> d) {
> >    List<WindowedSequence<NucleotideCompound>> l =
> >            new ArrayList<WindowedSequence<NucleotideCompound>>();
> >    for(int i=1; i<=KMER; i++) {
> >        SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >                i, d.getLength());
> >        WindowedSequence<NucleotideCompound> w =
> >            new WindowedSequence<NucleotideCompound>(sub, KMER);
> >        l.add(w);
> >    }
> >
> >    //Will return ATG, ATC, TGA & GAT
> >    for(WindowedSequence<NucleotideCompound> w: l) {
> >        for(List<NucleotideCompound> subList: w) {
> >            System.out.println(subList);
> >        }
> >    }
> > }
> >
> > //Generate triplet Compound lists non-overlapping
> > public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >    WindowedSequence<NucleotideCompound> w =
> >            new WindowedSequence<NucleotideCompound>(d, KMER);
> >    //Will return ATG & ATC
> >    for(List<NucleotideCompound> subList: w) {
> >        System.out.println(subList);
> >    }
> > }
> >
> > The disadvantage of all of these solutions is that they generate lists of
> > Compounds so kmer generation can/will be a memory intensive operation.
> This
> > does mean it has to be since sub sequences are thin wrappers around an
> > underlying sequence. Also the overlap solution is non-optimal since it
> > iterates through each window rather than stepping through delegating onto
> > each base in turn (hence why we get ATG & ATC before TGA)
> >
> > As for unique k-mers that's something which would require a bit more
> > engineering & would be better suited to a solution built around a Trie
> > (prefix tree).
> >
> > Hope this helps,
> >
> > Andy
> >
> > On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >
> > > Hi All,
> > >
> > > I had a quick question: Does Biojava have a method to generate k-mers
> or
> > > K-mer counting in a given Fasta Sequence / File? Basically, I want
> k-mer
> > > counts for every sequence in a fasta file. If something like this
> exists
> > it
> > > would save me some time to write the code.
> > >
> > > Thanks,
> > >
> > > Vishal
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> > --
> > Andrew Yates                   Ensembl Genomes Engineer
> > EMBL-EBI                       Tel: +44-(0)1223-492538
> > Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> > Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >
> >
> >
> >
> >
>
>
> --
> *Vishal Thapar, Ph.D.*
> *Scientific informatics Analyst
> Cold Spring Harbor Lab
> Quick Bldg, Lowe Lab
> 1 Bungtown Road
> Cold Spring Harbor, NY - 11724*
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From ayates at ebi.ac.uk  Fri Oct 29 18:32:45 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 19:32:45 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
Message-ID: <C65577E4-26B7-4DE6-A5F2-B7E8994F2CF0@ebi.ac.uk>

Hi Vishal,

There's no suffix tree impl in BioJava but if you want to give it a shot then go for it :). I'm interested in how they work but no time to implement it. As for efficiency give it a shot & lets see what it does. 

Andy

On 29 Oct 2010, at 17:27, Vishal Thapar wrote:

> Hi Andy,
> 
> This is good to have. I feel that including it as a part of core may not be necessary but having it as part of Genomic module in biojava3 will be nice. There is a project Bioinformatica http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequence which does something similar although not exactly. It counts the k-mers in a given fasta file but it does not count k-mers for each sequence within the file, just all within a file. This is a good feature to have specially if one is trying to find patterns within sequences which is what I am trying to do. It would most certainly be helpful to have a k-mer counting algorithm that counts k-mer frequency for each sequence. The way to go would be to use suffix trees. Again I don't know if biojava has a suffix tree api or not since I haven't used java in a while and am just switching back to it. A paper on using suffix trees to generate genome wide k-mer frequencies is: http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software is tallymer). It would be some work to implement this in java as a module for biojava3 but I can see that this will be helpful. Again, for small fasta files, it might not be efficient to create a suffix tree but for bigger files, I think that might be the way to go.
> 
> Thats just my two cents.What do you think?
> 
> -vishal
> 
> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Vishal,
> 
> As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3:
> 
> public static void main(String[] args) {
>    DNASequence d = new DNASequence("ATGATC");
>    System.out.println("Non-Overlap");
>    nonOverlap(d);
>    System.out.println("Overlap");
>    overlap(d);
> }
> 
> public static final int KMER = 3;
> 
> //Generate triplets overlapping
> public static void overlap(Sequence<NucleotideCompound> d) {
>    List<WindowedSequence<NucleotideCompound>> l =
>            new ArrayList<WindowedSequence<NucleotideCompound>>();
>    for(int i=1; i<=KMER; i++) {
>        SequenceView<NucleotideCompound> sub = d.getSubSequence(
>                i, d.getLength());
>        WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(sub, KMER);
>        l.add(w);
>    }
> 
>    //Will return ATG, ATC, TGA & GAT
>    for(WindowedSequence<NucleotideCompound> w: l) {
>        for(List<NucleotideCompound> subList: w) {
>            System.out.println(subList);
>        }
>    }
> }
> 
> //Generate triplet Compound lists non-overlapping
> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>    WindowedSequence<NucleotideCompound> w =
>            new WindowedSequence<NucleotideCompound>(d, KMER);
>    //Will return ATG & ATC
>    for(List<NucleotideCompound> subList: w) {
>        System.out.println(subList);
>    }
> }
> 
> The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA)
> 
> As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree).
> 
> Hope this helps,
> 
> Andy
> 
> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> 
> > Hi All,
> >
> > I had a quick question: Does Biojava have a method to generate k-mers or
> > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer
> > counts for every sequence in a fasta file. If something like this exists it
> > would save me some time to write the code.
> >
> > Thanks,
> >
> > Vishal
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> 
> 
> -- 
> Vishal Thapar, Ph.D.
> Scientific informatics Analyst
> Cold Spring Harbor Lab
> Quick Bldg, Lowe Lab
> 1 Bungtown Road
> Cold Spring Harbor, NY - 11724
> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From ayates at ebi.ac.uk  Fri Oct 29 18:35:43 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 19:35:43 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
Message-ID: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>

So if it's a suffix tree that's quite a fixed data structure so the chances of developing a pluggable mechanism there would be hard. I think there also has to be a limit as to what we can sensibly do. If people want to contribute this kind of work though then it's all be very well received (with the corresponding test environment/cases of course).

Cheers,

Andy

On 29 Oct 2010, at 17:56, Mark Fortner wrote:

> It might be useful to make the K-mer storage mechanism pluggable.  This
> would allow a developer to use anything from a simple MultiMap, to a NoSQL
> key-value database to store K-mers.  You could plugin custom map
> implementations to allow you to keep a count of the number of instances of
> particular K-mers that were found.  It might also be useful to be able to do
> set operations on those K-mer collections.  You could use it to determine
> which K-mers were present in a pathogen and not in a host.
> http://www.ncbi.nlm.nih.gov/pubmed/20428334
> http://www.ncbi.nlm.nih.gov/pubmed/16403026
> 
> Cheers,
> 
> Mark
> 
> card.ly: <http://card.ly/phidias51>
> 
> 
> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com>wrote:
> 
>> Hi Andy,
>> 
>> This is good to have. I feel that including it as a part of core may not be
>> necessary but having it as part of Genomic module in biojava3 will be nice.
>> There is a project Bioinformatica
>> 
>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>> does something similar although not exactly. It counts the k-mers in a
>> given fasta file but it does not count k-mers for each sequence within the
>> file, just all within a file. This is a good feature to have specially if
>> one is trying to find patterns within sequences which is what I am trying
>> to
>> do. It would most certainly be helpful to have a k-mer counting algorithm
>> that counts k-mer frequency for each sequence. The way to go would be to
>> use
>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>> since I haven't used java in a while and am just switching back to it. A
>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>> software
>> is tallymer). It would be some work to implement this in java as a module
>> for biojava3 but I can see that this will be helpful. Again, for small
>> fasta
>> files, it might not be efficient to create a suffix tree but for bigger
>> files, I think that might be the way to go.
>> 
>> Thats just my two cents.What do you think?
>> 
>> -vishal
>> 
>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> 
>>> Hi Vishal,
>>> 
>>> As far as I am aware there is nothing which will generate them in BioJava
>>> at the moment. However it is possible to do it with BioJava3:
>>> 
>>> public static void main(String[] args) {
>>>   DNASequence d = new DNASequence("ATGATC");
>>>   System.out.println("Non-Overlap");
>>>   nonOverlap(d);
>>>   System.out.println("Overlap");
>>>   overlap(d);
>>> }
>>> 
>>> public static final int KMER = 3;
>>> 
>>> //Generate triplets overlapping
>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>   List<WindowedSequence<NucleotideCompound>> l =
>>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>   for(int i=1; i<=KMER; i++) {
>>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>               i, d.getLength());
>>>       WindowedSequence<NucleotideCompound> w =
>>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>       l.add(w);
>>>   }
>>> 
>>>   //Will return ATG, ATC, TGA & GAT
>>>   for(WindowedSequence<NucleotideCompound> w: l) {
>>>       for(List<NucleotideCompound> subList: w) {
>>>           System.out.println(subList);
>>>       }
>>>   }
>>> }
>>> 
>>> //Generate triplet Compound lists non-overlapping
>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>   WindowedSequence<NucleotideCompound> w =
>>>           new WindowedSequence<NucleotideCompound>(d, KMER);
>>>   //Will return ATG & ATC
>>>   for(List<NucleotideCompound> subList: w) {
>>>       System.out.println(subList);
>>>   }
>>> }
>>> 
>>> The disadvantage of all of these solutions is that they generate lists of
>>> Compounds so kmer generation can/will be a memory intensive operation.
>> This
>>> does mean it has to be since sub sequences are thin wrappers around an
>>> underlying sequence. Also the overlap solution is non-optimal since it
>>> iterates through each window rather than stepping through delegating onto
>>> each base in turn (hence why we get ATG & ATC before TGA)
>>> 
>>> As for unique k-mers that's something which would require a bit more
>>> engineering & would be better suited to a solution built around a Trie
>>> (prefix tree).
>>> 
>>> Hope this helps,
>>> 
>>> Andy
>>> 
>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I had a quick question: Does Biojava have a method to generate k-mers
>> or
>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>> k-mer
>>>> counts for every sequence in a fasta file. If something like this
>> exists
>>> it
>>>> would save me some time to write the code.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vishal
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> --
>> *Vishal Thapar, Ph.D.*
>> *Scientific informatics Analyst
>> Cold Spring Harbor Lab
>> Quick Bldg, Lowe Lab
>> 1 Bungtown Road
>> Cold Spring Harbor, NY - 11724*
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jayunit100 at gmail.com  Fri Oct 29 18:40:46 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Fri, 29 Oct 2010 14:40:46 -0400
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
Message-ID: <AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>

Hi guys : Im trying to break up a biojava project built on 1.7 into biojava
3, and am having to look up some modules etc...
Im having trouble finding biojava3 javadocs ?  Unfortunately, the
'googleable' java docs are all from 1.7 .....

Where is the formal/generated javadoc info for biojava3 ? is it online ?


From phidias51 at gmail.com  Fri Oct 29 18:48:53 2010
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 29 Oct 2010 11:48:53 -0700
Subject: [Biojava-l] K-mers
In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
Message-ID: <AANLkTimiWU4PfB==xcVkgTo4TfEnNe5qoRuJUrvAUYVy@mail.gmail.com>

I was thinking more along the lines of using something that implements the
Map interface.  This would allow a developer to easily unit test the code
without having to load the data for a genome.  You would also be able to
provide different implementations to suit your needs.  If you wanted to use
a suffix tree as the underlying implementation, that would be OK, but you
would have other options as well.

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 11:35 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> So if it's a suffix tree that's quite a fixed data structure so the chances
> of developing a pluggable mechanism there would be hard. I think there also
> has to be a limit as to what we can sensibly do. If people want to
> contribute this kind of work though then it's all be very well received
> (with the corresponding test environment/cases of course).
>
> Cheers,
>
> Andy
>
> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>
> > It might be useful to make the K-mer storage mechanism pluggable.  This
> > would allow a developer to use anything from a simple MultiMap, to a
> NoSQL
> > key-value database to store K-mers.  You could plugin custom map
> > implementations to allow you to keep a count of the number of instances
> of
> > particular K-mers that were found.  It might also be useful to be able to
> do
> > set operations on those K-mer collections.  You could use it to determine
> > which K-mers were present in a pathogen and not in a host.
> > http://www.ncbi.nlm.nih.gov/pubmed/20428334
> > http://www.ncbi.nlm.nih.gov/pubmed/16403026
> >
> > Cheers,
> >
> > Mark
> >
> > card.ly: <http://card.ly/phidias51>
> >
> >
> > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar <vishalthapar at gmail.com
> >wrote:
> >
> >> Hi Andy,
> >>
> >> This is good to have. I feel that including it as a part of core may not
> be
> >> necessary but having it as part of Genomic module in biojava3 will be
> nice.
> >> There is a project Bioinformatica
> >>
> >>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> >> does something similar although not exactly. It counts the k-mers in a
> >> given fasta file but it does not count k-mers for each sequence within
> the
> >> file, just all within a file. This is a good feature to have specially
> if
> >> one is trying to find patterns within sequences which is what I am
> trying
> >> to
> >> do. It would most certainly be helpful to have a k-mer counting
> algorithm
> >> that counts k-mer frequency for each sequence. The way to go would be to
> >> use
> >> suffix trees. Again I don't know if biojava has a suffix tree api or not
> >> since I haven't used java in a while and am just switching back to it. A
> >> paper on using suffix trees to generate genome wide k-mer frequencies
> is:
> >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> >> software
> >> is tallymer). It would be some work to implement this in java as a
> module
> >> for biojava3 but I can see that this will be helpful. Again, for small
> >> fasta
> >> files, it might not be efficient to create a suffix tree but for bigger
> >> files, I think that might be the way to go.
> >>
> >> Thats just my two cents.What do you think?
> >>
> >> -vishal
> >>
> >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As far as I am aware there is nothing which will generate them in
> BioJava
> >>> at the moment. However it is possible to do it with BioJava3:
> >>>
> >>> public static void main(String[] args) {
> >>>   DNASequence d = new DNASequence("ATGATC");
> >>>   System.out.println("Non-Overlap");
> >>>   nonOverlap(d);
> >>>   System.out.println("Overlap");
> >>>   overlap(d);
> >>> }
> >>>
> >>> public static final int KMER = 3;
> >>>
> >>> //Generate triplets overlapping
> >>> public static void overlap(Sequence<NucleotideCompound> d) {
> >>>   List<WindowedSequence<NucleotideCompound>> l =
> >>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
> >>>   for(int i=1; i<=KMER; i++) {
> >>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >>>               i, d.getLength());
> >>>       WindowedSequence<NucleotideCompound> w =
> >>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
> >>>       l.add(w);
> >>>   }
> >>>
> >>>   //Will return ATG, ATC, TGA & GAT
> >>>   for(WindowedSequence<NucleotideCompound> w: l) {
> >>>       for(List<NucleotideCompound> subList: w) {
> >>>           System.out.println(subList);
> >>>       }
> >>>   }
> >>> }
> >>>
> >>> //Generate triplet Compound lists non-overlapping
> >>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >>>   WindowedSequence<NucleotideCompound> w =
> >>>           new WindowedSequence<NucleotideCompound>(d, KMER);
> >>>   //Will return ATG & ATC
> >>>   for(List<NucleotideCompound> subList: w) {
> >>>       System.out.println(subList);
> >>>   }
> >>> }
> >>>
> >>> The disadvantage of all of these solutions is that they generate lists
> of
> >>> Compounds so kmer generation can/will be a memory intensive operation.
> >> This
> >>> does mean it has to be since sub sequences are thin wrappers around an
> >>> underlying sequence. Also the overlap solution is non-optimal since it
> >>> iterates through each window rather than stepping through delegating
> onto
> >>> each base in turn (hence why we get ATG & ATC before TGA)
> >>>
> >>> As for unique k-mers that's something which would require a bit more
> >>> engineering & would be better suited to a solution built around a Trie
> >>> (prefix tree).
> >>>
> >>> Hope this helps,
> >>>
> >>> Andy
> >>>
> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I had a quick question: Does Biojava have a method to generate k-mers
> >> or
> >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
> >> k-mer
> >>>> counts for every sequence in a fasta file. If something like this
> >> exists
> >>> it
> >>>> would save me some time to write the code.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Vishal
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>
> >>> --
> >>> Andrew Yates                   Ensembl Genomes Engineer
> >>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> *Vishal Thapar, Ph.D.*
> >> *Scientific informatics Analyst
> >> Cold Spring Harbor Lab
> >> Quick Bldg, Lowe Lab
> >> 1 Bungtown Road
> >> Cold Spring Harbor, NY - 11724*
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


From jbdundas at gmail.com  Fri Oct 29 18:50:11 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 30 Oct 2010 00:20:11 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
Message-ID: <AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>

I agree Andy. These have become standard functionalities that
scientists do these days. I am all for implementing that in BioJava3.
Java isn't that efficient for such functionalities so we will surely
need more effort compared to the same in Python/Perl.

Regards,
Jitesh Dundas

On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> So if it's a suffix tree that's quite a fixed data structure so the chances
> of developing a pluggable mechanism there would be hard. I think there also
> has to be a limit as to what we can sensibly do. If people want to
> contribute this kind of work though then it's all be very well received
> (with the corresponding test environment/cases of course).
>
> Cheers,
>
> Andy
>
> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>
>> It might be useful to make the K-mer storage mechanism pluggable.  This
>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>> key-value database to store K-mers.  You could plugin custom map
>> implementations to allow you to keep a count of the number of instances of
>> particular K-mers that were found.  It might also be useful to be able to
>> do
>> set operations on those K-mer collections.  You could use it to determine
>> which K-mers were present in a pathogen and not in a host.
>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>
>> Cheers,
>>
>> Mark
>>
>> card.ly: <http://card.ly/phidias51>
>>
>>
>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>> <vishalthapar at gmail.com>wrote:
>>
>>> Hi Andy,
>>>
>>> This is good to have. I feel that including it as a part of core may not
>>> be
>>> necessary but having it as part of Genomic module in biojava3 will be
>>> nice.
>>> There is a project Bioinformatica
>>>
>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>> does something similar although not exactly. It counts the k-mers in a
>>> given fasta file but it does not count k-mers for each sequence within
>>> the
>>> file, just all within a file. This is a good feature to have specially if
>>> one is trying to find patterns within sequences which is what I am trying
>>> to
>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>> that counts k-mer frequency for each sequence. The way to go would be to
>>> use
>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>> since I haven't used java in a while and am just switching back to it. A
>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>> software
>>> is tallymer). It would be some work to implement this in java as a module
>>> for biojava3 but I can see that this will be helpful. Again, for small
>>> fasta
>>> files, it might not be efficient to create a suffix tree but for bigger
>>> files, I think that might be the way to go.
>>>
>>> Thats just my two cents.What do you think?
>>>
>>> -vishal
>>>
>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> As far as I am aware there is nothing which will generate them in
>>>> BioJava
>>>> at the moment. However it is possible to do it with BioJava3:
>>>>
>>>> public static void main(String[] args) {
>>>>   DNASequence d = new DNASequence("ATGATC");
>>>>   System.out.println("Non-Overlap");
>>>>   nonOverlap(d);
>>>>   System.out.println("Overlap");
>>>>   overlap(d);
>>>> }
>>>>
>>>> public static final int KMER = 3;
>>>>
>>>> //Generate triplets overlapping
>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>   List<WindowedSequence<NucleotideCompound>> l =
>>>>           new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>   for(int i=1; i<=KMER; i++) {
>>>>       SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>               i, d.getLength());
>>>>       WindowedSequence<NucleotideCompound> w =
>>>>           new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>       l.add(w);
>>>>   }
>>>>
>>>>   //Will return ATG, ATC, TGA & GAT
>>>>   for(WindowedSequence<NucleotideCompound> w: l) {
>>>>       for(List<NucleotideCompound> subList: w) {
>>>>           System.out.println(subList);
>>>>       }
>>>>   }
>>>> }
>>>>
>>>> //Generate triplet Compound lists non-overlapping
>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>   WindowedSequence<NucleotideCompound> w =
>>>>           new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>   //Will return ATG & ATC
>>>>   for(List<NucleotideCompound> subList: w) {
>>>>       System.out.println(subList);
>>>>   }
>>>> }
>>>>
>>>> The disadvantage of all of these solutions is that they generate lists
>>>> of
>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>> This
>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>> iterates through each window rather than stepping through delegating
>>>> onto
>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>
>>>> As for unique k-mers that's something which would require a bit more
>>>> engineering & would be better suited to a solution built around a Trie
>>>> (prefix tree).
>>>>
>>>> Hope this helps,
>>>>
>>>> Andy
>>>>
>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>> or
>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>> k-mer
>>>>> counts for every sequence in a fasta file. If something like this
>>> exists
>>>> it
>>>>> would save me some time to write the code.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Vishal
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>> --
>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Vishal Thapar, Ph.D.*
>>> *Scientific informatics Analyst
>>> Cold Spring Harbor Lab
>>> Quick Bldg, Lowe Lab
>>> 1 Bungtown Road
>>> Cold Spring Harbor, NY - 11724*
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From willishf at ufl.edu  Fri Oct 29 19:20:19 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Fri, 29 Oct 2010 15:20:19 -0400
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
	<AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
Message-ID: <AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>

Jay

I don't think we have pushed the biojava3 docs up to a place where google
can find them. From the nightly build
http://www.biojava.org/download/maven/org/biojava/ you can find javadocs in
the jar files. Biojava3 has two parts now. The older 1.7 modules refactored
into standalone jar files when possible but it is still a very cross
dependent code base. Then the newer modules labeled biojava3- are a clean
break from 1.7 so depending on what you are doing it may be easy/difficult
to start using the newer biojava3 code without lots of changes in your code.

Thanks

Scooter

On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas <jayunit100 at gmail.com> wrote:

> Hi guys : Im trying to break up a biojava project built on 1.7 into biojava
> 3, and am having to look up some modules etc...
> Im having trouble finding biojava3 javadocs ?  Unfortunately, the
> 'googleable' java docs are all from 1.7 .....
>
> Where is the formal/generated javadoc info for biojava3 ? is it online ?
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From markjschreiber at gmail.com  Fri Oct 29 19:25:12 2010
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Fri, 29 Oct 2010 15:25:12 -0400
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
	<AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
	<AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>
Message-ID: <AANLkTi=dn1nx4sqs1wNdrOP3OYRXzdfd8sy7JPD+_UdD@mail.gmail.com>

It might pay to put the link to the docs on the top level page.

You may need to get an Admin to change the front page.

On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis <willishf at ufl.edu> wrote:

> Jay
>
> I don't think we have pushed the biojava3 docs up to a place where google
> can find them. From the nightly build
> http://www.biojava.org/download/maven/org/biojava/ you can find javadocs
> in
> the jar files. Biojava3 has two parts now. The older 1.7 modules refactored
> into standalone jar files when possible but it is still a very cross
> dependent code base. Then the newer modules labeled biojava3- are a clean
> break from 1.7 so depending on what you are doing it may be easy/difficult
> to start using the newer biojava3 code without lots of changes in your
> code.
>
> Thanks
>
> Scooter
>
> On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas <jayunit100 at gmail.com> wrote:
>
> > Hi guys : Im trying to break up a biojava project built on 1.7 into
> biojava
> > 3, and am having to look up some modules etc...
> > Im having trouble finding biojava3 javadocs ?  Unfortunately, the
> > 'googleable' java docs are all from 1.7 .....
> >
> > Where is the formal/generated javadoc info for biojava3 ? is it online ?
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From ayates at ebi.ac.uk  Fri Oct 29 19:34:11 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 29 Oct 2010 20:34:11 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
Message-ID: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>

So we've got some basic kmer work now in SVN. If you look in the class SequenceMixin there are two static methods there for generating the two types of k-mers. It's not developed with Map storage in mind & I'll leave the door open there for anyone else to come in & develop it. The k-mers are also not unique across the sequence but it's a start :)

Share & enjoy!

Andy

On 29 Oct 2010, at 19:50, jitesh dundas wrote:

> I agree Andy. These have become standard functionalities that
> scientists do these days. I am all for implementing that in BioJava3.
> Java isn't that efficient for such functionalities so we will surely
> need more effort compared to the same in Python/Perl.
> 
> Regards,
> Jitesh Dundas
> 
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So if it's a suffix tree that's quite a fixed data structure so the chances
>> of developing a pluggable mechanism there would be hard. I think there also
>> has to be a limit as to what we can sensibly do. If people want to
>> contribute this kind of work though then it's all be very well received
>> (with the corresponding test environment/cases of course).
>> 
>> Cheers,
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>> 
>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>> would allow a developer to use anything from a simple MultiMap, to a NoSQL
>>> key-value database to store K-mers.  You could plugin custom map
>>> implementations to allow you to keep a count of the number of instances of
>>> particular K-mers that were found.  It might also be useful to be able to
>>> do
>>> set operations on those K-mer collections.  You could use it to determine
>>> which K-mers were present in a pathogen and not in a host.
>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>> 
>>> Cheers,
>>> 
>>> Mark
>>> 
>>> card.ly: <http://card.ly/phidias51>
>>> 
>>> 
>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>> <vishalthapar at gmail.com>wrote:
>>> 
>>>> Hi Andy,
>>>> 
>>>> This is good to have. I feel that including it as a part of core may not
>>>> be
>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>> nice.
>>>> There is a project Bioinformatica
>>>> 
>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>> does something similar although not exactly. It counts the k-mers in a
>>>> given fasta file but it does not count k-mers for each sequence within
>>>> the
>>>> file, just all within a file. This is a good feature to have specially if
>>>> one is trying to find patterns within sequences which is what I am trying
>>>> to
>>>> do. It would most certainly be helpful to have a k-mer counting algorithm
>>>> that counts k-mer frequency for each sequence. The way to go would be to
>>>> use
>>>> suffix trees. Again I don't know if biojava has a suffix tree api or not
>>>> since I haven't used java in a while and am just switching back to it. A
>>>> paper on using suffix trees to generate genome wide k-mer frequencies is:
>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>> software
>>>> is tallymer). It would be some work to implement this in java as a module
>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>> fasta
>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>> files, I think that might be the way to go.
>>>> 
>>>> Thats just my two cents.What do you think?
>>>> 
>>>> -vishal
>>>> 
>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> 
>>>>> Hi Vishal,
>>>>> 
>>>>> As far as I am aware there is nothing which will generate them in
>>>>> BioJava
>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>> 
>>>>> public static void main(String[] args) {
>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>  System.out.println("Non-Overlap");
>>>>>  nonOverlap(d);
>>>>>  System.out.println("Overlap");
>>>>>  overlap(d);
>>>>> }
>>>>> 
>>>>> public static final int KMER = 3;
>>>>> 
>>>>> //Generate triplets overlapping
>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>              i, d.getLength());
>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>      l.add(w);
>>>>>  }
>>>>> 
>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>          System.out.println(subList);
>>>>>      }
>>>>>  }
>>>>> }
>>>>> 
>>>>> //Generate triplet Compound lists non-overlapping
>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>  //Will return ATG & ATC
>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>      System.out.println(subList);
>>>>>  }
>>>>> }
>>>>> 
>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>> of
>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>> This
>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>> iterates through each window rather than stepping through delegating
>>>>> onto
>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>> 
>>>>> As for unique k-mers that's something which would require a bit more
>>>>> engineering & would be better suited to a solution built around a Trie
>>>>> (prefix tree).
>>>>> 
>>>>> Hope this helps,
>>>>> 
>>>>> Andy
>>>>> 
>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>> or
>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>> k-mer
>>>>>> counts for every sequence in a fasta file. If something like this
>>>> exists
>>>>> it
>>>>>> would save me some time to write the code.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Vishal
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>>> --
>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Vishal Thapar, Ph.D.*
>>>> *Scientific informatics Analyst
>>>> Cold Spring Harbor Lab
>>>> Quick Bldg, Lowe Lab
>>>> 1 Bungtown Road
>>>> Cold Spring Harbor, NY - 11724*
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Fri Oct 29 19:43:38 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 30 Oct 2010 01:13:38 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
Message-ID: <AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>

That is good news.Thanks for the directions Andy.

I have already started on this.Let me analyze and write the code now.

Maybe a next month deadline is not unreachable in this case.

Here we go!
JD

On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> So we've got some basic kmer work now in SVN. If you look in the class
> SequenceMixin there are two static methods there for generating the two
> types of k-mers. It's not developed with Map storage in mind & I'll leave
> the door open there for anyone else to come in & develop it. The k-mers are
> also not unique across the sequence but it's a start :)
>
> Share & enjoy!
>
> Andy
>
> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>
>> I agree Andy. These have become standard functionalities that
>> scientists do these days. I am all for implementing that in BioJava3.
>> Java isn't that efficient for such functionalities so we will surely
>> need more effort compared to the same in Python/Perl.
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> So if it's a suffix tree that's quite a fixed data structure so the
>>> chances
>>> of developing a pluggable mechanism there would be hard. I think there
>>> also
>>> has to be a limit as to what we can sensibly do. If people want to
>>> contribute this kind of work though then it's all be very well received
>>> (with the corresponding test environment/cases of course).
>>>
>>> Cheers,
>>>
>>> Andy
>>>
>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>
>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>> NoSQL
>>>> key-value database to store K-mers.  You could plugin custom map
>>>> implementations to allow you to keep a count of the number of instances
>>>> of
>>>> particular K-mers that were found.  It might also be useful to be able
>>>> to
>>>> do
>>>> set operations on those K-mer collections.  You could use it to
>>>> determine
>>>> which K-mers were present in a pathogen and not in a host.
>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>
>>>> Cheers,
>>>>
>>>> Mark
>>>>
>>>> card.ly: <http://card.ly/phidias51>
>>>>
>>>>
>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>> <vishalthapar at gmail.com>wrote:
>>>>
>>>>> Hi Andy,
>>>>>
>>>>> This is good to have. I feel that including it as a part of core may
>>>>> not
>>>>> be
>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>> nice.
>>>>> There is a project Bioinformatica
>>>>>
>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>> the
>>>>> file, just all within a file. This is a good feature to have specially
>>>>> if
>>>>> one is trying to find patterns within sequences which is what I am
>>>>> trying
>>>>> to
>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>> algorithm
>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>> to
>>>>> use
>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>> not
>>>>> since I haven't used java in a while and am just switching back to it.
>>>>> A
>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>> is:
>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>> software
>>>>> is tallymer). It would be some work to implement this in java as a
>>>>> module
>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>> fasta
>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>> files, I think that might be the way to go.
>>>>>
>>>>> Thats just my two cents.What do you think?
>>>>>
>>>>> -vishal
>>>>>
>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>> BioJava
>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>
>>>>>> public static void main(String[] args) {
>>>>>>  DNASequence d = new DNASequence("ATGATC");
>>>>>>  System.out.println("Non-Overlap");
>>>>>>  nonOverlap(d);
>>>>>>  System.out.println("Overlap");
>>>>>>  overlap(d);
>>>>>> }
>>>>>>
>>>>>> public static final int KMER = 3;
>>>>>>
>>>>>> //Generate triplets overlapping
>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>  List<WindowedSequence<NucleotideCompound>> l =
>>>>>>          new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>  for(int i=1; i<=KMER; i++) {
>>>>>>      SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>              i, d.getLength());
>>>>>>      WindowedSequence<NucleotideCompound> w =
>>>>>>          new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>      l.add(w);
>>>>>>  }
>>>>>>
>>>>>>  //Will return ATG, ATC, TGA & GAT
>>>>>>  for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>      for(List<NucleotideCompound> subList: w) {
>>>>>>          System.out.println(subList);
>>>>>>      }
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>  WindowedSequence<NucleotideCompound> w =
>>>>>>          new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>  //Will return ATG & ATC
>>>>>>  for(List<NucleotideCompound> subList: w) {
>>>>>>      System.out.println(subList);
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>> of
>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>> This
>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>> iterates through each window rather than stepping through delegating
>>>>>> onto
>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>
>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>> (prefix tree).
>>>>>>
>>>>>> Hope this helps,
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>> or
>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>> k-mer
>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>> exists
>>>>>> it
>>>>>>> would save me some time to write the code.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Vishal
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>> --
>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Vishal Thapar, Ph.D.*
>>>>> *Scientific informatics Analyst
>>>>> Cold Spring Harbor Lab
>>>>> Quick Bldg, Lowe Lab
>>>>> 1 Bungtown Road
>>>>> Cold Spring Harbor, NY - 11724*
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


From jayunit100 at gmail.com  Fri Oct 29 21:39:34 2010
From: jayunit100 at gmail.com (Jay Vyas)
Date: Fri, 29 Oct 2010 17:39:34 -0400
Subject: [Biojava-l] JavaDocs and Backwards compatibility
Message-ID: <AANLkTin75ggrrpFNE7DhhgcYnxYd3yPEXjKnWPww2p2z@mail.gmail.com>

Thanks, I am now all up to date with biojava 3.0 and it really works well.

It really would be valuable to have some public biojava java docs !

This is because, for example, when I completely removed biojava 1.7, and
replaced it with biojava 3.0,  it was somewhat tedious to refactor/find old
classes under new package names, for example :

For example,

 org.biojava3.alignment.
SimpleSubstitutionMatrix;
 org.biojava3.alignment.template.SubstitutionMatrix;


From andreas at sdsc.edu  Fri Oct 29 21:59:23 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Fri, 29 Oct 2010 14:59:23 -0700
Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21
In-Reply-To: <AANLkTi=dn1nx4sqs1wNdrOP3OYRXzdfd8sy7JPD+_UdD@mail.gmail.com>
References: <mailman.1519.1288310933.2958.biojava-l@lists.open-bio.org>
	<AANLkTi=Wv5Mv_DYz_MYX=G9cM27SpTmec5fsA+FJY3EL@mail.gmail.com>
	<AANLkTikFGuiRobDLHrn_CBkWyaT1Mweikeepyn9bSk3M@mail.gmail.com>
	<AANLkTi=dn1nx4sqs1wNdrOP3OYRXzdfd8sy7JPD+_UdD@mail.gmail.com>
Message-ID: <AANLkTimg8dFAXtWe1ZsiTeNLWQTk96LE8nfrpnxC3-Vn@mail.gmail.com>

Ideally I would like to see the automated build system also deploy the
latest javadocs on the website. I guess I should play around with the
maven site-plugin if it can do that ... or does anybody have a
recommendation for any other plugin?

Andreas

On Fri, Oct 29, 2010 at 12:25 PM, Mark Schreiber
<markjschreiber at gmail.com> wrote:
> It might pay to put the link to the docs on the top level page.
>
> You may need to get an Admin to change the front page.
>
> On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis <willishf at ufl.edu> wrote:
>
>> Jay
>>
>> I don't think we have pushed the biojava3 docs up to a place where google
>> can find them. From the nightly build
>> http://www.biojava.org/download/maven/org/biojava/ you can find javadocs
>> in
>> the jar files. Biojava3 has two parts now. The older 1.7 modules refactored
>> into standalone jar files when possible but it is still a very cross
>> dependent code base. Then the newer modules labeled biojava3- are a clean
>> break from 1.7 so depending on what you are doing it may be easy/difficult
>> to start using the newer biojava3 code without lots of changes in your
>> code.
>>
>> Thanks
>>
>> Scooter
>>
>> On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas <jayunit100 at gmail.com> wrote:
>>
>> > Hi guys : Im trying to break up a biojava project built on 1.7 into
>> biojava
>> > 3, and am having to look up some modules etc...
>> > Im having trouble finding biojava3 javadocs ? ?Unfortunately, the
>> > 'googleable' java docs are all from 1.7 .....
>> >
>> > Where is the formal/generated javadoc info for biojava3 ? is it online ?
>> > _______________________________________________
>> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>> >
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From simon.rayner.cn at gmail.com  Fri Oct 29 23:38:13 2010
From: simon.rayner.cn at gmail.com (simon rayner)
Date: Sat, 30 Oct 2010 07:38:13 +0800
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
	<AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
Message-ID: <AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>

just a suggestion, but might beans falling out the cup suggest that biojava
is unstable?  just offering feedback, i still think it looks very slick!

On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani <genjasp at gmail.com>wrote:

> Great Logo!!!
>
> :D
>
> 2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> > Dear All
> > I have designed a n new biojava logo. Please see the detail of it:
> > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> > <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your
> valuable
> > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
> >
> >
> > thanks
> >
> > --
> > Jitendra Narayan
> > Bioinformatist
> > www.bioinformaticsonline.com
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> Alessandro Cipriani
> (+39) 3206009509
> (+39) 3931311792
> http://www.cipriania.it
> skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com>
> msn:jaspzz
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Simon Rayner

State Key Laboratory of Virology
Wuhan Institute of Virology
Chinese Academy of Sciences
Wuhan, Hubei 430071
P.R.China

+86 (27) 87199895 (office)
+86 18627113001 (cell)


From phidias51 at gmail.com  Fri Oct 29 23:49:54 2010
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 29 Oct 2010 16:49:54 -0700
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
	<AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
	<AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>
Message-ID: <AANLkTi=S_5J+9OLH6g93U4i5MOKo0eP-7Q4vmx4nGJVK@mail.gmail.com>

The first logo looks nice; however, I don't see anything in it that connects
it to biology.  The second logo is too close to Oracle's logo, and I suspect
would require written permission from them in order to use it.

Cheers,

Mark

card.ly: <http://card.ly/phidias51>


On Fri, Oct 29, 2010 at 4:38 PM, simon rayner <simon.rayner.cn at gmail.com>wrote:

> just a suggestion, but might beans falling out the cup suggest that biojava
> is unstable?  just offering feedback, i still think it looks very slick!
>
> On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani <genjasp at gmail.com
> >wrote:
>
> > Great Logo!!!
> >
> > :D
> >
> > 2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> > > Dear All
> > > I have designed a n new biojava logo. Please see the detail of it:
> > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> > > <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your
> > valuable
> > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
> > >
> > >
> > > thanks
> > >
> > > --
> > > Jitendra Narayan
> > > Bioinformatist
> > > www.bioinformaticsonline.com
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> >
> >
> >
> > --
> > Alessandro Cipriani
> > (+39) 3206009509
> > (+39) 3931311792
> > http://www.cipriania.it
> > skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com> <
> skype%3Agenjasp at gmail.com <skype%253Agenjasp at gmail.com>>
> > msn:jaspzz
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
>
> --
> Simon Rayner
>
> State Key Laboratory of Virology
> Wuhan Institute of Virology
> Chinese Academy of Sciences
> Wuhan, Hubei 430071
> P.R.China
>
> +86 (27) 87199895 (office)
> +86 18627113001 (cell)
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From willishf at ufl.edu  Sat Oct 30 00:02:32 2010
From: willishf at ufl.edu (Scooter Willis)
Date: Fri, 29 Oct 2010 20:02:32 -0400
Subject: [Biojava-l] New Biojava Logo
In-Reply-To: <AANLkTi=S_5J+9OLH6g93U4i5MOKo0eP-7Q4vmx4nGJVK@mail.gmail.com>
References: <AANLkTikaqjFcb=Sq1jNw2meTjE8Y1n8xn9VZnzpyWQKV@mail.gmail.com>
	<AANLkTikK85HGhQwFDAgEc11muGBD0YAVbR1f7DQ-VxJ6@mail.gmail.com>
	<AANLkTikv=jYwQ7k+Ssd61KtZ4r5WTEa1E=k+YYbHriOs@mail.gmail.com>
	<AANLkTi=S_5J+9OLH6g93U4i5MOKo0eP-7Q4vmx4nGJVK@mail.gmail.com>
Message-ID: <AANLkTi=Rb8XAaNhT3bSkO6MNqAh8H2_tw39to-d3=15e@mail.gmail.com>

Jitendra

Could you morph from the coffee liquid to a DNA helix?

Scooter

On Fri, Oct 29, 2010 at 7:49 PM, Mark Fortner <phidias51 at gmail.com> wrote:

> The first logo looks nice; however, I don't see anything in it that
> connects
> it to biology.  The second logo is too close to Oracle's logo, and I
> suspect
> would require written permission from them in order to use it.
>
> Cheers,
>
> Mark
>
> card.ly: <http://card.ly/phidias51>
>
>
> On Fri, Oct 29, 2010 at 4:38 PM, simon rayner <simon.rayner.cn at gmail.com
> >wrote:
>
> > just a suggestion, but might beans falling out the cup suggest that
> biojava
> > is unstable?  just offering feedback, i still think it looks very slick!
> >
> > On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani <genjasp at gmail.com
> > >wrote:
> >
> > > Great Logo!!!
> > >
> > > :D
> > >
> > > 2010/10/29 jitendra narayan <jnarayan81 at gmail.com>:
> > > > Dear All
> > > > I have designed a n new biojava logo. Please see the detail of it:
> > > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg
> > > > <http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg>I need your
> > > valuable
> > > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo
> > > >
> > > >
> > > > thanks
> > > >
> > > > --
> > > > Jitendra Narayan
> > > > Bioinformatist
> > > > www.bioinformaticsonline.com
> > > > _______________________________________________
> > > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > > >
> > >
> > >
> > >
> > > --
> > > Alessandro Cipriani
> > > (+39) 3206009509
> > > (+39) 3931311792
> > > http://www.cipriania.it
> > > skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com> <
> skype%3Agenjasp at gmail.com <skype%253Agenjasp at gmail.com>> <
> > skype%3Agenjasp at gmail.com <skype%253Agenjasp at gmail.com> <
> skype%253Agenjasp at gmail.com <skype%25253Agenjasp at gmail.com>>>
> > > msn:jaspzz
> > >
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> >
> >
> >
> > --
> > Simon Rayner
> >
> > State Key Laboratory of Virology
> > Wuhan Institute of Virology
> > Chinese Academy of Sciences
> > Wuhan, Hubei 430071
> > P.R.China
> >
> > +86 (27) 87199895 (office)
> > +86 18627113001 (cell)
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From ayates at ebi.ac.uk  Sat Oct 30 09:20:30 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Sat, 30 Oct 2010 10:20:30 +0100
Subject: [Biojava-l] K-mers
In-Reply-To: <AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
	<AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
Message-ID: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>

You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. 

Just goes to show you should always do more testing than you think :).

Andy

On 29 Oct 2010, at 20:43, jitesh dundas wrote:

> That is good news.Thanks for the directions Andy.
> 
> I have already started on this.Let me analyze and write the code now.
> 
> Maybe a next month deadline is not unreachable in this case.
> 
> Here we go!
> JD
> 
> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>> So we've got some basic kmer work now in SVN. If you look in the class
>> SequenceMixin there are two static methods there for generating the two
>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>> the door open there for anyone else to come in & develop it. The k-mers are
>> also not unique across the sequence but it's a start :)
>> 
>> Share & enjoy!
>> 
>> Andy
>> 
>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>> 
>>> I agree Andy. These have become standard functionalities that
>>> scientists do these days. I am all for implementing that in BioJava3.
>>> Java isn't that efficient for such functionalities so we will surely
>>> need more effort compared to the same in Python/Perl.
>>> 
>>> Regards,
>>> Jitesh Dundas
>>> 
>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>> chances
>>>> of developing a pluggable mechanism there would be hard. I think there
>>>> also
>>>> has to be a limit as to what we can sensibly do. If people want to
>>>> contribute this kind of work though then it's all be very well received
>>>> (with the corresponding test environment/cases of course).
>>>> 
>>>> Cheers,
>>>> 
>>>> Andy
>>>> 
>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>> 
>>>>> It might be useful to make the K-mer storage mechanism pluggable.  This
>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>> NoSQL
>>>>> key-value database to store K-mers.  You could plugin custom map
>>>>> implementations to allow you to keep a count of the number of instances
>>>>> of
>>>>> particular K-mers that were found.  It might also be useful to be able
>>>>> to
>>>>> do
>>>>> set operations on those K-mer collections.  You could use it to
>>>>> determine
>>>>> which K-mers were present in a pathogen and not in a host.
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Mark
>>>>> 
>>>>> card.ly: <http://card.ly/phidias51>
>>>>> 
>>>>> 
>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>> <vishalthapar at gmail.com>wrote:
>>>>> 
>>>>>> Hi Andy,
>>>>>> 
>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>> not
>>>>>> be
>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>> nice.
>>>>>> There is a project Bioinformatica
>>>>>> 
>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>> the
>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>> if
>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>> trying
>>>>>> to
>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>> algorithm
>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>> to
>>>>>> use
>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>> not
>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>> A
>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>> is:
>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>> software
>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>> module
>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>> fasta
>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>> files, I think that might be the way to go.
>>>>>> 
>>>>>> Thats just my two cents.What do you think?
>>>>>> 
>>>>>> -vishal
>>>>>> 
>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>> 
>>>>>>> Hi Vishal,
>>>>>>> 
>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>> BioJava
>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>> 
>>>>>>> public static void main(String[] args) {
>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>> System.out.println("Non-Overlap");
>>>>>>> nonOverlap(d);
>>>>>>> System.out.println("Overlap");
>>>>>>> overlap(d);
>>>>>>> }
>>>>>>> 
>>>>>>> public static final int KMER = 3;
>>>>>>> 
>>>>>>> //Generate triplets overlapping
>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>>         new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>>     SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>>             i, d.getLength());
>>>>>>>     WindowedSequence<NucleotideCompound> w =
>>>>>>>         new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>>     l.add(w);
>>>>>>> }
>>>>>>> 
>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>>     for(List<NucleotideCompound> subList: w) {
>>>>>>>         System.out.println(subList);
>>>>>>>     }
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>>         new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>> //Will return ATG & ATC
>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>>     System.out.println(subList);
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>> of
>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>> This
>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>> onto
>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>> 
>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>> (prefix tree).
>>>>>>> 
>>>>>>> Hope this helps,
>>>>>>> 
>>>>>>> Andy
>>>>>>> 
>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>> 
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>> or
>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>> k-mer
>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>> exists
>>>>>>> it
>>>>>>>> would save me some time to write the code.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Vishal
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> 
>>>>>>> --
>>>>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> *Vishal Thapar, Ph.D.*
>>>>>> *Scientific informatics Analyst
>>>>>> Cold Spring Harbor Lab
>>>>>> Quick Bldg, Lowe Lab
>>>>>> 1 Bungtown Road
>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>> 
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>>> --
>>>> Andrew Yates                   Ensembl Genomes Engineer
>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From jbdundas at gmail.com  Sat Oct 30 09:40:35 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 30 Oct 2010 15:10:35 +0530
Subject: [Biojava-l] K-mers
In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
	<AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
	<1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
Message-ID: <AANLkTikbG5xQG6uresAVQ4-QLVhUobQobEQ8ti8j=1nD@mail.gmail.com>

I got your point Andy. .Thanks.

On Sat, Oct 30, 2010 at 2:50 PM, Andy Yates <ayates at ebi.ac.uk> wrote:

> You should be aware I just found a bug in the code. This has been fixed but
> the bug will still be in the alpha3 release. I would recommend either
> building a version yourself or if Andreas can post up the continuous
> integration server address there will be a release tonight.
>
> Just goes to show you should always do more testing than you think :).
>
> Andy
>
> On 29 Oct 2010, at 20:43, jitesh dundas wrote:
>
> > That is good news.Thanks for the directions Andy.
> >
> > I have already started on this.Let me analyze and write the code now.
> >
> > Maybe a next month deadline is not unreachable in this case.
> >
> > Here we go!
> > JD
> >
> > On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> So we've got some basic kmer work now in SVN. If you look in the class
> >> SequenceMixin there are two static methods there for generating the two
> >> types of k-mers. It's not developed with Map storage in mind & I'll
> leave
> >> the door open there for anyone else to come in & develop it. The k-mers
> are
> >> also not unique across the sequence but it's a start :)
> >>
> >> Share & enjoy!
> >>
> >> Andy
> >>
> >> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
> >>
> >>> I agree Andy. These have become standard functionalities that
> >>> scientists do these days. I am all for implementing that in BioJava3.
> >>> Java isn't that efficient for such functionalities so we will surely
> >>> need more effort compared to the same in Python/Perl.
> >>>
> >>> Regards,
> >>> Jitesh Dundas
> >>>
> >>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> So if it's a suffix tree that's quite a fixed data structure so the
> >>>> chances
> >>>> of developing a pluggable mechanism there would be hard. I think there
> >>>> also
> >>>> has to be a limit as to what we can sensibly do. If people want to
> >>>> contribute this kind of work though then it's all be very well
> received
> >>>> (with the corresponding test environment/cases of course).
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Andy
> >>>>
> >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
> >>>>
> >>>>> It might be useful to make the K-mer storage mechanism pluggable.
>  This
> >>>>> would allow a developer to use anything from a simple MultiMap, to a
> >>>>> NoSQL
> >>>>> key-value database to store K-mers.  You could plugin custom map
> >>>>> implementations to allow you to keep a count of the number of
> instances
> >>>>> of
> >>>>> particular K-mers that were found.  It might also be useful to be
> able
> >>>>> to
> >>>>> do
> >>>>> set operations on those K-mer collections.  You could use it to
> >>>>> determine
> >>>>> which K-mers were present in a pathogen and not in a host.
> >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
> >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> card.ly: <http://card.ly/phidias51>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
> >>>>> <vishalthapar at gmail.com>wrote:
> >>>>>
> >>>>>> Hi Andy,
> >>>>>>
> >>>>>> This is good to have. I feel that including it as a part of core may
> >>>>>> not
> >>>>>> be
> >>>>>> necessary but having it as part of Genomic module in biojava3 will
> be
> >>>>>> nice.
> >>>>>> There is a project Bioinformatica
> >>>>>>
> >>>>>>
> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
> >>>>>> does something similar although not exactly. It counts the k-mers in
> a
> >>>>>> given fasta file but it does not count k-mers for each sequence
> within
> >>>>>> the
> >>>>>> file, just all within a file. This is a good feature to have
> specially
> >>>>>> if
> >>>>>> one is trying to find patterns within sequences which is what I am
> >>>>>> trying
> >>>>>> to
> >>>>>> do. It would most certainly be helpful to have a k-mer counting
> >>>>>> algorithm
> >>>>>> that counts k-mer frequency for each sequence. The way to go would
> be
> >>>>>> to
> >>>>>> use
> >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
> >>>>>> not
> >>>>>> since I haven't used java in a while and am just switching back to
> it.
> >>>>>> A
> >>>>>> paper on using suffix trees to generate genome wide k-mer
> frequencies
> >>>>>> is:
> >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
> >>>>>> software
> >>>>>> is tallymer). It would be some work to implement this in java as a
> >>>>>> module
> >>>>>> for biojava3 but I can see that this will be helpful. Again, for
> small
> >>>>>> fasta
> >>>>>> files, it might not be efficient to create a suffix tree but for
> bigger
> >>>>>> files, I think that might be the way to go.
> >>>>>>
> >>>>>> Thats just my two cents.What do you think?
> >>>>>>
> >>>>>> -vishal
> >>>>>>
> >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk>
> wrote:
> >>>>>>
> >>>>>>> Hi Vishal,
> >>>>>>>
> >>>>>>> As far as I am aware there is nothing which will generate them in
> >>>>>>> BioJava
> >>>>>>> at the moment. However it is possible to do it with BioJava3:
> >>>>>>>
> >>>>>>> public static void main(String[] args) {
> >>>>>>> DNASequence d = new DNASequence("ATGATC");
> >>>>>>> System.out.println("Non-Overlap");
> >>>>>>> nonOverlap(d);
> >>>>>>> System.out.println("Overlap");
> >>>>>>> overlap(d);
> >>>>>>> }
> >>>>>>>
> >>>>>>> public static final int KMER = 3;
> >>>>>>>
> >>>>>>> //Generate triplets overlapping
> >>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
> >>>>>>> List<WindowedSequence<NucleotideCompound>> l =
> >>>>>>>         new ArrayList<WindowedSequence<NucleotideCompound>>();
> >>>>>>> for(int i=1; i<=KMER; i++) {
> >>>>>>>     SequenceView<NucleotideCompound> sub = d.getSubSequence(
> >>>>>>>             i, d.getLength());
> >>>>>>>     WindowedSequence<NucleotideCompound> w =
> >>>>>>>         new WindowedSequence<NucleotideCompound>(sub, KMER);
> >>>>>>>     l.add(w);
> >>>>>>> }
> >>>>>>>
> >>>>>>> //Will return ATG, ATC, TGA & GAT
> >>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
> >>>>>>>     for(List<NucleotideCompound> subList: w) {
> >>>>>>>         System.out.println(subList);
> >>>>>>>     }
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> //Generate triplet Compound lists non-overlapping
> >>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
> >>>>>>> WindowedSequence<NucleotideCompound> w =
> >>>>>>>         new WindowedSequence<NucleotideCompound>(d, KMER);
> >>>>>>> //Will return ATG & ATC
> >>>>>>> for(List<NucleotideCompound> subList: w) {
> >>>>>>>     System.out.println(subList);
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> The disadvantage of all of these solutions is that they generate
> lists
> >>>>>>> of
> >>>>>>> Compounds so kmer generation can/will be a memory intensive
> operation.
> >>>>>> This
> >>>>>>> does mean it has to be since sub sequences are thin wrappers around
> an
> >>>>>>> underlying sequence. Also the overlap solution is non-optimal since
> it
> >>>>>>> iterates through each window rather than stepping through
> delegating
> >>>>>>> onto
> >>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
> >>>>>>>
> >>>>>>> As for unique k-mers that's something which would require a bit
> more
> >>>>>>> engineering & would be better suited to a solution built around a
> Trie
> >>>>>>> (prefix tree).
> >>>>>>>
> >>>>>>> Hope this helps,
> >>>>>>>
> >>>>>>> Andy
> >>>>>>>
> >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
> >>>>>>>
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> I had a quick question: Does Biojava have a method to generate
> k-mers
> >>>>>> or
> >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
> >>>>>> k-mer
> >>>>>>>> counts for every sequence in a fasta file. If something like this
> >>>>>> exists
> >>>>>>> it
> >>>>>>>> would save me some time to write the code.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Vishal
> >>>>>>>> _______________________________________________
> >>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>
> >>>>>>> --
> >>>>>>> Andrew Yates                   Ensembl Genomes Engineer
> >>>>>>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>>>>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>>>>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> *Vishal Thapar, Ph.D.*
> >>>>>> *Scientific informatics Analyst
> >>>>>> Cold Spring Harbor Lab
> >>>>>> Quick Bldg, Lowe Lab
> >>>>>> 1 Bungtown Road
> >>>>>> Cold Spring Harbor, NY - 11724*
> >>>>>> _______________________________________________
> >>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>>> --
> >>>> Andrew Yates                   Ensembl Genomes Engineer
> >>>> EMBL-EBI                       Tel: +44-(0)1223-492538
> >>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> >>
> >> --
> >> Andrew Yates                   Ensembl Genomes Engineer
> >> EMBL-EBI                       Tel: +44-(0)1223-492538
> >> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> >> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> >>
> >>
> >>
> >>
> >>
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


From andreas at sdsc.edu  Sat Oct 30 10:50:48 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sat, 30 Oct 2010 06:50:48 -0400
Subject: [Biojava-l] K-mers
In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
References: <AANLkTikEawgnb5vedydwCazBoL=JQSLRPM7s02O-uqZh@mail.gmail.com>
	<A15AC3D6-FC5A-4EC4-8E96-8F675B281E3B@ebi.ac.uk>
	<AANLkTikcpqz6LJM3anG1RBZ4b0WPFC6Phyee2a3EoaGT@mail.gmail.com>
	<AANLkTikvqceacLQRxaXokkWsH2T0Yk-2oPCjat1BvQvt@mail.gmail.com>
	<06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk>
	<AANLkTinmcdw0csikLSBpBPrrhtDbHQea7iFLU6jEwn58@mail.gmail.com>
	<23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk>
	<AANLkTin1e4COTg8_50hW9xww1ym17xR6NZq4hLkyRBoG@mail.gmail.com>
	<1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk>
Message-ID: <AANLkTik+WUeseWqDSnLkba6N+35xADYtnMQ9xVgGmDtp@mail.gmail.com>

just kicked off a new build.. alpha4 should be on the servers
shortly... you don't need cruisecontrol for a release. Anybody with an
ssh account on portal.open-bio (and set up ssh keys correctly) can do
mvn release:clean release:prepare release:perform

A

On Sat, Oct 30, 2010 at 5:20 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight.
>
> Just goes to show you should always do more testing than you think :).
>
> Andy
>
> On 29 Oct 2010, at 20:43, jitesh dundas wrote:
>
>> That is good news.Thanks for the directions Andy.
>>
>> I have already started on this.Let me analyze and write the code now.
>>
>> Maybe a next month deadline is not unreachable in this case.
>>
>> Here we go!
>> JD
>>
>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> So we've got some basic kmer work now in SVN. If you look in the class
>>> SequenceMixin there are two static methods there for generating the two
>>> types of k-mers. It's not developed with Map storage in mind & I'll leave
>>> the door open there for anyone else to come in & develop it. The k-mers are
>>> also not unique across the sequence but it's a start :)
>>>
>>> Share & enjoy!
>>>
>>> Andy
>>>
>>> On 29 Oct 2010, at 19:50, jitesh dundas wrote:
>>>
>>>> I agree Andy. These have become standard functionalities that
>>>> scientists do these days. I am all for implementing that in BioJava3.
>>>> Java isn't that efficient for such functionalities so we will surely
>>>> need more effort compared to the same in Python/Perl.
>>>>
>>>> Regards,
>>>> Jitesh Dundas
>>>>
>>>> On 10/30/10, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> So if it's a suffix tree that's quite a fixed data structure so the
>>>>> chances
>>>>> of developing a pluggable mechanism there would be hard. I think there
>>>>> also
>>>>> has to be a limit as to what we can sensibly do. If people want to
>>>>> contribute this kind of work though then it's all be very well received
>>>>> (with the corresponding test environment/cases of course).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Andy
>>>>>
>>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote:
>>>>>
>>>>>> It might be useful to make the K-mer storage mechanism pluggable. ?This
>>>>>> would allow a developer to use anything from a simple MultiMap, to a
>>>>>> NoSQL
>>>>>> key-value database to store K-mers. ?You could plugin custom map
>>>>>> implementations to allow you to keep a count of the number of instances
>>>>>> of
>>>>>> particular K-mers that were found. ?It might also be useful to be able
>>>>>> to
>>>>>> do
>>>>>> set operations on those K-mer collections. ?You could use it to
>>>>>> determine
>>>>>> which K-mers were present in a pathogen and not in a host.
>>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334
>>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> card.ly: <http://card.ly/phidias51>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar
>>>>>> <vishalthapar at gmail.com>wrote:
>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> This is good to have. I feel that including it as a part of core may
>>>>>>> not
>>>>>>> be
>>>>>>> necessary but having it as part of Genomic module in biojava3 will be
>>>>>>> nice.
>>>>>>> There is a project Bioinformatica
>>>>>>>
>>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich
>>>>>>> does something similar although not exactly. It counts the k-mers in a
>>>>>>> given fasta file but it does not count k-mers for each sequence within
>>>>>>> the
>>>>>>> file, just all within a file. This is a good feature to have specially
>>>>>>> if
>>>>>>> one is trying to find patterns within sequences which is what I am
>>>>>>> trying
>>>>>>> to
>>>>>>> do. It would most certainly be helpful to have a k-mer counting
>>>>>>> algorithm
>>>>>>> that counts k-mer frequency for each sequence. The way to go would be
>>>>>>> to
>>>>>>> use
>>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or
>>>>>>> not
>>>>>>> since I haven't used java in a while and am just switching back to it.
>>>>>>> A
>>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies
>>>>>>> is:
>>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al,
>>>>>>> software
>>>>>>> is tallymer). It would be some work to implement this in java as a
>>>>>>> module
>>>>>>> for biojava3 but I can see that this will be helpful. Again, for small
>>>>>>> fasta
>>>>>>> files, it might not be efficient to create a suffix tree but for bigger
>>>>>>> files, I think that might be the way to go.
>>>>>>>
>>>>>>> Thats just my two cents.What do you think?
>>>>>>>
>>>>>>> -vishal
>>>>>>>
>>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> As far as I am aware there is nothing which will generate them in
>>>>>>>> BioJava
>>>>>>>> at the moment. However it is possible to do it with BioJava3:
>>>>>>>>
>>>>>>>> public static void main(String[] args) {
>>>>>>>> DNASequence d = new DNASequence("ATGATC");
>>>>>>>> System.out.println("Non-Overlap");
>>>>>>>> nonOverlap(d);
>>>>>>>> System.out.println("Overlap");
>>>>>>>> overlap(d);
>>>>>>>> }
>>>>>>>>
>>>>>>>> public static final int KMER = 3;
>>>>>>>>
>>>>>>>> //Generate triplets overlapping
>>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) {
>>>>>>>> List<WindowedSequence<NucleotideCompound>> l =
>>>>>>>> ? ? ? ? new ArrayList<WindowedSequence<NucleotideCompound>>();
>>>>>>>> for(int i=1; i<=KMER; i++) {
>>>>>>>> ? ? SequenceView<NucleotideCompound> sub = d.getSubSequence(
>>>>>>>> ? ? ? ? ? ? i, d.getLength());
>>>>>>>> ? ? WindowedSequence<NucleotideCompound> w =
>>>>>>>> ? ? ? ? new WindowedSequence<NucleotideCompound>(sub, KMER);
>>>>>>>> ? ? l.add(w);
>>>>>>>> }
>>>>>>>>
>>>>>>>> //Will return ATG, ATC, TGA & GAT
>>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) {
>>>>>>>> ? ? for(List<NucleotideCompound> subList: w) {
>>>>>>>> ? ? ? ? System.out.println(subList);
>>>>>>>> ? ? }
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> //Generate triplet Compound lists non-overlapping
>>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) {
>>>>>>>> WindowedSequence<NucleotideCompound> w =
>>>>>>>> ? ? ? ? new WindowedSequence<NucleotideCompound>(d, KMER);
>>>>>>>> //Will return ATG & ATC
>>>>>>>> for(List<NucleotideCompound> subList: w) {
>>>>>>>> ? ? System.out.println(subList);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> The disadvantage of all of these solutions is that they generate lists
>>>>>>>> of
>>>>>>>> Compounds so kmer generation can/will be a memory intensive operation.
>>>>>>> This
>>>>>>>> does mean it has to be since sub sequences are thin wrappers around an
>>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it
>>>>>>>> iterates through each window rather than stepping through delegating
>>>>>>>> onto
>>>>>>>> each base in turn (hence why we get ATG & ATC before TGA)
>>>>>>>>
>>>>>>>> As for unique k-mers that's something which would require a bit more
>>>>>>>> engineering & would be better suited to a solution built around a Trie
>>>>>>>> (prefix tree).
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers
>>>>>>> or
>>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want
>>>>>>> k-mer
>>>>>>>>> counts for every sequence in a fasta file. If something like this
>>>>>>> exists
>>>>>>>> it
>>>>>>>>> would save me some time to write the code.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Vishal
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
>>>>>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
>>>>>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
>>>>>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Vishal Thapar, Ph.D.*
>>>>>>> *Scientific informatics Analyst
>>>>>>> Cold Spring Harbor Lab
>>>>>>> Quick Bldg, Lowe Lab
>>>>>>> 1 Bungtown Road
>>>>>>> Cold Spring Harbor, NY - 11724*
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>> --
>>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
>>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
>>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
>>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>
>>> --
>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>>>
>>>
>>>
>>>
>>>
>
> --
> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From dasarnow at gmail.com  Sun Oct 31 23:56:05 2010
From: dasarnow at gmail.com (Daniel Asarnow)
Date: Sun, 31 Oct 2010 16:56:05 -0700
Subject: [Biojava-l] Superimposing structure pieces
Message-ID: <AANLkTimubnTZ3qKFvVqhbNpFuV+EjVNBV1=4k+XMnQLj@mail.gmail.com>

I've been trying to pull out pieces of protein chains and superimpose
them...my current code (as generic-ified code snips below) works, but
I wonder if it couldn't be faster.
Has anyone worked on similar methods?  Any other advice?

Best regards everyone,
da

Getting residue CA's as Atom[]:

for (int i; i < length; i++) {
    someAtoms[i] = someChain.getSeqResGroup(start + i).getAtom("CA");
}

Superimposing/aligning:

SVDSuperimposer svds = new SVDSuperimposer(someAtoms1, someAtoms2);
Matrix rot = svds.getRotation();
Atom trans = svds.getTranslation();
for (int i = 0; i < length; i++) {
    Calc.rotate(someAtoms1[i], rot);
    Calc.shift(someAtoms1[i], trans);
}
SVDSuperimposer.getRmsd(someAtoms1, someAtoms2);