From andreas at sdsc.edu Fri Mar 5 11:56:40 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 5 Mar 2010 08:56:40 -0800 Subject: [Biojava-dev] Google summer of code Message-ID: <59a41c431003050856v17c83b80sf1fb59f2587c9cd1@mail.gmail.com> Hi, The Open Bioinformatics Foundation (BioJava's mother organisation) is preparing an application for the Google Summer of Code. If you are interested in becoming a mentor for a BioJava related project, you can join us in the application. If you are a student and are interested in a project, please take a look at these pages: http://www.open-bio.org/wiki/Google_Summer_of_Code http://biojava.org/wiki/Google_Summer_of_Code Andreas From yogeshp08 at gmail.com Sat Mar 6 14:38:13 2010 From: yogeshp08 at gmail.com (Yogesh) Date: Sat, 6 Mar 2010 14:38:13 -0500 Subject: [Biojava-dev] Modules + GSoC2010 Message-ID: <193861401003061138gbd0fa77t785eaa15a25a971c@mail.gmail.com> Hello, I am a Graduate student in Bioinformatics. I am thrilled to know that OBF is particiapting in GSoC2010 I also wish to participate in GSoC2010 for the first time this year. I will like to apply for a project related to BioJava. I am very comfortable with Java. Also, I use BioJava very often. One of the projects from BioJava::Modules that I like and I think I can do is: Support for SCOP file parsing. Can I have some help on how to go about this project? Another project that I would like to contribute to is: Develop a multiple sequence alignment algorithm entirely written in Java More info on this will also help me decide on which project to apply for in GSoC2010. Thank you. Regards, -Yogesh From holland at eaglegenomics.com Mon Mar 15 06:34:14 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 15 Mar 2010 10:34:14 +0000 Subject: [Biojava-dev] Hackathon in Boston, July 2010 Message-ID: <5FC2D8EC-5408-4126-9A7D-CB6B3500B61C@eaglegenomics.com> Hi all, Following the successful hackathon in Cambridge earlier this year, it was originally planned to hold a second one in Boston in conjunction with BOSC in order to give those who couldn't make it to the UK a chance to get involved. However, OBF have beaten us to it by organising a cross-project CodeFest! http://www.open-bio.org/wiki/Codefest_2010 It would be great for BioJava people to get involved with this cross-project hackathon effort, and it saves organising one of our own! :) All relevant info is on the web page linked to above, and if you have any questions, ask Brad as detailed on the page. cheers, Richard -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Tue Mar 16 11:57:38 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 16 Mar 2010 08:57:38 -0700 Subject: [Biojava-dev] biojava 3 progress Message-ID: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> Hi, ISMB/BOSC is coming up rapidly and we should start to prepare for the annual BioJava release. As such it would be a good moment to discuss the current status of the various new BioJava 3 modules. The biojava-structure, biojava-structure-gui modules are essentially ready for release and I started to update the Cookbook with the latest features http://biojava.org/wiki/BioJava:CookBook:PDB:align Some of the re-factored modules based on biojava 1.7 could be released anytime soon as well. The documentation just needs to be updated to explain where the functionality can be found now (e.g. alignment module) What about the new code that has been under development since the hackathon? Is it getting release ready slowly? Any plans for documentation? What is missing before we can make the first Biojava 3 release? Andreas From ayates at ebi.ac.uk Tue Mar 16 13:21:48 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 16 Mar 2010 17:21:48 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> Message-ID: <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> It's getting ready very slowly. Currently we need: * Locations correctly implemented ** There's no way of requesting subseqs from them atmo * Feature on sequences support * Extra attributes which do not fit into top-level attributes * Mapping between sequences/assemblies * circular location support ** so no checks on start being less than end * Documentation Think that's it off the top of my head Andy On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > Hi, > > ISMB/BOSC is coming up rapidly and we should start to prepare for the annual > BioJava release. As such it would be a good moment to discuss the current > status of the various new BioJava 3 modules. > > The biojava-structure, biojava-structure-gui modules are essentially ready > for release and I started to update the Cookbook with the latest features > http://biojava.org/wiki/BioJava:CookBook:PDB:align > > Some of the re-factored modules based on biojava 1.7 could be released > anytime soon as well. The documentation just needs to be updated to explain > where the functionality can be found now (e.g. alignment module) > > What about the new code that has been under development since the hackathon? > Is it getting release ready slowly? Any plans for documentation? What is > missing before we can make the first Biojava 3 release? > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Tue Mar 16 14:51:04 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 16 Mar 2010 14:51:04 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> Message-ID: I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. I will also plan on migrating the sequence alignment code as well. I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. I am planning on attending ISMB/BOSC. Do we want to put some deadlines in place with a mini-project plan? Thanks Scooter On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: > It's getting ready very slowly. Currently we need: > > * Locations correctly implemented > ** There's no way of requesting subseqs from them atmo > * Feature on sequences support > * Extra attributes which do not fit into top-level attributes > * Mapping between sequences/assemblies > * circular location support > ** so no checks on start being less than end > * Documentation > > Think that's it off the top of my head > > Andy > > On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > >> Hi, >> >> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >> BioJava release. As such it would be a good moment to discuss the current >> status of the various new BioJava 3 modules. >> >> The biojava-structure, biojava-structure-gui modules are essentially ready >> for release and I started to update the Cookbook with the latest features >> http://biojava.org/wiki/BioJava:CookBook:PDB:align >> >> Some of the re-factored modules based on biojava 1.7 could be released >> anytime soon as well. The documentation just needs to be updated to explain >> where the functionality can be found now (e.g. alignment module) >> >> What about the new code that has been under development since the hackathon? >> Is it getting release ready slowly? Any plans for documentation? What is >> missing before we can make the first Biojava 3 release? >> >> Andreas >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Tue Mar 16 16:58:02 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 16 Mar 2010 13:58:02 -0700 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> Message-ID: <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... Andreas On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: > I am working on adding in additional features to the core module to round > things out and will be able to do docs/wiki examples. I will be working on > Features with the new sequence model and the ability to pull features from > uniprot based on uniprot id as an example. I will use uniprot XML as the > data model when figuring out the feature data model such that classes have > biology relevance instead of being completely abstract. > > I will also see if I can do something with NCBI for genome sequence data > where you don't need to download the entire sequence but based on gff > annotations you can pull dna sequences for exons belonging to a particular > gene. > > I will also plan on migrating the sequence alignment code as well. > > I think the focus for this release should be on the modularization of the > modules and the maven integration. We also need to provide a repository for > those who are not going to use maven and need just the jar files. We can > then highlight the newer modules as a benefit of the modularization. > > I am planning on attending ISMB/BOSC. > > Do we want to put some deadlines in place with a mini-project plan? > > Thanks > > Scooter > > > On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: > > > It's getting ready very slowly. Currently we need: > > > > * Locations correctly implemented > > ** There's no way of requesting subseqs from them atmo > > * Feature on sequences support > > * Extra attributes which do not fit into top-level attributes > > * Mapping between sequences/assemblies > > * circular location support > > ** so no checks on start being less than end > > * Documentation > > > > Think that's it off the top of my head > > > > Andy > > > > On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > > > >> Hi, > >> > >> ISMB/BOSC is coming up rapidly and we should start to prepare for the > annual > >> BioJava release. As such it would be a good moment to discuss the > current > >> status of the various new BioJava 3 modules. > >> > >> The biojava-structure, biojava-structure-gui modules are essentially > ready > >> for release and I started to update the Cookbook with the latest > features > >> http://biojava.org/wiki/BioJava:CookBook:PDB:align > >> > >> Some of the re-factored modules based on biojava 1.7 could be released > >> anytime soon as well. The documentation just needs to be updated to > explain > >> where the functionality can be found now (e.g. alignment module) > >> > >> What about the new code that has been under development since the > hackathon? > >> Is it getting release ready slowly? Any plans for documentation? What is > >> missing before we can make the first Biojava 3 release? > >> > >> Andreas > >> _______________________________________________ > >> biojava-dev mailing list > >> biojava-dev at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > From ayates at ebi.ac.uk Wed Mar 17 11:28:33 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 15:28:33 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> Message-ID: <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? On 16 Mar 2010, at 20:58, Andreas Prlic wrote: > Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... > > Andreas > > On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: > I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. > > I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. > > I will also plan on migrating the sequence alignment code as well. > > I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. > > I am planning on attending ISMB/BOSC. > > Do we want to put some deadlines in place with a mini-project plan? > > Thanks > > Scooter > > > On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: > > > It's getting ready very slowly. Currently we need: > > > > * Locations correctly implemented > > ** There's no way of requesting subseqs from them atmo > > * Feature on sequences support > > * Extra attributes which do not fit into top-level attributes > > * Mapping between sequences/assemblies > > * circular location support > > ** so no checks on start being less than end > > * Documentation > > > > Think that's it off the top of my head > > > > Andy > > > > On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > > > >> Hi, > >> > >> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual > >> BioJava release. As such it would be a good moment to discuss the current > >> status of the various new BioJava 3 modules. > >> > >> The biojava-structure, biojava-structure-gui modules are essentially ready > >> for release and I started to update the Cookbook with the latest features > >> http://biojava.org/wiki/BioJava:CookBook:PDB:align > >> > >> Some of the re-factored modules based on biojava 1.7 could be released > >> anytime soon as well. The documentation just needs to be updated to explain > >> where the functionality can be found now (e.g. alignment module) > >> > >> What about the new code that has been under development since the hackathon? > >> Is it getting release ready slowly? Any plans for documentation? What is > >> missing before we can make the first Biojava 3 release? > >> > >> Andreas > >> _______________________________________________ > >> biojava-dev mailing list > >> biojava-dev at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Wed Mar 17 11:52:01 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 11:52:01 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: Andy Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features. I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average? Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs? Scooter On Mar 17, 2010, at 11:28 AM, Andy Yates wrote: > I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? > > On 16 Mar 2010, at 20:58, Andreas Prlic wrote: > >> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... >> >> Andreas >> >> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: >> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. >> >> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. >> >> I will also plan on migrating the sequence alignment code as well. >> >> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. >> >> I am planning on attending ISMB/BOSC. >> >> Do we want to put some deadlines in place with a mini-project plan? >> >> Thanks >> >> Scooter >> >> >> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: >> >>> It's getting ready very slowly. Currently we need: >>> >>> * Locations correctly implemented >>> ** There's no way of requesting subseqs from them atmo >>> * Feature on sequences support >>> * Extra attributes which do not fit into top-level attributes >>> * Mapping between sequences/assemblies >>> * circular location support >>> ** so no checks on start being less than end >>> * Documentation >>> >>> Think that's it off the top of my head >>> >>> Andy >>> >>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote: >>> >>>> Hi, >>>> >>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >>>> BioJava release. As such it would be a good moment to discuss the current >>>> status of the various new BioJava 3 modules. >>>> >>>> The biojava-structure, biojava-structure-gui modules are essentially ready >>>> for release and I started to update the Cookbook with the latest features >>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align >>>> >>>> Some of the re-factored modules based on biojava 1.7 could be released >>>> anytime soon as well. The documentation just needs to be updated to explain >>>> where the functionality can be found now (e.g. alignment module) >>>> >>>> What about the new code that has been under development since the hackathon? >>>> Is it getting release ready slowly? Any plans for documentation? What is >>>> missing before we can make the first Biojava 3 release? >>>> >>>> Andreas >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > From ayates at ebi.ac.uk Wed Mar 17 12:04:50 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 16:04:50 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: <4A0FFCFF-9EAA-4B27-BF11-AFD6D4CFEAE0@ebi.ac.uk> Hey mate, Sounds good anything with good GFF support is something hard to come by :). So you're going to get it working for the non-generic structures & then push it out into the core modules if I'm reading what you said correctly? Add 0 to the percentage & make sure the docs describe what it's doing. Even if a gene has no introns it still affects the average of introns in a genome :). All I can think of is "biojava3-features". Not sure what "biojava3-genes" says. Maybe it goes into an "io" package ... say one which goes with an EMBL/Genbank/CHADO formatter maybe. Naming is a horrible thing to have to do. Andy On 17 Mar 2010, at 15:52, Scooter Willis wrote: > Andy > > Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features. > > I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average? > > Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs? > > Scooter > > > > > > On Mar 17, 2010, at 11:28 AM, Andy Yates wrote: > >> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? >> >> On 16 Mar 2010, at 20:58, Andreas Prlic wrote: >> >>> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... >>> >>> Andreas >>> >>> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: >>> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. >>> >>> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. >>> >>> I will also plan on migrating the sequence alignment code as well. >>> >>> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. >>> >>> I am planning on attending ISMB/BOSC. >>> >>> Do we want to put some deadlines in place with a mini-project plan? >>> >>> Thanks >>> >>> Scooter >>> >>> >>> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: >>> >>>> It's getting ready very slowly. Currently we need: >>>> >>>> * Locations correctly implemented >>>> ** There's no way of requesting subseqs from them atmo >>>> * Feature on sequences support >>>> * Extra attributes which do not fit into top-level attributes >>>> * Mapping between sequences/assemblies >>>> * circular location support >>>> ** so no checks on start being less than end >>>> * Documentation >>>> >>>> Think that's it off the top of my head >>>> >>>> Andy >>>> >>>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote: >>>> >>>>> Hi, >>>>> >>>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >>>>> BioJava release. As such it would be a good moment to discuss the current >>>>> status of the various new BioJava 3 modules. >>>>> >>>>> The biojava-structure, biojava-structure-gui modules are essentially ready >>>>> for release and I started to update the Cookbook with the latest features >>>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align >>>>> >>>>> Some of the re-factored modules based on biojava 1.7 could be released >>>>> anytime soon as well. The documentation just needs to be updated to explain >>>>> where the functionality can be found now (e.g. alignment module) >>>>> >>>>> What about the new code that has been under development since the hackathon? >>>>> Is it getting release ready slowly? Any plans for documentation? What is >>>>> missing before we can make the first Biojava 3 release? >>>>> >>>>> Andreas >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> >>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Wed Mar 17 12:09:29 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 12:09:29 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: Andy Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. Scooter From HWillis at scripps.edu Wed Mar 17 12:14:02 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 12:14:02 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <4A0FFCFF-9EAA-4B27-BF11-AFD6D4CFEAE0@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <4A0FFCFF-9EAA-4B27-BF11-AFD6D4CFEAE0@ebi.ac.uk> Message-ID: Andy I have two methods that calculate avg introns per gene both ways. Just wasn't sure what the standard is for reporting. I think features should be part of the core because it is abstract regardless of the source that generated the feature. For the code related to gene prediction work that probably should be in a different package because it is not general. Calling it biojava-geneprediction also doesn't work because it implies gene prediction. Scooter On Mar 17, 2010, at 12:04 PM, Andy Yates wrote: > Hey mate, > > Sounds good anything with good GFF support is something hard to come by :). So you're going to get it working for the non-generic structures & then push it out into the core modules if I'm reading what you said correctly? > > Add 0 to the percentage & make sure the docs describe what it's doing. Even if a gene has no introns it still affects the average of introns in a genome :). > > All I can think of is "biojava3-features". Not sure what "biojava3-genes" says. Maybe it goes into an "io" package ... say one which goes with an EMBL/Genbank/CHADO formatter maybe. Naming is a horrible thing to have to do. > > Andy > > On 17 Mar 2010, at 15:52, Scooter Willis wrote: > >> Andy >> >> Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features. >> >> I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average? >> >> Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs? >> >> Scooter >> >> >> >> >> >> On Mar 17, 2010, at 11:28 AM, Andy Yates wrote: >> >>> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? >>> >>> On 16 Mar 2010, at 20:58, Andreas Prlic wrote: >>> >>>> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... >>>> >>>> Andreas >>>> >>>> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: >>>> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. >>>> >>>> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. >>>> >>>> I will also plan on migrating the sequence alignment code as well. >>>> >>>> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. >>>> >>>> I am planning on attending ISMB/BOSC. >>>> >>>> Do we want to put some deadlines in place with a mini-project plan? >>>> >>>> Thanks >>>> >>>> Scooter >>>> >>>> >>>> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: >>>> >>>>> It's getting ready very slowly. Currently we need: >>>>> >>>>> * Locations correctly implemented >>>>> ** There's no way of requesting subseqs from them atmo >>>>> * Feature on sequences support >>>>> * Extra attributes which do not fit into top-level attributes >>>>> * Mapping between sequences/assemblies >>>>> * circular location support >>>>> ** so no checks on start being less than end >>>>> * Documentation >>>>> >>>>> Think that's it off the top of my head >>>>> >>>>> Andy >>>>> >>>>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >>>>>> BioJava release. As such it would be a good moment to discuss the current >>>>>> status of the various new BioJava 3 modules. >>>>>> >>>>>> The biojava-structure, biojava-structure-gui modules are essentially ready >>>>>> for release and I started to update the Cookbook with the latest features >>>>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align >>>>>> >>>>>> Some of the re-factored modules based on biojava 1.7 could be released >>>>>> anytime soon as well. The documentation just needs to be updated to explain >>>>>> where the functionality can be found now (e.g. alignment module) >>>>>> >>>>>> What about the new code that has been under development since the hackathon? >>>>>> Is it getting release ready slowly? Any plans for documentation? What is >>>>>> missing before we can make the first Biojava 3 release? >>>>>> >>>>>> Andreas >>>>>> _______________________________________________ >>>>>> biojava-dev mailing list >>>>>> biojava-dev at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>> >>>>> -- >>>>> Andrew Yates Ensembl Genomes Engineer >>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>>> >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > From andreas at sdsc.edu Wed Mar 17 13:46:19 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 17 Mar 2010 10:46:19 -0700 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... A On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: > Andy > > Let me know if you have any major code changes for the core sequencing > handling that have been or could be checked in. So far I haven't needed to > touch any of the core sequence code but want to avoid merging code if you > have made any significant changes. > > I should have code to check in today and if we can't come up with a better > name I will ask Andreas to create a biojava3-genes module and I can then > check that code in for your review. The current problem is that we have > ExonSequence extending DNASequence when it could also be described as a > feature. One way to look at this that a TranscriptSequence is also a feature > of a DNA sequence and only when you want to have a stand alone class with > internal links back to parent sequence do you return a TranscriptSequence. > The TranscriptFeature would have ExonFeature and IntronFeature as children. > You can ask for a ExonSequence based on the ExonFeature. Once you get a > ProteinSequence you should be able to reverse the process and get back the > TranscriptSequence and the corresponding ExonFeatures and some sort of > mapping from a protein sequence position back to the three DNA sequence > positions that coded for it. This would need to handle the case where you > have a the end of an exon and the start of the next exon coding for a > particular amino acid sequence position. > > We also need to add in the ability to have tracks as a way to group > features. This way you export features based on a particular track as a > GFF/GFF3 file for importing into various genome browsers. You have one > genome you are working on with genes added in from three different gene > prediction algorithms each organized by a track. You should then be able to > determine overlaps of genes that were predicted and validated via blast > against uniprot and create another summary track of validated genes and > non-validate genes. If the feature classes we put together can make this > easy then I think we will have a solid design. > > > Scooter > > From HWillis at scripps.edu Wed Mar 17 14:17:59 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 14:17:59 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> Message-ID: <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> Andreas The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. Thanks Scooter On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... A On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis > wrote: Andy Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. Scooter From ayates at ebi.ac.uk Wed Mar 17 15:24:13 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 19:24:13 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> Message-ID: <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> biojava-genomes sounds good. I've done nothing since my last check-in of code which was all to do with locations so there should be no problem there :) On 17 Mar 2010, at 18:17, Scooter Willis wrote: > Andreas > > The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. > > We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. > > Thanks > > Scooter > > > > On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: > >> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... >> A >> >> On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: >> Andy >> >> Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. >> >> I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. >> >> We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. >> >> >> Scooter >> >> > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Wed Mar 17 15:58:42 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 15:58:42 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> Message-ID: <9F8616DE-710D-4971-8C63-52C5EB7789C2@scripps.edu> Andy Should be use this as our test case http://www.sequenceontology.org/gff3.shtml for a complex example of transcription? Scooter On Mar 17, 2010, at 3:24 PM, Andy Yates wrote: > biojava-genomes sounds good. > > I've done nothing since my last check-in of code which was all to do with locations so there should be no problem there :) > > On 17 Mar 2010, at 18:17, Scooter Willis wrote: > >> Andreas >> >> The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. >> >> We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. >> >> Thanks >> >> Scooter >> >> >> >> On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: >> >>> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... >>> A >>> >>> On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: >>> Andy >>> >>> Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. >>> >>> I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. >>> >>> We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. >>> >>> >>> Scooter >>> >>> >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > From ayates at ebi.ac.uk Wed Mar 17 16:01:04 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 20:01:04 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <9F8616DE-710D-4971-8C63-52C5EB7789C2@scripps.edu> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> <9F8616DE-710D-4971-8C63-52C5EB7789C2@scripps.edu> Message-ID: <2A33D045-0AD9-4948-90D3-48636D074514@ebi.ac.uk> Perfect :). Nothing like using someone else's test case as ours Andy On 17 Mar 2010, at 19:58, Scooter Willis wrote: > Andy > > Should be use this as our test case http://www.sequenceontology.org/gff3.shtml for a complex example of transcription? > > Scooter > > On Mar 17, 2010, at 3:24 PM, Andy Yates wrote: > >> biojava-genomes sounds good. >> >> I've done nothing since my last check-in of code which was all to do with locations so there should be no problem there :) >> >> On 17 Mar 2010, at 18:17, Scooter Willis wrote: >> >>> Andreas >>> >>> The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. >>> >>> We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. >>> >>> Thanks >>> >>> Scooter >>> >>> >>> >>> On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: >>> >>>> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... >>>> A >>>> >>>> On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: >>>> Andy >>>> >>>> Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. >>>> >>>> I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. >>>> >>>> We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. >>>> >>>> >>>> Scooter >>>> >>>> >>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From andreas at sdsc.edu Wed Mar 17 18:14:40 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 17 Mar 2010 15:14:40 -0700 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> Message-ID: <59a41c431003171514u1357ecf1ndab75fa4d461124e@mail.gmail.com> ok, a new module biojava3-genome is now in SVN... A On Wed, Mar 17, 2010 at 11:17 AM, Scooter Willis wrote: > Andreas > > The problem with putting feature classes in a separate module is that > biojava-core sequences would then have a dependency on biojava-feature. A > sequence needs to hold a collection of features so feature classes need to > go in core. If features are created from gff the core module doesn't care > where features come from. > > We could go with biojava-genomes and code related to dealing with genomes > goes in that module. If you like biojava-genome or biojava-genomes go ahead > and create it and email me so I can check it out. > > Thanks > > Scooter > > > > On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: > > I like biojava-feature as a module name for the GFF and features related > code. (should we try to keep the module names singular?) Let me know if you > want me to create the module for this... > A > > On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: > >> Andy >> >> Let me know if you have any major code changes for the core sequencing >> handling that have been or could be checked in. So far I haven't needed to >> touch any of the core sequence code but want to avoid merging code if you >> have made any significant changes. >> >> I should have code to check in today and if we can't come up with a better >> name I will ask Andreas to create a biojava3-genes module and I can then >> check that code in for your review. The current problem is that we have >> ExonSequence extending DNASequence when it could also be described as a >> feature. One way to look at this that a TranscriptSequence is also a feature >> of a DNA sequence and only when you want to have a stand alone class with >> internal links back to parent sequence do you return a TranscriptSequence. >> The TranscriptFeature would have ExonFeature and IntronFeature as children. >> You can ask for a ExonSequence based on the ExonFeature. Once you get a >> ProteinSequence you should be able to reverse the process and get back the >> TranscriptSequence and the corresponding ExonFeatures and some sort of >> mapping from a protein sequence position back to the three DNA sequence >> positions that coded for it. This would need to handle the case where you >> have a the end of an exon and the start of the next exon coding for a >> particular amino acid sequence position. >> >> We also need to add in the ability to have tracks as a way to group >> features. This way you export features based on a particular track as a >> GFF/GFF3 file for importing into various genome browsers. You have one >> genome you are working on with genes added in from three different gene >> prediction algorithms each organized by a track. You should then be able to >> determine overlaps of genes that were predicted and validated via blast >> against uniprot and create another summary track of validated genes and >> non-validate genes. If the feature classes we put together can make this >> easy then I think we will have a solid design. >> >> >> Scooter >> >> > > From heuermh at acm.org Wed Mar 17 23:28:23 2010 From: heuermh at acm.org (Michael Heuer) Date: Wed, 17 Mar 2010 22:28:23 -0500 (EST) Subject: [Biojava-dev] Hackathon in Boston, July 2010 In-Reply-To: <5FC2D8EC-5408-4126-9A7D-CB6B3500B61C@eaglegenomics.com> Message-ID: On Mon, 15 Mar 2010, Richard Holland wrote: > Hi all, > > Following the successful hackathon in Cambridge earlier this year, it was originally planned to hold a second one in Boston in conjunction with BOSC in order to give those who couldn't make it to the UK a chance to get involved. > > However, OBF have beaten us to it by organising a cross-project CodeFest! > > http://www.open-bio.org/wiki/Codefest_2010 > > It would be great for BioJava people to get involved with this cross-project hackathon effort, and it saves organising one of our own! :) Yep, I'm already signed up. Look forward to seeing some of you there. michael From andreas at sdsc.edu Thu Mar 18 16:36:38 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 18 Mar 2010 13:36:38 -0700 Subject: [Biojava-dev] Google summer of code Message-ID: <59a41c431003181336i33d388aak4b5a26e11ee4161b@mail.gmail.com> Hi, It seems our (the Open Biology Foundation's) Google Summer of Code application has been accepted. http://socghop.appspot.com/gsoc/program/accepted_orgs/google/gsoc2010 As such we are now looking for an interested and skilled student to work on the BioJava multiple sequence alignment project. Take a look at the project description, and if you think you are up for the challenge, send me an email with your application. http://biojava.org/wiki/Google_Summer_of_Code Andreas From andreas at sdsc.edu Tue Mar 23 20:33:09 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 23 Mar 2010 17:33:09 -0700 Subject: [Biojava-dev] GSoC update Message-ID: <59a41c431003231733t1e259753k55fbe0a8bfb801a3@mail.gmail.com> Hi, A quick update regarding the current status of our Google Summer of Code project: Several students already have expressed their interest. In fact the response was so good that I believe BioJava should try to run more than just one project. In the meanwhile we added another "mentor proposed" project to our GSoC page : http://biojava.org/wiki/Google_Summer_of_Code . Identification and Classification of Posttranslational Modification of Proteins: Develop a Postranslational Modification package for the BioJava project. In general Google strongly encourages to have student-proposed projects, since historically those are often the most successful GSoC projects. It is recommended that students contact us / possible mentors prior to their application so we can match up students with suitable mentors and projects and we can help in solidifying your project ideas. In principle any BioJava contributor is suitable as a mentor. Students can apply between March 22nd and April 9th via the google web site. http://socghop.appspot.com/ Andreas From biopython at maubp.freeserve.co.uk Wed Mar 24 10:51:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:51:46 +0000 Subject: [Biojava-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> Message-ID: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote: > > On Mar 24, 2010, at 9:08 AM, Peter wrote: > >> Hi, >> >> This is probably of interest to all the Bio* projects offering access >> to the NCBI Entrez utilities. See forwarded message below. >> >> I *think* the new guidelines basically say that the email & tool parameters are >> optional BUT if your IP address ever gets banned for excessive use you then >> have to register an email & tool combination. >> >> Regarding the email address, the NCBI say to use the email of the developer >> (not the end user). However, they do not distinguish between the developers >> of a library (like us), and the developers of an application or script using a >> library (who may also be the end user). >> >> Currently we (Biopython) and I think BioPerl ask developers using our libraries >> to populate the email address themselves. I *think* this is still the >> right action. >> >> Peter > > > Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I > think with the SOAP-based ones as well). ?We're providing a specific set of > tools for user to write up their own applications end applications. ?I can try > contacting them regarding this to get an official response to clarify this > somewhat. Please give the NCBI an email - you can CC me too if you like. > Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a > default, but always leave the email blank and issue a warning if it isn't > set. ?We could just as easily leave both blank and issue warnings for both. We currently leave out the email and set the tool parameter to "Biopython" by default but this can be overridden. Currently leaving out the email does cause Biopython to give a warning. Peter From cjfields at illinois.edu Wed Mar 24 10:37:13 2010 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 24 Mar 2010 09:37:13 -0500 Subject: [Biojava-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> Message-ID: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> On Mar 24, 2010, at 9:08 AM, Peter wrote: > Hi, > > This is probably of interest to all the Bio* projects offering access > to the NCBI > Entrez utilities. See forwarded message below. > > I *think* the new guidelines basically say that the email & tool parameters are > optional BUT if your IP address ever gets banned for excessive use you then > have to register an email & tool combination. > > Regarding the email address, the NCBI say to use the email of the developer > (not the end user). However, they do not distinguish between the developers > of a library (like us), and the developers of an application or script using a > library (who may also be the end user). > > Currently we (Biopython) and I think BioPerl ask developers using our libraries > to populate the email address themselves. I *think* this is still the > right action. > > Peter Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I think with the SOAP-based ones as well). We're providing a specific set of tools for user to write up their own applications end applications. I can try contacting them regarding this to get an official response to clarify this somewhat. Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a default, but always leave the email blank and issue a warning if it isn't set. We could just as easily leave both blank and issue warnings for both. chris > ---------- Forwarded message ---------- > From: > Date: Wed, Mar 24, 2010 at 1:53 PM > Subject: [Utilities-announce] NCBI Revised E-utility Usage Policy > To: NLM/NCBI List utilities-announce > > > New E-utility documentation now on the NCBI Bookshelf > > The Entrez Programming Utilities (E-Utilities) Help documentation has > been added to the NCBI Bookshelf, and so is now fully integrated with > the Entrez search and retrieval system as a part of the Bookshelf > database. This help document has been divided into chapters for better > organization and includes several new sample Perl scripts. At present > this book covers the standard URL interface for the E-utilties; > material about the SOAP interface will be added soon and is still > available at the same URL: > http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html. > > > > Revised E-utility usage policy > > In December, 2009 NCBI announced a change to the usage policy for the > E-utilities that would require all requests to contain non-null values > for both the &email and &tool parameters. After several consultations > with our users and developers, we have decided to revise this policy > change, and the revised policy is described in detail at the following > link: > > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen > > Please let us know if you have any questions or concerns about this > policy change. > > > > Thank you, > > The E-Utilities Team > > NIH/NLM/NCBI > > eutilities at ncbi.nlm.nih.gov. > > > > _______________________________________________ > Utilities-announce mailing list > http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at drycafe.net Wed Mar 24 11:27:37 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 24 Mar 2010 11:27:37 -0400 Subject: [Biojava-dev] [Open-bio-l] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> Message-ID: <5D427F97-706E-4F66-95BA-2B397520C4FA@drycafe.net> On Mar 24, 2010, at 10:51 AM, Peter wrote: > Please give the NCBI an email - you can CC me too if you like. Can't this be the developers' mailing list (or lists, the appropriate one for each toolkit)? We can even whitelist all NCBI sender addresses so they can easily email us if there are issues. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From cjfields at illinois.edu Wed Mar 24 11:44:21 2010 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 24 Mar 2010 10:44:21 -0500 Subject: [Biojava-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> Message-ID: <338BDDD8-2A66-4086-BFB7-35EC8F8F0D66@illinois.edu> On Mar 24, 2010, at 9:51 AM, Peter wrote: > On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote: >> >> On Mar 24, 2010, at 9:08 AM, Peter wrote: >> >>> Hi, >>> >>> This is probably of interest to all the Bio* projects offering access >>> to the NCBI Entrez utilities. See forwarded message below. >>> >>> I *think* the new guidelines basically say that the email & tool parameters are >>> optional BUT if your IP address ever gets banned for excessive use you then >>> have to register an email & tool combination. >>> >>> Regarding the email address, the NCBI say to use the email of the developer >>> (not the end user). However, they do not distinguish between the developers >>> of a library (like us), and the developers of an application or script using a >>> library (who may also be the end user). >>> >>> Currently we (Biopython) and I think BioPerl ask developers using our libraries >>> to populate the email address themselves. I *think* this is still the >>> right action. >>> >>> Peter >> >> >> Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I >> think with the SOAP-based ones as well). We're providing a specific set of >> tools for user to write up their own applications end applications. I can try >> contacting them regarding this to get an official response to clarify this >> somewhat. > > Please give the NCBI an email - you can CC me too if you like. Sent, have cc'd the open-bio list. Don't want to cross-post this too much, so I think we should move the discussion there. >> Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a >> default, but always leave the email blank and issue a warning if it isn't >> set. We could just as easily leave both blank and issue warnings for both. > > We currently leave out the email and set the tool parameter to "Biopython" > by default but this can be overridden. Currently leaving out the email does > cause Biopython to give a warning. > > Peter We follow the same, then (down to the warning). This is mentioned in my post to them, I'll wait to see what they say. My concern is the wording of the new rules. Each tool and email must be registered with them if an IP is blocked. Does this mean each tool is assigned one specific email? And an IP that is blocked can register it to be allowed back into the fold? With that in mind, should we register each of our toolkits with them? Probably not a bad thing (it might help us as devs to get an idea of use), but then if one user abuses the rules will their actions affect all toolkit users? Is this all done on a per-IP basis, per-toolkit basis, etc? Unfortunately, at least to me, none of this is made very clear, so I'm hoping there is some clarification from their end. chris From maj at fortinbras.us Wed Mar 24 12:37:56 2010 From: maj at fortinbras.us (Mark A. Jensen) Date: Wed, 24 Mar 2010 12:37:56 -0400 Subject: [Biojava-dev] [Bioperl-l] [Open-bio-l] Fwd: [Utilities-announce] NCBI RevisedE-utility Usage Policy In-Reply-To: <5D427F97-706E-4F66-95BA-2B397520C4FA@drycafe.net> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com><38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu><320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> <5D427F97-706E-4F66-95BA-2B397520C4FA@drycafe.net> Message-ID: I think this is a great idea--- MAJ ----- Original Message ----- From: "Hilmar Lapp" To: "Peter" Cc: ; "Biopython-Dev Mailing List" ; ; "bioperl-l list" ; "Chris Fields" ; Sent: Wednesday, March 24, 2010 11:27 AM Subject: Re: [Bioperl-l] [Open-bio-l] Fwd: [Utilities-announce] NCBI RevisedE-utility Usage Policy > > On Mar 24, 2010, at 10:51 AM, Peter wrote: > >> Please give the NCBI an email - you can CC me too if you like. > > > Can't this be the developers' mailing list (or lists, the appropriate one for > each toolkit)? We can even whitelist all NCBI sender addresses so they can > easily email us if there are issues. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From sheoran143 at gmail.com Wed Mar 24 21:19:29 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Wed, 24 Mar 2010 20:19:29 -0500 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject :( Hibernate Exception and suggestion for change in BioSqlSchema) Message-ID: <4BAABA21.4000301@gmail.com> I am writing this email again, I didn't get any response weather this bugs are patched or are they lost some where on mailing list. I am not sure that's why I am writing this back. I don't know how to apply this patch So I am counting on you guys to apply theses patch and reply me back so I know its fixed. Thanks Deepak Sheoran Hi In response to bug fix suggested by Richard I have created some patches. We need to apply these to fix biojava from processing references from a genbank record in a wrong manner which cause more hibernate exceptions. After applying patch, reference resolution code will test pubmed or medline id, then if no match then test author/title/location, then if still no match create a new reference. I even tested it with GenbankRelease 175 and I gained almost 3159 more records in my database. Can somebody please have a look on second issue of it and fix it " 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). " Also I am planning on making a bridge between biosql database loaded using bioperl and biojava, here is my some of the investigation can you guys suggest some direction on it. Have a look on attached files 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank record is stored in biosql instance by bioperl and biojava 2) GenbankRecord.doc ==> its word document having a genbank showing where its information goes in biosql using bioperl and biojava 3) BioSqlRichobjectBuilder.patch ==> patch needed for BioSqlRichObjectBuild.java class 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class Thanks Deepak Sheoran -------- Original Message -------- Subject: Re: Hibernate Exception and suggestion for change in BioSqlSchema Date: Tue, 9 Feb 2010 20:34:32 +1300 From: Richard Holland To: Deepak Sheoran CC: biojava-l at biojava.org Hi. It's possible that your original email didn't make it to the list because it is HTML format, and the list only accepts plain text. However, in answer to your two questions: 1. The code that does the resolution of references might be better if it looks up existing IDs rather than using author, title, location to identify existing records. I would suggest modifying it to a three-step process - test ID, then if no match then test author/title/location, then if still no match create a new reference. Could someone do that? (I'm unable to do anything until late March). 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). cheers, Richard On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: > > Hi Richard > > Below is the email which I sent to Biojava-1 mailing list but it never get posted on the mailing list server neither do i got any response, so please have a look on this email and tell what can be the solution of the problem described in the message. > > > Thanks > Deepak Sheoran > -------- Original Message -------- > Subject: Hibernate Exception and suggestion for change in BioSqlSchema > Date: Wed, 03 Feb 2010 08:07:35 -0600 > From: Deepak Sheoran > To: biojava-l at lists.open-bio.org > > Hi guys, > > A couple of days back I was having some problem with hibernate exception but that exception got resolved and the reference to that email is:http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html > On Richard suggestion in above link I am able to resolve some of issues but then, I got stuck in to some other error with hibernate and then decided to investigate the matter and below are some facts and information which I found and I guess it is going to affect all of us. > ? The "Reference" table in bioSql schema have unique constraint on "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). Which mean only one entry in reference table can use on dbxref_id. > This Works wells but in cases when you have little variation in value of following column "location", "title", "authors" and all these variation refers to same PUBMED_ID. Then we can't persist or create a richsequence object . > Now when you tie RichObjectFactory to a active hibernate session then the class "BioSqlRichObjectBuilder" have method called "buildObject(Class clazz, List paramsList) " which is responsible for looking up details of object in the database and if it find one then it will return that object, else it will try to persist the new object into the database. > But problem is with below part of that method: > ?..LineNumber: 114 > else if (SimpleDocRef.class.isAssignableFrom(clazz)) > { queryType = "DocRef"; > // convert List constructor to String representation for query > ourParamsList.set(0, DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); > if (ourParamsList.size()<3) { > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title is null"; > } else { > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title = ?"; > } > } > ..LineNubmer: 123 > Now when hibernate search the database, it won't find any other record in "reference" table because those two record are different in string comparison, so it will return a new object back to "GenbankFormat" to following piece of code > ?.LineNumber: 447 > else { > try { > CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]{dbname, raccession, new Integer(0)}); > RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); > rlistener.getCurrentFeature().addRankedCrossRef(rcr); > } catch (ChangeVetoException e) { > throw new ParseException(e+", accession:"+accession); > } > } > ?..LineNumber:455 > Then we will add that object to rlistener. And move to next part of genbank record and then biojava search for a new crossref in database and it will try to persist the old one it get a hibernate exception regarding violation of "unique constraint on dbxref_id" column. > > The only way to get these record in database is: > ? The very easy solution and the way I did it for testing my theory is Change the bioSql schema so that it can allow many to one on relation between "reference" and "dbxref" table. Which even make sense because one paper can have many different variation of naming, and this change allow us to store that info too. But this is something BioSql people have decide and I don't know how to approach them. > ? Second solution is slightly difficult to implement, is to change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List paramsList)" make decision about weather a particular DocRef already exist in database or not. I am mean testing all possible string variations of authors, location, title of the docRef which we are searching. Which does have many complications and may slow down process of creating a richsequence object when link RichObjectFactory with a active hibernate session. > > Example:Below is a sample of what i have in my local biosql schema which has modification suggested by me. (dbxref_id column have Pubmed_id , I replaced the local dbxref_id which was present on this table in my database with pubmed_id stored in "dbxref" table, for easy reference with outside world in this email) > Reference_id > Dbxref_id > Location > Title > Authors > crc > 216 > 18554304 > FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 (2008) > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > 9E940E01F4BE3CD0 > 230 > 18554304 > FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > D3BC0C17F3F786C9 > 415 > 16790744 > Infect. Immun. 74 (7), 3715-3726 (2006) > Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via Recombination with Repetitive Chromosomal Sequences > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > 60AEDFA0CEEACC38 > 969 > 16790744 > Infect. Immun. 74 (7), 3715-3726 (2006) > Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > 4B1232999F6E8130 > 929 > 8688087 > Science 273 (5278), 1058-1073 (1996) > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > 3E79B40DD2AAA2B7 > 932 > 8688087 > Science 273 (5278), 1058-1073 (1996) > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > 094EB3384F8D6DE8 > 1426 > 10684935 > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and Fraser,C.M. > 357648D8FD8C6C8A > 1481 > 10684935 > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. > 115411EB2DEE5654 > 1497 > 14689165 > Arch. Microbiol. 181 (2), 144-154 (2004) > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > 4D5D376EECCD186B > 1501 > 14689165 > Arch. Microbiol. 181 (2), 144-154 (2004) > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > 4D57954EECDED66B > 1556 > 18060065 > PLoS ONE 2 (12), E1271 (2007) > Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > 698688FB6DB95247 > 1559 > 18060065 > PLoS ONE 2 (12), E1271 (2007) > Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > E25E1BA99DB18F3D > > ? The second kind of error which I got was : org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > ? Which means in richsequence object some feature have location object which have its feature set to null. > ? My Observation: > ? Usually occur when you try to persist a richsequence object to database, and occur to those features which have CompoundRichLocation usually "joins" and "complement" in cds region of a genbank record > ? After catching the hibernate exception I went through all the features and either biojava or hibernate changed the object type of a CompoundRichLocation to SimpleRichLocation and set the feature variable to null. > ? Below is the screen shot of one of my tests > ? Settings before trying to persits the richsequence object to database > > > ? > ? After trying to persits the richsequence object to database and got in hibernate exception catch > > ? > > ? So my question is why is this happening and how to stop or how to get these record into database, I have no clue why is this happening. > ? Some extra information to make things more clear to you guys. > ? Below are some Locus line from genbank record for which I know the error of location, I mean the cds region causing error, and array index in richsequence.feature arrayList object. > ? LOCUS AE001439 1643831 bp DNA circular BCT 19-JAN-2006 > ? richSequence.feature Index : 2540 and line number in the genbank record : 22115 > ? LOCUS CP001189 3887492 bp DNA circular BCT 16-OCT-2008 > ? richSequence.feature Index : 127 and line number in the genbank record : 2137 > ? LOCUS CP001292 328635 bp DNA circular BCT 17-DEC-2008 > ? richSequence.feature Index : 389 and line number in the genbank record : 3632 > ? LOCUS AM279694 238517 bp DNA linear BCT 23-OCT-2008 > ? richSequence.feature Index : 47 and line number in the genbank record : 4841 > ? LOCUS CR931663 18517 bp DNA linear BCT 18-SEP-2008 > ? richSequence.feature Index : 45 and line number in the genbank record : 442 > ? The complete exception msg : > org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) > at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) > at trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) > > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E:holland at eaglegenomics.com http://www.eaglegenomics.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: Biojava_BioPerl_diff.xls Type: application/vnd.ms-excel Size: 346624 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: BioSqlRichObjectBuilder.patch URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: GenbankFormat.patch URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GenbankRecord.doc Type: application/msword Size: 59392 bytes Desc: not available URL: From holland at eaglegenomics.com Thu Mar 25 12:27:17 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 25 Mar 2010 16:27:17 +0000 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject :( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4BAABA21.4000301@gmail.com> References: <4BAABA21.4000301@gmail.com> Message-ID: <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> Patched and in subversion on the head in the new Biojava 3 code. I modified the code slightly to simplify it. There were also parallel changes required over in SimpleDocRef itself to enable it to continue working without being connected to BioSQL. On 25 Mar 2010, at 01:19, Deepak Sheoran wrote: > I am writing this email again, I didn't get any response weather this bugs are patched or are they lost some where on mailing list. I am not sure that's why I am writing this back. I don't know how to apply this patch So I am counting on you guys to apply theses patch and reply me back so I know its fixed. > > > > Thanks > Deepak Sheoran > > > Hi > In response to bug fix suggested by Richard I have created some patches. We need to apply these to fix biojava from processing references from a genbank record in a wrong manner which cause more hibernate exceptions. After applying patch, reference resolution code will test pubmed or medline id, then if no match then test author/title/location, then if still no match create a new reference. I even tested it with GenbankRelease 175 and I gained almost 3159 more records in my database. > > Can somebody please have a look on second issue of it and fix it > " > 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). > " > > Also I am planning on making a bridge between biosql database loaded using bioperl and biojava, here is my some of the investigation can you guys suggest some direction on it. > Have a look on attached files > 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank record is stored in biosql instance by bioperl and biojava > 2) GenbankRecord.doc ==> its word document having a genbank showing where its information goes in biosql using bioperl and biojava > 3) BioSqlRichobjectBuilder.patch ==> patch needed for BioSqlRichObjectBuild.java class > 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class > > > Thanks > Deepak Sheoran > > > > -------- Original Message -------- > Subject: Re: Hibernate Exception and suggestion for change in BioSqlSchema > Date: Tue, 9 Feb 2010 20:34:32 +1300 > From: Richard Holland > To: Deepak Sheoran > CC: biojava-l at biojava.org > > Hi. It's possible that your original email didn't make it to the list because it is HTML format, and the list only accepts plain text. > > However, in answer to your two questions: > > 1. The code that does the resolution of references might be better if it looks up existing IDs rather than using author, title, location to identify existing records. I would suggest modifying it to a three-step process - test ID, then if no match then test author/title/location, then if still no match create a new reference. Could someone do that? (I'm unable to do anything until late March). > > 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). > > cheers, > Richard > > On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: > > > > > Hi Richard > > > > Below is the email which I sent to Biojava-1 mailing list but it never get posted on the mailing list server neither do i got any response, so please have a look on this email and tell what can be the solution of the problem described in the message. > > > > > > Thanks > > Deepak Sheoran > > -------- Original Message -------- > > Subject: Hibernate Exception and suggestion for change in BioSqlSchema > > Date: Wed, 03 Feb 2010 08:07:35 -0600 > > From: Deepak Sheoran > > > > To: > biojava-l at lists.open-bio.org > > > > > Hi guys, > > > > A couple of days back I was having some problem with hibernate exception but that exception got resolved and the reference to that email is: > http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html > > > On Richard suggestion in above link I am able to resolve some of issues but then, I got stuck in to some other error with hibernate and then decided to investigate the matter and below are some facts and information which I found and I guess it is going to affect all of us. > > ? The "Reference" table in bioSql schema have unique constraint on "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). Which mean only one entry in reference table can use on dbxref_id. > > This Works wells but in cases when you have little variation in value of following column "location", "title", "authors" and all these variation refers to same PUBMED_ID. Then we can't persist or create a richsequence object . > > Now when you tie RichObjectFactory to a active hibernate session then the class "BioSqlRichObjectBuilder" have method called "buildObject(Class clazz, List paramsList) " which is responsible for looking up details of object in the database and if it find one then it will return that object, else it will try to persist the new object into the database. > > But problem is with below part of that method: > > ?..LineNumber: 114 > > else if (SimpleDocRef.class.isAssignableFrom(clazz)) > > { queryType = "DocRef"; > > // convert List constructor to String representation for query > > ourParamsList.set(0, DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); > > if (ourParamsList.size()<3) { > > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title is null"; > > } else { > > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title = ?"; > > } > > } > > ..LineNubmer: 123 > > Now when hibernate search the database, it won't find any other record in "reference" table because those two record are different in string comparison, so it will return a new object back to "GenbankFormat" to following piece of code > > ?.LineNumber: 447 > > else { > > try { > > CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]{dbname, raccession, new Integer(0)}); > > RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); > > } catch (ChangeVetoException e) { > > throw new ParseException(e+", accession:"+accession); > > } > > } > > ?..LineNumber:455 > > Then we will add that object to rlistener. And move to next part of genbank record and then biojava search for a new crossref in database and it will try to persist the old one it get a hibernate exception regarding violation of "unique constraint on dbxref_id" column. > > > > The only way to get these record in database is: > > ? The very easy solution and the way I did it for testing my theory is Change the bioSql schema so that it can allow many to one on relation between "reference" and "dbxref" table. Which even make sense because one paper can have many different variation of naming, and this change allow us to store that info too. But this is something BioSql people have decide and I don't know how to approach them. > > ? Second solution is slightly difficult to implement, is to change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List paramsList)" make decision about weather a particular DocRef already exist in database or not. I am mean testing all possible string variations of authors, location, title of the docRef which we are searching. Which does have many complications and may slow down process of creating a richsequence object when link RichObjectFactory with a active hibernate session. > > > > Example:Below is a sample of what i have in my local biosql schema which has modification suggested by me. (dbxref_id column have Pubmed_id , I replaced the local dbxref_id which was present on this table in my database with pubmed_id stored in "dbxref" table, for easy reference with outside world in this email) > > Reference_id > > Dbxref_id > > Location > > Title > > Authors > > crc > > 216 > > 18554304 > > FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 (2008) > > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > > 9E940E01F4BE3CD0 > > 230 > > 18554304 > > FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) > > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > > D3BC0C17F3F786C9 > > 415 > > 16790744 > > Infect. Immun. 74 (7), 3715-3726 (2006) > > Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via Recombination with Repetitive Chromosomal Sequences > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > > 60AEDFA0CEEACC38 > > 969 > > 16790744 > > Infect. Immun. 74 (7), 3715-3726 (2006) > > Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > > 4B1232999F6E8130 > > 929 > > 8688087 > > Science 273 (5278), 1058-1073 (1996) > > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > > 3E79B40DD2AAA2B7 > > 932 > > 8688087 > > Science 273 (5278), 1058-1073 (1996) > > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > > 094EB3384F8D6DE8 > > 1426 > > 10684935 > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > > Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and Fraser,C.M. > > 357648D8FD8C6C8A > > 1481 > > 10684935 > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > > Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. > > 115411EB2DEE5654 > > 1497 > > 14689165 > > Arch. Microbiol. 181 (2), 144-154 (2004) > > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > > 4D5D376EECCD186B > > 1501 > > 14689165 > > Arch. Microbiol. 181 (2), 144-154 (2004) > > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > > 4D57954EECDED66B > > 1556 > > 18060065 > > PLoS ONE 2 (12), E1271 (2007) > > Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > 698688FB6DB95247 > > 1559 > > 18060065 > > PLoS ONE 2 (12), E1271 (2007) > > Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > E25E1BA99DB18F3D > > > > ? The second kind of error which I got was : org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > > ? Which means in richsequence object some feature have location object which have its feature set to null. > > ? My Observation: > > ? Usually occur when you try to persist a richsequence object to database, and occur to those features which have CompoundRichLocation usually "joins" and "complement" in cds region of a genbank record > > ? After catching the hibernate exception I went through all the features and either biojava or hibernate changed the object type of a CompoundRichLocation to SimpleRichLocation and set the feature variable to null. > > ? Below is the screen shot of one of my tests > > ? Settings before trying to persits the richsequence object to database > > > > > > ? > > ? After trying to persits the richsequence object to database and got in hibernate exception catch > > > > ? > > > > ? So my question is why is this happening and how to stop or how to get these record into database, I have no clue why is this happening. > > ? Some extra information to make things more clear to you guys. > > ? Below are some Locus line from genbank record for which I know the error of location, I mean the cds region causing error, and array index in richsequence.feature arrayList object. > > ? LOCUS AE001439 1643831 bp DNA circular BCT 19-JAN-2006 > > ? richSequence.feature Index : 2540 and line number in the genbank record : 22115 > > ? LOCUS CP001189 3887492 bp DNA circular BCT 16-OCT-2008 > > ? richSequence.feature Index : 127 and line number in the genbank record : 2137 > > ? LOCUS CP001292 328635 bp DNA circular BCT 17-DEC-2008 > > ? richSequence.feature Index : 389 and line number in the genbank record : 3632 > > ? LOCUS AM279694 238517 bp DNA linear BCT 23-OCT-2008 > > ? richSequence.feature Index : 47 and line number in the genbank record : 4841 > > ? LOCUS CR931663 18517 bp DNA linear BCT 18-SEP-2008 > > ? richSequence.feature Index : 45 and line number in the genbank record : 442 > > ? The complete exception msg : > > org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > > at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) > > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) > > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) > > at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) > > at trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) > > > > > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: > holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Thu Mar 25 12:47:45 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 25 Mar 2010 09:47:45 -0700 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject :( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> References: <4BAABA21.4000301@gmail.com> <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> Message-ID: <59a41c431003250947g6ecd11cbw21c5be5858b9aa09@mail.gmail.com> Excellent, thanks Richard and Deepak! Andreas On Thu, Mar 25, 2010 at 9:27 AM, Richard Holland wrote: > Patched and in subversion on the head in the new Biojava 3 code. I modified > the code slightly to simplify it. There were also parallel changes required > over in SimpleDocRef itself to enable it to continue working without being > connected to BioSQL. > > On 25 Mar 2010, at 01:19, Deepak Sheoran wrote: > > > I am writing this email again, I didn't get any response weather this > bugs are patched or are they lost some where on mailing list. I am not sure > that's why I am writing this back. I don't know how to apply this patch So I > am counting on you guys to apply theses patch and reply me back so I know > its fixed. > > > > > > > > Thanks > > Deepak Sheoran > > > > > > Hi > > In response to bug fix suggested by Richard I have created some patches. > We need to apply these to fix biojava from processing references from a > genbank record in a wrong manner which cause more hibernate exceptions. > After applying patch, reference resolution code will test pubmed or medline > id, then if no match then test author/title/location, then if still no match > create a new reference. I even tested it with GenbankRelease 175 and I > gained almost 3159 more records in my database. > > > > Can somebody please have a look on second issue of it and fix it > > " > > 2. I think that's a bug (compound locations with null features) but not > sure why. Could be that the process of constructing a CompoundRichLocation > is somehow losing the feature reference from the original > SimpleRichLocation. Again I can't investigate until March - can someone else > take a look at the code? (A good starting point would be to look at how a > CompoundRichLocation decides to select the feature from the > SimpleRichLocations it is made up from). > > " > > > > Also I am planning on making a bridge between biosql database loaded > using bioperl and biojava, here is my some of the investigation can you guys > suggest some direction on it. > > Have a look on attached files > > 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank > record is stored in biosql instance by bioperl and biojava > > 2) GenbankRecord.doc ==> its word document having a genbank showing > where its information goes in biosql using bioperl and biojava > > 3) BioSqlRichobjectBuilder.patch ==> patch needed for > BioSqlRichObjectBuild.java class > > 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class > > > > > > Thanks > > Deepak Sheoran > > > > > > > > -------- Original Message -------- > > Subject: Re: Hibernate Exception and suggestion for change in > BioSqlSchema > > Date: Tue, 9 Feb 2010 20:34:32 +1300 > > From: Richard Holland > > To: Deepak Sheoran > > CC: biojava-l at biojava.org > > > > Hi. It's possible that your original email didn't make it to the list > because it is HTML format, and the list only accepts plain text. > > > > However, in answer to your two questions: > > > > 1. The code that does the resolution of references might be better if > it looks up existing IDs rather than using author, title, location to > identify existing records. I would suggest modifying it to a three-step > process - test ID, then if no match then test author/title/location, then if > still no match create a new reference. Could someone do that? (I'm unable to > do anything until late March). > > > > 2. I think that's a bug (compound locations with null features) but not > sure why. Could be that the process of constructing a CompoundRichLocation > is somehow losing the feature reference from the original > SimpleRichLocation. Again I can't investigate until March - can someone else > take a look at the code? (A good starting point would be to look at how a > CompoundRichLocation decides to select the feature from the > SimpleRichLocations it is made up from). > > > > cheers, > > Richard > > > > On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: > > > > > > > > Hi Richard > > > > > > Below is the email which I sent to Biojava-1 mailing list but it never > get posted on the mailing list server neither do i got any response, so > please have a look on this email and tell what can be the solution of the > problem described in the message. > > > > > > > > > Thanks > > > Deepak Sheoran > > > -------- Original Message -------- > > > Subject: Hibernate Exception and suggestion for change in > BioSqlSchema > > > Date: Wed, 03 Feb 2010 08:07:35 -0600 > > > From: Deepak Sheoran > > > > > > > To: > > biojava-l at lists.open-bio.org > > > > > > > > Hi guys, > > > > > > A couple of days back I was having some problem with hibernate > exception but that exception got resolved and the reference to that email > is: > > > http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html > > > > > On Richard suggestion in above link I am able to resolve some of > issues but then, I got stuck in to some other error with hibernate and then > decided to investigate the matter and below are some facts and information > which I found and I guess it is going to affect all of us. > > > ? The "Reference" table in bioSql schema have unique constraint on > "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). > Which mean only one entry in reference table can use on dbxref_id. > > > This Works wells but in cases when you have little variation in value > of following column "location", "title", "authors" and all these variation > refers to same PUBMED_ID. Then we can't persist or create a richsequence > object . > > > Now when you tie RichObjectFactory to a active hibernate session then > the class "BioSqlRichObjectBuilder" have method called "buildObject(Class > clazz, List paramsList) " which is responsible for looking up details of > object in the database and if it find one then it will return that object, > else it will try to persist the new object into the database. > > > But problem is with below part of that method: > > > ?..LineNumber: 114 > > > else if (SimpleDocRef.class.isAssignableFrom(clazz)) > > > { queryType = "DocRef"; > > > // convert List constructor to String representation > for query > > > ourParamsList.set(0, > DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); > > > if (ourParamsList.size()<3) { > > > queryText = "from DocRef as cr where cr.authors > = ? and cr.location = ? and cr.title is null"; > > > } else { > > > queryText = "from DocRef as cr where cr.authors > = ? and cr.location = ? and cr.title = ?"; > > > } > > > } > > > ..LineNubmer: 123 > > > Now when hibernate search the database, it won't find any other record > in "reference" table because those two record are different in string > comparison, so it will return a new object back to "GenbankFormat" to > following piece of code > > > ?.LineNumber: 447 > > > else { > > > try { > > > CrossRef cr = > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new > Object[]{dbname, raccession, new Integer(0)}); > > > RankedCrossRef rcr = new > SimpleRankedCrossRef(cr, ++rcrossrefCount); > > > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); > > > } catch (ChangeVetoException e) > { > > > throw new > ParseException(e+", accession:"+accession); > > > } > > > } > > > ?..LineNumber:455 > > > Then we will add that object to rlistener. And move to next part of > genbank record and then biojava search for a new crossref in database and it > will try to persist the old one it get a hibernate exception regarding > violation of "unique constraint on dbxref_id" column. > > > > > > The only way to get these record in database is: > > > ? The very easy solution and the way I did it for testing > my theory is Change the bioSql schema so that it can allow many to one on > relation between "reference" and "dbxref" table. Which even make sense > because one paper can have many different variation of naming, and this > change allow us to store that info too. But this is something BioSql people > have decide and I don't know how to approach them. > > > ? Second solution is slightly difficult to implement, is to > change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List > paramsList)" make decision about weather a particular DocRef already exist > in database or not. I am mean testing all possible string variations of > authors, location, title of the docRef which we are searching. Which does > have many complications and may slow down process of creating a richsequence > object when link RichObjectFactory with a active hibernate session. > > > > > > Example:Below is a sample of what i have in my local biosql schema > which has modification suggested by me. (dbxref_id column have Pubmed_id , I > replaced the local dbxref_id which was present on this table in my database > with pubmed_id stored in "dbxref" table, for easy reference with outside > world in this email) > > > Reference_id > > > Dbxref_id > > > Location > > > Title > > > Authors > > > crc > > > 216 > > > 18554304 > > > FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 > (2008) > > > Isolation of lactate-utilizing butyrate-producing bacteria from human > feces and in vivo administration of Anaerostipes caccae strain L2 and > galacto-oligosaccharides in a rat model > > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., > Nomoto,K., Ito,M. and Sawada,H. > > > 9E940E01F4BE3CD0 > > > 230 > > > 18554304 > > > FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) > > > Isolation of lactate-utilizing butyrate-producing bacteria from human > feces and in vivo administration of Anaerostipes caccae strain L2 and > galacto-oligosaccharides in a rat model > > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., > Nomoto,K., Ito,M. and Sawada,H. > > > D3BC0C17F3F786C9 > > > 415 > > > 16790744 > > > Infect. Immun. 74 (7), 3715-3726 (2006) > > > Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is > Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via > Recombination with Repetitive Chromosomal Sequences > > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and > Totten,P.A. > > > 60AEDFA0CEEACC38 > > > 969 > > > 16790744 > > > Infect. Immun. 74 (7), 3715-3726 (2006) > > > Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is > extensive in vitro and in vivo and suggests that variation is generated via > recombination with repetitive chromosomal sequences > > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and > Totten,P.A. > > > 4B1232999F6E8130 > > > 929 > > > 8688087 > > > Science 273 (5278), 1058-1073 (1996) > > > Complete genome sequence of the methanogenic archaeon, Methanococcus > jannaschii > > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., > Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., > Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., > Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., > Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., > Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., > Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., > Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and > Venter,J.C. > > > 3E79B40DD2AAA2B7 > > > 932 > > > 8688087 > > > Science 273 (5278), 1058-1073 (1996) > > > Complete genome sequence of the methanogenic archaeon, Methanococcus > jannaschii > > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., > Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., > Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., > Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., > Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., > Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., > Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., > Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > > > 094EB3384F8D6DE8 > > > 1426 > > > 10684935 > > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae > AR39 > > > Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., > Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., > Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., > Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and > Fraser,C.M. > > > 357648D8FD8C6C8A > > > 1481 > > > 10684935 > > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae > AR39 > > > Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., > Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., > Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., > DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. > > > 115411EB2DEE5654 > > > 1497 > > > 14689165 > > > Arch. Microbiol. 181 (2), 144-154 (2004) > > > The effect of FITA mutations on the symbiotic properties of > Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., > del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. > and Ruiz-Sainz,J.E. > > > 4D5D376EECCD186B > > > 1501 > > > 14689165 > > > Arch. Microbiol. 181 (2), 144-154 (2004) > > > The effect of FITA mutations on the symbiotic properties of > Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., > Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. > and Ruiz-Sainz,J.E. > > > 4D57954EECDED66B > > > 1556 > > > 18060065 > > > PLoS ONE 2 (12), E1271 (2007) > > > Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 > and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids > > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., > Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > > 698688FB6DB95247 > > > 1559 > > > 18060065 > > > PLoS ONE 2 (12), E1271 (2007) > > > Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 > and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids > > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., > Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > > E25E1BA99DB18F3D > > > > > > ? The second kind of error which I got was : > org.hibernate.PropertyValueException: not-null property references a null or > transient value: Location.feature > > > ? Which means in richsequence object some feature have > location object which have its feature set to null. > > > ? My Observation: > > > ? Usually occur when you try to persist a > richsequence object to database, and occur to those features which have > CompoundRichLocation usually "joins" and "complement" in cds region of a > genbank record > > > ? After catching the hibernate exception I went > through all the features and either biojava or hibernate changed the object > type of a CompoundRichLocation to SimpleRichLocation and set the feature > variable to null. > > > ? Below is the screen shot of one of my tests > > > ? Settings before trying to persits the > richsequence object to database > > > > > > > > > ? > > > ? After trying to persits the richsequence object to > database and got in hibernate exception catch > > > > > > ? > > > > > > ? So my question is why is this happening and how to stop > or how to get these record into database, I have no clue why is this > happening. > > > ? Some extra information to make things more clear to you > guys. > > > ? Below are some Locus line from genbank record for > which I know the error of location, I mean the cds region causing error, and > array index in richsequence.feature arrayList object. > > > ? LOCUS AE001439 1643831 > bp DNA circular BCT 19-JAN-2006 > > > ? richSequence.feature Index : 2540 > and line number in the genbank record : 22115 > > > ? LOCUS CP001189 3887492 > bp DNA circular BCT 16-OCT-2008 > > > ? richSequence.feature Index : 127 > and line number in the genbank record : 2137 > > > ? LOCUS CP001292 328635 > bp DNA circular BCT 17-DEC-2008 > > > ? richSequence.feature Index : 389 > and line number in the genbank record : 3632 > > > ? LOCUS AM279694 238517 > bp DNA linear BCT 23-OCT-2008 > > > ? richSequence.feature Index : 47 > and line number in the genbank record : 4841 > > > ? LOCUS CR931663 18517 > bp DNA linear BCT 18-SEP-2008 > > > ? richSequence.feature Index : 45 > and line number in the genbank record : 442 > > > ? The complete exception msg : > > > org.hibernate.PropertyValueException: not-null property references a > null or transient value: Location.feature > > > at > org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > > at > org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > > at > org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at > org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > > at > org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > > at > org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > > at > org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > > at > org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at > org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > > at > org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > > at > org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > > at > org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > > at > org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > > at > org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) > > > at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) > > > at > trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Operations and Delivery Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: > > holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > > > > > > > > > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > From deepak.sheoran at orionbiosciences.com Thu Mar 25 14:46:57 2010 From: deepak.sheoran at orionbiosciences.com (Deepak Sheoran) Date: Thu, 25 Mar 2010 13:46:57 -0500 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject : ( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> References: <4BAABA21.4000301@gmail.com> <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> Message-ID: <4BABAFA1.6090806@orionbiosciences.com> That is reason why I was getting error when i was creating a Richsequence object without any active session to biosql, I didn't had the clue that I created one more bug by fixing one, thanks for noticing that and fixing that. I am thinking should we use bioperl -biojava and biosql compatibility as one of the google summer of code project. I have vision on this, but don't know right way to being with. This can help people who want to use biojava but can't because they are afraid to loos their Perl code,which is heavily dependent on perl way of loading the schema. Or come out with a hybrid way which have good from both languages. Deepak Sheoran On 3/25/2010 11:27 AM, Richard Holland wrote: > Patched and in subversion on the head in the new Biojava 3 code. I modified the code slightly to simplify it. There were also parallel changes required over in SimpleDocRef itself to enable it to continue working without being connected to BioSQL. > > On 25 Mar 2010, at 01:19, Deepak Sheoran wrote: > > >> I am writing this email again, I didn't get any response weather this bugs are patched or are they lost some where on mailing list. I am not sure that's why I am writing this back. I don't know how to apply this patch So I am counting on you guys to apply theses patch and reply me back so I know its fixed. >> >> >> >> Thanks >> Deepak Sheoran >> >> >> Hi >> In response to bug fix suggested by Richard I have created some patches. We need to apply these to fix biojava from processing references from a genbank record in a wrong manner which cause more hibernate exceptions. After applying patch, reference resolution code will test pubmed or medline id, then if no match then test author/title/location, then if still no match create a new reference. I even tested it with GenbankRelease 175 and I gained almost 3159 more records in my database. >> >> Can somebody please have a look on second issue of it and fix it >> " >> 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). >> " >> >> Also I am planning on making a bridge between biosql database loaded using bioperl and biojava, here is my some of the investigation can you guys suggest some direction on it. >> Have a look on attached files >> 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank record is stored in biosql instance by bioperl and biojava >> 2) GenbankRecord.doc ==> its word document having a genbank showing where its information goes in biosql using bioperl and biojava >> 3) BioSqlRichobjectBuilder.patch ==> patch needed for BioSqlRichObjectBuild.java class >> 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class >> >> >> Thanks >> Deepak Sheoran >> >> >> >> -------- Original Message -------- >> Subject: Re: Hibernate Exception and suggestion for change in BioSqlSchema >> Date: Tue, 9 Feb 2010 20:34:32 +1300 >> From: Richard Holland >> To: Deepak Sheoran >> CC: biojava-l at biojava.org >> >> Hi. It's possible that your original email didn't make it to the list because it is HTML format, and the list only accepts plain text. >> >> However, in answer to your two questions: >> >> 1. The code that does the resolution of references might be better if it looks up existing IDs rather than using author, title, location to identify existing records. I would suggest modifying it to a three-step process - test ID, then if no match then test author/title/location, then if still no match create a new reference. Could someone do that? (I'm unable to do anything until late March). >> >> 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). >> >> cheers, >> Richard >> >> On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: >> >> >>> Hi Richard >>> >>> Below is the email which I sent to Biojava-1 mailing list but it never get posted on the mailing list server neither do i got any response, so please have a look on this email and tell what can be the solution of the problem described in the message. >>> >>> >>> Thanks >>> Deepak Sheoran >>> -------- Original Message -------- >>> Subject: Hibernate Exception and suggestion for change in BioSqlSchema >>> Date: Wed, 03 Feb 2010 08:07:35 -0600 >>> From: Deepak Sheoran >>> >> >> >> >>> To: >>> >> biojava-l at lists.open-bio.org >> >> >>> Hi guys, >>> >>> A couple of days back I was having some problem with hibernate exception but that exception got resolved and the reference to that email is: >>> >> http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html >> >> >>> On Richard suggestion in above link I am able to resolve some of issues but then, I got stuck in to some other error with hibernate and then decided to investigate the matter and below are some facts and information which I found and I guess it is going to affect all of us. >>> ? The "Reference" table in bioSql schema have unique constraint on "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). Which mean only one entry in reference table can use on dbxref_id. >>> This Works wells but in cases when you have little variation in value of following column "location", "title", "authors" and all these variation refers to same PUBMED_ID. Then we can't persist or create a richsequence object . >>> Now when you tie RichObjectFactory to a active hibernate session then the class "BioSqlRichObjectBuilder" have method called "buildObject(Class clazz, List paramsList) " which is responsible for looking up details of object in the database and if it find one then it will return that object, else it will try to persist the new object into the database. >>> But problem is with below part of that method: >>> ?..LineNumber: 114 >>> else if (SimpleDocRef.class.isAssignableFrom(clazz)) >>> { queryType = "DocRef"; >>> // convert List constructor to String representation for query >>> ourParamsList.set(0, DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); >>> if (ourParamsList.size()<3) { >>> queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title is null"; >>> } else { >>> queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title = ?"; >>> } >>> } >>> ..LineNubmer: 123 >>> Now when hibernate search the database, it won't find any other record in "reference" table because those two record are different in string comparison, so it will return a new object back to "GenbankFormat" to following piece of code >>> ?.LineNumber: 447 >>> else { >>> try { >>> CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]{dbname, raccession, new Integer(0)}); >>> RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); >>> rlistener.getCurrentFeature().addRankedCrossRef(rcr); >>> } catch (ChangeVetoException e) { >>> throw new ParseException(e+", accession:"+accession); >>> } >>> } >>> ?..LineNumber:455 >>> Then we will add that object to rlistener. And move to next part of genbank record and then biojava search for a new crossref in database and it will try to persist the old one it get a hibernate exception regarding violation of "unique constraint on dbxref_id" column. >>> >>> The only way to get these record in database is: >>> ? The very easy solution and the way I did it for testing my theory is Change the bioSql schema so that it can allow many to one on relation between "reference" and "dbxref" table. Which even make sense because one paper can have many different variation of naming, and this change allow us to store that info too. But this is something BioSql people have decide and I don't know how to approach them. >>> ? Second solution is slightly difficult to implement, is to change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List paramsList)" make decision about weather a particular DocRef already exist in database or not. I am mean testing all possible string variations of authors, location, title of the docRef which we are searching. Which does have many complications and may slow down process of creating a richsequence object when link RichObjectFactory with a active hibernate session. >>> >>> Example:Below is a sample of what i have in my local biosql schema which has modification suggested by me. (dbxref_id column have Pubmed_id , I replaced the local dbxref_id which was present on this table in my database with pubmed_id stored in "dbxref" table, for easy reference with outside world in this email) >>> Reference_id >>> Dbxref_id >>> Location >>> Title >>> Authors >>> crc >>> 216 >>> 18554304 >>> FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 (2008) >>> Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model >>> Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. >>> 9E940E01F4BE3CD0 >>> 230 >>> 18554304 >>> FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) >>> Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model >>> Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. >>> D3BC0C17F3F786C9 >>> 415 >>> 16790744 >>> Infect. Immun. 74 (7), 3715-3726 (2006) >>> Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via Recombination with Repetitive Chromosomal Sequences >>> Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. >>> 60AEDFA0CEEACC38 >>> 969 >>> 16790744 >>> Infect. Immun. 74 (7), 3715-3726 (2006) >>> Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences >>> Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. >>> 4B1232999F6E8130 >>> 929 >>> 8688087 >>> Science 273 (5278), 1058-1073 (1996) >>> Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii >>> Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. >>> 3E79B40DD2AAA2B7 >>> 932 >>> 8688087 >>> Science 273 (5278), 1058-1073 (1996) >>> Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii >>> Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. >>> 094EB3384F8D6DE8 >>> 1426 >>> 10684935 >>> Nucleic Acids Res. 28 (6), 1397-1406 (2000) >>> Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 >>> Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and Fraser,C.M. >>> 357648D8FD8C6C8A >>> 1481 >>> 10684935 >>> Nucleic Acids Res. 28 (6), 1397-1406 (2000) >>> Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 >>> Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. >>> 115411EB2DEE5654 >>> 1497 >>> 14689165 >>> Arch. Microbiol. 181 (2), 144-154 (2004) >>> The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner >>> Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. >>> 4D5D376EECCD186B >>> 1501 >>> 14689165 >>> Arch. Microbiol. 181 (2), 144-154 (2004) >>> The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner >>> Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. >>> 4D57954EECDED66B >>> 1556 >>> 18060065 >>> PLoS ONE 2 (12), E1271 (2007) >>> Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids >>> Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. >>> 698688FB6DB95247 >>> 1559 >>> 18060065 >>> PLoS ONE 2 (12), E1271 (2007) >>> Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids >>> Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. >>> E25E1BA99DB18F3D >>> >>> ? The second kind of error which I got was : org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature >>> ? Which means in richsequence object some feature have location object which have its feature set to null. >>> ? My Observation: >>> ? Usually occur when you try to persist a richsequence object to database, and occur to those features which have CompoundRichLocation usually "joins" and "complement" in cds region of a genbank record >>> ? After catching the hibernate exception I went through all the features and either biojava or hibernate changed the object type of a CompoundRichLocation to SimpleRichLocation and set the feature variable to null. >>> ? Below is the screen shot of one of my tests >>> ? Settings before trying to persits the richsequence object to database >>> >>> >>> ? >>> ? After trying to persits the richsequence object to database and got in hibernate exception catch >>> >>> ? >>> >>> ? So my question is why is this happening and how to stop or how to get these record into database, I have no clue why is this happening. >>> ? Some extra information to make things more clear to you guys. >>> ? Below are some Locus line from genbank record for which I know the error of location, I mean the cds region causing error, and array index in richsequence.feature arrayList object. >>> ? LOCUS AE001439 1643831 bp DNA circular BCT 19-JAN-2006 >>> ? richSequence.feature Index : 2540 and line number in the genbank record : 22115 >>> ? LOCUS CP001189 3887492 bp DNA circular BCT 16-OCT-2008 >>> ? richSequence.feature Index : 127 and line number in the genbank record : 2137 >>> ? LOCUS CP001292 328635 bp DNA circular BCT 17-DEC-2008 >>> ? richSequence.feature Index : 389 and line number in the genbank record : 3632 >>> ? LOCUS AM279694 238517 bp DNA linear BCT 23-OCT-2008 >>> ? richSequence.feature Index : 47 and line number in the genbank record : 4841 >>> ? LOCUS CR931663 18517 bp DNA linear BCT 18-SEP-2008 >>> ? richSequence.feature Index : 45 and line number in the genbank record : 442 >>> ? The complete exception msg : >>> org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature >>> at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) >>> at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) >>> at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) >>> at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) >>> at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) >>> at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) >>> at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascade(Cascade.java:130) >>> at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) >>> at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) >>> at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) >>> at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) >>> at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) >>> at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) >>> at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascade(Cascade.java:130) >>> at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) >>> at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) >>> at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) >>> at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) >>> at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) >>> at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) >>> at trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) >>> >>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: >> holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> >> > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > From biopython at maubp.freeserve.co.uk Thu Mar 25 18:16:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Mar 2010 22:16:55 +0000 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject : ( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4BABAFA1.6090806@orionbiosciences.com> References: <4BAABA21.4000301@gmail.com> <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> <4BABAFA1.6090806@orionbiosciences.com> Message-ID: <320fb6e01003251516w2977ab2h9869342f94576287@mail.gmail.com> On Thu, Mar 25, 2010 at 6:46 PM, Deepak Sheoran wrote: > > That is reason why I was getting error when i was creating a Richsequence > object without any active session to biosql, I didn't had the clue that I > created one more bug by fixing one, thanks for noticing that and fixing > that. > > I am thinking should we use bioperl -biojava and biosql compatibility ?as > one of the google summer of code project. I have vision on this, but don't > know right way to being with. This can ?help people who want to use biojava > but can't because they are afraid to loos their Perl code,which is heavily > dependent on perl way of loading the schema. Or come out with a hybrid way > which have good from both languages. > > Deepak Sheoran That is an interesting idea for GSoC, I wonder if we at Biopython should do the same. I know of a few things where we differ from BioPerl's BioSQL support (e.g. SwissProt comment lines). [I take we agree that bioperl-db is the de facto reference implementation for mapping GenBank etc into BioSQL?] Peter From bugzilla-daemon at portal.open-bio.org Fri Mar 26 02:14:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 02:14:17 -0400 Subject: [Biojava-dev] [Bug 3035] New: ParseException thrown when parsing PDB file. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3035 Summary: ParseException thrown when parsing PDB file. Product: BioJava Version: unspecified Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: structure AssignedTo: biojava-dev at biojava.org ReportedBy: nakagawa-hiroyuki at mki.co.jp When reading a PDB file using org.biojava.bio.structure.io.PDBFileReader on non-English platform, java.text.ParseException is thrown. java.text.ParseException: Unparseable date: "26-DEC-97" at java.text.DateFormat.parse(Unknown Source) at org.biojava.bio.structure.io.PDBFileParser.pdb_HEADER_Handler(PDBFileParser.java:433) at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2067) at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) at org.biojava.bio.structure.io.PDBFileReader.getStructure(PDBFileReader.java:486) at org.biojava.bio.structure.io.PDBFileReader.getStructure(PDBFileReader.java:466) at Test.main(Test.java:9) To reproduce this symptom, 1. Set your operating system???s default locale to non-English one(e.g. Japanese). 2. Then run the test code described below. Or simply run the test code with the option ???-Duser.language=ja??? > java -Duser.language=ja Test ----Begin Test.java ---- import org.biojava.bio.structure.io.PDBFileReader; import org.biojava.bio.structure.Structure; public class Test { public static void main(String[] args) { String filename = "1a2b.pdb" ; PDBFileReader pdbreader = new PDBFileReader(); try{ Structure structure = pdbreader.getStructure(filename); } catch (Exception e){ e.printStackTrace(); } } } ----End Test.java ---- This cause, that java.text.SimpleDateFormat can???t parse PDB style "dd-MMM-yy" date format on some non-English locale. I attached a patch to correct this problem. ---- Begin PDBFileParser.java.diff ---- *** .\biojava-1.7.1\src\org\biojava\bio\structure\io\PDBFileParser.java.orig 2010-01-24 22:35:24.000000000 +0900 --- .\biojava-1.7.1\src\org\biojava\bio\structure\io\PDBFileParser.java 2010-03-19 11:34:28.571551900 +0900 *************** *** 271,277 **** current_compound = new Compound(); dbrefs = new ArrayList(); ! dateFormat = new SimpleDateFormat("dd-MMM-yy"); atomCount = 0; atomOverflow = false; --- 271,277 ---- current_compound = new Compound(); dbrefs = new ArrayList(); ! dateFormat = new SimpleDateFormat("dd-MMM-yy", java.util.Locale.ENGLISH); atomCount = 0; atomOverflow = false; ---- End PDBFileParser.java.diff ---- -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 02:18:26 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 02:18:26 -0400 Subject: [Biojava-dev] [Bug 3035] ParseException thrown when parsing PDB file. In-Reply-To: Message-ID: <201003260618.o2Q6IQEV023480@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3035 ------- Comment #1 from nakagawa-hiroyuki at mki.co.jp 2010-03-26 02:18 EST ------- Created an attachment (id=1467) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1467&action=view) A patch to correct this problem -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 12:25:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 12:25:14 -0400 Subject: [Biojava-dev] [Bug 3035] ParseException thrown when parsing PDB file. In-Reply-To: Message-ID: <201003261625.o2QGPEVe012950@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3035 andreas at sdsc.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 12:27:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 12:27:56 -0400 Subject: [Biojava-dev] [Bug 3035] ParseException thrown when parsing PDB file. In-Reply-To: Message-ID: <201003261627.o2QGRu2r013123@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3035 andreas at sdsc.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #2 from andreas at sdsc.edu 2010-03-26 12:27 EST ------- applied user provided patch, problem should be fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andreas at sdsc.edu Sun Mar 28 22:02:49 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 28 Mar 2010 19:02:49 -0700 Subject: [Biojava-dev] Biojava3 structure In-Reply-To: References: Message-ID: <59a41c431003281902ic2c5ed3h4a2383899f465a8@mail.gmail.com> Hi Scooter, at the present the structure modules depend on the alignment module and on the (old) core module. This is for aligning ATOM and SEQRES residues in the PDB files, and for the Smith Waterman alignment based 3D structure superposition. If we target a release of biojava 3 in about a month, I don't think it will be possible to break this out, mainly because the alignment module is still based on the biojava 1 code base. Overall I think that the core module probably should still be part of the BioJava 3 release. Any opinions on that? Andreas On Sun, Mar 28, 2010 at 3:06 PM, Scooter Willis wrote: > Andreas > > I needed to do some work with a PDB file so started to use the structure > library. It looks like it depends on all the old biojava code. Mainly the > structure exceptions that extend bioexception is the first thing tripping me > up. Should the biojava3-structure module have any external dependencies or > am I working with the wrong package? > > Thanks > > Scooter From andreas at sdsc.edu Fri Mar 5 16:56:40 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 5 Mar 2010 08:56:40 -0800 Subject: [Biojava-dev] Google summer of code Message-ID: <59a41c431003050856v17c83b80sf1fb59f2587c9cd1@mail.gmail.com> Hi, The Open Bioinformatics Foundation (BioJava's mother organisation) is preparing an application for the Google Summer of Code. If you are interested in becoming a mentor for a BioJava related project, you can join us in the application. If you are a student and are interested in a project, please take a look at these pages: http://www.open-bio.org/wiki/Google_Summer_of_Code http://biojava.org/wiki/Google_Summer_of_Code Andreas From yogeshp08 at gmail.com Sat Mar 6 19:38:13 2010 From: yogeshp08 at gmail.com (Yogesh) Date: Sat, 6 Mar 2010 14:38:13 -0500 Subject: [Biojava-dev] Modules + GSoC2010 Message-ID: <193861401003061138gbd0fa77t785eaa15a25a971c@mail.gmail.com> Hello, I am a Graduate student in Bioinformatics. I am thrilled to know that OBF is particiapting in GSoC2010 I also wish to participate in GSoC2010 for the first time this year. I will like to apply for a project related to BioJava. I am very comfortable with Java. Also, I use BioJava very often. One of the projects from BioJava::Modules that I like and I think I can do is: Support for SCOP file parsing. Can I have some help on how to go about this project? Another project that I would like to contribute to is: Develop a multiple sequence alignment algorithm entirely written in Java More info on this will also help me decide on which project to apply for in GSoC2010. Thank you. Regards, -Yogesh From holland at eaglegenomics.com Mon Mar 15 10:34:14 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 15 Mar 2010 10:34:14 +0000 Subject: [Biojava-dev] Hackathon in Boston, July 2010 Message-ID: <5FC2D8EC-5408-4126-9A7D-CB6B3500B61C@eaglegenomics.com> Hi all, Following the successful hackathon in Cambridge earlier this year, it was originally planned to hold a second one in Boston in conjunction with BOSC in order to give those who couldn't make it to the UK a chance to get involved. However, OBF have beaten us to it by organising a cross-project CodeFest! http://www.open-bio.org/wiki/Codefest_2010 It would be great for BioJava people to get involved with this cross-project hackathon effort, and it saves organising one of our own! :) All relevant info is on the web page linked to above, and if you have any questions, ask Brad as detailed on the page. cheers, Richard -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Tue Mar 16 15:57:38 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 16 Mar 2010 08:57:38 -0700 Subject: [Biojava-dev] biojava 3 progress Message-ID: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> Hi, ISMB/BOSC is coming up rapidly and we should start to prepare for the annual BioJava release. As such it would be a good moment to discuss the current status of the various new BioJava 3 modules. The biojava-structure, biojava-structure-gui modules are essentially ready for release and I started to update the Cookbook with the latest features http://biojava.org/wiki/BioJava:CookBook:PDB:align Some of the re-factored modules based on biojava 1.7 could be released anytime soon as well. The documentation just needs to be updated to explain where the functionality can be found now (e.g. alignment module) What about the new code that has been under development since the hackathon? Is it getting release ready slowly? Any plans for documentation? What is missing before we can make the first Biojava 3 release? Andreas From ayates at ebi.ac.uk Tue Mar 16 17:21:48 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 16 Mar 2010 17:21:48 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> Message-ID: <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> It's getting ready very slowly. Currently we need: * Locations correctly implemented ** There's no way of requesting subseqs from them atmo * Feature on sequences support * Extra attributes which do not fit into top-level attributes * Mapping between sequences/assemblies * circular location support ** so no checks on start being less than end * Documentation Think that's it off the top of my head Andy On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > Hi, > > ISMB/BOSC is coming up rapidly and we should start to prepare for the annual > BioJava release. As such it would be a good moment to discuss the current > status of the various new BioJava 3 modules. > > The biojava-structure, biojava-structure-gui modules are essentially ready > for release and I started to update the Cookbook with the latest features > http://biojava.org/wiki/BioJava:CookBook:PDB:align > > Some of the re-factored modules based on biojava 1.7 could be released > anytime soon as well. The documentation just needs to be updated to explain > where the functionality can be found now (e.g. alignment module) > > What about the new code that has been under development since the hackathon? > Is it getting release ready slowly? Any plans for documentation? What is > missing before we can make the first Biojava 3 release? > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Tue Mar 16 18:51:04 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 16 Mar 2010 14:51:04 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> Message-ID: I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. I will also plan on migrating the sequence alignment code as well. I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. I am planning on attending ISMB/BOSC. Do we want to put some deadlines in place with a mini-project plan? Thanks Scooter On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: > It's getting ready very slowly. Currently we need: > > * Locations correctly implemented > ** There's no way of requesting subseqs from them atmo > * Feature on sequences support > * Extra attributes which do not fit into top-level attributes > * Mapping between sequences/assemblies > * circular location support > ** so no checks on start being less than end > * Documentation > > Think that's it off the top of my head > > Andy > > On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > >> Hi, >> >> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >> BioJava release. As such it would be a good moment to discuss the current >> status of the various new BioJava 3 modules. >> >> The biojava-structure, biojava-structure-gui modules are essentially ready >> for release and I started to update the Cookbook with the latest features >> http://biojava.org/wiki/BioJava:CookBook:PDB:align >> >> Some of the re-factored modules based on biojava 1.7 could be released >> anytime soon as well. The documentation just needs to be updated to explain >> where the functionality can be found now (e.g. alignment module) >> >> What about the new code that has been under development since the hackathon? >> Is it getting release ready slowly? Any plans for documentation? What is >> missing before we can make the first Biojava 3 release? >> >> Andreas >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Tue Mar 16 20:58:02 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 16 Mar 2010 13:58:02 -0700 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> Message-ID: <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... Andreas On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: > I am working on adding in additional features to the core module to round > things out and will be able to do docs/wiki examples. I will be working on > Features with the new sequence model and the ability to pull features from > uniprot based on uniprot id as an example. I will use uniprot XML as the > data model when figuring out the feature data model such that classes have > biology relevance instead of being completely abstract. > > I will also see if I can do something with NCBI for genome sequence data > where you don't need to download the entire sequence but based on gff > annotations you can pull dna sequences for exons belonging to a particular > gene. > > I will also plan on migrating the sequence alignment code as well. > > I think the focus for this release should be on the modularization of the > modules and the maven integration. We also need to provide a repository for > those who are not going to use maven and need just the jar files. We can > then highlight the newer modules as a benefit of the modularization. > > I am planning on attending ISMB/BOSC. > > Do we want to put some deadlines in place with a mini-project plan? > > Thanks > > Scooter > > > On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: > > > It's getting ready very slowly. Currently we need: > > > > * Locations correctly implemented > > ** There's no way of requesting subseqs from them atmo > > * Feature on sequences support > > * Extra attributes which do not fit into top-level attributes > > * Mapping between sequences/assemblies > > * circular location support > > ** so no checks on start being less than end > > * Documentation > > > > Think that's it off the top of my head > > > > Andy > > > > On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > > > >> Hi, > >> > >> ISMB/BOSC is coming up rapidly and we should start to prepare for the > annual > >> BioJava release. As such it would be a good moment to discuss the > current > >> status of the various new BioJava 3 modules. > >> > >> The biojava-structure, biojava-structure-gui modules are essentially > ready > >> for release and I started to update the Cookbook with the latest > features > >> http://biojava.org/wiki/BioJava:CookBook:PDB:align > >> > >> Some of the re-factored modules based on biojava 1.7 could be released > >> anytime soon as well. The documentation just needs to be updated to > explain > >> where the functionality can be found now (e.g. alignment module) > >> > >> What about the new code that has been under development since the > hackathon? > >> Is it getting release ready slowly? Any plans for documentation? What is > >> missing before we can make the first Biojava 3 release? > >> > >> Andreas > >> _______________________________________________ > >> biojava-dev mailing list > >> biojava-dev at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > From ayates at ebi.ac.uk Wed Mar 17 15:28:33 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 15:28:33 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> Message-ID: <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? On 16 Mar 2010, at 20:58, Andreas Prlic wrote: > Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... > > Andreas > > On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: > I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. > > I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. > > I will also plan on migrating the sequence alignment code as well. > > I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. > > I am planning on attending ISMB/BOSC. > > Do we want to put some deadlines in place with a mini-project plan? > > Thanks > > Scooter > > > On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: > > > It's getting ready very slowly. Currently we need: > > > > * Locations correctly implemented > > ** There's no way of requesting subseqs from them atmo > > * Feature on sequences support > > * Extra attributes which do not fit into top-level attributes > > * Mapping between sequences/assemblies > > * circular location support > > ** so no checks on start being less than end > > * Documentation > > > > Think that's it off the top of my head > > > > Andy > > > > On 16 Mar 2010, at 15:57, Andreas Prlic wrote: > > > >> Hi, > >> > >> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual > >> BioJava release. As such it would be a good moment to discuss the current > >> status of the various new BioJava 3 modules. > >> > >> The biojava-structure, biojava-structure-gui modules are essentially ready > >> for release and I started to update the Cookbook with the latest features > >> http://biojava.org/wiki/BioJava:CookBook:PDB:align > >> > >> Some of the re-factored modules based on biojava 1.7 could be released > >> anytime soon as well. The documentation just needs to be updated to explain > >> where the functionality can be found now (e.g. alignment module) > >> > >> What about the new code that has been under development since the hackathon? > >> Is it getting release ready slowly? Any plans for documentation? What is > >> missing before we can make the first Biojava 3 release? > >> > >> Andreas > >> _______________________________________________ > >> biojava-dev mailing list > >> biojava-dev at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Wed Mar 17 15:52:01 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 11:52:01 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: Andy Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features. I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average? Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs? Scooter On Mar 17, 2010, at 11:28 AM, Andy Yates wrote: > I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? > > On 16 Mar 2010, at 20:58, Andreas Prlic wrote: > >> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... >> >> Andreas >> >> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: >> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. >> >> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. >> >> I will also plan on migrating the sequence alignment code as well. >> >> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. >> >> I am planning on attending ISMB/BOSC. >> >> Do we want to put some deadlines in place with a mini-project plan? >> >> Thanks >> >> Scooter >> >> >> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: >> >>> It's getting ready very slowly. Currently we need: >>> >>> * Locations correctly implemented >>> ** There's no way of requesting subseqs from them atmo >>> * Feature on sequences support >>> * Extra attributes which do not fit into top-level attributes >>> * Mapping between sequences/assemblies >>> * circular location support >>> ** so no checks on start being less than end >>> * Documentation >>> >>> Think that's it off the top of my head >>> >>> Andy >>> >>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote: >>> >>>> Hi, >>>> >>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >>>> BioJava release. As such it would be a good moment to discuss the current >>>> status of the various new BioJava 3 modules. >>>> >>>> The biojava-structure, biojava-structure-gui modules are essentially ready >>>> for release and I started to update the Cookbook with the latest features >>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align >>>> >>>> Some of the re-factored modules based on biojava 1.7 could be released >>>> anytime soon as well. The documentation just needs to be updated to explain >>>> where the functionality can be found now (e.g. alignment module) >>>> >>>> What about the new code that has been under development since the hackathon? >>>> Is it getting release ready slowly? Any plans for documentation? What is >>>> missing before we can make the first Biojava 3 release? >>>> >>>> Andreas >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > From ayates at ebi.ac.uk Wed Mar 17 16:04:50 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 16:04:50 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: <4A0FFCFF-9EAA-4B27-BF11-AFD6D4CFEAE0@ebi.ac.uk> Hey mate, Sounds good anything with good GFF support is something hard to come by :). So you're going to get it working for the non-generic structures & then push it out into the core modules if I'm reading what you said correctly? Add 0 to the percentage & make sure the docs describe what it's doing. Even if a gene has no introns it still affects the average of introns in a genome :). All I can think of is "biojava3-features". Not sure what "biojava3-genes" says. Maybe it goes into an "io" package ... say one which goes with an EMBL/Genbank/CHADO formatter maybe. Naming is a horrible thing to have to do. Andy On 17 Mar 2010, at 15:52, Scooter Willis wrote: > Andy > > Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features. > > I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average? > > Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs? > > Scooter > > > > > > On Mar 17, 2010, at 11:28 AM, Andy Yates wrote: > >> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? >> >> On 16 Mar 2010, at 20:58, Andreas Prlic wrote: >> >>> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... >>> >>> Andreas >>> >>> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: >>> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. >>> >>> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. >>> >>> I will also plan on migrating the sequence alignment code as well. >>> >>> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. >>> >>> I am planning on attending ISMB/BOSC. >>> >>> Do we want to put some deadlines in place with a mini-project plan? >>> >>> Thanks >>> >>> Scooter >>> >>> >>> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: >>> >>>> It's getting ready very slowly. Currently we need: >>>> >>>> * Locations correctly implemented >>>> ** There's no way of requesting subseqs from them atmo >>>> * Feature on sequences support >>>> * Extra attributes which do not fit into top-level attributes >>>> * Mapping between sequences/assemblies >>>> * circular location support >>>> ** so no checks on start being less than end >>>> * Documentation >>>> >>>> Think that's it off the top of my head >>>> >>>> Andy >>>> >>>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote: >>>> >>>>> Hi, >>>>> >>>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >>>>> BioJava release. As such it would be a good moment to discuss the current >>>>> status of the various new BioJava 3 modules. >>>>> >>>>> The biojava-structure, biojava-structure-gui modules are essentially ready >>>>> for release and I started to update the Cookbook with the latest features >>>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align >>>>> >>>>> Some of the re-factored modules based on biojava 1.7 could be released >>>>> anytime soon as well. The documentation just needs to be updated to explain >>>>> where the functionality can be found now (e.g. alignment module) >>>>> >>>>> What about the new code that has been under development since the hackathon? >>>>> Is it getting release ready slowly? Any plans for documentation? What is >>>>> missing before we can make the first Biojava 3 release? >>>>> >>>>> Andreas >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> >>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Wed Mar 17 16:09:29 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 12:09:29 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: Andy Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. Scooter From HWillis at scripps.edu Wed Mar 17 16:14:02 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 12:14:02 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <4A0FFCFF-9EAA-4B27-BF11-AFD6D4CFEAE0@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <4A0FFCFF-9EAA-4B27-BF11-AFD6D4CFEAE0@ebi.ac.uk> Message-ID: Andy I have two methods that calculate avg introns per gene both ways. Just wasn't sure what the standard is for reporting. I think features should be part of the core because it is abstract regardless of the source that generated the feature. For the code related to gene prediction work that probably should be in a different package because it is not general. Calling it biojava-geneprediction also doesn't work because it implies gene prediction. Scooter On Mar 17, 2010, at 12:04 PM, Andy Yates wrote: > Hey mate, > > Sounds good anything with good GFF support is something hard to come by :). So you're going to get it working for the non-generic structures & then push it out into the core modules if I'm reading what you said correctly? > > Add 0 to the percentage & make sure the docs describe what it's doing. Even if a gene has no introns it still affects the average of introns in a genome :). > > All I can think of is "biojava3-features". Not sure what "biojava3-genes" says. Maybe it goes into an "io" package ... say one which goes with an EMBL/Genbank/CHADO formatter maybe. Naming is a horrible thing to have to do. > > Andy > > On 17 Mar 2010, at 15:52, Scooter Willis wrote: > >> Andy >> >> Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features. >> >> I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average? >> >> Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs? >> >> Scooter >> >> >> >> >> >> On Mar 17, 2010, at 11:28 AM, Andy Yates wrote: >> >>> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter? >>> >>> On 16 Mar 2010, at 20:58, Andreas Prlic wrote: >>> >>>> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish, if we would say we are planning a release e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... >>>> >>>> Andreas >>>> >>>> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis wrote: >>>> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract. >>>> >>>> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene. >>>> >>>> I will also plan on migrating the sequence alignment code as well. >>>> >>>> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization. >>>> >>>> I am planning on attending ISMB/BOSC. >>>> >>>> Do we want to put some deadlines in place with a mini-project plan? >>>> >>>> Thanks >>>> >>>> Scooter >>>> >>>> >>>> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote: >>>> >>>>> It's getting ready very slowly. Currently we need: >>>>> >>>>> * Locations correctly implemented >>>>> ** There's no way of requesting subseqs from them atmo >>>>> * Feature on sequences support >>>>> * Extra attributes which do not fit into top-level attributes >>>>> * Mapping between sequences/assemblies >>>>> * circular location support >>>>> ** so no checks on start being less than end >>>>> * Documentation >>>>> >>>>> Think that's it off the top of my head >>>>> >>>>> Andy >>>>> >>>>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual >>>>>> BioJava release. As such it would be a good moment to discuss the current >>>>>> status of the various new BioJava 3 modules. >>>>>> >>>>>> The biojava-structure, biojava-structure-gui modules are essentially ready >>>>>> for release and I started to update the Cookbook with the latest features >>>>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align >>>>>> >>>>>> Some of the re-factored modules based on biojava 1.7 could be released >>>>>> anytime soon as well. The documentation just needs to be updated to explain >>>>>> where the functionality can be found now (e.g. alignment module) >>>>>> >>>>>> What about the new code that has been under development since the hackathon? >>>>>> Is it getting release ready slowly? Any plans for documentation? What is >>>>>> missing before we can make the first Biojava 3 release? >>>>>> >>>>>> Andreas >>>>>> _______________________________________________ >>>>>> biojava-dev mailing list >>>>>> biojava-dev at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>> >>>>> -- >>>>> Andrew Yates Ensembl Genomes Engineer >>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>>> >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > From andreas at sdsc.edu Wed Mar 17 17:46:19 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 17 Mar 2010 10:46:19 -0700 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> Message-ID: <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... A On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: > Andy > > Let me know if you have any major code changes for the core sequencing > handling that have been or could be checked in. So far I haven't needed to > touch any of the core sequence code but want to avoid merging code if you > have made any significant changes. > > I should have code to check in today and if we can't come up with a better > name I will ask Andreas to create a biojava3-genes module and I can then > check that code in for your review. The current problem is that we have > ExonSequence extending DNASequence when it could also be described as a > feature. One way to look at this that a TranscriptSequence is also a feature > of a DNA sequence and only when you want to have a stand alone class with > internal links back to parent sequence do you return a TranscriptSequence. > The TranscriptFeature would have ExonFeature and IntronFeature as children. > You can ask for a ExonSequence based on the ExonFeature. Once you get a > ProteinSequence you should be able to reverse the process and get back the > TranscriptSequence and the corresponding ExonFeatures and some sort of > mapping from a protein sequence position back to the three DNA sequence > positions that coded for it. This would need to handle the case where you > have a the end of an exon and the start of the next exon coding for a > particular amino acid sequence position. > > We also need to add in the ability to have tracks as a way to group > features. This way you export features based on a particular track as a > GFF/GFF3 file for importing into various genome browsers. You have one > genome you are working on with genes added in from three different gene > prediction algorithms each organized by a track. You should then be able to > determine overlaps of genes that were predicted and validated via blast > against uniprot and create another summary track of validated genes and > non-validate genes. If the feature classes we put together can make this > easy then I think we will have a solid design. > > > Scooter > > From HWillis at scripps.edu Wed Mar 17 18:17:59 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 14:17:59 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> Message-ID: <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> Andreas The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. Thanks Scooter On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... A On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis > wrote: Andy Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. Scooter From ayates at ebi.ac.uk Wed Mar 17 19:24:13 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 19:24:13 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> Message-ID: <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> biojava-genomes sounds good. I've done nothing since my last check-in of code which was all to do with locations so there should be no problem there :) On 17 Mar 2010, at 18:17, Scooter Willis wrote: > Andreas > > The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. > > We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. > > Thanks > > Scooter > > > > On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: > >> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... >> A >> >> On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: >> Andy >> >> Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. >> >> I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. >> >> We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. >> >> >> Scooter >> >> > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Wed Mar 17 19:58:42 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 17 Mar 2010 15:58:42 -0400 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> Message-ID: <9F8616DE-710D-4971-8C63-52C5EB7789C2@scripps.edu> Andy Should be use this as our test case http://www.sequenceontology.org/gff3.shtml for a complex example of transcription? Scooter On Mar 17, 2010, at 3:24 PM, Andy Yates wrote: > biojava-genomes sounds good. > > I've done nothing since my last check-in of code which was all to do with locations so there should be no problem there :) > > On 17 Mar 2010, at 18:17, Scooter Willis wrote: > >> Andreas >> >> The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. >> >> We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. >> >> Thanks >> >> Scooter >> >> >> >> On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: >> >>> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... >>> A >>> >>> On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: >>> Andy >>> >>> Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. >>> >>> I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. >>> >>> We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. >>> >>> >>> Scooter >>> >>> >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > From ayates at ebi.ac.uk Wed Mar 17 20:01:04 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 17 Mar 2010 20:01:04 +0000 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <9F8616DE-710D-4971-8C63-52C5EB7789C2@scripps.edu> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> <1077DC26-42AB-4E41-BFA3-DEFD769F4C61@ebi.ac.uk> <9F8616DE-710D-4971-8C63-52C5EB7789C2@scripps.edu> Message-ID: <2A33D045-0AD9-4948-90D3-48636D074514@ebi.ac.uk> Perfect :). Nothing like using someone else's test case as ours Andy On 17 Mar 2010, at 19:58, Scooter Willis wrote: > Andy > > Should be use this as our test case http://www.sequenceontology.org/gff3.shtml for a complex example of transcription? > > Scooter > > On Mar 17, 2010, at 3:24 PM, Andy Yates wrote: > >> biojava-genomes sounds good. >> >> I've done nothing since my last check-in of code which was all to do with locations so there should be no problem there :) >> >> On 17 Mar 2010, at 18:17, Scooter Willis wrote: >> >>> Andreas >>> >>> The problem with putting feature classes in a separate module is that biojava-core sequences would then have a dependency on biojava-feature. A sequence needs to hold a collection of features so feature classes need to go in core. If features are created from gff the core module doesn't care where features come from. >>> >>> We could go with biojava-genomes and code related to dealing with genomes goes in that module. If you like biojava-genome or biojava-genomes go ahead and create it and email me so I can check it out. >>> >>> Thanks >>> >>> Scooter >>> >>> >>> >>> On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: >>> >>>> I like biojava-feature as a module name for the GFF and features related code. (should we try to keep the module names singular?) Let me know if you want me to create the module for this... >>>> A >>>> >>>> On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: >>>> Andy >>>> >>>> Let me know if you have any major code changes for the core sequencing handling that have been or could be checked in. So far I haven't needed to touch any of the core sequence code but want to avoid merging code if you have made any significant changes. >>>> >>>> I should have code to check in today and if we can't come up with a better name I will ask Andreas to create a biojava3-genes module and I can then check that code in for your review. The current problem is that we have ExonSequence extending DNASequence when it could also be described as a feature. One way to look at this that a TranscriptSequence is also a feature of a DNA sequence and only when you want to have a stand alone class with internal links back to parent sequence do you return a TranscriptSequence. The TranscriptFeature would have ExonFeature and IntronFeature as children. You can ask for a ExonSequence based on the ExonFeature. Once you get a ProteinSequence you should be able to reverse the process and get back the TranscriptSequence and the corresponding ExonFeatures and some sort of mapping from a protein sequence position back to the three DNA sequence positions that coded for it. This would need to handle the case where you have a the end of an exon and the start of the next exon coding for a particular amino acid sequence position. >>>> >>>> We also need to add in the ability to have tracks as a way to group features. This way you export features based on a particular track as a GFF/GFF3 file for importing into various genome browsers. You have one genome you are working on with genes added in from three different gene prediction algorithms each organized by a track. You should then be able to determine overlaps of genes that were predicted and validated via blast against uniprot and create another summary track of validated genes and non-validate genes. If the feature classes we put together can make this easy then I think we will have a solid design. >>>> >>>> >>>> Scooter >>>> >>>> >>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From andreas at sdsc.edu Wed Mar 17 22:14:40 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 17 Mar 2010 15:14:40 -0700 Subject: [Biojava-dev] biojava 3 progress In-Reply-To: <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> References: <59a41c431003160857s5fb8f4f8i89f410a1adfbca85@mail.gmail.com> <81FA76CF-D4F6-44A5-A92F-C92D48BC7F8C@ebi.ac.uk> <59a41c431003161358h45d55b36w73050c8d5a883c98@mail.gmail.com> <4A9A2D02-6E24-468B-9EC3-D58BE335406F@ebi.ac.uk> <59a41c431003171046u57ef0d00vd4452074fc922b1@mail.gmail.com> <5C3EFA6A-68FF-4FF9-B92F-861E4E88B41C@scripps.edu> Message-ID: <59a41c431003171514u1357ecf1ndab75fa4d461124e@mail.gmail.com> ok, a new module biojava3-genome is now in SVN... A On Wed, Mar 17, 2010 at 11:17 AM, Scooter Willis wrote: > Andreas > > The problem with putting feature classes in a separate module is that > biojava-core sequences would then have a dependency on biojava-feature. A > sequence needs to hold a collection of features so feature classes need to > go in core. If features are created from gff the core module doesn't care > where features come from. > > We could go with biojava-genomes and code related to dealing with genomes > goes in that module. If you like biojava-genome or biojava-genomes go ahead > and create it and email me so I can check it out. > > Thanks > > Scooter > > > > On Mar 17, 2010, at 1:46 PM, Andreas Prlic wrote: > > I like biojava-feature as a module name for the GFF and features related > code. (should we try to keep the module names singular?) Let me know if you > want me to create the module for this... > A > > On Wed, Mar 17, 2010 at 9:09 AM, Scooter Willis wrote: > >> Andy >> >> Let me know if you have any major code changes for the core sequencing >> handling that have been or could be checked in. So far I haven't needed to >> touch any of the core sequence code but want to avoid merging code if you >> have made any significant changes. >> >> I should have code to check in today and if we can't come up with a better >> name I will ask Andreas to create a biojava3-genes module and I can then >> check that code in for your review. The current problem is that we have >> ExonSequence extending DNASequence when it could also be described as a >> feature. One way to look at this that a TranscriptSequence is also a feature >> of a DNA sequence and only when you want to have a stand alone class with >> internal links back to parent sequence do you return a TranscriptSequence. >> The TranscriptFeature would have ExonFeature and IntronFeature as children. >> You can ask for a ExonSequence based on the ExonFeature. Once you get a >> ProteinSequence you should be able to reverse the process and get back the >> TranscriptSequence and the corresponding ExonFeatures and some sort of >> mapping from a protein sequence position back to the three DNA sequence >> positions that coded for it. This would need to handle the case where you >> have a the end of an exon and the start of the next exon coding for a >> particular amino acid sequence position. >> >> We also need to add in the ability to have tracks as a way to group >> features. This way you export features based on a particular track as a >> GFF/GFF3 file for importing into various genome browsers. You have one >> genome you are working on with genes added in from three different gene >> prediction algorithms each organized by a track. You should then be able to >> determine overlaps of genes that were predicted and validated via blast >> against uniprot and create another summary track of validated genes and >> non-validate genes. If the feature classes we put together can make this >> easy then I think we will have a solid design. >> >> >> Scooter >> >> > > From heuermh at acm.org Thu Mar 18 03:28:23 2010 From: heuermh at acm.org (Michael Heuer) Date: Wed, 17 Mar 2010 22:28:23 -0500 (EST) Subject: [Biojava-dev] Hackathon in Boston, July 2010 In-Reply-To: <5FC2D8EC-5408-4126-9A7D-CB6B3500B61C@eaglegenomics.com> Message-ID: On Mon, 15 Mar 2010, Richard Holland wrote: > Hi all, > > Following the successful hackathon in Cambridge earlier this year, it was originally planned to hold a second one in Boston in conjunction with BOSC in order to give those who couldn't make it to the UK a chance to get involved. > > However, OBF have beaten us to it by organising a cross-project CodeFest! > > http://www.open-bio.org/wiki/Codefest_2010 > > It would be great for BioJava people to get involved with this cross-project hackathon effort, and it saves organising one of our own! :) Yep, I'm already signed up. Look forward to seeing some of you there. michael From andreas at sdsc.edu Thu Mar 18 20:36:38 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 18 Mar 2010 13:36:38 -0700 Subject: [Biojava-dev] Google summer of code Message-ID: <59a41c431003181336i33d388aak4b5a26e11ee4161b@mail.gmail.com> Hi, It seems our (the Open Biology Foundation's) Google Summer of Code application has been accepted. http://socghop.appspot.com/gsoc/program/accepted_orgs/google/gsoc2010 As such we are now looking for an interested and skilled student to work on the BioJava multiple sequence alignment project. Take a look at the project description, and if you think you are up for the challenge, send me an email with your application. http://biojava.org/wiki/Google_Summer_of_Code Andreas From andreas at sdsc.edu Wed Mar 24 00:33:09 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 23 Mar 2010 17:33:09 -0700 Subject: [Biojava-dev] GSoC update Message-ID: <59a41c431003231733t1e259753k55fbe0a8bfb801a3@mail.gmail.com> Hi, A quick update regarding the current status of our Google Summer of Code project: Several students already have expressed their interest. In fact the response was so good that I believe BioJava should try to run more than just one project. In the meanwhile we added another "mentor proposed" project to our GSoC page : http://biojava.org/wiki/Google_Summer_of_Code . Identification and Classification of Posttranslational Modification of Proteins: Develop a Postranslational Modification package for the BioJava project. In general Google strongly encourages to have student-proposed projects, since historically those are often the most successful GSoC projects. It is recommended that students contact us / possible mentors prior to their application so we can match up students with suitable mentors and projects and we can help in solidifying your project ideas. In principle any BioJava contributor is suitable as a mentor. Students can apply between March 22nd and April 9th via the google web site. http://socghop.appspot.com/ Andreas From biopython at maubp.freeserve.co.uk Wed Mar 24 14:51:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:51:46 +0000 Subject: [Biojava-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> Message-ID: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote: > > On Mar 24, 2010, at 9:08 AM, Peter wrote: > >> Hi, >> >> This is probably of interest to all the Bio* projects offering access >> to the NCBI Entrez utilities. See forwarded message below. >> >> I *think* the new guidelines basically say that the email & tool parameters are >> optional BUT if your IP address ever gets banned for excessive use you then >> have to register an email & tool combination. >> >> Regarding the email address, the NCBI say to use the email of the developer >> (not the end user). However, they do not distinguish between the developers >> of a library (like us), and the developers of an application or script using a >> library (who may also be the end user). >> >> Currently we (Biopython) and I think BioPerl ask developers using our libraries >> to populate the email address themselves. I *think* this is still the >> right action. >> >> Peter > > > Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I > think with the SOAP-based ones as well). ?We're providing a specific set of > tools for user to write up their own applications end applications. ?I can try > contacting them regarding this to get an official response to clarify this > somewhat. Please give the NCBI an email - you can CC me too if you like. > Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a > default, but always leave the email blank and issue a warning if it isn't > set. ?We could just as easily leave both blank and issue warnings for both. We currently leave out the email and set the tool parameter to "Biopython" by default but this can be overridden. Currently leaving out the email does cause Biopython to give a warning. Peter From cjfields at illinois.edu Wed Mar 24 14:37:13 2010 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 24 Mar 2010 09:37:13 -0500 Subject: [Biojava-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> Message-ID: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> On Mar 24, 2010, at 9:08 AM, Peter wrote: > Hi, > > This is probably of interest to all the Bio* projects offering access > to the NCBI > Entrez utilities. See forwarded message below. > > I *think* the new guidelines basically say that the email & tool parameters are > optional BUT if your IP address ever gets banned for excessive use you then > have to register an email & tool combination. > > Regarding the email address, the NCBI say to use the email of the developer > (not the end user). However, they do not distinguish between the developers > of a library (like us), and the developers of an application or script using a > library (who may also be the end user). > > Currently we (Biopython) and I think BioPerl ask developers using our libraries > to populate the email address themselves. I *think* this is still the > right action. > > Peter Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I think with the SOAP-based ones as well). We're providing a specific set of tools for user to write up their own applications end applications. I can try contacting them regarding this to get an official response to clarify this somewhat. Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a default, but always leave the email blank and issue a warning if it isn't set. We could just as easily leave both blank and issue warnings for both. chris > ---------- Forwarded message ---------- > From: > Date: Wed, Mar 24, 2010 at 1:53 PM > Subject: [Utilities-announce] NCBI Revised E-utility Usage Policy > To: NLM/NCBI List utilities-announce > > > New E-utility documentation now on the NCBI Bookshelf > > The Entrez Programming Utilities (E-Utilities) Help documentation has > been added to the NCBI Bookshelf, and so is now fully integrated with > the Entrez search and retrieval system as a part of the Bookshelf > database. This help document has been divided into chapters for better > organization and includes several new sample Perl scripts. At present > this book covers the standard URL interface for the E-utilties; > material about the SOAP interface will be added soon and is still > available at the same URL: > http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html. > > > > Revised E-utility usage policy > > In December, 2009 NCBI announced a change to the usage policy for the > E-utilities that would require all requests to contain non-null values > for both the &email and &tool parameters. After several consultations > with our users and developers, we have decided to revise this policy > change, and the revised policy is described in detail at the following > link: > > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen > > Please let us know if you have any questions or concerns about this > policy change. > > > > Thank you, > > The E-Utilities Team > > NIH/NLM/NCBI > > eutilities at ncbi.nlm.nih.gov. > > > > _______________________________________________ > Utilities-announce mailing list > http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at drycafe.net Wed Mar 24 15:27:37 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 24 Mar 2010 11:27:37 -0400 Subject: [Biojava-dev] [Open-bio-l] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> Message-ID: <5D427F97-706E-4F66-95BA-2B397520C4FA@drycafe.net> On Mar 24, 2010, at 10:51 AM, Peter wrote: > Please give the NCBI an email - you can CC me too if you like. Can't this be the developers' mailing list (or lists, the appropriate one for each toolkit)? We can even whitelist all NCBI sender addresses so they can easily email us if there are issues. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From cjfields at illinois.edu Wed Mar 24 15:44:21 2010 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 24 Mar 2010 10:44:21 -0500 Subject: [Biojava-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> Message-ID: <338BDDD8-2A66-4086-BFB7-35EC8F8F0D66@illinois.edu> On Mar 24, 2010, at 9:51 AM, Peter wrote: > On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote: >> >> On Mar 24, 2010, at 9:08 AM, Peter wrote: >> >>> Hi, >>> >>> This is probably of interest to all the Bio* projects offering access >>> to the NCBI Entrez utilities. See forwarded message below. >>> >>> I *think* the new guidelines basically say that the email & tool parameters are >>> optional BUT if your IP address ever gets banned for excessive use you then >>> have to register an email & tool combination. >>> >>> Regarding the email address, the NCBI say to use the email of the developer >>> (not the end user). However, they do not distinguish between the developers >>> of a library (like us), and the developers of an application or script using a >>> library (who may also be the end user). >>> >>> Currently we (Biopython) and I think BioPerl ask developers using our libraries >>> to populate the email address themselves. I *think* this is still the >>> right action. >>> >>> Peter >> >> >> Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I >> think with the SOAP-based ones as well). We're providing a specific set of >> tools for user to write up their own applications end applications. I can try >> contacting them regarding this to get an official response to clarify this >> somewhat. > > Please give the NCBI an email - you can CC me too if you like. Sent, have cc'd the open-bio list. Don't want to cross-post this too much, so I think we should move the discussion there. >> Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a >> default, but always leave the email blank and issue a warning if it isn't >> set. We could just as easily leave both blank and issue warnings for both. > > We currently leave out the email and set the tool parameter to "Biopython" > by default but this can be overridden. Currently leaving out the email does > cause Biopython to give a warning. > > Peter We follow the same, then (down to the warning). This is mentioned in my post to them, I'll wait to see what they say. My concern is the wording of the new rules. Each tool and email must be registered with them if an IP is blocked. Does this mean each tool is assigned one specific email? And an IP that is blocked can register it to be allowed back into the fold? With that in mind, should we register each of our toolkits with them? Probably not a bad thing (it might help us as devs to get an idea of use), but then if one user abuses the rules will their actions affect all toolkit users? Is this all done on a per-IP basis, per-toolkit basis, etc? Unfortunately, at least to me, none of this is made very clear, so I'm hoping there is some clarification from their end. chris From maj at fortinbras.us Wed Mar 24 16:37:56 2010 From: maj at fortinbras.us (Mark A. Jensen) Date: Wed, 24 Mar 2010 12:37:56 -0400 Subject: [Biojava-dev] [Bioperl-l] [Open-bio-l] Fwd: [Utilities-announce] NCBI RevisedE-utility Usage Policy In-Reply-To: <5D427F97-706E-4F66-95BA-2B397520C4FA@drycafe.net> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com><38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu><320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> <5D427F97-706E-4F66-95BA-2B397520C4FA@drycafe.net> Message-ID: I think this is a great idea--- MAJ ----- Original Message ----- From: "Hilmar Lapp" To: "Peter" Cc: ; "Biopython-Dev Mailing List" ; ; "bioperl-l list" ; "Chris Fields" ; Sent: Wednesday, March 24, 2010 11:27 AM Subject: Re: [Bioperl-l] [Open-bio-l] Fwd: [Utilities-announce] NCBI RevisedE-utility Usage Policy > > On Mar 24, 2010, at 10:51 AM, Peter wrote: > >> Please give the NCBI an email - you can CC me too if you like. > > > Can't this be the developers' mailing list (or lists, the appropriate one for > each toolkit)? We can even whitelist all NCBI sender addresses so they can > easily email us if there are issues. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From sheoran143 at gmail.com Thu Mar 25 01:19:29 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Wed, 24 Mar 2010 20:19:29 -0500 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject :( Hibernate Exception and suggestion for change in BioSqlSchema) Message-ID: <4BAABA21.4000301@gmail.com> I am writing this email again, I didn't get any response weather this bugs are patched or are they lost some where on mailing list. I am not sure that's why I am writing this back. I don't know how to apply this patch So I am counting on you guys to apply theses patch and reply me back so I know its fixed. Thanks Deepak Sheoran Hi In response to bug fix suggested by Richard I have created some patches. We need to apply these to fix biojava from processing references from a genbank record in a wrong manner which cause more hibernate exceptions. After applying patch, reference resolution code will test pubmed or medline id, then if no match then test author/title/location, then if still no match create a new reference. I even tested it with GenbankRelease 175 and I gained almost 3159 more records in my database. Can somebody please have a look on second issue of it and fix it " 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). " Also I am planning on making a bridge between biosql database loaded using bioperl and biojava, here is my some of the investigation can you guys suggest some direction on it. Have a look on attached files 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank record is stored in biosql instance by bioperl and biojava 2) GenbankRecord.doc ==> its word document having a genbank showing where its information goes in biosql using bioperl and biojava 3) BioSqlRichobjectBuilder.patch ==> patch needed for BioSqlRichObjectBuild.java class 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class Thanks Deepak Sheoran -------- Original Message -------- Subject: Re: Hibernate Exception and suggestion for change in BioSqlSchema Date: Tue, 9 Feb 2010 20:34:32 +1300 From: Richard Holland To: Deepak Sheoran CC: biojava-l at biojava.org Hi. It's possible that your original email didn't make it to the list because it is HTML format, and the list only accepts plain text. However, in answer to your two questions: 1. The code that does the resolution of references might be better if it looks up existing IDs rather than using author, title, location to identify existing records. I would suggest modifying it to a three-step process - test ID, then if no match then test author/title/location, then if still no match create a new reference. Could someone do that? (I'm unable to do anything until late March). 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). cheers, Richard On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: > > Hi Richard > > Below is the email which I sent to Biojava-1 mailing list but it never get posted on the mailing list server neither do i got any response, so please have a look on this email and tell what can be the solution of the problem described in the message. > > > Thanks > Deepak Sheoran > -------- Original Message -------- > Subject: Hibernate Exception and suggestion for change in BioSqlSchema > Date: Wed, 03 Feb 2010 08:07:35 -0600 > From: Deepak Sheoran > To: biojava-l at lists.open-bio.org > > Hi guys, > > A couple of days back I was having some problem with hibernate exception but that exception got resolved and the reference to that email is:http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html > On Richard suggestion in above link I am able to resolve some of issues but then, I got stuck in to some other error with hibernate and then decided to investigate the matter and below are some facts and information which I found and I guess it is going to affect all of us. > ? The "Reference" table in bioSql schema have unique constraint on "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). Which mean only one entry in reference table can use on dbxref_id. > This Works wells but in cases when you have little variation in value of following column "location", "title", "authors" and all these variation refers to same PUBMED_ID. Then we can't persist or create a richsequence object . > Now when you tie RichObjectFactory to a active hibernate session then the class "BioSqlRichObjectBuilder" have method called "buildObject(Class clazz, List paramsList) " which is responsible for looking up details of object in the database and if it find one then it will return that object, else it will try to persist the new object into the database. > But problem is with below part of that method: > ?..LineNumber: 114 > else if (SimpleDocRef.class.isAssignableFrom(clazz)) > { queryType = "DocRef"; > // convert List constructor to String representation for query > ourParamsList.set(0, DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); > if (ourParamsList.size()<3) { > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title is null"; > } else { > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title = ?"; > } > } > ..LineNubmer: 123 > Now when hibernate search the database, it won't find any other record in "reference" table because those two record are different in string comparison, so it will return a new object back to "GenbankFormat" to following piece of code > ?.LineNumber: 447 > else { > try { > CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]{dbname, raccession, new Integer(0)}); > RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); > rlistener.getCurrentFeature().addRankedCrossRef(rcr); > } catch (ChangeVetoException e) { > throw new ParseException(e+", accession:"+accession); > } > } > ?..LineNumber:455 > Then we will add that object to rlistener. And move to next part of genbank record and then biojava search for a new crossref in database and it will try to persist the old one it get a hibernate exception regarding violation of "unique constraint on dbxref_id" column. > > The only way to get these record in database is: > ? The very easy solution and the way I did it for testing my theory is Change the bioSql schema so that it can allow many to one on relation between "reference" and "dbxref" table. Which even make sense because one paper can have many different variation of naming, and this change allow us to store that info too. But this is something BioSql people have decide and I don't know how to approach them. > ? Second solution is slightly difficult to implement, is to change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List paramsList)" make decision about weather a particular DocRef already exist in database or not. I am mean testing all possible string variations of authors, location, title of the docRef which we are searching. Which does have many complications and may slow down process of creating a richsequence object when link RichObjectFactory with a active hibernate session. > > Example:Below is a sample of what i have in my local biosql schema which has modification suggested by me. (dbxref_id column have Pubmed_id , I replaced the local dbxref_id which was present on this table in my database with pubmed_id stored in "dbxref" table, for easy reference with outside world in this email) > Reference_id > Dbxref_id > Location > Title > Authors > crc > 216 > 18554304 > FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 (2008) > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > 9E940E01F4BE3CD0 > 230 > 18554304 > FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > D3BC0C17F3F786C9 > 415 > 16790744 > Infect. Immun. 74 (7), 3715-3726 (2006) > Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via Recombination with Repetitive Chromosomal Sequences > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > 60AEDFA0CEEACC38 > 969 > 16790744 > Infect. Immun. 74 (7), 3715-3726 (2006) > Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > 4B1232999F6E8130 > 929 > 8688087 > Science 273 (5278), 1058-1073 (1996) > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > 3E79B40DD2AAA2B7 > 932 > 8688087 > Science 273 (5278), 1058-1073 (1996) > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > 094EB3384F8D6DE8 > 1426 > 10684935 > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and Fraser,C.M. > 357648D8FD8C6C8A > 1481 > 10684935 > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. > 115411EB2DEE5654 > 1497 > 14689165 > Arch. Microbiol. 181 (2), 144-154 (2004) > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > 4D5D376EECCD186B > 1501 > 14689165 > Arch. Microbiol. 181 (2), 144-154 (2004) > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > 4D57954EECDED66B > 1556 > 18060065 > PLoS ONE 2 (12), E1271 (2007) > Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > 698688FB6DB95247 > 1559 > 18060065 > PLoS ONE 2 (12), E1271 (2007) > Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > E25E1BA99DB18F3D > > ? The second kind of error which I got was : org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > ? Which means in richsequence object some feature have location object which have its feature set to null. > ? My Observation: > ? Usually occur when you try to persist a richsequence object to database, and occur to those features which have CompoundRichLocation usually "joins" and "complement" in cds region of a genbank record > ? After catching the hibernate exception I went through all the features and either biojava or hibernate changed the object type of a CompoundRichLocation to SimpleRichLocation and set the feature variable to null. > ? Below is the screen shot of one of my tests > ? Settings before trying to persits the richsequence object to database > > > ? > ? After trying to persits the richsequence object to database and got in hibernate exception catch > > ? > > ? So my question is why is this happening and how to stop or how to get these record into database, I have no clue why is this happening. > ? Some extra information to make things more clear to you guys. > ? Below are some Locus line from genbank record for which I know the error of location, I mean the cds region causing error, and array index in richsequence.feature arrayList object. > ? LOCUS AE001439 1643831 bp DNA circular BCT 19-JAN-2006 > ? richSequence.feature Index : 2540 and line number in the genbank record : 22115 > ? LOCUS CP001189 3887492 bp DNA circular BCT 16-OCT-2008 > ? richSequence.feature Index : 127 and line number in the genbank record : 2137 > ? LOCUS CP001292 328635 bp DNA circular BCT 17-DEC-2008 > ? richSequence.feature Index : 389 and line number in the genbank record : 3632 > ? LOCUS AM279694 238517 bp DNA linear BCT 23-OCT-2008 > ? richSequence.feature Index : 47 and line number in the genbank record : 4841 > ? LOCUS CR931663 18517 bp DNA linear BCT 18-SEP-2008 > ? richSequence.feature Index : 45 and line number in the genbank record : 442 > ? The complete exception msg : > org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) > at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) > at trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) > > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E:holland at eaglegenomics.com http://www.eaglegenomics.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: Biojava_BioPerl_diff.xls Type: application/vnd.ms-excel Size: 346624 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: BioSqlRichObjectBuilder.patch URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: GenbankFormat.patch URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GenbankRecord.doc Type: application/msword Size: 59392 bytes Desc: not available URL: From holland at eaglegenomics.com Thu Mar 25 16:27:17 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 25 Mar 2010 16:27:17 +0000 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject :( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4BAABA21.4000301@gmail.com> References: <4BAABA21.4000301@gmail.com> Message-ID: <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> Patched and in subversion on the head in the new Biojava 3 code. I modified the code slightly to simplify it. There were also parallel changes required over in SimpleDocRef itself to enable it to continue working without being connected to BioSQL. On 25 Mar 2010, at 01:19, Deepak Sheoran wrote: > I am writing this email again, I didn't get any response weather this bugs are patched or are they lost some where on mailing list. I am not sure that's why I am writing this back. I don't know how to apply this patch So I am counting on you guys to apply theses patch and reply me back so I know its fixed. > > > > Thanks > Deepak Sheoran > > > Hi > In response to bug fix suggested by Richard I have created some patches. We need to apply these to fix biojava from processing references from a genbank record in a wrong manner which cause more hibernate exceptions. After applying patch, reference resolution code will test pubmed or medline id, then if no match then test author/title/location, then if still no match create a new reference. I even tested it with GenbankRelease 175 and I gained almost 3159 more records in my database. > > Can somebody please have a look on second issue of it and fix it > " > 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). > " > > Also I am planning on making a bridge between biosql database loaded using bioperl and biojava, here is my some of the investigation can you guys suggest some direction on it. > Have a look on attached files > 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank record is stored in biosql instance by bioperl and biojava > 2) GenbankRecord.doc ==> its word document having a genbank showing where its information goes in biosql using bioperl and biojava > 3) BioSqlRichobjectBuilder.patch ==> patch needed for BioSqlRichObjectBuild.java class > 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class > > > Thanks > Deepak Sheoran > > > > -------- Original Message -------- > Subject: Re: Hibernate Exception and suggestion for change in BioSqlSchema > Date: Tue, 9 Feb 2010 20:34:32 +1300 > From: Richard Holland > To: Deepak Sheoran > CC: biojava-l at biojava.org > > Hi. It's possible that your original email didn't make it to the list because it is HTML format, and the list only accepts plain text. > > However, in answer to your two questions: > > 1. The code that does the resolution of references might be better if it looks up existing IDs rather than using author, title, location to identify existing records. I would suggest modifying it to a three-step process - test ID, then if no match then test author/title/location, then if still no match create a new reference. Could someone do that? (I'm unable to do anything until late March). > > 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). > > cheers, > Richard > > On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: > > > > > Hi Richard > > > > Below is the email which I sent to Biojava-1 mailing list but it never get posted on the mailing list server neither do i got any response, so please have a look on this email and tell what can be the solution of the problem described in the message. > > > > > > Thanks > > Deepak Sheoran > > -------- Original Message -------- > > Subject: Hibernate Exception and suggestion for change in BioSqlSchema > > Date: Wed, 03 Feb 2010 08:07:35 -0600 > > From: Deepak Sheoran > > > > To: > biojava-l at lists.open-bio.org > > > > > Hi guys, > > > > A couple of days back I was having some problem with hibernate exception but that exception got resolved and the reference to that email is: > http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html > > > On Richard suggestion in above link I am able to resolve some of issues but then, I got stuck in to some other error with hibernate and then decided to investigate the matter and below are some facts and information which I found and I guess it is going to affect all of us. > > ? The "Reference" table in bioSql schema have unique constraint on "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). Which mean only one entry in reference table can use on dbxref_id. > > This Works wells but in cases when you have little variation in value of following column "location", "title", "authors" and all these variation refers to same PUBMED_ID. Then we can't persist or create a richsequence object . > > Now when you tie RichObjectFactory to a active hibernate session then the class "BioSqlRichObjectBuilder" have method called "buildObject(Class clazz, List paramsList) " which is responsible for looking up details of object in the database and if it find one then it will return that object, else it will try to persist the new object into the database. > > But problem is with below part of that method: > > ?..LineNumber: 114 > > else if (SimpleDocRef.class.isAssignableFrom(clazz)) > > { queryType = "DocRef"; > > // convert List constructor to String representation for query > > ourParamsList.set(0, DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); > > if (ourParamsList.size()<3) { > > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title is null"; > > } else { > > queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title = ?"; > > } > > } > > ..LineNubmer: 123 > > Now when hibernate search the database, it won't find any other record in "reference" table because those two record are different in string comparison, so it will return a new object back to "GenbankFormat" to following piece of code > > ?.LineNumber: 447 > > else { > > try { > > CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]{dbname, raccession, new Integer(0)}); > > RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); > > } catch (ChangeVetoException e) { > > throw new ParseException(e+", accession:"+accession); > > } > > } > > ?..LineNumber:455 > > Then we will add that object to rlistener. And move to next part of genbank record and then biojava search for a new crossref in database and it will try to persist the old one it get a hibernate exception regarding violation of "unique constraint on dbxref_id" column. > > > > The only way to get these record in database is: > > ? The very easy solution and the way I did it for testing my theory is Change the bioSql schema so that it can allow many to one on relation between "reference" and "dbxref" table. Which even make sense because one paper can have many different variation of naming, and this change allow us to store that info too. But this is something BioSql people have decide and I don't know how to approach them. > > ? Second solution is slightly difficult to implement, is to change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List paramsList)" make decision about weather a particular DocRef already exist in database or not. I am mean testing all possible string variations of authors, location, title of the docRef which we are searching. Which does have many complications and may slow down process of creating a richsequence object when link RichObjectFactory with a active hibernate session. > > > > Example:Below is a sample of what i have in my local biosql schema which has modification suggested by me. (dbxref_id column have Pubmed_id , I replaced the local dbxref_id which was present on this table in my database with pubmed_id stored in "dbxref" table, for easy reference with outside world in this email) > > Reference_id > > Dbxref_id > > Location > > Title > > Authors > > crc > > 216 > > 18554304 > > FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 (2008) > > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > > 9E940E01F4BE3CD0 > > 230 > > 18554304 > > FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) > > Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. > > D3BC0C17F3F786C9 > > 415 > > 16790744 > > Infect. Immun. 74 (7), 3715-3726 (2006) > > Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via Recombination with Repetitive Chromosomal Sequences > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > > 60AEDFA0CEEACC38 > > 969 > > 16790744 > > Infect. Immun. 74 (7), 3715-3726 (2006) > > Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. > > 4B1232999F6E8130 > > 929 > > 8688087 > > Science 273 (5278), 1058-1073 (1996) > > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > > 3E79B40DD2AAA2B7 > > 932 > > 8688087 > > Science 273 (5278), 1058-1073 (1996) > > Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > > 094EB3384F8D6DE8 > > 1426 > > 10684935 > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > > Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and Fraser,C.M. > > 357648D8FD8C6C8A > > 1481 > > 10684935 > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 > > Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. > > 115411EB2DEE5654 > > 1497 > > 14689165 > > Arch. Microbiol. 181 (2), 144-154 (2004) > > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > > 4D5D376EECCD186B > > 1501 > > 14689165 > > Arch. Microbiol. 181 (2), 144-154 (2004) > > The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. > > 4D57954EECDED66B > > 1556 > > 18060065 > > PLoS ONE 2 (12), E1271 (2007) > > Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > 698688FB6DB95247 > > 1559 > > 18060065 > > PLoS ONE 2 (12), E1271 (2007) > > Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > E25E1BA99DB18F3D > > > > ? The second kind of error which I got was : org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > > ? Which means in richsequence object some feature have location object which have its feature set to null. > > ? My Observation: > > ? Usually occur when you try to persist a richsequence object to database, and occur to those features which have CompoundRichLocation usually "joins" and "complement" in cds region of a genbank record > > ? After catching the hibernate exception I went through all the features and either biojava or hibernate changed the object type of a CompoundRichLocation to SimpleRichLocation and set the feature variable to null. > > ? Below is the screen shot of one of my tests > > ? Settings before trying to persits the richsequence object to database > > > > > > ? > > ? After trying to persits the richsequence object to database and got in hibernate exception catch > > > > ? > > > > ? So my question is why is this happening and how to stop or how to get these record into database, I have no clue why is this happening. > > ? Some extra information to make things more clear to you guys. > > ? Below are some Locus line from genbank record for which I know the error of location, I mean the cds region causing error, and array index in richsequence.feature arrayList object. > > ? LOCUS AE001439 1643831 bp DNA circular BCT 19-JAN-2006 > > ? richSequence.feature Index : 2540 and line number in the genbank record : 22115 > > ? LOCUS CP001189 3887492 bp DNA circular BCT 16-OCT-2008 > > ? richSequence.feature Index : 127 and line number in the genbank record : 2137 > > ? LOCUS CP001292 328635 bp DNA circular BCT 17-DEC-2008 > > ? richSequence.feature Index : 389 and line number in the genbank record : 3632 > > ? LOCUS AM279694 238517 bp DNA linear BCT 23-OCT-2008 > > ? richSequence.feature Index : 47 and line number in the genbank record : 4841 > > ? LOCUS CR931663 18517 bp DNA linear BCT 18-SEP-2008 > > ? richSequence.feature Index : 45 and line number in the genbank record : 442 > > ? The complete exception msg : > > org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature > > at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) > > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) > > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) > > at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) > > at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) > > at trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) > > > > > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: > holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Thu Mar 25 16:47:45 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 25 Mar 2010 09:47:45 -0700 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject :( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> References: <4BAABA21.4000301@gmail.com> <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> Message-ID: <59a41c431003250947g6ecd11cbw21c5be5858b9aa09@mail.gmail.com> Excellent, thanks Richard and Deepak! Andreas On Thu, Mar 25, 2010 at 9:27 AM, Richard Holland wrote: > Patched and in subversion on the head in the new Biojava 3 code. I modified > the code slightly to simplify it. There were also parallel changes required > over in SimpleDocRef itself to enable it to continue working without being > connected to BioSQL. > > On 25 Mar 2010, at 01:19, Deepak Sheoran wrote: > > > I am writing this email again, I didn't get any response weather this > bugs are patched or are they lost some where on mailing list. I am not sure > that's why I am writing this back. I don't know how to apply this patch So I > am counting on you guys to apply theses patch and reply me back so I know > its fixed. > > > > > > > > Thanks > > Deepak Sheoran > > > > > > Hi > > In response to bug fix suggested by Richard I have created some patches. > We need to apply these to fix biojava from processing references from a > genbank record in a wrong manner which cause more hibernate exceptions. > After applying patch, reference resolution code will test pubmed or medline > id, then if no match then test author/title/location, then if still no match > create a new reference. I even tested it with GenbankRelease 175 and I > gained almost 3159 more records in my database. > > > > Can somebody please have a look on second issue of it and fix it > > " > > 2. I think that's a bug (compound locations with null features) but not > sure why. Could be that the process of constructing a CompoundRichLocation > is somehow losing the feature reference from the original > SimpleRichLocation. Again I can't investigate until March - can someone else > take a look at the code? (A good starting point would be to look at how a > CompoundRichLocation decides to select the feature from the > SimpleRichLocations it is made up from). > > " > > > > Also I am planning on making a bridge between biosql database loaded > using bioperl and biojava, here is my some of the investigation can you guys > suggest some direction on it. > > Have a look on attached files > > 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank > record is stored in biosql instance by bioperl and biojava > > 2) GenbankRecord.doc ==> its word document having a genbank showing > where its information goes in biosql using bioperl and biojava > > 3) BioSqlRichobjectBuilder.patch ==> patch needed for > BioSqlRichObjectBuild.java class > > 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class > > > > > > Thanks > > Deepak Sheoran > > > > > > > > -------- Original Message -------- > > Subject: Re: Hibernate Exception and suggestion for change in > BioSqlSchema > > Date: Tue, 9 Feb 2010 20:34:32 +1300 > > From: Richard Holland > > To: Deepak Sheoran > > CC: biojava-l at biojava.org > > > > Hi. It's possible that your original email didn't make it to the list > because it is HTML format, and the list only accepts plain text. > > > > However, in answer to your two questions: > > > > 1. The code that does the resolution of references might be better if > it looks up existing IDs rather than using author, title, location to > identify existing records. I would suggest modifying it to a three-step > process - test ID, then if no match then test author/title/location, then if > still no match create a new reference. Could someone do that? (I'm unable to > do anything until late March). > > > > 2. I think that's a bug (compound locations with null features) but not > sure why. Could be that the process of constructing a CompoundRichLocation > is somehow losing the feature reference from the original > SimpleRichLocation. Again I can't investigate until March - can someone else > take a look at the code? (A good starting point would be to look at how a > CompoundRichLocation decides to select the feature from the > SimpleRichLocations it is made up from). > > > > cheers, > > Richard > > > > On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: > > > > > > > > Hi Richard > > > > > > Below is the email which I sent to Biojava-1 mailing list but it never > get posted on the mailing list server neither do i got any response, so > please have a look on this email and tell what can be the solution of the > problem described in the message. > > > > > > > > > Thanks > > > Deepak Sheoran > > > -------- Original Message -------- > > > Subject: Hibernate Exception and suggestion for change in > BioSqlSchema > > > Date: Wed, 03 Feb 2010 08:07:35 -0600 > > > From: Deepak Sheoran > > > > > > > To: > > biojava-l at lists.open-bio.org > > > > > > > > Hi guys, > > > > > > A couple of days back I was having some problem with hibernate > exception but that exception got resolved and the reference to that email > is: > > > http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html > > > > > On Richard suggestion in above link I am able to resolve some of > issues but then, I got stuck in to some other error with hibernate and then > decided to investigate the matter and below are some facts and information > which I found and I guess it is going to affect all of us. > > > ? The "Reference" table in bioSql schema have unique constraint on > "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). > Which mean only one entry in reference table can use on dbxref_id. > > > This Works wells but in cases when you have little variation in value > of following column "location", "title", "authors" and all these variation > refers to same PUBMED_ID. Then we can't persist or create a richsequence > object . > > > Now when you tie RichObjectFactory to a active hibernate session then > the class "BioSqlRichObjectBuilder" have method called "buildObject(Class > clazz, List paramsList) " which is responsible for looking up details of > object in the database and if it find one then it will return that object, > else it will try to persist the new object into the database. > > > But problem is with below part of that method: > > > ?..LineNumber: 114 > > > else if (SimpleDocRef.class.isAssignableFrom(clazz)) > > > { queryType = "DocRef"; > > > // convert List constructor to String representation > for query > > > ourParamsList.set(0, > DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); > > > if (ourParamsList.size()<3) { > > > queryText = "from DocRef as cr where cr.authors > = ? and cr.location = ? and cr.title is null"; > > > } else { > > > queryText = "from DocRef as cr where cr.authors > = ? and cr.location = ? and cr.title = ?"; > > > } > > > } > > > ..LineNubmer: 123 > > > Now when hibernate search the database, it won't find any other record > in "reference" table because those two record are different in string > comparison, so it will return a new object back to "GenbankFormat" to > following piece of code > > > ?.LineNumber: 447 > > > else { > > > try { > > > CrossRef cr = > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new > Object[]{dbname, raccession, new Integer(0)}); > > > RankedCrossRef rcr = new > SimpleRankedCrossRef(cr, ++rcrossrefCount); > > > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); > > > } catch (ChangeVetoException e) > { > > > throw new > ParseException(e+", accession:"+accession); > > > } > > > } > > > ?..LineNumber:455 > > > Then we will add that object to rlistener. And move to next part of > genbank record and then biojava search for a new crossref in database and it > will try to persist the old one it get a hibernate exception regarding > violation of "unique constraint on dbxref_id" column. > > > > > > The only way to get these record in database is: > > > ? The very easy solution and the way I did it for testing > my theory is Change the bioSql schema so that it can allow many to one on > relation between "reference" and "dbxref" table. Which even make sense > because one paper can have many different variation of naming, and this > change allow us to store that info too. But this is something BioSql people > have decide and I don't know how to approach them. > > > ? Second solution is slightly difficult to implement, is to > change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List > paramsList)" make decision about weather a particular DocRef already exist > in database or not. I am mean testing all possible string variations of > authors, location, title of the docRef which we are searching. Which does > have many complications and may slow down process of creating a richsequence > object when link RichObjectFactory with a active hibernate session. > > > > > > Example:Below is a sample of what i have in my local biosql schema > which has modification suggested by me. (dbxref_id column have Pubmed_id , I > replaced the local dbxref_id which was present on this table in my database > with pubmed_id stored in "dbxref" table, for easy reference with outside > world in this email) > > > Reference_id > > > Dbxref_id > > > Location > > > Title > > > Authors > > > crc > > > 216 > > > 18554304 > > > FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 > (2008) > > > Isolation of lactate-utilizing butyrate-producing bacteria from human > feces and in vivo administration of Anaerostipes caccae strain L2 and > galacto-oligosaccharides in a rat model > > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., > Nomoto,K., Ito,M. and Sawada,H. > > > 9E940E01F4BE3CD0 > > > 230 > > > 18554304 > > > FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) > > > Isolation of lactate-utilizing butyrate-producing bacteria from human > feces and in vivo administration of Anaerostipes caccae strain L2 and > galacto-oligosaccharides in a rat model > > > Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., > Nomoto,K., Ito,M. and Sawada,H. > > > D3BC0C17F3F786C9 > > > 415 > > > 16790744 > > > Infect. Immun. 74 (7), 3715-3726 (2006) > > > Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is > Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via > Recombination with Repetitive Chromosomal Sequences > > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and > Totten,P.A. > > > 60AEDFA0CEEACC38 > > > 969 > > > 16790744 > > > Infect. Immun. 74 (7), 3715-3726 (2006) > > > Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is > extensive in vitro and in vivo and suggests that variation is generated via > recombination with repetitive chromosomal sequences > > > Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and > Totten,P.A. > > > 4B1232999F6E8130 > > > 929 > > > 8688087 > > > Science 273 (5278), 1058-1073 (1996) > > > Complete genome sequence of the methanogenic archaeon, Methanococcus > jannaschii > > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., > Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., > Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., > Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., > Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., > Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., > Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., > Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and > Venter,J.C. > > > 3E79B40DD2AAA2B7 > > > 932 > > > 8688087 > > > Science 273 (5278), 1058-1073 (1996) > > > Complete genome sequence of the methanogenic archaeon, Methanococcus > jannaschii > > > Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., > Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., > Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., > Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., > Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., > Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., > Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., > Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. > > > 094EB3384F8D6DE8 > > > 1426 > > > 10684935 > > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae > AR39 > > > Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., > Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., > Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., > Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and > Fraser,C.M. > > > 357648D8FD8C6C8A > > > 1481 > > > 10684935 > > > Nucleic Acids Res. 28 (6), 1397-1406 (2000) > > > Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae > AR39 > > > Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., > Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., > Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., > DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. > > > 115411EB2DEE5654 > > > 1497 > > > 14689165 > > > Arch. Microbiol. 181 (2), 144-154 (2004) > > > The effect of FITA mutations on the symbiotic properties of > Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., > del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. > and Ruiz-Sainz,J.E. > > > 4D5D376EECCD186B > > > 1501 > > > 14689165 > > > Arch. Microbiol. 181 (2), 144-154 (2004) > > > The effect of FITA mutations on the symbiotic properties of > Sinorhizobium fredii varies in a chromosomal-background-dependent manner > > > Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., > Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. > and Ruiz-Sainz,J.E. > > > 4D57954EECDED66B > > > 1556 > > > 18060065 > > > PLoS ONE 2 (12), E1271 (2007) > > > Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 > and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids > > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., > Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > > 698688FB6DB95247 > > > 1559 > > > 18060065 > > > PLoS ONE 2 (12), E1271 (2007) > > > Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 > and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids > > > Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., > Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. > > > E25E1BA99DB18F3D > > > > > > ? The second kind of error which I got was : > org.hibernate.PropertyValueException: not-null property references a null or > transient value: Location.feature > > > ? Which means in richsequence object some feature have > location object which have its feature set to null. > > > ? My Observation: > > > ? Usually occur when you try to persist a > richsequence object to database, and occur to those features which have > CompoundRichLocation usually "joins" and "complement" in cds region of a > genbank record > > > ? After catching the hibernate exception I went > through all the features and either biojava or hibernate changed the object > type of a CompoundRichLocation to SimpleRichLocation and set the feature > variable to null. > > > ? Below is the screen shot of one of my tests > > > ? Settings before trying to persits the > richsequence object to database > > > > > > > > > ? > > > ? After trying to persits the richsequence object to > database and got in hibernate exception catch > > > > > > ? > > > > > > ? So my question is why is this happening and how to stop > or how to get these record into database, I have no clue why is this > happening. > > > ? Some extra information to make things more clear to you > guys. > > > ? Below are some Locus line from genbank record for > which I know the error of location, I mean the cds region causing error, and > array index in richsequence.feature arrayList object. > > > ? LOCUS AE001439 1643831 > bp DNA circular BCT 19-JAN-2006 > > > ? richSequence.feature Index : 2540 > and line number in the genbank record : 22115 > > > ? LOCUS CP001189 3887492 > bp DNA circular BCT 16-OCT-2008 > > > ? richSequence.feature Index : 127 > and line number in the genbank record : 2137 > > > ? LOCUS CP001292 328635 > bp DNA circular BCT 17-DEC-2008 > > > ? richSequence.feature Index : 389 > and line number in the genbank record : 3632 > > > ? LOCUS AM279694 238517 > bp DNA linear BCT 23-OCT-2008 > > > ? richSequence.feature Index : 47 > and line number in the genbank record : 4841 > > > ? LOCUS CR931663 18517 > bp DNA linear BCT 18-SEP-2008 > > > ? richSequence.feature Index : 45 > and line number in the genbank record : 442 > > > ? The complete exception msg : > > > org.hibernate.PropertyValueException: not-null property references a > null or transient value: Location.feature > > > at > org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > > at > org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > > at > org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at > org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > > at > org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > > at > org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > > > at > org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > > > at > org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > > > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at > org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) > > > at > org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) > > > at > org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) > > > at > org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > > > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > > > at > org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) > > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) > > > at > org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) > > > at > org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) > > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > > at > org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) > > > at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) > > > at > trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Operations and Delivery Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: > > holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > > > > > > > > > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > From deepak.sheoran at orionbiosciences.com Thu Mar 25 18:46:57 2010 From: deepak.sheoran at orionbiosciences.com (Deepak Sheoran) Date: Thu, 25 Mar 2010 13:46:57 -0500 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject : ( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> References: <4BAABA21.4000301@gmail.com> <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> Message-ID: <4BABAFA1.6090806@orionbiosciences.com> That is reason why I was getting error when i was creating a Richsequence object without any active session to biosql, I didn't had the clue that I created one more bug by fixing one, thanks for noticing that and fixing that. I am thinking should we use bioperl -biojava and biosql compatibility as one of the google summer of code project. I have vision on this, but don't know right way to being with. This can help people who want to use biojava but can't because they are afraid to loos their Perl code,which is heavily dependent on perl way of loading the schema. Or come out with a hybrid way which have good from both languages. Deepak Sheoran On 3/25/2010 11:27 AM, Richard Holland wrote: > Patched and in subversion on the head in the new Biojava 3 code. I modified the code slightly to simplify it. There were also parallel changes required over in SimpleDocRef itself to enable it to continue working without being connected to BioSQL. > > On 25 Mar 2010, at 01:19, Deepak Sheoran wrote: > > >> I am writing this email again, I didn't get any response weather this bugs are patched or are they lost some where on mailing list. I am not sure that's why I am writing this back. I don't know how to apply this patch So I am counting on you guys to apply theses patch and reply me back so I know its fixed. >> >> >> >> Thanks >> Deepak Sheoran >> >> >> Hi >> In response to bug fix suggested by Richard I have created some patches. We need to apply these to fix biojava from processing references from a genbank record in a wrong manner which cause more hibernate exceptions. After applying patch, reference resolution code will test pubmed or medline id, then if no match then test author/title/location, then if still no match create a new reference. I even tested it with GenbankRelease 175 and I gained almost 3159 more records in my database. >> >> Can somebody please have a look on second issue of it and fix it >> " >> 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). >> " >> >> Also I am planning on making a bridge between biosql database loaded using bioperl and biojava, here is my some of the investigation can you guys suggest some direction on it. >> Have a look on attached files >> 1) Biojava_BioPerl_Diff.xls ==> it have view of tables where genbank record is stored in biosql instance by bioperl and biojava >> 2) GenbankRecord.doc ==> its word document having a genbank showing where its information goes in biosql using bioperl and biojava >> 3) BioSqlRichobjectBuilder.patch ==> patch needed for BioSqlRichObjectBuild.java class >> 4) GenBankFormat.patch ==> patch needed for GenBankFormat.java class >> >> >> Thanks >> Deepak Sheoran >> >> >> >> -------- Original Message -------- >> Subject: Re: Hibernate Exception and suggestion for change in BioSqlSchema >> Date: Tue, 9 Feb 2010 20:34:32 +1300 >> From: Richard Holland >> To: Deepak Sheoran >> CC: biojava-l at biojava.org >> >> Hi. It's possible that your original email didn't make it to the list because it is HTML format, and the list only accepts plain text. >> >> However, in answer to your two questions: >> >> 1. The code that does the resolution of references might be better if it looks up existing IDs rather than using author, title, location to identify existing records. I would suggest modifying it to a three-step process - test ID, then if no match then test author/title/location, then if still no match create a new reference. Could someone do that? (I'm unable to do anything until late March). >> >> 2. I think that's a bug (compound locations with null features) but not sure why. Could be that the process of constructing a CompoundRichLocation is somehow losing the feature reference from the original SimpleRichLocation. Again I can't investigate until March - can someone else take a look at the code? (A good starting point would be to look at how a CompoundRichLocation decides to select the feature from the SimpleRichLocations it is made up from). >> >> cheers, >> Richard >> >> On 9 Feb 2010, at 20:21, Deepak Sheoran wrote: >> >> >>> Hi Richard >>> >>> Below is the email which I sent to Biojava-1 mailing list but it never get posted on the mailing list server neither do i got any response, so please have a look on this email and tell what can be the solution of the problem described in the message. >>> >>> >>> Thanks >>> Deepak Sheoran >>> -------- Original Message -------- >>> Subject: Hibernate Exception and suggestion for change in BioSqlSchema >>> Date: Wed, 03 Feb 2010 08:07:35 -0600 >>> From: Deepak Sheoran >>> >> >> >> >>> To: >>> >> biojava-l at lists.open-bio.org >> >> >>> Hi guys, >>> >>> A couple of days back I was having some problem with hibernate exception but that exception got resolved and the reference to that email is: >>> >> http://old.nabble.com/Hibernate-Exception-when-persisting-some-richsequence-object-to-biosql-schema-to27299245.html >> >> >>> On Richard suggestion in above link I am able to resolve some of issues but then, I got stuck in to some other error with hibernate and then decided to investigate the matter and below are some facts and information which I found and I guess it is going to affect all of us. >>> ? The "Reference" table in bioSql schema have unique constraint on "dbxref_id" column (CONSTRAINT reference_dbxref_id_key UNIQUE (dbxref_id)). Which mean only one entry in reference table can use on dbxref_id. >>> This Works wells but in cases when you have little variation in value of following column "location", "title", "authors" and all these variation refers to same PUBMED_ID. Then we can't persist or create a richsequence object . >>> Now when you tie RichObjectFactory to a active hibernate session then the class "BioSqlRichObjectBuilder" have method called "buildObject(Class clazz, List paramsList) " which is responsible for looking up details of object in the database and if it find one then it will return that object, else it will try to persist the new object into the database. >>> But problem is with below part of that method: >>> ?..LineNumber: 114 >>> else if (SimpleDocRef.class.isAssignableFrom(clazz)) >>> { queryType = "DocRef"; >>> // convert List constructor to String representation for query >>> ourParamsList.set(0, DocRefAuthor.Tools.generateAuthorString((List)ourParamsList.get(0), true)); >>> if (ourParamsList.size()<3) { >>> queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title is null"; >>> } else { >>> queryText = "from DocRef as cr where cr.authors = ? and cr.location = ? and cr.title = ?"; >>> } >>> } >>> ..LineNubmer: 123 >>> Now when hibernate search the database, it won't find any other record in "reference" table because those two record are different in string comparison, so it will return a new object back to "GenbankFormat" to following piece of code >>> ?.LineNumber: 447 >>> else { >>> try { >>> CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]{dbname, raccession, new Integer(0)}); >>> RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); >>> rlistener.getCurrentFeature().addRankedCrossRef(rcr); >>> } catch (ChangeVetoException e) { >>> throw new ParseException(e+", accession:"+accession); >>> } >>> } >>> ?..LineNumber:455 >>> Then we will add that object to rlistener. And move to next part of genbank record and then biojava search for a new crossref in database and it will try to persist the old one it get a hibernate exception regarding violation of "unique constraint on dbxref_id" column. >>> >>> The only way to get these record in database is: >>> ? The very easy solution and the way I did it for testing my theory is Change the bioSql schema so that it can allow many to one on relation between "reference" and "dbxref" table. Which even make sense because one paper can have many different variation of naming, and this change allow us to store that info too. But this is something BioSql people have decide and I don't know how to approach them. >>> ? Second solution is slightly difficult to implement, is to change the way "BioSqlRichObjectBuilder.buildObject(Class clazz,List paramsList)" make decision about weather a particular DocRef already exist in database or not. I am mean testing all possible string variations of authors, location, title of the docRef which we are searching. Which does have many complications and may slow down process of creating a richsequence object when link RichObjectFactory with a active hibernate session. >>> >>> Example:Below is a sample of what i have in my local biosql schema which has modification suggested by me. (dbxref_id column have Pubmed_id , I replaced the local dbxref_id which was present on this table in my database with pubmed_id stored in "dbxref" table, for easy reference with outside world in this email) >>> Reference_id >>> Dbxref_id >>> Location >>> Title >>> Authors >>> crc >>> 216 >>> 18554304 >>> FEMS Microbiol. Ecol. 66 (3THEMATIC ISSUE: GUT MICROBIOLOGY), 528-536 (2008) >>> Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model >>> Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. >>> 9E940E01F4BE3CD0 >>> 230 >>> 18554304 >>> FEMS Microbiol. Ecol. 66 (3), 528-536 (2008) >>> Isolation of lactate-utilizing butyrate-producing bacteria from human feces and in vivo administration of Anaerostipes caccae strain L2 and galacto-oligosaccharides in a rat model >>> Sato,T., Matsumoto,K., Okumura,T., Yokoi,W., Naito,E., Yoshida,Y., Nomoto,K., Ito,M. and Sawada,H. >>> D3BC0C17F3F786C9 >>> 415 >>> 16790744 >>> Infect. Immun. 74 (7), 3715-3726 (2006) >>> Intrastrain Heterogeneity of the mgpB Gene in Mycoplasma genitalium Is Extensive In Vitro and In Vivo and Suggests that Variation Is Generated via Recombination with Repetitive Chromosomal Sequences >>> Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. >>> 60AEDFA0CEEACC38 >>> 969 >>> 16790744 >>> Infect. Immun. 74 (7), 3715-3726 (2006) >>> Intrastrain heterogeneity of the mgpB gene in mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences >>> Iverson-Cabral,S.L., Astete,S.G., Cohen,C.R., Rocha,E.P. and Totten,P.A. >>> 4B1232999F6E8130 >>> 929 >>> 8688087 >>> Science 273 (5278), 1058-1073 (1996) >>> Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii >>> Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J.-F., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.L., Geoghagen,N.S.M., Weidman,J.F., Fuhrmann,J.L., Presley,E.A., Nguyen,D., Utterback,T.R., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.P., Borodovsky,M., Klenk,H.-P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. >>> 3E79B40DD2AAA2B7 >>> 932 >>> 8688087 >>> Science 273 (5278), 1058-1073 (1996) >>> Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii >>> Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D., Kerlavage,A.R., Dougherty,B.A., Tomb,J., Adams,M.D., Reich,C.I., Overbeek,R., Kirkness,E.F., Weinstock,K.G., Merrick,J.M., Glodek,A., Scott,J.D., Geoghagen,N.S., Weidman,J.F., Fuhrmann,J.L., Nguyen,D.T., Utterback,T., Kelley,J.M., Peterson,J.D., Sadow,P.W., Hanna,M.C., Cotton,M.D., Hurst,M.A., Roberts,K.M., Kaine,B.B., Borodovsky,M., Klenk,H.P., Fraser,C.M., Smith,H.O., Woese,C.R. and Venter,J.C. >>> 094EB3384F8D6DE8 >>> 1426 >>> 10684935 >>> Nucleic Acids Res. 28 (6), 1397-1406 (2000) >>> Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 >>> Read,T.D., Brunham,R.C., Shen,C., Gill,S.R., Heidelberg,J.F., White,O., Hickey,E.K., Peterson,J., Umayam,L.A., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S.L., Eisen,J. and Fraser,C.M. >>> 357648D8FD8C6C8A >>> 1481 >>> 10684935 >>> Nucleic Acids Res. 28 (6), 1397-1406 (2000) >>> Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 >>> Read,T., Brunham,R., Shen,C., Gill,S., Heidelberg,J., White,O., Hickey,E., Peterson,J., Utterback,T., Berry,K., Bass,S., Linher,K., Weidman,J., Khouri,H., Craven,B., Bowman,C., Dodson,R., Gwinn,M., Nelson,W., DeBoy,R., Kolonay,J., McClarty,G., Salzberg,S., Eisen,J. and Fraser,C. >>> 115411EB2DEE5654 >>> 1497 >>> 14689165 >>> Arch. Microbiol. 181 (2), 144-154 (2004) >>> The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner >>> Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. >>> 4D5D376EECCD186B >>> 1501 >>> 14689165 >>> Arch. Microbiol. 181 (2), 144-154 (2004) >>> The effect of FITA mutations on the symbiotic properties of Sinorhizobium fredii varies in a chromosomal-background-dependent manner >>> Vinardell,J.M., Lopez-Baena,F.J., Hidalgo,A., Ollero,F.J., Bellogin,R., Del Rosario Espuny,M., Temprano,F., Romero,F., Krishnan,H.B., Pueppke,S.G. and Ruiz-Sainz,J.E. >>> 4D57954EECDED66B >>> 1556 >>> 18060065 >>> PLoS ONE 2 (12), E1271 (2007) >>> Analysis of the Neurotoxin Complex Genes in Clostridium botulinum A1-A4 and B1 Strains: BoNT/A3, /Ba4 and /B1 Clusters Are Located within Plasmids >>> Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,A.C., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. >>> 698688FB6DB95247 >>> 1559 >>> 18060065 >>> PLoS ONE 2 (12), E1271 (2007) >>> Analysis of the neurotoxin complex genes in Clostridium botulinum A1-A4 and B1 strains: BoNT/A3, /Ba4 and /B1 clusters are located within plasmids >>> Smith,T.J., Hill,K.K., Foley,B.T., Detter,J.C., Munk,C.A., Bruce,D.C., Doggett,N.A., Smith,L.A., Marks,J.D., Xie,G. and Brettin,T.S. >>> E25E1BA99DB18F3D >>> >>> ? The second kind of error which I got was : org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature >>> ? Which means in richsequence object some feature have location object which have its feature set to null. >>> ? My Observation: >>> ? Usually occur when you try to persist a richsequence object to database, and occur to those features which have CompoundRichLocation usually "joins" and "complement" in cds region of a genbank record >>> ? After catching the hibernate exception I went through all the features and either biojava or hibernate changed the object type of a CompoundRichLocation to SimpleRichLocation and set the feature variable to null. >>> ? Below is the screen shot of one of my tests >>> ? Settings before trying to persits the richsequence object to database >>> >>> >>> ? >>> ? After trying to persits the richsequence object to database and got in hibernate exception catch >>> >>> ? >>> >>> ? So my question is why is this happening and how to stop or how to get these record into database, I have no clue why is this happening. >>> ? Some extra information to make things more clear to you guys. >>> ? Below are some Locus line from genbank record for which I know the error of location, I mean the cds region causing error, and array index in richsequence.feature arrayList object. >>> ? LOCUS AE001439 1643831 bp DNA circular BCT 19-JAN-2006 >>> ? richSequence.feature Index : 2540 and line number in the genbank record : 22115 >>> ? LOCUS CP001189 3887492 bp DNA circular BCT 16-OCT-2008 >>> ? richSequence.feature Index : 127 and line number in the genbank record : 2137 >>> ? LOCUS CP001292 328635 bp DNA circular BCT 17-DEC-2008 >>> ? richSequence.feature Index : 389 and line number in the genbank record : 3632 >>> ? LOCUS AM279694 238517 bp DNA linear BCT 23-OCT-2008 >>> ? richSequence.feature Index : 47 and line number in the genbank record : 4841 >>> ? LOCUS CR931663 18517 bp DNA linear BCT 18-SEP-2008 >>> ? richSequence.feature Index : 45 and line number in the genbank record : 442 >>> ? The complete exception msg : >>> org.hibernate.PropertyValueException: not-null property references a null or transient value: Location.feature >>> at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:290) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) >>> at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) >>> at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) >>> at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) >>> at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) >>> at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) >>> at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascade(Cascade.java:130) >>> at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) >>> at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:94) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) >>> at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) >>> at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) >>> at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) >>> at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascadeCollectionElements(Cascade.java:296) >>> at org.hibernate.engine.Cascade.cascadeCollection(Cascade.java:242) >>> at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:219) >>> at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) >>> at org.hibernate.engine.Cascade.cascade(Cascade.java:130) >>> at org.hibernate.event.def.AbstractSaveEventListener.cascadeAfterSave(AbstractSaveEventListener.java:456) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:334) >>> at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) >>> at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:121) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) >>> at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) >>> at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) >>> at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) >>> at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) >>> at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) >>> at trashtesting.GenBankLoaderTesting.main(GenBankLoaderTesting.java:78) >>> >>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: >> holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> >> > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > From biopython at maubp.freeserve.co.uk Thu Mar 25 22:16:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Mar 2010 22:16:55 +0000 Subject: [Biojava-dev] Bug fix for Biojava in regard to email with subject : ( Hibernate Exception and suggestion for change in BioSqlSchema) In-Reply-To: <4BABAFA1.6090806@orionbiosciences.com> References: <4BAABA21.4000301@gmail.com> <4FAB0AC5-3D97-4FD8-8A7E-81D1D6381D39@eaglegenomics.com> <4BABAFA1.6090806@orionbiosciences.com> Message-ID: <320fb6e01003251516w2977ab2h9869342f94576287@mail.gmail.com> On Thu, Mar 25, 2010 at 6:46 PM, Deepak Sheoran wrote: > > That is reason why I was getting error when i was creating a Richsequence > object without any active session to biosql, I didn't had the clue that I > created one more bug by fixing one, thanks for noticing that and fixing > that. > > I am thinking should we use bioperl -biojava and biosql compatibility ?as > one of the google summer of code project. I have vision on this, but don't > know right way to being with. This can ?help people who want to use biojava > but can't because they are afraid to loos their Perl code,which is heavily > dependent on perl way of loading the schema. Or come out with a hybrid way > which have good from both languages. > > Deepak Sheoran That is an interesting idea for GSoC, I wonder if we at Biopython should do the same. I know of a few things where we differ from BioPerl's BioSQL support (e.g. SwissProt comment lines). [I take we agree that bioperl-db is the de facto reference implementation for mapping GenBank etc into BioSQL?] Peter From bugzilla-daemon at portal.open-bio.org Fri Mar 26 06:14:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 02:14:17 -0400 Subject: [Biojava-dev] [Bug 3035] New: ParseException thrown when parsing PDB file. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3035 Summary: ParseException thrown when parsing PDB file. Product: BioJava Version: unspecified Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: structure AssignedTo: biojava-dev at biojava.org ReportedBy: nakagawa-hiroyuki at mki.co.jp When reading a PDB file using org.biojava.bio.structure.io.PDBFileReader on non-English platform, java.text.ParseException is thrown. java.text.ParseException: Unparseable date: "26-DEC-97" at java.text.DateFormat.parse(Unknown Source) at org.biojava.bio.structure.io.PDBFileParser.pdb_HEADER_Handler(PDBFileParser.java:433) at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2067) at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) at org.biojava.bio.structure.io.PDBFileReader.getStructure(PDBFileReader.java:486) at org.biojava.bio.structure.io.PDBFileReader.getStructure(PDBFileReader.java:466) at Test.main(Test.java:9) To reproduce this symptom, 1. Set your operating system???s default locale to non-English one(e.g. Japanese). 2. Then run the test code described below. Or simply run the test code with the option ???-Duser.language=ja??? > java -Duser.language=ja Test ----Begin Test.java ---- import org.biojava.bio.structure.io.PDBFileReader; import org.biojava.bio.structure.Structure; public class Test { public static void main(String[] args) { String filename = "1a2b.pdb" ; PDBFileReader pdbreader = new PDBFileReader(); try{ Structure structure = pdbreader.getStructure(filename); } catch (Exception e){ e.printStackTrace(); } } } ----End Test.java ---- This cause, that java.text.SimpleDateFormat can???t parse PDB style "dd-MMM-yy" date format on some non-English locale. I attached a patch to correct this problem. ---- Begin PDBFileParser.java.diff ---- *** .\biojava-1.7.1\src\org\biojava\bio\structure\io\PDBFileParser.java.orig 2010-01-24 22:35:24.000000000 +0900 --- .\biojava-1.7.1\src\org\biojava\bio\structure\io\PDBFileParser.java 2010-03-19 11:34:28.571551900 +0900 *************** *** 271,277 **** current_compound = new Compound(); dbrefs = new ArrayList(); ! dateFormat = new SimpleDateFormat("dd-MMM-yy"); atomCount = 0; atomOverflow = false; --- 271,277 ---- current_compound = new Compound(); dbrefs = new ArrayList(); ! dateFormat = new SimpleDateFormat("dd-MMM-yy", java.util.Locale.ENGLISH); atomCount = 0; atomOverflow = false; ---- End PDBFileParser.java.diff ---- -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 06:18:26 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 02:18:26 -0400 Subject: [Biojava-dev] [Bug 3035] ParseException thrown when parsing PDB file. In-Reply-To: Message-ID: <201003260618.o2Q6IQEV023480@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3035 ------- Comment #1 from nakagawa-hiroyuki at mki.co.jp 2010-03-26 02:18 EST ------- Created an attachment (id=1467) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1467&action=view) A patch to correct this problem -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 16:25:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 12:25:14 -0400 Subject: [Biojava-dev] [Bug 3035] ParseException thrown when parsing PDB file. In-Reply-To: Message-ID: <201003261625.o2QGPEVe012950@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3035 andreas at sdsc.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 16:27:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 12:27:56 -0400 Subject: [Biojava-dev] [Bug 3035] ParseException thrown when parsing PDB file. In-Reply-To: Message-ID: <201003261627.o2QGRu2r013123@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3035 andreas at sdsc.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #2 from andreas at sdsc.edu 2010-03-26 12:27 EST ------- applied user provided patch, problem should be fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andreas at sdsc.edu Mon Mar 29 02:02:49 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 28 Mar 2010 19:02:49 -0700 Subject: [Biojava-dev] Biojava3 structure In-Reply-To: References: Message-ID: <59a41c431003281902ic2c5ed3h4a2383899f465a8@mail.gmail.com> Hi Scooter, at the present the structure modules depend on the alignment module and on the (old) core module. This is for aligning ATOM and SEQRES residues in the PDB files, and for the Smith Waterman alignment based 3D structure superposition. If we target a release of biojava 3 in about a month, I don't think it will be possible to break this out, mainly because the alignment module is still based on the biojava 1 code base. Overall I think that the core module probably should still be part of the BioJava 3 release. Any opinions on that? Andreas On Sun, Mar 28, 2010 at 3:06 PM, Scooter Willis wrote: > Andreas > > I needed to do some work with a PDB file so started to use the structure > library. It looks like it depends on all the old biojava code. Mainly the > structure exceptions that extend bioexception is the first thing tripping me > up. Should the biojava3-structure module have any external dependencies or > am I working with the wrong package? > > Thanks > > Scooter