From markjschreiber at gmail.com Wed Oct 1 02:07:51 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 1 Oct 2008 14:07:51 +0800 Subject: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result In-Reply-To: References: Message-ID: <93b45ca50809302307t19a652a4v4a61eeceec07aa62@mail.gmail.com> Actually, if it is an OS specific carriage return then there is still a minor issue. We should really try and code stuff so that it can handle files that originate from any major OS. - Mark On Wed, Oct 1, 2008 at 12:31 AM, Richard Holland wrote: > > Sounds like it _might_ be something to do with the carriage return > itself. Is the blast file generated on the same OS that you're running > your analysis on? (e.g. you might run Blast on a Linux box, but > attempt to parse the file on a Windows box?). If the two OSes are > different, this might point to it - as Linux won't necessarily > understand the Windows linebreaks, or vice versa, and might > misinterpret them. When you copy the portion of the file to a new file > on the OS you're running the analysis on, it will substitute its own > local linebreaks and thus mask the problem. > > So the first thing I'd check is to what the two OSes involved are. If > they're different, try running your analysis program on the same OS as > the Blast output was generated on. If that does fix it, then try > putting your Blast files through dos2unix or something similar to > convert the linebreaks before running your analysis program. > > If they're the same OS, then we still have a problem! > > cheers, > Richard > > 2008/9/30 David Toomey : > > Hi > > > > > > > > I am parsing a blast result and I am getting a > > StringIndexOutOfBoundsException. The stack trace is > > > > > > > > at java.lang.String.substring(String.java:1938) > > > > at java.lang.String.substring(String.java:1905) > > > > at > > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeA > > lignmentSAXParser.java:291) > > > > at > > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlign > > mentSAXParser.java:116) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXP > > arser.java:517) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXP > > arser.java:287) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParse > > r.java:251) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.ja > > va:117) > > > > at > > org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser > > .java:634) > > > > at > > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:341 > > ) > > > > at > > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:168) > > > > at > > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars > > er.java:314) > > > > at > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. > > java:276) > > > > at > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java > > :163) > > > > at ie.rcsi.blast.StandardParser.parse(StandardParser.java:65) > > > > at ie.rcsi.blast.BlastParser.parse(BlastParser.java:44) > > > > at ie.rcsi.blast.Main.main(Main.java:30) > > > > > > > > I have updated BlastLikeAlignmentSAXParser to output some debug info and > > narrowed down the line causing the problem to the following line > > > > > > > > 2,4-cyclodiphosphate synthase OS=Plasmodium falciparum (isolate 3D7) > > > > GN=ISPF > > > > > > > > If I remove the carriage return and put it on a single line then everything > > works fine. Strangely if I copy this entry and put it in a file on it's own > > it also parses correctly, even with the carriage return!!! > > > > > > > > Has anyone seen this before or does anyone have a suggestion on what I might > > to do fix it. I send the complete blast result if it would help. I have > > tried using blast 2.2.18 and 2.2.17 and the problem is the same. > > > > > > > > Cheers > > > > > > > > Dave > > > > > > > > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From dtoomey at rcsi.ie Wed Oct 1 04:40:44 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Wed, 1 Oct 2008 09:40:44 +0100 Subject: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result References: Message-ID: They are on the same OS. For all my tests I have run the blast search and parsing on the same OS. This has mostly been windows but I have also tried the whole thing on Linux and I get the same problem. I have done some more testing and I don't think the carriage return is the problem. What I have found is that if the second line is less than 11 characters the error is thrown. If I add 4 spaces in front of the 'GN=ISPF' on the second line then it is parsed correctly, like this. 2,4-cyclodiphosphate synthase OS=Plasmodium falciparum (isolate 3D7) GN=ISPF I haven't figured out why it parses correctly when it is the only entry in the file, even without the spaces. So maybe I am still missing something. Cheers, Dave -----Original Message----- From: dicknetherlands at gmail.com [mailto:dicknetherlands at gmail.com] On Behalf Of Richard Holland Sent: 30 September 2008 17:31 To: David Toomey Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result Sounds like it _might_ be something to do with the carriage return itself. Is the blast file generated on the same OS that you're running your analysis on? (e.g. you might run Blast on a Linux box, but attempt to parse the file on a Windows box?). If the two OSes are different, this might point to it - as Linux won't necessarily understand the Windows linebreaks, or vice versa, and might misinterpret them. When you copy the portion of the file to a new file on the OS you're running the analysis on, it will substitute its own local linebreaks and thus mask the problem. So the first thing I'd check is to what the two OSes involved are. If they're different, try running your analysis program on the same OS as the Blast output was generated on. If that does fix it, then try putting your Blast files through dos2unix or something similar to convert the linebreaks before running your analysis program. If they're the same OS, then we still have a problem! cheers, Richard From holland at eaglegenomics.com Wed Oct 1 05:37:59 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 1 Oct 2008 10:37:59 +0100 Subject: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result In-Reply-To: References: Message-ID: Thanks for the extra info. 2008/10/1 David Toomey : > They are on the same OS. For all my tests I have run the blast search and > parsing on the same OS. This has mostly been windows but I have also tried > the whole thing on Linux and I get the same problem. > I have done some more testing and I don't think the carriage return is the > problem. > What I have found is that if the second line is less than 11 characters the > error is thrown. If I add 4 spaces in front of the 'GN=ISPF' on the second > line then it is parsed correctly, like this. > > 2,4-cyclodiphosphate synthase OS=Plasmodium falciparum (isolate 3D7) > GN=ISPF > > I haven't figured out why it parses correctly when it is the only entry in > the file, even without the spaces. So maybe I am still missing something. > > Cheers, > > Dave > > -----Original Message----- > From: dicknetherlands at gmail.com [mailto:dicknetherlands at gmail.com] On Behalf > Of Richard Holland > Sent: 30 September 2008 17:31 > To: David Toomey > Cc: biojava-l at lists.open-bio.org > Subject: Re: [Biojava-l] StringIndexOutOfBoundsException while parsing blast > result > > Sounds like it _might_ be something to do with the carriage return > itself. Is the blast file generated on the same OS that you're running > your analysis on? (e.g. you might run Blast on a Linux box, but > attempt to parse the file on a Windows box?). If the two OSes are > different, this might point to it - as Linux won't necessarily > understand the Windows linebreaks, or vice versa, and might > misinterpret them. When you copy the portion of the file to a new file > on the OS you're running the analysis on, it will substitute its own > local linebreaks and thus mask the problem. > > So the first thing I'd check is to what the two OSes involved are. If > they're different, try running your analysis program on the same OS as > the Blast output was generated on. If that does fix it, then try > putting your Blast files through dos2unix or something similar to > convert the linebreaks before running your analysis program. > > If they're the same OS, then we still have a problem! > > cheers, > Richard > > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pzgyuanf at gmail.com Wed Oct 1 06:52:25 2008 From: pzgyuanf at gmail.com (pprun) Date: Wed, 01 Oct 2008 18:52:25 +0800 Subject: [Biojava-l] BufferedOutputStream to RichSequence.IOTools.writeXXX() method needs to flush manually Message-ID: Hi, I don't know this is a feature or a bug, If a BufferedOutputStream was passed to method RichSequence.IOTools.writeGenbank(OutputStream os, Sequence seq, Namespace ns), at the end, I need to manually flush it - BufferedOutputStream.flush() Otherwise, the output content will be truncated. Is this the expected behavior? Thanks, - Pprun From holland at eaglegenomics.com Wed Oct 1 09:36:59 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 1 Oct 2008 14:36:59 +0100 Subject: [Biojava-l] BufferedOutputStream to RichSequence.IOTools.writeXXX() method needs to flush manually In-Reply-To: References: Message-ID: The IOTools interfaces accept OutputStream instances, not BufferedOutputStream instances. flush() is not a requirement on OutputStream and so BJX does not call it. cheers, Richard 2008/10/1 pprun : > Hi, > I don't know this is a feature or a bug, > If a BufferedOutputStream was passed to method > RichSequence.IOTools.writeGenbank(OutputStream os, Sequence seq, > Namespace ns), > at the end, I need to manually flush it - BufferedOutputStream.flush() > > Otherwise, the output content will be truncated. > > Is this the expected behavior? > > Thanks, > - Pprun > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Wed Oct 1 20:46:03 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 2 Oct 2008 08:46:03 +0800 Subject: [Biojava-l] BufferedOutputStream to RichSequence.IOTools.writeXXX() method needs to flush manually In-Reply-To: References: Message-ID: <93b45ca50810011746y7d4f49biffd5c2e483c86bd1@mail.gmail.com> As a general rule it is best if BioJava doesn't handle the flushing and closing of OutputStreams. This is because you may want to keep using the stream and control it's behaivour. An interesting example is if you pass System.out to a method that closes the stream. Probably not what you want. Having said that maybe we should add a javadoc to say that BufferedOutputStreams need to be flushed (and possibly closed). - Mark On Wed, Oct 1, 2008 at 9:36 PM, Richard Holland wrote: > The IOTools interfaces accept OutputStream instances, not > BufferedOutputStream instances. flush() is not a requirement on > OutputStream and so BJX does not call it. > > cheers, > Richard > > 2008/10/1 pprun : >> Hi, >> I don't know this is a feature or a bug, >> If a BufferedOutputStream was passed to method >> RichSequence.IOTools.writeGenbank(OutputStream os, Sequence seq, >> Namespace ns), >> at the end, I need to manually flush it - BufferedOutputStream.flush() >> >> Otherwise, the output content will be truncated. >> >> Is this the expected behavior? >> >> Thanks, >> - Pprun >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From gabrielle_doan at gmx.net Tue Oct 7 10:26:44 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Tue, 07 Oct 2008 16:26:44 +0200 Subject: [Biojava-l] Getting a part of a sequence Message-ID: <48EB71A4.70409@gmx.net> Hi all, I have a BioSQL database which contains all human chromosomes. My intention is to get the information about a particular gene. How can I get a part of a particular chromosome with all associated features? At the moment I use following code to create my new sequence: RichSequence subSeq = RichSequence.Tools.subSequence(parent, position[0], position[1], ns, geneName, parent.getAccession(), parent.getIdentifier(), parent.getVersion() + 1, (Double) (parent.getVersion() + 1.0)); <\code> Here is the part how I get the parent sequence: public static RichSequence getChromosome(String chrNo) { Transaction tx = session.beginTransaction(); RichSequence ret = null; String query; try { if (chrNo.equals("MT")) { query = "from BioEntry as be where be.description like '%:num%'"; query = query.replaceAll(":num", "mitochondrion"); } else { query = "from BioEntry as be where be.description like '%hromosome :num%'"; query = query.replaceAll(":num", chrNo); } Query q = session.createQuery(query); ret = (RichSequence) q.list().get(0); tx.commit(); } catch (Exception e) { tx.rollback(); e.printStackTrace(); } return ret; } <\code> I always have to load the whole chromsome to get a part of it, so it takes very long time and I get a lot of unused information (waste of memory). I also tried to use ThinRichSequence<\code> instead of RichSequence<\code>, but thereby I didn't notice any difference. Can you give me a hint how to accelerate the code? I am grateful for any hits. cheers, Gabrielle From holland at eaglegenomics.com Tue Oct 7 19:05:54 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 8 Oct 2008 00:05:54 +0100 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <48EB71A4.70409@gmx.net> References: <48EB71A4.70409@gmx.net> Message-ID: Hello. Your code is pretty good already - but you're right, it will load the whole chromosome into memory before you can chop out the interesting bit you actually need. As you observed, by using ThinRichSequence in your query it will load only the initial shell of a sequence object to start with, but the moment you try and sub-sequence it, it will immediately load the whole sequence data into memory in order to perform the operation. If you only want the sequence data, as a string, you can do this by specifying the sequence attribute in the query and bypassing the sequence object entirely: select rs.stringSequence from Sequence as rs where rs.description like '%hromosome :num% This will return a String instead of a RichSequence object. You can use HQL operators to perform substrings etc. on the string inside the query itself - see http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html , particularly section 14.9. If you only want the features, you can do this by using the BioSQLFeatureFilter technique. In particular you will want the BySequenceName filter, the And filter, and the OverlapsRichLocation filter. You construct a filter then pass it to the filter() method in BioSQLRichSequenceDB. The database will return to you all the RichFeature objects that match your criteria. Note that it searches the whole database so you really must use a BySequenceName filter at the very least in order to make the results useful! However, you can't use HQL to construct a complete slice of a sequence directly in the database before returning it to the program for use as a ready-made RichSequence object. This would require Hibernate to know what a BioJava sub-sequence object is and how it behaves in relation to an 'unsliced' one, which is beyond the scope of it's job as a persistence framework. cheers, Richard 2008/10/7 Gabrielle Doan : > Hi all, > I have a BioSQL database which contains all human chromosomes. My intention > is to get the information about a particular gene. How can I get a part of a > particular chromosome with all associated features? At the moment I use > following code to create my new sequence: > > > RichSequence subSeq = RichSequence.Tools.subSequence(parent, > position[0], position[1], ns, geneName, parent.getAccession(), > parent.getIdentifier(), parent.getVersion() + 1, > (Double) (parent.getVersion() + 1.0)); > <\code> > > Here is the part how I get the parent sequence: > > public static RichSequence getChromosome(String chrNo) { > Transaction tx = session.beginTransaction(); > RichSequence ret = null; > > String query; > > try { > if (chrNo.equals("MT")) { > query = "from BioEntry as be where > be.description like '%:num%'"; > query = query.replaceAll(":num", > "mitochondrion"); > } else { > query = "from BioEntry as be where > be.description like '%hromosome :num%'"; > query = query.replaceAll(":num", chrNo); > } > > Query q = session.createQuery(query); > > ret = (RichSequence) q.list().get(0); > tx.commit(); > } catch (Exception e) { > tx.rollback(); > e.printStackTrace(); > } > return ret; > } > <\code> > > I always have to load the whole chromsome to get a part of it, so it takes > very long time and I get a lot of unused information (waste of memory). I > also tried to use ThinRichSequence<\code> instead of > RichSequence<\code>, but thereby I didn't notice any difference. > Can you give me a hint how to accelerate the code? > I am grateful for any hits. > > cheers, > Gabrielle > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From koen.bruynseels at cropdesign.com Tue Oct 7 20:02:18 2008 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Wed, 8 Oct 2008 02:02:18 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 04/10/2008 and will not return until 09/10/2008. I will respond to your message when I return. From gabrielle_doan at gmx.net Thu Oct 9 08:22:01 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Thu, 09 Oct 2008 14:22:01 +0200 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: References: <48EB71A4.70409@gmx.net> Message-ID: <48EDF769.8050901@gmx.net> Hi Richard, thanks a lot for your mail. I have successfully retrieved the subsequence of a sequence as a String. And now I try to get the features for a particular range with following code: public FeatureHolder filterFeature(String name, int startpos, int endpos) { RichLocation rl = new SimpleRichLocation(new SimplePosition(startpos), new SimplePosition(endpos), 0); BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( new BioSQLFeatureFilter.BySequenceName(name), new BioSQLFeatureFilter.OverlapsRichLocation(rl)); return filter(filter); } <\code> Fortunately I received these errors: Exception in thread "main" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) at org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) ... 3 more Caused by: org.hibernate.PropertyAccessException: Exception occurred inside setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) at org.hibernate.loader.Loader.doQuery(Loader.java:729) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) ... 8 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) ... 21 more Caused by: java.lang.NullPointerException at org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) at org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) at org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) at org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) at org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) at org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) ... 26 more <\message> Why do I get these errors? BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. How can I find out the sequence name? Is it the value "name" in the table "Bioentry"? As the build-in subSequence method takes a long time I intend to get the subsequence as a String by myself and add the features to it. What do you think about this? I'm grateful for any hints. cheers, Gabrielle Richard Holland schrieb: > Hello. > > Your code is pretty good already - but you're right, it will load the > whole chromosome into memory before you can chop out the interesting > bit you actually need. > > As you observed, by using ThinRichSequence in your query it will load > only the initial shell of a sequence object to start with, but the > moment you try and sub-sequence it, it will immediately load the whole > sequence data into memory in order to perform the operation. > > If you only want the sequence data, as a string, you can do this by > specifying the sequence attribute in the query and bypassing the > sequence object entirely: > > select rs.stringSequence from Sequence as rs where rs.description > like '%hromosome :num% > > This will return a String instead of a RichSequence object. You can > use HQL operators to perform substrings etc. on the string inside the > query itself - see > http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html > , particularly section 14.9. > > If you only want the features, you can do this by using the > BioSQLFeatureFilter technique. In particular you will want the > BySequenceName filter, the And filter, and the OverlapsRichLocation > filter. You construct a filter then pass it to the filter() method in > BioSQLRichSequenceDB. The database will return to you all the > RichFeature objects that match your criteria. Note that it searches > the whole database so you really must use a BySequenceName filter at > the very least in order to make the results useful! > > However, you can't use HQL to construct a complete slice of a sequence > directly in the database before returning it to the program for use as > a ready-made RichSequence object. This would require Hibernate to know > what a BioJava sub-sequence object is and how it behaves in relation > to an 'unsliced' one, which is beyond the scope of it's job as a > persistence framework. > > cheers, > Richard > > > > 2008/10/7 Gabrielle Doan : >> Hi all, >> I have a BioSQL database which contains all human chromosomes. My intention >> is to get the information about a particular gene. How can I get a part of a >> particular chromosome with all associated features? At the moment I use >> following code to create my new sequence: >> >> >> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >> position[0], position[1], ns, geneName, parent.getAccession(), >> parent.getIdentifier(), parent.getVersion() + 1, >> (Double) (parent.getVersion() + 1.0)); >> <\code> >> >> Here is the part how I get the parent sequence: >> >> public static RichSequence getChromosome(String chrNo) { >> Transaction tx = session.beginTransaction(); >> RichSequence ret = null; >> >> String query; >> >> try { >> if (chrNo.equals("MT")) { >> query = "from BioEntry as be where >> be.description like '%:num%'"; >> query = query.replaceAll(":num", >> "mitochondrion"); >> } else { >> query = "from BioEntry as be where >> be.description like '%hromosome :num%'"; >> query = query.replaceAll(":num", chrNo); >> } >> >> Query q = session.createQuery(query); >> >> ret = (RichSequence) q.list().get(0); >> tx.commit(); >> } catch (Exception e) { >> tx.rollback(); >> e.printStackTrace(); >> } >> return ret; >> } >> <\code> >> >> I always have to load the whole chromsome to get a part of it, so it takes >> very long time and I get a lot of unused information (waste of memory). I >> also tried to use ThinRichSequence<\code> instead of >> RichSequence<\code>, but thereby I didn't notice any difference. >> Can you give me a hint how to accelerate the code? >> I am grateful for any hits. >> >> cheers, >> Gabrielle >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > From holland at eaglegenomics.com Fri Oct 10 10:30:03 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 10 Oct 2008 15:30:03 +0100 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <48EDF769.8050901@gmx.net> References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> Message-ID: This looks like a bug in BJX. I have just committed a fix that I think will fix it to the head of subversion. Can you check out the latest source, compile it, and try your program again? cheers, Richard 2008/10/9 Gabrielle Doan > Hi Richard, > > thanks a lot for your mail. I have successfully retrieved the subsequence > of a sequence as a String. And now I try to get the features for a > particular range with following code: > > > public FeatureHolder filterFeature(String name, int startpos, int > endpos) { > RichLocation rl = new SimpleRichLocation(new > SimplePosition(startpos), > new SimplePosition(endpos), 0); > BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( > new > BioSQLFeatureFilter.BySequenceName(name), > new > BioSQLFeatureFilter.OverlapsRichLocation(rl)); > return filter(filter); > } > <\code> > > Fortunately I received these errors: > > Exception in thread "main" java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) > at > org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) > at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) > ... 3 more > Caused by: org.hibernate.PropertyAccessException: Exception occurred inside > setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) > at > org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) > at > org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) > at > org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) > at > org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) > at > org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) > at org.hibernate.loader.Loader.doQuery(Loader.java:729) > at > org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) > at org.hibernate.loader.Loader.doList(Loader.java:2213) > at > org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) > at org.hibernate.loader.Loader.list(Loader.java:2099) > at > org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) > at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) > at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) > ... 8 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) > ... 21 more > Caused by: java.lang.NullPointerException > at > org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) > at > org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) > at > org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) > at > org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) > at > org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) > at > org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) > ... 26 more > <\message> > > Why do I get these errors? > BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. How > can I find out the sequence name? Is it the value "name" in the table > "Bioentry"? As the build-in subSequence method takes a long time I intend to > get the subsequence as a String by myself and add the features to it. What > do you think about this? > > I'm grateful for any hints. > cheers, > > Gabrielle > > > > Richard Holland schrieb: > > Hello. >> >> Your code is pretty good already - but you're right, it will load the >> whole chromosome into memory before you can chop out the interesting >> bit you actually need. >> >> As you observed, by using ThinRichSequence in your query it will load >> only the initial shell of a sequence object to start with, but the >> moment you try and sub-sequence it, it will immediately load the whole >> sequence data into memory in order to perform the operation. >> >> If you only want the sequence data, as a string, you can do this by >> specifying the sequence attribute in the query and bypassing the >> sequence object entirely: >> >> select rs.stringSequence from Sequence as rs where rs.description >> like '%hromosome :num% >> >> This will return a String instead of a RichSequence object. You can >> use HQL operators to perform substrings etc. on the string inside the >> query itself - see >> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >> , particularly section 14.9. >> >> If you only want the features, you can do this by using the >> BioSQLFeatureFilter technique. In particular you will want the >> BySequenceName filter, the And filter, and the OverlapsRichLocation >> filter. You construct a filter then pass it to the filter() method in >> BioSQLRichSequenceDB. The database will return to you all the >> RichFeature objects that match your criteria. Note that it searches >> the whole database so you really must use a BySequenceName filter at >> the very least in order to make the results useful! >> >> However, you can't use HQL to construct a complete slice of a sequence >> directly in the database before returning it to the program for use as >> a ready-made RichSequence object. This would require Hibernate to know >> what a BioJava sub-sequence object is and how it behaves in relation >> to an 'unsliced' one, which is beyond the scope of it's job as a >> persistence framework. >> >> cheers, >> Richard >> >> >> >> 2008/10/7 Gabrielle Doan : >> >>> Hi all, >>> I have a BioSQL database which contains all human chromosomes. My >>> intention >>> is to get the information about a particular gene. How can I get a part >>> of a >>> particular chromosome with all associated features? At the moment I use >>> following code to create my new sequence: >>> >>> >>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>> position[0], position[1], ns, geneName, parent.getAccession(), >>> parent.getIdentifier(), parent.getVersion() + 1, >>> (Double) (parent.getVersion() + 1.0)); >>> <\code> >>> >>> Here is the part how I get the parent sequence: >>> >>> public static RichSequence getChromosome(String chrNo) { >>> Transaction tx = session.beginTransaction(); >>> RichSequence ret = null; >>> >>> String query; >>> >>> try { >>> if (chrNo.equals("MT")) { >>> query = "from BioEntry as be where >>> be.description like '%:num%'"; >>> query = query.replaceAll(":num", >>> "mitochondrion"); >>> } else { >>> query = "from BioEntry as be where >>> be.description like '%hromosome :num%'"; >>> query = query.replaceAll(":num", chrNo); >>> } >>> >>> Query q = session.createQuery(query); >>> >>> ret = (RichSequence) q.list().get(0); >>> tx.commit(); >>> } catch (Exception e) { >>> tx.rollback(); >>> e.printStackTrace(); >>> } >>> return ret; >>> } >>> <\code> >>> >>> I always have to load the whole chromsome to get a part of it, so it >>> takes >>> very long time and I get a lot of unused information (waste of memory). I >>> also tried to use ThinRichSequence<\code> instead of >>> RichSequence<\code>, but thereby I didn't notice any difference. >>> Can you give me a hint how to accelerate the code? >>> I am grateful for any hits. >>> >>> cheers, >>> Gabrielle >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> >> >> >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From gabrielle_doan at gmx.net Tue Oct 14 07:18:20 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Tue, 14 Oct 2008 13:18:20 +0200 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> Message-ID: <48F47FFC.4090607@gmx.net> Hi Richard, I have checked out the latest source and tried my code again. It still didn't work and I received following new errors: Exception in thread "main" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) at org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:612) at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) ... 3 more Caused by: org.hibernate.PropertyAccessException: Exception occurred inside setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) at org.hibernate.loader.Loader.doQuery(Loader.java:729) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) ... 8 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) ... 21 more Caused by: java.lang.NullPointerException at org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) at org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) at org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) at org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) at org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) at org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) ... 25 more <\message> I think BioSQLFeatureFilter.OverlapsRichLocation(rl) <\code> causes the problem I have. Can you help me to solve this problem? I'm grateful for any hints. cheers, Gabrielle Richard Holland schrieb: > This looks like a bug in BJX. I have just committed a fix that I think will > fix it to the head of subversion. Can you check out the latest source, > compile it, and try your program again? > > cheers, > Richard > > 2008/10/9 Gabrielle Doan > >> Hi Richard, >> >> thanks a lot for your mail. I have successfully retrieved the subsequence >> of a sequence as a String. And now I try to get the features for a >> particular range with following code: >> >> >> public FeatureHolder filterFeature(String name, int startpos, int >> endpos) { >> RichLocation rl = new SimpleRichLocation(new >> SimplePosition(startpos), >> new SimplePosition(endpos), 0); >> BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( >> new >> BioSQLFeatureFilter.BySequenceName(name), >> new >> BioSQLFeatureFilter.OverlapsRichLocation(rl)); >> return filter(filter); >> } >> <\code> >> >> Fortunately I received these errors: >> >> Exception in thread "main" java.lang.RuntimeException: >> java.lang.reflect.InvocationTargetException >> at >> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >> at >> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >> at >> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) >> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >> Caused by: java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >> ... 3 more >> Caused by: org.hibernate.PropertyAccessException: Exception occurred inside >> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >> at >> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >> at >> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >> at >> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >> at >> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >> at >> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >> at >> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >> at >> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >> at org.hibernate.loader.Loader.doList(Loader.java:2213) >> at >> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >> at org.hibernate.loader.Loader.list(Loader.java:2099) >> at >> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >> ... 8 more >> Caused by: java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >> ... 21 more >> Caused by: java.lang.NullPointerException >> at >> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >> at >> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >> at >> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >> at >> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >> at >> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >> at >> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >> ... 26 more >> <\message> >> >> Why do I get these errors? >> BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. How >> can I find out the sequence name? Is it the value "name" in the table >> "Bioentry"? As the build-in subSequence method takes a long time I intend to >> get the subsequence as a String by myself and add the features to it. What >> do you think about this? >> >> I'm grateful for any hints. >> cheers, >> >> Gabrielle >> >> >> >> Richard Holland schrieb: >> >> Hello. >>> Your code is pretty good already - but you're right, it will load the >>> whole chromosome into memory before you can chop out the interesting >>> bit you actually need. >>> >>> As you observed, by using ThinRichSequence in your query it will load >>> only the initial shell of a sequence object to start with, but the >>> moment you try and sub-sequence it, it will immediately load the whole >>> sequence data into memory in order to perform the operation. >>> >>> If you only want the sequence data, as a string, you can do this by >>> specifying the sequence attribute in the query and bypassing the >>> sequence object entirely: >>> >>> select rs.stringSequence from Sequence as rs where rs.description >>> like '%hromosome :num% >>> >>> This will return a String instead of a RichSequence object. You can >>> use HQL operators to perform substrings etc. on the string inside the >>> query itself - see >>> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >>> , particularly section 14.9. >>> >>> If you only want the features, you can do this by using the >>> BioSQLFeatureFilter technique. In particular you will want the >>> BySequenceName filter, the And filter, and the OverlapsRichLocation >>> filter. You construct a filter then pass it to the filter() method in >>> BioSQLRichSequenceDB. The database will return to you all the >>> RichFeature objects that match your criteria. Note that it searches >>> the whole database so you really must use a BySequenceName filter at >>> the very least in order to make the results useful! >>> >>> However, you can't use HQL to construct a complete slice of a sequence >>> directly in the database before returning it to the program for use as >>> a ready-made RichSequence object. This would require Hibernate to know >>> what a BioJava sub-sequence object is and how it behaves in relation >>> to an 'unsliced' one, which is beyond the scope of it's job as a >>> persistence framework. >>> >>> cheers, >>> Richard >>> >>> >>> >>> 2008/10/7 Gabrielle Doan : >>> >>>> Hi all, >>>> I have a BioSQL database which contains all human chromosomes. My >>>> intention >>>> is to get the information about a particular gene. How can I get a part >>>> of a >>>> particular chromosome with all associated features? At the moment I use >>>> following code to create my new sequence: >>>> >>>> >>>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>>> position[0], position[1], ns, geneName, parent.getAccession(), >>>> parent.getIdentifier(), parent.getVersion() + 1, >>>> (Double) (parent.getVersion() + 1.0)); >>>> <\code> >>>> >>>> Here is the part how I get the parent sequence: >>>> >>>> public static RichSequence getChromosome(String chrNo) { >>>> Transaction tx = session.beginTransaction(); >>>> RichSequence ret = null; >>>> >>>> String query; >>>> >>>> try { >>>> if (chrNo.equals("MT")) { >>>> query = "from BioEntry as be where >>>> be.description like '%:num%'"; >>>> query = query.replaceAll(":num", >>>> "mitochondrion"); >>>> } else { >>>> query = "from BioEntry as be where >>>> be.description like '%hromosome :num%'"; >>>> query = query.replaceAll(":num", chrNo); >>>> } >>>> >>>> Query q = session.createQuery(query); >>>> >>>> ret = (RichSequence) q.list().get(0); >>>> tx.commit(); >>>> } catch (Exception e) { >>>> tx.rollback(); >>>> e.printStackTrace(); >>>> } >>>> return ret; >>>> } >>>> <\code> >>>> >>>> I always have to load the whole chromsome to get a part of it, so it >>>> takes >>>> very long time and I get a lot of unused information (waste of memory). I >>>> also tried to use ThinRichSequence<\code> instead of >>>> RichSequence<\code>, but thereby I didn't notice any difference. >>>> Can you give me a hint how to accelerate the code? >>>> I am grateful for any hits. >>>> >>>> cheers, >>>> Gabrielle >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >>> >>> > > From holland at eaglegenomics.com Tue Oct 14 11:23:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 14 Oct 2008 16:23:10 +0100 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <48F47FFC.4090607@gmx.net> References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> <48F47FFC.4090607@gmx.net> Message-ID: Something's broken! At least from your stack trace I can see exactly what's going on. The set of locations is being loaded for the feature, but Hibernate is not calling the setMin()/setMax() methods in each location before inserting them into the set. When they get added to the set of locations for the feature, they therefore get added with null for min and max. At any point when these locations are used, for instance when they are merged by the feature location setter, or anywhere else, you'll get NullPointerExceptions. This is despite the fact that the HBM XML files are explicitly telling it _not_ to lazy-load them. Also this only happens when loading Features, and not when loading Sequence objects. I honestly don't know! What I suggest is that you create a temporary database with only one record in it, and run your test program against that to see what happens. If it still breaks, raise a bug on BugZilla and post the Genbank dump of the database to BugZilla along with your program code and the full stacktrace. Someone with a bit more Hibernate knowledge than me might then be able to help out. cheers, Richard 2008/10/14 Gabrielle Doan > Hi Richard, > I have checked out the latest source and tried my code again. It still > didn't work and I received following new errors: > > > Exception in thread "main" java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) > at > org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:612) > at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) > ... 3 more > Caused by: org.hibernate.PropertyAccessException: Exception occurred inside > setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) > at > org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) > at > org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) > at > org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) > at > org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) > at > org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) > at org.hibernate.loader.Loader.doQuery(Loader.java:729) > at > org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) > at org.hibernate.loader.Loader.doList(Loader.java:2213) > at > org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) > at org.hibernate.loader.Loader.list(Loader.java:2099) > at > org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) > at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) > at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) > ... 8 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) > ... 21 more > Caused by: java.lang.NullPointerException > at > org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) > at > org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) > at > org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) > at > org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) > at > org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) > at > org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) > ... 25 more > <\message> > > I think BioSQLFeatureFilter.OverlapsRichLocation(rl) <\code> causes > the problem I have. Can you help me to solve this problem? > > I'm grateful for any hints. > cheers, > > Gabrielle > > > > Richard Holland schrieb: > >> This looks like a bug in BJX. I have just committed a fix that I think >> will >> fix it to the head of subversion. Can you check out the latest source, >> compile it, and try your program again? >> >> cheers, >> Richard >> >> 2008/10/9 Gabrielle Doan >> >> Hi Richard, >>> >>> thanks a lot for your mail. I have successfully retrieved the subsequence >>> of a sequence as a String. And now I try to get the features for a >>> particular range with following code: >>> >>> >>> public FeatureHolder filterFeature(String name, int startpos, int >>> endpos) { >>> RichLocation rl = new SimpleRichLocation(new >>> SimplePosition(startpos), >>> new SimplePosition(endpos), 0); >>> BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( >>> new >>> BioSQLFeatureFilter.BySequenceName(name), >>> new >>> BioSQLFeatureFilter.OverlapsRichLocation(rl)); >>> return filter(filter); >>> } >>> <\code> >>> >>> Fortunately I received these errors: >>> >>> Exception in thread "main" java.lang.RuntimeException: >>> java.lang.reflect.InvocationTargetException >>> at >>> >>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >>> at >>> >>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >>> at >>> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) >>> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >>> Caused by: java.lang.reflect.InvocationTargetException >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> >>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >>> ... 3 more >>> Caused by: org.hibernate.PropertyAccessException: Exception occurred >>> inside >>> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >>> at >>> >>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >>> at >>> >>> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >>> at >>> >>> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >>> at >>> >>> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >>> at >>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >>> at >>> >>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >>> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >>> at >>> >>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >>> at org.hibernate.loader.Loader.doList(Loader.java:2213) >>> at >>> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >>> at org.hibernate.loader.Loader.list(Loader.java:2099) >>> at >>> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >>> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >>> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >>> ... 8 more >>> Caused by: java.lang.reflect.InvocationTargetException >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> >>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >>> ... 21 more >>> Caused by: java.lang.NullPointerException >>> at >>> >>> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >>> at >>> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >>> ... 26 more >>> <\message> >>> >>> Why do I get these errors? >>> BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. >>> How >>> can I find out the sequence name? Is it the value "name" in the table >>> "Bioentry"? As the build-in subSequence method takes a long time I intend >>> to >>> get the subsequence as a String by myself and add the features to it. >>> What >>> do you think about this? >>> >>> I'm grateful for any hints. >>> cheers, >>> >>> Gabrielle >>> >>> >>> >>> Richard Holland schrieb: >>> >>> Hello. >>> >>>> Your code is pretty good already - but you're right, it will load the >>>> whole chromosome into memory before you can chop out the interesting >>>> bit you actually need. >>>> >>>> As you observed, by using ThinRichSequence in your query it will load >>>> only the initial shell of a sequence object to start with, but the >>>> moment you try and sub-sequence it, it will immediately load the whole >>>> sequence data into memory in order to perform the operation. >>>> >>>> If you only want the sequence data, as a string, you can do this by >>>> specifying the sequence attribute in the query and bypassing the >>>> sequence object entirely: >>>> >>>> select rs.stringSequence from Sequence as rs where rs.description >>>> like '%hromosome :num% >>>> >>>> This will return a String instead of a RichSequence object. You can >>>> use HQL operators to perform substrings etc. on the string inside the >>>> query itself - see >>>> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >>>> , particularly section 14.9. >>>> >>>> If you only want the features, you can do this by using the >>>> BioSQLFeatureFilter technique. In particular you will want the >>>> BySequenceName filter, the And filter, and the OverlapsRichLocation >>>> filter. You construct a filter then pass it to the filter() method in >>>> BioSQLRichSequenceDB. The database will return to you all the >>>> RichFeature objects that match your criteria. Note that it searches >>>> the whole database so you really must use a BySequenceName filter at >>>> the very least in order to make the results useful! >>>> >>>> However, you can't use HQL to construct a complete slice of a sequence >>>> directly in the database before returning it to the program for use as >>>> a ready-made RichSequence object. This would require Hibernate to know >>>> what a BioJava sub-sequence object is and how it behaves in relation >>>> to an 'unsliced' one, which is beyond the scope of it's job as a >>>> persistence framework. >>>> >>>> cheers, >>>> Richard >>>> >>>> >>>> >>>> 2008/10/7 Gabrielle Doan : >>>> >>>> Hi all, >>>>> I have a BioSQL database which contains all human chromosomes. My >>>>> intention >>>>> is to get the information about a particular gene. How can I get a part >>>>> of a >>>>> particular chromosome with all associated features? At the moment I use >>>>> following code to create my new sequence: >>>>> >>>>> >>>>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>>>> position[0], position[1], ns, geneName, parent.getAccession(), >>>>> parent.getIdentifier(), parent.getVersion() + 1, >>>>> (Double) (parent.getVersion() + 1.0)); >>>>> <\code> >>>>> >>>>> Here is the part how I get the parent sequence: >>>>> >>>>> public static RichSequence getChromosome(String chrNo) { >>>>> Transaction tx = session.beginTransaction(); >>>>> RichSequence ret = null; >>>>> >>>>> String query; >>>>> >>>>> try { >>>>> if (chrNo.equals("MT")) { >>>>> query = "from BioEntry as be where >>>>> be.description like '%:num%'"; >>>>> query = query.replaceAll(":num", >>>>> "mitochondrion"); >>>>> } else { >>>>> query = "from BioEntry as be where >>>>> be.description like '%hromosome :num%'"; >>>>> query = query.replaceAll(":num", chrNo); >>>>> } >>>>> >>>>> Query q = session.createQuery(query); >>>>> >>>>> ret = (RichSequence) q.list().get(0); >>>>> tx.commit(); >>>>> } catch (Exception e) { >>>>> tx.rollback(); >>>>> e.printStackTrace(); >>>>> } >>>>> return ret; >>>>> } >>>>> <\code> >>>>> >>>>> I always have to load the whole chromsome to get a part of it, so it >>>>> takes >>>>> very long time and I get a lot of unused information (waste of memory). >>>>> I >>>>> also tried to use ThinRichSequence<\code> instead of >>>>> RichSequence<\code>, but thereby I didn't notice any difference. >>>>> Can you give me a hint how to accelerate the code? >>>>> I am grateful for any hits. >>>>> >>>>> cheers, >>>>> Gabrielle >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> >>>>> >>>> >>>> >> >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From charles at imbusch.net Tue Oct 14 17:03:04 2008 From: charles at imbusch.net (Charles Imbusch) Date: Tue, 14 Oct 2008 23:03:04 +0200 Subject: [Biojava-l] parsing tblastn results Message-ID: <48F50908.5060307@imbusch.net> Hello, for a project I want to parse a tblastn result with BioJava. I used the code on http://biojava.org/wiki/BioJava:CookBook:Blast:Parser as it is and I get an error message as follows: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1938) at java.lang.String.substring(String.java:1905) at org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeAlignmentSAXParser.java:289) at org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlignmentSAXParser.java:115) at org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXParser.java:514) at org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXParser.java:287) at org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParser.java:251) at org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.java:118) at org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser.java:635) at org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:337) at org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) at org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:313) at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:276) at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:162) at BlastEcho.echo(BlastEcho.java:29) at BlastEcho.main(BlastEcho.java:75) I uploaded the Blast output file I want to parse here: http://charles.imbusch.net/tmp/blastresult.txt Any answer is appreciated. Cheers, Charles From ayates at ebi.ac.uk Wed Oct 15 04:07:35 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 15 Oct 2008 09:07:35 +0100 Subject: [Biojava-l] ANN: EBI Course - Programmatic access in Java: webservices & work flows Message-ID: <48F5A4C7.7010304@ebi.ac.uk> Hi everyone, Posting this here as it may be of interest to some people. The EBI is holding a course in accessing a large number of its resources from Java programs. The course will run from the 24th - 27th November being held on-site at the Hinxton Genome Campus. Resources being covered will include: * Ontology Lookup Service - Offers access to multiple ontologies through a common interface * PICR - A tool for going between identifier spaces for proteins) * UniProt * IntAct * ChEBI * BioMart * Integr8 * CiteXplore * And many many more :) If you are interested in any of these resources then please go to http://www.ebi.ac.uk/training/handson/course_081124_javawebservices.html . The course will cost you ?75 for the 3 days. All the best, Andy Yates From holland at eaglegenomics.com Wed Oct 15 04:13:18 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 15 Oct 2008 09:13:18 +0100 Subject: [Biojava-l] parsing tblastn results In-Reply-To: <48F50908.5060307@imbusch.net> References: <48F50908.5060307@imbusch.net> Message-ID: I've raised a bug report for you. Hopefully someone will take a look at it soon: http://bugzilla.open-bio.org/show_bug.cgi?id=2617 cheers, Richard 2008/10/14 Charles Imbusch > Hello, > > for a project I want to parse a tblastn result with BioJava. I used the > code > on http://biojava.org/wiki/BioJava:CookBook:Blast:Parser as it is and I > get an > error message as follows: > > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String index out of range: -3 > at java.lang.String.substring(String.java:1938) > at java.lang.String.substring(String.java:1905) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeAlignmentSAXParser.java:289) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlignmentSAXParser.java:115) > at > org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXParser.java:514) > at > org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXParser.java:287) > at > org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParser.java:251) > at > org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.java:118) > at > org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser.java:635) > at > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:337) > at > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:313) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:276) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:162) > at BlastEcho.echo(BlastEcho.java:29) > at BlastEcho.main(BlastEcho.java:75) > > I uploaded the Blast output file I want to parse here: > http://charles.imbusch.net/tmp/blastresult.txt > > Any answer is appreciated. > > Cheers, > Charles > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From dtoomey at rcsi.ie Wed Oct 15 05:46:58 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Wed, 15 Oct 2008 10:46:58 +0100 Subject: [Biojava-l] parsing tblastn results References: <48F50908.5060307@imbusch.net> Message-ID: Hi Richard This looks suspiciously like a bug I raised a couple of weeks ago. I was parsing blastp results but the stack trace is the same. http://bugzilla.open-bio.org/show_bug.cgi?id=2603 Charles, I have updated the original bug with a hack which at least allows you to parse the result and get an output. You just need to recompile the source code with the modified 'BlastLikeAlignmentSAXParser.java. Not ideal but at least you will be able to run your code until the source is fixed. Cheers Dave -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Richard Holland Sent: 15 October 2008 09:13 To: Charles Imbusch Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] parsing tblastn results I've raised a bug report for you. Hopefully someone will take a look at it soon: http://bugzilla.open-bio.org/show_bug.cgi?id=2617 cheers, Richard 2008/10/14 Charles Imbusch > Hello, > > for a project I want to parse a tblastn result with BioJava. I used the > code > on http://biojava.org/wiki/BioJava:CookBook:Blast:Parser as it is and I > get an > error message as follows: > > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String index out of range: -3 > at java.lang.String.substring(String.java:1938) > at java.lang.String.substring(String.java:1905) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeA lignmentSAXParser.java:289) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlign mentSAXParser.java:115) > at > org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXP arser.java:514) > at > org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXP arser.java:287) > at > org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParse r.java:251) > at > org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.ja va:118) > at > org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser .java:635) > at > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:337 ) > at > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars er.java:313) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. java:276) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java :162) > at BlastEcho.echo(BlastEcho.java:29) > at BlastEcho.main(BlastEcho.java:75) > > I uploaded the Blast output file I want to parse here: > http://charles.imbusch.net/tmp/blastresult.txt > > Any answer is appreciated. > > Cheers, > Charles > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From gabrielle_doan at gmx.net Wed Oct 15 09:15:39 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Wed, 15 Oct 2008 15:15:39 +0200 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <381a3e850810142152p4e0a0c2ds80a74570b44f2be0@mail.gmail.com> References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> <48F47FFC.4090607@gmx.net> <381a3e850810140928p4af06cf4r3dfd08908efd42f6@mail.gmail.com> <48F4C99E.6070007@gmx.net> <381a3e850810142152p4e0a0c2ds80a74570b44f2be0@mail.gmail.com> Message-ID: <48F5ECFB.6040703@gmx.net> Hi Augusto, I've inserted your files into BJX. Unfortunately it hasn't solved my problems. Maybe Richard has another idea how to handle it. Best regards, Gabrielle Augusto Fernandes Vellozo schrieb: > Hi Gabrielle, > Please, let me know if the results ares ok or not. > I remember, when I made the corrections, I didn't see the case with > circularLength, because for my use case it doesn't matter and because > i don't understand exactly what is this. Take care, if you have this > use case. > > Cheers, > > Augusto > > 2008/10/14 Gabrielle Doan : >> Hi Augusto, >> >> thank you so much. I hope this will be the solution to my problem. >> >> cheers, >> Gabrielle >> >> Augusto Fernandes Vellozo schrieb: >>> Hi Gabrielle, >>> I had some problems with the class Location and i modified some >>> classes in my machine. I've already written to Richard. >>> The classes modified are attached. >>> These could help you. >>> >>> Good luck, >>> >>> Augusto >>> >>> 2008/10/14 Gabrielle Doan : >>>> Hi Richard, >>>> I have checked out the latest source and tried my code again. It still >>>> didn't work and I received following new errors: >>>> >>>> >>>> Exception in thread "main" java.lang.RuntimeException: >>>> java.lang.reflect.InvocationTargetException >>>> at >>>> >>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >>>> at >>>> >>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >>>> at >>>> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:612) >>>> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> at >>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> >>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >>>> ... 3 more >>>> Caused by: org.hibernate.PropertyAccessException: Exception occurred >>>> inside >>>> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >>>> at >>>> >>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >>>> at >>>> >>>> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >>>> at >>>> >>>> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >>>> at >>>> >>>> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >>>> at >>>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >>>> at >>>> >>>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >>>> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >>>> at >>>> >>>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >>>> at org.hibernate.loader.Loader.doList(Loader.java:2213) >>>> at >>>> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >>>> at org.hibernate.loader.Loader.list(Loader.java:2099) >>>> at >>>> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >>>> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >>>> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >>>> ... 8 more >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) >>>> at >>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> >>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >>>> ... 21 more >>>> Caused by: java.lang.NullPointerException >>>> at >>>> >>>> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >>>> at >>>> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >>>> ... 25 more >>>> <\message> >>>> >>>> I think BioSQLFeatureFilter.OverlapsRichLocation(rl) <\code> >>>> causes >>>> the problem I have. Can you help me to solve this problem? >>>> >>>> I'm grateful for any hints. >>>> cheers, >>>> >>>> Gabrielle >>>> >>>> >>>> >>>> Richard Holland schrieb: >>>>> This looks like a bug in BJX. I have just committed a fix that I think >>>>> will >>>>> fix it to the head of subversion. Can you check out the latest source, >>>>> compile it, and try your program again? >>>>> >>>>> cheers, >>>>> Richard >>>>> >>>>> 2008/10/9 Gabrielle Doan >>>>> >>>>>> Hi Richard, >>>>>> >>>>>> thanks a lot for your mail. I have successfully retrieved the >>>>>> subsequence >>>>>> of a sequence as a String. And now I try to get the features for a >>>>>> particular range with following code: >>>>>> >>>>>> >>>>>> public FeatureHolder filterFeature(String name, int startpos, int >>>>>> endpos) { >>>>>> RichLocation rl = new SimpleRichLocation(new >>>>>> SimplePosition(startpos), >>>>>> new SimplePosition(endpos), 0); >>>>>> BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( >>>>>> new >>>>>> BioSQLFeatureFilter.BySequenceName(name), >>>>>> new >>>>>> BioSQLFeatureFilter.OverlapsRichLocation(rl)); >>>>>> return filter(filter); >>>>>> } >>>>>> <\code> >>>>>> >>>>>> Fortunately I received these errors: >>>>>> >>>>>> Exception in thread "main" java.lang.RuntimeException: >>>>>> java.lang.reflect.InvocationTargetException >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >>>>>> at >>>>>> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) >>>>>> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >>>>>> ... 3 more >>>>>> Caused by: org.hibernate.PropertyAccessException: Exception occurred >>>>>> inside >>>>>> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >>>>>> at >>>>>> >>>>>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >>>>>> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >>>>>> at org.hibernate.loader.Loader.doList(Loader.java:2213) >>>>>> at >>>>>> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >>>>>> at org.hibernate.loader.Loader.list(Loader.java:2099) >>>>>> at >>>>>> >>>>>> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >>>>>> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >>>>>> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >>>>>> ... 8 more >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >>>>>> ... 21 more >>>>>> Caused by: java.lang.NullPointerException >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >>>>>> at >>>>>> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >>>>>> ... 26 more >>>>>> <\message> >>>>>> >>>>>> Why do I get these errors? >>>>>> BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. >>>>>> How >>>>>> can I find out the sequence name? Is it the value "name" in the table >>>>>> "Bioentry"? As the build-in subSequence method takes a long time I >>>>>> intend >>>>>> to >>>>>> get the subsequence as a String by myself and add the features to it. >>>>>> What >>>>>> do you think about this? >>>>>> >>>>>> I'm grateful for any hints. >>>>>> cheers, >>>>>> >>>>>> Gabrielle >>>>>> >>>>>> >>>>>> >>>>>> Richard Holland schrieb: >>>>>> >>>>>> Hello. >>>>>>> Your code is pretty good already - but you're right, it will load the >>>>>>> whole chromosome into memory before you can chop out the interesting >>>>>>> bit you actually need. >>>>>>> >>>>>>> As you observed, by using ThinRichSequence in your query it will load >>>>>>> only the initial shell of a sequence object to start with, but the >>>>>>> moment you try and sub-sequence it, it will immediately load the whole >>>>>>> sequence data into memory in order to perform the operation. >>>>>>> >>>>>>> If you only want the sequence data, as a string, you can do this by >>>>>>> specifying the sequence attribute in the query and bypassing the >>>>>>> sequence object entirely: >>>>>>> >>>>>>> select rs.stringSequence from Sequence as rs where rs.description >>>>>>> like '%hromosome :num% >>>>>>> >>>>>>> This will return a String instead of a RichSequence object. You can >>>>>>> use HQL operators to perform substrings etc. on the string inside the >>>>>>> query itself - see >>>>>>> >>>>>>> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >>>>>>> , particularly section 14.9. >>>>>>> >>>>>>> If you only want the features, you can do this by using the >>>>>>> BioSQLFeatureFilter technique. In particular you will want the >>>>>>> BySequenceName filter, the And filter, and the OverlapsRichLocation >>>>>>> filter. You construct a filter then pass it to the filter() method in >>>>>>> BioSQLRichSequenceDB. The database will return to you all the >>>>>>> RichFeature objects that match your criteria. Note that it searches >>>>>>> the whole database so you really must use a BySequenceName filter at >>>>>>> the very least in order to make the results useful! >>>>>>> >>>>>>> However, you can't use HQL to construct a complete slice of a sequence >>>>>>> directly in the database before returning it to the program for use as >>>>>>> a ready-made RichSequence object. This would require Hibernate to know >>>>>>> what a BioJava sub-sequence object is and how it behaves in relation >>>>>>> to an 'unsliced' one, which is beyond the scope of it's job as a >>>>>>> persistence framework. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2008/10/7 Gabrielle Doan : >>>>>>> >>>>>>>> Hi all, >>>>>>>> I have a BioSQL database which contains all human chromosomes. My >>>>>>>> intention >>>>>>>> is to get the information about a particular gene. How can I get a >>>>>>>> part >>>>>>>> of a >>>>>>>> particular chromosome with all associated features? At the moment I >>>>>>>> use >>>>>>>> following code to create my new sequence: >>>>>>>> >>>>>>>> >>>>>>>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>>>>>>> position[0], position[1], ns, geneName, parent.getAccession(), >>>>>>>> parent.getIdentifier(), parent.getVersion() + 1, >>>>>>>> (Double) (parent.getVersion() + 1.0)); >>>>>>>> <\code> >>>>>>>> >>>>>>>> Here is the part how I get the parent sequence: >>>>>>>> >>>>>>>> public static RichSequence getChromosome(String chrNo) { >>>>>>>> Transaction tx = session.beginTransaction(); >>>>>>>> RichSequence ret = null; >>>>>>>> >>>>>>>> String query; >>>>>>>> >>>>>>>> try { >>>>>>>> if (chrNo.equals("MT")) { >>>>>>>> query = "from BioEntry as be where >>>>>>>> be.description like '%:num%'"; >>>>>>>> query = query.replaceAll(":num", >>>>>>>> "mitochondrion"); >>>>>>>> } else { >>>>>>>> query = "from BioEntry as be where >>>>>>>> be.description like '%hromosome :num%'"; >>>>>>>> query = query.replaceAll(":num", chrNo); >>>>>>>> } >>>>>>>> >>>>>>>> Query q = session.createQuery(query); >>>>>>>> >>>>>>>> ret = (RichSequence) q.list().get(0); >>>>>>>> tx.commit(); >>>>>>>> } catch (Exception e) { >>>>>>>> tx.rollback(); >>>>>>>> e.printStackTrace(); >>>>>>>> } >>>>>>>> return ret; >>>>>>>> } >>>>>>>> <\code> >>>>>>>> >>>>>>>> I always have to load the whole chromsome to get a part of it, so it >>>>>>>> takes >>>>>>>> very long time and I get a lot of unused information (waste of >>>>>>>> memory). >>>>>>>> I >>>>>>>> also tried to use ThinRichSequence<\code> instead of >>>>>>>> RichSequence<\code>, but thereby I didn't notice any >>>>>>>> difference. >>>>>>>> Can you give me a hint how to accelerate the code? >>>>>>>> I am grateful for any hits. >>>>>>>> >>>>>>>> cheers, >>>>>>>> Gabrielle >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >> > > > From holland at eaglegenomics.com Sun Oct 19 20:18:29 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 20 Oct 2008 01:18:29 +0100 Subject: [Biojava-l] BioJava 3 Begins - Volunteers please! Message-ID: Hi all, I've just committed some new code to the biojava3 branch of the biojava-live subversion repository. It's the foundations of a brand new alphabet+symbol set of classes, and an example of how to use them to represent DNA. You'll notice that the new code is very lightweight and allows for a lot more flexibility than the old code - for instance, the concept of Alphabet has changed radically. It also makes much more extensive use of the Collections API. I haven't got any test cases or usage examples yet but give me a shout if you don't understand the code and I'll explain how it works. (Hint: SymbolFormat is there to convert Strings into SymbolList objects, and vice versa). So, now we want some volunteers! We're starting from scratch here so there's a lot of work to do. The whole of BioJava needs 'translating' into BJ3, whether it be copy-and-paste existing classes and modify them to suit the new style, or write completely new ones to provide equivalent functionality. I'll post an example of how to do file parsing soon, probably starting with FASTA. In the meantime, a good place to start would be for people to design object models to represent their favourite data types (e.g. Genbank, or microarray data). Utility classes to manipulate those objects would be great too. The object models need to be normalised as much as possible - e.g. if your data has a lot of comments, and the order of those comments is important, then give your object model a collection of comment objects. The object model for each data type should be completely independent and use basic data types wherever possible (e.g. store sequences as strings, don't attempt to parse them into anything fancy like SymbolLists). The closer the object model is to the original data format, the better. There's going to be clever tricks when it comes to converting data between different object models (e.g. Genbank to INSDSeq), which I will explain later when I put the file parsing examples up. You'll notice how the biojava3 branch uses Maven instead of Ant. This is because we want to make it as modular as possible, so if you want to write microarray stuff, create a new microarray sub-project (as per the dna example that's already there). This way if someone only wants the microarray bit of BJ3, they only need install the appropriate JAR file and can ignore the rest. (The 'core' module is for stuff that is so generic it could be used anywhere, or is used in every single other module.) If coding isn't your cup of tea, then we would very much welcome testers (particularly those who enjoy writing test cases!), documenters (particularly code commenters), translators (for internationalisation of the code), and of course all those who wish to contribute ideas and suggestions no matter how off-the-wall they might be. In particular if you'd like to take charge of an area of the development process, e.g. Documentation Chief, or Protein Champion, then that would be much appreciated. I'm very much looking forward to working with everyone on this. Good luck, and happy coding! cheers, Richard PS. Please don't forget to attach the appropriate licence to your code. You can copy-and-paste it from the existing classes I just committed this evening. PPS. For those who are worried about backwards compatibility - this was discussed on the lists a while back and it was made clear that BJ3 is a clean break. However, the existing code will continue to be maintained and bugfixed for a couple of years so you don't have to upgrade if you don't want to - it just won't have any new features developed for it. This is largely because it'll probably take just that long to write all the new BJ3 code. When we do decide to desupport the existing BJ code, plenty of notice will be given (i.e. years as opposed to months). -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Mon Oct 20 13:52:08 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 20 Oct 2008 18:52:08 +0100 Subject: [Biojava-l] File parsing in BJ3 Message-ID: (From now on I will only be posting these development messages to biojava-dev, which is the intended purpose of that list. Those of you who wish to keep track of things but are currently only subscribed to biojava-l should also subscribe to biojava-dev in order to keep up to date.) As promised, I've committed a new package in the biojava-core module that should help understand how to do file parsing and conversion and writing in the new BJ3 modules. Here's an example of how to use it to write a Genbank parser (note no parsers actually exist yet!): 1. Design yourself a Genbank class which implements the interface Thing and can fully represent all the data that might possibly occur inside a Genbank file. 2. Write an interface called GenbankReceiver, which extends ThingReceiver and defines all the methods you might need in order to construct a Genbank object in an asynchronous fashion. 3. Write a GenbankBuilder class which implements GenbankReceiver and ThingBuilder. It's job is to receive data via method calls, use that data to construct a Genbank object, then provide that object on demand. 4. Write a GenbankWriter class which implements GenbankReceiver and ThingWriter. It's job is similar to GenbankBuilder, but instead of constructing new Genbank objects, it writes Genbank records to file that reflect the data it receives. 5. Write a GenbankReader class which implements ThingReader. It can read GenbankFiles and output the data to the methods of the ThingReceiver provided to it, which in this case could be anything which implements the interface GenbankReceiver. 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a Genbank object and will fire off data from it to the provided ThingReceiver (a GenbankReceiver instance) as if the Genbank object was being read from a file or some other source. That's it! OK so it's a minimum of 6 classes instead of the original 1 or 2, but the additional steps are necessary for flexibility in converting between formats. Now to use it (you'll probably want a GenbankTools class to wrap these steps up for user-friendliness, including various options for opening files, etc.): 1. To read a file - instantiate ThingParser with your GenbankReader as the reader, and GenbankBuilder as the receiver. Use the iterator methods on ThingParser to get the objects out. 2. To write a file - instantiate ThingParser with a GenbankEmitter wrapping your Genbank object, and a GenbankWriter as the receiver. Use the parseAll() method on the ThingParser to dump the whole lot to your chosen output. The clever bit comes when you want to convert between files. Imagine you've done all the above for Genbank, and you've also done it for FASTA. How to convert between them? What you need to do is this: 1. Implement all the classes for both Genbank and FASTA. 2. Write a GenbankFASTAConverter class that implements ThingConverter and GenbankReceiver, and will internally convert the data received and pass it on out to the receiver provided, which will be a FASTAReceiver instance. 3. Write a FASTAGenbankConverter class that operates in exactly the opposite way, implementing ThingConverter and FASTAReceiver. Then to convert you use ThingParser again: 1. From FASTA file to Genbank object: Instantiate ThingParser with a FASTAReader reader, a GenbankBuilder receiver, and add a FASTAGenbankConverter instance to the converter chain. Use the iterator to get your Genbank objects out of your FASTA file. 2. From FASTA file to Genbank file: Same as option 1, but provide a GenbankWriter instead and use parseAll() instead of the iterator methos. 3. From FASTA object to Genbank object: Same as option 1, but provide a FASTAEmitter wrapping your FASTA object as the reader instead. 4. From FASTA object to Genbank file: Same as option 1, but swap both the reader and the receiver as per options 2 and 3. 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all mentions of FASTA and Genbank, and use GenbankFASTAConverter instead. One last and very important feature of this approach is that if you discover that nobody has written the appropriate converter for your chosen pair of formats A and C, but converters do exist to map A to some other format B and that other format B on to C, then you can just put the two converts A-B and B-C into the ThingParser chain and it'll work perfectly. Enjoy! cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Mon Oct 20 22:54:27 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 10:54:27 +0800 Subject: [Biojava-l] Biojava / BioSQL entity beans Message-ID: <93b45ca50810201954k44ab0f65xb94a0214d8eb4e13@mail.gmail.com> Hi - Richard has kindly uploaded some JPA Entity beans that map to the BioSQL database schema as a BioSQL module for BJ3. These entity beans where generated as part of the Tokyo webservices workshop. As Entities they are useful as POJOs as well as data transfer via JPA, JAXB and can be used in EJB containers or a plain old JVM. The have no biological smarts and the intention was/is that these will be provided by wrapping them in Bio-aware (and more thread safe) wrappers that implement interfaces from other BJ3 modules. In essence it is a persistence layer. The following is copied verbatim from the package-info.java and gives you some idea of how I intend the package to be used (obviously some of this is still to come). There is also some discussion of some of the gotcha's that might trip you up when playing with object relational persistence. BTW the naming convention is to call something FooEntity. Where BioSQL requires a compound primary key this is implemented as an Embeddable object called FooEntityPK which is the key for FooEntity. The other thing you may see is FooEntityUK which is the same concept but represents some of the cases where BioSQL tables don't have a primary key (even a compound one) but implicitly they do because all the fields have the SQL unique restriction. In these cases JPA still requires an Embeddable key to track updates. As far as Java is concerned they are the same as a FooEntityPK but I used a different name to make the distinction. The annotations provide mapping to tables from a Derby database. This is the reference Java in memory DB which can run from any JVM and is also found in Glassfish. The mappings will likely also work with MySQL. For Oracle (and possibly others) you would need to override the @GeneratedValue strategy for generating primary keys. I believe this can be done with external XML config files. You may also wish to overide the default eager loading and cascade annotations depending on your JPA persistence method and preferences. This has been lightly tested using Glassfish, Derby and Toplink essentials and is a work in progress but seems to work OK. Best regards, - Mark /** * The package contains Entity representations of BioJava classes. * The purpose of these entities is to allow simple serialization of BioJava data * using binary serialization for protocols that require this (eg RPC between * Java application servers) as well as persistence mechanisms that require bean * like ojbects such as the Java Persistence Architechture (JPA) or the * Java API for XML Binding (JAXB). For this reason all objects in this package * should provide a parameterless public constructor and public get/set methods * for relevant fields. *

* Given the public nature of the constructors and the setters in these beans * these classes are not intended for direct use in general programming when * using the BioJava v3 API. This is because it is possible to leave the bean in * and inconsitent state and they are not thread safe unless synchronization * controlled externally (via synchornization blocks or via a application container). *

* The Entities are intended to back other objects that a * programer will interact with directly. For example Foo.class will be backed * by FooEntity.class. Generally interaction with Foo.class is to be prefered and * will often be more sensible as the entities typically provide no 'biological * behaivour'. Relevant behaivour should be provided by the wrapping class. It is best * to think of Foo as a view onto the data that is held in the * FooEntity. A good example is the sophisticated Symbol * behaivour that can represent biological logic about IUPAC ambiguity symbols. * For example a 'w' in a Biosequence represents an abiguity between 'a' and 't', * whereas a 'w' in BiosequenceEntity is simply a 'w' and nothing else. *

* The wrapper entity pattern is intended to allow for a lot of the advanced * behaivour in the original BioJava while also allowing use of modern transport * and persistence packages. This is achieved by peristing and transporting the * entity without the wrapper and re-wrapping it at the other end. *

* Currently BioJava v3 uses annotated @Id fields to define * equals(Object o). Consistent definition is critical to how * the object will behave when persisted to a database. In the case of: *

 * Foo f = ... initialize
 * Foo fo = ... initialize
 * boolean b = f.equals(fo);
 * 
* b would be true if both objects share the same value * (or embeddable object) in the field that represents the primary key in the * database even if all other fields are equal. This is desirable because * two entities representing the same DB record may be retreived from two different * sessions. Additionally these are the identity fields, so logically, they should map to * the concept of identity. Finally, searching a collection is made very simple * without requireing an iterator: *
 * Integer id = //code to initialize
 * collection.contains(new Foo(id));
 * 
* By default BioJava v3 entities use only the primary key field for equality * If either record has null as the primary key value it is never equal * to another. When implementing equals(Object o) it is not advisable to perform * the test this.getClass() == o.getClass() because of the possibility of proxy * classes used in JPA. This can, however, lead to an issue with the * hashcode() method. Consider the following code: *
 * Foo foo = new Foo() //no primary key
 * HashSet set = new HashSet();
 * set.add(foo);
 * // code here to persist Foo and consequently generate it's PK
 * boolean b = set.contains(foo);
 * 
* Because only the PK is used for equality, then the PK is used in the hashcode. * This means that b is probably going to be false because * it would have been stored in a hash bucket using the old hashcode that will * now be different even though the set actually does contain a pointer to foo. * Although a potential deficiency it is unlikely to be a major problem for * BioJava v3 developers because using entity backed objects is prefered to direct * interaction with entities. If you need to use entities directly then use hashed * collections with caution. * *

Wrapper classes can either delegate it's equals call to the underlying * entity or it can do something that is more biologically sensible * (as PK values are typically not exposed in the wrapper). It is probably more * sensible for a wrapper to define it's own equals (and haschode * implementations due to the limitations of the default @Id based system * described above. Especially the potential hashcode problems. * * For example FooSequence.class might want to base * equality on the exact match of the DNA sequence it holds even though * FooSequenceEntity.class may only use the PK field. If delegation * is used (or not) it should be clearly documented. *

* *

* @author Mark Schreiber */ package org.biojava.biosql.entity; From markjschreiber at gmail.com Mon Oct 20 23:16:51 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 11:16:51 +0800 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: References: Message-ID: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> So if I want to build a BioSQL loader from Genbank then would the classes (or there wrappers) in the BioSQL Entity package need to implement Thing? Would maven have an issue with that or would it just create a dependency on core? (you can tell I've never used Maven right). >From a design point of view should Thing be an interface or an Annotation? The reason I ask is that it doesn't define any methods so it is more of a tag than an interface. Anyway, my understanding is that I would use a Genbank parser (or write one). Write a EntityReceiver interface (probably more than one given the number of entities in BioSQL, implement a EntityBuilder (again possibly more than one) that implements EntityReceiver and builds Entity beans from messages it receives. In this case I probably wouldn't provide a writer as JPA would be writing the beans to the database. Would this be how you imagine it? - Mark On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland wrote: > (From now on I will only be posting these development messages to > biojava-dev, which is the intended purpose of that list. Those of you who > wish to keep track of things but are currently only subscribed to biojava-l > should also subscribe to biojava-dev in order to keep up to date.) > > As promised, I've committed a new package in the biojava-core module that > should help understand how to do file parsing and conversion and writing in > the new BJ3 modules. Here's an example of how to use it to write a Genbank > parser (note no parsers actually exist yet!): > > 1. Design yourself a Genbank class which implements the interface Thing and > can fully represent all the data that might possibly occur inside a Genbank > file. > > 2. Write an interface called GenbankReceiver, which extends ThingReceiver > and defines all the methods you might need in order to construct a Genbank > object in an asynchronous fashion. > > 3. Write a GenbankBuilder class which implements GenbankReceiver and > ThingBuilder. It's job is to receive data via method calls, use that data to > construct a Genbank object, then provide that object on demand. > > 4. Write a GenbankWriter class which implements GenbankReceiver and > ThingWriter. It's job is similar to GenbankBuilder, but instead of > constructing new Genbank objects, it writes Genbank records to file that > reflect the data it receives. > > 5. Write a GenbankReader class which implements ThingReader. It can read > GenbankFiles and output the data to the methods of the ThingReceiver > provided to it, which in this case could be anything which implements the > interface GenbankReceiver. > > 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a > Genbank object and will fire off data from it to the provided ThingReceiver > (a GenbankReceiver instance) as if the Genbank object was being read from a > file or some other source. > > That's it! OK so it's a minimum of 6 classes instead of the original 1 or 2, > but the additional steps are necessary for flexibility in converting between > formats. > > Now to use it (you'll probably want a GenbankTools class to wrap these steps > up for user-friendliness, including various options for opening files, > etc.): > > 1. To read a file - instantiate ThingParser with your GenbankReader as the > reader, and GenbankBuilder as the receiver. Use the iterator methods on > ThingParser to get the objects out. > > 2. To write a file - instantiate ThingParser with a GenbankEmitter wrapping > your Genbank object, and a GenbankWriter as the receiver. Use the parseAll() > method on the ThingParser to dump the whole lot to your chosen output. > > The clever bit comes when you want to convert between files. Imagine you've > done all the above for Genbank, and you've also done it for FASTA. How to > convert between them? What you need to do is this: > > 1. Implement all the classes for both Genbank and FASTA. > > 2. Write a GenbankFASTAConverter class that implements ThingConverter > and GenbankReceiver, and will internally convert the data received and pass > it on out to the receiver provided, which will be a FASTAReceiver instance. > > 3. Write a FASTAGenbankConverter class that operates in exactly the opposite > way, implementing ThingConverter and FASTAReceiver. > > Then to convert you use ThingParser again: > > 1. From FASTA file to Genbank object: Instantiate ThingParser with a > FASTAReader reader, a GenbankBuilder receiver, and add a > FASTAGenbankConverter instance to the converter chain. Use the iterator to > get your Genbank objects out of your FASTA file. > > 2. From FASTA file to Genbank file: Same as option 1, but provide a > GenbankWriter instead and use parseAll() instead of the iterator methos. > > 3. From FASTA object to Genbank object: Same as option 1, but provide a > FASTAEmitter wrapping your FASTA object as the reader instead. > > 4. From FASTA object to Genbank file: Same as option 1, but swap both the > reader and the receiver as per options 2 and 3. > > 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all mentions > of FASTA and Genbank, and use GenbankFASTAConverter instead. > > One last and very important feature of this approach is that if you discover > that nobody has written the appropriate converter for your chosen pair of > formats A and C, but converters do exist to map A to some other format B and > that other format B on to C, then you can just put the two converts A-B and > B-C into the ThingParser chain and it'll work perfectly. > > Enjoy! > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Mon Oct 20 23:17:28 2008 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Oct 2008 20:17:28 -0700 Subject: [Biojava-l] [Biojava-dev] BioJava 3 Begins - Volunteers please! In-Reply-To: References: Message-ID: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> Hi, Couple of thoughts regarding biojava v3: License: Since it seems we will end up copying code from biojava 1.6 to biojava 3.0, we need to keep the license the same (LGPL 2.1). I.e. people should still use the same biojava license headers when committing new files and all code will be considered to be LGPL, if no header is present. Do NOT commit code under other licenses. Installation: We need some installation instructions on the wiki site, e.g. how to get the maven setup running. What are the code conventions for the new version? Blast: the Blast parsing modules are among the most frequently used ones in biojava 1.6. To make people use biojava v3 it will be crucial to have a port of them to the new version. Does anybody want to take care of that? Automated builds: is it interesting to have automated builds set up for the new version at this stage, or should we wait until a more mature stage? I could easily add another auto-build similar to the one for biojava 1.6 at http://www.spice-3d.org/cruise/ Andreas On Sun, Oct 19, 2008 at 5:18 PM, Richard Holland wrote: > Hi all, > > I've just committed some new code to the biojava3 branch of the biojava-live > subversion repository. It's the foundations of a brand new alphabet+symbol > set of classes, and an example of how to use them to represent DNA. You'll > notice that the new code is very lightweight and allows for a lot more > flexibility than the old code - for instance, the concept of Alphabet has > changed radically. It also makes much more extensive use of the Collections > API. > > I haven't got any test cases or usage examples yet but give me a shout if > you don't understand the code and I'll explain how it works. (Hint: > SymbolFormat is there to convert Strings into SymbolList objects, and vice > versa). > > So, now we want some volunteers! We're starting from scratch here so there's > a lot of work to do. The whole of BioJava needs 'translating' into BJ3, > whether it be copy-and-paste existing classes and modify them to suit the > new style, or write completely new ones to provide equivalent functionality. > > > I'll post an example of how to do file parsing soon, probably starting with > FASTA. In the meantime, a good place to start would be for people to design > object models to represent their favourite data types (e.g. Genbank, or > microarray data). Utility classes to manipulate those objects would be great > too. > > The object models need to be normalised as much as possible - e.g. if your > data has a lot of comments, and the order of those comments is important, > then give your object model a collection of comment objects. The object > model for each data type should be completely independent and use basic data > types wherever possible (e.g. store sequences as strings, don't attempt to > parse them into anything fancy like SymbolLists). The closer the object > model is to the original data format, the better. There's going to be clever > tricks when it comes to converting data between different object models > (e.g. Genbank to INSDSeq), which I will explain later when I put the file > parsing examples up. > > You'll notice how the biojava3 branch uses Maven instead of Ant. This is > because we want to make it as modular as possible, so if you want to write > microarray stuff, create a new microarray sub-project (as per the dna > example that's already there). This way if someone only wants the microarray > bit of BJ3, they only need install the appropriate JAR file and can ignore > the rest. (The 'core' module is for stuff that is so generic it could be > used anywhere, or is used in every single other module.) > > If coding isn't your cup of tea, then we would very much welcome testers > (particularly those who enjoy writing test cases!), documenters > (particularly code commenters), translators (for internationalisation of the > code), and of course all those who wish to contribute ideas and suggestions > no matter how off-the-wall they might be. In particular if you'd like to > take charge of an area of the development process, e.g. Documentation Chief, > or Protein Champion, then that would be much appreciated. > > I'm very much looking forward to working with everyone on this. Good luck, > and happy coding! > > cheers, > Richard > > PS. Please don't forget to attach the appropriate licence to your code. You > can copy-and-paste it from the existing classes I just committed this > evening. > > PPS. For those who are worried about backwards compatibility - this was > discussed on the lists a while back and it was made clear that BJ3 is a > clean break. However, the existing code will continue to be maintained and > bugfixed for a couple of years so you don't have to upgrade if you don't > want to - it just won't have any new features developed for it. This is > largely because it'll probably take just that long to write all the new BJ3 > code. When we do decide to desupport the existing BJ code, plenty of notice > will be given (i.e. years as opposed to months). > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From markjschreiber at gmail.com Tue Oct 21 01:41:28 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 13:41:28 +0800 Subject: [Biojava-l] Logging in BJ3 Message-ID: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Hi - I would like to strongly advocate the liberal and extensive use of Logging in BioJava3. The lack of this plagued us (me at least) during bug fixes in previous versions of BioJava. The default Java logging API is very flexible and easily meets our needs. It's also not too much effort for developers to put in place (you know you use System.println() all over the place anyway). The following is an example snippet using logging that would certainly help debugging. With the standard logging setup only the severe statement would appear on the terminal. We could also provide config files that show lower levels of logging so that people can easily generate detailed logs to accompany bug reports. If we want to be really tricky we could even use a MemoryLogger that has a rotating buffer of log statements that could spit out with a stack trace so you could just submit the stack trace and the activity log all in one go and we can get an idea of what was going on at the time. The example below also shows what to do to avoid a major performance hit during logging. The marked "expensive logging operation" pretends to get config information by getting it from a database. One might expect this to take time while the db connects etc and could produce quite a long String of information. To save time when logging is not set to the CONFIG level the if statement is able to skip this costly step. I know from experience we will definitely get the most value from this in the IO parsers and ThingBuilders. Any thoughts? - Mark private Logger logger = Logger.getLogger("org.biojava.MyClass"); public Object generateObject(String argument){ logger.entering(""+getClass(), "generateObject", argument); //expensive logging operation if (logger.isLoggable( Level.CONFIG )) { logger.config("DB config: "+ getDBConfigInfo()); } Object obj = null; try{ //do some stuff logger.fine("doing stuff"); obj = new Object(); }catch(Exception ex){ logger.severe("Failed to do stuff"); logger.throwing(""+getClass(), "generateObject", ex); } logger.exiting(""+getClass(), "generateObject", obj); return obj; } From holland at eaglegenomics.com Tue Oct 21 04:34:46 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 21 Oct 2008 09:34:46 +0100 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> References: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> Message-ID: Spot on. Annotation/interface.... i think Annotation is probably better as you suggest, but I'd have to look into that. Not sure how it works with collections and generics. If it does turn out to be a better bet, I'll change it over. With the BioSQL dependencies, take a look at the pom.xml file inside the biojava-dna module. It declares a dependency on biojava-core. If you want to add dependencies to external JARs, take a look at biojava-biosql's pom.xml to see how it depends on javax.persistence. (The easiest way to add these is via an IDE such as NetBeans, which is what I'm using at the moment). cheers, Richard 2008/10/21 Mark Schreiber > So if I want to build a BioSQL loader from Genbank then would the > classes (or there wrappers) in the BioSQL Entity package need to > implement Thing? Would maven have an issue with that or would it just > create a dependency on core? (you can tell I've never used Maven > right). > > From a design point of view should Thing be an interface or an > Annotation? The reason I ask is that it doesn't define any methods so > it is more of a tag than an interface. > > Anyway, my understanding is that I would use a Genbank parser (or > write one). Write a EntityReceiver interface (probably more than one > given the number of entities in BioSQL, implement a EntityBuilder > (again possibly more than one) that implements EntityReceiver and > builds Entity beans from messages it receives. In this case I probably > wouldn't provide a writer as JPA would be writing the beans to the > database. Would this be how you imagine it? > > - Mark > > > On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland > wrote: > > (From now on I will only be posting these development messages to > > biojava-dev, which is the intended purpose of that list. Those of you who > > wish to keep track of things but are currently only subscribed to > biojava-l > > should also subscribe to biojava-dev in order to keep up to date.) > > > > As promised, I've committed a new package in the biojava-core module that > > should help understand how to do file parsing and conversion and writing > in > > the new BJ3 modules. Here's an example of how to use it to write a > Genbank > > parser (note no parsers actually exist yet!): > > > > 1. Design yourself a Genbank class which implements the interface Thing > and > > can fully represent all the data that might possibly occur inside a > Genbank > > file. > > > > 2. Write an interface called GenbankReceiver, which extends ThingReceiver > > and defines all the methods you might need in order to construct a > Genbank > > object in an asynchronous fashion. > > > > 3. Write a GenbankBuilder class which implements GenbankReceiver and > > ThingBuilder. It's job is to receive data via method calls, use that data > to > > construct a Genbank object, then provide that object on demand. > > > > 4. Write a GenbankWriter class which implements GenbankReceiver and > > ThingWriter. It's job is similar to GenbankBuilder, but instead of > > constructing new Genbank objects, it writes Genbank records to file that > > reflect the data it receives. > > > > 5. Write a GenbankReader class which implements ThingReader. It can read > > GenbankFiles and output the data to the methods of the ThingReceiver > > provided to it, which in this case could be anything which implements the > > interface GenbankReceiver. > > > > 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a > > Genbank object and will fire off data from it to the provided > ThingReceiver > > (a GenbankReceiver instance) as if the Genbank object was being read from > a > > file or some other source. > > > > That's it! OK so it's a minimum of 6 classes instead of the original 1 or > 2, > > but the additional steps are necessary for flexibility in converting > between > > formats. > > > > Now to use it (you'll probably want a GenbankTools class to wrap these > steps > > up for user-friendliness, including various options for opening files, > > etc.): > > > > 1. To read a file - instantiate ThingParser with your GenbankReader as > the > > reader, and GenbankBuilder as the receiver. Use the iterator methods on > > ThingParser to get the objects out. > > > > 2. To write a file - instantiate ThingParser with a GenbankEmitter > wrapping > > your Genbank object, and a GenbankWriter as the receiver. Use the > parseAll() > > method on the ThingParser to dump the whole lot to your chosen output. > > > > The clever bit comes when you want to convert between files. Imagine > you've > > done all the above for Genbank, and you've also done it for FASTA. How to > > convert between them? What you need to do is this: > > > > 1. Implement all the classes for both Genbank and FASTA. > > > > 2. Write a GenbankFASTAConverter class that implements > ThingConverter > > and GenbankReceiver, and will internally convert the data received and > pass > > it on out to the receiver provided, which will be a FASTAReceiver > instance. > > > > 3. Write a FASTAGenbankConverter class that operates in exactly the > opposite > > way, implementing ThingConverter and FASTAReceiver. > > > > Then to convert you use ThingParser again: > > > > 1. From FASTA file to Genbank object: Instantiate ThingParser with a > > FASTAReader reader, a GenbankBuilder receiver, and add a > > FASTAGenbankConverter instance to the converter chain. Use the iterator > to > > get your Genbank objects out of your FASTA file. > > > > 2. From FASTA file to Genbank file: Same as option 1, but provide a > > GenbankWriter instead and use parseAll() instead of the iterator methos. > > > > 3. From FASTA object to Genbank object: Same as option 1, but provide a > > FASTAEmitter wrapping your FASTA object as the reader instead. > > > > 4. From FASTA object to Genbank file: Same as option 1, but swap both the > > reader and the receiver as per options 2 and 3. > > > > 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all > mentions > > of FASTA and Genbank, and use GenbankFASTAConverter instead. > > > > One last and very important feature of this approach is that if you > discover > > that nobody has written the appropriate converter for your chosen pair of > > formats A and C, but converters do exist to map A to some other format B > and > > that other format B on to C, then you can just put the two converts A-B > and > > B-C into the ThingParser chain and it'll work perfectly. > > > > Enjoy! > > > > cheers, > > Richard > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > M: +44 7500 438846 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From ayates at ebi.ac.uk Tue Oct 21 04:40:48 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 09:40:48 +0100 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Message-ID: <48FD9590.5010704@ebi.ac.uk> Hi, A logging framework is a priority to start baking into the new API now. As Mark has mentioned logging frameworks are very flexible things but it's not until you start using them do you get a real feel about how easy & extensible they are. The JDK logger has some good integration with MessageFormat & localization. I'm not completely taken with how it does the checks for log levels (log.isDebugEnabled() just seems easier that log.isLoggable(Level.FINEST)) & how you grab a logger ( I'd prefer something like Logger.getLogger(this.getClass()) ) but that's just nit-picking. I'll be happy to go with whatever people are most comfortable with & we should attempt to use as many of the core Java classes as possible. Andy Mark Schreiber wrote: > Hi - > > I would like to strongly advocate the liberal and extensive use of > Logging in BioJava3. The lack of this plagued us (me at least) during > bug fixes in previous versions of BioJava. The default Java logging > API is very flexible and easily meets our needs. It's also not too > much effort for developers to put in place (you know you use > System.println() all over the place anyway). > > The following is an example snippet using logging that would certainly > help debugging. With the standard logging setup only the severe > statement would appear on the terminal. We could also provide config > files that show lower levels of logging so that people can easily > generate detailed logs to accompany bug reports. If we want to be > really tricky we could even use a MemoryLogger that has a rotating > buffer of log statements that could spit out with a stack trace so you > could just submit the stack trace and the activity log all in one go > and we can get an idea of what was going on at the time. > > The example below also shows what to do to avoid a major performance > hit during logging. The marked "expensive logging operation" pretends > to get config information by getting it from a database. One might > expect this to take time while the db connects etc and could produce > quite a long String of information. To save time when logging is not > set to the CONFIG level the if statement is able to skip this costly > step. > > I know from experience we will definitely get the most value from this > in the IO parsers and ThingBuilders. > > Any thoughts? > > - Mark > > > > private Logger logger = Logger.getLogger("org.biojava.MyClass"); > > public Object generateObject(String argument){ > logger.entering(""+getClass(), "generateObject", argument); > > //expensive logging operation > if (logger.isLoggable( Level.CONFIG )) { > logger.config("DB config: "+ getDBConfigInfo()); > } > > Object obj = null; > try{ > > //do some stuff > logger.fine("doing stuff"); > obj = new Object(); > > }catch(Exception ex){ > logger.severe("Failed to do stuff"); > logger.throwing(""+getClass(), "generateObject", ex); > } > > logger.exiting(""+getClass(), "generateObject", obj); > return obj; > } > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From ayates at ebi.ac.uk Tue Oct 21 04:49:47 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 09:49:47 +0100 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: References: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> Message-ID: <48FD97AB.70503@ebi.ac.uk> Depends on what you want to program. If you want to have a collection of objects which are Things & perform a common action on them then annotations are not the way forward. If you want to have some kind of meta-programming occurring & need a class to be multiple things then annotations are right. There is currently no way to enforce compile time dependencies on annotations & my thinking is that this is right. Annotations should be meta data or provide a way to alter a class in a non-invasive way (think Web Service annotations creating WS Servers & Clients without any alteration of the class). Andy Richard Holland wrote: > Spot on. > > Annotation/interface.... i think Annotation is probably better as you > suggest, but I'd have to look into that. Not sure how it works with > collections and generics. If it does turn out to be a better bet, I'll > change it over. > > With the BioSQL dependencies, take a look at the pom.xml file inside the > biojava-dna module. It declares a dependency on biojava-core. If you want to > add dependencies to external JARs, take a look at biojava-biosql's pom.xml > to see how it depends on javax.persistence. (The easiest way to add these is > via an IDE such as NetBeans, which is what I'm using at the moment). > > cheers, > Richard > > 2008/10/21 Mark Schreiber > >> So if I want to build a BioSQL loader from Genbank then would the >> classes (or there wrappers) in the BioSQL Entity package need to >> implement Thing? Would maven have an issue with that or would it just >> create a dependency on core? (you can tell I've never used Maven >> right). >> >> From a design point of view should Thing be an interface or an >> Annotation? The reason I ask is that it doesn't define any methods so >> it is more of a tag than an interface. >> >> Anyway, my understanding is that I would use a Genbank parser (or >> write one). Write a EntityReceiver interface (probably more than one >> given the number of entities in BioSQL, implement a EntityBuilder >> (again possibly more than one) that implements EntityReceiver and >> builds Entity beans from messages it receives. In this case I probably >> wouldn't provide a writer as JPA would be writing the beans to the >> database. Would this be how you imagine it? >> >> - Mark >> >> >> On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland >> wrote: >>> (From now on I will only be posting these development messages to >>> biojava-dev, which is the intended purpose of that list. Those of you who >>> wish to keep track of things but are currently only subscribed to >> biojava-l >>> should also subscribe to biojava-dev in order to keep up to date.) >>> >>> As promised, I've committed a new package in the biojava-core module that >>> should help understand how to do file parsing and conversion and writing >> in >>> the new BJ3 modules. Here's an example of how to use it to write a >> Genbank >>> parser (note no parsers actually exist yet!): >>> >>> 1. Design yourself a Genbank class which implements the interface Thing >> and >>> can fully represent all the data that might possibly occur inside a >> Genbank >>> file. >>> >>> 2. Write an interface called GenbankReceiver, which extends ThingReceiver >>> and defines all the methods you might need in order to construct a >> Genbank >>> object in an asynchronous fashion. >>> >>> 3. Write a GenbankBuilder class which implements GenbankReceiver and >>> ThingBuilder. It's job is to receive data via method calls, use that data >> to >>> construct a Genbank object, then provide that object on demand. >>> >>> 4. Write a GenbankWriter class which implements GenbankReceiver and >>> ThingWriter. It's job is similar to GenbankBuilder, but instead of >>> constructing new Genbank objects, it writes Genbank records to file that >>> reflect the data it receives. >>> >>> 5. Write a GenbankReader class which implements ThingReader. It can read >>> GenbankFiles and output the data to the methods of the ThingReceiver >>> provided to it, which in this case could be anything which implements the >>> interface GenbankReceiver. >>> >>> 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a >>> Genbank object and will fire off data from it to the provided >> ThingReceiver >>> (a GenbankReceiver instance) as if the Genbank object was being read from >> a >>> file or some other source. >>> >>> That's it! OK so it's a minimum of 6 classes instead of the original 1 or >> 2, >>> but the additional steps are necessary for flexibility in converting >> between >>> formats. >>> >>> Now to use it (you'll probably want a GenbankTools class to wrap these >> steps >>> up for user-friendliness, including various options for opening files, >>> etc.): >>> >>> 1. To read a file - instantiate ThingParser with your GenbankReader as >> the >>> reader, and GenbankBuilder as the receiver. Use the iterator methods on >>> ThingParser to get the objects out. >>> >>> 2. To write a file - instantiate ThingParser with a GenbankEmitter >> wrapping >>> your Genbank object, and a GenbankWriter as the receiver. Use the >> parseAll() >>> method on the ThingParser to dump the whole lot to your chosen output. >>> >>> The clever bit comes when you want to convert between files. Imagine >> you've >>> done all the above for Genbank, and you've also done it for FASTA. How to >>> convert between them? What you need to do is this: >>> >>> 1. Implement all the classes for both Genbank and FASTA. >>> >>> 2. Write a GenbankFASTAConverter class that implements >> ThingConverter >>> and GenbankReceiver, and will internally convert the data received and >> pass >>> it on out to the receiver provided, which will be a FASTAReceiver >> instance. >>> 3. Write a FASTAGenbankConverter class that operates in exactly the >> opposite >>> way, implementing ThingConverter and FASTAReceiver. >>> >>> Then to convert you use ThingParser again: >>> >>> 1. From FASTA file to Genbank object: Instantiate ThingParser with a >>> FASTAReader reader, a GenbankBuilder receiver, and add a >>> FASTAGenbankConverter instance to the converter chain. Use the iterator >> to >>> get your Genbank objects out of your FASTA file. >>> >>> 2. From FASTA file to Genbank file: Same as option 1, but provide a >>> GenbankWriter instead and use parseAll() instead of the iterator methos. >>> >>> 3. From FASTA object to Genbank object: Same as option 1, but provide a >>> FASTAEmitter wrapping your FASTA object as the reader instead. >>> >>> 4. From FASTA object to Genbank file: Same as option 1, but swap both the >>> reader and the receiver as per options 2 and 3. >>> >>> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all >> mentions >>> of FASTA and Genbank, and use GenbankFASTAConverter instead. >>> >>> One last and very important feature of this approach is that if you >> discover >>> that nobody has written the appropriate converter for your chosen pair of >>> formats A and C, but converters do exist to map A to some other format B >> and >>> that other format B on to C, then you can just put the two converts A-B >> and >>> B-C into the ThingParser chain and it'll work perfectly. >>> >>> Enjoy! >>> >>> cheers, >>> Richard >>> >>> -- >>> Richard Holland, BSc MBCS >>> Finance Director, Eagle Genomics Ltd >>> M: +44 7500 438846 | E: holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > > From holland at eaglegenomics.com Tue Oct 21 05:06:41 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 21 Oct 2008 10:06:41 +0100 Subject: [Biojava-l] [Biojava-dev] BioJava 3 Begins - Volunteers please! In-Reply-To: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> References: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> Message-ID: > > > License: Since it seems we will end up copying code from biojava 1.6 > to biojava 3.0, we need to keep the license the same (LGPL 2.1). I.e. > people should still use the same biojava license headers when > committing new files and all code will be considered to be LGPL, if no > header is present. Do NOT commit code under other licenses. > > Installation: We need some installation instructions on the wiki site, > e.g. how to get the maven setup running. What are the code > conventions for the new version? Not sure where best to put it in the Wiki, but I agree it needs to go there somewhere. Installation is a one-liner from within the top level of the project: mvn install This compiles and installs the JARs into your local Maven repository, and also downloads and installs any external dependencies. Then you can add the installed modules as dependencies in your own Maven projects. If you need to write a launcher script for your project, or you want to use the JAR files outside Maven, you can use this command to generate the CLASSPATH for use outside Maven. This only includes external dependencies - you'll also need to add to it the individual JAR files from inside the various target/ folders that Maven built for you: mvn dependency:build-classpath Code conventions are simple: 1. I'm not fussed about the specific formatter people use in each module, as long as the code is all formatted using some kind of consistent method. I personally just use the default settings from Format code in NetBeans. 2. Use 'this' wherever possible, and for static references, use the classname prefix (e.g. MyClass.staticField). I hate having to try and work out in my head which references are going where, and which are static and which are not! 3. Comment every single method, even if it's private. This helps understand the flow of your code. Also comment liberally inside methods if they are longer than just a few lines (i.e. if you can't fit the entire method within the code panel in NetBeans, its going to need internal comments). 4. When writing getters/setters, follow the Java beans conventions so that automated frameworks like Spring can easily pick it up and work with it. 5. Please write tests for your code using JUnit conventions, inside the test/ folder of each module. I know I haven't done this myself yet, but I'm going to! > > > Blast: the Blast parsing modules are among the most frequently used > ones in biojava 1.6. To make people use biojava v3 it will be crucial > to have a port of them to the new version. Does anybody want to take > care of that? I'll second that. Blast is vital. We'd really appreciate a volunteer, please! > > Automated builds: is it interesting to have automated builds set up > for the new version at this stage, or should we wait until a more > mature stage? I could easily add another auto-build similar to the one > for biojava 1.6 at http://www.spice-3d.org/cruise/ You could do, although I don't think they'd be much use yet. But why not start early then we won't forget to do it later. Richard > > Andreas > > On Sun, Oct 19, 2008 at 5:18 PM, Richard Holland > wrote: > > Hi all, > > > > I've just committed some new code to the biojava3 branch of the > biojava-live > > subversion repository. It's the foundations of a brand new > alphabet+symbol > > set of classes, and an example of how to use them to represent DNA. > You'll > > notice that the new code is very lightweight and allows for a lot more > > flexibility than the old code - for instance, the concept of Alphabet has > > changed radically. It also makes much more extensive use of the > Collections > > API. > > > > I haven't got any test cases or usage examples yet but give me a shout if > > you don't understand the code and I'll explain how it works. (Hint: > > SymbolFormat is there to convert Strings into SymbolList objects, and > vice > > versa). > > > > So, now we want some volunteers! We're starting from scratch here so > there's > > a lot of work to do. The whole of BioJava needs 'translating' into BJ3, > > whether it be copy-and-paste existing classes and modify them to suit the > > new style, or write completely new ones to provide equivalent > functionality. > > > > > > I'll post an example of how to do file parsing soon, probably starting > with > > FASTA. In the meantime, a good place to start would be for people to > design > > object models to represent their favourite data types (e.g. Genbank, or > > microarray data). Utility classes to manipulate those objects would be > great > > too. > > > > The object models need to be normalised as much as possible - e.g. if > your > > data has a lot of comments, and the order of those comments is important, > > then give your object model a collection of comment objects. The object > > model for each data type should be completely independent and use basic > data > > types wherever possible (e.g. store sequences as strings, don't attempt > to > > parse them into anything fancy like SymbolLists). The closer the object > > model is to the original data format, the better. There's going to be > clever > > tricks when it comes to converting data between different object models > > (e.g. Genbank to INSDSeq), which I will explain later when I put the file > > parsing examples up. > > > > You'll notice how the biojava3 branch uses Maven instead of Ant. This is > > because we want to make it as modular as possible, so if you want to > write > > microarray stuff, create a new microarray sub-project (as per the dna > > example that's already there). This way if someone only wants the > microarray > > bit of BJ3, they only need install the appropriate JAR file and can > ignore > > the rest. (The 'core' module is for stuff that is so generic it could be > > used anywhere, or is used in every single other module.) > > > > If coding isn't your cup of tea, then we would very much welcome testers > > (particularly those who enjoy writing test cases!), documenters > > (particularly code commenters), translators (for internationalisation of > the > > code), and of course all those who wish to contribute ideas and > suggestions > > no matter how off-the-wall they might be. In particular if you'd like to > > take charge of an area of the development process, e.g. Documentation > Chief, > > or Protein Champion, then that would be much appreciated. > > > > I'm very much looking forward to working with everyone on this. Good > luck, > > and happy coding! > > > > cheers, > > Richard > > > > PS. Please don't forget to attach the appropriate licence to your code. > You > > can copy-and-paste it from the existing classes I just committed this > > evening. > > > > PPS. For those who are worried about backwards compatibility - this was > > discussed on the lists a while back and it was made clear that BJ3 is a > > clean break. However, the existing code will continue to be maintained > and > > bugfixed for a couple of years so you don't have to upgrade if you don't > > want to - it just won't have any new features developed for it. This is > > largely because it'll probably take just that long to write all the new > BJ3 > > code. When we do decide to desupport the existing BJ code, plenty of > notice > > will be given (i.e. years as opposed to months). > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > M: +44 7500 438846 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From benn at mpi-cbg.de Tue Oct 21 05:00:44 2008 From: benn at mpi-cbg.de (Neil Benn) Date: Tue, 21 Oct 2008 11:00:44 +0200 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Message-ID: <48FD9A3C.20904@mpi-cbg.de> Hello, I'm not sure if I should comment as I have no time to contribute LOC but I thought I may as well ;). Mark Schreiber wrote: > Hi - > > I would like to strongly advocate the liberal and extensive use of > Logging in BioJava3. The lack of this plagued us (me at least) during > bug fixes in previous versions of BioJava. The default Java logging > API is very flexible and easily meets our needs. It's also not too > much effort for developers to put in place (you know you use > System.println() all over the place anyway). > Hmm, that is true but for total completeness you can use commons-logging, that is very easy to use and much more flexible as it can encapsulate other logging mechanisms (including JDK1.4 logging framework). To use it you simply declare a new logger as follows: private static final Log logger = LogFactory.getLog(); The rest of it works pretty much the same as below- if you dovetail commons-logging with log4j then you'll cover the most common case of logging used in other frameworks - the config files to setup log4j (XML and preperties fiels) are well documented all over the web. > > > I know from experience we will definitely get the most value from this > in the IO parsers and ThingBuilders. > > Any thoughts? > +1 > - Mark > > > > private Logger logger = Logger.getLogger("org.biojava.MyClass"); > > public Object generateObject(String argument){ > logger.entering(""+getClass(), "generateObject", argument); > > //expensive logging operation > if (logger.isLoggable( Level.CONFIG )) { > logger.config("DB config: "+ getDBConfigInfo()); > } > > Object obj = null; > try{ > > //do some stuff > logger.fine("doing stuff"); > obj = new Object(); > > }catch(Exception ex){ > logger.severe("Failed to do stuff"); > logger.throwing(""+getClass(), "generateObject", ex); > } > > logger.exiting(""+getClass(), "generateObject", obj); > return obj; > } > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Tue Oct 21 05:18:41 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 17:18:41 +0800 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Message-ID: <93b45ca50810210218n1e2ac06bma211f1541b8be3bb@mail.gmail.com> For the Entity classes my original thinking was to implement an EJB3 interceptor which logs all method calls. This would be preferable to putting logging statements in all the classes but I don't know if such an interceptor will work outside of a container. Does anyone know if JPA can use an interceptor outside of a container? Logging for the actual persistence would be via the persistence provider (Hibernate, Toplink etc). - Mark On Tue, Oct 21, 2008 at 5:08 PM, Richard Holland wrote: > Excellent idea. I'll integrate it into ThingParser as an example > > 2008/10/21 Mark Schreiber >> >> Hi - >> >> I would like to strongly advocate the liberal and extensive use of >> Logging in BioJava3. The lack of this plagued us (me at least) during >> bug fixes in previous versions of BioJava. The default Java logging >> API is very flexible and easily meets our needs. It's also not too >> much effort for developers to put in place (you know you use >> System.println() all over the place anyway). >> >> The following is an example snippet using logging that would certainly >> help debugging. With the standard logging setup only the severe >> statement would appear on the terminal. We could also provide config >> files that show lower levels of logging so that people can easily >> generate detailed logs to accompany bug reports. If we want to be >> really tricky we could even use a MemoryLogger that has a rotating >> buffer of log statements that could spit out with a stack trace so you >> could just submit the stack trace and the activity log all in one go >> and we can get an idea of what was going on at the time. >> >> The example below also shows what to do to avoid a major performance >> hit during logging. The marked "expensive logging operation" pretends >> to get config information by getting it from a database. One might >> expect this to take time while the db connects etc and could produce >> quite a long String of information. To save time when logging is not >> set to the CONFIG level the if statement is able to skip this costly >> step. >> >> I know from experience we will definitely get the most value from this >> in the IO parsers and ThingBuilders. >> >> Any thoughts? >> >> - Mark >> >> >> >> private Logger logger = Logger.getLogger("org.biojava.MyClass"); >> >> public Object generateObject(String argument){ >> logger.entering(""+getClass(), "generateObject", argument); >> >> //expensive logging operation >> if (logger.isLoggable( Level.CONFIG )) { >> logger.config("DB config: "+ getDBConfigInfo()); >> } >> >> Object obj = null; >> try{ >> >> //do some stuff >> logger.fine("doing stuff"); >> obj = new Object(); >> >> }catch(Exception ex){ >> logger.severe("Failed to do stuff"); >> logger.throwing(""+getClass(), "generateObject", ex); >> } >> >> logger.exiting(""+getClass(), "generateObject", obj); >> return obj; >> } >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From ayates at ebi.ac.uk Tue Oct 21 05:21:26 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 10:21:26 +0100 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <48FD9A3C.20904@mpi-cbg.de> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> <48FD9A3C.20904@mpi-cbg.de> Message-ID: <48FD9F16.2000405@ebi.ac.uk> Hi Neil, That's okay the more people take an interest in this the better it will be. We did discuss this quite a bit ago at a biojava meeting & the general consensus was bridges can be manually written between the logging frameworks as and when they are required. Also using the JDK logger reduces our external dependencies. However I do like the logging facades & am in favour of them. Especially SLF4J which does the same thing as commons-logging but relies on the existence of SLF4J adaptors not the raw logging framework which commons-logging does. It also has links to a lot more logging frameworks including simple-log (https://simple-log.dev.java.net/) & logback (http://logback.qos.ch/). There's just so many options here it's hard to gauge what is the best thing to do. Do we buy into a single framework & use all of its features (JDK logger has nice things for logging entering & exiting methods along with locale ResourceBundles) or go for a common denominator. It's not an easy decision to make ........ Andy Neil Benn wrote: > Hello, > > I'm not sure if I should comment as I have no time to > contribute LOC but I thought I may as well ;). > > Mark Schreiber wrote: >> Hi - >> >> I would like to strongly advocate the liberal and extensive use of >> Logging in BioJava3. The lack of this plagued us (me at least) during >> bug fixes in previous versions of BioJava. The default Java logging >> API is very flexible and easily meets our needs. It's also not too >> much effort for developers to put in place (you know you use >> System.println() all over the place anyway). >> > Hmm, that is true but for total completeness you can use > commons-logging, that is very easy to use and much more flexible as it > can encapsulate other logging mechanisms (including JDK1.4 logging > framework). To use it you simply declare a new logger as follows: > > private static final Log logger = LogFactory.getLog( here>); > > The rest of it works pretty much the same as below- if you dovetail > commons-logging with log4j then you'll cover the most common case of > logging used in other frameworks - the config files to setup log4j (XML > and preperties fiels) are well documented all over the web. >> >> >> I know from experience we will definitely get the most value from this >> in the IO parsers and ThingBuilders. >> >> Any thoughts? >> > +1 >> - Mark >> >> >> >> private Logger logger = Logger.getLogger("org.biojava.MyClass"); >> >> public Object generateObject(String argument){ >> logger.entering(""+getClass(), "generateObject", argument); >> >> //expensive logging operation >> if (logger.isLoggable( Level.CONFIG )) { >> logger.config("DB config: "+ getDBConfigInfo()); >> } >> >> Object obj = null; >> try{ >> >> //do some stuff >> logger.fine("doing stuff"); >> obj = new Object(); >> >> }catch(Exception ex){ >> logger.severe("Failed to do stuff"); >> logger.throwing(""+getClass(), "generateObject", ex); >> } >> >> logger.exiting(""+getClass(), "generateObject", obj); >> return obj; >> } >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From ayates at ebi.ac.uk Tue Oct 21 05:23:35 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 10:23:35 +0100 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <93b45ca50810210218n1e2ac06bma211f1541b8be3bb@mail.gmail.com> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> <93b45ca50810210218n1e2ac06bma211f1541b8be3bb@mail.gmail.com> Message-ID: <48FD9F97.8010705@ebi.ac.uk> As far as I was aware JPA has no concept of EJB3 interceptors. If you want that kind of thing I think you would have to start using AOP or proxy objects. Andy Mark Schreiber wrote: > For the Entity classes my original thinking was to implement an EJB3 > interceptor which logs all method calls. This would be preferable to > putting logging statements in all the classes but I don't know if such > an interceptor will work outside of a container. Does anyone know if > JPA can use an interceptor outside of a container? > > Logging for the actual persistence would be via the persistence > provider (Hibernate, Toplink etc). > > - Mark > > On Tue, Oct 21, 2008 at 5:08 PM, Richard Holland > wrote: >> Excellent idea. I'll integrate it into ThingParser as an example >> >> 2008/10/21 Mark Schreiber >>> Hi - >>> >>> I would like to strongly advocate the liberal and extensive use of >>> Logging in BioJava3. The lack of this plagued us (me at least) during >>> bug fixes in previous versions of BioJava. The default Java logging >>> API is very flexible and easily meets our needs. It's also not too >>> much effort for developers to put in place (you know you use >>> System.println() all over the place anyway). >>> >>> The following is an example snippet using logging that would certainly >>> help debugging. With the standard logging setup only the severe >>> statement would appear on the terminal. We could also provide config >>> files that show lower levels of logging so that people can easily >>> generate detailed logs to accompany bug reports. If we want to be >>> really tricky we could even use a MemoryLogger that has a rotating >>> buffer of log statements that could spit out with a stack trace so you >>> could just submit the stack trace and the activity log all in one go >>> and we can get an idea of what was going on at the time. >>> >>> The example below also shows what to do to avoid a major performance >>> hit during logging. The marked "expensive logging operation" pretends >>> to get config information by getting it from a database. One might >>> expect this to take time while the db connects etc and could produce >>> quite a long String of information. To save time when logging is not >>> set to the CONFIG level the if statement is able to skip this costly >>> step. >>> >>> I know from experience we will definitely get the most value from this >>> in the IO parsers and ThingBuilders. >>> >>> Any thoughts? >>> >>> - Mark >>> >>> >>> >>> private Logger logger = Logger.getLogger("org.biojava.MyClass"); >>> >>> public Object generateObject(String argument){ >>> logger.entering(""+getClass(), "generateObject", argument); >>> >>> //expensive logging operation >>> if (logger.isLoggable( Level.CONFIG )) { >>> logger.config("DB config: "+ getDBConfigInfo()); >>> } >>> >>> Object obj = null; >>> try{ >>> >>> //do some stuff >>> logger.fine("doing stuff"); >>> obj = new Object(); >>> >>> }catch(Exception ex){ >>> logger.severe("Failed to do stuff"); >>> logger.throwing(""+getClass(), "generateObject", ex); >>> } >>> >>> logger.exiting(""+getClass(), "generateObject", obj); >>> return obj; >>> } >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> M: +44 7500 438846 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Tue Oct 21 05:26:41 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 17:26:41 +0800 Subject: [Biojava-l] [Biojava-dev] BioJava 3 Begins - Volunteers please! In-Reply-To: References: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> Message-ID: <93b45ca50810210226t79cfbcbfhcadaedcfe8735676@mail.gmail.com> >> Blast: the Blast parsing modules are among the most frequently used >> ones in biojava 1.6. To make people use biojava v3 it will be crucial >> to have a port of them to the new version. Does anybody want to take >> care of that? > > > I'll second that. Blast is vital. We'd really appreciate a volunteer, > please! > BlastXML output would certainly be the easiest place to start. I also think with the new Thing/ ThingBuilder framework it will be possible to develop all manner of parsers for the vagaries of Blast text output that come with each new release of Blast. Possible but maybe not a good idea. I don't think that output was ever supposed to be machine readable. The table formatted output (-m8 I think) would be a better option. Given the DTD it should be possible to do a quick JAXB binding. How would that work in the Thing/ ThingBuilder paradigm? - Mark From markjschreiber at gmail.com Tue Oct 21 06:35:14 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 18:35:14 +0800 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: <48FD97AB.70503@ebi.ac.uk> References: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> <48FD97AB.70503@ebi.ac.uk> Message-ID: <93b45ca50810210335j5ef4a206y545e5a1869cedc03@mail.gmail.com> Is there any need for Thing at all? Can't a bulder be typed to produce something that extends Object? If Thing provides no behaivour contract or meta-information then why does it exist? - Mark On Tue, Oct 21, 2008 at 4:49 PM, Andy Yates wrote: > Depends on what you want to program. If you want to have a collection of > objects which are Things & perform a common action on them then > annotations are not the way forward. > > If you want to have some kind of meta-programming occurring & need a > class to be multiple things then annotations are right. There is > currently no way to enforce compile time dependencies on annotations & > my thinking is that this is right. Annotations should be meta data or > provide a way to alter a class in a non-invasive way (think Web Service > annotations creating WS Servers & Clients without any alteration of the > class). > > Andy > > Richard Holland wrote: >> Spot on. >> >> Annotation/interface.... i think Annotation is probably better as you >> suggest, but I'd have to look into that. Not sure how it works with >> collections and generics. If it does turn out to be a better bet, I'll >> change it over. >> >> With the BioSQL dependencies, take a look at the pom.xml file inside the >> biojava-dna module. It declares a dependency on biojava-core. If you want to >> add dependencies to external JARs, take a look at biojava-biosql's pom.xml >> to see how it depends on javax.persistence. (The easiest way to add these is >> via an IDE such as NetBeans, which is what I'm using at the moment). >> >> cheers, >> Richard >> >> 2008/10/21 Mark Schreiber >> >>> So if I want to build a BioSQL loader from Genbank then would the >>> classes (or there wrappers) in the BioSQL Entity package need to >>> implement Thing? Would maven have an issue with that or would it just >>> create a dependency on core? (you can tell I've never used Maven >>> right). >>> >>> From a design point of view should Thing be an interface or an >>> Annotation? The reason I ask is that it doesn't define any methods so >>> it is more of a tag than an interface. >>> >>> Anyway, my understanding is that I would use a Genbank parser (or >>> write one). Write a EntityReceiver interface (probably more than one >>> given the number of entities in BioSQL, implement a EntityBuilder >>> (again possibly more than one) that implements EntityReceiver and >>> builds Entity beans from messages it receives. In this case I probably >>> wouldn't provide a writer as JPA would be writing the beans to the >>> database. Would this be how you imagine it? >>> >>> - Mark >>> >>> >>> On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland >>> wrote: >>>> (From now on I will only be posting these development messages to >>>> biojava-dev, which is the intended purpose of that list. Those of you who >>>> wish to keep track of things but are currently only subscribed to >>> biojava-l >>>> should also subscribe to biojava-dev in order to keep up to date.) >>>> >>>> As promised, I've committed a new package in the biojava-core module that >>>> should help understand how to do file parsing and conversion and writing >>> in >>>> the new BJ3 modules. Here's an example of how to use it to write a >>> Genbank >>>> parser (note no parsers actually exist yet!): >>>> >>>> 1. Design yourself a Genbank class which implements the interface Thing >>> and >>>> can fully represent all the data that might possibly occur inside a >>> Genbank >>>> file. >>>> >>>> 2. Write an interface called GenbankReceiver, which extends ThingReceiver >>>> and defines all the methods you might need in order to construct a >>> Genbank >>>> object in an asynchronous fashion. >>>> >>>> 3. Write a GenbankBuilder class which implements GenbankReceiver and >>>> ThingBuilder. It's job is to receive data via method calls, use that data >>> to >>>> construct a Genbank object, then provide that object on demand. >>>> >>>> 4. Write a GenbankWriter class which implements GenbankReceiver and >>>> ThingWriter. It's job is similar to GenbankBuilder, but instead of >>>> constructing new Genbank objects, it writes Genbank records to file that >>>> reflect the data it receives. >>>> >>>> 5. Write a GenbankReader class which implements ThingReader. It can read >>>> GenbankFiles and output the data to the methods of the ThingReceiver >>>> provided to it, which in this case could be anything which implements the >>>> interface GenbankReceiver. >>>> >>>> 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a >>>> Genbank object and will fire off data from it to the provided >>> ThingReceiver >>>> (a GenbankReceiver instance) as if the Genbank object was being read from >>> a >>>> file or some other source. >>>> >>>> That's it! OK so it's a minimum of 6 classes instead of the original 1 or >>> 2, >>>> but the additional steps are necessary for flexibility in converting >>> between >>>> formats. >>>> >>>> Now to use it (you'll probably want a GenbankTools class to wrap these >>> steps >>>> up for user-friendliness, including various options for opening files, >>>> etc.): >>>> >>>> 1. To read a file - instantiate ThingParser with your GenbankReader as >>> the >>>> reader, and GenbankBuilder as the receiver. Use the iterator methods on >>>> ThingParser to get the objects out. >>>> >>>> 2. To write a file - instantiate ThingParser with a GenbankEmitter >>> wrapping >>>> your Genbank object, and a GenbankWriter as the receiver. Use the >>> parseAll() >>>> method on the ThingParser to dump the whole lot to your chosen output. >>>> >>>> The clever bit comes when you want to convert between files. Imagine >>> you've >>>> done all the above for Genbank, and you've also done it for FASTA. How to >>>> convert between them? What you need to do is this: >>>> >>>> 1. Implement all the classes for both Genbank and FASTA. >>>> >>>> 2. Write a GenbankFASTAConverter class that implements >>> ThingConverter >>>> and GenbankReceiver, and will internally convert the data received and >>> pass >>>> it on out to the receiver provided, which will be a FASTAReceiver >>> instance. >>>> 3. Write a FASTAGenbankConverter class that operates in exactly the >>> opposite >>>> way, implementing ThingConverter and FASTAReceiver. >>>> >>>> Then to convert you use ThingParser again: >>>> >>>> 1. From FASTA file to Genbank object: Instantiate ThingParser with a >>>> FASTAReader reader, a GenbankBuilder receiver, and add a >>>> FASTAGenbankConverter instance to the converter chain. Use the iterator >>> to >>>> get your Genbank objects out of your FASTA file. >>>> >>>> 2. From FASTA file to Genbank file: Same as option 1, but provide a >>>> GenbankWriter instead and use parseAll() instead of the iterator methos. >>>> >>>> 3. From FASTA object to Genbank object: Same as option 1, but provide a >>>> FASTAEmitter wrapping your FASTA object as the reader instead. >>>> >>>> 4. From FASTA object to Genbank file: Same as option 1, but swap both the >>>> reader and the receiver as per options 2 and 3. >>>> >>>> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all >>> mentions >>>> of FASTA and Genbank, and use GenbankFASTAConverter instead. >>>> >>>> One last and very important feature of this approach is that if you >>> discover >>>> that nobody has written the appropriate converter for your chosen pair of >>>> formats A and C, but converters do exist to map A to some other format B >>> and >>>> that other format B on to C, then you can just put the two converts A-B >>> and >>>> B-C into the ThingParser chain and it'll work perfectly. >>>> >>>> Enjoy! >>>> >>>> cheers, >>>> Richard >>>> >>>> -- >>>> Richard Holland, BSc MBCS >>>> Finance Director, Eagle Genomics Ltd >>>> M: +44 7500 438846 | E: holland at eaglegenomics.com >>>> http://www.eaglegenomics.com/ >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> >> >> > From augustovmail-java at yahoo.com.br Tue Oct 21 07:45:41 2008 From: augustovmail-java at yahoo.com.br (Augusto Fernandes Vellozo) Date: Tue, 21 Oct 2008 13:45:41 +0200 Subject: [Biojava-l] SimpleRichAnnotation In-Reply-To: <381a3e850810210421u54058163ncf347b57394af1b2@mail.gmail.com> References: <381a3e850810210421u54058163ncf347b57394af1b2@mail.gmail.com> Message-ID: <381a3e850810210445sc801d40ja36655349b5920b9@mail.gmail.com> Hi everyone, I am having problems with the class SimpleRichAnnotation. I have one term t of ontology o and I put one note n (with the term t) in an SimpleRichAnnotation object a, but in the moment i call the method a.getProperties(t) it didn't return the note n. I saw in the code of Biojava that the method getProperties imports the term t into of the ontology default before to do the search. Because this it doesn't return the correct note. Please, someone knows why is this method changing the ontology? Thanks, -- Augusto F. Vellozo -- Augusto F. Vellozo From charles at imbusch.net Tue Oct 21 10:00:45 2008 From: charles at imbusch.net (Charles Imbusch) Date: Tue, 21 Oct 2008 16:00:45 +0200 Subject: [Biojava-l] parsing tblastn results In-Reply-To: References: <48F50908.5060307@imbusch.net> Message-ID: <48FDE08D.8000300@imbusch.net> Thank you David and Richard for the quick replies. I downloaded two files from http://bugzilla.open-bio.org/show_bug.cgi?id=2603 and tried to apply the patches. I suppose that's the way to get the modified BlastSAXParser.java. charlie at custodian:~/biojava-live_1.6$ patch -p0 < BlastSAXParser.java.patch (Stripping trailing CRs from patch.) patching file src/org/biojava/bio/program/sax/BlastSAXParser.java Hunk #1 FAILED at 60. Hunk #2 FAILED at 631. Hunk #3 FAILED at 643. Hunk #4 FAILED at 650. 4 out of 4 hunks FAILED -- saving rejects to file src/org/biojava/bio/program/sax/BlastSAXParser.java.rej and similar for the other file charlie at custodian:~/biojava-live_1.6$ patch -p0 < HitSectionSAXParser.java.patch (Stripping trailing CRs from patch.) patching file src/org/biojava/bio/program/sax/HitSectionSAXParser.java Hunk #1 FAILED at 41. Hunk #2 FAILED at 65. Hunk #3 FAILED at 96. Hunk #4 FAILED at 515. Hunk #5 FAILED at 524. 5 out of 5 hunks FAILED -- saving rejects to file src/org/biojava/bio/program/sax/HitSectionSAXParser.java.rej Obviously something went wrong, but I couldn't figure out what. I uploaded the rej files to http://charles.imbusch.net/tmp/ Any hint is appreciated. cheers, Charles From crackeur at comcast.net Tue Oct 21 22:21:57 2008 From: crackeur at comcast.net (jimmy Zhang) Date: Tue, 21 Oct 2008 19:21:57 -0700 Subject: [Biojava-l] [ANN] VTD-XML extended edition released References: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> <93b45ca50810210226t79cfbcbfhcadaedcfe8735676@mail.gmail.com> Message-ID: <009401c933ec$f572a700$0402a8c0@your55e5f9e3d2> The Java version of extended VTD-XmL is released and available for download. This version supports 256 GB max file sizes and memory mapped capabilities. The updated documentation is also available for download. In short, you can basically do full XPath query on documents that are bigger than memory space available on your machine. A special thanks to Duane May who provided value suggestions and inputs and helped refine the VTD specs to make this happen. To download the package and the documentation, go to https://sourceforge.net/project/downloading.php?group_id=110612&use_mirror=&filename=vtd-xml_2.4_doc.zip&64621261 https://sourceforge.net/project/downloading.php?group_id=110612&use_mirror=&filename=ximpleware_extended_2.4.zip&99532507 From pzgyuanf at gmail.com Sat Oct 25 20:57:16 2008 From: pzgyuanf at gmail.com (pprun) Date: Sun, 26 Oct 2008 08:57:16 +0800 Subject: [Biojava-l] Test failed for Alphabet.getSymbolMatchType method Message-ID: Hi, The current implementation uses the same condition equalsIgnoreCase for EXACT_STRING_MATCH and MIXED_CASE_MATCH public SymbolMatchType getSymbolMatchType(Symbol a, Symbol b) { ... if (a.toString().equalsIgnoreCase(b.toString())) { return SymbolMatchType.EXACT_STRING_MATCH; } if (a.toString().equalsIgnoreCase(b.toString())) { return SymbolMatchType.MIXED_CASE_MATCH; } ... String.equals should be used for EXACT_STRING_MATCH: public SymbolMatchType getSymbolMatchType(Symbol a, Symbol b) { ... if (a.toString().equals(b.toString())) { return SymbolMatchType.EXACT_STRING_MATCH; } if (a.toString().equalsIgnoreCase(b.toString())) { return SymbolMatchType.MIXED_CASE_MATCH; } ... The test case used to identify the above bug is: /* * BioJava development code * * This code may be freely distributed and modified under the * terms of the GNU Lesser General Public Licence. This should * be distributed with the code. If you do not have a copy, * see: * * http://www.gnu.org/copyleft/lesser.html * * Copyright for this code is held jointly by the individual * authors. These should be listed in @author doc comments. * * For more information on the BioJava project and its aims, * or to join the biojava-l mailing list, visit the home page * at: * * http://www.biojava.org/ * */ package org.biojava.core.symbol; import org.junit.After; import org.junit.AfterClass; import org.junit.Before; import org.junit.BeforeClass; import org.junit.Test; import static org.junit.Assert.*; /** * * @author pprun */ public class AlphabetTest { public AlphabetTest() { } @BeforeClass public static void setUpClass() throws Exception { } @AfterClass public static void tearDownClass() throws Exception { } @Before public void setUp() { } @After public void tearDown() { } /** * Test of getSymbolMatchType method, of class Alphabet. */ @Test public void testGetSymbolMatchType() { System.out.println("getSymbolMatchType"); Alphabet testAlphabet = new Alphabet("testGetSymbolMatchType"); // 1. exact match Symbol a = Symbol.get("ATGC"); Symbol b = Symbol.get("ATGC"); SymbolMatchType expResult = SymbolMatchType.EXACT_MATCH; SymbolMatchType result = testAlphabet.getSymbolMatchType(a, b); assertEquals(expResult, result); // 2. mixed case match a = Symbol.get("ATGC"); b = Symbol.get("aTGC"); expResult = SymbolMatchType.MIXED_CASE_MATCH; result = testAlphabet.getSymbolMatchType(a, b); assertEquals(expResult, result); } } BTW., how can I get the dev/test role? Then I can contribute to the development or test (as I'm still a beginner for bio field) for BJ3. Thanks, Pprun From gabrielle_doan at gmx.net Mon Oct 27 08:57:03 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Mon, 27 Oct 2008 13:57:03 +0100 Subject: [Biojava-l] differences between read in sequence and stored sequence in database Message-ID: <4905BA9F.1060400@gmx.net> Hi all, I have a BioSQL database which contains all human chromsomes. For my recent project I have to query for a part of a sequence. As far as I know I can get the whole sequence from the entry Biosequence.Seq in the BioSQL schema. So I've made this query: SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; But this query hasn't yield the desired string, because the length of this biosequence is only 100,000,020 bp. I am very confused why I get such a discrepancy. I have added all chromosomes with the build in method in BioJava addRichSequence(RichSequence seq) to the database. From my raw data I know that this sequence should have a length of 140,279,252 bp. So where is the remaining part of my sequence? I have observed these discrepancies on all chromsomes which are longer than 100,000,020 bp. Here is an abstract of my database: bioentry_id description length 2 Homo sapiens mitochondrion, complete genome. 16571 3 Homo sapiens chromosome Y, reference assembly, complete sequence. 57772954 4 Homo sapiens chromosome X, reference assembly, complete sequence. 100000020 5 Homo sapiens chromosome 22, reference assembly, complete sequence. 49691432 6 Homo sapiens chromosome 21, reference assembly, complete sequence. 46944323 7 Homo sapiens chromosome 20, reference assembly, complete sequence. 25960004 8 Homo sapiens chromosome 9, reference assembly, complete sequence. 100000020 9 Homo sapiens chromosome 7, reference assembly, complete sequence. 100000020 Sequences smaller than 100,000,020 bp are correctly stored under Biosequence.seq. I am grateful for any hints, which explain the behaviour of my database. Cheers, Gabrielle From gabrielle_doan at gmx.net Tue Oct 28 10:26:47 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Tue, 28 Oct 2008 15:26:47 +0100 Subject: [Biojava-l] differences between read in sequence and stored sequence in database] Message-ID: <49072127.7010304@gmx.net> Hi all, concering the problem as described below I have found out that this problem also occured in BioRuby and was fixed in 2004. See: http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby Unfortunately I'm clueless about BioRuby. Does anybody recognize this problem or understand how it was solved in BioRuby? I am grateful for any hints. Cheers, Gabrielle -------- Original-Nachricht -------- Betreff: [Biojava-l] differences between read in sequence and stored sequence in database Datum: Mon, 27 Oct 2008 13:57:03 +0100 Von: Gabrielle Doan An: biojava-l at biojava.org Hi all, I have a BioSQL database which contains all human chromsomes. For my recent project I have to query for a part of a sequence. As far as I know I can get the whole sequence from the entry Biosequence.Seq in the BioSQL schema. So I've made this query: SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; But this query hasn't yield the desired string, because the length of this biosequence is only 100,000,020 bp. I am very confused why I get such a discrepancy. I have added all chromosomes with the build in method in BioJava addRichSequence(RichSequence seq) to the database. From my raw data I know that this sequence should have a length of 140,279,252 bp. So where is the remaining part of my sequence? I have observed these discrepancies on all chromsomes which are longer than 100,000,020 bp. Here is an abstract of my database: bioentry_id description length 2 Homo sapiens mitochondrion, complete genome. 16571 3 Homo sapiens chromosome Y, reference assembly, complete sequence. 57772954 4 Homo sapiens chromosome X, reference assembly, complete sequence. 100000020 5 Homo sapiens chromosome 22, reference assembly, complete sequence. 49691432 6 Homo sapiens chromosome 21, reference assembly, complete sequence. 46944323 7 Homo sapiens chromosome 20, reference assembly, complete sequence. 25960004 8 Homo sapiens chromosome 9, reference assembly, complete sequence. 100000020 9 Homo sapiens chromosome 7, reference assembly, complete sequence. 100000020 Sequences smaller than 100,000,020 bp are correctly stored under Biosequence.seq. I am grateful for any hints, which explain the behaviour of my database. Cheers, Gabrielle _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From dtoomey at rcsi.ie Wed Oct 29 06:45:45 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Wed, 29 Oct 2008 10:45:45 +0000 Subject: [Biojava-l] How to get full query description from blast result Message-ID: Hi I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium falciparum (isolate 3D7) GN=ABRA I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation Can anyone point me in the right direction for retrieving the complete query description? Thanks Dave From holland at eaglegenomics.com Thu Oct 30 10:07:42 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 30 Oct 2008 14:07:42 +0000 Subject: [Biojava-l] differences between read in sequence and stored sequence in database] In-Reply-To: <49072127.7010304@gmx.net> References: <49072127.7010304@gmx.net> Message-ID: Hello. Sorry for the delayed reply - I've been away on business all week. The similar Ruby issue (and solution) is discussed here: http://portal.open-bio.org/pipermail/bioruby/2004-March.txt How did you parse the files in the first place? Did you use the new GenBank parsers (BJX), or the older ones? This will help indicate where the problem lies - the data will have been truncated at the point it was parsed from file, so the data in your database will reflect this and you'll have to reload it once the appropriate parser has been fixed. If it was the newer BJX parser, then the problem most probably lies in this regex from org.biojavax.bio.seq.io.GenbankFormat, which can probably be fixed in a similar manner to the Ruby equivalent dicussed in the posting above: protected static final Pattern sectp = Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); Could someone volunteer to develop and test a fix? If you come up with something, please commit it to the SVN trunk. cheers, Richard 2008/10/28 Gabrielle Doan : > Hi all, > concering the problem as described below I have found out that this problem > also occured in BioRuby and was fixed in 2004. > See: > http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby > Unfortunately I'm clueless about BioRuby. Does anybody recognize this > problem or understand how it was solved in BioRuby? > > I am grateful for any hints. > > Cheers, > > Gabrielle > > > -------- Original-Nachricht -------- > Betreff: [Biojava-l] differences between read in sequence and stored > sequence in database > Datum: Mon, 27 Oct 2008 13:57:03 +0100 > Von: Gabrielle Doan > An: biojava-l at biojava.org > > Hi all, > > I have a BioSQL database which contains all human chromsomes. For my > recent project I have to query for a part of a sequence. > As far as I know I can get the whole sequence from the entry > Biosequence.Seq in the BioSQL schema. So I've made this query: > > SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; > > But this query hasn't yield the desired string, because the length of > this biosequence is only 100,000,020 bp. I am very confused why I get > such a discrepancy. I have added all chromosomes with the build in > method in BioJava addRichSequence(RichSequence seq) to the database. > From my raw data I know that this sequence should have a length of > 140,279,252 bp. So where is the remaining part of my sequence? I have > observed these discrepancies on all chromsomes which are longer than > 100,000,020 bp. > > Here is an abstract of my database: > bioentry_id description length > 2 Homo sapiens mitochondrion, complete genome. 16571 > 3 Homo sapiens chromosome Y, reference assembly, complete sequence. > 57772954 > 4 Homo sapiens chromosome X, reference assembly, complete sequence. > 100000020 > 5 Homo sapiens chromosome 22, reference assembly, complete sequence. > 49691432 > 6 Homo sapiens chromosome 21, reference assembly, complete sequence. > 46944323 > 7 Homo sapiens chromosome 20, reference assembly, complete sequence. > 25960004 > 8 Homo sapiens chromosome 9, reference assembly, complete sequence. > 100000020 > 9 Homo sapiens chromosome 7, reference assembly, complete sequence. > 100000020 > > Sequences smaller than 100,000,020 bp are correctly stored under > Biosequence.seq. > > I am grateful for any hints, which explain the behaviour of my database. > > Cheers, > > Gabrielle > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Thu Oct 30 10:10:12 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 30 Oct 2008 14:10:12 +0000 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: References: Message-ID: Good question! Can someone who knows a lot about the blast parser internals provide David with an answer to his question? cheers, Richard 2008/10/29 David Toomey : > Hi > > I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > falciparum (isolate 3D7) GN=ABRA > > I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation > > Can anyone point me in the right direction for retrieving the complete query description? > > Thanks > > Dave > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Fri Oct 31 03:26:35 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 31 Oct 2008 15:26:35 +0800 Subject: [Biojava-l] differences between read in sequence and stored sequence in database In-Reply-To: <4905BA9F.1060400@gmx.net> References: <4905BA9F.1060400@gmx.net> Message-ID: <93b45ca50810310026o6ee35a61sf2815c3547e1e679@mail.gmail.com> Could this be a database implementation issue? Is there a limit on how long a field can be in your DB? - Mark On Mon, Oct 27, 2008 at 8:57 PM, Gabrielle Doan wrote: > > Hi all, > > I have a BioSQL database which contains all human chromsomes. For my recent project I have to query for a part of a sequence. > As far as I know I can get the whole sequence from the entry Biosequence.Seq in the BioSQL schema. So I've made this query: > > SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; > > But this query hasn't yield the desired string, because the length of this biosequence is only 100,000,020 bp. I am very confused why I get such a discrepancy. I have added all chromosomes with the build in method in BioJava addRichSequence(RichSequence seq) to the database. From my raw data I know that this sequence should have a length of 140,279,252 bp. So where is the remaining part of my sequence? I have observed these discrepancies on all chromsomes which are longer than 100,000,020 bp. > > Here is an abstract of my database: > bioentry_id description length > 2 Homo sapiens mitochondrion, complete genome. 16571 > 3 Homo sapiens chromosome Y, reference assembly, complete sequence. 57772954 > 4 Homo sapiens chromosome X, reference assembly, complete sequence. 100000020 > 5 Homo sapiens chromosome 22, reference assembly, complete sequence. 49691432 > 6 Homo sapiens chromosome 21, reference assembly, complete sequence. 46944323 > 7 Homo sapiens chromosome 20, reference assembly, complete sequence. 25960004 > 8 Homo sapiens chromosome 9, reference assembly, complete sequence. 100000020 > 9 Homo sapiens chromosome 7, reference assembly, complete sequence. 100000020 > > Sequences smaller than 100,000,020 bp are correctly stored under Biosequence.seq. > > I am grateful for any hints, which explain the behaviour of my database. > > Cheers, > > Gabrielle > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Fri Oct 31 04:00:35 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 31 Oct 2008 16:00:35 +0800 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: References: Message-ID: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> Hi - If you use the BlastEcho program on the cookbook pages you can find out if and how the information is being parsed and where it goes. It is possible it is not parsed. In this case you could add a feature request. - Mark On Thu, Oct 30, 2008 at 10:10 PM, Richard Holland wrote: > > Good question! > > Can someone who knows a lot about the blast parser internals provide > David with an answer to his question? > > cheers, > Richard > > 2008/10/29 David Toomey : > > Hi > > > > I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query > > > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > > falciparum (isolate 3D7) GN=ABRA > > > > I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation > > > > Can anyone point me in the right direction for retrieving the complete query description? > > > > Thanks > > > > Dave > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From community at struck.lu Fri Oct 31 06:05:00 2008 From: community at struck.lu (community at struck.lu) Date: Fri, 31 Oct 2008 11:05:00 +0100 Subject: [Biojava-l] SCF: support for ambiguities Message-ID: Hello, I am using the SCF class in the context of HIV-1 population sequencing. In this context we do have sometimes ambiguous base calls. To support them I extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. Therefore I simply added the following code to the "decode" function: ######################### public Symbol decode(byte call) throws IllegalSymbolException { //get the DNA Alphabet Alphabet dna = DNATools.getDNA(); char c = (char) call; switch (c) { case 'a': case 'A': return DNATools.a(); case 'c': case 'C': return DNATools.c(); case 'g': case 'G': return DNATools.g(); case 't': case 'T': return DNATools.t(); case 'n': case 'N': return DNATools.n(); case '-': return DNATools.getDNA().getGapSymbol(); case 'w': case 'W': //make the 'W' symbol Set symbolsThatMakeW = new HashSet(); symbolsThatMakeW.add(DNATools.a()); symbolsThatMakeW.add(DNATools.t()); Symbol w = dna.getAmbiguity(symbolsThatMakeW); return w; case 's': case 'S': //make the 'S' symbol Set symbolsThatMakeS = new HashSet(); symbolsThatMakeS.add(DNATools.c()); symbolsThatMakeS.add(DNATools.g()); Symbol s = dna.getAmbiguity(symbolsThatMakeS); return s; ... (and so on) ######################### Is this the right way to do it? And if so, how can this code be submitted to the official biojava source code? Best regards, Daniel Struck _________________________________________________________ Mail sent using root eSolutions Webmailer - www.root.lu From dtoomey at rcsi.ie Fri Oct 31 08:07:19 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Fri, 31 Oct 2008 12:07:19 +0000 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> References: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> Message-ID: Hi Mark I tried that and it appears that it is not being parsed. Only the portion of the line up to the first space is returned as queryId. The rest of the line is not returned. Could this be added to the blast parser? Cheers Dave -----Original Message----- From: Mark Schreiber [mailto:markjschreiber at gmail.com] Sent: 31 October 2008 08:01 To: holland at eaglegenomics.com Cc: David Toomey; biojava-l at biojava.org Subject: Re: [Biojava-l] How to get full query description from blast result Hi - If you use the BlastEcho program on the cookbook pages you can find out if and how the information is being parsed and where it goes. It is possible it is not parsed. In this case you could add a feature request. - Mark On Thu, Oct 30, 2008 at 10:10 PM, Richard Holland wrote: > > Good question! > > Can someone who knows a lot about the blast parser internals provide > David with an answer to his question? > > cheers, > Richard > > 2008/10/29 David Toomey : > > Hi > > > > I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query > > > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > > falciparum (isolate 3D7) GN=ABRA > > > > I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation > > > > Can anyone point me in the right direction for retrieving the complete query description? > > > > Thanks > > > > Dave > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From simon.foote at nrc-cnrc.gc.ca Fri Oct 31 07:56:30 2008 From: simon.foote at nrc-cnrc.gc.ca (Simon Foote) Date: Fri, 31 Oct 2008 07:56:30 -0400 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> References: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> Message-ID: <490AF26E.7000604@nrc-cnrc.gc.ca> Mark is right A quick look at the code shows that for the query line, it extracts everything upto the first whitespace and puts that into the queryId and everything else is discarded. To get the full description, some additional code is needed to populate a queryDescription with everything from the query line upto the query length information which is contained in parentheses. Simon Bioinformatics Specialist Institute for Biological Sciences | Institut des sciences biologiques National Research Council of Canada | Conseil national de recherches Canada Ottawa, Canada K1A 0R6 Telephone | T?l?phone 613-990-3600 / Facsimile | T?l?copieur 613-990-9092 Government of Canada | Gouvernement du Canada Mark Schreiber wrote: > > Hi - > > If you use the BlastEcho program on the cookbook pages you can find > out if and how the information is being parsed and where it goes. > > It is possible it is not parsed. In this case you could add a feature > request. > > - Mark > > On Thu, Oct 30, 2008 at 10:10 PM, Richard Holland > wrote: > > > > Good question! > > > > Can someone who knows a lot about the blast parser internals provide > > David with an answer to his question? > > > > cheers, > > Richard > > > > 2008/10/29 David Toomey : > > > Hi > > > > > > I am parsing blast results and I need to get the complete query > description line but I can only work out how to get the first part of > the line. So for example in the blast result query > > > > > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > > > falciparum (isolate 3D7) GN=ABRA > > > > > > I need to get all of the description above but I can only seem to > retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the > queryId property of the annotation > > > > > > Can anyone point me in the right direction for retrieving the > complete query description? > > > > > > Thanks > > > > > > Dave > > > > > > > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > M: +44 7500 438846 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From benb at fruitfly.org Fri Oct 31 09:38:32 2008 From: benb at fruitfly.org (Ben Berman) Date: Fri, 31 Oct 2008 06:38:32 -0700 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: Is there a reason why IUPAC ambiguity codes have never been added to DNATools? Would it hurt the performance of symbol lookups? On Oct 31, 2008, at 3:05 AM, community at struck.lu wrote: > Hello, > > > I am using the SCF class in the context of HIV-1 population > sequencing. In > this context we do have sometimes ambiguous base calls. To support > them I > extended the SCF class to allow for IUPAC ambiguities up to 2 > nucleotides. > > Therefore I simply added the following code to the "decode" function: > > ######################### > public Symbol decode(byte call) throws IllegalSymbolException { > > //get the DNA Alphabet > Alphabet dna = DNATools.getDNA(); > > char c = (char) call; > switch (c) { > case 'a': > case 'A': > return DNATools.a(); > case 'c': > case 'C': > return DNATools.c(); > case 'g': > case 'G': > return DNATools.g(); > case 't': > case 'T': > return DNATools.t(); > case 'n': > case 'N': > return DNATools.n(); > case '-': > return DNATools.getDNA().getGapSymbol(); > case 'w': > case 'W': > //make the 'W' symbol > Set symbolsThatMakeW = new HashSet(); > symbolsThatMakeW.add(DNATools.a()); > symbolsThatMakeW.add(DNATools.t()); > Symbol w = dna.getAmbiguity(symbolsThatMakeW); > return w; > case 's': > case 'S': > //make the 'S' symbol > Set symbolsThatMakeS = new HashSet(); > symbolsThatMakeS.add(DNATools.c()); > symbolsThatMakeS.add(DNATools.g()); > Symbol s = dna.getAmbiguity(symbolsThatMakeS); > return s; > ... (and so on) > ######################### > > Is this the right way to do it? And if so, how can this code be > submitted to > the official biojava source code? > > > Best regards, > Daniel Struck > _________________________________________________________ > Mail sent using root eSolutions Webmailer - www.root.lu > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > ---- Ben Berman, PhD Research Associate, USC Epigenome Center Harlyne J. Norris Research Tower 1450 Biggy St. Room #G511, MC 9601 Los Angeles, CA 90033 From holland at eaglegenomics.com Fri Oct 31 09:56:54 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 31 Oct 2008 13:56:54 +0000 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: It is the correct method, yes. However your code constructs a new hash set every time it does the check for W or S etc.. It would be much more efficient to create class-static references to the ambiguity symbols you need, instead of (re)creating them every time they're encountered. A class-static gap symbol reference would also be good in this situation. cheers, Richard 2008/10/31 community at struck.lu : > Hello, > > > I am using the SCF class in the context of HIV-1 population sequencing. In > this context we do have sometimes ambiguous base calls. To support them I > extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. > > Therefore I simply added the following code to the "decode" function: > > ######################### > public Symbol decode(byte call) throws IllegalSymbolException { > > //get the DNA Alphabet > Alphabet dna = DNATools.getDNA(); > > char c = (char) call; > switch (c) { > case 'a': > case 'A': > return DNATools.a(); > case 'c': > case 'C': > return DNATools.c(); > case 'g': > case 'G': > return DNATools.g(); > case 't': > case 'T': > return DNATools.t(); > case 'n': > case 'N': > return DNATools.n(); > case '-': > return DNATools.getDNA().getGapSymbol(); > case 'w': > case 'W': > //make the 'W' symbol > Set symbolsThatMakeW = new HashSet(); > symbolsThatMakeW.add(DNATools.a()); > symbolsThatMakeW.add(DNATools.t()); > Symbol w = dna.getAmbiguity(symbolsThatMakeW); > return w; > case 's': > case 'S': > //make the 'S' symbol > Set symbolsThatMakeS = new HashSet(); > symbolsThatMakeS.add(DNATools.c()); > symbolsThatMakeS.add(DNATools.g()); > Symbol s = dna.getAmbiguity(symbolsThatMakeS); > return s; > ... (and so on) > ######################### > > Is this the right way to do it? And if so, how can this code be submitted to > the official biojava source code? > > > Best regards, > Daniel Struck > _________________________________________________________ > Mail sent using root eSolutions Webmailer - www.root.lu > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Fri Oct 31 10:40:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 31 Oct 2008 14:40:10 +0000 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: It would be fine to add them there too. You'd still need to modify the SCF parser though in order for it to be able to know about them. cheers, Richard 2008/10/31 Ben Berman : > > Is there a reason why IUPAC ambiguity codes have never been added to > DNATools? Would it hurt the performance of symbol lookups? > > > On Oct 31, 2008, at 3:05 AM, community at struck.lu wrote: > >> Hello, >> >> >> I am using the SCF class in the context of HIV-1 population sequencing. In >> this context we do have sometimes ambiguous base calls. To support them I >> extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. >> >> Therefore I simply added the following code to the "decode" function: >> >> ######################### >> public Symbol decode(byte call) throws IllegalSymbolException { >> >> //get the DNA Alphabet >> Alphabet dna = DNATools.getDNA(); >> >> char c = (char) call; >> switch (c) { >> case 'a': >> case 'A': >> return DNATools.a(); >> case 'c': >> case 'C': >> return DNATools.c(); >> case 'g': >> case 'G': >> return DNATools.g(); >> case 't': >> case 'T': >> return DNATools.t(); >> case 'n': >> case 'N': >> return DNATools.n(); >> case '-': >> return DNATools.getDNA().getGapSymbol(); >> case 'w': >> case 'W': >> //make the 'W' symbol >> Set symbolsThatMakeW = new HashSet(); >> symbolsThatMakeW.add(DNATools.a()); >> symbolsThatMakeW.add(DNATools.t()); >> Symbol w = dna.getAmbiguity(symbolsThatMakeW); >> return w; >> case 's': >> case 'S': >> //make the 'S' symbol >> Set symbolsThatMakeS = new HashSet(); >> symbolsThatMakeS.add(DNATools.c()); >> symbolsThatMakeS.add(DNATools.g()); >> Symbol s = dna.getAmbiguity(symbolsThatMakeS); >> return s; >> ... (and so on) >> ######################### >> >> Is this the right way to do it? And if so, how can this code be submitted >> to >> the official biojava source code? >> >> >> Best regards, >> Daniel Struck >> _________________________________________________________ >> Mail sent using root eSolutions Webmailer - www.root.lu >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > ---- > Ben Berman, PhD > Research Associate, USC Epigenome Center > Harlyne J. Norris Research Tower > 1450 Biggy St. > Room #G511, MC 9601 > Los Angeles, CA 90033 > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From community at struck.lu Fri Oct 31 12:06:45 2008 From: community at struck.lu (community at struck.lu) Date: Fri, 31 Oct 2008 17:06:45 +0100 Subject: [Biojava-l] SCF: support for ambiguities Message-ID: True. It was a first quick and dirty hack to get the rest of my project going. I think adding support of the IUPAC ambiguities to DNATools would be the most approbate solution. The SCF class can then easily be adapted. Are there any plans to do so? If not, I could give it a try and submit a patch for DNATools and SCF. Greetings, Daniel "Richard Holland" wrote: > It is the correct method, yes. > > However your code constructs a new hash set every time it does the > check for W or S etc.. It would be much more efficient to create > class-static references to the ambiguity symbols you need, instead of > (re)creating them every time they're encountered. A class-static gap > symbol reference would also be good in this situation. > > cheers, > Richard > > > > 2008/10/31 community at struck.lu : > > Hello, > > > > > > I am using the SCF class in the context of HIV-1 population sequencing. In > > this context we do have sometimes ambiguous base calls. To support them I > > extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. > > > > Therefore I simply added the following code to the "decode" function: > > > > ######################### > > public Symbol decode(byte call) throws IllegalSymbolException { > > > > //get the DNA Alphabet > > Alphabet dna = DNATools.getDNA(); > > > > char c = (char) call; > > switch (c) { > > case 'a': > > case 'A': > > return DNATools.a(); > > case 'c': > > case 'C': > > return DNATools.c(); > > case 'g': > > case 'G': > > return DNATools.g(); > > case 't': > > case 'T': > > return DNATools.t(); > > case 'n': > > case 'N': > > return DNATools.n(); > > case '-': > > return DNATools.getDNA().getGapSymbol(); > > case 'w': > > case 'W': > > //make the 'W' symbol > > Set symbolsThatMakeW = new HashSet(); > > symbolsThatMakeW.add(DNATools.a()); > > symbolsThatMakeW.add(DNATools.t()); > > Symbol w = dna.getAmbiguity(symbolsThatMakeW); > > return w; > > case 's': > > case 'S': > > //make the 'S' symbol > > Set symbolsThatMakeS = new HashSet(); > > symbolsThatMakeS.add(DNATools.c()); > > symbolsThatMakeS.add(DNATools.g()); > > Symbol s = dna.getAmbiguity(symbolsThatMakeS); > > return s; > > ... (and so on) > > ######################### > > > > Is this the right way to do it? And if so, how can this code be submitted to > > the official biojava source code? > > > > > > Best regards, > > Daniel Struck > > _________________________________________________________ > > Mail sent using root eSolutions Webmailer - www.root.lu > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _________________________________________________________ Mail sent using root eSolutions Webmailer - www.root.lu From holland at eaglegenomics.com Fri Oct 31 12:14:30 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 31 Oct 2008 16:14:30 +0000 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: A patch would be much appreciated! cheers, Richard 2008/10/31 community at struck.lu : > True. It was a first quick and dirty hack to get the rest of my project going. > > I think adding support of the IUPAC ambiguities to DNATools would be the most > approbate solution. The SCF class can then easily be adapted. > > Are there any plans to do so? > If not, I could give it a try and submit a patch for DNATools and SCF. > > Greetings, > Daniel > > "Richard Holland" wrote: > >> It is the correct method, yes. >> >> However your code constructs a new hash set every time it does the >> check for W or S etc.. It would be much more efficient to create >> class-static references to the ambiguity symbols you need, instead of >> (re)creating them every time they're encountered. A class-static gap >> symbol reference would also be good in this situation. >> >> cheers, >> Richard >> >> >> >> 2008/10/31 community at struck.lu : >> > Hello, >> > >> > >> > I am using the SCF class in the context of HIV-1 population sequencing. In >> > this context we do have sometimes ambiguous base calls. To support them I >> > extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. >> > >> > Therefore I simply added the following code to the "decode" function: >> > >> > ######################### >> > public Symbol decode(byte call) throws IllegalSymbolException { >> > >> > //get the DNA Alphabet >> > Alphabet dna = DNATools.getDNA(); >> > >> > char c = (char) call; >> > switch (c) { >> > case 'a': >> > case 'A': >> > return DNATools.a(); >> > case 'c': >> > case 'C': >> > return DNATools.c(); >> > case 'g': >> > case 'G': >> > return DNATools.g(); >> > case 't': >> > case 'T': >> > return DNATools.t(); >> > case 'n': >> > case 'N': >> > return DNATools.n(); >> > case '-': >> > return DNATools.getDNA().getGapSymbol(); >> > case 'w': >> > case 'W': >> > //make the 'W' symbol >> > Set symbolsThatMakeW = new HashSet(); >> > symbolsThatMakeW.add(DNATools.a()); >> > symbolsThatMakeW.add(DNATools.t()); >> > Symbol w = dna.getAmbiguity(symbolsThatMakeW); >> > return w; >> > case 's': >> > case 'S': >> > //make the 'S' symbol >> > Set symbolsThatMakeS = new HashSet(); >> > symbolsThatMakeS.add(DNATools.c()); >> > symbolsThatMakeS.add(DNATools.g()); >> > Symbol s = dna.getAmbiguity(symbolsThatMakeS); >> > return s; >> > ... (and so on) >> > ######################### >> > >> > Is this the right way to do it? And if so, how can this code be submitted > to >> > the official biojava source code? >> > >> > >> > Best regards, >> > Daniel Struck >> > _________________________________________________________ >> > Mail sent using root eSolutions Webmailer - www.root.lu >> > >> > >> > _______________________________________________ >> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> > > > _________________________________________________________ > Mail sent using root eSolutions Webmailer - www.root.lu > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From gabrielle_doan at gmx.net Fri Oct 31 11:09:56 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Fri, 31 Oct 2008 15:09:56 -0000 Subject: [Biojava-l] differences between read in sequence and stored sequence in database] In-Reply-To: References: <49072127.7010304@gmx.net> Message-ID: <490B1FB3.7010607@gmx.net> Hi all, I've changed the regular expression in org.biojavax.bio.seq.io.GenbankFormat from protected static final Pattern sectp = Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); <\code> to protected static final Pattern sectp = Pattern.compile("^(\\s{0,8}([A-Za-z]+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); <\code> like in BioRuby (http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb.diff?r1=0.24&r2=0.25&cvsroot=bioruby). But than features like D-loop can't be detected. So this is not the solution for my problem. The reason for the truncation is readSection(BufferedReader br) in org.biojavax.bio.seq.io.GenbankFormat. if (line==null || line.length()==0 || (!line.startsWith(" ") && linecount++>0)) { // dump out last part of section section.add(new String[]{currKey,currVal.toString()}); br.reset(); done = true; <\snip> The condition in the if-clause will ignore lines which don't begin with a whitespace, so this line will be read 99999961 cccgcccaca cccctcggcc ctgccctctg gccatacagg ttctcggtgg tgttgaagag <\snip> and this line won't be read: 100000021 gtcctcgggc tccggcttgg tgctcacgca cacaggaaag tcagcttctc ctgggagggc <\snip> If you change the if-statement to this: String firstSecKey = section.size() == 0 ? "" : ((String[])section.get(0))[0]; if (line==null || line.length()==0 || (!line.startsWith(" ") && linecount++>0 && ( !firstSecKey.equals(START_SEQUENCE_TAG) || line.startsWith(END_SEQUENCE_TAG)))) <\snip> You can add the whole sequence without truncation to the database. I have attached GenbankFormat.java in this mail. Can anybody check the method for me and commit it? Since I'm not a BioJava specialist. Cheers, Gabrielle Richard Holland schrieb: > Hello. > > Sorry for the delayed reply - I've been away on business all week. > > The similar Ruby issue (and solution) is discussed here: > > http://portal.open-bio.org/pipermail/bioruby/2004-March.txt > > How did you parse the files in the first place? Did you use the new > GenBank parsers (BJX), or the older ones? This will help indicate > where the problem lies - the data will have been truncated at the > point it was parsed from file, so the data in your database will > reflect this and you'll have to reload it once the appropriate parser > has been fixed. > > If it was the newer BJX parser, then the problem most probably lies in > this regex from org.biojavax.bio.seq.io.GenbankFormat, which can > probably be fixed in a similar manner to the Ruby equivalent dicussed > in the posting above: > > protected static final Pattern sectp = > Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); > > Could someone volunteer to develop and test a fix? If you come up with > something, please commit it to the SVN trunk. > > cheers, > Richard > > > 2008/10/28 Gabrielle Doan : >> Hi all, >> concering the problem as described below I have found out that this problem >> also occured in BioRuby and was fixed in 2004. >> See: >> http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby >> Unfortunately I'm clueless about BioRuby. Does anybody recognize this >> problem or understand how it was solved in BioRuby? >> >> I am grateful for any hints. >> >> Cheers, >> >> Gabrielle >> >> >> -------- Original-Nachricht -------- >> Betreff: [Biojava-l] differences between read in sequence and stored >> sequence in database >> Datum: Mon, 27 Oct 2008 13:57:03 +0100 >> Von: Gabrielle Doan >> An: biojava-l at biojava.org >> >> Hi all, >> >> I have a BioSQL database which contains all human chromsomes. For my >> recent project I have to query for a part of a sequence. >> As far as I know I can get the whole sequence from the entry >> Biosequence.Seq in the BioSQL schema. So I've made this query: >> >> SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; >> >> But this query hasn't yield the desired string, because the length of >> this biosequence is only 100,000,020 bp. I am very confused why I get >> such a discrepancy. I have added all chromosomes with the build in >> method in BioJava addRichSequence(RichSequence seq) to the database. >> From my raw data I know that this sequence should have a length of >> 140,279,252 bp. So where is the remaining part of my sequence? I have >> observed these discrepancies on all chromsomes which are longer than >> 100,000,020 bp. >> >> Here is an abstract of my database: >> bioentry_id description length >> 2 Homo sapiens mitochondrion, complete genome. 16571 >> 3 Homo sapiens chromosome Y, reference assembly, complete sequence. >> 57772954 >> 4 Homo sapiens chromosome X, reference assembly, complete sequence. >> 100000020 >> 5 Homo sapiens chromosome 22, reference assembly, complete sequence. >> 49691432 >> 6 Homo sapiens chromosome 21, reference assembly, complete sequence. >> 46944323 >> 7 Homo sapiens chromosome 20, reference assembly, complete sequence. >> 25960004 >> 8 Homo sapiens chromosome 9, reference assembly, complete sequence. >> 100000020 >> 9 Homo sapiens chromosome 7, reference assembly, complete sequence. >> 100000020 >> >> Sequences smaller than 100,000,020 bp are correctly stored under >> Biosequence.seq. >> >> I am grateful for any hints, which explain the behaviour of my database. >> >> Cheers, >> >> Gabrielle >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: GenbankFormat.java Type: text/x-java Size: 48624 bytes Desc: not available URL: From markjschreiber at gmail.com Wed Oct 1 06:07:51 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 1 Oct 2008 14:07:51 +0800 Subject: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result In-Reply-To: References: Message-ID: <93b45ca50809302307t19a652a4v4a61eeceec07aa62@mail.gmail.com> Actually, if it is an OS specific carriage return then there is still a minor issue. We should really try and code stuff so that it can handle files that originate from any major OS. - Mark On Wed, Oct 1, 2008 at 12:31 AM, Richard Holland wrote: > > Sounds like it _might_ be something to do with the carriage return > itself. Is the blast file generated on the same OS that you're running > your analysis on? (e.g. you might run Blast on a Linux box, but > attempt to parse the file on a Windows box?). If the two OSes are > different, this might point to it - as Linux won't necessarily > understand the Windows linebreaks, or vice versa, and might > misinterpret them. When you copy the portion of the file to a new file > on the OS you're running the analysis on, it will substitute its own > local linebreaks and thus mask the problem. > > So the first thing I'd check is to what the two OSes involved are. If > they're different, try running your analysis program on the same OS as > the Blast output was generated on. If that does fix it, then try > putting your Blast files through dos2unix or something similar to > convert the linebreaks before running your analysis program. > > If they're the same OS, then we still have a problem! > > cheers, > Richard > > 2008/9/30 David Toomey : > > Hi > > > > > > > > I am parsing a blast result and I am getting a > > StringIndexOutOfBoundsException. The stack trace is > > > > > > > > at java.lang.String.substring(String.java:1938) > > > > at java.lang.String.substring(String.java:1905) > > > > at > > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeA > > lignmentSAXParser.java:291) > > > > at > > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlign > > mentSAXParser.java:116) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXP > > arser.java:517) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXP > > arser.java:287) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParse > > r.java:251) > > > > at > > org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.ja > > va:117) > > > > at > > org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser > > .java:634) > > > > at > > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:341 > > ) > > > > at > > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:168) > > > > at > > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars > > er.java:314) > > > > at > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. > > java:276) > > > > at > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java > > :163) > > > > at ie.rcsi.blast.StandardParser.parse(StandardParser.java:65) > > > > at ie.rcsi.blast.BlastParser.parse(BlastParser.java:44) > > > > at ie.rcsi.blast.Main.main(Main.java:30) > > > > > > > > I have updated BlastLikeAlignmentSAXParser to output some debug info and > > narrowed down the line causing the problem to the following line > > > > > > > > 2,4-cyclodiphosphate synthase OS=Plasmodium falciparum (isolate 3D7) > > > > GN=ISPF > > > > > > > > If I remove the carriage return and put it on a single line then everything > > works fine. Strangely if I copy this entry and put it in a file on it's own > > it also parses correctly, even with the carriage return!!! > > > > > > > > Has anyone seen this before or does anyone have a suggestion on what I might > > to do fix it. I send the complete blast result if it would help. I have > > tried using blast 2.2.18 and 2.2.17 and the problem is the same. > > > > > > > > Cheers > > > > > > > > Dave > > > > > > > > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From dtoomey at rcsi.ie Wed Oct 1 08:40:44 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Wed, 1 Oct 2008 09:40:44 +0100 Subject: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result References: Message-ID: They are on the same OS. For all my tests I have run the blast search and parsing on the same OS. This has mostly been windows but I have also tried the whole thing on Linux and I get the same problem. I have done some more testing and I don't think the carriage return is the problem. What I have found is that if the second line is less than 11 characters the error is thrown. If I add 4 spaces in front of the 'GN=ISPF' on the second line then it is parsed correctly, like this. 2,4-cyclodiphosphate synthase OS=Plasmodium falciparum (isolate 3D7) GN=ISPF I haven't figured out why it parses correctly when it is the only entry in the file, even without the spaces. So maybe I am still missing something. Cheers, Dave -----Original Message----- From: dicknetherlands at gmail.com [mailto:dicknetherlands at gmail.com] On Behalf Of Richard Holland Sent: 30 September 2008 17:31 To: David Toomey Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result Sounds like it _might_ be something to do with the carriage return itself. Is the blast file generated on the same OS that you're running your analysis on? (e.g. you might run Blast on a Linux box, but attempt to parse the file on a Windows box?). If the two OSes are different, this might point to it - as Linux won't necessarily understand the Windows linebreaks, or vice versa, and might misinterpret them. When you copy the portion of the file to a new file on the OS you're running the analysis on, it will substitute its own local linebreaks and thus mask the problem. So the first thing I'd check is to what the two OSes involved are. If they're different, try running your analysis program on the same OS as the Blast output was generated on. If that does fix it, then try putting your Blast files through dos2unix or something similar to convert the linebreaks before running your analysis program. If they're the same OS, then we still have a problem! cheers, Richard From holland at eaglegenomics.com Wed Oct 1 09:37:59 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 1 Oct 2008 10:37:59 +0100 Subject: [Biojava-l] StringIndexOutOfBoundsException while parsing blast result In-Reply-To: References: Message-ID: Thanks for the extra info. 2008/10/1 David Toomey : > They are on the same OS. For all my tests I have run the blast search and > parsing on the same OS. This has mostly been windows but I have also tried > the whole thing on Linux and I get the same problem. > I have done some more testing and I don't think the carriage return is the > problem. > What I have found is that if the second line is less than 11 characters the > error is thrown. If I add 4 spaces in front of the 'GN=ISPF' on the second > line then it is parsed correctly, like this. > > 2,4-cyclodiphosphate synthase OS=Plasmodium falciparum (isolate 3D7) > GN=ISPF > > I haven't figured out why it parses correctly when it is the only entry in > the file, even without the spaces. So maybe I am still missing something. > > Cheers, > > Dave > > -----Original Message----- > From: dicknetherlands at gmail.com [mailto:dicknetherlands at gmail.com] On Behalf > Of Richard Holland > Sent: 30 September 2008 17:31 > To: David Toomey > Cc: biojava-l at lists.open-bio.org > Subject: Re: [Biojava-l] StringIndexOutOfBoundsException while parsing blast > result > > Sounds like it _might_ be something to do with the carriage return > itself. Is the blast file generated on the same OS that you're running > your analysis on? (e.g. you might run Blast on a Linux box, but > attempt to parse the file on a Windows box?). If the two OSes are > different, this might point to it - as Linux won't necessarily > understand the Windows linebreaks, or vice versa, and might > misinterpret them. When you copy the portion of the file to a new file > on the OS you're running the analysis on, it will substitute its own > local linebreaks and thus mask the problem. > > So the first thing I'd check is to what the two OSes involved are. If > they're different, try running your analysis program on the same OS as > the Blast output was generated on. If that does fix it, then try > putting your Blast files through dos2unix or something similar to > convert the linebreaks before running your analysis program. > > If they're the same OS, then we still have a problem! > > cheers, > Richard > > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pzgyuanf at gmail.com Wed Oct 1 10:52:25 2008 From: pzgyuanf at gmail.com (pprun) Date: Wed, 01 Oct 2008 18:52:25 +0800 Subject: [Biojava-l] BufferedOutputStream to RichSequence.IOTools.writeXXX() method needs to flush manually Message-ID: Hi, I don't know this is a feature or a bug, If a BufferedOutputStream was passed to method RichSequence.IOTools.writeGenbank(OutputStream os, Sequence seq, Namespace ns), at the end, I need to manually flush it - BufferedOutputStream.flush() Otherwise, the output content will be truncated. Is this the expected behavior? Thanks, - Pprun From holland at eaglegenomics.com Wed Oct 1 13:36:59 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 1 Oct 2008 14:36:59 +0100 Subject: [Biojava-l] BufferedOutputStream to RichSequence.IOTools.writeXXX() method needs to flush manually In-Reply-To: References: Message-ID: The IOTools interfaces accept OutputStream instances, not BufferedOutputStream instances. flush() is not a requirement on OutputStream and so BJX does not call it. cheers, Richard 2008/10/1 pprun : > Hi, > I don't know this is a feature or a bug, > If a BufferedOutputStream was passed to method > RichSequence.IOTools.writeGenbank(OutputStream os, Sequence seq, > Namespace ns), > at the end, I need to manually flush it - BufferedOutputStream.flush() > > Otherwise, the output content will be truncated. > > Is this the expected behavior? > > Thanks, > - Pprun > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Thu Oct 2 00:46:03 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 2 Oct 2008 08:46:03 +0800 Subject: [Biojava-l] BufferedOutputStream to RichSequence.IOTools.writeXXX() method needs to flush manually In-Reply-To: References: Message-ID: <93b45ca50810011746y7d4f49biffd5c2e483c86bd1@mail.gmail.com> As a general rule it is best if BioJava doesn't handle the flushing and closing of OutputStreams. This is because you may want to keep using the stream and control it's behaivour. An interesting example is if you pass System.out to a method that closes the stream. Probably not what you want. Having said that maybe we should add a javadoc to say that BufferedOutputStreams need to be flushed (and possibly closed). - Mark On Wed, Oct 1, 2008 at 9:36 PM, Richard Holland wrote: > The IOTools interfaces accept OutputStream instances, not > BufferedOutputStream instances. flush() is not a requirement on > OutputStream and so BJX does not call it. > > cheers, > Richard > > 2008/10/1 pprun : >> Hi, >> I don't know this is a feature or a bug, >> If a BufferedOutputStream was passed to method >> RichSequence.IOTools.writeGenbank(OutputStream os, Sequence seq, >> Namespace ns), >> at the end, I need to manually flush it - BufferedOutputStream.flush() >> >> Otherwise, the output content will be truncated. >> >> Is this the expected behavior? >> >> Thanks, >> - Pprun >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From gabrielle_doan at gmx.net Tue Oct 7 14:26:44 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Tue, 07 Oct 2008 16:26:44 +0200 Subject: [Biojava-l] Getting a part of a sequence Message-ID: <48EB71A4.70409@gmx.net> Hi all, I have a BioSQL database which contains all human chromosomes. My intention is to get the information about a particular gene. How can I get a part of a particular chromosome with all associated features? At the moment I use following code to create my new sequence: RichSequence subSeq = RichSequence.Tools.subSequence(parent, position[0], position[1], ns, geneName, parent.getAccession(), parent.getIdentifier(), parent.getVersion() + 1, (Double) (parent.getVersion() + 1.0)); <\code> Here is the part how I get the parent sequence: public static RichSequence getChromosome(String chrNo) { Transaction tx = session.beginTransaction(); RichSequence ret = null; String query; try { if (chrNo.equals("MT")) { query = "from BioEntry as be where be.description like '%:num%'"; query = query.replaceAll(":num", "mitochondrion"); } else { query = "from BioEntry as be where be.description like '%hromosome :num%'"; query = query.replaceAll(":num", chrNo); } Query q = session.createQuery(query); ret = (RichSequence) q.list().get(0); tx.commit(); } catch (Exception e) { tx.rollback(); e.printStackTrace(); } return ret; } <\code> I always have to load the whole chromsome to get a part of it, so it takes very long time and I get a lot of unused information (waste of memory). I also tried to use ThinRichSequence<\code> instead of RichSequence<\code>, but thereby I didn't notice any difference. Can you give me a hint how to accelerate the code? I am grateful for any hits. cheers, Gabrielle From holland at eaglegenomics.com Tue Oct 7 23:05:54 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 8 Oct 2008 00:05:54 +0100 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <48EB71A4.70409@gmx.net> References: <48EB71A4.70409@gmx.net> Message-ID: Hello. Your code is pretty good already - but you're right, it will load the whole chromosome into memory before you can chop out the interesting bit you actually need. As you observed, by using ThinRichSequence in your query it will load only the initial shell of a sequence object to start with, but the moment you try and sub-sequence it, it will immediately load the whole sequence data into memory in order to perform the operation. If you only want the sequence data, as a string, you can do this by specifying the sequence attribute in the query and bypassing the sequence object entirely: select rs.stringSequence from Sequence as rs where rs.description like '%hromosome :num% This will return a String instead of a RichSequence object. You can use HQL operators to perform substrings etc. on the string inside the query itself - see http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html , particularly section 14.9. If you only want the features, you can do this by using the BioSQLFeatureFilter technique. In particular you will want the BySequenceName filter, the And filter, and the OverlapsRichLocation filter. You construct a filter then pass it to the filter() method in BioSQLRichSequenceDB. The database will return to you all the RichFeature objects that match your criteria. Note that it searches the whole database so you really must use a BySequenceName filter at the very least in order to make the results useful! However, you can't use HQL to construct a complete slice of a sequence directly in the database before returning it to the program for use as a ready-made RichSequence object. This would require Hibernate to know what a BioJava sub-sequence object is and how it behaves in relation to an 'unsliced' one, which is beyond the scope of it's job as a persistence framework. cheers, Richard 2008/10/7 Gabrielle Doan : > Hi all, > I have a BioSQL database which contains all human chromosomes. My intention > is to get the information about a particular gene. How can I get a part of a > particular chromosome with all associated features? At the moment I use > following code to create my new sequence: > > > RichSequence subSeq = RichSequence.Tools.subSequence(parent, > position[0], position[1], ns, geneName, parent.getAccession(), > parent.getIdentifier(), parent.getVersion() + 1, > (Double) (parent.getVersion() + 1.0)); > <\code> > > Here is the part how I get the parent sequence: > > public static RichSequence getChromosome(String chrNo) { > Transaction tx = session.beginTransaction(); > RichSequence ret = null; > > String query; > > try { > if (chrNo.equals("MT")) { > query = "from BioEntry as be where > be.description like '%:num%'"; > query = query.replaceAll(":num", > "mitochondrion"); > } else { > query = "from BioEntry as be where > be.description like '%hromosome :num%'"; > query = query.replaceAll(":num", chrNo); > } > > Query q = session.createQuery(query); > > ret = (RichSequence) q.list().get(0); > tx.commit(); > } catch (Exception e) { > tx.rollback(); > e.printStackTrace(); > } > return ret; > } > <\code> > > I always have to load the whole chromsome to get a part of it, so it takes > very long time and I get a lot of unused information (waste of memory). I > also tried to use ThinRichSequence<\code> instead of > RichSequence<\code>, but thereby I didn't notice any difference. > Can you give me a hint how to accelerate the code? > I am grateful for any hits. > > cheers, > Gabrielle > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From koen.bruynseels at cropdesign.com Wed Oct 8 00:02:18 2008 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Wed, 8 Oct 2008 02:02:18 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 04/10/2008 and will not return until 09/10/2008. I will respond to your message when I return. From gabrielle_doan at gmx.net Thu Oct 9 12:22:01 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Thu, 09 Oct 2008 14:22:01 +0200 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: References: <48EB71A4.70409@gmx.net> Message-ID: <48EDF769.8050901@gmx.net> Hi Richard, thanks a lot for your mail. I have successfully retrieved the subsequence of a sequence as a String. And now I try to get the features for a particular range with following code: public FeatureHolder filterFeature(String name, int startpos, int endpos) { RichLocation rl = new SimpleRichLocation(new SimplePosition(startpos), new SimplePosition(endpos), 0); BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( new BioSQLFeatureFilter.BySequenceName(name), new BioSQLFeatureFilter.OverlapsRichLocation(rl)); return filter(filter); } <\code> Fortunately I received these errors: Exception in thread "main" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) at org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) ... 3 more Caused by: org.hibernate.PropertyAccessException: Exception occurred inside setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) at org.hibernate.loader.Loader.doQuery(Loader.java:729) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) ... 8 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) ... 21 more Caused by: java.lang.NullPointerException at org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) at org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) at org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) at org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) at org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) at org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) ... 26 more <\message> Why do I get these errors? BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. How can I find out the sequence name? Is it the value "name" in the table "Bioentry"? As the build-in subSequence method takes a long time I intend to get the subsequence as a String by myself and add the features to it. What do you think about this? I'm grateful for any hints. cheers, Gabrielle Richard Holland schrieb: > Hello. > > Your code is pretty good already - but you're right, it will load the > whole chromosome into memory before you can chop out the interesting > bit you actually need. > > As you observed, by using ThinRichSequence in your query it will load > only the initial shell of a sequence object to start with, but the > moment you try and sub-sequence it, it will immediately load the whole > sequence data into memory in order to perform the operation. > > If you only want the sequence data, as a string, you can do this by > specifying the sequence attribute in the query and bypassing the > sequence object entirely: > > select rs.stringSequence from Sequence as rs where rs.description > like '%hromosome :num% > > This will return a String instead of a RichSequence object. You can > use HQL operators to perform substrings etc. on the string inside the > query itself - see > http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html > , particularly section 14.9. > > If you only want the features, you can do this by using the > BioSQLFeatureFilter technique. In particular you will want the > BySequenceName filter, the And filter, and the OverlapsRichLocation > filter. You construct a filter then pass it to the filter() method in > BioSQLRichSequenceDB. The database will return to you all the > RichFeature objects that match your criteria. Note that it searches > the whole database so you really must use a BySequenceName filter at > the very least in order to make the results useful! > > However, you can't use HQL to construct a complete slice of a sequence > directly in the database before returning it to the program for use as > a ready-made RichSequence object. This would require Hibernate to know > what a BioJava sub-sequence object is and how it behaves in relation > to an 'unsliced' one, which is beyond the scope of it's job as a > persistence framework. > > cheers, > Richard > > > > 2008/10/7 Gabrielle Doan : >> Hi all, >> I have a BioSQL database which contains all human chromosomes. My intention >> is to get the information about a particular gene. How can I get a part of a >> particular chromosome with all associated features? At the moment I use >> following code to create my new sequence: >> >> >> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >> position[0], position[1], ns, geneName, parent.getAccession(), >> parent.getIdentifier(), parent.getVersion() + 1, >> (Double) (parent.getVersion() + 1.0)); >> <\code> >> >> Here is the part how I get the parent sequence: >> >> public static RichSequence getChromosome(String chrNo) { >> Transaction tx = session.beginTransaction(); >> RichSequence ret = null; >> >> String query; >> >> try { >> if (chrNo.equals("MT")) { >> query = "from BioEntry as be where >> be.description like '%:num%'"; >> query = query.replaceAll(":num", >> "mitochondrion"); >> } else { >> query = "from BioEntry as be where >> be.description like '%hromosome :num%'"; >> query = query.replaceAll(":num", chrNo); >> } >> >> Query q = session.createQuery(query); >> >> ret = (RichSequence) q.list().get(0); >> tx.commit(); >> } catch (Exception e) { >> tx.rollback(); >> e.printStackTrace(); >> } >> return ret; >> } >> <\code> >> >> I always have to load the whole chromsome to get a part of it, so it takes >> very long time and I get a lot of unused information (waste of memory). I >> also tried to use ThinRichSequence<\code> instead of >> RichSequence<\code>, but thereby I didn't notice any difference. >> Can you give me a hint how to accelerate the code? >> I am grateful for any hits. >> >> cheers, >> Gabrielle >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > From holland at eaglegenomics.com Fri Oct 10 14:30:03 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 10 Oct 2008 15:30:03 +0100 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <48EDF769.8050901@gmx.net> References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> Message-ID: This looks like a bug in BJX. I have just committed a fix that I think will fix it to the head of subversion. Can you check out the latest source, compile it, and try your program again? cheers, Richard 2008/10/9 Gabrielle Doan > Hi Richard, > > thanks a lot for your mail. I have successfully retrieved the subsequence > of a sequence as a String. And now I try to get the features for a > particular range with following code: > > > public FeatureHolder filterFeature(String name, int startpos, int > endpos) { > RichLocation rl = new SimpleRichLocation(new > SimplePosition(startpos), > new SimplePosition(endpos), 0); > BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( > new > BioSQLFeatureFilter.BySequenceName(name), > new > BioSQLFeatureFilter.OverlapsRichLocation(rl)); > return filter(filter); > } > <\code> > > Fortunately I received these errors: > > Exception in thread "main" java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) > at > org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) > at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) > ... 3 more > Caused by: org.hibernate.PropertyAccessException: Exception occurred inside > setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) > at > org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) > at > org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) > at > org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) > at > org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) > at > org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) > at org.hibernate.loader.Loader.doQuery(Loader.java:729) > at > org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) > at org.hibernate.loader.Loader.doList(Loader.java:2213) > at > org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) > at org.hibernate.loader.Loader.list(Loader.java:2099) > at > org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) > at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) > at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) > ... 8 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) > ... 21 more > Caused by: java.lang.NullPointerException > at > org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) > at > org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) > at > org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) > at > org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) > at > org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) > at > org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) > ... 26 more > <\message> > > Why do I get these errors? > BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. How > can I find out the sequence name? Is it the value "name" in the table > "Bioentry"? As the build-in subSequence method takes a long time I intend to > get the subsequence as a String by myself and add the features to it. What > do you think about this? > > I'm grateful for any hints. > cheers, > > Gabrielle > > > > Richard Holland schrieb: > > Hello. >> >> Your code is pretty good already - but you're right, it will load the >> whole chromosome into memory before you can chop out the interesting >> bit you actually need. >> >> As you observed, by using ThinRichSequence in your query it will load >> only the initial shell of a sequence object to start with, but the >> moment you try and sub-sequence it, it will immediately load the whole >> sequence data into memory in order to perform the operation. >> >> If you only want the sequence data, as a string, you can do this by >> specifying the sequence attribute in the query and bypassing the >> sequence object entirely: >> >> select rs.stringSequence from Sequence as rs where rs.description >> like '%hromosome :num% >> >> This will return a String instead of a RichSequence object. You can >> use HQL operators to perform substrings etc. on the string inside the >> query itself - see >> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >> , particularly section 14.9. >> >> If you only want the features, you can do this by using the >> BioSQLFeatureFilter technique. In particular you will want the >> BySequenceName filter, the And filter, and the OverlapsRichLocation >> filter. You construct a filter then pass it to the filter() method in >> BioSQLRichSequenceDB. The database will return to you all the >> RichFeature objects that match your criteria. Note that it searches >> the whole database so you really must use a BySequenceName filter at >> the very least in order to make the results useful! >> >> However, you can't use HQL to construct a complete slice of a sequence >> directly in the database before returning it to the program for use as >> a ready-made RichSequence object. This would require Hibernate to know >> what a BioJava sub-sequence object is and how it behaves in relation >> to an 'unsliced' one, which is beyond the scope of it's job as a >> persistence framework. >> >> cheers, >> Richard >> >> >> >> 2008/10/7 Gabrielle Doan : >> >>> Hi all, >>> I have a BioSQL database which contains all human chromosomes. My >>> intention >>> is to get the information about a particular gene. How can I get a part >>> of a >>> particular chromosome with all associated features? At the moment I use >>> following code to create my new sequence: >>> >>> >>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>> position[0], position[1], ns, geneName, parent.getAccession(), >>> parent.getIdentifier(), parent.getVersion() + 1, >>> (Double) (parent.getVersion() + 1.0)); >>> <\code> >>> >>> Here is the part how I get the parent sequence: >>> >>> public static RichSequence getChromosome(String chrNo) { >>> Transaction tx = session.beginTransaction(); >>> RichSequence ret = null; >>> >>> String query; >>> >>> try { >>> if (chrNo.equals("MT")) { >>> query = "from BioEntry as be where >>> be.description like '%:num%'"; >>> query = query.replaceAll(":num", >>> "mitochondrion"); >>> } else { >>> query = "from BioEntry as be where >>> be.description like '%hromosome :num%'"; >>> query = query.replaceAll(":num", chrNo); >>> } >>> >>> Query q = session.createQuery(query); >>> >>> ret = (RichSequence) q.list().get(0); >>> tx.commit(); >>> } catch (Exception e) { >>> tx.rollback(); >>> e.printStackTrace(); >>> } >>> return ret; >>> } >>> <\code> >>> >>> I always have to load the whole chromsome to get a part of it, so it >>> takes >>> very long time and I get a lot of unused information (waste of memory). I >>> also tried to use ThinRichSequence<\code> instead of >>> RichSequence<\code>, but thereby I didn't notice any difference. >>> Can you give me a hint how to accelerate the code? >>> I am grateful for any hits. >>> >>> cheers, >>> Gabrielle >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> >> >> >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From gabrielle_doan at gmx.net Tue Oct 14 11:18:20 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Tue, 14 Oct 2008 13:18:20 +0200 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> Message-ID: <48F47FFC.4090607@gmx.net> Hi Richard, I have checked out the latest source and tried my code again. It still didn't work and I received following new errors: Exception in thread "main" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) at org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:612) at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) ... 3 more Caused by: org.hibernate.PropertyAccessException: Exception occurred inside setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) at org.hibernate.loader.Loader.doQuery(Loader.java:729) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) ... 8 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) ... 21 more Caused by: java.lang.NullPointerException at org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) at org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) at org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) at org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) at org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) at org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) ... 25 more <\message> I think BioSQLFeatureFilter.OverlapsRichLocation(rl) <\code> causes the problem I have. Can you help me to solve this problem? I'm grateful for any hints. cheers, Gabrielle Richard Holland schrieb: > This looks like a bug in BJX. I have just committed a fix that I think will > fix it to the head of subversion. Can you check out the latest source, > compile it, and try your program again? > > cheers, > Richard > > 2008/10/9 Gabrielle Doan > >> Hi Richard, >> >> thanks a lot for your mail. I have successfully retrieved the subsequence >> of a sequence as a String. And now I try to get the features for a >> particular range with following code: >> >> >> public FeatureHolder filterFeature(String name, int startpos, int >> endpos) { >> RichLocation rl = new SimpleRichLocation(new >> SimplePosition(startpos), >> new SimplePosition(endpos), 0); >> BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( >> new >> BioSQLFeatureFilter.BySequenceName(name), >> new >> BioSQLFeatureFilter.OverlapsRichLocation(rl)); >> return filter(filter); >> } >> <\code> >> >> Fortunately I received these errors: >> >> Exception in thread "main" java.lang.RuntimeException: >> java.lang.reflect.InvocationTargetException >> at >> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >> at >> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >> at >> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) >> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >> Caused by: java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >> ... 3 more >> Caused by: org.hibernate.PropertyAccessException: Exception occurred inside >> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >> at >> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >> at >> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >> at >> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >> at >> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >> at >> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >> at >> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >> at >> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >> at org.hibernate.loader.Loader.doList(Loader.java:2213) >> at >> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >> at org.hibernate.loader.Loader.list(Loader.java:2099) >> at >> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >> ... 8 more >> Caused by: java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >> ... 21 more >> Caused by: java.lang.NullPointerException >> at >> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >> at >> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >> at >> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >> at >> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >> at >> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >> at >> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >> ... 26 more >> <\message> >> >> Why do I get these errors? >> BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. How >> can I find out the sequence name? Is it the value "name" in the table >> "Bioentry"? As the build-in subSequence method takes a long time I intend to >> get the subsequence as a String by myself and add the features to it. What >> do you think about this? >> >> I'm grateful for any hints. >> cheers, >> >> Gabrielle >> >> >> >> Richard Holland schrieb: >> >> Hello. >>> Your code is pretty good already - but you're right, it will load the >>> whole chromosome into memory before you can chop out the interesting >>> bit you actually need. >>> >>> As you observed, by using ThinRichSequence in your query it will load >>> only the initial shell of a sequence object to start with, but the >>> moment you try and sub-sequence it, it will immediately load the whole >>> sequence data into memory in order to perform the operation. >>> >>> If you only want the sequence data, as a string, you can do this by >>> specifying the sequence attribute in the query and bypassing the >>> sequence object entirely: >>> >>> select rs.stringSequence from Sequence as rs where rs.description >>> like '%hromosome :num% >>> >>> This will return a String instead of a RichSequence object. You can >>> use HQL operators to perform substrings etc. on the string inside the >>> query itself - see >>> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >>> , particularly section 14.9. >>> >>> If you only want the features, you can do this by using the >>> BioSQLFeatureFilter technique. In particular you will want the >>> BySequenceName filter, the And filter, and the OverlapsRichLocation >>> filter. You construct a filter then pass it to the filter() method in >>> BioSQLRichSequenceDB. The database will return to you all the >>> RichFeature objects that match your criteria. Note that it searches >>> the whole database so you really must use a BySequenceName filter at >>> the very least in order to make the results useful! >>> >>> However, you can't use HQL to construct a complete slice of a sequence >>> directly in the database before returning it to the program for use as >>> a ready-made RichSequence object. This would require Hibernate to know >>> what a BioJava sub-sequence object is and how it behaves in relation >>> to an 'unsliced' one, which is beyond the scope of it's job as a >>> persistence framework. >>> >>> cheers, >>> Richard >>> >>> >>> >>> 2008/10/7 Gabrielle Doan : >>> >>>> Hi all, >>>> I have a BioSQL database which contains all human chromosomes. My >>>> intention >>>> is to get the information about a particular gene. How can I get a part >>>> of a >>>> particular chromosome with all associated features? At the moment I use >>>> following code to create my new sequence: >>>> >>>> >>>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>>> position[0], position[1], ns, geneName, parent.getAccession(), >>>> parent.getIdentifier(), parent.getVersion() + 1, >>>> (Double) (parent.getVersion() + 1.0)); >>>> <\code> >>>> >>>> Here is the part how I get the parent sequence: >>>> >>>> public static RichSequence getChromosome(String chrNo) { >>>> Transaction tx = session.beginTransaction(); >>>> RichSequence ret = null; >>>> >>>> String query; >>>> >>>> try { >>>> if (chrNo.equals("MT")) { >>>> query = "from BioEntry as be where >>>> be.description like '%:num%'"; >>>> query = query.replaceAll(":num", >>>> "mitochondrion"); >>>> } else { >>>> query = "from BioEntry as be where >>>> be.description like '%hromosome :num%'"; >>>> query = query.replaceAll(":num", chrNo); >>>> } >>>> >>>> Query q = session.createQuery(query); >>>> >>>> ret = (RichSequence) q.list().get(0); >>>> tx.commit(); >>>> } catch (Exception e) { >>>> tx.rollback(); >>>> e.printStackTrace(); >>>> } >>>> return ret; >>>> } >>>> <\code> >>>> >>>> I always have to load the whole chromsome to get a part of it, so it >>>> takes >>>> very long time and I get a lot of unused information (waste of memory). I >>>> also tried to use ThinRichSequence<\code> instead of >>>> RichSequence<\code>, but thereby I didn't notice any difference. >>>> Can you give me a hint how to accelerate the code? >>>> I am grateful for any hits. >>>> >>>> cheers, >>>> Gabrielle >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >>> >>> > > From holland at eaglegenomics.com Tue Oct 14 15:23:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 14 Oct 2008 16:23:10 +0100 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <48F47FFC.4090607@gmx.net> References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> <48F47FFC.4090607@gmx.net> Message-ID: Something's broken! At least from your stack trace I can see exactly what's going on. The set of locations is being loaded for the feature, but Hibernate is not calling the setMin()/setMax() methods in each location before inserting them into the set. When they get added to the set of locations for the feature, they therefore get added with null for min and max. At any point when these locations are used, for instance when they are merged by the feature location setter, or anywhere else, you'll get NullPointerExceptions. This is despite the fact that the HBM XML files are explicitly telling it _not_ to lazy-load them. Also this only happens when loading Features, and not when loading Sequence objects. I honestly don't know! What I suggest is that you create a temporary database with only one record in it, and run your test program against that to see what happens. If it still breaks, raise a bug on BugZilla and post the Genbank dump of the database to BugZilla along with your program code and the full stacktrace. Someone with a bit more Hibernate knowledge than me might then be able to help out. cheers, Richard 2008/10/14 Gabrielle Doan > Hi Richard, > I have checked out the latest source and tried my code again. It still > didn't work and I received following new errors: > > > Exception in thread "main" java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) > at > org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:612) > at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) > ... 3 more > Caused by: org.hibernate.PropertyAccessException: Exception occurred inside > setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) > at > org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) > at > org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) > at > org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) > at > org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) > at > org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) > at org.hibernate.loader.Loader.doQuery(Loader.java:729) > at > org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) > at org.hibernate.loader.Loader.doList(Loader.java:2213) > at > org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) > at org.hibernate.loader.Loader.list(Loader.java:2099) > at > org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) > at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) > at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) > ... 8 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) > ... 21 more > Caused by: java.lang.NullPointerException > at > org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) > at > org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) > at > org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) > at > org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) > at > org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) > at > org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) > ... 25 more > <\message> > > I think BioSQLFeatureFilter.OverlapsRichLocation(rl) <\code> causes > the problem I have. Can you help me to solve this problem? > > I'm grateful for any hints. > cheers, > > Gabrielle > > > > Richard Holland schrieb: > >> This looks like a bug in BJX. I have just committed a fix that I think >> will >> fix it to the head of subversion. Can you check out the latest source, >> compile it, and try your program again? >> >> cheers, >> Richard >> >> 2008/10/9 Gabrielle Doan >> >> Hi Richard, >>> >>> thanks a lot for your mail. I have successfully retrieved the subsequence >>> of a sequence as a String. And now I try to get the features for a >>> particular range with following code: >>> >>> >>> public FeatureHolder filterFeature(String name, int startpos, int >>> endpos) { >>> RichLocation rl = new SimpleRichLocation(new >>> SimplePosition(startpos), >>> new SimplePosition(endpos), 0); >>> BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( >>> new >>> BioSQLFeatureFilter.BySequenceName(name), >>> new >>> BioSQLFeatureFilter.OverlapsRichLocation(rl)); >>> return filter(filter); >>> } >>> <\code> >>> >>> Fortunately I received these errors: >>> >>> Exception in thread "main" java.lang.RuntimeException: >>> java.lang.reflect.InvocationTargetException >>> at >>> >>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >>> at >>> >>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >>> at >>> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) >>> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >>> Caused by: java.lang.reflect.InvocationTargetException >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> >>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >>> ... 3 more >>> Caused by: org.hibernate.PropertyAccessException: Exception occurred >>> inside >>> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >>> at >>> >>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >>> at >>> >>> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >>> at >>> >>> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >>> at >>> >>> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >>> at >>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >>> at >>> >>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >>> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >>> at >>> >>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >>> at org.hibernate.loader.Loader.doList(Loader.java:2213) >>> at >>> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >>> at org.hibernate.loader.Loader.list(Loader.java:2099) >>> at >>> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >>> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >>> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >>> ... 8 more >>> Caused by: java.lang.reflect.InvocationTargetException >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> >>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >>> ... 21 more >>> Caused by: java.lang.NullPointerException >>> at >>> >>> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >>> at >>> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >>> at >>> >>> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >>> ... 26 more >>> <\message> >>> >>> Why do I get these errors? >>> BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. >>> How >>> can I find out the sequence name? Is it the value "name" in the table >>> "Bioentry"? As the build-in subSequence method takes a long time I intend >>> to >>> get the subsequence as a String by myself and add the features to it. >>> What >>> do you think about this? >>> >>> I'm grateful for any hints. >>> cheers, >>> >>> Gabrielle >>> >>> >>> >>> Richard Holland schrieb: >>> >>> Hello. >>> >>>> Your code is pretty good already - but you're right, it will load the >>>> whole chromosome into memory before you can chop out the interesting >>>> bit you actually need. >>>> >>>> As you observed, by using ThinRichSequence in your query it will load >>>> only the initial shell of a sequence object to start with, but the >>>> moment you try and sub-sequence it, it will immediately load the whole >>>> sequence data into memory in order to perform the operation. >>>> >>>> If you only want the sequence data, as a string, you can do this by >>>> specifying the sequence attribute in the query and bypassing the >>>> sequence object entirely: >>>> >>>> select rs.stringSequence from Sequence as rs where rs.description >>>> like '%hromosome :num% >>>> >>>> This will return a String instead of a RichSequence object. You can >>>> use HQL operators to perform substrings etc. on the string inside the >>>> query itself - see >>>> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >>>> , particularly section 14.9. >>>> >>>> If you only want the features, you can do this by using the >>>> BioSQLFeatureFilter technique. In particular you will want the >>>> BySequenceName filter, the And filter, and the OverlapsRichLocation >>>> filter. You construct a filter then pass it to the filter() method in >>>> BioSQLRichSequenceDB. The database will return to you all the >>>> RichFeature objects that match your criteria. Note that it searches >>>> the whole database so you really must use a BySequenceName filter at >>>> the very least in order to make the results useful! >>>> >>>> However, you can't use HQL to construct a complete slice of a sequence >>>> directly in the database before returning it to the program for use as >>>> a ready-made RichSequence object. This would require Hibernate to know >>>> what a BioJava sub-sequence object is and how it behaves in relation >>>> to an 'unsliced' one, which is beyond the scope of it's job as a >>>> persistence framework. >>>> >>>> cheers, >>>> Richard >>>> >>>> >>>> >>>> 2008/10/7 Gabrielle Doan : >>>> >>>> Hi all, >>>>> I have a BioSQL database which contains all human chromosomes. My >>>>> intention >>>>> is to get the information about a particular gene. How can I get a part >>>>> of a >>>>> particular chromosome with all associated features? At the moment I use >>>>> following code to create my new sequence: >>>>> >>>>> >>>>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>>>> position[0], position[1], ns, geneName, parent.getAccession(), >>>>> parent.getIdentifier(), parent.getVersion() + 1, >>>>> (Double) (parent.getVersion() + 1.0)); >>>>> <\code> >>>>> >>>>> Here is the part how I get the parent sequence: >>>>> >>>>> public static RichSequence getChromosome(String chrNo) { >>>>> Transaction tx = session.beginTransaction(); >>>>> RichSequence ret = null; >>>>> >>>>> String query; >>>>> >>>>> try { >>>>> if (chrNo.equals("MT")) { >>>>> query = "from BioEntry as be where >>>>> be.description like '%:num%'"; >>>>> query = query.replaceAll(":num", >>>>> "mitochondrion"); >>>>> } else { >>>>> query = "from BioEntry as be where >>>>> be.description like '%hromosome :num%'"; >>>>> query = query.replaceAll(":num", chrNo); >>>>> } >>>>> >>>>> Query q = session.createQuery(query); >>>>> >>>>> ret = (RichSequence) q.list().get(0); >>>>> tx.commit(); >>>>> } catch (Exception e) { >>>>> tx.rollback(); >>>>> e.printStackTrace(); >>>>> } >>>>> return ret; >>>>> } >>>>> <\code> >>>>> >>>>> I always have to load the whole chromsome to get a part of it, so it >>>>> takes >>>>> very long time and I get a lot of unused information (waste of memory). >>>>> I >>>>> also tried to use ThinRichSequence<\code> instead of >>>>> RichSequence<\code>, but thereby I didn't notice any difference. >>>>> Can you give me a hint how to accelerate the code? >>>>> I am grateful for any hits. >>>>> >>>>> cheers, >>>>> Gabrielle >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> >>>>> >>>> >>>> >> >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From charles at imbusch.net Tue Oct 14 21:03:04 2008 From: charles at imbusch.net (Charles Imbusch) Date: Tue, 14 Oct 2008 23:03:04 +0200 Subject: [Biojava-l] parsing tblastn results Message-ID: <48F50908.5060307@imbusch.net> Hello, for a project I want to parse a tblastn result with BioJava. I used the code on http://biojava.org/wiki/BioJava:CookBook:Blast:Parser as it is and I get an error message as follows: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1938) at java.lang.String.substring(String.java:1905) at org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeAlignmentSAXParser.java:289) at org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlignmentSAXParser.java:115) at org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXParser.java:514) at org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXParser.java:287) at org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParser.java:251) at org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.java:118) at org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser.java:635) at org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:337) at org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) at org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:313) at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:276) at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:162) at BlastEcho.echo(BlastEcho.java:29) at BlastEcho.main(BlastEcho.java:75) I uploaded the Blast output file I want to parse here: http://charles.imbusch.net/tmp/blastresult.txt Any answer is appreciated. Cheers, Charles From ayates at ebi.ac.uk Wed Oct 15 08:07:35 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 15 Oct 2008 09:07:35 +0100 Subject: [Biojava-l] ANN: EBI Course - Programmatic access in Java: webservices & work flows Message-ID: <48F5A4C7.7010304@ebi.ac.uk> Hi everyone, Posting this here as it may be of interest to some people. The EBI is holding a course in accessing a large number of its resources from Java programs. The course will run from the 24th - 27th November being held on-site at the Hinxton Genome Campus. Resources being covered will include: * Ontology Lookup Service - Offers access to multiple ontologies through a common interface * PICR - A tool for going between identifier spaces for proteins) * UniProt * IntAct * ChEBI * BioMart * Integr8 * CiteXplore * And many many more :) If you are interested in any of these resources then please go to http://www.ebi.ac.uk/training/handson/course_081124_javawebservices.html . The course will cost you ?75 for the 3 days. All the best, Andy Yates From holland at eaglegenomics.com Wed Oct 15 08:13:18 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 15 Oct 2008 09:13:18 +0100 Subject: [Biojava-l] parsing tblastn results In-Reply-To: <48F50908.5060307@imbusch.net> References: <48F50908.5060307@imbusch.net> Message-ID: I've raised a bug report for you. Hopefully someone will take a look at it soon: http://bugzilla.open-bio.org/show_bug.cgi?id=2617 cheers, Richard 2008/10/14 Charles Imbusch > Hello, > > for a project I want to parse a tblastn result with BioJava. I used the > code > on http://biojava.org/wiki/BioJava:CookBook:Blast:Parser as it is and I > get an > error message as follows: > > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String index out of range: -3 > at java.lang.String.substring(String.java:1938) > at java.lang.String.substring(String.java:1905) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeAlignmentSAXParser.java:289) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlignmentSAXParser.java:115) > at > org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXParser.java:514) > at > org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXParser.java:287) > at > org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParser.java:251) > at > org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.java:118) > at > org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser.java:635) > at > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:337) > at > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:313) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:276) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:162) > at BlastEcho.echo(BlastEcho.java:29) > at BlastEcho.main(BlastEcho.java:75) > > I uploaded the Blast output file I want to parse here: > http://charles.imbusch.net/tmp/blastresult.txt > > Any answer is appreciated. > > Cheers, > Charles > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From dtoomey at rcsi.ie Wed Oct 15 09:46:58 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Wed, 15 Oct 2008 10:46:58 +0100 Subject: [Biojava-l] parsing tblastn results References: <48F50908.5060307@imbusch.net> Message-ID: Hi Richard This looks suspiciously like a bug I raised a couple of weeks ago. I was parsing blastp results but the stack trace is the same. http://bugzilla.open-bio.org/show_bug.cgi?id=2603 Charles, I have updated the original bug with a hack which at least allows you to parse the result and get an output. You just need to recompile the source code with the modified 'BlastLikeAlignmentSAXParser.java. Not ideal but at least you will be able to run your code until the source is fixed. Cheers Dave -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Richard Holland Sent: 15 October 2008 09:13 To: Charles Imbusch Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] parsing tblastn results I've raised a bug report for you. Hopefully someone will take a look at it soon: http://bugzilla.open-bio.org/show_bug.cgi?id=2617 cheers, Richard 2008/10/14 Charles Imbusch > Hello, > > for a project I want to parse a tblastn result with BioJava. I used the > code > on http://biojava.org/wiki/BioJava:CookBook:Blast:Parser as it is and I > get an > error message as follows: > > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String index out of range: -3 > at java.lang.String.substring(String.java:1938) > at java.lang.String.substring(String.java:1905) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parseLine(BlastLikeA lignmentSAXParser.java:289) > at > org.biojava.bio.program.sax.BlastLikeAlignmentSAXParser.parse(BlastLikeAlign mentSAXParser.java:115) > at > org.biojava.bio.program.sax.HitSectionSAXParser.outputHSPInfo(HitSectionSAXP arser.java:514) > at > org.biojava.bio.program.sax.HitSectionSAXParser.firstHSPEvent(HitSectionSAXP arser.java:287) > at > org.biojava.bio.program.sax.HitSectionSAXParser.interpret(HitSectionSAXParse r.java:251) > at > org.biojava.bio.program.sax.HitSectionSAXParser.parse(HitSectionSAXParser.ja va:118) > at > org.biojava.bio.program.sax.BlastSAXParser.hitsSectionReached(BlastSAXParser .java:635) > at > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:337 ) > at > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars er.java:313) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. java:276) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java :162) > at BlastEcho.echo(BlastEcho.java:29) > at BlastEcho.main(BlastEcho.java:75) > > I uploaded the Blast output file I want to parse here: > http://charles.imbusch.net/tmp/blastresult.txt > > Any answer is appreciated. > > Cheers, > Charles > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From gabrielle_doan at gmx.net Wed Oct 15 13:15:39 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Wed, 15 Oct 2008 15:15:39 +0200 Subject: [Biojava-l] Getting a part of a sequence In-Reply-To: <381a3e850810142152p4e0a0c2ds80a74570b44f2be0@mail.gmail.com> References: <48EB71A4.70409@gmx.net> <48EDF769.8050901@gmx.net> <48F47FFC.4090607@gmx.net> <381a3e850810140928p4af06cf4r3dfd08908efd42f6@mail.gmail.com> <48F4C99E.6070007@gmx.net> <381a3e850810142152p4e0a0c2ds80a74570b44f2be0@mail.gmail.com> Message-ID: <48F5ECFB.6040703@gmx.net> Hi Augusto, I've inserted your files into BJX. Unfortunately it hasn't solved my problems. Maybe Richard has another idea how to handle it. Best regards, Gabrielle Augusto Fernandes Vellozo schrieb: > Hi Gabrielle, > Please, let me know if the results ares ok or not. > I remember, when I made the corrections, I didn't see the case with > circularLength, because for my use case it doesn't matter and because > i don't understand exactly what is this. Take care, if you have this > use case. > > Cheers, > > Augusto > > 2008/10/14 Gabrielle Doan : >> Hi Augusto, >> >> thank you so much. I hope this will be the solution to my problem. >> >> cheers, >> Gabrielle >> >> Augusto Fernandes Vellozo schrieb: >>> Hi Gabrielle, >>> I had some problems with the class Location and i modified some >>> classes in my machine. I've already written to Richard. >>> The classes modified are attached. >>> These could help you. >>> >>> Good luck, >>> >>> Augusto >>> >>> 2008/10/14 Gabrielle Doan : >>>> Hi Richard, >>>> I have checked out the latest source and tried my code again. It still >>>> didn't work and I received following new errors: >>>> >>>> >>>> Exception in thread "main" java.lang.RuntimeException: >>>> java.lang.reflect.InvocationTargetException >>>> at >>>> >>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >>>> at >>>> >>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >>>> at >>>> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:612) >>>> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> at >>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> >>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >>>> ... 3 more >>>> Caused by: org.hibernate.PropertyAccessException: Exception occurred >>>> inside >>>> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >>>> at >>>> >>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >>>> at >>>> >>>> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >>>> at >>>> >>>> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >>>> at >>>> >>>> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >>>> at >>>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >>>> at >>>> >>>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >>>> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >>>> at >>>> >>>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >>>> at org.hibernate.loader.Loader.doList(Loader.java:2213) >>>> at >>>> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >>>> at org.hibernate.loader.Loader.list(Loader.java:2099) >>>> at >>>> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >>>> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >>>> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >>>> ... 8 more >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) >>>> at >>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> >>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >>>> ... 21 more >>>> Caused by: java.lang.NullPointerException >>>> at >>>> >>>> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >>>> at >>>> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >>>> at >>>> >>>> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >>>> ... 25 more >>>> <\message> >>>> >>>> I think BioSQLFeatureFilter.OverlapsRichLocation(rl) <\code> >>>> causes >>>> the problem I have. Can you help me to solve this problem? >>>> >>>> I'm grateful for any hints. >>>> cheers, >>>> >>>> Gabrielle >>>> >>>> >>>> >>>> Richard Holland schrieb: >>>>> This looks like a bug in BJX. I have just committed a fix that I think >>>>> will >>>>> fix it to the head of subversion. Can you check out the latest source, >>>>> compile it, and try your program again? >>>>> >>>>> cheers, >>>>> Richard >>>>> >>>>> 2008/10/9 Gabrielle Doan >>>>> >>>>>> Hi Richard, >>>>>> >>>>>> thanks a lot for your mail. I have successfully retrieved the >>>>>> subsequence >>>>>> of a sequence as a String. And now I try to get the features for a >>>>>> particular range with following code: >>>>>> >>>>>> >>>>>> public FeatureHolder filterFeature(String name, int startpos, int >>>>>> endpos) { >>>>>> RichLocation rl = new SimpleRichLocation(new >>>>>> SimplePosition(startpos), >>>>>> new SimplePosition(endpos), 0); >>>>>> BioSQLFeatureFilter filter = new BioSQLFeatureFilter.And( >>>>>> new >>>>>> BioSQLFeatureFilter.BySequenceName(name), >>>>>> new >>>>>> BioSQLFeatureFilter.OverlapsRichLocation(rl)); >>>>>> return filter(filter); >>>>>> } >>>>>> <\code> >>>>>> >>>>>> Fortunately I received these errors: >>>>>> >>>>>> Exception in thread "main" java.lang.RuntimeException: >>>>>> java.lang.reflect.InvocationTargetException >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:143) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.filter(BioSQLRichSequenceDB.java:151) >>>>>> at >>>>>> org.sequence_viewer.db.HBioSQLDB.filterFeature(HBioSQLDB.java:599) >>>>>> at org.sequence_viewer.db.AbfragenTest.main(AbfragenTest.java:56) >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.db.biosql.BioSQLRichSequenceDB.processFeatureFilter(BioSQLRichSequenceDB.java:138) >>>>>> ... 3 more >>>>>> Caused by: org.hibernate.PropertyAccessException: Exception occurred >>>>>> inside >>>>>> setter of org.biojavax.bio.seq.SimpleRichFeature.locationSet >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:65) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3571) >>>>>> at >>>>>> >>>>>> org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:133) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854) >>>>>> at org.hibernate.loader.Loader.doQuery(Loader.java:729) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) >>>>>> at org.hibernate.loader.Loader.doList(Loader.java:2213) >>>>>> at >>>>>> org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) >>>>>> at org.hibernate.loader.Loader.list(Loader.java:2099) >>>>>> at >>>>>> >>>>>> org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94) >>>>>> at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569) >>>>>> at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283) >>>>>> ... 8 more >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>> at >>>>>> >>>>>> >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> >>>>>> >>>>>> org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42) >>>>>> ... 21 more >>>>>> Caused by: java.lang.NullPointerException >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.PositionResolver$AverageResolver.getMin(PositionResolver.java:103) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichLocation.getMin(SimpleRichLocation.java:323) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichLocation.overlaps(SimpleRichLocation.java:451) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichLocation.union(SimpleRichLocation.java:469) >>>>>> at >>>>>> org.biojavax.bio.seq.RichLocation$Tools.merge(RichLocation.java:363) >>>>>> at >>>>>> >>>>>> >>>>>> org.biojavax.bio.seq.SimpleRichFeature.setLocationSet(SimpleRichFeature.java:181) >>>>>> ... 26 more >>>>>> <\message> >>>>>> >>>>>> Why do I get these errors? >>>>>> BioSQLFeatureFilter.BySequenceName(name) needs a seqName as parameter. >>>>>> How >>>>>> can I find out the sequence name? Is it the value "name" in the table >>>>>> "Bioentry"? As the build-in subSequence method takes a long time I >>>>>> intend >>>>>> to >>>>>> get the subsequence as a String by myself and add the features to it. >>>>>> What >>>>>> do you think about this? >>>>>> >>>>>> I'm grateful for any hints. >>>>>> cheers, >>>>>> >>>>>> Gabrielle >>>>>> >>>>>> >>>>>> >>>>>> Richard Holland schrieb: >>>>>> >>>>>> Hello. >>>>>>> Your code is pretty good already - but you're right, it will load the >>>>>>> whole chromosome into memory before you can chop out the interesting >>>>>>> bit you actually need. >>>>>>> >>>>>>> As you observed, by using ThinRichSequence in your query it will load >>>>>>> only the initial shell of a sequence object to start with, but the >>>>>>> moment you try and sub-sequence it, it will immediately load the whole >>>>>>> sequence data into memory in order to perform the operation. >>>>>>> >>>>>>> If you only want the sequence data, as a string, you can do this by >>>>>>> specifying the sequence attribute in the query and bypassing the >>>>>>> sequence object entirely: >>>>>>> >>>>>>> select rs.stringSequence from Sequence as rs where rs.description >>>>>>> like '%hromosome :num% >>>>>>> >>>>>>> This will return a String instead of a RichSequence object. You can >>>>>>> use HQL operators to perform substrings etc. on the string inside the >>>>>>> query itself - see >>>>>>> >>>>>>> http://docs.huihoo.com/hibernate/hibernate-reference-3.2.1/queryhql.html >>>>>>> , particularly section 14.9. >>>>>>> >>>>>>> If you only want the features, you can do this by using the >>>>>>> BioSQLFeatureFilter technique. In particular you will want the >>>>>>> BySequenceName filter, the And filter, and the OverlapsRichLocation >>>>>>> filter. You construct a filter then pass it to the filter() method in >>>>>>> BioSQLRichSequenceDB. The database will return to you all the >>>>>>> RichFeature objects that match your criteria. Note that it searches >>>>>>> the whole database so you really must use a BySequenceName filter at >>>>>>> the very least in order to make the results useful! >>>>>>> >>>>>>> However, you can't use HQL to construct a complete slice of a sequence >>>>>>> directly in the database before returning it to the program for use as >>>>>>> a ready-made RichSequence object. This would require Hibernate to know >>>>>>> what a BioJava sub-sequence object is and how it behaves in relation >>>>>>> to an 'unsliced' one, which is beyond the scope of it's job as a >>>>>>> persistence framework. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2008/10/7 Gabrielle Doan : >>>>>>> >>>>>>>> Hi all, >>>>>>>> I have a BioSQL database which contains all human chromosomes. My >>>>>>>> intention >>>>>>>> is to get the information about a particular gene. How can I get a >>>>>>>> part >>>>>>>> of a >>>>>>>> particular chromosome with all associated features? At the moment I >>>>>>>> use >>>>>>>> following code to create my new sequence: >>>>>>>> >>>>>>>> >>>>>>>> RichSequence subSeq = RichSequence.Tools.subSequence(parent, >>>>>>>> position[0], position[1], ns, geneName, parent.getAccession(), >>>>>>>> parent.getIdentifier(), parent.getVersion() + 1, >>>>>>>> (Double) (parent.getVersion() + 1.0)); >>>>>>>> <\code> >>>>>>>> >>>>>>>> Here is the part how I get the parent sequence: >>>>>>>> >>>>>>>> public static RichSequence getChromosome(String chrNo) { >>>>>>>> Transaction tx = session.beginTransaction(); >>>>>>>> RichSequence ret = null; >>>>>>>> >>>>>>>> String query; >>>>>>>> >>>>>>>> try { >>>>>>>> if (chrNo.equals("MT")) { >>>>>>>> query = "from BioEntry as be where >>>>>>>> be.description like '%:num%'"; >>>>>>>> query = query.replaceAll(":num", >>>>>>>> "mitochondrion"); >>>>>>>> } else { >>>>>>>> query = "from BioEntry as be where >>>>>>>> be.description like '%hromosome :num%'"; >>>>>>>> query = query.replaceAll(":num", chrNo); >>>>>>>> } >>>>>>>> >>>>>>>> Query q = session.createQuery(query); >>>>>>>> >>>>>>>> ret = (RichSequence) q.list().get(0); >>>>>>>> tx.commit(); >>>>>>>> } catch (Exception e) { >>>>>>>> tx.rollback(); >>>>>>>> e.printStackTrace(); >>>>>>>> } >>>>>>>> return ret; >>>>>>>> } >>>>>>>> <\code> >>>>>>>> >>>>>>>> I always have to load the whole chromsome to get a part of it, so it >>>>>>>> takes >>>>>>>> very long time and I get a lot of unused information (waste of >>>>>>>> memory). >>>>>>>> I >>>>>>>> also tried to use ThinRichSequence<\code> instead of >>>>>>>> RichSequence<\code>, but thereby I didn't notice any >>>>>>>> difference. >>>>>>>> Can you give me a hint how to accelerate the code? >>>>>>>> I am grateful for any hits. >>>>>>>> >>>>>>>> cheers, >>>>>>>> Gabrielle >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >> > > > From holland at eaglegenomics.com Mon Oct 20 00:18:29 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 20 Oct 2008 01:18:29 +0100 Subject: [Biojava-l] BioJava 3 Begins - Volunteers please! Message-ID: Hi all, I've just committed some new code to the biojava3 branch of the biojava-live subversion repository. It's the foundations of a brand new alphabet+symbol set of classes, and an example of how to use them to represent DNA. You'll notice that the new code is very lightweight and allows for a lot more flexibility than the old code - for instance, the concept of Alphabet has changed radically. It also makes much more extensive use of the Collections API. I haven't got any test cases or usage examples yet but give me a shout if you don't understand the code and I'll explain how it works. (Hint: SymbolFormat is there to convert Strings into SymbolList objects, and vice versa). So, now we want some volunteers! We're starting from scratch here so there's a lot of work to do. The whole of BioJava needs 'translating' into BJ3, whether it be copy-and-paste existing classes and modify them to suit the new style, or write completely new ones to provide equivalent functionality. I'll post an example of how to do file parsing soon, probably starting with FASTA. In the meantime, a good place to start would be for people to design object models to represent their favourite data types (e.g. Genbank, or microarray data). Utility classes to manipulate those objects would be great too. The object models need to be normalised as much as possible - e.g. if your data has a lot of comments, and the order of those comments is important, then give your object model a collection of comment objects. The object model for each data type should be completely independent and use basic data types wherever possible (e.g. store sequences as strings, don't attempt to parse them into anything fancy like SymbolLists). The closer the object model is to the original data format, the better. There's going to be clever tricks when it comes to converting data between different object models (e.g. Genbank to INSDSeq), which I will explain later when I put the file parsing examples up. You'll notice how the biojava3 branch uses Maven instead of Ant. This is because we want to make it as modular as possible, so if you want to write microarray stuff, create a new microarray sub-project (as per the dna example that's already there). This way if someone only wants the microarray bit of BJ3, they only need install the appropriate JAR file and can ignore the rest. (The 'core' module is for stuff that is so generic it could be used anywhere, or is used in every single other module.) If coding isn't your cup of tea, then we would very much welcome testers (particularly those who enjoy writing test cases!), documenters (particularly code commenters), translators (for internationalisation of the code), and of course all those who wish to contribute ideas and suggestions no matter how off-the-wall they might be. In particular if you'd like to take charge of an area of the development process, e.g. Documentation Chief, or Protein Champion, then that would be much appreciated. I'm very much looking forward to working with everyone on this. Good luck, and happy coding! cheers, Richard PS. Please don't forget to attach the appropriate licence to your code. You can copy-and-paste it from the existing classes I just committed this evening. PPS. For those who are worried about backwards compatibility - this was discussed on the lists a while back and it was made clear that BJ3 is a clean break. However, the existing code will continue to be maintained and bugfixed for a couple of years so you don't have to upgrade if you don't want to - it just won't have any new features developed for it. This is largely because it'll probably take just that long to write all the new BJ3 code. When we do decide to desupport the existing BJ code, plenty of notice will be given (i.e. years as opposed to months). -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Mon Oct 20 17:52:08 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 20 Oct 2008 18:52:08 +0100 Subject: [Biojava-l] File parsing in BJ3 Message-ID: (From now on I will only be posting these development messages to biojava-dev, which is the intended purpose of that list. Those of you who wish to keep track of things but are currently only subscribed to biojava-l should also subscribe to biojava-dev in order to keep up to date.) As promised, I've committed a new package in the biojava-core module that should help understand how to do file parsing and conversion and writing in the new BJ3 modules. Here's an example of how to use it to write a Genbank parser (note no parsers actually exist yet!): 1. Design yourself a Genbank class which implements the interface Thing and can fully represent all the data that might possibly occur inside a Genbank file. 2. Write an interface called GenbankReceiver, which extends ThingReceiver and defines all the methods you might need in order to construct a Genbank object in an asynchronous fashion. 3. Write a GenbankBuilder class which implements GenbankReceiver and ThingBuilder. It's job is to receive data via method calls, use that data to construct a Genbank object, then provide that object on demand. 4. Write a GenbankWriter class which implements GenbankReceiver and ThingWriter. It's job is similar to GenbankBuilder, but instead of constructing new Genbank objects, it writes Genbank records to file that reflect the data it receives. 5. Write a GenbankReader class which implements ThingReader. It can read GenbankFiles and output the data to the methods of the ThingReceiver provided to it, which in this case could be anything which implements the interface GenbankReceiver. 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a Genbank object and will fire off data from it to the provided ThingReceiver (a GenbankReceiver instance) as if the Genbank object was being read from a file or some other source. That's it! OK so it's a minimum of 6 classes instead of the original 1 or 2, but the additional steps are necessary for flexibility in converting between formats. Now to use it (you'll probably want a GenbankTools class to wrap these steps up for user-friendliness, including various options for opening files, etc.): 1. To read a file - instantiate ThingParser with your GenbankReader as the reader, and GenbankBuilder as the receiver. Use the iterator methods on ThingParser to get the objects out. 2. To write a file - instantiate ThingParser with a GenbankEmitter wrapping your Genbank object, and a GenbankWriter as the receiver. Use the parseAll() method on the ThingParser to dump the whole lot to your chosen output. The clever bit comes when you want to convert between files. Imagine you've done all the above for Genbank, and you've also done it for FASTA. How to convert between them? What you need to do is this: 1. Implement all the classes for both Genbank and FASTA. 2. Write a GenbankFASTAConverter class that implements ThingConverter and GenbankReceiver, and will internally convert the data received and pass it on out to the receiver provided, which will be a FASTAReceiver instance. 3. Write a FASTAGenbankConverter class that operates in exactly the opposite way, implementing ThingConverter and FASTAReceiver. Then to convert you use ThingParser again: 1. From FASTA file to Genbank object: Instantiate ThingParser with a FASTAReader reader, a GenbankBuilder receiver, and add a FASTAGenbankConverter instance to the converter chain. Use the iterator to get your Genbank objects out of your FASTA file. 2. From FASTA file to Genbank file: Same as option 1, but provide a GenbankWriter instead and use parseAll() instead of the iterator methos. 3. From FASTA object to Genbank object: Same as option 1, but provide a FASTAEmitter wrapping your FASTA object as the reader instead. 4. From FASTA object to Genbank file: Same as option 1, but swap both the reader and the receiver as per options 2 and 3. 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all mentions of FASTA and Genbank, and use GenbankFASTAConverter instead. One last and very important feature of this approach is that if you discover that nobody has written the appropriate converter for your chosen pair of formats A and C, but converters do exist to map A to some other format B and that other format B on to C, then you can just put the two converts A-B and B-C into the ThingParser chain and it'll work perfectly. Enjoy! cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Tue Oct 21 02:54:27 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 10:54:27 +0800 Subject: [Biojava-l] Biojava / BioSQL entity beans Message-ID: <93b45ca50810201954k44ab0f65xb94a0214d8eb4e13@mail.gmail.com> Hi - Richard has kindly uploaded some JPA Entity beans that map to the BioSQL database schema as a BioSQL module for BJ3. These entity beans where generated as part of the Tokyo webservices workshop. As Entities they are useful as POJOs as well as data transfer via JPA, JAXB and can be used in EJB containers or a plain old JVM. The have no biological smarts and the intention was/is that these will be provided by wrapping them in Bio-aware (and more thread safe) wrappers that implement interfaces from other BJ3 modules. In essence it is a persistence layer. The following is copied verbatim from the package-info.java and gives you some idea of how I intend the package to be used (obviously some of this is still to come). There is also some discussion of some of the gotcha's that might trip you up when playing with object relational persistence. BTW the naming convention is to call something FooEntity. Where BioSQL requires a compound primary key this is implemented as an Embeddable object called FooEntityPK which is the key for FooEntity. The other thing you may see is FooEntityUK which is the same concept but represents some of the cases where BioSQL tables don't have a primary key (even a compound one) but implicitly they do because all the fields have the SQL unique restriction. In these cases JPA still requires an Embeddable key to track updates. As far as Java is concerned they are the same as a FooEntityPK but I used a different name to make the distinction. The annotations provide mapping to tables from a Derby database. This is the reference Java in memory DB which can run from any JVM and is also found in Glassfish. The mappings will likely also work with MySQL. For Oracle (and possibly others) you would need to override the @GeneratedValue strategy for generating primary keys. I believe this can be done with external XML config files. You may also wish to overide the default eager loading and cascade annotations depending on your JPA persistence method and preferences. This has been lightly tested using Glassfish, Derby and Toplink essentials and is a work in progress but seems to work OK. Best regards, - Mark /** * The package contains Entity representations of BioJava classes. * The purpose of these entities is to allow simple serialization of BioJava data * using binary serialization for protocols that require this (eg RPC between * Java application servers) as well as persistence mechanisms that require bean * like ojbects such as the Java Persistence Architechture (JPA) or the * Java API for XML Binding (JAXB). For this reason all objects in this package * should provide a parameterless public constructor and public get/set methods * for relevant fields. *

* Given the public nature of the constructors and the setters in these beans * these classes are not intended for direct use in general programming when * using the BioJava v3 API. This is because it is possible to leave the bean in * and inconsitent state and they are not thread safe unless synchronization * controlled externally (via synchornization blocks or via a application container). *

* The Entities are intended to back other objects that a * programer will interact with directly. For example Foo.class will be backed * by FooEntity.class. Generally interaction with Foo.class is to be prefered and * will often be more sensible as the entities typically provide no 'biological * behaivour'. Relevant behaivour should be provided by the wrapping class. It is best * to think of Foo as a view onto the data that is held in the * FooEntity. A good example is the sophisticated Symbol * behaivour that can represent biological logic about IUPAC ambiguity symbols. * For example a 'w' in a Biosequence represents an abiguity between 'a' and 't', * whereas a 'w' in BiosequenceEntity is simply a 'w' and nothing else. *

* The wrapper entity pattern is intended to allow for a lot of the advanced * behaivour in the original BioJava while also allowing use of modern transport * and persistence packages. This is achieved by peristing and transporting the * entity without the wrapper and re-wrapping it at the other end. *

* Currently BioJava v3 uses annotated @Id fields to define * equals(Object o). Consistent definition is critical to how * the object will behave when persisted to a database. In the case of: *

 * Foo f = ... initialize
 * Foo fo = ... initialize
 * boolean b = f.equals(fo);
 * 
* b would be true if both objects share the same value * (or embeddable object) in the field that represents the primary key in the * database even if all other fields are equal. This is desirable because * two entities representing the same DB record may be retreived from two different * sessions. Additionally these are the identity fields, so logically, they should map to * the concept of identity. Finally, searching a collection is made very simple * without requireing an iterator: *
 * Integer id = //code to initialize
 * collection.contains(new Foo(id));
 * 
* By default BioJava v3 entities use only the primary key field for equality * If either record has null as the primary key value it is never equal * to another. When implementing equals(Object o) it is not advisable to perform * the test this.getClass() == o.getClass() because of the possibility of proxy * classes used in JPA. This can, however, lead to an issue with the * hashcode() method. Consider the following code: *
 * Foo foo = new Foo() //no primary key
 * HashSet set = new HashSet();
 * set.add(foo);
 * // code here to persist Foo and consequently generate it's PK
 * boolean b = set.contains(foo);
 * 
* Because only the PK is used for equality, then the PK is used in the hashcode. * This means that b is probably going to be false because * it would have been stored in a hash bucket using the old hashcode that will * now be different even though the set actually does contain a pointer to foo. * Although a potential deficiency it is unlikely to be a major problem for * BioJava v3 developers because using entity backed objects is prefered to direct * interaction with entities. If you need to use entities directly then use hashed * collections with caution. * *

Wrapper classes can either delegate it's equals call to the underlying * entity or it can do something that is more biologically sensible * (as PK values are typically not exposed in the wrapper). It is probably more * sensible for a wrapper to define it's own equals (and haschode * implementations due to the limitations of the default @Id based system * described above. Especially the potential hashcode problems. * * For example FooSequence.class might want to base * equality on the exact match of the DNA sequence it holds even though * FooSequenceEntity.class may only use the PK field. If delegation * is used (or not) it should be clearly documented. *

* *

* @author Mark Schreiber */ package org.biojava.biosql.entity; From markjschreiber at gmail.com Tue Oct 21 03:16:51 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 11:16:51 +0800 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: References: Message-ID: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> So if I want to build a BioSQL loader from Genbank then would the classes (or there wrappers) in the BioSQL Entity package need to implement Thing? Would maven have an issue with that or would it just create a dependency on core? (you can tell I've never used Maven right). >From a design point of view should Thing be an interface or an Annotation? The reason I ask is that it doesn't define any methods so it is more of a tag than an interface. Anyway, my understanding is that I would use a Genbank parser (or write one). Write a EntityReceiver interface (probably more than one given the number of entities in BioSQL, implement a EntityBuilder (again possibly more than one) that implements EntityReceiver and builds Entity beans from messages it receives. In this case I probably wouldn't provide a writer as JPA would be writing the beans to the database. Would this be how you imagine it? - Mark On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland wrote: > (From now on I will only be posting these development messages to > biojava-dev, which is the intended purpose of that list. Those of you who > wish to keep track of things but are currently only subscribed to biojava-l > should also subscribe to biojava-dev in order to keep up to date.) > > As promised, I've committed a new package in the biojava-core module that > should help understand how to do file parsing and conversion and writing in > the new BJ3 modules. Here's an example of how to use it to write a Genbank > parser (note no parsers actually exist yet!): > > 1. Design yourself a Genbank class which implements the interface Thing and > can fully represent all the data that might possibly occur inside a Genbank > file. > > 2. Write an interface called GenbankReceiver, which extends ThingReceiver > and defines all the methods you might need in order to construct a Genbank > object in an asynchronous fashion. > > 3. Write a GenbankBuilder class which implements GenbankReceiver and > ThingBuilder. It's job is to receive data via method calls, use that data to > construct a Genbank object, then provide that object on demand. > > 4. Write a GenbankWriter class which implements GenbankReceiver and > ThingWriter. It's job is similar to GenbankBuilder, but instead of > constructing new Genbank objects, it writes Genbank records to file that > reflect the data it receives. > > 5. Write a GenbankReader class which implements ThingReader. It can read > GenbankFiles and output the data to the methods of the ThingReceiver > provided to it, which in this case could be anything which implements the > interface GenbankReceiver. > > 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a > Genbank object and will fire off data from it to the provided ThingReceiver > (a GenbankReceiver instance) as if the Genbank object was being read from a > file or some other source. > > That's it! OK so it's a minimum of 6 classes instead of the original 1 or 2, > but the additional steps are necessary for flexibility in converting between > formats. > > Now to use it (you'll probably want a GenbankTools class to wrap these steps > up for user-friendliness, including various options for opening files, > etc.): > > 1. To read a file - instantiate ThingParser with your GenbankReader as the > reader, and GenbankBuilder as the receiver. Use the iterator methods on > ThingParser to get the objects out. > > 2. To write a file - instantiate ThingParser with a GenbankEmitter wrapping > your Genbank object, and a GenbankWriter as the receiver. Use the parseAll() > method on the ThingParser to dump the whole lot to your chosen output. > > The clever bit comes when you want to convert between files. Imagine you've > done all the above for Genbank, and you've also done it for FASTA. How to > convert between them? What you need to do is this: > > 1. Implement all the classes for both Genbank and FASTA. > > 2. Write a GenbankFASTAConverter class that implements ThingConverter > and GenbankReceiver, and will internally convert the data received and pass > it on out to the receiver provided, which will be a FASTAReceiver instance. > > 3. Write a FASTAGenbankConverter class that operates in exactly the opposite > way, implementing ThingConverter and FASTAReceiver. > > Then to convert you use ThingParser again: > > 1. From FASTA file to Genbank object: Instantiate ThingParser with a > FASTAReader reader, a GenbankBuilder receiver, and add a > FASTAGenbankConverter instance to the converter chain. Use the iterator to > get your Genbank objects out of your FASTA file. > > 2. From FASTA file to Genbank file: Same as option 1, but provide a > GenbankWriter instead and use parseAll() instead of the iterator methos. > > 3. From FASTA object to Genbank object: Same as option 1, but provide a > FASTAEmitter wrapping your FASTA object as the reader instead. > > 4. From FASTA object to Genbank file: Same as option 1, but swap both the > reader and the receiver as per options 2 and 3. > > 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all mentions > of FASTA and Genbank, and use GenbankFASTAConverter instead. > > One last and very important feature of this approach is that if you discover > that nobody has written the appropriate converter for your chosen pair of > formats A and C, but converters do exist to map A to some other format B and > that other format B on to C, then you can just put the two converts A-B and > B-C into the ThingParser chain and it'll work perfectly. > > Enjoy! > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Tue Oct 21 03:17:28 2008 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Oct 2008 20:17:28 -0700 Subject: [Biojava-l] [Biojava-dev] BioJava 3 Begins - Volunteers please! In-Reply-To: References: Message-ID: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> Hi, Couple of thoughts regarding biojava v3: License: Since it seems we will end up copying code from biojava 1.6 to biojava 3.0, we need to keep the license the same (LGPL 2.1). I.e. people should still use the same biojava license headers when committing new files and all code will be considered to be LGPL, if no header is present. Do NOT commit code under other licenses. Installation: We need some installation instructions on the wiki site, e.g. how to get the maven setup running. What are the code conventions for the new version? Blast: the Blast parsing modules are among the most frequently used ones in biojava 1.6. To make people use biojava v3 it will be crucial to have a port of them to the new version. Does anybody want to take care of that? Automated builds: is it interesting to have automated builds set up for the new version at this stage, or should we wait until a more mature stage? I could easily add another auto-build similar to the one for biojava 1.6 at http://www.spice-3d.org/cruise/ Andreas On Sun, Oct 19, 2008 at 5:18 PM, Richard Holland wrote: > Hi all, > > I've just committed some new code to the biojava3 branch of the biojava-live > subversion repository. It's the foundations of a brand new alphabet+symbol > set of classes, and an example of how to use them to represent DNA. You'll > notice that the new code is very lightweight and allows for a lot more > flexibility than the old code - for instance, the concept of Alphabet has > changed radically. It also makes much more extensive use of the Collections > API. > > I haven't got any test cases or usage examples yet but give me a shout if > you don't understand the code and I'll explain how it works. (Hint: > SymbolFormat is there to convert Strings into SymbolList objects, and vice > versa). > > So, now we want some volunteers! We're starting from scratch here so there's > a lot of work to do. The whole of BioJava needs 'translating' into BJ3, > whether it be copy-and-paste existing classes and modify them to suit the > new style, or write completely new ones to provide equivalent functionality. > > > I'll post an example of how to do file parsing soon, probably starting with > FASTA. In the meantime, a good place to start would be for people to design > object models to represent their favourite data types (e.g. Genbank, or > microarray data). Utility classes to manipulate those objects would be great > too. > > The object models need to be normalised as much as possible - e.g. if your > data has a lot of comments, and the order of those comments is important, > then give your object model a collection of comment objects. The object > model for each data type should be completely independent and use basic data > types wherever possible (e.g. store sequences as strings, don't attempt to > parse them into anything fancy like SymbolLists). The closer the object > model is to the original data format, the better. There's going to be clever > tricks when it comes to converting data between different object models > (e.g. Genbank to INSDSeq), which I will explain later when I put the file > parsing examples up. > > You'll notice how the biojava3 branch uses Maven instead of Ant. This is > because we want to make it as modular as possible, so if you want to write > microarray stuff, create a new microarray sub-project (as per the dna > example that's already there). This way if someone only wants the microarray > bit of BJ3, they only need install the appropriate JAR file and can ignore > the rest. (The 'core' module is for stuff that is so generic it could be > used anywhere, or is used in every single other module.) > > If coding isn't your cup of tea, then we would very much welcome testers > (particularly those who enjoy writing test cases!), documenters > (particularly code commenters), translators (for internationalisation of the > code), and of course all those who wish to contribute ideas and suggestions > no matter how off-the-wall they might be. In particular if you'd like to > take charge of an area of the development process, e.g. Documentation Chief, > or Protein Champion, then that would be much appreciated. > > I'm very much looking forward to working with everyone on this. Good luck, > and happy coding! > > cheers, > Richard > > PS. Please don't forget to attach the appropriate licence to your code. You > can copy-and-paste it from the existing classes I just committed this > evening. > > PPS. For those who are worried about backwards compatibility - this was > discussed on the lists a while back and it was made clear that BJ3 is a > clean break. However, the existing code will continue to be maintained and > bugfixed for a couple of years so you don't have to upgrade if you don't > want to - it just won't have any new features developed for it. This is > largely because it'll probably take just that long to write all the new BJ3 > code. When we do decide to desupport the existing BJ code, plenty of notice > will be given (i.e. years as opposed to months). > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From markjschreiber at gmail.com Tue Oct 21 05:41:28 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 13:41:28 +0800 Subject: [Biojava-l] Logging in BJ3 Message-ID: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Hi - I would like to strongly advocate the liberal and extensive use of Logging in BioJava3. The lack of this plagued us (me at least) during bug fixes in previous versions of BioJava. The default Java logging API is very flexible and easily meets our needs. It's also not too much effort for developers to put in place (you know you use System.println() all over the place anyway). The following is an example snippet using logging that would certainly help debugging. With the standard logging setup only the severe statement would appear on the terminal. We could also provide config files that show lower levels of logging so that people can easily generate detailed logs to accompany bug reports. If we want to be really tricky we could even use a MemoryLogger that has a rotating buffer of log statements that could spit out with a stack trace so you could just submit the stack trace and the activity log all in one go and we can get an idea of what was going on at the time. The example below also shows what to do to avoid a major performance hit during logging. The marked "expensive logging operation" pretends to get config information by getting it from a database. One might expect this to take time while the db connects etc and could produce quite a long String of information. To save time when logging is not set to the CONFIG level the if statement is able to skip this costly step. I know from experience we will definitely get the most value from this in the IO parsers and ThingBuilders. Any thoughts? - Mark private Logger logger = Logger.getLogger("org.biojava.MyClass"); public Object generateObject(String argument){ logger.entering(""+getClass(), "generateObject", argument); //expensive logging operation if (logger.isLoggable( Level.CONFIG )) { logger.config("DB config: "+ getDBConfigInfo()); } Object obj = null; try{ //do some stuff logger.fine("doing stuff"); obj = new Object(); }catch(Exception ex){ logger.severe("Failed to do stuff"); logger.throwing(""+getClass(), "generateObject", ex); } logger.exiting(""+getClass(), "generateObject", obj); return obj; } From holland at eaglegenomics.com Tue Oct 21 08:34:46 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 21 Oct 2008 09:34:46 +0100 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> References: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> Message-ID: Spot on. Annotation/interface.... i think Annotation is probably better as you suggest, but I'd have to look into that. Not sure how it works with collections and generics. If it does turn out to be a better bet, I'll change it over. With the BioSQL dependencies, take a look at the pom.xml file inside the biojava-dna module. It declares a dependency on biojava-core. If you want to add dependencies to external JARs, take a look at biojava-biosql's pom.xml to see how it depends on javax.persistence. (The easiest way to add these is via an IDE such as NetBeans, which is what I'm using at the moment). cheers, Richard 2008/10/21 Mark Schreiber > So if I want to build a BioSQL loader from Genbank then would the > classes (or there wrappers) in the BioSQL Entity package need to > implement Thing? Would maven have an issue with that or would it just > create a dependency on core? (you can tell I've never used Maven > right). > > From a design point of view should Thing be an interface or an > Annotation? The reason I ask is that it doesn't define any methods so > it is more of a tag than an interface. > > Anyway, my understanding is that I would use a Genbank parser (or > write one). Write a EntityReceiver interface (probably more than one > given the number of entities in BioSQL, implement a EntityBuilder > (again possibly more than one) that implements EntityReceiver and > builds Entity beans from messages it receives. In this case I probably > wouldn't provide a writer as JPA would be writing the beans to the > database. Would this be how you imagine it? > > - Mark > > > On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland > wrote: > > (From now on I will only be posting these development messages to > > biojava-dev, which is the intended purpose of that list. Those of you who > > wish to keep track of things but are currently only subscribed to > biojava-l > > should also subscribe to biojava-dev in order to keep up to date.) > > > > As promised, I've committed a new package in the biojava-core module that > > should help understand how to do file parsing and conversion and writing > in > > the new BJ3 modules. Here's an example of how to use it to write a > Genbank > > parser (note no parsers actually exist yet!): > > > > 1. Design yourself a Genbank class which implements the interface Thing > and > > can fully represent all the data that might possibly occur inside a > Genbank > > file. > > > > 2. Write an interface called GenbankReceiver, which extends ThingReceiver > > and defines all the methods you might need in order to construct a > Genbank > > object in an asynchronous fashion. > > > > 3. Write a GenbankBuilder class which implements GenbankReceiver and > > ThingBuilder. It's job is to receive data via method calls, use that data > to > > construct a Genbank object, then provide that object on demand. > > > > 4. Write a GenbankWriter class which implements GenbankReceiver and > > ThingWriter. It's job is similar to GenbankBuilder, but instead of > > constructing new Genbank objects, it writes Genbank records to file that > > reflect the data it receives. > > > > 5. Write a GenbankReader class which implements ThingReader. It can read > > GenbankFiles and output the data to the methods of the ThingReceiver > > provided to it, which in this case could be anything which implements the > > interface GenbankReceiver. > > > > 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a > > Genbank object and will fire off data from it to the provided > ThingReceiver > > (a GenbankReceiver instance) as if the Genbank object was being read from > a > > file or some other source. > > > > That's it! OK so it's a minimum of 6 classes instead of the original 1 or > 2, > > but the additional steps are necessary for flexibility in converting > between > > formats. > > > > Now to use it (you'll probably want a GenbankTools class to wrap these > steps > > up for user-friendliness, including various options for opening files, > > etc.): > > > > 1. To read a file - instantiate ThingParser with your GenbankReader as > the > > reader, and GenbankBuilder as the receiver. Use the iterator methods on > > ThingParser to get the objects out. > > > > 2. To write a file - instantiate ThingParser with a GenbankEmitter > wrapping > > your Genbank object, and a GenbankWriter as the receiver. Use the > parseAll() > > method on the ThingParser to dump the whole lot to your chosen output. > > > > The clever bit comes when you want to convert between files. Imagine > you've > > done all the above for Genbank, and you've also done it for FASTA. How to > > convert between them? What you need to do is this: > > > > 1. Implement all the classes for both Genbank and FASTA. > > > > 2. Write a GenbankFASTAConverter class that implements > ThingConverter > > and GenbankReceiver, and will internally convert the data received and > pass > > it on out to the receiver provided, which will be a FASTAReceiver > instance. > > > > 3. Write a FASTAGenbankConverter class that operates in exactly the > opposite > > way, implementing ThingConverter and FASTAReceiver. > > > > Then to convert you use ThingParser again: > > > > 1. From FASTA file to Genbank object: Instantiate ThingParser with a > > FASTAReader reader, a GenbankBuilder receiver, and add a > > FASTAGenbankConverter instance to the converter chain. Use the iterator > to > > get your Genbank objects out of your FASTA file. > > > > 2. From FASTA file to Genbank file: Same as option 1, but provide a > > GenbankWriter instead and use parseAll() instead of the iterator methos. > > > > 3. From FASTA object to Genbank object: Same as option 1, but provide a > > FASTAEmitter wrapping your FASTA object as the reader instead. > > > > 4. From FASTA object to Genbank file: Same as option 1, but swap both the > > reader and the receiver as per options 2 and 3. > > > > 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all > mentions > > of FASTA and Genbank, and use GenbankFASTAConverter instead. > > > > One last and very important feature of this approach is that if you > discover > > that nobody has written the appropriate converter for your chosen pair of > > formats A and C, but converters do exist to map A to some other format B > and > > that other format B on to C, then you can just put the two converts A-B > and > > B-C into the ThingParser chain and it'll work perfectly. > > > > Enjoy! > > > > cheers, > > Richard > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > M: +44 7500 438846 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From ayates at ebi.ac.uk Tue Oct 21 08:40:48 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 09:40:48 +0100 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Message-ID: <48FD9590.5010704@ebi.ac.uk> Hi, A logging framework is a priority to start baking into the new API now. As Mark has mentioned logging frameworks are very flexible things but it's not until you start using them do you get a real feel about how easy & extensible they are. The JDK logger has some good integration with MessageFormat & localization. I'm not completely taken with how it does the checks for log levels (log.isDebugEnabled() just seems easier that log.isLoggable(Level.FINEST)) & how you grab a logger ( I'd prefer something like Logger.getLogger(this.getClass()) ) but that's just nit-picking. I'll be happy to go with whatever people are most comfortable with & we should attempt to use as many of the core Java classes as possible. Andy Mark Schreiber wrote: > Hi - > > I would like to strongly advocate the liberal and extensive use of > Logging in BioJava3. The lack of this plagued us (me at least) during > bug fixes in previous versions of BioJava. The default Java logging > API is very flexible and easily meets our needs. It's also not too > much effort for developers to put in place (you know you use > System.println() all over the place anyway). > > The following is an example snippet using logging that would certainly > help debugging. With the standard logging setup only the severe > statement would appear on the terminal. We could also provide config > files that show lower levels of logging so that people can easily > generate detailed logs to accompany bug reports. If we want to be > really tricky we could even use a MemoryLogger that has a rotating > buffer of log statements that could spit out with a stack trace so you > could just submit the stack trace and the activity log all in one go > and we can get an idea of what was going on at the time. > > The example below also shows what to do to avoid a major performance > hit during logging. The marked "expensive logging operation" pretends > to get config information by getting it from a database. One might > expect this to take time while the db connects etc and could produce > quite a long String of information. To save time when logging is not > set to the CONFIG level the if statement is able to skip this costly > step. > > I know from experience we will definitely get the most value from this > in the IO parsers and ThingBuilders. > > Any thoughts? > > - Mark > > > > private Logger logger = Logger.getLogger("org.biojava.MyClass"); > > public Object generateObject(String argument){ > logger.entering(""+getClass(), "generateObject", argument); > > //expensive logging operation > if (logger.isLoggable( Level.CONFIG )) { > logger.config("DB config: "+ getDBConfigInfo()); > } > > Object obj = null; > try{ > > //do some stuff > logger.fine("doing stuff"); > obj = new Object(); > > }catch(Exception ex){ > logger.severe("Failed to do stuff"); > logger.throwing(""+getClass(), "generateObject", ex); > } > > logger.exiting(""+getClass(), "generateObject", obj); > return obj; > } > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From ayates at ebi.ac.uk Tue Oct 21 08:49:47 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 09:49:47 +0100 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: References: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> Message-ID: <48FD97AB.70503@ebi.ac.uk> Depends on what you want to program. If you want to have a collection of objects which are Things & perform a common action on them then annotations are not the way forward. If you want to have some kind of meta-programming occurring & need a class to be multiple things then annotations are right. There is currently no way to enforce compile time dependencies on annotations & my thinking is that this is right. Annotations should be meta data or provide a way to alter a class in a non-invasive way (think Web Service annotations creating WS Servers & Clients without any alteration of the class). Andy Richard Holland wrote: > Spot on. > > Annotation/interface.... i think Annotation is probably better as you > suggest, but I'd have to look into that. Not sure how it works with > collections and generics. If it does turn out to be a better bet, I'll > change it over. > > With the BioSQL dependencies, take a look at the pom.xml file inside the > biojava-dna module. It declares a dependency on biojava-core. If you want to > add dependencies to external JARs, take a look at biojava-biosql's pom.xml > to see how it depends on javax.persistence. (The easiest way to add these is > via an IDE such as NetBeans, which is what I'm using at the moment). > > cheers, > Richard > > 2008/10/21 Mark Schreiber > >> So if I want to build a BioSQL loader from Genbank then would the >> classes (or there wrappers) in the BioSQL Entity package need to >> implement Thing? Would maven have an issue with that or would it just >> create a dependency on core? (you can tell I've never used Maven >> right). >> >> From a design point of view should Thing be an interface or an >> Annotation? The reason I ask is that it doesn't define any methods so >> it is more of a tag than an interface. >> >> Anyway, my understanding is that I would use a Genbank parser (or >> write one). Write a EntityReceiver interface (probably more than one >> given the number of entities in BioSQL, implement a EntityBuilder >> (again possibly more than one) that implements EntityReceiver and >> builds Entity beans from messages it receives. In this case I probably >> wouldn't provide a writer as JPA would be writing the beans to the >> database. Would this be how you imagine it? >> >> - Mark >> >> >> On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland >> wrote: >>> (From now on I will only be posting these development messages to >>> biojava-dev, which is the intended purpose of that list. Those of you who >>> wish to keep track of things but are currently only subscribed to >> biojava-l >>> should also subscribe to biojava-dev in order to keep up to date.) >>> >>> As promised, I've committed a new package in the biojava-core module that >>> should help understand how to do file parsing and conversion and writing >> in >>> the new BJ3 modules. Here's an example of how to use it to write a >> Genbank >>> parser (note no parsers actually exist yet!): >>> >>> 1. Design yourself a Genbank class which implements the interface Thing >> and >>> can fully represent all the data that might possibly occur inside a >> Genbank >>> file. >>> >>> 2. Write an interface called GenbankReceiver, which extends ThingReceiver >>> and defines all the methods you might need in order to construct a >> Genbank >>> object in an asynchronous fashion. >>> >>> 3. Write a GenbankBuilder class which implements GenbankReceiver and >>> ThingBuilder. It's job is to receive data via method calls, use that data >> to >>> construct a Genbank object, then provide that object on demand. >>> >>> 4. Write a GenbankWriter class which implements GenbankReceiver and >>> ThingWriter. It's job is similar to GenbankBuilder, but instead of >>> constructing new Genbank objects, it writes Genbank records to file that >>> reflect the data it receives. >>> >>> 5. Write a GenbankReader class which implements ThingReader. It can read >>> GenbankFiles and output the data to the methods of the ThingReceiver >>> provided to it, which in this case could be anything which implements the >>> interface GenbankReceiver. >>> >>> 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a >>> Genbank object and will fire off data from it to the provided >> ThingReceiver >>> (a GenbankReceiver instance) as if the Genbank object was being read from >> a >>> file or some other source. >>> >>> That's it! OK so it's a minimum of 6 classes instead of the original 1 or >> 2, >>> but the additional steps are necessary for flexibility in converting >> between >>> formats. >>> >>> Now to use it (you'll probably want a GenbankTools class to wrap these >> steps >>> up for user-friendliness, including various options for opening files, >>> etc.): >>> >>> 1. To read a file - instantiate ThingParser with your GenbankReader as >> the >>> reader, and GenbankBuilder as the receiver. Use the iterator methods on >>> ThingParser to get the objects out. >>> >>> 2. To write a file - instantiate ThingParser with a GenbankEmitter >> wrapping >>> your Genbank object, and a GenbankWriter as the receiver. Use the >> parseAll() >>> method on the ThingParser to dump the whole lot to your chosen output. >>> >>> The clever bit comes when you want to convert between files. Imagine >> you've >>> done all the above for Genbank, and you've also done it for FASTA. How to >>> convert between them? What you need to do is this: >>> >>> 1. Implement all the classes for both Genbank and FASTA. >>> >>> 2. Write a GenbankFASTAConverter class that implements >> ThingConverter >>> and GenbankReceiver, and will internally convert the data received and >> pass >>> it on out to the receiver provided, which will be a FASTAReceiver >> instance. >>> 3. Write a FASTAGenbankConverter class that operates in exactly the >> opposite >>> way, implementing ThingConverter and FASTAReceiver. >>> >>> Then to convert you use ThingParser again: >>> >>> 1. From FASTA file to Genbank object: Instantiate ThingParser with a >>> FASTAReader reader, a GenbankBuilder receiver, and add a >>> FASTAGenbankConverter instance to the converter chain. Use the iterator >> to >>> get your Genbank objects out of your FASTA file. >>> >>> 2. From FASTA file to Genbank file: Same as option 1, but provide a >>> GenbankWriter instead and use parseAll() instead of the iterator methos. >>> >>> 3. From FASTA object to Genbank object: Same as option 1, but provide a >>> FASTAEmitter wrapping your FASTA object as the reader instead. >>> >>> 4. From FASTA object to Genbank file: Same as option 1, but swap both the >>> reader and the receiver as per options 2 and 3. >>> >>> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all >> mentions >>> of FASTA and Genbank, and use GenbankFASTAConverter instead. >>> >>> One last and very important feature of this approach is that if you >> discover >>> that nobody has written the appropriate converter for your chosen pair of >>> formats A and C, but converters do exist to map A to some other format B >> and >>> that other format B on to C, then you can just put the two converts A-B >> and >>> B-C into the ThingParser chain and it'll work perfectly. >>> >>> Enjoy! >>> >>> cheers, >>> Richard >>> >>> -- >>> Richard Holland, BSc MBCS >>> Finance Director, Eagle Genomics Ltd >>> M: +44 7500 438846 | E: holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > > From holland at eaglegenomics.com Tue Oct 21 09:06:41 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 21 Oct 2008 10:06:41 +0100 Subject: [Biojava-l] [Biojava-dev] BioJava 3 Begins - Volunteers please! In-Reply-To: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> References: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> Message-ID: > > > License: Since it seems we will end up copying code from biojava 1.6 > to biojava 3.0, we need to keep the license the same (LGPL 2.1). I.e. > people should still use the same biojava license headers when > committing new files and all code will be considered to be LGPL, if no > header is present. Do NOT commit code under other licenses. > > Installation: We need some installation instructions on the wiki site, > e.g. how to get the maven setup running. What are the code > conventions for the new version? Not sure where best to put it in the Wiki, but I agree it needs to go there somewhere. Installation is a one-liner from within the top level of the project: mvn install This compiles and installs the JARs into your local Maven repository, and also downloads and installs any external dependencies. Then you can add the installed modules as dependencies in your own Maven projects. If you need to write a launcher script for your project, or you want to use the JAR files outside Maven, you can use this command to generate the CLASSPATH for use outside Maven. This only includes external dependencies - you'll also need to add to it the individual JAR files from inside the various target/ folders that Maven built for you: mvn dependency:build-classpath Code conventions are simple: 1. I'm not fussed about the specific formatter people use in each module, as long as the code is all formatted using some kind of consistent method. I personally just use the default settings from Format code in NetBeans. 2. Use 'this' wherever possible, and for static references, use the classname prefix (e.g. MyClass.staticField). I hate having to try and work out in my head which references are going where, and which are static and which are not! 3. Comment every single method, even if it's private. This helps understand the flow of your code. Also comment liberally inside methods if they are longer than just a few lines (i.e. if you can't fit the entire method within the code panel in NetBeans, its going to need internal comments). 4. When writing getters/setters, follow the Java beans conventions so that automated frameworks like Spring can easily pick it up and work with it. 5. Please write tests for your code using JUnit conventions, inside the test/ folder of each module. I know I haven't done this myself yet, but I'm going to! > > > Blast: the Blast parsing modules are among the most frequently used > ones in biojava 1.6. To make people use biojava v3 it will be crucial > to have a port of them to the new version. Does anybody want to take > care of that? I'll second that. Blast is vital. We'd really appreciate a volunteer, please! > > Automated builds: is it interesting to have automated builds set up > for the new version at this stage, or should we wait until a more > mature stage? I could easily add another auto-build similar to the one > for biojava 1.6 at http://www.spice-3d.org/cruise/ You could do, although I don't think they'd be much use yet. But why not start early then we won't forget to do it later. Richard > > Andreas > > On Sun, Oct 19, 2008 at 5:18 PM, Richard Holland > wrote: > > Hi all, > > > > I've just committed some new code to the biojava3 branch of the > biojava-live > > subversion repository. It's the foundations of a brand new > alphabet+symbol > > set of classes, and an example of how to use them to represent DNA. > You'll > > notice that the new code is very lightweight and allows for a lot more > > flexibility than the old code - for instance, the concept of Alphabet has > > changed radically. It also makes much more extensive use of the > Collections > > API. > > > > I haven't got any test cases or usage examples yet but give me a shout if > > you don't understand the code and I'll explain how it works. (Hint: > > SymbolFormat is there to convert Strings into SymbolList objects, and > vice > > versa). > > > > So, now we want some volunteers! We're starting from scratch here so > there's > > a lot of work to do. The whole of BioJava needs 'translating' into BJ3, > > whether it be copy-and-paste existing classes and modify them to suit the > > new style, or write completely new ones to provide equivalent > functionality. > > > > > > I'll post an example of how to do file parsing soon, probably starting > with > > FASTA. In the meantime, a good place to start would be for people to > design > > object models to represent their favourite data types (e.g. Genbank, or > > microarray data). Utility classes to manipulate those objects would be > great > > too. > > > > The object models need to be normalised as much as possible - e.g. if > your > > data has a lot of comments, and the order of those comments is important, > > then give your object model a collection of comment objects. The object > > model for each data type should be completely independent and use basic > data > > types wherever possible (e.g. store sequences as strings, don't attempt > to > > parse them into anything fancy like SymbolLists). The closer the object > > model is to the original data format, the better. There's going to be > clever > > tricks when it comes to converting data between different object models > > (e.g. Genbank to INSDSeq), which I will explain later when I put the file > > parsing examples up. > > > > You'll notice how the biojava3 branch uses Maven instead of Ant. This is > > because we want to make it as modular as possible, so if you want to > write > > microarray stuff, create a new microarray sub-project (as per the dna > > example that's already there). This way if someone only wants the > microarray > > bit of BJ3, they only need install the appropriate JAR file and can > ignore > > the rest. (The 'core' module is for stuff that is so generic it could be > > used anywhere, or is used in every single other module.) > > > > If coding isn't your cup of tea, then we would very much welcome testers > > (particularly those who enjoy writing test cases!), documenters > > (particularly code commenters), translators (for internationalisation of > the > > code), and of course all those who wish to contribute ideas and > suggestions > > no matter how off-the-wall they might be. In particular if you'd like to > > take charge of an area of the development process, e.g. Documentation > Chief, > > or Protein Champion, then that would be much appreciated. > > > > I'm very much looking forward to working with everyone on this. Good > luck, > > and happy coding! > > > > cheers, > > Richard > > > > PS. Please don't forget to attach the appropriate licence to your code. > You > > can copy-and-paste it from the existing classes I just committed this > > evening. > > > > PPS. For those who are worried about backwards compatibility - this was > > discussed on the lists a while back and it was made clear that BJ3 is a > > clean break. However, the existing code will continue to be maintained > and > > bugfixed for a couple of years so you don't have to upgrade if you don't > > want to - it just won't have any new features developed for it. This is > > largely because it'll probably take just that long to write all the new > BJ3 > > code. When we do decide to desupport the existing BJ code, plenty of > notice > > will be given (i.e. years as opposed to months). > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > M: +44 7500 438846 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From benn at mpi-cbg.de Tue Oct 21 09:00:44 2008 From: benn at mpi-cbg.de (Neil Benn) Date: Tue, 21 Oct 2008 11:00:44 +0200 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Message-ID: <48FD9A3C.20904@mpi-cbg.de> Hello, I'm not sure if I should comment as I have no time to contribute LOC but I thought I may as well ;). Mark Schreiber wrote: > Hi - > > I would like to strongly advocate the liberal and extensive use of > Logging in BioJava3. The lack of this plagued us (me at least) during > bug fixes in previous versions of BioJava. The default Java logging > API is very flexible and easily meets our needs. It's also not too > much effort for developers to put in place (you know you use > System.println() all over the place anyway). > Hmm, that is true but for total completeness you can use commons-logging, that is very easy to use and much more flexible as it can encapsulate other logging mechanisms (including JDK1.4 logging framework). To use it you simply declare a new logger as follows: private static final Log logger = LogFactory.getLog(); The rest of it works pretty much the same as below- if you dovetail commons-logging with log4j then you'll cover the most common case of logging used in other frameworks - the config files to setup log4j (XML and preperties fiels) are well documented all over the web. > > > I know from experience we will definitely get the most value from this > in the IO parsers and ThingBuilders. > > Any thoughts? > +1 > - Mark > > > > private Logger logger = Logger.getLogger("org.biojava.MyClass"); > > public Object generateObject(String argument){ > logger.entering(""+getClass(), "generateObject", argument); > > //expensive logging operation > if (logger.isLoggable( Level.CONFIG )) { > logger.config("DB config: "+ getDBConfigInfo()); > } > > Object obj = null; > try{ > > //do some stuff > logger.fine("doing stuff"); > obj = new Object(); > > }catch(Exception ex){ > logger.severe("Failed to do stuff"); > logger.throwing(""+getClass(), "generateObject", ex); > } > > logger.exiting(""+getClass(), "generateObject", obj); > return obj; > } > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Tue Oct 21 09:18:41 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 17:18:41 +0800 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> Message-ID: <93b45ca50810210218n1e2ac06bma211f1541b8be3bb@mail.gmail.com> For the Entity classes my original thinking was to implement an EJB3 interceptor which logs all method calls. This would be preferable to putting logging statements in all the classes but I don't know if such an interceptor will work outside of a container. Does anyone know if JPA can use an interceptor outside of a container? Logging for the actual persistence would be via the persistence provider (Hibernate, Toplink etc). - Mark On Tue, Oct 21, 2008 at 5:08 PM, Richard Holland wrote: > Excellent idea. I'll integrate it into ThingParser as an example > > 2008/10/21 Mark Schreiber >> >> Hi - >> >> I would like to strongly advocate the liberal and extensive use of >> Logging in BioJava3. The lack of this plagued us (me at least) during >> bug fixes in previous versions of BioJava. The default Java logging >> API is very flexible and easily meets our needs. It's also not too >> much effort for developers to put in place (you know you use >> System.println() all over the place anyway). >> >> The following is an example snippet using logging that would certainly >> help debugging. With the standard logging setup only the severe >> statement would appear on the terminal. We could also provide config >> files that show lower levels of logging so that people can easily >> generate detailed logs to accompany bug reports. If we want to be >> really tricky we could even use a MemoryLogger that has a rotating >> buffer of log statements that could spit out with a stack trace so you >> could just submit the stack trace and the activity log all in one go >> and we can get an idea of what was going on at the time. >> >> The example below also shows what to do to avoid a major performance >> hit during logging. The marked "expensive logging operation" pretends >> to get config information by getting it from a database. One might >> expect this to take time while the db connects etc and could produce >> quite a long String of information. To save time when logging is not >> set to the CONFIG level the if statement is able to skip this costly >> step. >> >> I know from experience we will definitely get the most value from this >> in the IO parsers and ThingBuilders. >> >> Any thoughts? >> >> - Mark >> >> >> >> private Logger logger = Logger.getLogger("org.biojava.MyClass"); >> >> public Object generateObject(String argument){ >> logger.entering(""+getClass(), "generateObject", argument); >> >> //expensive logging operation >> if (logger.isLoggable( Level.CONFIG )) { >> logger.config("DB config: "+ getDBConfigInfo()); >> } >> >> Object obj = null; >> try{ >> >> //do some stuff >> logger.fine("doing stuff"); >> obj = new Object(); >> >> }catch(Exception ex){ >> logger.severe("Failed to do stuff"); >> logger.throwing(""+getClass(), "generateObject", ex); >> } >> >> logger.exiting(""+getClass(), "generateObject", obj); >> return obj; >> } >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From ayates at ebi.ac.uk Tue Oct 21 09:21:26 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 10:21:26 +0100 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <48FD9A3C.20904@mpi-cbg.de> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> <48FD9A3C.20904@mpi-cbg.de> Message-ID: <48FD9F16.2000405@ebi.ac.uk> Hi Neil, That's okay the more people take an interest in this the better it will be. We did discuss this quite a bit ago at a biojava meeting & the general consensus was bridges can be manually written between the logging frameworks as and when they are required. Also using the JDK logger reduces our external dependencies. However I do like the logging facades & am in favour of them. Especially SLF4J which does the same thing as commons-logging but relies on the existence of SLF4J adaptors not the raw logging framework which commons-logging does. It also has links to a lot more logging frameworks including simple-log (https://simple-log.dev.java.net/) & logback (http://logback.qos.ch/). There's just so many options here it's hard to gauge what is the best thing to do. Do we buy into a single framework & use all of its features (JDK logger has nice things for logging entering & exiting methods along with locale ResourceBundles) or go for a common denominator. It's not an easy decision to make ........ Andy Neil Benn wrote: > Hello, > > I'm not sure if I should comment as I have no time to > contribute LOC but I thought I may as well ;). > > Mark Schreiber wrote: >> Hi - >> >> I would like to strongly advocate the liberal and extensive use of >> Logging in BioJava3. The lack of this plagued us (me at least) during >> bug fixes in previous versions of BioJava. The default Java logging >> API is very flexible and easily meets our needs. It's also not too >> much effort for developers to put in place (you know you use >> System.println() all over the place anyway). >> > Hmm, that is true but for total completeness you can use > commons-logging, that is very easy to use and much more flexible as it > can encapsulate other logging mechanisms (including JDK1.4 logging > framework). To use it you simply declare a new logger as follows: > > private static final Log logger = LogFactory.getLog( here>); > > The rest of it works pretty much the same as below- if you dovetail > commons-logging with log4j then you'll cover the most common case of > logging used in other frameworks - the config files to setup log4j (XML > and preperties fiels) are well documented all over the web. >> >> >> I know from experience we will definitely get the most value from this >> in the IO parsers and ThingBuilders. >> >> Any thoughts? >> > +1 >> - Mark >> >> >> >> private Logger logger = Logger.getLogger("org.biojava.MyClass"); >> >> public Object generateObject(String argument){ >> logger.entering(""+getClass(), "generateObject", argument); >> >> //expensive logging operation >> if (logger.isLoggable( Level.CONFIG )) { >> logger.config("DB config: "+ getDBConfigInfo()); >> } >> >> Object obj = null; >> try{ >> >> //do some stuff >> logger.fine("doing stuff"); >> obj = new Object(); >> >> }catch(Exception ex){ >> logger.severe("Failed to do stuff"); >> logger.throwing(""+getClass(), "generateObject", ex); >> } >> >> logger.exiting(""+getClass(), "generateObject", obj); >> return obj; >> } >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From ayates at ebi.ac.uk Tue Oct 21 09:23:35 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 21 Oct 2008 10:23:35 +0100 Subject: [Biojava-l] Logging in BJ3 In-Reply-To: <93b45ca50810210218n1e2ac06bma211f1541b8be3bb@mail.gmail.com> References: <93b45ca50810202241i767e2a56w2d9b7ede0f895431@mail.gmail.com> <93b45ca50810210218n1e2ac06bma211f1541b8be3bb@mail.gmail.com> Message-ID: <48FD9F97.8010705@ebi.ac.uk> As far as I was aware JPA has no concept of EJB3 interceptors. If you want that kind of thing I think you would have to start using AOP or proxy objects. Andy Mark Schreiber wrote: > For the Entity classes my original thinking was to implement an EJB3 > interceptor which logs all method calls. This would be preferable to > putting logging statements in all the classes but I don't know if such > an interceptor will work outside of a container. Does anyone know if > JPA can use an interceptor outside of a container? > > Logging for the actual persistence would be via the persistence > provider (Hibernate, Toplink etc). > > - Mark > > On Tue, Oct 21, 2008 at 5:08 PM, Richard Holland > wrote: >> Excellent idea. I'll integrate it into ThingParser as an example >> >> 2008/10/21 Mark Schreiber >>> Hi - >>> >>> I would like to strongly advocate the liberal and extensive use of >>> Logging in BioJava3. The lack of this plagued us (me at least) during >>> bug fixes in previous versions of BioJava. The default Java logging >>> API is very flexible and easily meets our needs. It's also not too >>> much effort for developers to put in place (you know you use >>> System.println() all over the place anyway). >>> >>> The following is an example snippet using logging that would certainly >>> help debugging. With the standard logging setup only the severe >>> statement would appear on the terminal. We could also provide config >>> files that show lower levels of logging so that people can easily >>> generate detailed logs to accompany bug reports. If we want to be >>> really tricky we could even use a MemoryLogger that has a rotating >>> buffer of log statements that could spit out with a stack trace so you >>> could just submit the stack trace and the activity log all in one go >>> and we can get an idea of what was going on at the time. >>> >>> The example below also shows what to do to avoid a major performance >>> hit during logging. The marked "expensive logging operation" pretends >>> to get config information by getting it from a database. One might >>> expect this to take time while the db connects etc and could produce >>> quite a long String of information. To save time when logging is not >>> set to the CONFIG level the if statement is able to skip this costly >>> step. >>> >>> I know from experience we will definitely get the most value from this >>> in the IO parsers and ThingBuilders. >>> >>> Any thoughts? >>> >>> - Mark >>> >>> >>> >>> private Logger logger = Logger.getLogger("org.biojava.MyClass"); >>> >>> public Object generateObject(String argument){ >>> logger.entering(""+getClass(), "generateObject", argument); >>> >>> //expensive logging operation >>> if (logger.isLoggable( Level.CONFIG )) { >>> logger.config("DB config: "+ getDBConfigInfo()); >>> } >>> >>> Object obj = null; >>> try{ >>> >>> //do some stuff >>> logger.fine("doing stuff"); >>> obj = new Object(); >>> >>> }catch(Exception ex){ >>> logger.severe("Failed to do stuff"); >>> logger.throwing(""+getClass(), "generateObject", ex); >>> } >>> >>> logger.exiting(""+getClass(), "generateObject", obj); >>> return obj; >>> } >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> M: +44 7500 438846 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Tue Oct 21 09:26:41 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 17:26:41 +0800 Subject: [Biojava-l] [Biojava-dev] BioJava 3 Begins - Volunteers please! In-Reply-To: References: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> Message-ID: <93b45ca50810210226t79cfbcbfhcadaedcfe8735676@mail.gmail.com> >> Blast: the Blast parsing modules are among the most frequently used >> ones in biojava 1.6. To make people use biojava v3 it will be crucial >> to have a port of them to the new version. Does anybody want to take >> care of that? > > > I'll second that. Blast is vital. We'd really appreciate a volunteer, > please! > BlastXML output would certainly be the easiest place to start. I also think with the new Thing/ ThingBuilder framework it will be possible to develop all manner of parsers for the vagaries of Blast text output that come with each new release of Blast. Possible but maybe not a good idea. I don't think that output was ever supposed to be machine readable. The table formatted output (-m8 I think) would be a better option. Given the DTD it should be possible to do a quick JAXB binding. How would that work in the Thing/ ThingBuilder paradigm? - Mark From markjschreiber at gmail.com Tue Oct 21 10:35:14 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 21 Oct 2008 18:35:14 +0800 Subject: [Biojava-l] File parsing in BJ3 In-Reply-To: <48FD97AB.70503@ebi.ac.uk> References: <93b45ca50810202016j13a2a2a9y78a2992e543d6f5a@mail.gmail.com> <48FD97AB.70503@ebi.ac.uk> Message-ID: <93b45ca50810210335j5ef4a206y545e5a1869cedc03@mail.gmail.com> Is there any need for Thing at all? Can't a bulder be typed to produce something that extends Object? If Thing provides no behaivour contract or meta-information then why does it exist? - Mark On Tue, Oct 21, 2008 at 4:49 PM, Andy Yates wrote: > Depends on what you want to program. If you want to have a collection of > objects which are Things & perform a common action on them then > annotations are not the way forward. > > If you want to have some kind of meta-programming occurring & need a > class to be multiple things then annotations are right. There is > currently no way to enforce compile time dependencies on annotations & > my thinking is that this is right. Annotations should be meta data or > provide a way to alter a class in a non-invasive way (think Web Service > annotations creating WS Servers & Clients without any alteration of the > class). > > Andy > > Richard Holland wrote: >> Spot on. >> >> Annotation/interface.... i think Annotation is probably better as you >> suggest, but I'd have to look into that. Not sure how it works with >> collections and generics. If it does turn out to be a better bet, I'll >> change it over. >> >> With the BioSQL dependencies, take a look at the pom.xml file inside the >> biojava-dna module. It declares a dependency on biojava-core. If you want to >> add dependencies to external JARs, take a look at biojava-biosql's pom.xml >> to see how it depends on javax.persistence. (The easiest way to add these is >> via an IDE such as NetBeans, which is what I'm using at the moment). >> >> cheers, >> Richard >> >> 2008/10/21 Mark Schreiber >> >>> So if I want to build a BioSQL loader from Genbank then would the >>> classes (or there wrappers) in the BioSQL Entity package need to >>> implement Thing? Would maven have an issue with that or would it just >>> create a dependency on core? (you can tell I've never used Maven >>> right). >>> >>> From a design point of view should Thing be an interface or an >>> Annotation? The reason I ask is that it doesn't define any methods so >>> it is more of a tag than an interface. >>> >>> Anyway, my understanding is that I would use a Genbank parser (or >>> write one). Write a EntityReceiver interface (probably more than one >>> given the number of entities in BioSQL, implement a EntityBuilder >>> (again possibly more than one) that implements EntityReceiver and >>> builds Entity beans from messages it receives. In this case I probably >>> wouldn't provide a writer as JPA would be writing the beans to the >>> database. Would this be how you imagine it? >>> >>> - Mark >>> >>> >>> On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland >>> wrote: >>>> (From now on I will only be posting these development messages to >>>> biojava-dev, which is the intended purpose of that list. Those of you who >>>> wish to keep track of things but are currently only subscribed to >>> biojava-l >>>> should also subscribe to biojava-dev in order to keep up to date.) >>>> >>>> As promised, I've committed a new package in the biojava-core module that >>>> should help understand how to do file parsing and conversion and writing >>> in >>>> the new BJ3 modules. Here's an example of how to use it to write a >>> Genbank >>>> parser (note no parsers actually exist yet!): >>>> >>>> 1. Design yourself a Genbank class which implements the interface Thing >>> and >>>> can fully represent all the data that might possibly occur inside a >>> Genbank >>>> file. >>>> >>>> 2. Write an interface called GenbankReceiver, which extends ThingReceiver >>>> and defines all the methods you might need in order to construct a >>> Genbank >>>> object in an asynchronous fashion. >>>> >>>> 3. Write a GenbankBuilder class which implements GenbankReceiver and >>>> ThingBuilder. It's job is to receive data via method calls, use that data >>> to >>>> construct a Genbank object, then provide that object on demand. >>>> >>>> 4. Write a GenbankWriter class which implements GenbankReceiver and >>>> ThingWriter. It's job is similar to GenbankBuilder, but instead of >>>> constructing new Genbank objects, it writes Genbank records to file that >>>> reflect the data it receives. >>>> >>>> 5. Write a GenbankReader class which implements ThingReader. It can read >>>> GenbankFiles and output the data to the methods of the ThingReceiver >>>> provided to it, which in this case could be anything which implements the >>>> interface GenbankReceiver. >>>> >>>> 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a >>>> Genbank object and will fire off data from it to the provided >>> ThingReceiver >>>> (a GenbankReceiver instance) as if the Genbank object was being read from >>> a >>>> file or some other source. >>>> >>>> That's it! OK so it's a minimum of 6 classes instead of the original 1 or >>> 2, >>>> but the additional steps are necessary for flexibility in converting >>> between >>>> formats. >>>> >>>> Now to use it (you'll probably want a GenbankTools class to wrap these >>> steps >>>> up for user-friendliness, including various options for opening files, >>>> etc.): >>>> >>>> 1. To read a file - instantiate ThingParser with your GenbankReader as >>> the >>>> reader, and GenbankBuilder as the receiver. Use the iterator methods on >>>> ThingParser to get the objects out. >>>> >>>> 2. To write a file - instantiate ThingParser with a GenbankEmitter >>> wrapping >>>> your Genbank object, and a GenbankWriter as the receiver. Use the >>> parseAll() >>>> method on the ThingParser to dump the whole lot to your chosen output. >>>> >>>> The clever bit comes when you want to convert between files. Imagine >>> you've >>>> done all the above for Genbank, and you've also done it for FASTA. How to >>>> convert between them? What you need to do is this: >>>> >>>> 1. Implement all the classes for both Genbank and FASTA. >>>> >>>> 2. Write a GenbankFASTAConverter class that implements >>> ThingConverter >>>> and GenbankReceiver, and will internally convert the data received and >>> pass >>>> it on out to the receiver provided, which will be a FASTAReceiver >>> instance. >>>> 3. Write a FASTAGenbankConverter class that operates in exactly the >>> opposite >>>> way, implementing ThingConverter and FASTAReceiver. >>>> >>>> Then to convert you use ThingParser again: >>>> >>>> 1. From FASTA file to Genbank object: Instantiate ThingParser with a >>>> FASTAReader reader, a GenbankBuilder receiver, and add a >>>> FASTAGenbankConverter instance to the converter chain. Use the iterator >>> to >>>> get your Genbank objects out of your FASTA file. >>>> >>>> 2. From FASTA file to Genbank file: Same as option 1, but provide a >>>> GenbankWriter instead and use parseAll() instead of the iterator methos. >>>> >>>> 3. From FASTA object to Genbank object: Same as option 1, but provide a >>>> FASTAEmitter wrapping your FASTA object as the reader instead. >>>> >>>> 4. From FASTA object to Genbank file: Same as option 1, but swap both the >>>> reader and the receiver as per options 2 and 3. >>>> >>>> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all >>> mentions >>>> of FASTA and Genbank, and use GenbankFASTAConverter instead. >>>> >>>> One last and very important feature of this approach is that if you >>> discover >>>> that nobody has written the appropriate converter for your chosen pair of >>>> formats A and C, but converters do exist to map A to some other format B >>> and >>>> that other format B on to C, then you can just put the two converts A-B >>> and >>>> B-C into the ThingParser chain and it'll work perfectly. >>>> >>>> Enjoy! >>>> >>>> cheers, >>>> Richard >>>> >>>> -- >>>> Richard Holland, BSc MBCS >>>> Finance Director, Eagle Genomics Ltd >>>> M: +44 7500 438846 | E: holland at eaglegenomics.com >>>> http://www.eaglegenomics.com/ >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> >> >> > From augustovmail-java at yahoo.com.br Tue Oct 21 11:45:41 2008 From: augustovmail-java at yahoo.com.br (Augusto Fernandes Vellozo) Date: Tue, 21 Oct 2008 13:45:41 +0200 Subject: [Biojava-l] SimpleRichAnnotation In-Reply-To: <381a3e850810210421u54058163ncf347b57394af1b2@mail.gmail.com> References: <381a3e850810210421u54058163ncf347b57394af1b2@mail.gmail.com> Message-ID: <381a3e850810210445sc801d40ja36655349b5920b9@mail.gmail.com> Hi everyone, I am having problems with the class SimpleRichAnnotation. I have one term t of ontology o and I put one note n (with the term t) in an SimpleRichAnnotation object a, but in the moment i call the method a.getProperties(t) it didn't return the note n. I saw in the code of Biojava that the method getProperties imports the term t into of the ontology default before to do the search. Because this it doesn't return the correct note. Please, someone knows why is this method changing the ontology? Thanks, -- Augusto F. Vellozo -- Augusto F. Vellozo From charles at imbusch.net Tue Oct 21 14:00:45 2008 From: charles at imbusch.net (Charles Imbusch) Date: Tue, 21 Oct 2008 16:00:45 +0200 Subject: [Biojava-l] parsing tblastn results In-Reply-To: References: <48F50908.5060307@imbusch.net> Message-ID: <48FDE08D.8000300@imbusch.net> Thank you David and Richard for the quick replies. I downloaded two files from http://bugzilla.open-bio.org/show_bug.cgi?id=2603 and tried to apply the patches. I suppose that's the way to get the modified BlastSAXParser.java. charlie at custodian:~/biojava-live_1.6$ patch -p0 < BlastSAXParser.java.patch (Stripping trailing CRs from patch.) patching file src/org/biojava/bio/program/sax/BlastSAXParser.java Hunk #1 FAILED at 60. Hunk #2 FAILED at 631. Hunk #3 FAILED at 643. Hunk #4 FAILED at 650. 4 out of 4 hunks FAILED -- saving rejects to file src/org/biojava/bio/program/sax/BlastSAXParser.java.rej and similar for the other file charlie at custodian:~/biojava-live_1.6$ patch -p0 < HitSectionSAXParser.java.patch (Stripping trailing CRs from patch.) patching file src/org/biojava/bio/program/sax/HitSectionSAXParser.java Hunk #1 FAILED at 41. Hunk #2 FAILED at 65. Hunk #3 FAILED at 96. Hunk #4 FAILED at 515. Hunk #5 FAILED at 524. 5 out of 5 hunks FAILED -- saving rejects to file src/org/biojava/bio/program/sax/HitSectionSAXParser.java.rej Obviously something went wrong, but I couldn't figure out what. I uploaded the rej files to http://charles.imbusch.net/tmp/ Any hint is appreciated. cheers, Charles From crackeur at comcast.net Wed Oct 22 02:21:57 2008 From: crackeur at comcast.net (jimmy Zhang) Date: Tue, 21 Oct 2008 19:21:57 -0700 Subject: [Biojava-l] [ANN] VTD-XML extended edition released References: <59a41c430810202017n226327cahefe0ed7e5f6a8df2@mail.gmail.com> <93b45ca50810210226t79cfbcbfhcadaedcfe8735676@mail.gmail.com> Message-ID: <009401c933ec$f572a700$0402a8c0@your55e5f9e3d2> The Java version of extended VTD-XmL is released and available for download. This version supports 256 GB max file sizes and memory mapped capabilities. The updated documentation is also available for download. In short, you can basically do full XPath query on documents that are bigger than memory space available on your machine. A special thanks to Duane May who provided value suggestions and inputs and helped refine the VTD specs to make this happen. To download the package and the documentation, go to https://sourceforge.net/project/downloading.php?group_id=110612&use_mirror=&filename=vtd-xml_2.4_doc.zip&64621261 https://sourceforge.net/project/downloading.php?group_id=110612&use_mirror=&filename=ximpleware_extended_2.4.zip&99532507 From pzgyuanf at gmail.com Sun Oct 26 00:57:16 2008 From: pzgyuanf at gmail.com (pprun) Date: Sun, 26 Oct 2008 08:57:16 +0800 Subject: [Biojava-l] Test failed for Alphabet.getSymbolMatchType method Message-ID: Hi, The current implementation uses the same condition equalsIgnoreCase for EXACT_STRING_MATCH and MIXED_CASE_MATCH public SymbolMatchType getSymbolMatchType(Symbol a, Symbol b) { ... if (a.toString().equalsIgnoreCase(b.toString())) { return SymbolMatchType.EXACT_STRING_MATCH; } if (a.toString().equalsIgnoreCase(b.toString())) { return SymbolMatchType.MIXED_CASE_MATCH; } ... String.equals should be used for EXACT_STRING_MATCH: public SymbolMatchType getSymbolMatchType(Symbol a, Symbol b) { ... if (a.toString().equals(b.toString())) { return SymbolMatchType.EXACT_STRING_MATCH; } if (a.toString().equalsIgnoreCase(b.toString())) { return SymbolMatchType.MIXED_CASE_MATCH; } ... The test case used to identify the above bug is: /* * BioJava development code * * This code may be freely distributed and modified under the * terms of the GNU Lesser General Public Licence. This should * be distributed with the code. If you do not have a copy, * see: * * http://www.gnu.org/copyleft/lesser.html * * Copyright for this code is held jointly by the individual * authors. These should be listed in @author doc comments. * * For more information on the BioJava project and its aims, * or to join the biojava-l mailing list, visit the home page * at: * * http://www.biojava.org/ * */ package org.biojava.core.symbol; import org.junit.After; import org.junit.AfterClass; import org.junit.Before; import org.junit.BeforeClass; import org.junit.Test; import static org.junit.Assert.*; /** * * @author pprun */ public class AlphabetTest { public AlphabetTest() { } @BeforeClass public static void setUpClass() throws Exception { } @AfterClass public static void tearDownClass() throws Exception { } @Before public void setUp() { } @After public void tearDown() { } /** * Test of getSymbolMatchType method, of class Alphabet. */ @Test public void testGetSymbolMatchType() { System.out.println("getSymbolMatchType"); Alphabet testAlphabet = new Alphabet("testGetSymbolMatchType"); // 1. exact match Symbol a = Symbol.get("ATGC"); Symbol b = Symbol.get("ATGC"); SymbolMatchType expResult = SymbolMatchType.EXACT_MATCH; SymbolMatchType result = testAlphabet.getSymbolMatchType(a, b); assertEquals(expResult, result); // 2. mixed case match a = Symbol.get("ATGC"); b = Symbol.get("aTGC"); expResult = SymbolMatchType.MIXED_CASE_MATCH; result = testAlphabet.getSymbolMatchType(a, b); assertEquals(expResult, result); } } BTW., how can I get the dev/test role? Then I can contribute to the development or test (as I'm still a beginner for bio field) for BJ3. Thanks, Pprun From gabrielle_doan at gmx.net Mon Oct 27 12:57:03 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Mon, 27 Oct 2008 13:57:03 +0100 Subject: [Biojava-l] differences between read in sequence and stored sequence in database Message-ID: <4905BA9F.1060400@gmx.net> Hi all, I have a BioSQL database which contains all human chromsomes. For my recent project I have to query for a part of a sequence. As far as I know I can get the whole sequence from the entry Biosequence.Seq in the BioSQL schema. So I've made this query: SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; But this query hasn't yield the desired string, because the length of this biosequence is only 100,000,020 bp. I am very confused why I get such a discrepancy. I have added all chromosomes with the build in method in BioJava addRichSequence(RichSequence seq) to the database. From my raw data I know that this sequence should have a length of 140,279,252 bp. So where is the remaining part of my sequence? I have observed these discrepancies on all chromsomes which are longer than 100,000,020 bp. Here is an abstract of my database: bioentry_id description length 2 Homo sapiens mitochondrion, complete genome. 16571 3 Homo sapiens chromosome Y, reference assembly, complete sequence. 57772954 4 Homo sapiens chromosome X, reference assembly, complete sequence. 100000020 5 Homo sapiens chromosome 22, reference assembly, complete sequence. 49691432 6 Homo sapiens chromosome 21, reference assembly, complete sequence. 46944323 7 Homo sapiens chromosome 20, reference assembly, complete sequence. 25960004 8 Homo sapiens chromosome 9, reference assembly, complete sequence. 100000020 9 Homo sapiens chromosome 7, reference assembly, complete sequence. 100000020 Sequences smaller than 100,000,020 bp are correctly stored under Biosequence.seq. I am grateful for any hints, which explain the behaviour of my database. Cheers, Gabrielle From gabrielle_doan at gmx.net Tue Oct 28 14:26:47 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Tue, 28 Oct 2008 15:26:47 +0100 Subject: [Biojava-l] differences between read in sequence and stored sequence in database] Message-ID: <49072127.7010304@gmx.net> Hi all, concering the problem as described below I have found out that this problem also occured in BioRuby and was fixed in 2004. See: http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby Unfortunately I'm clueless about BioRuby. Does anybody recognize this problem or understand how it was solved in BioRuby? I am grateful for any hints. Cheers, Gabrielle -------- Original-Nachricht -------- Betreff: [Biojava-l] differences between read in sequence and stored sequence in database Datum: Mon, 27 Oct 2008 13:57:03 +0100 Von: Gabrielle Doan An: biojava-l at biojava.org Hi all, I have a BioSQL database which contains all human chromsomes. For my recent project I have to query for a part of a sequence. As far as I know I can get the whole sequence from the entry Biosequence.Seq in the BioSQL schema. So I've made this query: SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; But this query hasn't yield the desired string, because the length of this biosequence is only 100,000,020 bp. I am very confused why I get such a discrepancy. I have added all chromosomes with the build in method in BioJava addRichSequence(RichSequence seq) to the database. From my raw data I know that this sequence should have a length of 140,279,252 bp. So where is the remaining part of my sequence? I have observed these discrepancies on all chromsomes which are longer than 100,000,020 bp. Here is an abstract of my database: bioentry_id description length 2 Homo sapiens mitochondrion, complete genome. 16571 3 Homo sapiens chromosome Y, reference assembly, complete sequence. 57772954 4 Homo sapiens chromosome X, reference assembly, complete sequence. 100000020 5 Homo sapiens chromosome 22, reference assembly, complete sequence. 49691432 6 Homo sapiens chromosome 21, reference assembly, complete sequence. 46944323 7 Homo sapiens chromosome 20, reference assembly, complete sequence. 25960004 8 Homo sapiens chromosome 9, reference assembly, complete sequence. 100000020 9 Homo sapiens chromosome 7, reference assembly, complete sequence. 100000020 Sequences smaller than 100,000,020 bp are correctly stored under Biosequence.seq. I am grateful for any hints, which explain the behaviour of my database. Cheers, Gabrielle _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From dtoomey at rcsi.ie Wed Oct 29 10:45:45 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Wed, 29 Oct 2008 10:45:45 +0000 Subject: [Biojava-l] How to get full query description from blast result Message-ID: Hi I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium falciparum (isolate 3D7) GN=ABRA I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation Can anyone point me in the right direction for retrieving the complete query description? Thanks Dave From holland at eaglegenomics.com Thu Oct 30 14:07:42 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 30 Oct 2008 14:07:42 +0000 Subject: [Biojava-l] differences between read in sequence and stored sequence in database] In-Reply-To: <49072127.7010304@gmx.net> References: <49072127.7010304@gmx.net> Message-ID: Hello. Sorry for the delayed reply - I've been away on business all week. The similar Ruby issue (and solution) is discussed here: http://portal.open-bio.org/pipermail/bioruby/2004-March.txt How did you parse the files in the first place? Did you use the new GenBank parsers (BJX), or the older ones? This will help indicate where the problem lies - the data will have been truncated at the point it was parsed from file, so the data in your database will reflect this and you'll have to reload it once the appropriate parser has been fixed. If it was the newer BJX parser, then the problem most probably lies in this regex from org.biojavax.bio.seq.io.GenbankFormat, which can probably be fixed in a similar manner to the Ruby equivalent dicussed in the posting above: protected static final Pattern sectp = Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); Could someone volunteer to develop and test a fix? If you come up with something, please commit it to the SVN trunk. cheers, Richard 2008/10/28 Gabrielle Doan : > Hi all, > concering the problem as described below I have found out that this problem > also occured in BioRuby and was fixed in 2004. > See: > http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby > Unfortunately I'm clueless about BioRuby. Does anybody recognize this > problem or understand how it was solved in BioRuby? > > I am grateful for any hints. > > Cheers, > > Gabrielle > > > -------- Original-Nachricht -------- > Betreff: [Biojava-l] differences between read in sequence and stored > sequence in database > Datum: Mon, 27 Oct 2008 13:57:03 +0100 > Von: Gabrielle Doan > An: biojava-l at biojava.org > > Hi all, > > I have a BioSQL database which contains all human chromsomes. For my > recent project I have to query for a part of a sequence. > As far as I know I can get the whole sequence from the entry > Biosequence.Seq in the BioSQL schema. So I've made this query: > > SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; > > But this query hasn't yield the desired string, because the length of > this biosequence is only 100,000,020 bp. I am very confused why I get > such a discrepancy. I have added all chromosomes with the build in > method in BioJava addRichSequence(RichSequence seq) to the database. > From my raw data I know that this sequence should have a length of > 140,279,252 bp. So where is the remaining part of my sequence? I have > observed these discrepancies on all chromsomes which are longer than > 100,000,020 bp. > > Here is an abstract of my database: > bioentry_id description length > 2 Homo sapiens mitochondrion, complete genome. 16571 > 3 Homo sapiens chromosome Y, reference assembly, complete sequence. > 57772954 > 4 Homo sapiens chromosome X, reference assembly, complete sequence. > 100000020 > 5 Homo sapiens chromosome 22, reference assembly, complete sequence. > 49691432 > 6 Homo sapiens chromosome 21, reference assembly, complete sequence. > 46944323 > 7 Homo sapiens chromosome 20, reference assembly, complete sequence. > 25960004 > 8 Homo sapiens chromosome 9, reference assembly, complete sequence. > 100000020 > 9 Homo sapiens chromosome 7, reference assembly, complete sequence. > 100000020 > > Sequences smaller than 100,000,020 bp are correctly stored under > Biosequence.seq. > > I am grateful for any hints, which explain the behaviour of my database. > > Cheers, > > Gabrielle > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Thu Oct 30 14:10:12 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 30 Oct 2008 14:10:12 +0000 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: References: Message-ID: Good question! Can someone who knows a lot about the blast parser internals provide David with an answer to his question? cheers, Richard 2008/10/29 David Toomey : > Hi > > I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > falciparum (isolate 3D7) GN=ABRA > > I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation > > Can anyone point me in the right direction for retrieving the complete query description? > > Thanks > > Dave > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Fri Oct 31 07:26:35 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 31 Oct 2008 15:26:35 +0800 Subject: [Biojava-l] differences between read in sequence and stored sequence in database In-Reply-To: <4905BA9F.1060400@gmx.net> References: <4905BA9F.1060400@gmx.net> Message-ID: <93b45ca50810310026o6ee35a61sf2815c3547e1e679@mail.gmail.com> Could this be a database implementation issue? Is there a limit on how long a field can be in your DB? - Mark On Mon, Oct 27, 2008 at 8:57 PM, Gabrielle Doan wrote: > > Hi all, > > I have a BioSQL database which contains all human chromsomes. For my recent project I have to query for a part of a sequence. > As far as I know I can get the whole sequence from the entry Biosequence.Seq in the BioSQL schema. So I've made this query: > > SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; > > But this query hasn't yield the desired string, because the length of this biosequence is only 100,000,020 bp. I am very confused why I get such a discrepancy. I have added all chromosomes with the build in method in BioJava addRichSequence(RichSequence seq) to the database. From my raw data I know that this sequence should have a length of 140,279,252 bp. So where is the remaining part of my sequence? I have observed these discrepancies on all chromsomes which are longer than 100,000,020 bp. > > Here is an abstract of my database: > bioentry_id description length > 2 Homo sapiens mitochondrion, complete genome. 16571 > 3 Homo sapiens chromosome Y, reference assembly, complete sequence. 57772954 > 4 Homo sapiens chromosome X, reference assembly, complete sequence. 100000020 > 5 Homo sapiens chromosome 22, reference assembly, complete sequence. 49691432 > 6 Homo sapiens chromosome 21, reference assembly, complete sequence. 46944323 > 7 Homo sapiens chromosome 20, reference assembly, complete sequence. 25960004 > 8 Homo sapiens chromosome 9, reference assembly, complete sequence. 100000020 > 9 Homo sapiens chromosome 7, reference assembly, complete sequence. 100000020 > > Sequences smaller than 100,000,020 bp are correctly stored under Biosequence.seq. > > I am grateful for any hints, which explain the behaviour of my database. > > Cheers, > > Gabrielle > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Fri Oct 31 08:00:35 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 31 Oct 2008 16:00:35 +0800 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: References: Message-ID: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> Hi - If you use the BlastEcho program on the cookbook pages you can find out if and how the information is being parsed and where it goes. It is possible it is not parsed. In this case you could add a feature request. - Mark On Thu, Oct 30, 2008 at 10:10 PM, Richard Holland wrote: > > Good question! > > Can someone who knows a lot about the blast parser internals provide > David with an answer to his question? > > cheers, > Richard > > 2008/10/29 David Toomey : > > Hi > > > > I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query > > > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > > falciparum (isolate 3D7) GN=ABRA > > > > I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation > > > > Can anyone point me in the right direction for retrieving the complete query description? > > > > Thanks > > > > Dave > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From community at struck.lu Fri Oct 31 10:05:00 2008 From: community at struck.lu (community at struck.lu) Date: Fri, 31 Oct 2008 11:05:00 +0100 Subject: [Biojava-l] SCF: support for ambiguities Message-ID: Hello, I am using the SCF class in the context of HIV-1 population sequencing. In this context we do have sometimes ambiguous base calls. To support them I extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. Therefore I simply added the following code to the "decode" function: ######################### public Symbol decode(byte call) throws IllegalSymbolException { //get the DNA Alphabet Alphabet dna = DNATools.getDNA(); char c = (char) call; switch (c) { case 'a': case 'A': return DNATools.a(); case 'c': case 'C': return DNATools.c(); case 'g': case 'G': return DNATools.g(); case 't': case 'T': return DNATools.t(); case 'n': case 'N': return DNATools.n(); case '-': return DNATools.getDNA().getGapSymbol(); case 'w': case 'W': //make the 'W' symbol Set symbolsThatMakeW = new HashSet(); symbolsThatMakeW.add(DNATools.a()); symbolsThatMakeW.add(DNATools.t()); Symbol w = dna.getAmbiguity(symbolsThatMakeW); return w; case 's': case 'S': //make the 'S' symbol Set symbolsThatMakeS = new HashSet(); symbolsThatMakeS.add(DNATools.c()); symbolsThatMakeS.add(DNATools.g()); Symbol s = dna.getAmbiguity(symbolsThatMakeS); return s; ... (and so on) ######################### Is this the right way to do it? And if so, how can this code be submitted to the official biojava source code? Best regards, Daniel Struck _________________________________________________________ Mail sent using root eSolutions Webmailer - www.root.lu From dtoomey at rcsi.ie Fri Oct 31 12:07:19 2008 From: dtoomey at rcsi.ie (David Toomey) Date: Fri, 31 Oct 2008 12:07:19 +0000 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> References: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> Message-ID: Hi Mark I tried that and it appears that it is not being parsed. Only the portion of the line up to the first space is returned as queryId. The rest of the line is not returned. Could this be added to the blast parser? Cheers Dave -----Original Message----- From: Mark Schreiber [mailto:markjschreiber at gmail.com] Sent: 31 October 2008 08:01 To: holland at eaglegenomics.com Cc: David Toomey; biojava-l at biojava.org Subject: Re: [Biojava-l] How to get full query description from blast result Hi - If you use the BlastEcho program on the cookbook pages you can find out if and how the information is being parsed and where it goes. It is possible it is not parsed. In this case you could add a feature request. - Mark On Thu, Oct 30, 2008 at 10:10 PM, Richard Holland wrote: > > Good question! > > Can someone who knows a lot about the blast parser internals provide > David with an answer to his question? > > cheers, > Richard > > 2008/10/29 David Toomey : > > Hi > > > > I am parsing blast results and I need to get the complete query description line but I can only work out how to get the first part of the line. So for example in the blast result query > > > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > > falciparum (isolate 3D7) GN=ABRA > > > > I need to get all of the description above but I can only seem to retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the queryId property of the annotation > > > > Can anyone point me in the right direction for retrieving the complete query description? > > > > Thanks > > > > Dave > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From simon.foote at nrc-cnrc.gc.ca Fri Oct 31 11:56:30 2008 From: simon.foote at nrc-cnrc.gc.ca (Simon Foote) Date: Fri, 31 Oct 2008 07:56:30 -0400 Subject: [Biojava-l] How to get full query description from blast result In-Reply-To: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> References: <93b45ca50810310100w5e922161iaf79469050afbc3c@mail.gmail.com> Message-ID: <490AF26E.7000604@nrc-cnrc.gc.ca> Mark is right A quick look at the code shows that for the query line, it extracts everything upto the first whitespace and puts that into the queryId and everything else is discarded. To get the full description, some additional code is needed to populate a queryDescription with everything from the query line upto the query length information which is contained in parentheses. Simon Bioinformatics Specialist Institute for Biological Sciences | Institut des sciences biologiques National Research Council of Canada | Conseil national de recherches Canada Ottawa, Canada K1A 0R6 Telephone | T?l?phone 613-990-3600 / Facsimile | T?l?copieur 613-990-9092 Government of Canada | Gouvernement du Canada Mark Schreiber wrote: > > Hi - > > If you use the BlastEcho program on the cookbook pages you can find > out if and how the information is being parsed and where it goes. > > It is possible it is not parsed. In this case you could add a feature > request. > > - Mark > > On Thu, Oct 30, 2008 at 10:10 PM, Richard Holland > wrote: > > > > Good question! > > > > Can someone who knows a lot about the blast parser internals provide > > David with an answer to his question? > > > > cheers, > > Richard > > > > 2008/10/29 David Toomey : > > > Hi > > > > > > I am parsing blast results and I need to get the complete query > description line but I can only work out how to get the first part of > the line. So for example in the blast result query > > > > > > Query= sp|Q8I5D2|ABRA_PLAF7 101 kDa malaria antigen OS=Plasmodium > > > falciparum (isolate 3D7) GN=ABRA > > > > > > I need to get all of the description above but I can only seem to > retrieve the first part 'sp|Q8I5D2|ABRA_PLAF7' which I get from the > queryId property of the annotation > > > > > > Can anyone point me in the right direction for retrieving the > complete query description? > > > > > > Thanks > > > > > > Dave > > > > > > > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > M: +44 7500 438846 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From benb at fruitfly.org Fri Oct 31 13:38:32 2008 From: benb at fruitfly.org (Ben Berman) Date: Fri, 31 Oct 2008 06:38:32 -0700 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: Is there a reason why IUPAC ambiguity codes have never been added to DNATools? Would it hurt the performance of symbol lookups? On Oct 31, 2008, at 3:05 AM, community at struck.lu wrote: > Hello, > > > I am using the SCF class in the context of HIV-1 population > sequencing. In > this context we do have sometimes ambiguous base calls. To support > them I > extended the SCF class to allow for IUPAC ambiguities up to 2 > nucleotides. > > Therefore I simply added the following code to the "decode" function: > > ######################### > public Symbol decode(byte call) throws IllegalSymbolException { > > //get the DNA Alphabet > Alphabet dna = DNATools.getDNA(); > > char c = (char) call; > switch (c) { > case 'a': > case 'A': > return DNATools.a(); > case 'c': > case 'C': > return DNATools.c(); > case 'g': > case 'G': > return DNATools.g(); > case 't': > case 'T': > return DNATools.t(); > case 'n': > case 'N': > return DNATools.n(); > case '-': > return DNATools.getDNA().getGapSymbol(); > case 'w': > case 'W': > //make the 'W' symbol > Set symbolsThatMakeW = new HashSet(); > symbolsThatMakeW.add(DNATools.a()); > symbolsThatMakeW.add(DNATools.t()); > Symbol w = dna.getAmbiguity(symbolsThatMakeW); > return w; > case 's': > case 'S': > //make the 'S' symbol > Set symbolsThatMakeS = new HashSet(); > symbolsThatMakeS.add(DNATools.c()); > symbolsThatMakeS.add(DNATools.g()); > Symbol s = dna.getAmbiguity(symbolsThatMakeS); > return s; > ... (and so on) > ######################### > > Is this the right way to do it? And if so, how can this code be > submitted to > the official biojava source code? > > > Best regards, > Daniel Struck > _________________________________________________________ > Mail sent using root eSolutions Webmailer - www.root.lu > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > ---- Ben Berman, PhD Research Associate, USC Epigenome Center Harlyne J. Norris Research Tower 1450 Biggy St. Room #G511, MC 9601 Los Angeles, CA 90033 From holland at eaglegenomics.com Fri Oct 31 13:56:54 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 31 Oct 2008 13:56:54 +0000 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: It is the correct method, yes. However your code constructs a new hash set every time it does the check for W or S etc.. It would be much more efficient to create class-static references to the ambiguity symbols you need, instead of (re)creating them every time they're encountered. A class-static gap symbol reference would also be good in this situation. cheers, Richard 2008/10/31 community at struck.lu : > Hello, > > > I am using the SCF class in the context of HIV-1 population sequencing. In > this context we do have sometimes ambiguous base calls. To support them I > extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. > > Therefore I simply added the following code to the "decode" function: > > ######################### > public Symbol decode(byte call) throws IllegalSymbolException { > > //get the DNA Alphabet > Alphabet dna = DNATools.getDNA(); > > char c = (char) call; > switch (c) { > case 'a': > case 'A': > return DNATools.a(); > case 'c': > case 'C': > return DNATools.c(); > case 'g': > case 'G': > return DNATools.g(); > case 't': > case 'T': > return DNATools.t(); > case 'n': > case 'N': > return DNATools.n(); > case '-': > return DNATools.getDNA().getGapSymbol(); > case 'w': > case 'W': > //make the 'W' symbol > Set symbolsThatMakeW = new HashSet(); > symbolsThatMakeW.add(DNATools.a()); > symbolsThatMakeW.add(DNATools.t()); > Symbol w = dna.getAmbiguity(symbolsThatMakeW); > return w; > case 's': > case 'S': > //make the 'S' symbol > Set symbolsThatMakeS = new HashSet(); > symbolsThatMakeS.add(DNATools.c()); > symbolsThatMakeS.add(DNATools.g()); > Symbol s = dna.getAmbiguity(symbolsThatMakeS); > return s; > ... (and so on) > ######################### > > Is this the right way to do it? And if so, how can this code be submitted to > the official biojava source code? > > > Best regards, > Daniel Struck > _________________________________________________________ > Mail sent using root eSolutions Webmailer - www.root.lu > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Fri Oct 31 14:40:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 31 Oct 2008 14:40:10 +0000 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: It would be fine to add them there too. You'd still need to modify the SCF parser though in order for it to be able to know about them. cheers, Richard 2008/10/31 Ben Berman : > > Is there a reason why IUPAC ambiguity codes have never been added to > DNATools? Would it hurt the performance of symbol lookups? > > > On Oct 31, 2008, at 3:05 AM, community at struck.lu wrote: > >> Hello, >> >> >> I am using the SCF class in the context of HIV-1 population sequencing. In >> this context we do have sometimes ambiguous base calls. To support them I >> extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. >> >> Therefore I simply added the following code to the "decode" function: >> >> ######################### >> public Symbol decode(byte call) throws IllegalSymbolException { >> >> //get the DNA Alphabet >> Alphabet dna = DNATools.getDNA(); >> >> char c = (char) call; >> switch (c) { >> case 'a': >> case 'A': >> return DNATools.a(); >> case 'c': >> case 'C': >> return DNATools.c(); >> case 'g': >> case 'G': >> return DNATools.g(); >> case 't': >> case 'T': >> return DNATools.t(); >> case 'n': >> case 'N': >> return DNATools.n(); >> case '-': >> return DNATools.getDNA().getGapSymbol(); >> case 'w': >> case 'W': >> //make the 'W' symbol >> Set symbolsThatMakeW = new HashSet(); >> symbolsThatMakeW.add(DNATools.a()); >> symbolsThatMakeW.add(DNATools.t()); >> Symbol w = dna.getAmbiguity(symbolsThatMakeW); >> return w; >> case 's': >> case 'S': >> //make the 'S' symbol >> Set symbolsThatMakeS = new HashSet(); >> symbolsThatMakeS.add(DNATools.c()); >> symbolsThatMakeS.add(DNATools.g()); >> Symbol s = dna.getAmbiguity(symbolsThatMakeS); >> return s; >> ... (and so on) >> ######################### >> >> Is this the right way to do it? And if so, how can this code be submitted >> to >> the official biojava source code? >> >> >> Best regards, >> Daniel Struck >> _________________________________________________________ >> Mail sent using root eSolutions Webmailer - www.root.lu >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > ---- > Ben Berman, PhD > Research Associate, USC Epigenome Center > Harlyne J. Norris Research Tower > 1450 Biggy St. > Room #G511, MC 9601 > Los Angeles, CA 90033 > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From community at struck.lu Fri Oct 31 16:06:45 2008 From: community at struck.lu (community at struck.lu) Date: Fri, 31 Oct 2008 17:06:45 +0100 Subject: [Biojava-l] SCF: support for ambiguities Message-ID: True. It was a first quick and dirty hack to get the rest of my project going. I think adding support of the IUPAC ambiguities to DNATools would be the most approbate solution. The SCF class can then easily be adapted. Are there any plans to do so? If not, I could give it a try and submit a patch for DNATools and SCF. Greetings, Daniel "Richard Holland" wrote: > It is the correct method, yes. > > However your code constructs a new hash set every time it does the > check for W or S etc.. It would be much more efficient to create > class-static references to the ambiguity symbols you need, instead of > (re)creating them every time they're encountered. A class-static gap > symbol reference would also be good in this situation. > > cheers, > Richard > > > > 2008/10/31 community at struck.lu : > > Hello, > > > > > > I am using the SCF class in the context of HIV-1 population sequencing. In > > this context we do have sometimes ambiguous base calls. To support them I > > extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. > > > > Therefore I simply added the following code to the "decode" function: > > > > ######################### > > public Symbol decode(byte call) throws IllegalSymbolException { > > > > //get the DNA Alphabet > > Alphabet dna = DNATools.getDNA(); > > > > char c = (char) call; > > switch (c) { > > case 'a': > > case 'A': > > return DNATools.a(); > > case 'c': > > case 'C': > > return DNATools.c(); > > case 'g': > > case 'G': > > return DNATools.g(); > > case 't': > > case 'T': > > return DNATools.t(); > > case 'n': > > case 'N': > > return DNATools.n(); > > case '-': > > return DNATools.getDNA().getGapSymbol(); > > case 'w': > > case 'W': > > //make the 'W' symbol > > Set symbolsThatMakeW = new HashSet(); > > symbolsThatMakeW.add(DNATools.a()); > > symbolsThatMakeW.add(DNATools.t()); > > Symbol w = dna.getAmbiguity(symbolsThatMakeW); > > return w; > > case 's': > > case 'S': > > //make the 'S' symbol > > Set symbolsThatMakeS = new HashSet(); > > symbolsThatMakeS.add(DNATools.c()); > > symbolsThatMakeS.add(DNATools.g()); > > Symbol s = dna.getAmbiguity(symbolsThatMakeS); > > return s; > > ... (and so on) > > ######################### > > > > Is this the right way to do it? And if so, how can this code be submitted to > > the official biojava source code? > > > > > > Best regards, > > Daniel Struck > > _________________________________________________________ > > Mail sent using root eSolutions Webmailer - www.root.lu > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _________________________________________________________ Mail sent using root eSolutions Webmailer - www.root.lu From holland at eaglegenomics.com Fri Oct 31 16:14:30 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 31 Oct 2008 16:14:30 +0000 Subject: [Biojava-l] SCF: support for ambiguities In-Reply-To: References: Message-ID: A patch would be much appreciated! cheers, Richard 2008/10/31 community at struck.lu : > True. It was a first quick and dirty hack to get the rest of my project going. > > I think adding support of the IUPAC ambiguities to DNATools would be the most > approbate solution. The SCF class can then easily be adapted. > > Are there any plans to do so? > If not, I could give it a try and submit a patch for DNATools and SCF. > > Greetings, > Daniel > > "Richard Holland" wrote: > >> It is the correct method, yes. >> >> However your code constructs a new hash set every time it does the >> check for W or S etc.. It would be much more efficient to create >> class-static references to the ambiguity symbols you need, instead of >> (re)creating them every time they're encountered. A class-static gap >> symbol reference would also be good in this situation. >> >> cheers, >> Richard >> >> >> >> 2008/10/31 community at struck.lu : >> > Hello, >> > >> > >> > I am using the SCF class in the context of HIV-1 population sequencing. In >> > this context we do have sometimes ambiguous base calls. To support them I >> > extended the SCF class to allow for IUPAC ambiguities up to 2 nucleotides. >> > >> > Therefore I simply added the following code to the "decode" function: >> > >> > ######################### >> > public Symbol decode(byte call) throws IllegalSymbolException { >> > >> > //get the DNA Alphabet >> > Alphabet dna = DNATools.getDNA(); >> > >> > char c = (char) call; >> > switch (c) { >> > case 'a': >> > case 'A': >> > return DNATools.a(); >> > case 'c': >> > case 'C': >> > return DNATools.c(); >> > case 'g': >> > case 'G': >> > return DNATools.g(); >> > case 't': >> > case 'T': >> > return DNATools.t(); >> > case 'n': >> > case 'N': >> > return DNATools.n(); >> > case '-': >> > return DNATools.getDNA().getGapSymbol(); >> > case 'w': >> > case 'W': >> > //make the 'W' symbol >> > Set symbolsThatMakeW = new HashSet(); >> > symbolsThatMakeW.add(DNATools.a()); >> > symbolsThatMakeW.add(DNATools.t()); >> > Symbol w = dna.getAmbiguity(symbolsThatMakeW); >> > return w; >> > case 's': >> > case 'S': >> > //make the 'S' symbol >> > Set symbolsThatMakeS = new HashSet(); >> > symbolsThatMakeS.add(DNATools.c()); >> > symbolsThatMakeS.add(DNATools.g()); >> > Symbol s = dna.getAmbiguity(symbolsThatMakeS); >> > return s; >> > ... (and so on) >> > ######################### >> > >> > Is this the right way to do it? And if so, how can this code be submitted > to >> > the official biojava source code? >> > >> > >> > Best regards, >> > Daniel Struck >> > _________________________________________________________ >> > Mail sent using root eSolutions Webmailer - www.root.lu >> > >> > >> > _______________________________________________ >> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> > > > _________________________________________________________ > Mail sent using root eSolutions Webmailer - www.root.lu > > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From gabrielle_doan at gmx.net Fri Oct 31 15:09:56 2008 From: gabrielle_doan at gmx.net (Gabrielle Doan) Date: Fri, 31 Oct 2008 15:09:56 -0000 Subject: [Biojava-l] differences between read in sequence and stored sequence in database] In-Reply-To: References: <49072127.7010304@gmx.net> Message-ID: <490B1FB3.7010607@gmx.net> Hi all, I've changed the regular expression in org.biojavax.bio.seq.io.GenbankFormat from protected static final Pattern sectp = Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); <\code> to protected static final Pattern sectp = Pattern.compile("^(\\s{0,8}([A-Za-z]+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); <\code> like in BioRuby (http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb.diff?r1=0.24&r2=0.25&cvsroot=bioruby). But than features like D-loop can't be detected. So this is not the solution for my problem. The reason for the truncation is readSection(BufferedReader br) in org.biojavax.bio.seq.io.GenbankFormat. if (line==null || line.length()==0 || (!line.startsWith(" ") && linecount++>0)) { // dump out last part of section section.add(new String[]{currKey,currVal.toString()}); br.reset(); done = true; <\snip> The condition in the if-clause will ignore lines which don't begin with a whitespace, so this line will be read 99999961 cccgcccaca cccctcggcc ctgccctctg gccatacagg ttctcggtgg tgttgaagag <\snip> and this line won't be read: 100000021 gtcctcgggc tccggcttgg tgctcacgca cacaggaaag tcagcttctc ctgggagggc <\snip> If you change the if-statement to this: String firstSecKey = section.size() == 0 ? "" : ((String[])section.get(0))[0]; if (line==null || line.length()==0 || (!line.startsWith(" ") && linecount++>0 && ( !firstSecKey.equals(START_SEQUENCE_TAG) || line.startsWith(END_SEQUENCE_TAG)))) <\snip> You can add the whole sequence without truncation to the database. I have attached GenbankFormat.java in this mail. Can anybody check the method for me and commit it? Since I'm not a BioJava specialist. Cheers, Gabrielle Richard Holland schrieb: > Hello. > > Sorry for the delayed reply - I've been away on business all week. > > The similar Ruby issue (and solution) is discussed here: > > http://portal.open-bio.org/pipermail/bioruby/2004-March.txt > > How did you parse the files in the first place? Did you use the new > GenBank parsers (BJX), or the older ones? This will help indicate > where the problem lies - the data will have been truncated at the > point it was parsed from file, so the data in your database will > reflect this and you'll have to reload it once the appropriate parser > has been fixed. > > If it was the newer BJX parser, then the problem most probably lies in > this regex from org.biojavax.bio.seq.io.GenbankFormat, which can > probably be fixed in a similar manner to the Ruby equivalent dicussed > in the posting above: > > protected static final Pattern sectp = > Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$"); > > Could someone volunteer to develop and test a fix? If you come up with > something, please commit it to the SVN trunk. > > cheers, > Richard > > > 2008/10/28 Gabrielle Doan : >> Hi all, >> concering the problem as described below I have found out that this problem >> also occured in BioRuby and was fixed in 2004. >> See: >> http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby >> Unfortunately I'm clueless about BioRuby. Does anybody recognize this >> problem or understand how it was solved in BioRuby? >> >> I am grateful for any hints. >> >> Cheers, >> >> Gabrielle >> >> >> -------- Original-Nachricht -------- >> Betreff: [Biojava-l] differences between read in sequence and stored >> sequence in database >> Datum: Mon, 27 Oct 2008 13:57:03 +0100 >> Von: Gabrielle Doan >> An: biojava-l at biojava.org >> >> Hi all, >> >> I have a BioSQL database which contains all human chromsomes. For my >> recent project I have to query for a part of a sequence. >> As far as I know I can get the whole sequence from the entry >> Biosequence.Seq in the BioSQL schema. So I've made this query: >> >> SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs; >> >> But this query hasn't yield the desired string, because the length of >> this biosequence is only 100,000,020 bp. I am very confused why I get >> such a discrepancy. I have added all chromosomes with the build in >> method in BioJava addRichSequence(RichSequence seq) to the database. >> From my raw data I know that this sequence should have a length of >> 140,279,252 bp. So where is the remaining part of my sequence? I have >> observed these discrepancies on all chromsomes which are longer than >> 100,000,020 bp. >> >> Here is an abstract of my database: >> bioentry_id description length >> 2 Homo sapiens mitochondrion, complete genome. 16571 >> 3 Homo sapiens chromosome Y, reference assembly, complete sequence. >> 57772954 >> 4 Homo sapiens chromosome X, reference assembly, complete sequence. >> 100000020 >> 5 Homo sapiens chromosome 22, reference assembly, complete sequence. >> 49691432 >> 6 Homo sapiens chromosome 21, reference assembly, complete sequence. >> 46944323 >> 7 Homo sapiens chromosome 20, reference assembly, complete sequence. >> 25960004 >> 8 Homo sapiens chromosome 9, reference assembly, complete sequence. >> 100000020 >> 9 Homo sapiens chromosome 7, reference assembly, complete sequence. >> 100000020 >> >> Sequences smaller than 100,000,020 bp are correctly stored under >> Biosequence.seq. >> >> I am grateful for any hints, which explain the behaviour of my database. >> >> Cheers, >> >> Gabrielle >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: GenbankFormat.java Type: text/x-java Size: 48624 bytes Desc: not available URL: