From deniz.koellhofer at cambia.org Wed Sep 1 02:21:52 2010 From: deniz.koellhofer at cambia.org (Deniz Koellhofer) Date: Wed, 1 Sep 2010 16:21:52 +1000 Subject: [Biojava-dev] biojava3 BLAST parser In-Reply-To: <47FD5948-0439-45C6-A1AB-22E7CC8D17A6@scripps.edu> References:

<47FD5948-0439-45C6-A1AB-22E7CC8D17A6@scripps.edu> Message-ID: Hi Scooter, I'm currently parsing the BLAST results into plain data containers, but wouldn't mind integrating it more with existing BioJava3 modules if I find some time. Pretty busy at the moment, but I will let you guys know if I get any further. Cheers, Deniz On Wed, Sep 1, 2010 at 12:11 AM, Scooter Willis wrote: > Deniz > > It would be great to formalize the XML blast results as Java classes. Do > you have any interest in taking on the project? > > Capturing the blast alignment using the new alignment classes would be a > very nice feature. I like using XPATH as the query language to select for > hits of interest which should allow for a SAX based approach to minimize the > impact of very large XML files. XPATH and SAX does appear to have some > constraints ( > http://stackoverflow.com/questions/1863250/is-it-there-any-xpath-processor-for-sax-model > ) > > Probably makes sense to have a Blast module that would depend on core and > alignment. > > Thanks > > Scooter > > > > On Aug 31, 2010, at 8:49 AM, Deniz Koellhofer wrote: > > *Hi Scooter,* > * > * > *Thanks for the reply. I guess the BlastXMLQuery is a good example to show > how to quickly extract information from a BLAST result. * > * > * > *But in my opinion biojava3 should alo have a Blast parser that generates > java beans containing the complete Blast result set - similar to what > biojava1.7.1 was doing. So yeah, I'm after translating the XML elements to > Java classes.* > * > * > *Would something like that fit into one of the biojava3 modules? homology, > I/O?* > * > * > *Thanks,* > *Deniz* > * > * > On Tue, Aug 31, 2010 at 8:43 PM, Scooter Willis wrote: > >> Deniz >> >> Can you provide some requirements regarding parsing the Blast XML. I tend >> to use XPATH and the DOM object to get to the data elements of interest so >> you already have the ability to load the Blast XML and work with the data. >> The difficulty of "parsing" is not an issue with XML. The BlastXMLQuery is >> an example of searching the Blast XML to get results. Are you wanting the >> XML elements translated to Java classes? >> >> Thanks >> >> Scooter >> >> On Aug 31, 2010, at 2:46 AM, Deniz Koellhofer wrote: >> >> > Hi, >> > >> > I wanted to find out the current state of blast parsing efforts in >> biojava3 >> > - especially for ncbi blastxml output? >> > >> > I had a quick look and found some DOM based code fragments >> > in org.biojava3.genome.query.BlastXMLQuery. Is there already anybody >> working >> > on a more comprehensive SAX parser? >> > >> > The biojava1.7.1 blastxml parser seems to work fine, however some of the >> > tags in NCBI-BLASTN 2.2.23+ output like Hsp_midline, BlastOutput_param >> don't >> > seem to get parsed properly >> > in org.biojava.bio.program.sax.blastxml.BlastXMLParserFacade. >> > >> > Cheers, >> > Deniz >> > >> > -- >> > Deniz Koellhofer >> > Cambia >> > Initiative for Open Innovation (IOI) >> > Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia >> > _______________________________________________ >> > biojava-dev mailing list >> > biojava-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> > > > -- > -- > Deniz Koellhofer > Cambia > Initiative for Open Innovation (IOI) > Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia > > > From bugzilla-daemon at portal.open-bio.org Wed Sep 1 13:59:08 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Sep 2010 13:59:08 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009011759.o81Hx83i005446@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 darnells at dnastar.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #2 from darnells at dnastar.com 2010-09-01 13:59 EST ------- I have applied the patch Amr sent over the weekend and tested it. It's looking pretty good. I tested structures 3CYT, 4HHB, 3AFA, 1MR1, and 2OHX and the following functions work as advertised: Structure: getSites() PDBSite: getSiteID(), getResidues() PDBSite.Residue: getResidueName(), getResidueNumber() I didn't directly test the set* methods, assuming they are working fine since the get* methods work. I noticed that PDBSite implements PDBRecord (similarly as SSBond). The method PDBSite.toPDB() ends in error: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuffer.append(StringBuffer.java:224) at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:39) at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:25) at com.dnastar.structureTest.Cookbook.printSiteIDs(Cookbook.java:36) at com.dnastar.structureTest.Cookbook.main(Cookbook.java:24) That should be corrected so that it is consistent with the SSBond class. I also agree with Jules Jacobsen's comments (via email): "We still need to hook-up the REMARK 800 parser and make some unit tests for this." -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jacobsen at ebi.ac.uk Thu Sep 2 06:27:10 2010 From: jacobsen at ebi.ac.uk (Jules Jacobsen) Date: Thu, 02 Sep 2010 11:27:10 +0100 Subject: [Biojava-dev] Bug 3132: "SITE records in PDBFileReader"??? In-Reply-To: References: <4C73A9A8.9010708@ebi.ac.uk> <4C7667D8.9030000@ebi.ac.uk>

Message-ID: <4C7F7BFE.6090101@ebi.ac.uk> Hi Guys, OK, please don't hack about with this - I'm adding it into the main branch of biojava 3, but things are changing a little. PDBSite has been renamed to Site ('PDB' is a redundant) The Residue inner class in PDBSite I have removed so as to use the Group interface which we have already defined. As Steve says, Site.toPDB() throws an OutOfMemoryError - I'm fixing this and adding tests to Site, the PDBFileParser and other classes which have been touched. Also I'm adding the Remark800 parser and minor cosmetic tweaks for consistent coding style. This should be good to go for Mon/Tues next week. Jules On 01/09/2010 18:57, Steve Darnell wrote: > Amr, > > I have applied the patch you sent over the weekend and tested it. It's > looking pretty good. I tested structures 3CYT, 4HHB, 3AFA, 1MR1, and > 2OHX and the following functions work as advertised: > > Structure: getSites() > PDBSite: getSiteID(), getResidues() > PDBSite.Residue: getResidueName(), getResidueNumber() > > I didn't directly test the set* methods, assuming they are working fine > since the get* methods work. > > I noticed that PDBSite implements PDBRecord (similarly as SSBond). The > method PDBSite.toPDB() ends in error: > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2882) > at > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav > a:100) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) > at java.lang.StringBuffer.append(StringBuffer.java:224) > at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:39) > at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:25) > at > com.dnastar.structureTest.Cookbook.printSiteIDs(Cookbook.java:36) > at com.dnastar.structureTest.Cookbook.main(Cookbook.java:24) > > That should be corrected so that it is consistent with the SSBond class. > I'll update the bug in Bugzilla with this information. > > Regards, > Steve > > CC: Jules Jacobsen (also looking at SITE patch) > > -----Original Message----- > From: Steve Darnell > Sent: Tuesday, August 31, 2010 5:52 PM > To: 'Amr AL-Hossary' > Subject: RE: Bug 3132: "SITE records in PDBFileReader"??? > > Amr, > >> The configuration is related to the PROJECT PROPERTIES, not to Your > system. > > My project/application is using JDK 1.6.0_20 (Windows/Mac). > > >> so all you need is SITE_IDENTIFIER, you don't need EVIDENCE_CODE, right > ? > > I need the SITE_DESCRIPTION from REMARK 800. SITE_IDENTIFIER, which is > the same as the siteID from the SITE record, is not meaningful. > Correct, I do not need the EVIDENCE_CODE. > > >> Sorry? Isn't it already there in the PDBSite bean? >> or what is the difference between site name& site ID? > > For example, PDBID: 3CYT > > SITE, siteID or REMARK 800, SITE_IDENTIFIER: "AC1" > REMARK 800, SITE_DESCRIPTION: "BINDING SITE FOR RESIDUE HEM O 104" > > The description is meaningful, the identifier is not. > > >> please recheck the date. > > My mistake. Wednesday, September 8th, two days after Labor Day (Monday, > September 6th). > > I hope I addressed your questions. Let me know if I can clarify > anything else. > > Regards, > Steve From bugzilla-daemon at portal.open-bio.org Thu Sep 2 13:21:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Sep 2010 13:21:39 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009021721.o82HLdKj024110@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #3 from jacobsen at ebi.ac.uk 2010-09-02 13:21 EST ------- SITE parser now up and running in DEV - should be available in automated build - access Site objects from Structure.getSites() - Output available in standard String and also proper PDB format (with padded whitespace!) - java.lang.OutOfMemoryError fixed. - tests added to PDBFileParserTest and SiteTest REMARK 800 is being added too. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 2 17:32:50 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Sep 2010 17:32:50 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009022132.o82LWoP7005561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #4 from darnells at dnastar.com 2010-09-02 17:32 EST ------- Using revision 8221 (2010/09/02 17:11:55), I experienced two unit test failures in org.biojava.bio.structure.SiteTest during the build process. Both testToPDB_0args() and testToPDB_StringBuffer() failed because the expected result did not have whitespace padding the line lengths to 80 characters. In SiteTest.java, I replaced lines 118-119 and 134-135 with the following: String expResult = "SITE 1 AC1 6 ARG H 221A LYS H 224 HOH H 403 HOH H 460 " + newline + "SITE 2 AC1 6 HOH H 464 HOH H 497 " + newline; I can successfully build afterwards. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 3 04:48:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Sep 2010 04:48:01 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009030848.o838m1qv018822@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #5 from jacobsen at ebi.ac.uk 2010-09-03 04:48 EST ------- Sorry Steve - I forgot to comment out the wrong bit of the test - the results are correct, but the test was wrong! This is fixed now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 3 06:43:08 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Sep 2010 06:43:08 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009031043.o83Ah8rE008398@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #6 from jacobsen at ebi.ac.uk 2010-09-03 06:43 EST ------- Updated SiteTest and PDBFileParserTest to include REMARK 800 parsing and functionality - new methods added to Site to get/set and print its related REMARK 800 section. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 3 15:55:12 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Sep 2010 15:55:12 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009031955.o83JtCUJ023861@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #7 from darnells at dnastar.com 2010-09-03 15:55 EST ------- There is a defect in PDBFileParser.java regarding sites. Line 2044: String pdbCode = groupString.substring(4, 10).trim(); This assigns pdbCode to "$CHAIN $RESNUM$INSCODE" (e.g. "H 221A"), where the Group interface documentation states it should just be $RESNUM$INSCODE (e.g. "221A"). Also, setParent() is never called and assigned to a chain object. Currently, to get the chain id for a site residue, you need to use the first character of pdbCode. Is it possible to link the site residue groups to their correct chain? This is somewhat analogous to SSBond; is a similar approach appropriate? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Sep 6 04:06:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 6 Sep 2010 04:06:01 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009060806.o86861xr002482@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #8 from amr_alhossary at hotmail.com 2010-09-06 04:06 EST ------- Actually Dr. Andreas idea was to use the PDBResidueNumber bean, and thats what i did in my implementation. Any way, we can update everything to use the new bean, but we need first to set new clear definitions of terms like group, residue, HetAtom, etc. and the relation between them. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From deniz.koellhofer at cambia.org Tue Sep 21 18:59:27 2010 From: deniz.koellhofer at cambia.org (Deniz Koellhofer) Date: Wed, 22 Sep 2010 08:59:27 +1000 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein Message-ID: Hi, I'm trying to parse EMBL formatted files with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't match. Looks like the parser utilises the EMBLFormat class with the following ID pattern: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$" **);* The ID lines in my files (retrieved from EMBL-EBI) look like *ID A00197; SV 1; linear; protein; PRT; SYN; 602 AA.* Looks like the pattern is specifically written for dna/rna and should more look like: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+); **\\s+(\\d+)\\s+(BP|AA)\\.$"**);* Or am I using he wrong RichSequence.IOTools function? Cheers, Deniz -- Deniz Koellhofer Cambia Initiative for Open Innovation (IOI) Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia From deniz.koellhofer at cambia.org Wed Sep 22 19:10:58 2010 From: deniz.koellhofer at cambia.org (Deniz Koellhofer) Date: Thu, 23 Sep 2010 09:10:58 +1000 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein In-Reply-To: <20100922211815.12439.qmail@mxw1102.verio-web.com> References: <20100922211815.12439.qmail@mxw1102.verio-web.com> Message-ID: Hi George, This entry is from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz Have you used the RichSequence.IOTools successfully for parsing EMBL protein files before? I assume this should always fail due to the "BP" in the regex? Deniz On Thu, Sep 23, 2010 at 7:18 AM, George Waldon wrote: > Hi Deniz: > > I have a quick question that may be obvious, but which database do you get > those protein files from? > > Thank you, > > George > > On Tue, Sep 21, 2010 at 3:59 PM, Deniz Koellhofer < > deniz.koellhofer at cambia.org> wrote: > > Hi, > > I'm trying to parse EMBL formatted files > with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID > lines don't match. > > Looks like the parser utilises the EMBLFormat class with the following > ID > pattern: > > *protected** **static** **final** Pattern **lp** = Pattern.compile(** > > "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$" > **);* > > The ID lines in my files (retrieved from EMBL-EBI) look like *ID > A00197; > SV 1; linear; protein; PRT; SYN; 602 AA.* > > Looks like the pattern is specifically written for dna/rna and should > more > look like: > > *protected** **static** **final** Pattern **lp** = Pattern.compile(** > > "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+); > **\\s+(\\d+)\\s+(BP|AA)\\.$"**);* > > Or am I using he wrong RichSequence.IOTools function? > > Cheers, > > Deniz > -- > Deniz Koellhofer > Cambia > Initiative for Open Innovation (IOI) > Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- Deniz Koellhofer Cambia Initiative for Open Innovation (IOI) Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia From gwaldon at geneinfinity.org Thu Sep 23 00:23:48 2010 From: gwaldon at geneinfinity.org (George Waldon) Date: Thu, 23 Sep 2010 00:23:48 -0400 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein Message-ID: <20100923042349.77746.qmail@mxw1102.verio-web.com> Curious, this seems to be the only place to find this type of files. Not really an official format, a little bit like GenPept. Your fix should probably work. Can you fill a bug on bugzilla (http://bugzilla.open-bio.org/)? Best, George On Wed, Sep 22, 2010 at 4:10 PM, Deniz Koellhofer wrote: Hi George, This entry is from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz Have you used the RichSequence.IOTools successfully for parsing EMBL protein files before? I assume this should always fail due to the "BP" in the regex? Deniz On Thu, Sep 23, 2010 at 7:18 AM, George Waldon wrote: Hi Deniz: I have a quick question that may be obvious, but which database do you get those protein files from? Thank you, George On Tue, Sep 21, 2010 at 3:59 PM, Deniz Koellhofer wrote: Hi, I'm trying to parse EMBL formatted files with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't match. Looks like the parser utilises the EMBLFormat class with the following ID pattern: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$" **);* The ID lines in my files (retrieved from EMBL-EBI) look like *ID A00197; SV 1; linear; protein; PRT; SYN; 602 AA.* Looks like the pattern is specifically written for dna/rna and should more look like: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+); **\\s+(\\d+)\\s+(BP|AA)\\.$"**);* Or am I using he wrong RichSequence.IOTools function? Cheers, Deniz -- Deniz Koellhofer Cambia Initiative for Open Innovation (IOI) Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From bugzilla-daemon at portal.open-bio.org Thu Sep 23 00:35:54 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Sep 2010 00:35:54 -0400 Subject: [Biojava-dev] [Bug 3137] New: RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3137 Summary: RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries. Product: BioJava Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: seq.io AssignedTo: biojava-dev at biojava.org ReportedBy: dkoellhofer at gmail.com Hi, I'm trying to parse EMBL formatted files with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't match. Looks like the parser utilises the EMBLFormat class with the following ID pattern: protected static final Pattern lp = Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$"); The ID lines in my files (retrieved from EMBL-EBI) look like ID A00197; SV 1; linear; protein; PRT; SYN; 602 AA. Looks like the pattern is specifically written for dna/rna and should more look like: protected static final Pattern lp = Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+(BP|AA)\\.$"); The failing protein sequences come from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz Cheers, Deniz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Sep 23 05:10:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Sep 2010 10:10:56 +0100 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein In-Reply-To: <20100923042349.77746.qmail@mxw1102.verio-web.com> References: <20100923042349.77746.qmail@mxw1102.verio-web.com> Message-ID: On Thu, Sep 23, 2010 at 5:23 AM, George Waldon wrote: > > Curious, this seems to be the only place to find this type of files. > Not really an official format, a little bit like GenPept. Your fix > should probably work. Can you fill a bug on bugzilla > (http://bugzilla.open-bio.org/)? > > Best, > George Bug filed: http://bugzilla.open-bio.org/show_bug.cgi?id=3137 Interestingly Biopython doesn't support Protein EMBL files either (I didn't know they existed), I wonder if BioPerl does? Peter From biopython at maubp.freeserve.co.uk Thu Sep 23 06:51:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Sep 2010 11:51:35 +0100 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein In-Reply-To: References: <20100923042349.77746.qmail@mxw1102.verio-web.com> Message-ID: Hi BioJava team et al, Thanks for the indirect alert that protein EMBL files exist, e.g. ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz I've just updated Biopython to support them. Chris - I've CC'd you in case you want to look at this for BioPerl. Peter On Thu, Sep 23, 2010 at 10:10 AM, Peter wrote: > On Thu, Sep 23, 2010 at 5:23 AM, George Waldon wrote: >> >> Curious, this seems to be the only place to find this type of files. >> Not really an official format, a little bit like GenPept. Your fix >> should probably work. Can you fill a bug on bugzilla >> (http://bugzilla.open-bio.org/)? >> >> Best, >> George > > Bug filed: http://bugzilla.open-bio.org/show_bug.cgi?id=3137 > > Interestingly Biopython doesn't support Protein EMBL files either > (I didn't know they existed), I wonder if BioPerl does? > > Peter From bugzilla-daemon at portal.open-bio.org Thu Sep 23 18:45:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Sep 2010 18:45:44 -0400 Subject: [Biojava-dev] [Bug 3138] New: SeqRes2AtomAligner misaligns Atom groups at N-terminus Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3138 Summary: SeqRes2AtomAligner misaligns Atom groups at N-terminus Product: BioJava Version: live (CVS source) Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: structure AssignedTo: biojava-dev at biojava.org ReportedBy: darnells at dnastar.com The SeqRes2AtomAligner misaligns the N-terminus residues for 1BKV (Collagen) and 3NLC (VP0956). Both structures have SEQRES residues without corresponding ATOM residues at the N-terminus. In the case of 3NLC, the observed mapping is (* = no ATOM group): SEQRES MGHHHHHHSHMIRINE ATOM M**********IRINE and the correct mapping should be: SEQRES MGHHHHHHSHMIRINE ATOM **********MIRINE Activating the Chemical Component Dictionary does not help the alignment. Is it significant that for 1BKV and 3NLC the first SEQRES group and the first ATOM group are the same residue (nearly the same for 1BKV, PRO ~ HYP)? Is this a situation that is difficult for the Needleman-Wunsch algorithm used by SeqRes2AtomAligner? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 23 18:50:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Sep 2010 18:50:11 -0400 Subject: [Biojava-dev] [Bug 3138] SeqRes2AtomAligner misaligns Atom groups at N-terminus In-Reply-To: Message-ID: <201009232250.o8NMoBr0029097@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3138 ap3 at sanger.ac.uk changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biojava-dev at biojava.org |andreas at sdsc.edu -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 24 01:45:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Sep 2010 01:45:17 -0400 Subject: [Biojava-dev] [Bug 3137] RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries. In-Reply-To: Message-ID: <201009240545.o8O5jHw9007788@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3137 gwaldon at geneinfinity.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from gwaldon at geneinfinity.org 2010-09-24 01:45 EST ------- Committed new pattern and added a test. Note that protein sequences from KIPO have a non-regular ID line that prevent parsing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From gindin at gmail.com Sun Sep 26 19:05:50 2010 From: gindin at gmail.com (Yevgeniy Gindin) Date: Sun, 26 Sep 2010 19:05:50 -0400 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties Message-ID: There seems to be a bug in org.biojavax.bio.alignment.blast.RemoteQBlastAlignmentProperties.setBlastProgram(String) The method throws an exception when a valid program name is given. I believe that this is due to the fact that two String objects are tested with the "==" operator rather then equals(). To illustrate: String a = new String ("a"); String b = new String ("a"); System.out.println (a == b); The above will return false. -- Yevgeniy From pedros at berkeley.edu Mon Sep 27 15:55:05 2010 From: pedros at berkeley.edu (Pedro Silva) Date: Mon, 27 Sep 2010 12:55:05 -0700 Subject: [Biojava-dev] BioJava 3 current status Message-ID: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> On 30 Jun 2010, at 12:18, Andreas Prlic wrote: > About BioJava 3: This has made great progress over the last weeks and > a lot of new functionality has been committed to SVN. To make this > release ready there are now two new tools: > > * There is now a BioJava Maven repository, which is hosting SNAPSHOT > builds from the current SVN. Hi Andreas, A couple questions about Biojava 3, with which I'm trying to interoperate from Clojure: 1. You mention SVN above, but I also see an active git repository at github. Is that only a mirror, and active development occurs at biojava.org's svn repo? 2. Are all non biojava3-prefixed modules in the Maven repository not suitable to use with Biojava 3? Eg., biosql, das, etc. 3. Will the Maven repository eventually also host releases, not just snapshots? How far along would a release be? To provide some context, I've started coding on 'Bioclojure', and my strategy is to position my efforts downstream from Biojava (3). To that end, I'm trying to figure out the best way to pull development snapshots and stable releases from you guys, as well as your current state of affairs in the porting of biojava1/biojavax to biojava3. Thanks for your help. Pedro -- Pedro Silva UC Berkeley - PMB From andreas at sdsc.edu Mon Sep 27 16:40:30 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 27 Sep 2010 13:40:30 -0700 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> References: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> Message-ID: Hi Pedro, > 1. You mention SVN above, but I also see an active git repository at github. > ? Is that only a mirror, and active development occurs at biojava.org's svn repo? Yes, git is only a mirror. The main development happens in an ssh protected SVN, which is replicated onto the anonymous SVN and git servers (within one hour after a commit). > 2. Are all non biojava3-prefixed modules in the Maven repository not suitable to use with Biojava 3? > ? Eg., biosql, das, etc. In an ideal world we would have a clear separation of biojava 3 and biojava 1. However the way things are at the present there is some mix and there is still legacy code in SVN. e.g. the module "core" is essentially a lot of biojava1.7. It is up to the module maintainers to upgrade their code. We will need to take a decision soon if we want to drop all legacy code from SVN and only support biojava3 based code from that point on. ( need to do a dependency analysis how many modules are still depending on the old core). > 3. Will the Maven repository eventually also host releases, not just snapshots? yes > ? How far along would a release be? We have not set a date at the present. I will write a proposal about the next steps to the mailing list soon and then this is up for discussion. Having said this, several people are using the current SVN checkouts in production environments and what seems mostly missing is lots of documentation... > To provide some context, I've started coding on 'Bioclojure', and my strategy is to position my efforts downstream from Biojava (3). Excellent! Are you planning to add additional functionality or is it mainly a wrapper for biojava? > To that end, I'm trying to figure out the best way to pull development snapshots and stable releases from you guys, as well as > your current state of affairs in the porting of biojava1/biojavax to biojava3. The automated build system pushes any successful build to http://www.biojava.org/download/maven/ This means all libraries that are available from there are guaranteed to pass the junit tests and compile correctly. Andreas From pedros at berkeley.edu Mon Sep 27 19:13:17 2010 From: pedros at berkeley.edu (Pedro Silva) Date: Mon, 27 Sep 2010 16:13:17 -0700 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: References: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> Message-ID: <1285629197.25539.38.camel@dev.dzlab.pmb.berkeley.edu> On Mon, 2010-09-27 at 13:40 -0700, Andreas Prlic wrote: > Excellent! Are you planning to add additional functionality or is it > mainly a wrapper for biojava? Well, it started as native Clojure all the way, then I figured it would be a wrapper around Biojava, *then* I figured, Clojure-Java interoperability is so good, I don't need any wrapper at all. That leaves adding additional functionality, plus wrapping the more verbose and/or common Biojava idioms. Thanks for your answers again. I appreciate it. Pedro -- Pedro Silva UC Berkeley - PMB From gwaldon at geneinfinity.org Mon Sep 27 23:34:01 2010 From: gwaldon at geneinfinity.org (George Waldon) Date: Mon, 27 Sep 2010 23:34:01 -0400 Subject: [Biojava-dev] BioJava 3 current status Message-ID: <20100928033401.79626.qmail@mxw1102.verio-web.com> On Mon, Sep 27, 2010 at 1:40 PM, Andreas Prlic wrote: >In an ideal world we would have a clear separation of biojava 3 and >biojava 1. However the way things are at the present there is some >mix and there is still legacy code in SVN. e.g. the module "core" is >essentially a lot of biojava1.7. It is up to the module maintainers to >upgrade their code. We will need to take a decision soon if we want to >drop all legacy code from SVN and only support biojava3 based code >from that point on. ( need to do a dependency analysis how many >modules are still depending on the old core). I see a few (rather lonely) BioException in the biojava3-ws module and a "biojava1-core" dependency in org.biojava3.core.sequence.transcription.DefaultRNAProteinTranscription in the sequence-core module. I also see that many of the classes of the sequence-core module are duplicated in the biojava3-core module. Is-there a meaning for this duplication? George From andreas at sdsc.edu Tue Sep 28 13:17:37 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 28 Sep 2010 10:17:37 -0700 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: <20100928033401.79626.qmail@mxw1102.verio-web.com> References: <20100928033401.79626.qmail@mxw1102.verio-web.com> Message-ID: > > I see a few (rather lonely) BioException in the biojava3-ws module and a "biojava1-core" dependency in org.biojava3.core.sequence.transcription.DefaultRNAProteinTranscription in the sequence-core module. I also see that many of the classes of the sequence-core module are duplicated in the biojava3-core module. Is-there a meaning for this duplication? Andy, Scooter, is this a leftover from the initial days? Can I delete the sequence module? Is this all biojava3-core now? On a related matter, is anybody working on genebank and embl file parsing for biojava3? There were some feature requests for this and would be nice if this could be done in biojava3 as well... Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From sylvain.foisy at inflammgen.org Tue Sep 28 09:54:43 2010 From: sylvain.foisy at inflammgen.org (Sylvain Foisy) Date: Tue, 28 Sep 2010 09:54:43 -0400 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) Message-ID: Hi, Ok, it should be fixed in SVN. I actually replaced the == by a call to the Arrays.binarySearch(Object[] obj, Object obj) method, which makes traversing the array obsolete. Thanks for pointing it ;-) Best regards Sylvain From HWillis at scripps.edu Tue Sep 28 13:30:50 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 28 Sep 2010 13:30:50 -0400 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) Message-ID: Andreas Yes you can delete the sequence modules. I will add the parsers to the list and will use that to make sure core sequence module properly models the data elements. Thanks Scooter ----- Reply message ----- From: "Sylvain Foisy" Date: Tue, Sep 28, 2010 1:18 pm Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) To: "biojava-dev at lists.open-bio.org" Hi, Ok, it should be fixed in SVN. I actually replaced the == by a call to the Arrays.binarySearch(Object[] obj, Object obj) method, which makes traversing the array obsolete. Thanks for pointing it ;-) Best regards Sylvain _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From bugzilla-daemon at portal.open-bio.org Wed Sep 29 00:48:46 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Sep 2010 00:48:46 -0400 Subject: [Biojava-dev] [Bug 3140] New: Required Correction in GenbankLocationParser class Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3140 Summary: Required Correction in GenbankLocationParser class Product: BioJava Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: seq.io AssignedTo: biojava-dev at biojava.org ReportedBy: gwaldon at geneinfinity.org Reported on behalf of Deepak Sheoran: Their is problem with GenbankLocationParser class, this class don't process genbank record with Accession: M32882. LocationParser class fails at following line in genbank record: gene join((8298.8300)..10206,1..855) /gene="bcn" mRNA join((8298.8300)..10206,1..855) /gene="bcn" /note="alternative transcript" Exception stack trace is as follows: Could not understand position: 10206,1..855 org.biojava.bio.seq.io.ParseException: Could not understand position: 10206,1..855 at org.biojavax.bio.seq.io.GenbankLocationParser.parsePosition(GenbankLocationParser.java:285) at org.biojavax.bio.seq.io.GenbankLocationParser.parsePosition(GenbankLocationParser.java:285) at org.biojavax.bio.seq.io.GenbankLocationParser.parseLocString(GenbankLocationParser.java:277) at org.biojavax.bio.seq.io.GenbankLocationParser.parseLocString(GenbankLocationParser.java:244) at org.biojavax.bio.seq.io.GenbankLocationParser.parseLocation(GenbankLocationParser.java:131) I did some investigation in following matter, and found the defect in regular expression named as "gp" in GenbankLocationParser class. This error can be fixed by applying attached patch. And then for testing I have created a method which proves that it can now understand all the possible combination of location. This test class is also attached so that you can test my patch before and after its application. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 29 00:52:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Sep 2010 00:52:33 -0400 Subject: [Biojava-dev] [Bug 3140] Required Correction in GenbankLocationParser class In-Reply-To: Message-ID: <201009290452.o8T4qXkg023352@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3140 gwaldon at geneinfinity.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from gwaldon at geneinfinity.org 2010-09-29 00:52 EST ------- Applied proposed patch and test case. Index: GenbankLocationParser.java =================================================================== --- GenbankLocationParser.java (revision 8212) +++ GenbankLocationParser.java (working copy) @@ -133,7 +133,7 @@ // O beautiful regex, we worship you. (:-) // this matches grouped locations - private static Pattern gp = Pattern.compile("^([^\$\$:]*?:)?(complement|join|order)?\$*{0,1}(.*?)\$*{0,1}$"); + private static Pattern gp = Pattern.compile("^([^\$\$:]*?:)?(complement|join|order)?\${0,1}(.*?\$*{0,1})$"); // this matches range locations private static Pattern rp = Pattern.compile("^\$*(.*?)\$*(\\.\\.\$*(.*)\$*)?$"); // this matches accession/version pairs -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Wed Sep 29 07:49:24 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 29 Sep 2010 07:49:24 -0400 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> References: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> Message-ID: <20100929114924.GC8837@sobchak.mgh.harvard.edu> Pedro; > To provide some context, I've started coding on 'Bioclojure', and my > strategy is to position my efforts downstream from Biojava (3). That's awesome. I've been teaching myself Clojure to be able to interoperate with Biojava and the GATK/Picard toolkits. Great to hear you are tackling this. Have you also seen Jan Aerts work? http://github.com/jandot/bioclojure Brad From andreas at sdsc.edu Wed Sep 29 22:17:45 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 29 Sep 2010 19:17:45 -0700 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) In-Reply-To: References:

Message-ID: Hi Scooter, I just deleted the sequence modules in SVN. If it later turns out that there is some code that was forgotten to be moved to biojava3-core, there is still a tag in svn to show the status before it got deleted... Andreas On Tue, Sep 28, 2010 at 10:30 AM, Scooter Willis wrote: > Andreas > > Yes you can delete the sequence modules. I will add the parsers to the list and will use that to make sure core sequence module properly models the data elements. > > Thanks > > Scooter > > ----- Reply message ----- > From: "Sylvain Foisy" > Date: Tue, Sep 28, 2010 1:18 pm > Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) > To: "biojava-dev at lists.open-bio.org" > > Hi, > > Ok, it should be fixed in SVN. I actually replaced the == by a call to the > Arrays.binarySearch(Object[] obj, Object obj) method, which makes traversing > the array obsolete. > > Thanks for pointing it ;-) > > Best regards > > Sylvain > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From deniz.koellhofer at cambia.org Wed Sep 1 06:21:52 2010 From: deniz.koellhofer at cambia.org (Deniz Koellhofer) Date: Wed, 1 Sep 2010 16:21:52 +1000 Subject: [Biojava-dev] biojava3 BLAST parser In-Reply-To: <47FD5948-0439-45C6-A1AB-22E7CC8D17A6@scripps.edu> References:

<47FD5948-0439-45C6-A1AB-22E7CC8D17A6@scripps.edu> Message-ID: Hi Scooter, I'm currently parsing the BLAST results into plain data containers, but wouldn't mind integrating it more with existing BioJava3 modules if I find some time. Pretty busy at the moment, but I will let you guys know if I get any further. Cheers, Deniz On Wed, Sep 1, 2010 at 12:11 AM, Scooter Willis wrote: > Deniz > > It would be great to formalize the XML blast results as Java classes. Do > you have any interest in taking on the project? > > Capturing the blast alignment using the new alignment classes would be a > very nice feature. I like using XPATH as the query language to select for > hits of interest which should allow for a SAX based approach to minimize the > impact of very large XML files. XPATH and SAX does appear to have some > constraints ( > http://stackoverflow.com/questions/1863250/is-it-there-any-xpath-processor-for-sax-model > ) > > Probably makes sense to have a Blast module that would depend on core and > alignment. > > Thanks > > Scooter > > > > On Aug 31, 2010, at 8:49 AM, Deniz Koellhofer wrote: > > *Hi Scooter,* > * > * > *Thanks for the reply. I guess the BlastXMLQuery is a good example to show > how to quickly extract information from a BLAST result. * > * > * > *But in my opinion biojava3 should alo have a Blast parser that generates > java beans containing the complete Blast result set - similar to what > biojava1.7.1 was doing. So yeah, I'm after translating the XML elements to > Java classes.* > * > * > *Would something like that fit into one of the biojava3 modules? homology, > I/O?* > * > * > *Thanks,* > *Deniz* > * > * > On Tue, Aug 31, 2010 at 8:43 PM, Scooter Willis wrote: > >> Deniz >> >> Can you provide some requirements regarding parsing the Blast XML. I tend >> to use XPATH and the DOM object to get to the data elements of interest so >> you already have the ability to load the Blast XML and work with the data. >> The difficulty of "parsing" is not an issue with XML. The BlastXMLQuery is >> an example of searching the Blast XML to get results. Are you wanting the >> XML elements translated to Java classes? >> >> Thanks >> >> Scooter >> >> On Aug 31, 2010, at 2:46 AM, Deniz Koellhofer wrote: >> >> > Hi, >> > >> > I wanted to find out the current state of blast parsing efforts in >> biojava3 >> > - especially for ncbi blastxml output? >> > >> > I had a quick look and found some DOM based code fragments >> > in org.biojava3.genome.query.BlastXMLQuery. Is there already anybody >> working >> > on a more comprehensive SAX parser? >> > >> > The biojava1.7.1 blastxml parser seems to work fine, however some of the >> > tags in NCBI-BLASTN 2.2.23+ output like Hsp_midline, BlastOutput_param >> don't >> > seem to get parsed properly >> > in org.biojava.bio.program.sax.blastxml.BlastXMLParserFacade. >> > >> > Cheers, >> > Deniz >> > >> > -- >> > Deniz Koellhofer >> > Cambia >> > Initiative for Open Innovation (IOI) >> > Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia >> > _______________________________________________ >> > biojava-dev mailing list >> > biojava-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> > > > -- > -- > Deniz Koellhofer > Cambia > Initiative for Open Innovation (IOI) > Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia > > > From bugzilla-daemon at portal.open-bio.org Wed Sep 1 17:59:08 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Sep 2010 13:59:08 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009011759.o81Hx83i005446@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 darnells at dnastar.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #2 from darnells at dnastar.com 2010-09-01 13:59 EST ------- I have applied the patch Amr sent over the weekend and tested it. It's looking pretty good. I tested structures 3CYT, 4HHB, 3AFA, 1MR1, and 2OHX and the following functions work as advertised: Structure: getSites() PDBSite: getSiteID(), getResidues() PDBSite.Residue: getResidueName(), getResidueNumber() I didn't directly test the set* methods, assuming they are working fine since the get* methods work. I noticed that PDBSite implements PDBRecord (similarly as SSBond). The method PDBSite.toPDB() ends in error: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuffer.append(StringBuffer.java:224) at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:39) at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:25) at com.dnastar.structureTest.Cookbook.printSiteIDs(Cookbook.java:36) at com.dnastar.structureTest.Cookbook.main(Cookbook.java:24) That should be corrected so that it is consistent with the SSBond class. I also agree with Jules Jacobsen's comments (via email): "We still need to hook-up the REMARK 800 parser and make some unit tests for this." -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jacobsen at ebi.ac.uk Thu Sep 2 10:27:10 2010 From: jacobsen at ebi.ac.uk (Jules Jacobsen) Date: Thu, 02 Sep 2010 11:27:10 +0100 Subject: [Biojava-dev] Bug 3132: "SITE records in PDBFileReader"??? In-Reply-To: References: <4C73A9A8.9010708@ebi.ac.uk> <4C7667D8.9030000@ebi.ac.uk>

Message-ID: <4C7F7BFE.6090101@ebi.ac.uk> Hi Guys, OK, please don't hack about with this - I'm adding it into the main branch of biojava 3, but things are changing a little. PDBSite has been renamed to Site ('PDB' is a redundant) The Residue inner class in PDBSite I have removed so as to use the Group interface which we have already defined. As Steve says, Site.toPDB() throws an OutOfMemoryError - I'm fixing this and adding tests to Site, the PDBFileParser and other classes which have been touched. Also I'm adding the Remark800 parser and minor cosmetic tweaks for consistent coding style. This should be good to go for Mon/Tues next week. Jules On 01/09/2010 18:57, Steve Darnell wrote: > Amr, > > I have applied the patch you sent over the weekend and tested it. It's > looking pretty good. I tested structures 3CYT, 4HHB, 3AFA, 1MR1, and > 2OHX and the following functions work as advertised: > > Structure: getSites() > PDBSite: getSiteID(), getResidues() > PDBSite.Residue: getResidueName(), getResidueNumber() > > I didn't directly test the set* methods, assuming they are working fine > since the get* methods work. > > I noticed that PDBSite implements PDBRecord (similarly as SSBond). The > method PDBSite.toPDB() ends in error: > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2882) > at > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav > a:100) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) > at java.lang.StringBuffer.append(StringBuffer.java:224) > at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:39) > at org.biojava.bio.structure.PDBSite.toPDB(PDBSite.java:25) > at > com.dnastar.structureTest.Cookbook.printSiteIDs(Cookbook.java:36) > at com.dnastar.structureTest.Cookbook.main(Cookbook.java:24) > > That should be corrected so that it is consistent with the SSBond class. > I'll update the bug in Bugzilla with this information. > > Regards, > Steve > > CC: Jules Jacobsen (also looking at SITE patch) > > -----Original Message----- > From: Steve Darnell > Sent: Tuesday, August 31, 2010 5:52 PM > To: 'Amr AL-Hossary' > Subject: RE: Bug 3132: "SITE records in PDBFileReader"??? > > Amr, > >> The configuration is related to the PROJECT PROPERTIES, not to Your > system. > > My project/application is using JDK 1.6.0_20 (Windows/Mac). > > >> so all you need is SITE_IDENTIFIER, you don't need EVIDENCE_CODE, right > ? > > I need the SITE_DESCRIPTION from REMARK 800. SITE_IDENTIFIER, which is > the same as the siteID from the SITE record, is not meaningful. > Correct, I do not need the EVIDENCE_CODE. > > >> Sorry? Isn't it already there in the PDBSite bean? >> or what is the difference between site name& site ID? > > For example, PDBID: 3CYT > > SITE, siteID or REMARK 800, SITE_IDENTIFIER: "AC1" > REMARK 800, SITE_DESCRIPTION: "BINDING SITE FOR RESIDUE HEM O 104" > > The description is meaningful, the identifier is not. > > >> please recheck the date. > > My mistake. Wednesday, September 8th, two days after Labor Day (Monday, > September 6th). > > I hope I addressed your questions. Let me know if I can clarify > anything else. > > Regards, > Steve From bugzilla-daemon at portal.open-bio.org Thu Sep 2 17:21:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Sep 2010 13:21:39 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009021721.o82HLdKj024110@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #3 from jacobsen at ebi.ac.uk 2010-09-02 13:21 EST ------- SITE parser now up and running in DEV - should be available in automated build - access Site objects from Structure.getSites() - Output available in standard String and also proper PDB format (with padded whitespace!) - java.lang.OutOfMemoryError fixed. - tests added to PDBFileParserTest and SiteTest REMARK 800 is being added too. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 2 21:32:50 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Sep 2010 17:32:50 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009022132.o82LWoP7005561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #4 from darnells at dnastar.com 2010-09-02 17:32 EST ------- Using revision 8221 (2010/09/02 17:11:55), I experienced two unit test failures in org.biojava.bio.structure.SiteTest during the build process. Both testToPDB_0args() and testToPDB_StringBuffer() failed because the expected result did not have whitespace padding the line lengths to 80 characters. In SiteTest.java, I replaced lines 118-119 and 134-135 with the following: String expResult = "SITE 1 AC1 6 ARG H 221A LYS H 224 HOH H 403 HOH H 460 " + newline + "SITE 2 AC1 6 HOH H 464 HOH H 497 " + newline; I can successfully build afterwards. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 3 08:48:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Sep 2010 04:48:01 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009030848.o838m1qv018822@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #5 from jacobsen at ebi.ac.uk 2010-09-03 04:48 EST ------- Sorry Steve - I forgot to comment out the wrong bit of the test - the results are correct, but the test was wrong! This is fixed now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 3 10:43:08 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Sep 2010 06:43:08 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009031043.o83Ah8rE008398@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #6 from jacobsen at ebi.ac.uk 2010-09-03 06:43 EST ------- Updated SiteTest and PDBFileParserTest to include REMARK 800 parsing and functionality - new methods added to Site to get/set and print its related REMARK 800 section. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 3 19:55:12 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Sep 2010 15:55:12 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009031955.o83JtCUJ023861@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #7 from darnells at dnastar.com 2010-09-03 15:55 EST ------- There is a defect in PDBFileParser.java regarding sites. Line 2044: String pdbCode = groupString.substring(4, 10).trim(); This assigns pdbCode to "$CHAIN $RESNUM$INSCODE" (e.g. "H 221A"), where the Group interface documentation states it should just be $RESNUM$INSCODE (e.g. "221A"). Also, setParent() is never called and assigned to a chain object. Currently, to get the chain id for a site residue, you need to use the first character of pdbCode. Is it possible to link the site residue groups to their correct chain? This is somewhat analogous to SSBond; is a similar approach appropriate? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Sep 6 08:06:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 6 Sep 2010 04:06:01 -0400 Subject: [Biojava-dev] [Bug 3132] SITE records in PDBFileReader In-Reply-To: Message-ID: <201009060806.o86861xr002482@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3132 ------- Comment #8 from amr_alhossary at hotmail.com 2010-09-06 04:06 EST ------- Actually Dr. Andreas idea was to use the PDBResidueNumber bean, and thats what i did in my implementation. Any way, we can update everything to use the new bean, but we need first to set new clear definitions of terms like group, residue, HetAtom, etc. and the relation between them. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From deniz.koellhofer at cambia.org Tue Sep 21 22:59:27 2010 From: deniz.koellhofer at cambia.org (Deniz Koellhofer) Date: Wed, 22 Sep 2010 08:59:27 +1000 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein Message-ID: Hi, I'm trying to parse EMBL formatted files with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't match. Looks like the parser utilises the EMBLFormat class with the following ID pattern: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$" **);* The ID lines in my files (retrieved from EMBL-EBI) look like *ID A00197; SV 1; linear; protein; PRT; SYN; 602 AA.* Looks like the pattern is specifically written for dna/rna and should more look like: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+); **\\s+(\\d+)\\s+(BP|AA)\\.$"**);* Or am I using he wrong RichSequence.IOTools function? Cheers, Deniz -- Deniz Koellhofer Cambia Initiative for Open Innovation (IOI) Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia From deniz.koellhofer at cambia.org Wed Sep 22 23:10:58 2010 From: deniz.koellhofer at cambia.org (Deniz Koellhofer) Date: Thu, 23 Sep 2010 09:10:58 +1000 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein In-Reply-To: <20100922211815.12439.qmail@mxw1102.verio-web.com> References: <20100922211815.12439.qmail@mxw1102.verio-web.com> Message-ID: Hi George, This entry is from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz Have you used the RichSequence.IOTools successfully for parsing EMBL protein files before? I assume this should always fail due to the "BP" in the regex? Deniz On Thu, Sep 23, 2010 at 7:18 AM, George Waldon wrote: > Hi Deniz: > > I have a quick question that may be obvious, but which database do you get > those protein files from? > > Thank you, > > George > > On Tue, Sep 21, 2010 at 3:59 PM, Deniz Koellhofer < > deniz.koellhofer at cambia.org> wrote: > > Hi, > > I'm trying to parse EMBL formatted files > with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID > lines don't match. > > Looks like the parser utilises the EMBLFormat class with the following > ID > pattern: > > *protected** **static** **final** Pattern **lp** = Pattern.compile(** > > "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$" > **);* > > The ID lines in my files (retrieved from EMBL-EBI) look like *ID > A00197; > SV 1; linear; protein; PRT; SYN; 602 AA.* > > Looks like the pattern is specifically written for dna/rna and should > more > look like: > > *protected** **static** **final** Pattern **lp** = Pattern.compile(** > > "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+); > **\\s+(\\d+)\\s+(BP|AA)\\.$"**);* > > Or am I using he wrong RichSequence.IOTools function? > > Cheers, > > Deniz > -- > Deniz Koellhofer > Cambia > Initiative for Open Innovation (IOI) > Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- Deniz Koellhofer Cambia Initiative for Open Innovation (IOI) Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia From gwaldon at geneinfinity.org Thu Sep 23 04:23:48 2010 From: gwaldon at geneinfinity.org (George Waldon) Date: Thu, 23 Sep 2010 00:23:48 -0400 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein Message-ID: <20100923042349.77746.qmail@mxw1102.verio-web.com> Curious, this seems to be the only place to find this type of files. Not really an official format, a little bit like GenPept. Your fix should probably work. Can you fill a bug on bugzilla (http://bugzilla.open-bio.org/)? Best, George On Wed, Sep 22, 2010 at 4:10 PM, Deniz Koellhofer wrote: Hi George, This entry is from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz Have you used the RichSequence.IOTools successfully for parsing EMBL protein files before? I assume this should always fail due to the "BP" in the regex? Deniz On Thu, Sep 23, 2010 at 7:18 AM, George Waldon wrote: Hi Deniz: I have a quick question that may be obvious, but which database do you get those protein files from? Thank you, George On Tue, Sep 21, 2010 at 3:59 PM, Deniz Koellhofer wrote: Hi, I'm trying to parse EMBL formatted files with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't match. Looks like the parser utilises the EMBLFormat class with the following ID pattern: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$" **);* The ID lines in my files (retrieved from EMBL-EBI) look like *ID A00197; SV 1; linear; protein; PRT; SYN; 602 AA.* Looks like the pattern is specifically written for dna/rna and should more look like: *protected** **static** **final** Pattern **lp** = Pattern.compile(** "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+); **\\s+(\\d+)\\s+(BP|AA)\\.$"**);* Or am I using he wrong RichSequence.IOTools function? Cheers, Deniz -- Deniz Koellhofer Cambia Initiative for Open Innovation (IOI) Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From bugzilla-daemon at portal.open-bio.org Thu Sep 23 04:35:54 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Sep 2010 00:35:54 -0400 Subject: [Biojava-dev] [Bug 3137] New: RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3137 Summary: RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries. Product: BioJava Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: seq.io AssignedTo: biojava-dev at biojava.org ReportedBy: dkoellhofer at gmail.com Hi, I'm trying to parse EMBL formatted files with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't match. Looks like the parser utilises the EMBLFormat class with the following ID pattern: protected static final Pattern lp = Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$"); The ID lines in my files (retrieved from EMBL-EBI) look like ID A00197; SV 1; linear; protein; PRT; SYN; 602 AA. Looks like the pattern is specifically written for dna/rna and should more look like: protected static final Pattern lp = Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+(BP|AA)\\.$"); The failing protein sequences come from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz Cheers, Deniz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Sep 23 09:10:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Sep 2010 10:10:56 +0100 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein In-Reply-To: <20100923042349.77746.qmail@mxw1102.verio-web.com> References: <20100923042349.77746.qmail@mxw1102.verio-web.com> Message-ID: On Thu, Sep 23, 2010 at 5:23 AM, George Waldon wrote: > > Curious, this seems to be the only place to find this type of files. > Not really an official format, a little bit like GenPept. Your fix > should probably work. Can you fill a bug on bugzilla > (http://bugzilla.open-bio.org/)? > > Best, > George Bug filed: http://bugzilla.open-bio.org/show_bug.cgi?id=3137 Interestingly Biopython doesn't support Protein EMBL files either (I didn't know they existed), I wonder if BioPerl does? Peter From biopython at maubp.freeserve.co.uk Thu Sep 23 10:51:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Sep 2010 11:51:35 +0100 Subject: [Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein In-Reply-To: References: <20100923042349.77746.qmail@mxw1102.verio-web.com> Message-ID: Hi BioJava team et al, Thanks for the indirect alert that protein EMBL files exist, e.g. ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz I've just updated Biopython to support them. Chris - I've CC'd you in case you want to look at this for BioPerl. Peter On Thu, Sep 23, 2010 at 10:10 AM, Peter wrote: > On Thu, Sep 23, 2010 at 5:23 AM, George Waldon wrote: >> >> Curious, this seems to be the only place to find this type of files. >> Not really an official format, a little bit like GenPept. Your fix >> should probably work. Can you fill a bug on bugzilla >> (http://bugzilla.open-bio.org/)? >> >> Best, >> George > > Bug filed: http://bugzilla.open-bio.org/show_bug.cgi?id=3137 > > Interestingly Biopython doesn't support Protein EMBL files either > (I didn't know they existed), I wonder if BioPerl does? > > Peter From bugzilla-daemon at portal.open-bio.org Thu Sep 23 22:45:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Sep 2010 18:45:44 -0400 Subject: [Biojava-dev] [Bug 3138] New: SeqRes2AtomAligner misaligns Atom groups at N-terminus Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3138 Summary: SeqRes2AtomAligner misaligns Atom groups at N-terminus Product: BioJava Version: live (CVS source) Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: structure AssignedTo: biojava-dev at biojava.org ReportedBy: darnells at dnastar.com The SeqRes2AtomAligner misaligns the N-terminus residues for 1BKV (Collagen) and 3NLC (VP0956). Both structures have SEQRES residues without corresponding ATOM residues at the N-terminus. In the case of 3NLC, the observed mapping is (* = no ATOM group): SEQRES MGHHHHHHSHMIRINE ATOM M**********IRINE and the correct mapping should be: SEQRES MGHHHHHHSHMIRINE ATOM **********MIRINE Activating the Chemical Component Dictionary does not help the alignment. Is it significant that for 1BKV and 3NLC the first SEQRES group and the first ATOM group are the same residue (nearly the same for 1BKV, PRO ~ HYP)? Is this a situation that is difficult for the Needleman-Wunsch algorithm used by SeqRes2AtomAligner? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 23 22:50:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Sep 2010 18:50:11 -0400 Subject: [Biojava-dev] [Bug 3138] SeqRes2AtomAligner misaligns Atom groups at N-terminus In-Reply-To: Message-ID: <201009232250.o8NMoBr0029097@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3138 ap3 at sanger.ac.uk changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biojava-dev at biojava.org |andreas at sdsc.edu -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 24 05:45:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Sep 2010 01:45:17 -0400 Subject: [Biojava-dev] [Bug 3137] RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries. In-Reply-To: Message-ID: <201009240545.o8O5jHw9007788@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3137 gwaldon at geneinfinity.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from gwaldon at geneinfinity.org 2010-09-24 01:45 EST ------- Committed new pattern and added a test. Note that protein sequences from KIPO have a non-regular ID line that prevent parsing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From gindin at gmail.com Sun Sep 26 23:05:50 2010 From: gindin at gmail.com (Yevgeniy Gindin) Date: Sun, 26 Sep 2010 19:05:50 -0400 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties Message-ID: There seems to be a bug in org.biojavax.bio.alignment.blast.RemoteQBlastAlignmentProperties.setBlastProgram(String) The method throws an exception when a valid program name is given. I believe that this is due to the fact that two String objects are tested with the "==" operator rather then equals(). To illustrate: String a = new String ("a"); String b = new String ("a"); System.out.println (a == b); The above will return false. -- Yevgeniy From pedros at berkeley.edu Mon Sep 27 19:55:05 2010 From: pedros at berkeley.edu (Pedro Silva) Date: Mon, 27 Sep 2010 12:55:05 -0700 Subject: [Biojava-dev] BioJava 3 current status Message-ID: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> On 30 Jun 2010, at 12:18, Andreas Prlic wrote: > About BioJava 3: This has made great progress over the last weeks and > a lot of new functionality has been committed to SVN. To make this > release ready there are now two new tools: > > * There is now a BioJava Maven repository, which is hosting SNAPSHOT > builds from the current SVN. Hi Andreas, A couple questions about Biojava 3, with which I'm trying to interoperate from Clojure: 1. You mention SVN above, but I also see an active git repository at github. Is that only a mirror, and active development occurs at biojava.org's svn repo? 2. Are all non biojava3-prefixed modules in the Maven repository not suitable to use with Biojava 3? Eg., biosql, das, etc. 3. Will the Maven repository eventually also host releases, not just snapshots? How far along would a release be? To provide some context, I've started coding on 'Bioclojure', and my strategy is to position my efforts downstream from Biojava (3). To that end, I'm trying to figure out the best way to pull development snapshots and stable releases from you guys, as well as your current state of affairs in the porting of biojava1/biojavax to biojava3. Thanks for your help. Pedro -- Pedro Silva UC Berkeley - PMB From andreas at sdsc.edu Mon Sep 27 20:40:30 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 27 Sep 2010 13:40:30 -0700 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> References: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> Message-ID: Hi Pedro, > 1. You mention SVN above, but I also see an active git repository at github. > ? Is that only a mirror, and active development occurs at biojava.org's svn repo? Yes, git is only a mirror. The main development happens in an ssh protected SVN, which is replicated onto the anonymous SVN and git servers (within one hour after a commit). > 2. Are all non biojava3-prefixed modules in the Maven repository not suitable to use with Biojava 3? > ? Eg., biosql, das, etc. In an ideal world we would have a clear separation of biojava 3 and biojava 1. However the way things are at the present there is some mix and there is still legacy code in SVN. e.g. the module "core" is essentially a lot of biojava1.7. It is up to the module maintainers to upgrade their code. We will need to take a decision soon if we want to drop all legacy code from SVN and only support biojava3 based code from that point on. ( need to do a dependency analysis how many modules are still depending on the old core). > 3. Will the Maven repository eventually also host releases, not just snapshots? yes > ? How far along would a release be? We have not set a date at the present. I will write a proposal about the next steps to the mailing list soon and then this is up for discussion. Having said this, several people are using the current SVN checkouts in production environments and what seems mostly missing is lots of documentation... > To provide some context, I've started coding on 'Bioclojure', and my strategy is to position my efforts downstream from Biojava (3). Excellent! Are you planning to add additional functionality or is it mainly a wrapper for biojava? > To that end, I'm trying to figure out the best way to pull development snapshots and stable releases from you guys, as well as > your current state of affairs in the porting of biojava1/biojavax to biojava3. The automated build system pushes any successful build to http://www.biojava.org/download/maven/ This means all libraries that are available from there are guaranteed to pass the junit tests and compile correctly. Andreas From pedros at berkeley.edu Mon Sep 27 23:13:17 2010 From: pedros at berkeley.edu (Pedro Silva) Date: Mon, 27 Sep 2010 16:13:17 -0700 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: References: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> Message-ID: <1285629197.25539.38.camel@dev.dzlab.pmb.berkeley.edu> On Mon, 2010-09-27 at 13:40 -0700, Andreas Prlic wrote: > Excellent! Are you planning to add additional functionality or is it > mainly a wrapper for biojava? Well, it started as native Clojure all the way, then I figured it would be a wrapper around Biojava, *then* I figured, Clojure-Java interoperability is so good, I don't need any wrapper at all. That leaves adding additional functionality, plus wrapping the more verbose and/or common Biojava idioms. Thanks for your answers again. I appreciate it. Pedro -- Pedro Silva UC Berkeley - PMB From gwaldon at geneinfinity.org Tue Sep 28 03:34:01 2010 From: gwaldon at geneinfinity.org (George Waldon) Date: Mon, 27 Sep 2010 23:34:01 -0400 Subject: [Biojava-dev] BioJava 3 current status Message-ID: <20100928033401.79626.qmail@mxw1102.verio-web.com> On Mon, Sep 27, 2010 at 1:40 PM, Andreas Prlic wrote: >In an ideal world we would have a clear separation of biojava 3 and >biojava 1. However the way things are at the present there is some >mix and there is still legacy code in SVN. e.g. the module "core" is >essentially a lot of biojava1.7. It is up to the module maintainers to >upgrade their code. We will need to take a decision soon if we want to >drop all legacy code from SVN and only support biojava3 based code >from that point on. ( need to do a dependency analysis how many >modules are still depending on the old core). I see a few (rather lonely) BioException in the biojava3-ws module and a "biojava1-core" dependency in org.biojava3.core.sequence.transcription.DefaultRNAProteinTranscription in the sequence-core module. I also see that many of the classes of the sequence-core module are duplicated in the biojava3-core module. Is-there a meaning for this duplication? George From andreas at sdsc.edu Tue Sep 28 17:17:37 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 28 Sep 2010 10:17:37 -0700 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: <20100928033401.79626.qmail@mxw1102.verio-web.com> References: <20100928033401.79626.qmail@mxw1102.verio-web.com> Message-ID: > > I see a few (rather lonely) BioException in the biojava3-ws module and a "biojava1-core" dependency in org.biojava3.core.sequence.transcription.DefaultRNAProteinTranscription in the sequence-core module. I also see that many of the classes of the sequence-core module are duplicated in the biojava3-core module. Is-there a meaning for this duplication? Andy, Scooter, is this a leftover from the initial days? Can I delete the sequence module? Is this all biojava3-core now? On a related matter, is anybody working on genebank and embl file parsing for biojava3? There were some feature requests for this and would be nice if this could be done in biojava3 as well... Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From sylvain.foisy at inflammgen.org Tue Sep 28 13:54:43 2010 From: sylvain.foisy at inflammgen.org (Sylvain Foisy) Date: Tue, 28 Sep 2010 09:54:43 -0400 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) Message-ID: Hi, Ok, it should be fixed in SVN. I actually replaced the == by a call to the Arrays.binarySearch(Object[] obj, Object obj) method, which makes traversing the array obsolete. Thanks for pointing it ;-) Best regards Sylvain From HWillis at scripps.edu Tue Sep 28 17:30:50 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 28 Sep 2010 13:30:50 -0400 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) Message-ID: Andreas Yes you can delete the sequence modules. I will add the parsers to the list and will use that to make sure core sequence module properly models the data elements. Thanks Scooter ----- Reply message ----- From: "Sylvain Foisy" Date: Tue, Sep 28, 2010 1:18 pm Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) To: "biojava-dev at lists.open-bio.org" Hi, Ok, it should be fixed in SVN. I actually replaced the == by a call to the Arrays.binarySearch(Object[] obj, Object obj) method, which makes traversing the array obsolete. Thanks for pointing it ;-) Best regards Sylvain _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From bugzilla-daemon at portal.open-bio.org Wed Sep 29 04:48:46 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Sep 2010 00:48:46 -0400 Subject: [Biojava-dev] [Bug 3140] New: Required Correction in GenbankLocationParser class Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3140 Summary: Required Correction in GenbankLocationParser class Product: BioJava Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: seq.io AssignedTo: biojava-dev at biojava.org ReportedBy: gwaldon at geneinfinity.org Reported on behalf of Deepak Sheoran: Their is problem with GenbankLocationParser class, this class don't process genbank record with Accession: M32882. LocationParser class fails at following line in genbank record: gene join((8298.8300)..10206,1..855) /gene="bcn" mRNA join((8298.8300)..10206,1..855) /gene="bcn" /note="alternative transcript" Exception stack trace is as follows: Could not understand position: 10206,1..855 org.biojava.bio.seq.io.ParseException: Could not understand position: 10206,1..855 at org.biojavax.bio.seq.io.GenbankLocationParser.parsePosition(GenbankLocationParser.java:285) at org.biojavax.bio.seq.io.GenbankLocationParser.parsePosition(GenbankLocationParser.java:285) at org.biojavax.bio.seq.io.GenbankLocationParser.parseLocString(GenbankLocationParser.java:277) at org.biojavax.bio.seq.io.GenbankLocationParser.parseLocString(GenbankLocationParser.java:244) at org.biojavax.bio.seq.io.GenbankLocationParser.parseLocation(GenbankLocationParser.java:131) I did some investigation in following matter, and found the defect in regular expression named as "gp" in GenbankLocationParser class. This error can be fixed by applying attached patch. And then for testing I have created a method which proves that it can now understand all the possible combination of location. This test class is also attached so that you can test my patch before and after its application. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 29 04:52:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Sep 2010 00:52:33 -0400 Subject: [Biojava-dev] [Bug 3140] Required Correction in GenbankLocationParser class In-Reply-To: Message-ID: <201009290452.o8T4qXkg023352@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3140 gwaldon at geneinfinity.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from gwaldon at geneinfinity.org 2010-09-29 00:52 EST ------- Applied proposed patch and test case. Index: GenbankLocationParser.java =================================================================== --- GenbankLocationParser.java (revision 8212) +++ GenbankLocationParser.java (working copy) @@ -133,7 +133,7 @@ // O beautiful regex, we worship you. (:-) // this matches grouped locations - private static Pattern gp = Pattern.compile("^([^\$\$:]*?:)?(complement|join|order)?\$*{0,1}(.*?)\$*{0,1}$"); + private static Pattern gp = Pattern.compile("^([^\$\$:]*?:)?(complement|join|order)?\${0,1}(.*?\$*{0,1})$"); // this matches range locations private static Pattern rp = Pattern.compile("^\$*(.*?)\$*(\\.\\.\$*(.*)\$*)?$"); // this matches accession/version pairs -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Wed Sep 29 11:49:24 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 29 Sep 2010 07:49:24 -0400 Subject: [Biojava-dev] BioJava 3 current status In-Reply-To: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> References: <1285617305.25539.15.camel@dev.dzlab.pmb.berkeley.edu> Message-ID: <20100929114924.GC8837@sobchak.mgh.harvard.edu> Pedro; > To provide some context, I've started coding on 'Bioclojure', and my > strategy is to position my efforts downstream from Biojava (3). That's awesome. I've been teaching myself Clojure to be able to interoperate with Biojava and the GATK/Picard toolkits. Great to hear you are tackling this. Have you also seen Jan Aerts work? http://github.com/jandot/bioclojure Brad From andreas at sdsc.edu Thu Sep 30 02:17:45 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 29 Sep 2010 19:17:45 -0700 Subject: [Biojava-dev] Bug in RemoteQBlastAlignmentProperties (Yevgeniy Gindin) In-Reply-To: References: