From phidias51 at gmail.com Sat Dec 6 11:34:39 2008 From: phidias51 at gmail.com (Mark Fortner) Date: Sat, 6 Dec 2008 08:34:39 -0800 Subject: [Biojava-dev] File Validator Message-ID: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> I've noticed that a lot of the email on the mailing list from users tends to revolve around the inability to parse a file of a given file type. In most of the cases it turns out that the file either does not conform to the standard, or the data in the file apparently violates XML rules of well-formedness. It occurred to me that we might put a page in the Cookbook that describes basic troubleshooting techniques. Richards past emails definitely contain a lot of useful information and could be used as a basis for the page. I also wondered if there were any plans in BioJava3 to include some sort of file validator (either as an integral part of the parsing framework or as a separate utility that could be run against any problematic file)? In most cases, the user simply wants to know what part of the file is broken so that they can fix the file and carry on (or notify the data provider of the problem and have them address the issue). Regards, Mark Fortner From markjschreiber at gmail.com Sat Dec 6 20:03:28 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 7 Dec 2008 09:03:28 +0800 Subject: [Biojava-dev] File Validator In-Reply-To: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> References: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> Message-ID: <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> I would agree that a file validator would be excellent although sometimes hard to write. The problem is mainly with the flat file formats. When we wrote the biojavax parsers we tried to make them conform to the descriptions given by NCBI etc. The problem is that they don't always conform to this. I think a possible problem with NCBI is that all their flat files are produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be validated quite easily. The flatfiles are produced by a transformation of the XML so they aren't always going to match the description. Finally other people produce 'Genbank' and 'EMBL' files that are really just a similar format but not the real thing. One of the most troublesome formats is FASTA. Not because it is difficult but because people try to code all manner of metadata into the header without any convention existing. Overall I would say whenever possible parse XML this should be the safest bet, although not always possible. - Mark On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner wrote: > I've noticed that a lot of the email on the mailing list from users tends to > revolve around the inability to parse a file of a given file type. In most > of the cases it turns out that the file either does not conform to the > standard, or the data in the file apparently violates XML rules of > well-formedness. > > It occurred to me that we might put a page in the Cookbook that describes > basic troubleshooting techniques. Richards past emails definitely contain a > lot of useful information and could be used as a basis for the page. > > I also wondered if there were any plans in BioJava3 to include some sort of > file validator (either as an integral part of the parsing framework or as a > separate utility that could be run against any problematic file)? In most > cases, the user simply wants to know what part of the file is broken so that > they can fix the file and carry on (or notify the data provider of the > problem and have them address the issue). > > Regards, > > Mark Fortner > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at eaglegenomics.com Sun Dec 7 14:03:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 07 Dec 2008 19:03:10 +0000 Subject: [Biojava-dev] File Validator In-Reply-To: <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> References: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> Message-ID: <493C1DEE.1010105@eaglegenomics.com> I like the idea of a validator. It should probably just be the standard parser run with some kind of a 'report errors but carry on parsing anyway' flag set (which currently doesn't exist). After all the standard parser is conforming to the published format so should be able to spot most errors. Such a flag does not yet exist, but yes it would be nice to incorporate it in future versions. cheers, Richard Mark Schreiber wrote: > I would agree that a file validator would be excellent although > sometimes hard to write. The problem is mainly with the flat file > formats. When we wrote the biojavax parsers we tried to make them > conform to the descriptions given by NCBI etc. The problem is that > they don't always conform to this. > > I think a possible problem with NCBI is that all their flat files are > produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be > validated quite easily. The flatfiles are produced by a transformation > of the XML so they aren't always going to match the description. > Finally other people produce 'Genbank' and 'EMBL' files that are > really just a similar format but not the real thing. > > One of the most troublesome formats is FASTA. Not because it is > difficult but because people try to code all manner of metadata into > the header without any convention existing. > > Overall I would say whenever possible parse XML this should be the > safest bet, although not always possible. > > - Mark > > On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner wrote: >> I've noticed that a lot of the email on the mailing list from users tends to >> revolve around the inability to parse a file of a given file type. In most >> of the cases it turns out that the file either does not conform to the >> standard, or the data in the file apparently violates XML rules of >> well-formedness. >> >> It occurred to me that we might put a page in the Cookbook that describes >> basic troubleshooting techniques. Richards past emails definitely contain a >> lot of useful information and could be used as a basis for the page. >> >> I also wondered if there were any plans in BioJava3 to include some sort of >> file validator (either as an integral part of the parsing framework or as a >> separate utility that could be run against any problematic file)? In most >> cases, the user simply wants to know what part of the file is broken so that >> they can fix the file and carry on (or notify the data provider of the >> problem and have them address the issue). >> >> Regards, >> >> Mark Fortner >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Sun Dec 7 22:57:40 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 8 Dec 2008 11:57:40 +0800 Subject: [Biojava-dev] File Validator In-Reply-To: <493C1DEE.1010105@eaglegenomics.com> References: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> <493C1DEE.1010105@eaglegenomics.com> Message-ID: <93b45ca50812071957u345f37dbp6f31537ed7074654@mail.gmail.com> A way to skip bad files in a stream would be great for general purpose use as well. Currently it's a pain to parse lots of files only to fail three quaters of the way through. - Mark On Mon, Dec 8, 2008 at 3:03 AM, Richard Holland wrote: > I like the idea of a validator. It should probably just be the standard > parser run with some kind of a 'report errors but carry on parsing > anyway' flag set (which currently doesn't exist). After all the standard > parser is conforming to the published format so should be able to spot > most errors. > > Such a flag does not yet exist, but yes it would be nice to incorporate > it in future versions. > > cheers, > Richard > > Mark Schreiber wrote: >> I would agree that a file validator would be excellent although >> sometimes hard to write. The problem is mainly with the flat file >> formats. When we wrote the biojavax parsers we tried to make them >> conform to the descriptions given by NCBI etc. The problem is that >> they don't always conform to this. >> >> I think a possible problem with NCBI is that all their flat files are >> produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be >> validated quite easily. The flatfiles are produced by a transformation >> of the XML so they aren't always going to match the description. >> Finally other people produce 'Genbank' and 'EMBL' files that are >> really just a similar format but not the real thing. >> >> One of the most troublesome formats is FASTA. Not because it is >> difficult but because people try to code all manner of metadata into >> the header without any convention existing. >> >> Overall I would say whenever possible parse XML this should be the >> safest bet, although not always possible. >> >> - Mark >> >> On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner wrote: >>> I've noticed that a lot of the email on the mailing list from users tends to >>> revolve around the inability to parse a file of a given file type. In most >>> of the cases it turns out that the file either does not conform to the >>> standard, or the data in the file apparently violates XML rules of >>> well-formedness. >>> >>> It occurred to me that we might put a page in the Cookbook that describes >>> basic troubleshooting techniques. Richards past emails definitely contain a >>> lot of useful information and could be used as a basis for the page. >>> >>> I also wondered if there were any plans in BioJava3 to include some sort of >>> file validator (either as an integral part of the parsing framework or as a >>> separate utility that could be run against any problematic file)? In most >>> cases, the user simply wants to know what part of the file is broken so that >>> they can fix the file and carry on (or notify the data provider of the >>> problem and have them address the issue). >>> >>> Regards, >>> >>> Mark Fortner >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From simpleyrx at 163.com Wed Dec 10 01:35:06 2008 From: simpleyrx at 163.com (simpleyrx) Date: Wed, 10 Dec 2008 14:35:06 +0800 (CST) Subject: [Biojava-dev] can biojava calculate ssea(secondary structure elements alignment) ? Message-ID: <9641867.357891228890906675.JavaMail.coremail@bj163app105.163.com> Dear expoerts, I wonder that can biojava calculate ssea (protein secondary structure elements alignment) ? -- Renxiang Yan From simpleyrx at 163.com Wed Dec 10 01:35:06 2008 From: simpleyrx at 163.com (simpleyrx) Date: Wed, 10 Dec 2008 14:35:06 +0800 (CST) Subject: [Biojava-dev] can biojava calculate ssea(secondary structure elements alignment) ? Message-ID: <9641867.357891228890906675.JavaMail.coremail@bj163app105.163.com> Dear expoerts, I wonder that can biojava calculate ssea (protein secondary structure elements alignment) ? -- Renxiang Yan From holland at eaglegenomics.com Fri Dec 19 06:28:17 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 19 Dec 2008 11:28:17 +0000 Subject: [Biojava-dev] Biojava3 updates Message-ID: <494B8551.7050406@eaglegenomics.com> It seems I forgot to commit my FASTA parser code last time round. I've just committed it now, along with a new class called ThingParserFactory to make file reading/writing much easier. See the updated docs here for a how-to: http://www.biojava.org/wiki/BioJava3:HowTo#FASTA cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Fri Dec 19 05:25:55 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 19 Dec 2008 10:25:55 +0000 Subject: [Biojava-dev] Annotations and Hibernate IDs Message-ID: <494B76B3.7040408@eaglegenomics.com> Hi all, I've just made a commit to the trunk of BioJavaX which resolves the following points: 1. Deprecated get/setProperty in RichAnnotation (hopefully no more confusion - people should use get/setNote[Set] instead). 2. Updated Rich* classes to explicitly specify RichAnnotation instead of Annotation (means getAnnotation returns RichAnnotation now, not plain old Annotation. This helps with point 1 above.). 3. Made all IDs on BioSQL-Rich* classes publicly get/settable. Use with caution! This allows you to identify individual database records from within your code. cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Fri Dec 19 07:01:14 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 19 Dec 2008 12:01:14 +0000 Subject: [Biojava-dev] Please help - BugZilla Message-ID: <494B8D0A.4090104@eaglegenomics.com> Hi all. I'd like to make a plea for help! There's about 16 reported bugs still open in BugZilla which have been there for quite some time. http://bugzilla.open-bio.org/buglist.cgi?product=BioJava&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED I would really appreciate it if a few people could "Adopt A Bug" and take a serious look at seeing if they can fix it. Just one bug per person would make all the difference. It would hugely help the project, and I would be eternally grateful. You can let me know if you've adopted a bug by assigning it to yourself in BugZilla. Currently only one of the 16 is actually assigned to anyone (thanks Andreas!), but I'm hoping that maybe someone out there will have a few moments to spare over the forthcoming holiday season and might fancy a challenge. Remember, a bug is for life, not just for Christmas! cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From bugzilla-daemon at portal.open-bio.org Fri Dec 19 13:05:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Dec 2008 13:05:23 -0500 Subject: [Biojava-dev] [Bug 2602] ParseException thrown when parsing Genbank file. In-Reply-To: Message-ID: <200812191805.mBJI5N01013362@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2602 me at hongyu.org changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biojava-dev at biojava.org |me at hongyu.org -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Dec 19 16:02:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Dec 2008 16:02:52 -0500 Subject: [Biojava-dev] [Bug 2716] New: Retrieve Partial CDS/Gene Information Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2716 Summary: Retrieve Partial CDS/Gene Information Product: BioJava Version: live (CVS source) Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: DB / BioSQL AssignedTo: biojava-dev at biojava.org ReportedBy: bandarus at niaid.nih.gov CC: gopalanv at niaid.nih.gov Overview: Add enhancement to biojavax to support retrieval of partial gene or partial CDS information for a partial gene/CDS feature of database record. Actual Result: RichSequence is holding this information initially, but lost when retrieving the sequence(saved in MySQL database using BioSQL schema). Expected Result: To be able to get the retrieve the partial gene/CDS Feature Location. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From simpleyrx at 163.com Mon Dec 22 01:23:12 2008 From: simpleyrx at 163.com (simpleyrx) Date: Mon, 22 Dec 2008 14:23:12 +0800 (CST) Subject: [Biojava-dev] Could sb give me a copy of source code for profile profile alignment ? Message-ID: <1895183.477771229926992135.JavaMail.coremail@app157.163.com> Dear experts, Could sb give me a copy of source code for profile profile alignment ? I need it very much. -- Renxiang Yan From holland at eaglegenomics.com Mon Dec 22 07:11:50 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 22 Dec 2008 12:11:50 +0000 Subject: [Biojava-dev] BlastXML parsers Message-ID: <494F8406.2070901@eaglegenomics.com> Mark Schreiber has kindly written some BlastXML parsers which can be found in the biojava-blastxml module in the BioJava3 repository. If you run your blasts with XML output, it will be able to fully parse every kind of blast output supported by NCBI blast. cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From armita_sh at yahoo.com Wed Dec 24 12:02:11 2008 From: armita_sh at yahoo.com (Armita Sheari) Date: Wed, 24 Dec 2008 09:02:11 -0800 (PST) Subject: [Biojava-dev] How can I calculate the RMSD of Global Alignment? Message-ID: <537200.67656.qm@web51412.mail.re2.yahoo.com> ???? Hi everyone, ?I want to calculate the RMSD relevent to the Structural Global Alignment of two proteins. I have used the align method of the StructurePairAligner class with default parameters, and then I have calculated the RMSD using the getRmsd method of the AlternativeAlignment class. But it seems something is wrong. I think I should change some parameters of align method which are defined in StrucAligParameters class. Unfortunately, I couldn't find any documentation that describes the parameters (not in the api nor in the source code). I would be thankful if you take a look at my code and let me know your opinion about which parameter(s) I should change. StructurePairAligner structurePairAligner = new StructurePairAligner(); structurePairAligner.align(structure1, structure2); AlternativeAlignment[] alternativeAlignment =? structurePairAligner.getAlignments(); ClusterAltAligs.cluster(alternativeAlignment); double minRmsd = 1000; double rmsd = 0; for(int i = 0; i < alternativeAlignment.length; i++) ?????????? { ?????????????? rmsd = alternativeAlignment[i].getRmsd(); ?????????????? if(rmsd < minRmsd) minRmsd = rmsd; ?????????????? rmsd = 0; ?????????? } return minRmsd; Thanks, Armitash From armita_sh at yahoo.com Mon Dec 29 07:44:16 2008 From: armita_sh at yahoo.com (Armita Sheari) Date: Mon, 29 Dec 2008 04:44:16 -0800 (PST) Subject: [Biojava-dev] How can I calculate the RMSD of Global Alignment? In-Reply-To: <59a41c430812250420j35947f87i6589735ff4c4273c@mail.gmail.com> Message-ID: <577293.91936.qm@web51403.mail.re2.yahoo.com> Dear Andreas, ? Thanks for your answer. I saw the run of the example you had linked. As I found, all rmsd numbers?which?were calculated in the example?were?related?to Local Alignments. Wheares I need to?find the?RMSD of Global Alignment (without any Gap). And we should consider all Alpha Carbons of the proteins to calculate the?RMSD. ? Which parameters should I change in align method to?reach the Global Alignment with NO Gap? ? Thanks again, ArmitaSh --- On Thu, 12/25/08, Andreas Prlic wrote: From: Andreas Prlic Subject: Re: [Biojava-dev] How can I calculate the RMSD of Global Alignment? To: armita_sh at yahoo.com Cc: "Biojava" , "biojava-dev" Date: Thursday, December 25, 2008, 7:20 PM Hi Armita, I agree the missing documentation for all the alignment parameters is a problem... It is not exactly clear from your mail what is the problem you encountered. The code that you sent in principle looks fine. I suspect you want to select the "best" of the alternative alignments? In that case you can simply take the first, since the alignments come out sorted. An example can be run from here: http://www.biojava.org/download/performance/biojava-structure-example1.jnlp The alternative alignments are sorted according to their number of structurally equivalent residues. The rmsd in the different alternative solutions is always kept under a certain threshold (one of the parameters), since one of the strategies is to try to keep the rmsd constant, while maximizing the number of structurally equivalent residues. Similar results are clustered together in the same cluster. In the example shown above this allows to find the multiple matches between the four chains of hemoglobin and myoglobin. Andreas On Wed, Dec 24, 2008 at 6:02 PM, Armita Sheari wrote: > Hi everyone, > > I want to calculate the RMSD relevent to the Structural Global Alignment of two proteins. I have used the align method of the StructurePairAligner class with default parameters, and then I have calculated the RMSD using the getRmsd method of the AlternativeAlignment class. But it seems something is wrong. > I think I should change some parameters of align method which are defined in StrucAligParameters class. Unfortunately, I couldn't find any documentation that describes the parameters (not in the api nor in the source code). > > I would be thankful if you take a look at my code and let me know your opinion about which parameter(s) I should change. > > StructurePairAligner structurePairAligner = new StructurePairAligner(); > structurePairAligner.align(structure1, structure2); > AlternativeAlignment[] alternativeAlignment = structurePairAligner.getAlignments(); > ClusterAltAligs.cluster(alternativeAlignment); > > double minRmsd = 1000; > double rmsd = 0; > for(int i = 0; i < alternativeAlignment.length; i++) > { > rmsd = alternativeAlignment[i].getRmsd(); > if(rmsd < minRmsd) minRmsd = rmsd; > rmsd = 0; > } > return minRmsd; > > Thanks, > Armitash > > > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From phidias51 at gmail.com Sat Dec 6 16:34:39 2008 From: phidias51 at gmail.com (Mark Fortner) Date: Sat, 6 Dec 2008 08:34:39 -0800 Subject: [Biojava-dev] File Validator Message-ID: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> I've noticed that a lot of the email on the mailing list from users tends to revolve around the inability to parse a file of a given file type. In most of the cases it turns out that the file either does not conform to the standard, or the data in the file apparently violates XML rules of well-formedness. It occurred to me that we might put a page in the Cookbook that describes basic troubleshooting techniques. Richards past emails definitely contain a lot of useful information and could be used as a basis for the page. I also wondered if there were any plans in BioJava3 to include some sort of file validator (either as an integral part of the parsing framework or as a separate utility that could be run against any problematic file)? In most cases, the user simply wants to know what part of the file is broken so that they can fix the file and carry on (or notify the data provider of the problem and have them address the issue). Regards, Mark Fortner From markjschreiber at gmail.com Sun Dec 7 01:03:28 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 7 Dec 2008 09:03:28 +0800 Subject: [Biojava-dev] File Validator In-Reply-To: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> References: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> Message-ID: <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> I would agree that a file validator would be excellent although sometimes hard to write. The problem is mainly with the flat file formats. When we wrote the biojavax parsers we tried to make them conform to the descriptions given by NCBI etc. The problem is that they don't always conform to this. I think a possible problem with NCBI is that all their flat files are produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be validated quite easily. The flatfiles are produced by a transformation of the XML so they aren't always going to match the description. Finally other people produce 'Genbank' and 'EMBL' files that are really just a similar format but not the real thing. One of the most troublesome formats is FASTA. Not because it is difficult but because people try to code all manner of metadata into the header without any convention existing. Overall I would say whenever possible parse XML this should be the safest bet, although not always possible. - Mark On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner wrote: > I've noticed that a lot of the email on the mailing list from users tends to > revolve around the inability to parse a file of a given file type. In most > of the cases it turns out that the file either does not conform to the > standard, or the data in the file apparently violates XML rules of > well-formedness. > > It occurred to me that we might put a page in the Cookbook that describes > basic troubleshooting techniques. Richards past emails definitely contain a > lot of useful information and could be used as a basis for the page. > > I also wondered if there were any plans in BioJava3 to include some sort of > file validator (either as an integral part of the parsing framework or as a > separate utility that could be run against any problematic file)? In most > cases, the user simply wants to know what part of the file is broken so that > they can fix the file and carry on (or notify the data provider of the > problem and have them address the issue). > > Regards, > > Mark Fortner > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at eaglegenomics.com Sun Dec 7 19:03:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 07 Dec 2008 19:03:10 +0000 Subject: [Biojava-dev] File Validator In-Reply-To: <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> References: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> Message-ID: <493C1DEE.1010105@eaglegenomics.com> I like the idea of a validator. It should probably just be the standard parser run with some kind of a 'report errors but carry on parsing anyway' flag set (which currently doesn't exist). After all the standard parser is conforming to the published format so should be able to spot most errors. Such a flag does not yet exist, but yes it would be nice to incorporate it in future versions. cheers, Richard Mark Schreiber wrote: > I would agree that a file validator would be excellent although > sometimes hard to write. The problem is mainly with the flat file > formats. When we wrote the biojavax parsers we tried to make them > conform to the descriptions given by NCBI etc. The problem is that > they don't always conform to this. > > I think a possible problem with NCBI is that all their flat files are > produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be > validated quite easily. The flatfiles are produced by a transformation > of the XML so they aren't always going to match the description. > Finally other people produce 'Genbank' and 'EMBL' files that are > really just a similar format but not the real thing. > > One of the most troublesome formats is FASTA. Not because it is > difficult but because people try to code all manner of metadata into > the header without any convention existing. > > Overall I would say whenever possible parse XML this should be the > safest bet, although not always possible. > > - Mark > > On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner wrote: >> I've noticed that a lot of the email on the mailing list from users tends to >> revolve around the inability to parse a file of a given file type. In most >> of the cases it turns out that the file either does not conform to the >> standard, or the data in the file apparently violates XML rules of >> well-formedness. >> >> It occurred to me that we might put a page in the Cookbook that describes >> basic troubleshooting techniques. Richards past emails definitely contain a >> lot of useful information and could be used as a basis for the page. >> >> I also wondered if there were any plans in BioJava3 to include some sort of >> file validator (either as an integral part of the parsing framework or as a >> separate utility that could be run against any problematic file)? In most >> cases, the user simply wants to know what part of the file is broken so that >> they can fix the file and carry on (or notify the data provider of the >> problem and have them address the issue). >> >> Regards, >> >> Mark Fortner >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Mon Dec 8 03:57:40 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 8 Dec 2008 11:57:40 +0800 Subject: [Biojava-dev] File Validator In-Reply-To: <493C1DEE.1010105@eaglegenomics.com> References: <6e1d61f50812060834p4457a3a8m4aad0783ab46afcf@mail.gmail.com> <93b45ca50812061703v5bbc9575p5bb5795a5791ec1@mail.gmail.com> <493C1DEE.1010105@eaglegenomics.com> Message-ID: <93b45ca50812071957u345f37dbp6f31537ed7074654@mail.gmail.com> A way to skip bad files in a stream would be great for general purpose use as well. Currently it's a pain to parse lots of files only to fail three quaters of the way through. - Mark On Mon, Dec 8, 2008 at 3:03 AM, Richard Holland wrote: > I like the idea of a validator. It should probably just be the standard > parser run with some kind of a 'report errors but carry on parsing > anyway' flag set (which currently doesn't exist). After all the standard > parser is conforming to the published format so should be able to spot > most errors. > > Such a flag does not yet exist, but yes it would be nice to incorporate > it in future versions. > > cheers, > Richard > > Mark Schreiber wrote: >> I would agree that a file validator would be excellent although >> sometimes hard to write. The problem is mainly with the flat file >> formats. When we wrote the biojavax parsers we tried to make them >> conform to the descriptions given by NCBI etc. The problem is that >> they don't always conform to this. >> >> I think a possible problem with NCBI is that all their flat files are >> produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be >> validated quite easily. The flatfiles are produced by a transformation >> of the XML so they aren't always going to match the description. >> Finally other people produce 'Genbank' and 'EMBL' files that are >> really just a similar format but not the real thing. >> >> One of the most troublesome formats is FASTA. Not because it is >> difficult but because people try to code all manner of metadata into >> the header without any convention existing. >> >> Overall I would say whenever possible parse XML this should be the >> safest bet, although not always possible. >> >> - Mark >> >> On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner wrote: >>> I've noticed that a lot of the email on the mailing list from users tends to >>> revolve around the inability to parse a file of a given file type. In most >>> of the cases it turns out that the file either does not conform to the >>> standard, or the data in the file apparently violates XML rules of >>> well-formedness. >>> >>> It occurred to me that we might put a page in the Cookbook that describes >>> basic troubleshooting techniques. Richards past emails definitely contain a >>> lot of useful information and could be used as a basis for the page. >>> >>> I also wondered if there were any plans in BioJava3 to include some sort of >>> file validator (either as an integral part of the parsing framework or as a >>> separate utility that could be run against any problematic file)? In most >>> cases, the user simply wants to know what part of the file is broken so that >>> they can fix the file and carry on (or notify the data provider of the >>> problem and have them address the issue). >>> >>> Regards, >>> >>> Mark Fortner >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > M: +44 7500 438846 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From simpleyrx at 163.com Wed Dec 10 06:35:06 2008 From: simpleyrx at 163.com (simpleyrx) Date: Wed, 10 Dec 2008 14:35:06 +0800 (CST) Subject: [Biojava-dev] can biojava calculate ssea(secondary structure elements alignment) ? Message-ID: <9641867.357891228890906675.JavaMail.coremail@bj163app105.163.com> Dear expoerts, I wonder that can biojava calculate ssea (protein secondary structure elements alignment) ? -- Renxiang Yan From simpleyrx at 163.com Wed Dec 10 06:35:06 2008 From: simpleyrx at 163.com (simpleyrx) Date: Wed, 10 Dec 2008 14:35:06 +0800 (CST) Subject: [Biojava-dev] can biojava calculate ssea(secondary structure elements alignment) ? Message-ID: <9641867.357891228890906675.JavaMail.coremail@bj163app105.163.com> Dear expoerts, I wonder that can biojava calculate ssea (protein secondary structure elements alignment) ? -- Renxiang Yan From holland at eaglegenomics.com Fri Dec 19 11:28:17 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 19 Dec 2008 11:28:17 +0000 Subject: [Biojava-dev] Biojava3 updates Message-ID: <494B8551.7050406@eaglegenomics.com> It seems I forgot to commit my FASTA parser code last time round. I've just committed it now, along with a new class called ThingParserFactory to make file reading/writing much easier. See the updated docs here for a how-to: http://www.biojava.org/wiki/BioJava3:HowTo#FASTA cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Fri Dec 19 10:25:55 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 19 Dec 2008 10:25:55 +0000 Subject: [Biojava-dev] Annotations and Hibernate IDs Message-ID: <494B76B3.7040408@eaglegenomics.com> Hi all, I've just made a commit to the trunk of BioJavaX which resolves the following points: 1. Deprecated get/setProperty in RichAnnotation (hopefully no more confusion - people should use get/setNote[Set] instead). 2. Updated Rich* classes to explicitly specify RichAnnotation instead of Annotation (means getAnnotation returns RichAnnotation now, not plain old Annotation. This helps with point 1 above.). 3. Made all IDs on BioSQL-Rich* classes publicly get/settable. Use with caution! This allows you to identify individual database records from within your code. cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Fri Dec 19 12:01:14 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 19 Dec 2008 12:01:14 +0000 Subject: [Biojava-dev] Please help - BugZilla Message-ID: <494B8D0A.4090104@eaglegenomics.com> Hi all. I'd like to make a plea for help! There's about 16 reported bugs still open in BugZilla which have been there for quite some time. http://bugzilla.open-bio.org/buglist.cgi?product=BioJava&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED I would really appreciate it if a few people could "Adopt A Bug" and take a serious look at seeing if they can fix it. Just one bug per person would make all the difference. It would hugely help the project, and I would be eternally grateful. You can let me know if you've adopted a bug by assigning it to yourself in BugZilla. Currently only one of the 16 is actually assigned to anyone (thanks Andreas!), but I'm hoping that maybe someone out there will have a few moments to spare over the forthcoming holiday season and might fancy a challenge. Remember, a bug is for life, not just for Christmas! cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From bugzilla-daemon at portal.open-bio.org Fri Dec 19 18:05:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Dec 2008 13:05:23 -0500 Subject: [Biojava-dev] [Bug 2602] ParseException thrown when parsing Genbank file. In-Reply-To: Message-ID: <200812191805.mBJI5N01013362@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2602 me at hongyu.org changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biojava-dev at biojava.org |me at hongyu.org -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Dec 19 21:02:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Dec 2008 16:02:52 -0500 Subject: [Biojava-dev] [Bug 2716] New: Retrieve Partial CDS/Gene Information Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2716 Summary: Retrieve Partial CDS/Gene Information Product: BioJava Version: live (CVS source) Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: DB / BioSQL AssignedTo: biojava-dev at biojava.org ReportedBy: bandarus at niaid.nih.gov CC: gopalanv at niaid.nih.gov Overview: Add enhancement to biojavax to support retrieval of partial gene or partial CDS information for a partial gene/CDS feature of database record. Actual Result: RichSequence is holding this information initially, but lost when retrieving the sequence(saved in MySQL database using BioSQL schema). Expected Result: To be able to get the retrieve the partial gene/CDS Feature Location. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From simpleyrx at 163.com Mon Dec 22 06:23:12 2008 From: simpleyrx at 163.com (simpleyrx) Date: Mon, 22 Dec 2008 14:23:12 +0800 (CST) Subject: [Biojava-dev] Could sb give me a copy of source code for profile profile alignment ? Message-ID: <1895183.477771229926992135.JavaMail.coremail@app157.163.com> Dear experts, Could sb give me a copy of source code for profile profile alignment ? I need it very much. -- Renxiang Yan From holland at eaglegenomics.com Mon Dec 22 12:11:50 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 22 Dec 2008 12:11:50 +0000 Subject: [Biojava-dev] BlastXML parsers Message-ID: <494F8406.2070901@eaglegenomics.com> Mark Schreiber has kindly written some BlastXML parsers which can be found in the biojava-blastxml module in the BioJava3 repository. If you run your blasts with XML output, it will be able to fully parse every kind of blast output supported by NCBI blast. cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From armita_sh at yahoo.com Wed Dec 24 17:02:11 2008 From: armita_sh at yahoo.com (Armita Sheari) Date: Wed, 24 Dec 2008 09:02:11 -0800 (PST) Subject: [Biojava-dev] How can I calculate the RMSD of Global Alignment? Message-ID: <537200.67656.qm@web51412.mail.re2.yahoo.com> ???? Hi everyone, ?I want to calculate the RMSD relevent to the Structural Global Alignment of two proteins. I have used the align method of the StructurePairAligner class with default parameters, and then I have calculated the RMSD using the getRmsd method of the AlternativeAlignment class. But it seems something is wrong. I think I should change some parameters of align method which are defined in StrucAligParameters class. Unfortunately, I couldn't find any documentation that describes the parameters (not in the api nor in the source code). I would be thankful if you take a look at my code and let me know your opinion about which parameter(s) I should change. StructurePairAligner structurePairAligner = new StructurePairAligner(); structurePairAligner.align(structure1, structure2); AlternativeAlignment[] alternativeAlignment =? structurePairAligner.getAlignments(); ClusterAltAligs.cluster(alternativeAlignment); double minRmsd = 1000; double rmsd = 0; for(int i = 0; i < alternativeAlignment.length; i++) ?????????? { ?????????????? rmsd = alternativeAlignment[i].getRmsd(); ?????????????? if(rmsd < minRmsd) minRmsd = rmsd; ?????????????? rmsd = 0; ?????????? } return minRmsd; Thanks, Armitash From armita_sh at yahoo.com Mon Dec 29 12:44:16 2008 From: armita_sh at yahoo.com (Armita Sheari) Date: Mon, 29 Dec 2008 04:44:16 -0800 (PST) Subject: [Biojava-dev] How can I calculate the RMSD of Global Alignment? In-Reply-To: <59a41c430812250420j35947f87i6589735ff4c4273c@mail.gmail.com> Message-ID: <577293.91936.qm@web51403.mail.re2.yahoo.com> Dear Andreas, ? Thanks for your answer. I saw the run of the example you had linked. As I found, all rmsd numbers?which?were calculated in the example?were?related?to Local Alignments. Wheares I need to?find the?RMSD of Global Alignment (without any Gap). And we should consider all Alpha Carbons of the proteins to calculate the?RMSD. ? Which parameters should I change in align method to?reach the Global Alignment with NO Gap? ? Thanks again, ArmitaSh --- On Thu, 12/25/08, Andreas Prlic wrote: From: Andreas Prlic Subject: Re: [Biojava-dev] How can I calculate the RMSD of Global Alignment? To: armita_sh at yahoo.com Cc: "Biojava" , "biojava-dev" Date: Thursday, December 25, 2008, 7:20 PM Hi Armita, I agree the missing documentation for all the alignment parameters is a problem... It is not exactly clear from your mail what is the problem you encountered. The code that you sent in principle looks fine. I suspect you want to select the "best" of the alternative alignments? In that case you can simply take the first, since the alignments come out sorted. An example can be run from here: http://www.biojava.org/download/performance/biojava-structure-example1.jnlp The alternative alignments are sorted according to their number of structurally equivalent residues. The rmsd in the different alternative solutions is always kept under a certain threshold (one of the parameters), since one of the strategies is to try to keep the rmsd constant, while maximizing the number of structurally equivalent residues. Similar results are clustered together in the same cluster. In the example shown above this allows to find the multiple matches between the four chains of hemoglobin and myoglobin. Andreas On Wed, Dec 24, 2008 at 6:02 PM, Armita Sheari wrote: > Hi everyone, > > I want to calculate the RMSD relevent to the Structural Global Alignment of two proteins. I have used the align method of the StructurePairAligner class with default parameters, and then I have calculated the RMSD using the getRmsd method of the AlternativeAlignment class. But it seems something is wrong. > I think I should change some parameters of align method which are defined in StrucAligParameters class. Unfortunately, I couldn't find any documentation that describes the parameters (not in the api nor in the source code). > > I would be thankful if you take a look at my code and let me know your opinion about which parameter(s) I should change. > > StructurePairAligner structurePairAligner = new StructurePairAligner(); > structurePairAligner.align(structure1, structure2); > AlternativeAlignment[] alternativeAlignment = structurePairAligner.getAlignments(); > ClusterAltAligs.cluster(alternativeAlignment); > > double minRmsd = 1000; > double rmsd = 0; > for(int i = 0; i < alternativeAlignment.length; i++) > { > rmsd = alternativeAlignment[i].getRmsd(); > if(rmsd < minRmsd) minRmsd = rmsd; > rmsd = 0; > } > return minRmsd; > > Thanks, > Armitash > > > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev >