From simon.rayner.cn at gmail.com Wed Apr 1 00:39:42 2009 From: simon.rayner.cn at gmail.com (simon rayner) Date: Tue, 31 Mar 2009 23:39:42 -0500 Subject: [Biojava-l] demos in version biojava 1.6 Message-ID: <616a29410903312139j47d757fdq98e34e7a2282ade0@mail.gmail.com> did i lose the demo files somewhere in version 1.6.1? (I found them okay in 1.6) I downloaded the full version via http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar and unjarred it xx at yyyyyy:~/downloads/biojava-1.6.1-all$ *ls -all* total 817 drwxr-x--- 11 sr sr 672 2009-01-27 22:25 . drwxr-x--- 6 sr sr 616 2009-01-27 21:05 .. drwxr-xr-x 4 sr sr 392 2009-01-27 22:26 ant-build -rw-r--r-- 1 sr sr 27035 2008-10-26 21:09 build.xml -rw-r--r-- 1 sr sr 93463 2008-10-26 21:13 bytecode.jar -rw-r--r-- 1 sr sr 30117 2008-10-26 21:13 commons-cli.jar -rw-r--r-- 1 sr sr 165119 2008-10-26 21:13 commons-collections-2.1.jar -rw-r--r-- 1 sr sr 100776 2008-10-26 21:13 commons-dbcp-1.1.jar -rw-r--r-- 1 sr sr 39523 2008-10-26 21:13 commons-pool-1.1.jar drwxr-x--- 6 sr sr 144 2008-10-26 21:13 doc -rw-r--r-- 1 sr sr 166303 2008-10-26 21:13 jgrapht-jdk1.5.jar -rw-r--r-- 1 sr sr 161477 2008-10-26 21:13 junit-4.4.jar -rw-r--r-- 1 sr sr 25091 2008-10-26 21:09 LICENSE drwxr-x--- 2 sr sr 136 2008-10-26 21:09 manifest drwxr-x--- 2 sr sr 80 2008-10-26 21:13 META-INF -rw-r--r-- 1 sr sr 3056 2008-10-26 21:09 README -rw-r--r-- 1 sr sr 2541 2008-10-26 21:09 README.biosql drwxr-xr-x 3 sr sr 72 2009-01-27 22:25 reports drwxr-x--- 5 sr sr 120 2008-10-26 21:09 resources drwxr-x--- 2 sr sr 176 2008-10-26 21:13 selfSignedCertificate drwxr-x--- 3 sr sr 72 2008-10-26 21:07 src drwxr-x--- 4 sr sr 96 2008-10-26 21:09 tests xx at yyyyyy:~/downloads/biojava-1.6.1-all$ *ant -version* Apache Ant version 1.7.0 compiled on April 29 2008 xx at yyyyyy:~/downloads/biojava-1.6.1-all$ xx at yyyyyy:~/downloads/biojava-1.6.1-all$* ant compile-demos* Buildfile: build.xml init: [echo] Building biojava-live [echo] Java Home: /usr/lib/jvm/java-6-sun-1.6.0.12/jre [echo] JUnit present: true [echo] JUnit supported by Ant: true [echo] HSQLDB driver present: ${sqlDriver.hsqldb} [echo] XSLT support: true prepare: prepare-demos: [mkdir] Created dir: /home/sr/downloads/biojava-1.6.1-all/ant-build/classes/demos [mkdir] Created dir: /home/sr/downloads/biojava-1.6.1-all/ant-build/docs/demos prepare-biojava: compile-biojava: package-biojava: compile-demos: BUILD FAILED /home/--/downloads/biojava-1.6.1-all/build.xml:283: srcdir "/home/--/downloads/biojava-1.6.1-all/demos" does not exist! Total time: 1 second xx at yyyyyy:~/downloads/biojava-1.6.1-all$ xx at yyyyyy:~/downloads/biojava-1.6.1-all$ *find ./ -name TestEmbl** ./doc/demos/seq/TestEmbl2.html ./doc/demos/seq/class-use/TestEmbl2.html ./doc/demos/seq/class-use/TestEmbl.html ./doc/demos/seq/TestEmbl.html xx at yyyyyy:~/downloads/biojava-1.6.1-all$ am i doing something stupid here? From markjschreiber at gmail.com Thu Apr 2 20:07:58 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 3 Apr 2009 08:07:58 +0800 Subject: [Biojava-l] [Biojava-dev] How to convert a multiple alignment to a PSSM matrix ? In-Reply-To: <93b45ca50904021707h786aeac9sd12f3cd592303981@mail.gmail.com> References: <11965100.460701238686263309.JavaMail.coremail@bj163app72.163.com> <93b45ca50904021707h786aeac9sd12f3cd592303981@mail.gmail.com> Message-ID: <93b45ca50904021707m1cfd068fv5766e4531bab7991@mail.gmail.com> There is a class called a WeightMatrix. I think there is an example on the cookbook. On 2 Apr 2009, 11:47 PM, "simpleyrx" wrote: Hi, friends, I have a question: How to convert a multiple alignment to a PSSM matrix ? Is there any code in Biojava implement the function ? Or is there other source code have the function ? Student _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Tue Apr 7 01:50:50 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 6 Apr 2009 22:50:50 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49D9C3CC.7010000@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> Message-ID: <59a41c430904062250l25a1105dw505ce3edc2651184@mail.gmail.com> Hi Richard, Thanks for the nomination. In short I am intending to do the following things over the next couple of months: * Release biojava 1.7 - don't forget, code freeze will be on Wed. April 8th. Please commit your final changes for this release in the next couple of days, or let me know if you need more time asap. * Keep maintaining biojava nightly builds at http://www.spice-3d.org/cruise/ * Organize a biojava user meeting around BOSC / ISMB 2009 * After the biojava 1.7 release I want to have a discussion how to continue with the code base and what to change for the next major release. * I will actively seek and invite new contributors and package maintainers. * My main focus are the further development of the protein structure related modules. As such I will need YOUR help for maintaining blast, sequence and any of the other frequently used modules. Andreas On Mon, Apr 6, 2009 at 1:56 AM, Richard Holland wrote: > Hi all. > > There were no nominations for the BioJava leadership role by the end of > last week, so I would like to nominate Andreas Prlic to take over the > role as BioJava coordinator/project manager. Andreas has agreed to be > nominated. > > If there are no objections lodged on this list by next Monday (13th > April), I'll hand over to Andreas by the end of next week. > > thanks, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at eaglegenomics.com Tue Apr 7 06:09:25 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 07 Apr 2009 11:09:25 +0100 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <59a41c430904062250l25a1105dw505ce3edc2651184@mail.gmail.com> References: <49D9C3CC.7010000@eaglegenomics.com> <59a41c430904062250l25a1105dw505ce3edc2651184@mail.gmail.com> Message-ID: <49DB2655.1080500@eaglegenomics.com> I'd be happy to maintain the parts you request. Andreas Prlic wrote: > Hi Richard, > > Thanks for the nomination. In short I am intending to do the following > things over the next couple of months: > > * Release biojava 1.7 - don't forget, code freeze will be on Wed. > April 8th. Please commit your final changes for this release in the > next couple of days, or let me know if you need more time asap. > > * Keep maintaining biojava nightly builds at http://www.spice-3d.org/cruise/ > > * Organize a biojava user meeting around BOSC / ISMB 2009 > > * After the biojava 1.7 release I want to have a discussion how to > continue with the code base and what to change for the next major > release. > > * I will actively seek and invite new contributors and package maintainers. > > * My main focus are the further development of the protein structure > related modules. As such I will need YOUR help for maintaining blast, > sequence and any of the other frequently used modules. > > Andreas > > > > > On Mon, Apr 6, 2009 at 1:56 AM, Richard Holland > wrote: >> Hi all. >> >> There were no nominations for the BioJava leadership role by the end of >> last week, so I would like to nominate Andreas Prlic to take over the >> role as BioJava coordinator/project manager. Andreas has agreed to be >> nominated. >> >> If there are no objections lodged on this list by next Monday (13th >> April), I'll hand over to Andreas by the end of next week. >> >> thanks, >> Richard >> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From tallpaulinjax at yahoo.com Wed Apr 8 15:52:57 2009 From: tallpaulinjax at yahoo.com (tallpaulinjax at yahoo.com) Date: Wed, 8 Apr 2009 12:52:57 -0700 (PDT) Subject: [Biojava-l] MMCIF parser? Message-ID: <54544.49687.qm@web30702.mail.mud.yahoo.com> Hi, ? The JAVADOCS located here with a build date of today indicate BioJava supports MMCIF file parsing: http://www.spice-3d.org/public-files/javadoc/biojava/overview-summary.html As does the Cookbook page here: http://biojava.org/wiki/BioJava:CookBook:PDB:mmcif ? Yet the (official ?) Javadocs here don't indicate support: http://www.biojava.org/docs/api16/index.html ? And?when I?searched in the source code I can not find any mention of the *mmcif*.java files, and the class files don't seem to exist either. To make sure I had the latest build, I re-downloaded the JAR file located here and checked it as well: http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar ? Is MMCIF support from an older version of BioJava that has been deprecated, or from a new version yet to be released? Thanks, ? Paul From andreas at sdsc.edu Wed Apr 8 16:10:01 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 8 Apr 2009 13:10:01 -0700 Subject: [Biojava-l] MMCIF parser? In-Reply-To: <54544.49687.qm@web30702.mail.mud.yahoo.com> References: <54544.49687.qm@web30702.mail.mud.yahoo.com> Message-ID: <59a41c430904081310y6127448ft9af65a0c79be0bac@mail.gmail.com> Hi Paul, The mmcif functionality is new and will be part of the 1.7 release that will go out next week. In the meanwhile you can use the nightly build .jars from http://www.spice-3d.org/cruise/ ... Andreas On Wed, Apr 8, 2009 at 12:52 PM, wrote: > > Hi, > > The JAVADOCS located here with a build date of today indicate BioJava supports MMCIF file parsing: > http://www.spice-3d.org/public-files/javadoc/biojava/overview-summary.html > As does the Cookbook page here: > http://biojava.org/wiki/BioJava:CookBook:PDB:mmcif > > Yet the (official ?) Javadocs here don't indicate support: > http://www.biojava.org/docs/api16/index.html > > And?when I?searched in the source code I can not find any mention of the *mmcif*.java files, and the class files don't seem to exist either. To make sure I had the latest build, I re-downloaded the JAR file located here and checked it as well: > http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar > > Is MMCIF support from an older version of BioJava that has been deprecated, or from a new version yet to be released? > > Thanks, > > Paul > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From tallpaulinjax at yahoo.com Wed Apr 8 16:38:19 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Wed, 8 Apr 2009 13:38:19 -0700 (PDT) Subject: [Biojava-l] MMCIF parser? Message-ID: <75942.7684.qm@web30707.mail.mud.yahoo.com> Thanks, Andreas! --- On Wed, 4/8/09, Andreas Prlic wrote: From: Andreas Prlic Subject: Re: [Biojava-l] MMCIF parser? To: tallpaulinjax at yahoo.com Cc: "biojava-l at biojava.org" Date: Wednesday, April 8, 2009, 4:10 PM Hi Paul, The mmcif functionality is new and will be part of the? 1.7 release that will go out next week. In the meanwhile you can use the nightly build .jars from http://www.spice-3d.org/cruise/ ... Andreas On Wed, Apr 8, 2009 at 12:52 PM,? wrote: > > Hi, > > The JAVADOCS located here with a build date of today indicate BioJava supports MMCIF file parsing: > http://www.spice-3d.org/public-files/javadoc/biojava/overview-summary.html > As does the Cookbook page here: > http://biojava.org/wiki/BioJava:CookBook:PDB:mmcif > > Yet the (official ?) Javadocs here don't indicate support: > http://www.biojava.org/docs/api16/index.html > > And?when I?searched in the source code I can not find any mention of the *mmcif*.java files, and the class files don't seem to exist either. To make sure I had the latest build, I re-downloaded the JAR file located here and checked it as well: > http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar > > Is MMCIF support from an older version of BioJava that has been deprecated, or from a new version yet to be released? > > Thanks, > > Paul > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list? -? Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Sat Apr 11 15:05:44 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Sat, 11 Apr 2009 12:05:44 -0700 Subject: [Biojava-l] BOSC abstract Message-ID: <59a41c430904111205p624821aarf6124d0db2bb7eb9@mail.gmail.com> Hi, Submision deadline for the BOSC abstract is on Monday. The current version is available at http://biojava.org/wiki/BOSC2009_Presentation#BioJava_2009:__an_Open-Source_Framework_for_Bioinformatics. If you are one of the co-authors, can you please make sure I got your affiliation right? Also if you have any additions or corrections to the abstract, please feel free to edit. If I missed anybody who should be co-author, please edit as well... Andreas From holland at eaglegenomics.com Sun Apr 12 06:49:54 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 12 Apr 2009 11:49:54 +0100 Subject: [Biojava-l] [Biojava-dev] BOSC abstract In-Reply-To: <59a41c430904111205p624821aarf6124d0db2bb7eb9@mail.gmail.com> References: <59a41c430904111205p624821aarf6124d0db2bb7eb9@mail.gmail.com> Message-ID: <49E1C752.6080702@eaglegenomics.com> looks good! Andreas Prlic wrote: > Hi, > > Submision deadline for the BOSC abstract is on Monday. The current > version is available at > http://biojava.org/wiki/BOSC2009_Presentation#BioJava_2009:__an_Open-Source_Framework_for_Bioinformatics. > If you are one of the co-authors, can you please make sure I got your > affiliation right? Also if you have any additions or corrections to > the abstract, please feel free to edit. If I missed anybody who should > be co-author, please edit as well... > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Sun Apr 12 22:47:26 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 12 Apr 2009 19:47:26 -0700 Subject: [Biojava-l] BioJava 1.7 released Message-ID: <59a41c430904121947v67c7a7f9v1a236d3ad695760f@mail.gmail.com> Biojava 1.7 has been released and is available from http://biojava.org/wiki/BioJava:Download BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats, and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. Besides numerous bug fixes and stability improvements, a lot of development has been going on in the protein structure modules. BioJava now provides a framework for parsing mmCif files. The parsing of PDB header information has been improved and a new tool to read the Chemical component dictionary is in place. Biojava 1.7 offers more functionality and stability over the previous official releases. We highly recommend you to upgrade as soon as possible. Thanks to all contributors for making this release possible. Happy Biojava-ing, Andreas From holland at eaglegenomics.com Tue Apr 14 04:33:05 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 14 Apr 2009 09:33:05 +0100 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49D9C3CC.7010000@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> Message-ID: <49E44A41.2010703@eaglegenomics.com> Hello again. Well, nobody objected, and several people supported the idea, so I would now like to formally hand over control of the BioJava project to Andreas Prlic with immediate effect. It's been good fun working with the project over the last 5 years, and although I'll no longer be in charge, I will still remain on the mailing lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. I'll also continue to attend BOSC, including this year in Stockholm, so I'm looking forward to meeting up with everyone there for a beer or two. Thanks for the help and support everyone's given, and I'm sure you'll join me in wishing Andreas the best of luck with the project. He'll be an excellent leader and with him in charge I believe the project will go from strength to strength. cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From willishf at ufl.edu Tue Apr 14 11:02:10 2009 From: willishf at ufl.edu (Scooter Willis) Date: Tue, 14 Apr 2009 11:02:10 -0400 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49E44A41.2010703@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> Message-ID: <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> Andreas Congrats on taking on the responsibility of steering BioJava in a positive direction. I needed the ability to generate phylogenetic trees from aligned sequence data and found that the work was started in a google summer project but looking at the code it wasn't finished and appeared to focus only on loading trees not creating them. I ended up taking the tree generation code out of jalview and removing as much jalview dependencies as possible and have it as a nice tight collection of classes. My assumption without any deep legal review is that because jalview is open source that the code can be used and contributed to another open source project like BioJava. I will also plan on contributing the code changes back to Jalview. One of the challenges I ran into with the JalView code is performance for building the tree when using 1800+ sequences(takes a very very long time) so I am doing some code optimization and finishing up testing on a fairly significant performance speedup doing Neighbor_Join with a slightly different approach that makes it N2 instead of N3. I have a couple things to fix in tree joinging code and then will compare results for the quality of the tree compared to the original distance matrix. I should know more this week. I think I remember a BioJava discussion about trying to seperate parts and pieces to that if you try and use a particular feature set of BioJava you are not forced into absorbing the entire BioJava collection of Jars. In my case I would want a biojava-phylogenetic.jar that has all things related to tree creation and/or tree viewing etc. If the common data format for handling sequences is RichSequence or Sequence then I would expect to have one other Jar requirement of biojava-core.jar. Not sure if any work has been done to refactor the BioJava code base into multiple jar files in the same way apache does its jars for great java code geared to a specific problem domain. Let me know what I can do to assist moving forward. Thanks Scooter Willis On Tue, Apr 14, 2009 at 4:33 AM, Richard Holland wrote: > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From andreas.prlic at gmail.com Tue Apr 14 12:22:27 2009 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Tue, 14 Apr 2009 09:22:27 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49E44A41.2010703@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> Message-ID: <59a41c430904140922m79f6fdetb0e286e8548a37fe@mail.gmail.com> Hi Richard, Again, thanks for your BioJava contributions in the last years and great to have you still around. I am looking forward to the next year of BioJava development. Our google analytics stats reveal that we have an ever growing user base and it will be a challenge to continue developing BioJava further and add new and useful features. Of course this is a task that I can't do alone and I will need the help of everybody who wants to write documentation, submit bug fixes or wants to become maintainer of one of the modules. With BioJava 1.7 being out it is a now good time to start a discussion at how to improve the code base for the next version. We also have BOSC coming up in June and it will provide a good opportunity for people to meet in person. Hope to see you (Richard, and everybody else!) in Sweden, otherwise we will keep talking via the lists. Andreas On Tue, Apr 14, 2009 at 1:33 AM, Richard Holland wrote: > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From andreas at sdsc.edu Tue Apr 14 12:45:21 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 14 Apr 2009 09:45:21 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49E44A41.2010703@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> Message-ID: <59a41c430904140945t30911e33idc8ea095c52e59e3@mail.gmail.com> Hi Richard, Again, thanks for your BioJava contributions in the last years and great to have you still around. I am looking forward to the next year of BioJava development. Our google analytics stats reveal that we have an ever growing user base and it will be a challenge to continue developing BioJava further and add new and useful features. Of course this is a task that I can't do alone and I will need the help of everybody who wants to write documentation, submit bug fixes or wants to become maintainer of one of the modules. With BioJava 1.7 being out it is now a good time to start a discussion at how to improve the code base for the next version. We also have BOSC coming up in June and it will provide a good opportunity for people to meet in person. Hope to see you (Richard, and everybody else!) in Sweden, otherwise we will keep talking via the lists. Andreas On Tue, Apr 14, 2009 at 1:33 AM, Richard Holland wrote: > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From holland at eaglegenomics.com Tue Apr 14 11:07:15 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 14 Apr 2009 16:07:15 +0100 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> Message-ID: <49E4A6A3.6030005@eaglegenomics.com> The plan was to create separate new BJ3 jars for task-specific code, exactly as you suggest for phylogenetics. I'd support a biojava-phylo jar of some kind, and I agree it would probably depend on the BJ3 biojava-core module for sequence handling. The existing BJ code was not originally going to be refactored into separate jars, unless Andreas has other plans! Scooter Willis wrote: > Andreas > > Congrats on taking on the responsibility of steering BioJava in a > positive direction. > > I needed the ability to generate phylogenetic trees from aligned > sequence data and found that the work was started in a google summer > project but looking at the code it wasn't finished and appeared to focus > only on loading trees not creating them. I ended up taking the tree > generation code out of jalview and removing as much jalview dependencies > as possible and have it as a nice tight collection of classes. My > assumption without any deep legal review is that because jalview is open > source that the code can be used and contributed to another open source > project like BioJava. I will also plan on contributing the code changes > back to Jalview. > > One of the challenges I ran into with the JalView code is performance > for building the tree when using 1800+ sequences(takes a very very long > time) so I am doing some code optimization and finishing up testing on a > fairly significant performance speedup doing Neighbor_Join with a > slightly different approach that makes it N2 instead of N3. I have a > couple things to fix in tree joinging code and then will compare results > for the quality of the tree compared to the original distance matrix. I > should know more this week. > > I think I remember a BioJava discussion about trying to seperate parts > and pieces to that if you try and use a particular feature set of > BioJava you are not forced into absorbing the entire BioJava collection > of Jars. In my case I would want a biojava-phylogenetic.jar that has all > things related to tree creation and/or tree viewing etc. If the common > data format for handling sequences is RichSequence or Sequence then I > would expect to have one other Jar requirement of biojava-core.jar. Not > sure if any work has been done to refactor the BioJava code base into > multiple jar files in the same way apache does its jars for great java > code geared to a specific problem domain. > > Let me know what I can do to assist moving forward. > > Thanks > > Scooter Willis > > On Tue, Apr 14, 2009 at 4:33 AM, Richard Holland > > wrote: > > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Wed Apr 15 01:40:24 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 14 Apr 2009 22:40:24 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> Message-ID: <59a41c430904142240u5a9812eejf11023f69baa27b7@mail.gmail.com> Hi Scooter, Thanks for volunteering. I like the idea of modularizing BioJava in the next version. The details of how to do this still need to be discussed on the dev mailing list. If you want to be involved into anything phylo related then this is great and any contribution will be welcome. About merging in 3rd party code that is under a different license: For this you need to get permission from the original copyright owners or rewrite the code .... Andreas On Tue, Apr 14, 2009 at 8:02 AM, Scooter Willis wrote: > Andreas > > Congrats on taking on the responsibility of steering BioJava in a positive > direction. > > I needed the ability to generate phylogenetic trees from aligned sequence > data and found that the work was started in a google summer project but > looking at the code it wasn't finished and appeared to focus only on loading > trees not creating them. I ended up taking the tree generation code out of > jalview and removing as much jalview dependencies as possible and have it as > a nice tight collection of classes. My assumption without any deep legal > review is that because jalview is open source that the code can be used and > contributed to another open source project like BioJava. I will also plan on > contributing the code changes back to Jalview. > > One of the challenges I ran into with the JalView code is performance for > building the tree when using 1800+ sequences(takes a very very long time) so > I am doing some code optimization and finishing up testing on a fairly > significant performance speedup doing Neighbor_Join with a slightly > different approach that makes it N2 instead of N3. I have a couple things to > fix in tree joinging code and then will compare results for the quality of > the tree compared to the original distance matrix. I should know more this > week. > > I think I remember a BioJava discussion about trying to seperate parts and > pieces to that if you try and use a particular feature set of BioJava you > are not forced into absorbing the entire BioJava collection of Jars. In my > case I would want a biojava-phylogenetic.jar that has all things related to > tree creation and/or tree viewing etc. If the common data format for > handling sequences is RichSequence or Sequence then I would expect to have > one other Jar requirement of biojava-core.jar. Not sure if any work has been > done to refactor the BioJava code base into multiple jar files in the same > way apache does its jars for great java code geared to a specific problem > domain. > > Let me know what I can do to assist moving forward. > > Thanks > > Scooter Willis > > On Tue, Apr 14, 2009 at 4:33 AM, Richard Holland > wrote: > >> Hello again. >> >> Well, nobody objected, and several people supported the idea, so I would >> now like to formally hand over control of the BioJava project to Andreas >> Prlic with immediate effect. >> >> It's been good fun working with the project over the last 5 years, and >> although I'll no longer be in charge, I will still remain on the mailing >> lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. >> >> I'll also continue to attend BOSC, including this year in Stockholm, so >> I'm looking forward to meeting up with everyone there for a beer or two. >> >> Thanks for the help and support everyone's given, and I'm sure you'll >> join me in wishing Andreas the best of luck with the project. He'll be >> an excellent leader and with him in charge I believe the project will go >> from strength to strength. >> >> cheers, >> Richard >> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jolyon.holdstock at ogt.co.uk Fri Apr 17 06:27:16 2009 From: jolyon.holdstock at ogt.co.uk (Jolyon Holdstock) Date: Fri, 17 Apr 2009 11:27:16 +0100 Subject: [Biojava-l] User interface example for Cookbook Message-ID: <588D0DD225D05746B5D8CAE1BE971F3F026D7EAC@EUCLID.internal.ogtip.com> Hi, I've been meaning to generate an updated example of code for displaying a sequence (with some additional functionality) for the cookbook and finally got off my backside to do it. Code is below; I hope it's of use - feel free to point out errors, improvements etc... Cheers, Jolyon //Code starts ------------------------------------------------------------------------ ------------------- //Java libraries import java.awt.*; import java.awt.event.*; import java.io.*; import java.util.*; import javax.swing.*; //BioJava libraries import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.gui.sequence.*; //BioJava extension libraries import org.biojavax.*; import org.biojavax.ontology.*; import org.biojavax.bio.seq.*; public class DisplaySequenceFile extends JFrame implements SequenceViewerMotionListener { private TranslatedSequencePanel tsp = new TranslatedSequencePanel(); private MultiLineRenderer mlr = new MultiLineRenderer(); private RulerRenderer rr = new RulerRenderer(); private SequenceRenderer seqR = new SymbolSequenceRenderer(); private FeatureBlockSequenceRenderer fbsr; private RichSequence richSeq; private Container con; private JPanel controlPanel; private JButton mvLeft, mvRight, zoomIn, zoomOut; private double sequenceScale = 0.05; private int windowWidth = 1200; private int windowHeight = 200; public DisplaySequenceFile(String fileName){ //Load the sequence file try { richSeq = RichSequence.IOTools.readEMBLDNA(new BufferedReader(new FileReader(new File(fileName))), null).nextRichSequence(); } catch (BioException bioe){ System.err.println("Not an EMBL sequence" + bioe); } catch(FileNotFoundException fnfe){ System.err.println("FileNotFoundException: " + fnfe); } catch (IOException ioe){ System.err.println("IOException: " + ioe); } //Define the appearance of the rendered Features BasicFeatureRenderer bfr = new BasicFeatureRenderer(); GradientPaint gradient = new GradientPaint(0, 10, Color.RED, 0, 0, Color.white, true); bfr.setFill(gradient); bfr.setOutline(Color.RED); //Form a bridge between Sequence rendering and Feature rendering fbsr = new FeatureBlockSequenceRenderer(bfr); fbsr.setCollapsing(false); //Filter for CDS features on the forward strand SequenceRenderer fwd_sr = new FilteringRenderer(fbsr, new FeatureFilter.And(new FeatureFilter.ByType("CDS"), new FeatureFilter.StrandFilter(StrandedFeature.POSITIVE)), true); //Filter for CDS features on the reverse strand SequenceRenderer rev_sr = new FilteringRenderer(fbsr, new FeatureFilter.And(new FeatureFilter.ByType("CDS"), new FeatureFilter.StrandFilter(StrandedFeature.NEGATIVE)), true); //Add the renderers to the MultiLineRenderer mlr.addRenderer(fwd_sr); mlr.addRenderer(rr); mlr.addRenderer(rev_sr); mlr.addRenderer(seqR); //Set the sequence renderer for the TranslatedSequencePanel tsp.setRenderer(mlr); //Set the sequence to render tsp.setSequence(richSeq); //Set the position of the displayed sequence tsp.setSymbolTranslation(1); //Set the scale as pixels per Symbol. tsp.setScale(sequenceScale); //Add a sequence viewer motion listener to the TranslateSequencePanel tsp.addSequenceViewerMotionListener(this); //Generate the control panel controlPanel = new JPanel(); controlPanel.setBackground(Color.lightGray); //Move along the sequence towards 5' end mvLeft = new JButton("<<"); mvLeft.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ int rightSide = tsp.getRange().getMax(); int leftSide = tsp.getRange().getMin(); int newStartPoint = leftSide - (rightSide - leftSide); if (newStartPoint < 1){ newStartPoint = 1; } tsp.setSymbolTranslation(newStartPoint); } }); //Move along the sequence towards 3' end mvRight = new JButton(">>"); mvRight.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ int rightSide = tsp.getRange().getMax(); int leftSide = tsp.getRange().getMin(); int screenWidth = rightSide - leftSide; if ((rightSide + screenWidth) >= richSeq.length()){ tsp.setSymbolTranslation(richSeq.length() - screenWidth); } else { tsp.setSymbolTranslation(rightSide); } } }); //Increase sequence scale zoomIn = new JButton("+"); zoomIn.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ sequenceScale = sequenceScale * 2; //if sequence scale = 12 the bases are rendered //no need to zoom in further so disable the button. if (sequenceScale > 12){ sequenceScale = 12; zoomIn.setEnabled(false); } tsp.setScale(sequenceScale); } }); //Reduce sequence scale zoomOut = new JButton("-"); zoomOut.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ sequenceScale = sequenceScale / 2; //if sequence scale is below 12 the enable zoomIn button if (sequenceScale < 12){ zoomIn.setEnabled(true); } //If the scale allows more than the sequence to be displayed //display the whole sequence if (sequenceScale < ((double)tsp.getWidth()/(double)richSeq.length())){ sequenceScale = (double)tsp.getWidth()/(double)richSeq.length(); tsp.setSymbolTranslation(1); } tsp.setScale(sequenceScale); //If the new scale coupled with the current SymbolTranslation means the //displayed can't fill the TranslatedSequencePanel then reset the SymbolTranlstion if(tsp.getRange().getMax() >= richSeq.length()){ int tmp = (int)((double)tsp.getWidth()/sequenceScale); tsp.setSymbolTranslation(richSeq.length() - tmp); } } }); controlPanel.add(mvLeft); controlPanel.add(mvRight); controlPanel.add(zoomIn); controlPanel.add(zoomOut); con = new Container(); con = getContentPane(); con.setLayout(new BorderLayout()); con.add(controlPanel, BorderLayout.NORTH); con.add(tsp, BorderLayout.CENTER); setLocation(50,50); setSize(windowWidth,windowHeight); setVisible(true); setResizable(false); } /** * Detect mouse dragged events * @param sve */ public void mouseDragged(SequenceViewerEvent sve) { } /** * Detect mouse movement events * If the mouse moves over a CDS feature create a tooltiptext stating the * the name of the gene associated with the CDS feature. * @param sve */ public void mouseMoved(SequenceViewerEvent sve) { //Manage the tooltip ToolTipManager ttm = ToolTipManager.sharedInstance(); ttm.setDismissDelay(2000); //If the mouse have moved over a SimpleFeatureHolder if (sve.getTarget() instanceof SimpleFeatureHolder){ ComparableTerm gene = RichObjectFactory.getDefaultOntology().getOrCreateTerm("gene"); SimpleFeatureHolder sfh = (SimpleFeatureHolder)sve.getTarget(); FeatureHolder fh = sfh.filter(new FeatureFilter.ByType("CDS")); Iterator i = fh.features(); while(i.hasNext()){ RichFeature rf = i.next(); RichAnnotation anno = (RichAnnotation) rf.getAnnotation(); Set annotationNotes = anno.getNoteSet(); for (Iterator it = annotationNotes.iterator(); it.hasNext();) { Note note = it.next(); if (note.getTerm().equals(gene)) { tsp.setToolTipText("Gene: " + note.getValue()); } } } } else { //Remove the tooltip ttm.setDismissDelay(10); } } /** * Main method * @param args */ public static void main(String args []){ if (args.length == 1){ new DisplaySequenceFile(args[0]); } else { System.out.println("Usage: java SequenceViewer "); System.exit(1); } } } //Code ends ------------------------------------------------------------------------ ------------------- Dr. Jolyon Holdstock Senior Computational Biologist, Oxford Gene Technology, Begbroke Science Park, Sandy Lane, Yarnton, Oxford, OX5 1PF, UK. T: +44 (0)1865 856 852 F: +44 (0)1865 842 116 E: jolyon.holdstock at ogt.co.uk W: www.ogt.co.uk Looking to outsource your microarray studies? Look no further. Click here to tour our facilities Click here to request a quotation Scientific pedigree delivering high quality microarray results to you: * Service capacity >1000 samples per week * Rigorous QC from sample to result * Applications available include aCGH, CNV, methylation studies and miRNA Oxford Gene Technology (Operations) Ltd. Registered in England No: 03845432 Begbroke Science Park Sandy Lane Yarnton Oxford OX5 1PF Confidentiality Notice: The contents of this email from Oxford Gene Technology are confidential and intended solely for the person to whom it is addressed. It may contain privileged and confidential information. If you are not the intended recipient you must not read, copy, distribute, discuss or take any action in reliance on it. If you have received this email in error please advise the sender so that we can arrange for proper delivery. Then please delete the message from your inbox. Thank you. From hkayabilisim at gmail.com Mon Apr 20 10:14:03 2009 From: hkayabilisim at gmail.com (=?ISO-8859-1?Q?H=FCseyin_Kaya?=) Date: Mon, 20 Apr 2009 17:14:03 +0300 Subject: [Biojava-l] A problem in reading remote AB1 files Message-ID: Hi BioJava Community, I have a little problem in reading AB1 files residing on a remote webserver. I have two different AB1 files; ok.ab1 and failed.ab1. They were both generated from the same sequencer. Here is a quick summary: Reading ok.ab1 from a local directory is OK Reading failed.ab1 from a local directory is OK Reading ok.ab1 from a remote webserver is OK Reading failed.ab1 from a remote webserver is not OK The exception is: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.biojava.utils.io.CachingInputStream.read(CachingInputStream.java:101) at java.io.DataInputStream.readFully(Unknown Source) at java.io.DataInputStream.readFully(Unknown Source) at org.biojava.bio.program.abi.ABIFParser$DataStream.readFully(ABIFParser.java:376) at org.biojava.bio.program.abi.ABIFParser.readDataRecords(ABIFParser.java:129) at org.biojava.bio.program.abi.ABIFParser.(ABIFParser.java:100) at org.biojava.bio.program.abi.ABIFParser.(ABIFParser.java:89) at org.biojava.bio.program.abi.ABIFChromatogram$Parser.(ABIFChromatogram.java:117) at org.biojava.bio.program.abi.ABIFChromatogram.load(ABIFChromatogram.java:101) at org.biojava.bio.program.abi.ABIFChromatogram.create(ABIFChromatogram.java:89) at org.biojava.bio.chromatogram.ChromatogramFactory.create(ChromatogramFactory.java:119) at BugTest.readChromatogram(BugTest.java:34) at BugTest.main(BugTest.java:22) I will be glad if you help me in resolving this problem. Sincerely Huseyin Kaya BugTest.java import java.net.URL; import org.biojava.bio.chromatogram.AbstractChromatogram; import org.biojava.bio.chromatogram.ChromatogramFactory; public class BugTest { public static final String URL_REMOTE_FAILED = " http://dna.iontek.com.tr/files/failed.ab1"; public static final String URL_REMOTE_OK = " http://dna.iontek.com.tr/files/ok.ab1"; public static final String URL_LOCAL_FAILED = "file:///C:/failed.ab1"; public static final String URL_LOCAL_OK = "file:///C:/ok.ab1"; public static void main(String[] args) throws Exception { readChromatogram("Reading ok.ab1 from local directory",URL_LOCAL_OK); readChromatogram("Reading failed.ab1 from local directory",URL_LOCAL_FAILED); readChromatogram("Reading ok.ab1 from webserver ",URL_REMOTE_OK); // This one is failed readChromatogram("Reading failed.ab1 from webserver ",URL_REMOTE_FAILED); } private static void readChromatogram(String message, String urlstr) { System.out.println(message + "["+urlstr+"]"); AbstractChromatogram ch = null ; URL url; try { url = new URL(urlstr); ch = (AbstractChromatogram)ChromatogramFactory.create(url.openStream()); } catch (Exception e) { e.printStackTrace(); } if (ch != null) System.out.println("Trace length is "+ch.getTraceLength()); } } From jp at javaclass.co.uk Mon Apr 20 10:56:00 2009 From: jp at javaclass.co.uk (JP) Date: Mon, 20 Apr 2009 15:56:00 +0100 Subject: [Biojava-l] BioJava Question - Evolutionary Rate Calculation Message-ID: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> Hi there at Biojava, I have a number of orthologue protein (sequences) from different species - I would like to calculate the distance (score) between each of these (programatically) based on the sequence. Can anyone suggest a way to do this (I'd rather use an existing bit of software than having to reinvent the wheel) ? Surely this software must exist...(hopefully in biojava). Many Thanks Jean-Paul Ebejer, Malta From cif077 at gmail.com Mon Apr 20 19:11:48 2009 From: cif077 at gmail.com (Ian Yi-Feng Chang) Date: Tue, 21 Apr 2009 07:11:48 +0800 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> Message-ID: <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> Dear All, I've a problem while editing a richsequence. and got this exception: Exception in thread "main" org.biojava.utils.ChangeVetoException: AbstractSymbolList is immutable at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) at gizmo.tools.GBKCurator.main(GBKCurator.java:176) I trace this problem in this mailing list and find a latest thread in** *Wed Feb 20 21:33:39 EST 2008* However, I still have no idea how to Here is the solution (from the JavaDoc) SimpleRichSequenceBuilderFactory public SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) Creates a new instance of SimpleRichSequenceBuilderFactory that uses a specified factory for SymbolLists longer than a specified length. Before that a SimpleSymbolListFacotry is used. Parameters: fact - the factory to use when building the SymbolList.threshold - the threshold to exceed before using this factory However, could you please help to demonstrate how to use this solution to edit a richsequence? Thank you so much. ian chang From andreas at sdsc.edu Mon Apr 20 19:26:07 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Apr 2009 16:26:07 -0700 Subject: [Biojava-l] BioJava Question - Evolutionary Rate Calculation In-Reply-To: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> References: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> Message-ID: <59a41c430904201626y28a2f6d4m34cab6e9a34ad767@mail.gmail.com> Hi Jean-Paul, You can use BioJava to calculate pairwise alignments between your sequences: http://biojava.org/wiki/BioJava:CookBook:DP:PairWise2 Andreas On Mon, Apr 20, 2009 at 7:56 AM, JP wrote: > Hi there at Biojava, > > I have a number of orthologue protein (sequences) from different species - I > would like to calculate the distance (score) between each of these > (programatically) based on the sequence. > > Can anyone suggest a way to do this (I'd rather use an existing bit of > software than having to reinvent the wheel) ? ?Surely this software must > exist...(hopefully in biojava). > > Many Thanks > Jean-Paul Ebejer, Malta > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From cif077 at gmail.com Tue Apr 21 04:59:06 2009 From: cif077 at gmail.com (Ian Yi-Feng Chang) Date: Tue, 21 Apr 2009 16:59:06 +0800 Subject: [Biojava-l] Cannot edit RichSequence larger than 16Kbp Message-ID: <720d02c10904210159n57c44e02n2f63856cd706c7e5@mail.gmail.com> Dear All, I try to edit the sequence in GenBank flat file and try to preserve all Annotations and Features of it. Therefore I use following code to edit RichSequence. However, the following code only works with sequence length < 16Kbp. Is anything wrong in my code? Thanks for your help. import java.io.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; import org.biojavax.*; import org.biojavax.bio.seq.*; import org.biojavax.bio.seq.RichSequence.*; import org.biojavax.bio.seq.io.*; public class RichSequenceTest { public static void main(String[] args) throws Exception{ SimpleRichSequenceBuilderFactory srsbf = new SimpleRichSequenceBuilderFactory(new SimpleSymbolListFactory(),100000); RichSequence seq = IOTools.readGenbank( new BufferedReader(new FileReader("/data/gbk/NC_011995q.gbk")), IOTools.getDNAParser(), srsbf, RichObjectFactory.getDefaultNamespace() ).nextRichSequence(); Edit ed = new Edit(3, 2, DNATools.createDNA("aatagaa")); seq.edit(ed); System.out.println(seq.seqString()); } } Exception in thread "main" org.biojava.utils.ChangeVetoException: AbstractSymbolList is immutable at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) at gizmo.test.RichSequenceTest.main(RichSequenceTest.java:20 From andreas at sdsc.edu Tue Apr 21 13:18:38 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 21 Apr 2009 10:18:38 -0700 Subject: [Biojava-l] BioJava Question - Evolutionary Rate Calculation In-Reply-To: <4adc29060904210319g7a391c21n867e893ae0598800@mail.gmail.com> References: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> <59a41c430904201626y28a2f6d4m34cab6e9a34ad767@mail.gmail.com> <4adc29060904210319g7a391c21n867e893ae0598800@mail.gmail.com> Message-ID: <59a41c430904211018k4fb8fd64vdf0796274389996@mail.gmail.com> A simple measure of distance is e.g. the number of amino acid differences of two sequences divided by the number of aligned amino acids. You can use the pairwise alignments to derive that number. There is also plenty of other ways to calculate evolutionary distances and many papers have been written on this topic... Andreas On Tue, Apr 21, 2009 at 3:19 AM, JP wrote: > Thanks Andreas - Would you say that evolutionary distance is the same as > pairwise alignment ? > > Many thanks > JP > > On Tue, Apr 21, 2009 at 12:26 AM, Andreas Prlic wrote: >> >> Hi Jean-Paul, >> >> You can use BioJava to calculate pairwise alignments between your >> sequences: >> >> http://biojava.org/wiki/BioJava:CookBook:DP:PairWise2 >> >> Andreas >> >> >> On Mon, Apr 20, 2009 at 7:56 AM, JP wrote: >> > Hi there at Biojava, >> > >> > I have a number of orthologue protein (sequences) from different species >> > - I >> > would like to calculate the distance (score) between each of these >> > (programatically) based on the sequence. >> > >> > Can anyone suggest a way to do this (I'd rather use an existing bit of >> > software than having to reinvent the wheel) ? ?Surely this software must >> > exist...(hopefully in biojava). >> > >> > Many Thanks >> > Jean-Paul Ebejer, Malta >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > From holland at eaglegenomics.com Wed Apr 22 11:45:15 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Apr 2009 16:45:15 +0100 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> Message-ID: <49EF3B8B.5090509@eaglegenomics.com> The problem lies in SimpleRichSequenceBuilder: public void addSymbols(Alphabet alpha, Symbol[] syms, int start, int length) throws IllegalAlphabetException { if (this.symbols==null) { if (threshold<=0) { this.symbols = new ChunkedSymbolListFactory(this.factory); } else { this.symbols = new ChunkedSymbolListFactory(this.factory,threshold); } } this.symbols.addSymbols(alpha, syms, start, length); } The references to ChunkedSymbolListFactory are causing the problem. ChunkedSymbolListFactory is supposed to perform the threshold checking/factory selection. However it is also applying a further layer of abstraction which forces all symbol lists for sequences over 16k (1<<14) long to be ChunkedSymbolLists, regardless of the factory specified - the factory only specifies what the constituent sequences are within the ChunkedSymbolList. ChunkedSymbolList is immutable so will not allow edits even if its constituents are mutable. However if your sequence is less than 16k long, it behaves properly and you will get the type of sequence you asked for (SimpleSymbolList below the threshold, whatever you specify above it - SimpleSymbolList also happens to be the only SymbolList implementation in BioJava that is actually mutable at present.) As the older thread describes, ChunkedSymbolList and its Factory are very embedded into the core of BioJava and are hard to change - it could break all kinds of things. Therefore the only real solution for now is to temporarily modify your local copy so that inside ChunkedSymbolList, you change the CHUNK_SIZE to something much larger than 1<<14. thanks, Richard Ian Yi-Feng Chang wrote: > Dear All, > I've a problem while editing a richsequence. > and got this exception: > Exception in thread "main" org.biojava.utils.ChangeVetoException: > AbstractSymbolList is immutable > at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) > > at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) > at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) > at gizmo.tools.GBKCurator.main(GBKCurator.java:176) > > I trace this problem in this mailing list and find a latest thread > in** *Wed Feb 20 21:33:39 EST 2008* > > However, I still have no idea how to > > Here is the solution (from the JavaDoc) > > > SimpleRichSequenceBuilderFactory public > SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) > Creates a new instance of SimpleRichSequenceBuilderFactory that uses > a specified factory for SymbolLists longer than a specified length. > Before that a SimpleSymbolListFacotry is used. > > Parameters: > fact - the factory to use when building the > SymbolList.threshold - the threshold to exceed before using this factory > > However, could you please help to demonstrate how to use this solution > to edit a richsequence? > > Thank you so much. > > ian chang > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Wed Apr 22 11:50:17 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Apr 2009 16:50:17 +0100 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <49EF3B8B.5090509@eaglegenomics.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> <49EF3B8B.5090509@eaglegenomics.com> Message-ID: <49EF3CB9.3070909@eaglegenomics.com> I forgot to mention - ChunkedSymbolListFactory is currently the only SymbolListFactory implementation in BioJava which can accept 'streamed' data rather than taking the whole sequence at once. So, the other alternative to changing CHUNK_SIZE is to create a new SymbolListFactory implementation which can accept 'streamed' data and use it to replace the reference to ChunkedSymbolListFactory in SimpleRichSequenceBuilder. Richard. Richard Holland wrote: > The problem lies in SimpleRichSequenceBuilder: > > public void addSymbols(Alphabet alpha, Symbol[] syms, int start, int > length) throws IllegalAlphabetException { > if (this.symbols==null) { > if (threshold<=0) { > this.symbols = new ChunkedSymbolListFactory(this.factory); > } else { > this.symbols = new > ChunkedSymbolListFactory(this.factory,threshold); > } > } > this.symbols.addSymbols(alpha, syms, start, length); > } > > The references to ChunkedSymbolListFactory are causing the problem. > ChunkedSymbolListFactory is supposed to perform the threshold > checking/factory selection. However it is also applying a further layer > of abstraction which forces all symbol lists for sequences over 16k > (1<<14) long to be ChunkedSymbolLists, regardless of the factory > specified - the factory only specifies what the constituent sequences > are within the ChunkedSymbolList. ChunkedSymbolList is immutable so will > not allow edits even if its constituents are mutable. However if your > sequence is less than 16k long, it behaves properly and you will get the > type of sequence you asked for (SimpleSymbolList below the threshold, > whatever you specify above it - SimpleSymbolList also happens to be the > only SymbolList implementation in BioJava that is actually mutable at > present.) > > As the older thread describes, ChunkedSymbolList and its Factory are > very embedded into the core of BioJava and are hard to change - it could > break all kinds of things. Therefore the only real solution for now is > to temporarily modify your local copy so that inside ChunkedSymbolList, > you change the CHUNK_SIZE to something much larger than 1<<14. > > thanks, > Richard > > Ian Yi-Feng Chang wrote: >> Dear All, >> I've a problem while editing a richsequence. >> and got this exception: >> Exception in thread "main" org.biojava.utils.ChangeVetoException: >> AbstractSymbolList is immutable >> at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) >> >> at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) >> at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) >> at gizmo.tools.GBKCurator.main(GBKCurator.java:176) >> >> I trace this problem in this mailing list and find a latest thread >> in** *Wed Feb 20 21:33:39 EST 2008* >> >> However, I still have no idea how to >> >> Here is the solution (from the JavaDoc) >> >> >> SimpleRichSequenceBuilderFactory public >> SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) >> Creates a new instance of SimpleRichSequenceBuilderFactory that uses >> a specified factory for SymbolLists longer than a specified length. >> Before that a SimpleSymbolListFacotry is used. >> >> Parameters: >> fact - the factory to use when building the >> SymbolList.threshold - the threshold to exceed before using this factory >> >> However, could you please help to demonstrate how to use this solution >> to edit a richsequence? >> >> Thank you so much. >> >> ian chang >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From cif077 at gmail.com Wed Apr 22 21:39:14 2009 From: cif077 at gmail.com (Ian Yi-Feng Chang) Date: Thu, 23 Apr 2009 09:39:14 +0800 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <49EF3CB9.3070909@eaglegenomics.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> <49EF3B8B.5090509@eaglegenomics.com> <49EF3CB9.3070909@eaglegenomics.com> Message-ID: <720d02c10904221839s74f8c403j67c8ad3c56bda963@mail.gmail.com> Thanks for your detail explanation. I got it now. On Wed, Apr 22, 2009 at 11:50 PM, Richard Holland wrote: > I forgot to mention - ChunkedSymbolListFactory is currently the only > SymbolListFactory implementation in BioJava which can accept 'streamed' > data rather than taking the whole sequence at once. So, the other > alternative to changing CHUNK_SIZE is to create a new SymbolListFactory > implementation which can accept 'streamed' data and use it to replace > the reference to ChunkedSymbolListFactory in SimpleRichSequenceBuilder. > > Richard. > > Richard Holland wrote: > > The problem lies in SimpleRichSequenceBuilder: > > > > public void addSymbols(Alphabet alpha, Symbol[] syms, int start, int > > length) throws IllegalAlphabetException { > > if (this.symbols==null) { > > if (threshold<=0) { > > this.symbols = new > ChunkedSymbolListFactory(this.factory); > > } else { > > this.symbols = new > > ChunkedSymbolListFactory(this.factory,threshold); > > } > > } > > this.symbols.addSymbols(alpha, syms, start, length); > > } > > > > The references to ChunkedSymbolListFactory are causing the problem. > > ChunkedSymbolListFactory is supposed to perform the threshold > > checking/factory selection. However it is also applying a further layer > > of abstraction which forces all symbol lists for sequences over 16k > > (1<<14) long to be ChunkedSymbolLists, regardless of the factory > > specified - the factory only specifies what the constituent sequences > > are within the ChunkedSymbolList. ChunkedSymbolList is immutable so will > > not allow edits even if its constituents are mutable. However if your > > sequence is less than 16k long, it behaves properly and you will get the > > type of sequence you asked for (SimpleSymbolList below the threshold, > > whatever you specify above it - SimpleSymbolList also happens to be the > > only SymbolList implementation in BioJava that is actually mutable at > > present.) > > > > As the older thread describes, ChunkedSymbolList and its Factory are > > very embedded into the core of BioJava and are hard to change - it could > > break all kinds of things. Therefore the only real solution for now is > > to temporarily modify your local copy so that inside ChunkedSymbolList, > > you change the CHUNK_SIZE to something much larger than 1<<14. > > > > thanks, > > Richard > > > > Ian Yi-Feng Chang wrote: > >> Dear All, > >> I've a problem while editing a richsequence. > >> and got this exception: > >> Exception in thread "main" org.biojava.utils.ChangeVetoException: > >> AbstractSymbolList is immutable > >> at > org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) > >> > >> at > org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) > >> at > org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) > >> at gizmo.tools.GBKCurator.main(GBKCurator.java:176) > >> > >> I trace this problem in this mailing list and find a latest thread > >> in** *Wed Feb 20 21:33:39 EST 2008* > >> > >> However, I still have no idea how to > >> > >> Here is the solution (from the JavaDoc) > >> > >> > >> SimpleRichSequenceBuilderFactory public > >> SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) > >> Creates a new instance of SimpleRichSequenceBuilderFactory that uses > >> a specified factory for SymbolLists longer than a specified length. > >> Before that a SimpleSymbolListFacotry is used. > >> > >> Parameters: > >> fact - the factory to use when building the > >> SymbolList.threshold - the threshold to exceed before using this factory > >> > >> However, could you please help to demonstrate how to use this solution > >> to edit a richsequence? > >> > >> Thank you so much. > >> > >> ian chang > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From paolo.romano at istge.it Tue Apr 28 09:21:44 2009 From: paolo.romano at istge.it (Paolo Romano) Date: Tue, 28 Apr 2009 15:21:44 +0200 Subject: [Biojava-l] NETTAB 2009: Deadline postponed to May 4, 2009, for Oral communications Message-ID: <200904281335.n3SDZsrd018188@ibm43p.biotech.ist.unige.it> Due to many requests for a new deadline for submission of contributions for oral communications, the related deadline has been postponed to: Monday May 4, 2009, at 12.00 (noon), EST (GMT+1). ===== Last Call for Oral communications NETTAB 2009 Workshop on "Technologies, Tools and Applications for Collaborative and Social Bioinformatics Research and Development" with a Special Session on: "Methods and Tools for RNA Structure and Functional Analysis" June 10-13, 2009 Department of Computer Science, University of Catania, Italy http://www.nettab.org/2009/ Deadline approaching: May 4, 2009: Oral communication submission Contributions must be short papers of around THREE A4 pages or 12.000 characters long. Submit through the EasyChair system at: http://www.easychair.org/conferences/?conf=nettab2009 . See web site for details. Motivation The most recent developments of collaborative development tools are impressive. Researchers can now collaboratively develop software (open source systems), discuss and compare development strategies (social networks), write documents (google docs, wiki systems), build knowledge bases. So, it may now be the time for presenting current technologies, tools and applications for collaborative work and for discussing perspectives of their utilization in support of Bioinformatics. For these reasons, NETTAB 2009 will be devoted to "Technologies, Tools and Applications for Collaborative and Social Bioinformatics Research and Development". The RNA community is also taking advantage of collaborative research tools such as Wikis and other virtual environments. The RNA WikiProject contains now over 600 articles describing families of noncoding RNAs based on the Rfam database, and invite the community to update, edit, and correct those articles. Therefore, the NETTAB 2009 special session will focus on collaborative research project, computational methods and tools for the analysis of RNA structures and functions, with a special emphasis on ncRNAs. Invited Speakers (more to be announced) # Alex Bateman Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK # Doron Betel MSKCC - Computational Biology Center New York, USA # Tim Clark Director of Informatics, MassGeneral Institute for Neurodegenerative Disease Neurology Research Department, Massachusetts General Hospital, Boston, USA # Duncan Hull School of Chemistry, University of Manchester, Manchester, UK # Gabriel Valiente Technical University of Catalonia, Department of Software, Barcelona, Spain # Debora Marks Systems Biology Department, Harvard Medical School, Boston, USA # Gabriel Valiente Technical University of Catalonia, Department of Software, Barcelona, Spain Topics - Collaborative Web sites (bioinformatics.org, biojava, bioperl, ) - Communities of Practices (CoPs) Scientific practices in scientific communities Automatic detection / gathering / modelling of scientific practices Implementations of CoPs - Social networking (myExperiment, Annotea, myScience) Social Bookmarking Semantic Document Markup Relationships mining from literature - Open Source development Sharing of data models, libraries, interfaces - Social software for collaborative documentation development Wikis, blogs, google docs Knowledge Wikis Social-software-mediated collaborative scientific research Social-software-mediated collaborative tools' development Knowledge base collaborative development Ontologies collaborative development - Education and training tools E-learning Virtual environments Methods and Tools for RNA Structure and Functional Analysis - RNA structure prediction - Collaborative studies of RNAs - ncRNAs functional analysis and classification - miRNAs and networks - Genome-wide functional studies - Identification of ncRNAs - Databases of ncRNAs and miRNA targets - miRNA targets prediction - Synthetic miRNA and siRNA design - Gene expression analysis - Analysis of viral RNAs - RNAi therapeutics - Identification of ncRNAs biomarkers - RNA-protein interaction prediction Deadlines Contributions for both oral communications and posters must be short papers of around THREE A4 pages or 12.000 characters long. They must be submitted through the EasyChair system at: http://www.easychair.org/conferences/?conf=nettab2009 . - May 4, 2009: Oral communication submission - May 15, 2009: Posters submission - May 17, 2009: Early registration - June 10-13, 2009: Tutorials and Workshop Calls for SPECIAL ISSUES We plan to launch Calls for Special Issues on the themes of the workshop in peer-review journals with associated Impact factor around July for submission in September 2009. Best regards. Paolo Romano on behalf of NETTAB 2009 Chairs NETTAB '09 - Ninth International Workshop on Network Tools and Applications in Biology 10-13 June 2009, Catania, Italy http://www.nettab.org/2009/ Paolo Romano (paolo.romano at istge.it) Bioinformatics National Cancer Research Institute (IST) From jp at javaclass.co.uk Tue Apr 28 10:01:04 2009 From: jp at javaclass.co.uk (JP) Date: Tue, 28 Apr 2009 15:01:04 +0100 Subject: [Biojava-l] FASTA parsing bug ? Message-ID: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> Hi all at BioJava, I am trying to parse several FASTA files using the following code: fr = new FileReader(fastaProteinFileName); > br = new BufferedReader(fr); > > RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > while (protIter.hasNext()) { > BioEntry bioEntry = protIter.nextBioEntry(); > System.out.println (fastaProteinFileName + " == " + accessionId + " = > " + bioEntry.getAccession()); > } At particular points in my fasta file - I get the following exception: 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > biojava library) > org.biojava.bio.BioException: Could not read sequence > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > at > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > at > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > at > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > at > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > at > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > Caused by: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(Unknown Source) > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > ... 5 more Interestingly if I delete the header portion of the header line (from type=protein... till the end of the line ...Dgri;) >FBpp0145468 type=protein; > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > species=Dgri; > It works - but I have a number of these exceptions (and I do not want to edit the original data). Mind you I have longer headers in my file which are parsed OK (strange!). Any ideas anyone ? Alternatively - is there a better way how to get ONE SINGLE sequence from the whole fasta file give that I have the accession id (FBpp0145468) ? Many Thanks JP From jp at javaclass.co.uk Tue Apr 28 10:59:40 2009 From: jp at javaclass.co.uk (JP) Date: Tue, 28 Apr 2009 15:59:40 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <49F71258.3060103@eaglegenomics.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> Message-ID: <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> Thanks Richard for your prompt reply. I will not attach the fasta file I am parsing (12MB) its dgri-all-translation-r1.3.fasta from the flybase project. If the file had any extra new lines I would see them when I loaded it in a text editor - no ? I implemented the whole thing without using Biojava (for this part) fr = new FileReader(fastaProteinFileName); br = new BufferedReader(fr); String fastaLine; String startAccession = '>' + accessionId.trim(); String fastaEntry = ""; boolean record = false; while ((fastaLine = br.readLine()) != null) { fastaLine = fastaLine.trim() + '\n'; if (fastaLine.startsWith(startAccession)) { record = true; } else if (record && fastaLine.startsWith(">")) { record = false; break; } if (record) { fastaEntry += fastaLine; } } Notice - I do not use regex - since I'd need to read the whole file and then regex upon it (if the record is the first one - I just read that one). Cheers JP On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland wrote: > The "Mark invalid" exception is indicating that the parser has gone too > far ahead in the file looking for a valid header. I'm not sure why but > looking at your original query, there may be extra newlines embedded > into your FASTA header line? That would definitely confuse it. > > The parser is not able to currently pull out just one sequence - in > effect this is a search facility, which it doesn't have. :( > > thanks, > Richard > > JP wrote: > > Hi all at BioJava, > > > > I am trying to parse several FASTA files using the following code: > > > > fr = new FileReader(fastaProteinFileName); > >> br = new BufferedReader(fr); > >> > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > >> while (protIter.hasNext()) { > >> BioEntry bioEntry = protIter.nextBioEntry(); > >> System.out.println (fastaProteinFileName + " == " + accessionId + " > = > >> " + bioEntry.getAccession()); > >> } > > > > > > At particular points in my fasta file - I get the following exception: > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > >> biojava library) > >> org.biojava.bio.BioException: Could not read sequence > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > >> at > >> > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > >> at > >> > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > >> Caused by: java.io.IOException: Mark invalid > >> at java.io.BufferedReader.reset(Unknown Source) > >> at > >> > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > >> ... 5 more > > > > > > Interestingly if I delete the header portion of the header line (from > > type=protein... till the end of the line ...Dgri;) > > > >> FBpp0145468 type=protein; > >> > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > >> species=Dgri; > >> > > > > It works - but I have a number of these exceptions (and I do not want to > > edit the original data). Mind you I have longer headers in my file which > > are parsed OK (strange!). > > > > Any ideas anyone ? Alternatively - is there a better way how to get ONE > > SINGLE sequence from the whole fasta file give that I have the accession > id > > (FBpp0145468) ? > > > > Many Thanks > > JP > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From holland at eaglegenomics.com Tue Apr 28 11:21:25 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 28 Apr 2009 16:21:25 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> Message-ID: <49F71EF5.90702@eaglegenomics.com> You're right, doesn't look like newlines. The "Mark invalid" happens when the parser looks too far ahead in the file attempting to seek out the next valid sequence to parse. I'm not sure why this is happening. I don't have the time to test right now but if you could post the link to where someone could download the same FASTA as you're using, then it would make it possible for someone else to investigate in more detail. thanks, Richard JP wrote: > Thanks Richard for your prompt reply. > > I will not attach the fasta file I am parsing (12MB) its > dgri-all-translation-r1.3.fasta from the flybase project. > > If the file had any extra new lines I would see them when I loaded it in > a text editor - no ? > > I implemented the whole thing without using Biojava (for this part) > > fr = new FileReader(fastaProteinFileName); > br = new BufferedReader(fr); > String fastaLine; > String startAccession = '>' + accessionId.trim(); > String fastaEntry = ""; > boolean record = false; > while ((fastaLine = br.readLine()) != null) { > fastaLine = fastaLine.trim() + '\n'; > if (fastaLine.startsWith(startAccession)) { > record = true; > } else if (record && fastaLine.startsWith(">")) { > record = false; > break; > } > if (record) { > fastaEntry += fastaLine; > } > } > > > Notice - I do not use regex - since I'd need to read the whole file and > then regex upon it (if the record is the first one - I just read that one). > > Cheers > JP > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > wrote: > > The "Mark invalid" exception is indicating that the parser has gone too > far ahead in the file looking for a valid header. I'm not sure why but > looking at your original query, there may be extra newlines embedded > into your FASTA header line? That would definitely confuse it. > > The parser is not able to currently pull out just one sequence - in > effect this is a search facility, which it doesn't have. :( > > thanks, > Richard > > JP wrote: > > Hi all at BioJava, > > > > I am trying to parse several FASTA files using the following code: > > > > fr = new FileReader(fastaProteinFileName); > >> br = new BufferedReader(fr); > >> > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > >> while (protIter.hasNext()) { > >> BioEntry bioEntry = protIter.nextBioEntry(); > >> System.out.println (fastaProteinFileName + " == " + > accessionId + " = > >> " + bioEntry.getAccession()); > >> } > > > > > > At particular points in my fasta file - I get the following exception: > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > >> biojava library) > >> org.biojava.bio.BioException: Could not read sequence > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > >> at > >> > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > >> at > >> > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > >> Caused by: java.io.IOException: Mark invalid > >> at java.io.BufferedReader.reset(Unknown Source) > >> at > >> > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > >> ... 5 more > > > > > > Interestingly if I delete the header portion of the header line (from > > type=protein... till the end of the line ...Dgri;) > > > >> FBpp0145468 type=protein; > >> > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > >> species=Dgri; > >> > > > > It works - but I have a number of these exceptions (and I do not > want to > > edit the original data). Mind you I have longer headers in my > file which > > are parsed OK (strange!). > > > > Any ideas anyone ? Alternatively - is there a better way how to > get ONE > > SINGLE sequence from the whole fasta file give that I have the > accession id > > (FBpp0145468) ? > > > > Many Thanks > > JP > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Tue Apr 28 10:27:36 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 28 Apr 2009 15:27:36 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> Message-ID: <49F71258.3060103@eaglegenomics.com> The "Mark invalid" exception is indicating that the parser has gone too far ahead in the file looking for a valid header. I'm not sure why but looking at your original query, there may be extra newlines embedded into your FASTA header line? That would definitely confuse it. The parser is not able to currently pull out just one sequence - in effect this is a search facility, which it doesn't have. :( thanks, Richard JP wrote: > Hi all at BioJava, > > I am trying to parse several FASTA files using the following code: > > fr = new FileReader(fastaProteinFileName); >> br = new BufferedReader(fr); >> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); >> while (protIter.hasNext()) { >> BioEntry bioEntry = protIter.nextBioEntry(); >> System.out.println (fastaProteinFileName + " == " + accessionId + " = >> " + bioEntry.getAccession()); >> } > > > At particular points in my fasta file - I get the following exception: > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from >> biojava library) >> org.biojava.bio.BioException: Could not read sequence >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) >> at >> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) >> at >> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) >> at >> edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) >> at >> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) >> Caused by: java.io.IOException: Mark invalid >> at java.io.BufferedReader.reset(Unknown Source) >> at >> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) >> ... 5 more > > > Interestingly if I delete the header portion of the header line (from > type=protein... till the end of the line ...Dgri;) > >> FBpp0145468 type=protein; >> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; >> species=Dgri; >> > > It works - but I have a number of these exceptions (and I do not want to > edit the original data). Mind you I have longer headers in my file which > are parsed OK (strange!). > > Any ideas anyone ? Alternatively - is there a better way how to get ONE > SINGLE sequence from the whole fasta file give that I have the accession id > (FBpp0145468) ? > > Many Thanks > JP > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jogoodma at indiana.edu Tue Apr 28 23:08:43 2009 From: jogoodma at indiana.edu (Josh Goodman) Date: Tue, 28 Apr 2009 23:08:43 -0400 (EDT) Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <49F71EF5.90702@eaglegenomics.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: Hi Richard and JP, I think I can be of some help as I'm the FlyBase developer responsible for generating these troublesome FASTA files :-). The cause of this problem appears to be the description line length for the record FBpp0145470. The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop at line 196. Biojava correctly reads in FBpp0145468 but throws an error when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 but when biojava reaches the end of the sequence it reads in the header for the next record (FBpp0145470). It then tries to reset the BufferedReader to the start of FBpp0145470 but that is where the exception is thrown because line 197 sets the read ahead limit to 500 characters and the reader.readLine() command exceeds that limit. What isn't obvious to me is why other large definition lines that precede that line don't throw the same error (e.g. FBpp0157909). I guess the javadoc on BufferedReader.mark() does say "may fail" but I assumed it would be more predictable than that. The file in question can be downloaded from ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz. If there is interest in a solution that doesn't involve simply upping the read ahead limit I can put a patch file together in the next day or so. Cheers, Josh On Tue, 28 Apr 2009, Richard Holland wrote: > You're right, doesn't look like newlines. > > The "Mark invalid" happens when the parser looks too far ahead in the > file attempting to seek out the next valid sequence to parse. I'm not > sure why this is happening. > > I don't have the time to test right now but if you could post the link > to where someone could download the same FASTA as you're using, then it > would make it possible for someone else to investigate in more detail. > > thanks, > Richard > > JP wrote: > > Thanks Richard for your prompt reply. > > > > I will not attach the fasta file I am parsing (12MB) its > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > If the file had any extra new lines I would see them when I loaded it in > > a text editor - no ? > > > > I implemented the whole thing without using Biojava (for this part) > > > > fr = new FileReader(fastaProteinFileName); > > br = new BufferedReader(fr); > > String fastaLine; > > String startAccession = '>' + accessionId.trim(); > > String fastaEntry = ""; > > boolean record = false; > > while ((fastaLine = br.readLine()) != null) { > > fastaLine = fastaLine.trim() + '\n'; > > if (fastaLine.startsWith(startAccession)) { > > record = true; > > } else if (record && fastaLine.startsWith(">")) { > > record = false; > > break; > > } > > if (record) { > > fastaEntry += fastaLine; > > } > > } > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > then regex upon it (if the record is the first one - I just read that one). > > > > Cheers > > JP > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > wrote: > > > > The "Mark invalid" exception is indicating that the parser has gone too > > far ahead in the file looking for a valid header. I'm not sure why but > > looking at your original query, there may be extra newlines embedded > > into your FASTA header line? That would definitely confuse it. > > > > The parser is not able to currently pull out just one sequence - in > > effect this is a search facility, which it doesn't have. :( > > > > thanks, > > Richard > > > > JP wrote: > > > Hi all at BioJava, > > > > > > I am trying to parse several FASTA files using the following code: > > > > > > fr = new FileReader(fastaProteinFileName); > > >> br = new BufferedReader(fr); > > >> > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > > >> while (protIter.hasNext()) { > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > >> System.out.println (fastaProteinFileName + " == " + > > accessionId + " = > > >> " + bioEntry.getAccession()); > > >> } > > > > > > > > > At particular points in my fasta file - I get the following exception: > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > > >> biojava library) > > >> org.biojava.bio.BioException: Could not read sequence > > >> at > > >> > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > >> at > > >> > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > >> at > > >> > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > >> at > > >> > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > >> at > > >> > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > >> at > > >> > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > >> Caused by: java.io.IOException: Mark invalid > > >> at java.io.BufferedReader.reset(Unknown Source) > > >> at > > >> > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > >> at > > >> > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > >> ... 5 more > > > > > > > > > Interestingly if I delete the header portion of the header line (from > > > type=protein... till the end of the line ...Dgri;) > > > > > >> FBpp0145468 type=protein; > > >> > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > >> species=Dgri; > > >> > > > > > > It works - but I have a number of these exceptions (and I do not > > want to > > > edit the original data). Mind you I have longer headers in my > > file which > > > are parsed OK (strange!). > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > get ONE > > > SINGLE sequence from the whole fasta file give that I have the > > accession id > > > (FBpp0145468) ? > > > > > > Many Thanks > > > JP > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > http://www.eaglegenomics.com/ > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jp at javaclass.co.uk Wed Apr 29 03:13:02 2009 From: jp at javaclass.co.uk (JP) Date: Wed, 29 Apr 2009 08:13:02 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> This is why we all love the internet and the community. What is the chance of this happening ? You are speaking about World Peace, and Kofi Annan butts in. :) I found that strange also (that there are larger headers preceding the troublesome one). Maybe (and this is a long shot) there is some buffer which gets filled at that particular record or point in file ? (Does the error move record if we delete a couple of initial Fasta entries ?) Mind you this is NOT the only flybase fasta file I get errors with (same happens with dpse one v2.3 - and I am sure there are others). I am interested in the solution, so are a ton of other people who use biojava and particularly verbose fasta files. I love flybase and biojava JP On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman wrote: > > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > . > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > You're right, doesn't look like newlines. > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > file attempting to seek out the next valid sequence to parse. I'm not > > sure why this is happening. > > > > I don't have the time to test right now but if you could post the link > > to where someone could download the same FASTA as you're using, then it > > would make it possible for someone else to investigate in more detail. > > > > thanks, > > Richard > > > > JP wrote: > > > Thanks Richard for your prompt reply. > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > If the file had any extra new lines I would see them when I loaded it > in > > > a text editor - no ? > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > fr = new FileReader(fastaProteinFileName); > > > br = new BufferedReader(fr); > > > String fastaLine; > > > String startAccession = '>' + accessionId.trim(); > > > String fastaEntry = ""; > > > boolean record = false; > > > while ((fastaLine = br.readLine()) != null) { > > > fastaLine = fastaLine.trim() + '\n'; > > > if (fastaLine.startsWith(startAccession)) { > > > record = true; > > > } else if (record && fastaLine.startsWith(">")) { > > > record = false; > > > break; > > > } > > > if (record) { > > > fastaEntry += fastaLine; > > > } > > > } > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > then regex upon it (if the record is the first one - I just read that > one). > > > > > > Cheers > > > JP > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > wrote: > > > > > > The "Mark invalid" exception is indicating that the parser has gone > too > > > far ahead in the file looking for a valid header. I'm not sure why > but > > > looking at your original query, there may be extra newlines > embedded > > > into your FASTA header line? That would definitely confuse it. > > > > > > The parser is not able to currently pull out just one sequence - in > > > effect this is a search facility, which it doesn't have. :( > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Hi all at BioJava, > > > > > > > > I am trying to parse several FASTA files using the following > code: > > > > > > > > fr = new FileReader(fastaProteinFileName); > > > >> br = new BufferedReader(fr); > > > >> > > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > null); > > > >> while (protIter.hasNext()) { > > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > > >> System.out.println (fastaProteinFileName + " == " + > > > accessionId + " = > > > >> " + bioEntry.getAccession()); > > > >> } > > > > > > > > > > > > At particular points in my fasta file - I get the following > exception: > > > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception > (from > > > >> biojava library) > > > >> org.biojava.bio.BioException: Could not read sequence > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > >> Caused by: java.io.IOException: Mark invalid > > > >> at java.io.BufferedReader.reset(Unknown Source) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > >> ... 5 more > > > > > > > > > > > > Interestingly if I delete the header portion of the header line > (from > > > > type=protein... till the end of the line ...Dgri;) > > > > > > > >> FBpp0145468 type=protein; > > > >> > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; > parent=FBgn0119042,FBtr0146976; > > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > >> species=Dgri; > > > >> > > > > > > > > It works - but I have a number of these exceptions (and I do not > > > want to > > > > edit the original data). Mind you I have longer headers in my > > > file which > > > > are parsed OK (strange!). > > > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > > get ONE > > > > SINGLE sequence from the whole fasta file give that I have the > > > accession id > > > > (FBpp0145468) ? > > > > > > > > Many Thanks > > > > JP > > > > _______________________________________________ > > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > > > http://www.eaglegenomics.com/ > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > From markjschreiber at gmail.com Wed Apr 29 04:31:00 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 29 Apr 2009 16:31:00 +0800 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> Message-ID: <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> People who know me will know I am not a big fan of FASTA format. Sure it was useful in the days of FORTRAN but we really need to move on. I'm not sure the people who started the format foresaw the kind of abuse that the "format" would get. What I would much prefer is something that looks like the BioEntry table of BioSQL plus the BioSequence information in somekind of (dare I say it) XML format. It would certainly be tidier and vastly more machine readable, for a start not all the metadata would need to be on the description line in no specific order. I think by limiting it to those two tables you get most of the key metadata without all the cruft that comes with more extensive XML formats. It would be a bit less user friendly for people pasting sequences into webforms (although I think FASTA is fine for that) but much better for data distribution, webservices, machine processing etc. Anyhow, that's enough venting. I don't wan't to start somekind of holy war or anything... - Mark ps. Sorry for getting off topic. On Wed, Apr 29, 2009 at 3:13 PM, JP wrote: > > This is why we all love the internet and the community. > What is the chance of this happening ? ?You are speaking about World Peace, > and Kofi Annan butts in. :) > > I found that strange also (that there are larger headers preceding the > troublesome one). ?Maybe (and this is a long shot) there is some buffer > which gets filled at that particular record or point in file ? ?(Does the > error move record if we delete a couple of initial Fasta entries ?) > > Mind you this is NOT the only flybase fasta file I get errors with (same > happens with dpse one v2.3 - and I am sure there are others). > > I am interested in the solution, so are a ton of other people who use > biojava and particularly verbose fasta files. > > I love flybase and biojava > JP > > On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman wrote: > > > > > Hi Richard and JP, > > > > I think I can be of some help as I'm the FlyBase developer responsible for > > generating these troublesome FASTA files :-). ?The cause of this problem > > appears to be the description line length for the record FBpp0145470. > > > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > > at line 196. ?Biojava correctly reads in FBpp0145468 but throws an error > > when trying to parse FBpp0145469. ?There is nothing wrong in FBpp0145469 > > but when biojava reaches the end of the sequence it reads in the header > > for the next record (FBpp0145470). ?It then tries to reset the > > BufferedReader to the start of FBpp0145470 but that is where the exception > > is thrown because line 197 sets the read ahead limit to 500 characters and > > the reader.readLine() command exceeds that limit. > > > > What isn't obvious to me is why other large definition lines that precede > > that line don't throw the same error (e.g. FBpp0157909). ?I guess the > > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > > would be more predictable than that. > > > > The file in question can be downloaded from > > > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > > . > > > > If there is interest in a solution that doesn't involve simply upping the > > read ahead limit I can put a patch file together in the next day or so. > > > > Cheers, > > Josh > > > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > > > You're right, doesn't look like newlines. > > > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > > file attempting to seek out the next valid sequence to parse. I'm not > > > sure why this is happening. > > > > > > I don't have the time to test right now but if you could post the link > > > to where someone could download the same FASTA as you're using, then it > > > would make it possible for someone else to investigate in more detail. > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Thanks Richard for your prompt reply. > > > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > > > If the file had any extra new lines I would see them when I loaded it > > in > > > > a text editor - no ? > > > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > > > ? ? fr = new FileReader(fastaProteinFileName); > > > > ? ? br = new BufferedReader(fr); > > > > ? ? String fastaLine; > > > > ? ? String startAccession = '>' + accessionId.trim(); > > > > ? ? String fastaEntry = ""; > > > > ? ? boolean record = false; > > > > ? ? while ((fastaLine = br.readLine()) != null) { > > > > ? ? ? ? fastaLine = fastaLine.trim() + '\n'; > > > > ? ? ? ? if (fastaLine.startsWith(startAccession)) { > > > > ? ? ? ? ? ? record = true; > > > > ? ? ? ? } else if (record && fastaLine.startsWith(">")) { > > > > ? ? ? ? ? ? record = false; > > > > ? ? ? ? ? ? break; > > > > ? ? ? ? } > > > > ? ? ? ? if (record) { > > > > ? ? ? ? ? ? fastaEntry += fastaLine; > > > > ? ? ? ? } > > > > ? ? } > > > > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > > then regex upon it (if the record is the first one - I just read that > > one). > > > > > > > > Cheers > > > > JP > > > > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > > wrote: > > > > > > > > ? ? The "Mark invalid" exception is indicating that the parser has gone > > too > > > > ? ? far ahead in the file looking for a valid header. I'm not sure why > > but > > > > ? ? looking at your original query, there may be extra newlines > > embedded > > > > ? ? into your FASTA header line? That would definitely confuse it. > > > > > > > > ? ? The parser is not able to currently pull out just one sequence - in > > > > ? ? effect this is a search facility, which it doesn't have. :( > > > > > > > > ? ? thanks, > > > > ? ? Richard > > > > > > > > ? ? JP wrote: > > > > ? ? > Hi all at BioJava, > > > > ? ? > > > > > ? ? > I am trying to parse several FASTA files using the following > > code: > > > > ? ? > > > > > ? ? > fr = new FileReader(fastaProteinFileName); > > > > ? ? >> br = new BufferedReader(fr); > > > > ? ? >> > > > > ? ? >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > > null); > > > > ? ? >> while (protIter.hasNext()) { > > > > ? ? >> ? ? ?BioEntry bioEntry = protIter.nextBioEntry(); > > > > ? ? >> ? ? ?System.out.println (fastaProteinFileName + " == " + > > > > ? ? accessionId + " = > > > > ? ? >> " + bioEntry.getAccession()); > > > > ? ? >> } > > > > ? ? > > > > > ? ? > > > > > ? ? > At particular points in my fasta file - I get the following > > exception: > > > > ? ? > > > > > ? ? > 14:53:42,546 ERROR FastaFileProcessing ?- File parsing exception > > (from > > > > ? ? >> biojava library) > > > > ? ? >> org.biojava.bio.BioException: Could not read sequence > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > > ? ? >> Caused by: java.io.IOException: Mark invalid > > > > ? ? >> ? ? at java.io.BufferedReader.reset(Unknown Source) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > > ? ? >> ? ? ... 5 more > > > > ? ? > > > > > ? ? > > > > > ? ? > Interestingly if I delete the header portion of the header line > > (from > > > > ? ? > type=protein... till the end of the line ...Dgri;) > > > > ? ? > > > > > ? ? >> FBpp0145468 type=protein; > > > > ? ? >> > > > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > > ? ? >> ID=FBpp0145468; name=Dgri\GH11562-PA; > > parent=FBgn0119042,FBtr0146976; > > > > ? ? >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > > ? ? >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > > ? ? >> species=Dgri; > > > > ? ? >> > > > > ? ? > > > > > ? ? > It works - but I have a number of these exceptions (and I do not > > > > ? ? want to > > > > ? ? > edit the original data). ?Mind you I have longer headers in my > > > > ? ? file which > > > > ? ? > are parsed OK (strange!). > > > > ? ? > > > > > ? ? > Any ideas anyone ? ?Alternatively - is there a better way how to > > > > ? ? get ONE > > > > ? ? > SINGLE sequence from the whole fasta file give that I have the > > > > ? ? accession id > > > > ? ? > (FBpp0145468) ? > > > > ? ? > > > > > ? ? > Many Thanks > > > > ? ? > JP > > > > ? ? > _______________________________________________ > > > > ? ? > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > > ? ? > > > > ? ? > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > ? ? > > > > > > > > > ? ? -- > > > > ? ? Richard Holland, BSc MBCS > > > > ? ? Finance Director, Eagle Genomics Ltd > > > > ? ? T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > ? ? > > > > ? ? http://www.eaglegenomics.com/ > > > > > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > http://www.eaglegenomics.com/ > > > _______________________________________________ > > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at eaglegenomics.com Wed Apr 29 05:49:58 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 29 Apr 2009 10:49:58 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: <49F822C6.1020809@eaglegenomics.com> I'd love to see a proper solution to this that doesn't involve upping the read-ahead limit. I was aware that it might be the issue, but had no idea why it was not failing for other similar long sequences. I look forward to seeing your suggested fix! thanks, Richard Josh Goodman wrote: > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz. > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > >> You're right, doesn't look like newlines. >> >> The "Mark invalid" happens when the parser looks too far ahead in the >> file attempting to seek out the next valid sequence to parse. I'm not >> sure why this is happening. >> >> I don't have the time to test right now but if you could post the link >> to where someone could download the same FASTA as you're using, then it >> would make it possible for someone else to investigate in more detail. >> >> thanks, >> Richard >> >> JP wrote: >>> Thanks Richard for your prompt reply. >>> >>> I will not attach the fasta file I am parsing (12MB) its >>> dgri-all-translation-r1.3.fasta from the flybase project. >>> >>> If the file had any extra new lines I would see them when I loaded it in >>> a text editor - no ? >>> >>> I implemented the whole thing without using Biojava (for this part) >>> >>> fr = new FileReader(fastaProteinFileName); >>> br = new BufferedReader(fr); >>> String fastaLine; >>> String startAccession = '>' + accessionId.trim(); >>> String fastaEntry = ""; >>> boolean record = false; >>> while ((fastaLine = br.readLine()) != null) { >>> fastaLine = fastaLine.trim() + '\n'; >>> if (fastaLine.startsWith(startAccession)) { >>> record = true; >>> } else if (record && fastaLine.startsWith(">")) { >>> record = false; >>> break; >>> } >>> if (record) { >>> fastaEntry += fastaLine; >>> } >>> } >>> >>> >>> Notice - I do not use regex - since I'd need to read the whole file and >>> then regex upon it (if the record is the first one - I just read that one). >>> >>> Cheers >>> JP >>> >>> >>> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland >>> > wrote: >>> >>> The "Mark invalid" exception is indicating that the parser has gone too >>> far ahead in the file looking for a valid header. I'm not sure why but >>> looking at your original query, there may be extra newlines embedded >>> into your FASTA header line? That would definitely confuse it. >>> >>> The parser is not able to currently pull out just one sequence - in >>> effect this is a search facility, which it doesn't have. :( >>> >>> thanks, >>> Richard >>> >>> JP wrote: >>> > Hi all at BioJava, >>> > >>> > I am trying to parse several FASTA files using the following code: >>> > >>> > fr = new FileReader(fastaProteinFileName); >>> >> br = new BufferedReader(fr); >>> >> >>> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); >>> >> while (protIter.hasNext()) { >>> >> BioEntry bioEntry = protIter.nextBioEntry(); >>> >> System.out.println (fastaProteinFileName + " == " + >>> accessionId + " = >>> >> " + bioEntry.getAccession()); >>> >> } >>> > >>> > >>> > At particular points in my fasta file - I get the following exception: >>> > >>> > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from >>> >> biojava library) >>> >> org.biojava.bio.BioException: Could not read sequence >>> >> at >>> >> >>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) >>> >> at >>> >> >>> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) >>> >> Caused by: java.io.IOException: Mark invalid >>> >> at java.io.BufferedReader.reset(Unknown Source) >>> >> at >>> >> >>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) >>> >> at >>> >> >>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) >>> >> ... 5 more >>> > >>> > >>> > Interestingly if I delete the header portion of the header line (from >>> > type=protein... till the end of the line ...Dgri;) >>> > >>> >> FBpp0145468 type=protein; >>> >> >>> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); >>> >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; >>> >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; >>> >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; >>> >> species=Dgri; >>> >> >>> > >>> > It works - but I have a number of these exceptions (and I do not >>> want to >>> > edit the original data). Mind you I have longer headers in my >>> file which >>> > are parsed OK (strange!). >>> > >>> > Any ideas anyone ? Alternatively - is there a better way how to >>> get ONE >>> > SINGLE sequence from the whole fasta file give that I have the >>> accession id >>> > (FBpp0145468) ? >>> > >>> > Many Thanks >>> > JP >>> > _______________________________________________ >>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> >>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > >>> >>> -- >>> Richard Holland, BSc MBCS >>> Finance Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >>> >>> http://www.eaglegenomics.com/ >>> >>> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Wed Apr 29 10:33:27 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 29 Apr 2009 22:33:27 +0800 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <93b45ca50904290726n4149ce7bhb5e9e82467982fb@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> <93b45ca50904290726n4149ce7bhb5e9e82467982fb@mail.gmail.com> Message-ID: <93b45ca50904290733k68afb5b0na661588b4f09d804@mail.gmail.com> I can understand a bench scientist wanting FASTA but a computational biologist. They should be ashamed! With some of the friendly XPath implementations in common scripting languages there really is no excuse. It's easier to parse XML than FASTA in Groovy, Perl, Python and Ruby. Probably Java and C as well. The state of bioinformatics data formats is cringe worthy. Let's try and enter the 21st century! OK I'm ranting again. Maybe I'll go join twitter. - Mark On 29 Apr 2009, 10:04 PM, "Josh Goodman" wrote: Hi Mark, I couldn't agree with you more, which is why we also provide this data in GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only Chado database. However, no matter how much we try to encourage use of the other formats users still flock to the good old FASTA files. There are a variety of reasons but the most common case involves bench scientists and/or programmers who run at the sight of anything more complex than a FASTA file. I've toyed with the idea of reducing the data we cram into the headers to gently try to encourage use of the other more sensible formats. However, at the end of the day we (FlyBase) serve at the behest of our user community and this is what they want to see. Cheers, Josh On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know I am not a big fan of F... From SMarkel at accelrys.com Wed Apr 29 15:53:10 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Wed, 29 Apr 2009 15:53:10 -0400 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A010637@exch1-hi.accelrys.net> A quick note in order to add one more data point. While looking at this it should be kept in mind that NCBI's nonredundant database FASTA files (nr.fa and nt.fa) use ctrl-A characters to concatenate multiple descriptions. These concatenated descriptions can be thousands of characters long. I've got one that I use as a test case that has 378,260 characters (5204 concatenated descriptions). It's a 98 residue sequence for "NADH dehydrogenase subunit 4L". I'm not saying it's right, just that cases like this do exist. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Josh Goodman > Sent: Tuesday, 28 April 2009 8:09 PM > To: Richard Holland > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] FASTA parsing bug ? > > > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fas > ta/dgri-all-translation-r1.3.fasta.gz. > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > You're right, doesn't look like newlines. > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > file attempting to seek out the next valid sequence to parse. I'm not > > sure why this is happening. > > > > I don't have the time to test right now but if you could post the link > > to where someone could download the same FASTA as you're using, then it > > would make it possible for someone else to investigate in more detail. > > > > thanks, > > Richard > > > > JP wrote: > > > Thanks Richard for your prompt reply. > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > If the file had any extra new lines I would see them when I loaded it > in > > > a text editor - no ? > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > fr = new FileReader(fastaProteinFileName); > > > br = new BufferedReader(fr); > > > String fastaLine; > > > String startAccession = '>' + accessionId.trim(); > > > String fastaEntry = ""; > > > boolean record = false; > > > while ((fastaLine = br.readLine()) != null) { > > > fastaLine = fastaLine.trim() + '\n'; > > > if (fastaLine.startsWith(startAccession)) { > > > record = true; > > > } else if (record && fastaLine.startsWith(">")) { > > > record = false; > > > break; > > > } > > > if (record) { > > > fastaEntry += fastaLine; > > > } > > > } > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file > and > > > then regex upon it (if the record is the first one - I just read that > one). > > > > > > Cheers > > > JP > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > wrote: > > > > > > The "Mark invalid" exception is indicating that the parser has > gone too > > > far ahead in the file looking for a valid header. I'm not sure why > but > > > looking at your original query, there may be extra newlines > embedded > > > into your FASTA header line? That would definitely confuse it. > > > > > > The parser is not able to currently pull out just one sequence - > in > > > effect this is a search facility, which it doesn't have. :( > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Hi all at BioJava, > > > > > > > > I am trying to parse several FASTA files using the following > code: > > > > > > > > fr = new FileReader(fastaProteinFileName); > > > >> br = new BufferedReader(fr); > > > >> > > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > null); > > > >> while (protIter.hasNext()) { > > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > > >> System.out.println (fastaProteinFileName + " == " + > > > accessionId + " = > > > >> " + bioEntry.getAccession()); > > > >> } > > > > > > > > > > > > At particular points in my fasta file - I get the following > exception: > > > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception > (from > > > >> biojava library) > > > >> org.biojava.bio.BioException: Could not read sequence > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader > .java:113) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.jav > a:99) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFr > omFASTAFile(FastaFileProcessing.java:60) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(Ortholog > ueFinder.java:64) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder. > java:51) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(Ortholo > gueFinderLauncher.java:60) > > > >> Caused by: java.io.IOException: Mark invalid > > > >> at java.io.BufferedReader.reset(Unknown Source) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader > .java:110) > > > >> ... 5 more > > > > > > > > > > > > Interestingly if I delete the header portion of the header line > (from > > > > type=protein... till the end of the line ...Dgri;) > > > > > > > >> FBpp0145468 type=protein; > > > >> > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13 > 220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331.. > 13226463,13226531..13226658); > > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; > parent=FBgn0119042,FBtr0146976; > > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > >> species=Dgri; > > > >> > > > > > > > > It works - but I have a number of these exceptions (and I do not > > > want to > > > > edit the original data). Mind you I have longer headers in my > > > file which > > > > are parsed OK (strange!). > > > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > > get ONE > > > > SINGLE sequence from the whole fasta file give that I have the > > > accession id > > > > (FBpp0145468) ? > > > > > > > > Many Thanks > > > > JP > > > > _______________________________________________ > > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > > > http://www.eaglegenomics.com/ > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Wed Apr 29 23:01:20 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 30 Apr 2009 11:01:20 +0800 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <616a29410904291528i1f2a4aag34988a7d036bcbe4@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> <93b45ca50904290726n4149ce7bhb5e9e82467982fb@mail.gmail.com> <93b45ca50904290733k68afb5b0na661588b4f09d804@mail.gmail.com> <616a29410904291528i1f2a4aag34988a7d036bcbe4@mail.gmail.com> Message-ID: <93b45ca50904292001n7f24947bh6f57dfb07eb73641@mail.gmail.com> A minimal XML equivalent to Fasta would look like this: ACGTGCACGCTGCACGT I think a biologist could handle that and it is much easier to parse than FASTA because it is well formed. You don't even need to use an XML parser. You could even convert this to FASTA using a text editor with a few find and replace expressions. Possibly this would be easier to handle even for someone who can't program at all? Of course you could make it a lot more sophisticated but then you are approximating GenbankXML or something similar. Remember BioJava has FASTA parsers made by experienced programmers and over 10 years of testing and bug fixes and people still manage to break them. This indicates to me that the FASTA format is bad and should be voted off the island. - Mark On Thu, Apr 30, 2009 at 6:28 AM, simon rayner wrote: > don't forget that a lot of the people doing bioinformatics are biologists > with no formal training.? They want to get the job done in the easiest > possible way and aren't really concerned about the details.? If you want > people to switch to XML for example, the whole concept needs to be made more > accessible.? I'm still struggling to get my students to adopt XML. > > It seems that more basic tutorials would be useful - but in a less formal > style that would be easier for newcomers to follow.?? Is there any feelings > about trying to develop this side of the Biojava project?? I thought about > trying to add some stuff, but my java programming is embarrassingly poor and > i thought i would be laughed off the website. > > Simon > > On Wed, Apr 29, 2009 at 10:33 PM, Mark Schreiber > wrote: >> >> I can understand a bench scientist wanting FASTA but a computational >> biologist. They should be ashamed! With some of the friendly XPath >> implementations in common scripting languages there really is no excuse. >> It's easier to parse XML than FASTA in Groovy, Perl, Python and Ruby. >> Probably Java and C as well. >> >> The state of bioinformatics data formats is cringe worthy. Let's try and >> enter the 21st century! >> >> OK I'm ranting again. Maybe I'll go join twitter. >> >> - Mark >> >> On 29 Apr 2009, 10:04 PM, "Josh Goodman" wrote: >> >> >> Hi Mark, >> >> I couldn't agree with you more, which is why we also provide this data in >> GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only >> Chado database. ?However, no matter how much we try to encourage use of >> the other formats users still flock to the good old FASTA files. ?There >> are a variety of reasons but the most common case involves bench >> scientists and/or programmers who run at the sight of anything more >> complex than a FASTA file. >> >> I've toyed with the idea of reducing the data we cram into the headers to >> gently try to encourage use of the other more sensible formats. ?However, >> at the end of the day we (FlyBase) serve at the behest of our user >> community and this is what they want to see. >> >> Cheers, >> Josh >> >> On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know >> I >> am not a big fan of F... >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Simon Rayner > > State Key Laboratory of Virology > Wuhan Institute of Virology > Chinese Academy of Sciences > Wuhan, Hubei 430071 > P.R.China > > +86 (27) 87199895 (office) > +86 15972923715 (cell) > > From jogoodma at indiana.edu Wed Apr 29 10:04:42 2009 From: jogoodma at indiana.edu (Josh Goodman) Date: Wed, 29 Apr 2009 14:04:42 -0000 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> Message-ID: Hi Mark, I couldn't agree with you more, which is why we also provide this data in GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only Chado database. However, no matter how much we try to encourage use of the other formats users still flock to the good old FASTA files. There are a variety of reasons but the most common case involves bench scientists and/or programmers who run at the sight of anything more complex than a FASTA file. I've toyed with the idea of reducing the data we cram into the headers to gently try to encourage use of the other more sensible formats. However, at the end of the day we (FlyBase) serve at the behest of our user community and this is what they want to see. Cheers, Josh On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know I am not a big fan of FASTA format. Sure > it was useful in the days of FORTRAN but we really need to move on. > I'm not sure the people who started the format foresaw the kind of > abuse that the "format" would get. > > What I would much prefer is something that looks like the BioEntry > table of BioSQL plus the BioSequence information in somekind of (dare > I say it) XML format. It would certainly be tidier and vastly more > machine readable, for a start not all the metadata would need to be on > the description line in no specific order. I think by limiting it to > those two tables you get most of the key metadata without all the > cruft that comes with more extensive XML formats. > > It would be a bit less user friendly for people pasting sequences into > webforms (although I think FASTA is fine for that) but much better for > data distribution, webservices, machine processing etc. > > Anyhow, that's enough venting. I don't wan't to start somekind of holy > war or anything... > > - Mark > > ps. Sorry for getting off topic. > > On Wed, Apr 29, 2009 at 3:13 PM, JP wrote: > > > > This is why we all love the internet and the community. > > What is the chance of this happening ? ?You are speaking about World Peace, > > and Kofi Annan butts in. :) > > > > I found that strange also (that there are larger headers preceding the > > troublesome one). ?Maybe (and this is a long shot) there is some buffer > > which gets filled at that particular record or point in file ? ?(Does the > > error move record if we delete a couple of initial Fasta entries ?) > > > > Mind you this is NOT the only flybase fasta file I get errors with (same > > happens with dpse one v2.3 - and I am sure there are others). > > > > I am interested in the solution, so are a ton of other people who use > > biojava and particularly verbose fasta files. > > > > I love flybase and biojava > > JP > > > > On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman wrote: > > > > > > > > Hi Richard and JP, > > > > > > I think I can be of some help as I'm the FlyBase developer responsible for > > > generating these troublesome FASTA files :-). ?The cause of this problem > > > appears to be the description line length for the record FBpp0145470. > > > > > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > > > at line 196. ?Biojava correctly reads in FBpp0145468 but throws an error > > > when trying to parse FBpp0145469. ?There is nothing wrong in FBpp0145469 > > > but when biojava reaches the end of the sequence it reads in the header > > > for the next record (FBpp0145470). ?It then tries to reset the > > > BufferedReader to the start of FBpp0145470 but that is where the exception > > > is thrown because line 197 sets the read ahead limit to 500 characters and > > > the reader.readLine() command exceeds that limit. > > > > > > What isn't obvious to me is why other large definition lines that precede > > > that line don't throw the same error (e.g. FBpp0157909). ?I guess the > > > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > > > would be more predictable than that. > > > > > > The file in question can be downloaded from > > > > > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > > > . > > > > > > If there is interest in a solution that doesn't involve simply upping the > > > read ahead limit I can put a patch file together in the next day or so. > > > > > > Cheers, > > > Josh > > > > > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > > > > > You're right, doesn't look like newlines. > > > > > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > > > file attempting to seek out the next valid sequence to parse. I'm not > > > > sure why this is happening. > > > > > > > > I don't have the time to test right now but if you could post the link > > > > to where someone could download the same FASTA as you're using, then it > > > > would make it possible for someone else to investigate in more detail.. > > > > > > > > thanks, > > > > Richard > > > > > > > > JP wrote: > > > > > Thanks Richard for your prompt reply. > > > > > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > > > > > If the file had any extra new lines I would see them when I loaded it > > > in > > > > > a text editor - no ? > > > > > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > > > > > ? ? fr = new FileReader(fastaProteinFileName); > > > > > ? ? br = new BufferedReader(fr); > > > > > ? ? String fastaLine; > > > > > ? ? String startAccession = '>' + accessionId.trim(); > > > > > ? ? String fastaEntry = ""; > > > > > ? ? boolean record = false; > > > > > ? ? while ((fastaLine = br.readLine()) != null) { > > > > > ? ? ? ? fastaLine = fastaLine.trim() + '\n'; > > > > > ? ? ? ? if (fastaLine.startsWith(startAccession)) { > > > > > ? ? ? ? ? ? record = true; > > > > > ? ? ? ? } else if (record && fastaLine.startsWith(">")) { > > > > > ? ? ? ? ? ? record = false; > > > > > ? ? ? ? ? ? break; > > > > > ? ? ? ? } > > > > > ? ? ? ? if (record) { > > > > > ? ? ? ? ? ? fastaEntry += fastaLine; > > > > > ? ? ? ? } > > > > > ? ? } > > > > > > > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > > > then regex upon it (if the record is the first one - I just read that > > > one). > > > > > > > > > > Cheers > > > > > JP > > > > > > > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > > > wrote: > > > > > > > > > > ? ? The "Mark invalid" exception is indicating that the parser has gone > > > too > > > > > ? ? far ahead in the file looking for a valid header. I'm not sure why > > > but > > > > > ? ? looking at your original query, there may be extra newlines > > > embedded > > > > > ? ? into your FASTA header line? That would definitely confuse it. > > > > > > > > > > ? ? The parser is not able to currently pull out just one sequence - in > > > > > ? ? effect this is a search facility, which it doesn't have. :( > > > > > > > > > > ? ? thanks, > > > > > ? ? Richard > > > > > > > > > > ? ? JP wrote: > > > > > ? ? > Hi all at BioJava, > > > > > ? ? > > > > > > ? ? > I am trying to parse several FASTA files using the following > > > code: > > > > > ? ? > > > > > > ? ? > fr = new FileReader(fastaProteinFileName); > > > > > ? ? >> br = new BufferedReader(fr); > > > > > ? ? >> > > > > > ? ? >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > > > null); > > > > > ? ? >> while (protIter.hasNext()) { > > > > > ? ? >> ? ? ?BioEntry bioEntry = protIter.nextBioEntry(); > > > > > ? ? >> ? ? ?System.out.println (fastaProteinFileName + " == " + > > > > > ? ? accessionId + " = > > > > > ? ? >> " + bioEntry.getAccession()); > > > > > ? ? >> } > > > > > ? ? > > > > > > ? ? > > > > > > ? ? > At particular points in my fasta file - I get the following > > > exception: > > > > > ? ? > > > > > > ? ? > 14:53:42,546 ERROR FastaFileProcessing ?- File parsing exception > > > (from > > > > > ? ? >> biojava library) > > > > > ? ? >> org.biojava.bio.BioException: Could not read sequence > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > > > ? ? >> Caused by: java.io.IOException: Mark invalid > > > > > ? ? >> ? ? at java.io.BufferedReader.reset(Unknown Source) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > > > ? ? >> ? ? ... 5 more > > > > > ? ? > > > > > > ? ? > > > > > > ? ? > Interestingly if I delete the header portion of the header line > > > (from > > > > > ? ? > type=protein... till the end of the line ...Dgri;) > > > > > ? ? > > > > > > ? ? >> FBpp0145468 type=protein; > > > > > ? ? >> > > > > > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > > > ? ? >> ID=FBpp0145468; name=Dgri\GH11562-PA; > > > parent=FBgn0119042,FBtr0146976; > > > > > ? ? >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > > > ? ? >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > > > ? ? >> species=Dgri; > > > > > ? ? >> > > > > > ? ? > > > > > > ? ? > It works - but I have a number of these exceptions (and I do not > > > > > ? ? want to > > > > > ? ? > edit the original data). ?Mind you I have longer headers in my > > > > > ? ? file which > > > > > ? ? > are parsed OK (strange!). > > > > > ? ? > > > > > > ? ? > Any ideas anyone ? ?Alternatively - is there a better way how to > > > > > ? ? get ONE > > > > > ? ? > SINGLE sequence from the whole fasta file give that I have the > > > > > ? ? accession id > > > > > ? ? > (FBpp0145468) ? > > > > > ? ? > > > > > > ? ? > Many Thanks > > > > > ? ? > JP > > > > > ? ? > _______________________________________________ > > > > > ? ? > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > > > ? ? > > > > > ? ? > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > ? ? > > > > > > > > > > > ? ? -- > > > > > ? ? Richard Holland, BSc MBCS > > > > > ? ? Finance Director, Eagle Genomics Ltd > > > > > ? ? T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > > ? ? > > > > > ? ? http://www.eaglegenomics.com/ > > > > > > > > > > > > > > > > > > -- > > > > Richard Holland, BSc MBCS > > > > Finance Director, Eagle Genomics Ltd > > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > http://www.eaglegenomics.com/ > > > > _______________________________________________ > > > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > _______________________________________________ > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > From simon.rayner.cn at gmail.com Wed Apr 1 04:39:42 2009 From: simon.rayner.cn at gmail.com (simon rayner) Date: Tue, 31 Mar 2009 23:39:42 -0500 Subject: [Biojava-l] demos in version biojava 1.6 Message-ID: <616a29410903312139j47d757fdq98e34e7a2282ade0@mail.gmail.com> did i lose the demo files somewhere in version 1.6.1? (I found them okay in 1.6) I downloaded the full version via http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar and unjarred it xx at yyyyyy:~/downloads/biojava-1.6.1-all$ *ls -all* total 817 drwxr-x--- 11 sr sr 672 2009-01-27 22:25 . drwxr-x--- 6 sr sr 616 2009-01-27 21:05 .. drwxr-xr-x 4 sr sr 392 2009-01-27 22:26 ant-build -rw-r--r-- 1 sr sr 27035 2008-10-26 21:09 build.xml -rw-r--r-- 1 sr sr 93463 2008-10-26 21:13 bytecode.jar -rw-r--r-- 1 sr sr 30117 2008-10-26 21:13 commons-cli.jar -rw-r--r-- 1 sr sr 165119 2008-10-26 21:13 commons-collections-2.1.jar -rw-r--r-- 1 sr sr 100776 2008-10-26 21:13 commons-dbcp-1.1.jar -rw-r--r-- 1 sr sr 39523 2008-10-26 21:13 commons-pool-1.1.jar drwxr-x--- 6 sr sr 144 2008-10-26 21:13 doc -rw-r--r-- 1 sr sr 166303 2008-10-26 21:13 jgrapht-jdk1.5.jar -rw-r--r-- 1 sr sr 161477 2008-10-26 21:13 junit-4.4.jar -rw-r--r-- 1 sr sr 25091 2008-10-26 21:09 LICENSE drwxr-x--- 2 sr sr 136 2008-10-26 21:09 manifest drwxr-x--- 2 sr sr 80 2008-10-26 21:13 META-INF -rw-r--r-- 1 sr sr 3056 2008-10-26 21:09 README -rw-r--r-- 1 sr sr 2541 2008-10-26 21:09 README.biosql drwxr-xr-x 3 sr sr 72 2009-01-27 22:25 reports drwxr-x--- 5 sr sr 120 2008-10-26 21:09 resources drwxr-x--- 2 sr sr 176 2008-10-26 21:13 selfSignedCertificate drwxr-x--- 3 sr sr 72 2008-10-26 21:07 src drwxr-x--- 4 sr sr 96 2008-10-26 21:09 tests xx at yyyyyy:~/downloads/biojava-1.6.1-all$ *ant -version* Apache Ant version 1.7.0 compiled on April 29 2008 xx at yyyyyy:~/downloads/biojava-1.6.1-all$ xx at yyyyyy:~/downloads/biojava-1.6.1-all$* ant compile-demos* Buildfile: build.xml init: [echo] Building biojava-live [echo] Java Home: /usr/lib/jvm/java-6-sun-1.6.0.12/jre [echo] JUnit present: true [echo] JUnit supported by Ant: true [echo] HSQLDB driver present: ${sqlDriver.hsqldb} [echo] XSLT support: true prepare: prepare-demos: [mkdir] Created dir: /home/sr/downloads/biojava-1.6.1-all/ant-build/classes/demos [mkdir] Created dir: /home/sr/downloads/biojava-1.6.1-all/ant-build/docs/demos prepare-biojava: compile-biojava: package-biojava: compile-demos: BUILD FAILED /home/--/downloads/biojava-1.6.1-all/build.xml:283: srcdir "/home/--/downloads/biojava-1.6.1-all/demos" does not exist! Total time: 1 second xx at yyyyyy:~/downloads/biojava-1.6.1-all$ xx at yyyyyy:~/downloads/biojava-1.6.1-all$ *find ./ -name TestEmbl** ./doc/demos/seq/TestEmbl2.html ./doc/demos/seq/class-use/TestEmbl2.html ./doc/demos/seq/class-use/TestEmbl.html ./doc/demos/seq/TestEmbl.html xx at yyyyyy:~/downloads/biojava-1.6.1-all$ am i doing something stupid here? From markjschreiber at gmail.com Fri Apr 3 00:07:58 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 3 Apr 2009 08:07:58 +0800 Subject: [Biojava-l] [Biojava-dev] How to convert a multiple alignment to a PSSM matrix ? In-Reply-To: <93b45ca50904021707h786aeac9sd12f3cd592303981@mail.gmail.com> References: <11965100.460701238686263309.JavaMail.coremail@bj163app72.163.com> <93b45ca50904021707h786aeac9sd12f3cd592303981@mail.gmail.com> Message-ID: <93b45ca50904021707m1cfd068fv5766e4531bab7991@mail.gmail.com> There is a class called a WeightMatrix. I think there is an example on the cookbook. On 2 Apr 2009, 11:47 PM, "simpleyrx" wrote: Hi, friends, I have a question: How to convert a multiple alignment to a PSSM matrix ? Is there any code in Biojava implement the function ? Or is there other source code have the function ? Student _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Tue Apr 7 05:50:50 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 6 Apr 2009 22:50:50 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49D9C3CC.7010000@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> Message-ID: <59a41c430904062250l25a1105dw505ce3edc2651184@mail.gmail.com> Hi Richard, Thanks for the nomination. In short I am intending to do the following things over the next couple of months: * Release biojava 1.7 - don't forget, code freeze will be on Wed. April 8th. Please commit your final changes for this release in the next couple of days, or let me know if you need more time asap. * Keep maintaining biojava nightly builds at http://www.spice-3d.org/cruise/ * Organize a biojava user meeting around BOSC / ISMB 2009 * After the biojava 1.7 release I want to have a discussion how to continue with the code base and what to change for the next major release. * I will actively seek and invite new contributors and package maintainers. * My main focus are the further development of the protein structure related modules. As such I will need YOUR help for maintaining blast, sequence and any of the other frequently used modules. Andreas On Mon, Apr 6, 2009 at 1:56 AM, Richard Holland wrote: > Hi all. > > There were no nominations for the BioJava leadership role by the end of > last week, so I would like to nominate Andreas Prlic to take over the > role as BioJava coordinator/project manager. Andreas has agreed to be > nominated. > > If there are no objections lodged on this list by next Monday (13th > April), I'll hand over to Andreas by the end of next week. > > thanks, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at eaglegenomics.com Tue Apr 7 10:09:25 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 07 Apr 2009 11:09:25 +0100 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <59a41c430904062250l25a1105dw505ce3edc2651184@mail.gmail.com> References: <49D9C3CC.7010000@eaglegenomics.com> <59a41c430904062250l25a1105dw505ce3edc2651184@mail.gmail.com> Message-ID: <49DB2655.1080500@eaglegenomics.com> I'd be happy to maintain the parts you request. Andreas Prlic wrote: > Hi Richard, > > Thanks for the nomination. In short I am intending to do the following > things over the next couple of months: > > * Release biojava 1.7 - don't forget, code freeze will be on Wed. > April 8th. Please commit your final changes for this release in the > next couple of days, or let me know if you need more time asap. > > * Keep maintaining biojava nightly builds at http://www.spice-3d.org/cruise/ > > * Organize a biojava user meeting around BOSC / ISMB 2009 > > * After the biojava 1.7 release I want to have a discussion how to > continue with the code base and what to change for the next major > release. > > * I will actively seek and invite new contributors and package maintainers. > > * My main focus are the further development of the protein structure > related modules. As such I will need YOUR help for maintaining blast, > sequence and any of the other frequently used modules. > > Andreas > > > > > On Mon, Apr 6, 2009 at 1:56 AM, Richard Holland > wrote: >> Hi all. >> >> There were no nominations for the BioJava leadership role by the end of >> last week, so I would like to nominate Andreas Prlic to take over the >> role as BioJava coordinator/project manager. Andreas has agreed to be >> nominated. >> >> If there are no objections lodged on this list by next Monday (13th >> April), I'll hand over to Andreas by the end of next week. >> >> thanks, >> Richard >> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From tallpaulinjax at yahoo.com Wed Apr 8 19:52:57 2009 From: tallpaulinjax at yahoo.com (tallpaulinjax at yahoo.com) Date: Wed, 8 Apr 2009 12:52:57 -0700 (PDT) Subject: [Biojava-l] MMCIF parser? Message-ID: <54544.49687.qm@web30702.mail.mud.yahoo.com> Hi, ? The JAVADOCS located here with a build date of today indicate BioJava supports MMCIF file parsing: http://www.spice-3d.org/public-files/javadoc/biojava/overview-summary.html As does the Cookbook page here: http://biojava.org/wiki/BioJava:CookBook:PDB:mmcif ? Yet the (official ?) Javadocs here don't indicate support: http://www.biojava.org/docs/api16/index.html ? And?when I?searched in the source code I can not find any mention of the *mmcif*.java files, and the class files don't seem to exist either. To make sure I had the latest build, I re-downloaded the JAR file located here and checked it as well: http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar ? Is MMCIF support from an older version of BioJava that has been deprecated, or from a new version yet to be released? Thanks, ? Paul From andreas at sdsc.edu Wed Apr 8 20:10:01 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 8 Apr 2009 13:10:01 -0700 Subject: [Biojava-l] MMCIF parser? In-Reply-To: <54544.49687.qm@web30702.mail.mud.yahoo.com> References: <54544.49687.qm@web30702.mail.mud.yahoo.com> Message-ID: <59a41c430904081310y6127448ft9af65a0c79be0bac@mail.gmail.com> Hi Paul, The mmcif functionality is new and will be part of the 1.7 release that will go out next week. In the meanwhile you can use the nightly build .jars from http://www.spice-3d.org/cruise/ ... Andreas On Wed, Apr 8, 2009 at 12:52 PM, wrote: > > Hi, > > The JAVADOCS located here with a build date of today indicate BioJava supports MMCIF file parsing: > http://www.spice-3d.org/public-files/javadoc/biojava/overview-summary.html > As does the Cookbook page here: > http://biojava.org/wiki/BioJava:CookBook:PDB:mmcif > > Yet the (official ?) Javadocs here don't indicate support: > http://www.biojava.org/docs/api16/index.html > > And?when I?searched in the source code I can not find any mention of the *mmcif*.java files, and the class files don't seem to exist either. To make sure I had the latest build, I re-downloaded the JAR file located here and checked it as well: > http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar > > Is MMCIF support from an older version of BioJava that has been deprecated, or from a new version yet to be released? > > Thanks, > > Paul > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From tallpaulinjax at yahoo.com Wed Apr 8 20:38:19 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Wed, 8 Apr 2009 13:38:19 -0700 (PDT) Subject: [Biojava-l] MMCIF parser? Message-ID: <75942.7684.qm@web30707.mail.mud.yahoo.com> Thanks, Andreas! --- On Wed, 4/8/09, Andreas Prlic wrote: From: Andreas Prlic Subject: Re: [Biojava-l] MMCIF parser? To: tallpaulinjax at yahoo.com Cc: "biojava-l at biojava.org" Date: Wednesday, April 8, 2009, 4:10 PM Hi Paul, The mmcif functionality is new and will be part of the? 1.7 release that will go out next week. In the meanwhile you can use the nightly build .jars from http://www.spice-3d.org/cruise/ ... Andreas On Wed, Apr 8, 2009 at 12:52 PM,? wrote: > > Hi, > > The JAVADOCS located here with a build date of today indicate BioJava supports MMCIF file parsing: > http://www.spice-3d.org/public-files/javadoc/biojava/overview-summary.html > As does the Cookbook page here: > http://biojava.org/wiki/BioJava:CookBook:PDB:mmcif > > Yet the (official ?) Javadocs here don't indicate support: > http://www.biojava.org/docs/api16/index.html > > And?when I?searched in the source code I can not find any mention of the *mmcif*.java files, and the class files don't seem to exist either. To make sure I had the latest build, I re-downloaded the JAR file located here and checked it as well: > http://www.biojava.org/download/bj16/all/biojava-1.6.1-all.jar > > Is MMCIF support from an older version of BioJava that has been deprecated, or from a new version yet to be released? > > Thanks, > > Paul > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list? -? Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Sat Apr 11 19:05:44 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Sat, 11 Apr 2009 12:05:44 -0700 Subject: [Biojava-l] BOSC abstract Message-ID: <59a41c430904111205p624821aarf6124d0db2bb7eb9@mail.gmail.com> Hi, Submision deadline for the BOSC abstract is on Monday. The current version is available at http://biojava.org/wiki/BOSC2009_Presentation#BioJava_2009:__an_Open-Source_Framework_for_Bioinformatics. If you are one of the co-authors, can you please make sure I got your affiliation right? Also if you have any additions or corrections to the abstract, please feel free to edit. If I missed anybody who should be co-author, please edit as well... Andreas From holland at eaglegenomics.com Sun Apr 12 10:49:54 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 12 Apr 2009 11:49:54 +0100 Subject: [Biojava-l] [Biojava-dev] BOSC abstract In-Reply-To: <59a41c430904111205p624821aarf6124d0db2bb7eb9@mail.gmail.com> References: <59a41c430904111205p624821aarf6124d0db2bb7eb9@mail.gmail.com> Message-ID: <49E1C752.6080702@eaglegenomics.com> looks good! Andreas Prlic wrote: > Hi, > > Submision deadline for the BOSC abstract is on Monday. The current > version is available at > http://biojava.org/wiki/BOSC2009_Presentation#BioJava_2009:__an_Open-Source_Framework_for_Bioinformatics. > If you are one of the co-authors, can you please make sure I got your > affiliation right? Also if you have any additions or corrections to > the abstract, please feel free to edit. If I missed anybody who should > be co-author, please edit as well... > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Mon Apr 13 02:47:26 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 12 Apr 2009 19:47:26 -0700 Subject: [Biojava-l] BioJava 1.7 released Message-ID: <59a41c430904121947v67c7a7f9v1a236d3ad695760f@mail.gmail.com> Biojava 1.7 has been released and is available from http://biojava.org/wiki/BioJava:Download BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats, and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. Besides numerous bug fixes and stability improvements, a lot of development has been going on in the protein structure modules. BioJava now provides a framework for parsing mmCif files. The parsing of PDB header information has been improved and a new tool to read the Chemical component dictionary is in place. Biojava 1.7 offers more functionality and stability over the previous official releases. We highly recommend you to upgrade as soon as possible. Thanks to all contributors for making this release possible. Happy Biojava-ing, Andreas From holland at eaglegenomics.com Tue Apr 14 08:33:05 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 14 Apr 2009 09:33:05 +0100 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49D9C3CC.7010000@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> Message-ID: <49E44A41.2010703@eaglegenomics.com> Hello again. Well, nobody objected, and several people supported the idea, so I would now like to formally hand over control of the BioJava project to Andreas Prlic with immediate effect. It's been good fun working with the project over the last 5 years, and although I'll no longer be in charge, I will still remain on the mailing lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. I'll also continue to attend BOSC, including this year in Stockholm, so I'm looking forward to meeting up with everyone there for a beer or two. Thanks for the help and support everyone's given, and I'm sure you'll join me in wishing Andreas the best of luck with the project. He'll be an excellent leader and with him in charge I believe the project will go from strength to strength. cheers, Richard -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From willishf at ufl.edu Tue Apr 14 15:02:10 2009 From: willishf at ufl.edu (Scooter Willis) Date: Tue, 14 Apr 2009 11:02:10 -0400 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49E44A41.2010703@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> Message-ID: <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> Andreas Congrats on taking on the responsibility of steering BioJava in a positive direction. I needed the ability to generate phylogenetic trees from aligned sequence data and found that the work was started in a google summer project but looking at the code it wasn't finished and appeared to focus only on loading trees not creating them. I ended up taking the tree generation code out of jalview and removing as much jalview dependencies as possible and have it as a nice tight collection of classes. My assumption without any deep legal review is that because jalview is open source that the code can be used and contributed to another open source project like BioJava. I will also plan on contributing the code changes back to Jalview. One of the challenges I ran into with the JalView code is performance for building the tree when using 1800+ sequences(takes a very very long time) so I am doing some code optimization and finishing up testing on a fairly significant performance speedup doing Neighbor_Join with a slightly different approach that makes it N2 instead of N3. I have a couple things to fix in tree joinging code and then will compare results for the quality of the tree compared to the original distance matrix. I should know more this week. I think I remember a BioJava discussion about trying to seperate parts and pieces to that if you try and use a particular feature set of BioJava you are not forced into absorbing the entire BioJava collection of Jars. In my case I would want a biojava-phylogenetic.jar that has all things related to tree creation and/or tree viewing etc. If the common data format for handling sequences is RichSequence or Sequence then I would expect to have one other Jar requirement of biojava-core.jar. Not sure if any work has been done to refactor the BioJava code base into multiple jar files in the same way apache does its jars for great java code geared to a specific problem domain. Let me know what I can do to assist moving forward. Thanks Scooter Willis On Tue, Apr 14, 2009 at 4:33 AM, Richard Holland wrote: > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From andreas.prlic at gmail.com Tue Apr 14 16:22:27 2009 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Tue, 14 Apr 2009 09:22:27 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49E44A41.2010703@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> Message-ID: <59a41c430904140922m79f6fdetb0e286e8548a37fe@mail.gmail.com> Hi Richard, Again, thanks for your BioJava contributions in the last years and great to have you still around. I am looking forward to the next year of BioJava development. Our google analytics stats reveal that we have an ever growing user base and it will be a challenge to continue developing BioJava further and add new and useful features. Of course this is a task that I can't do alone and I will need the help of everybody who wants to write documentation, submit bug fixes or wants to become maintainer of one of the modules. With BioJava 1.7 being out it is a now good time to start a discussion at how to improve the code base for the next version. We also have BOSC coming up in June and it will provide a good opportunity for people to meet in person. Hope to see you (Richard, and everybody else!) in Sweden, otherwise we will keep talking via the lists. Andreas On Tue, Apr 14, 2009 at 1:33 AM, Richard Holland wrote: > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From andreas at sdsc.edu Tue Apr 14 16:45:21 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 14 Apr 2009 09:45:21 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <49E44A41.2010703@eaglegenomics.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> Message-ID: <59a41c430904140945t30911e33idc8ea095c52e59e3@mail.gmail.com> Hi Richard, Again, thanks for your BioJava contributions in the last years and great to have you still around. I am looking forward to the next year of BioJava development. Our google analytics stats reveal that we have an ever growing user base and it will be a challenge to continue developing BioJava further and add new and useful features. Of course this is a task that I can't do alone and I will need the help of everybody who wants to write documentation, submit bug fixes or wants to become maintainer of one of the modules. With BioJava 1.7 being out it is now a good time to start a discussion at how to improve the code base for the next version. We also have BOSC coming up in June and it will provide a good opportunity for people to meet in person. Hope to see you (Richard, and everybody else!) in Sweden, otherwise we will keep talking via the lists. Andreas On Tue, Apr 14, 2009 at 1:33 AM, Richard Holland wrote: > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From holland at eaglegenomics.com Tue Apr 14 15:07:15 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 14 Apr 2009 16:07:15 +0100 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> Message-ID: <49E4A6A3.6030005@eaglegenomics.com> The plan was to create separate new BJ3 jars for task-specific code, exactly as you suggest for phylogenetics. I'd support a biojava-phylo jar of some kind, and I agree it would probably depend on the BJ3 biojava-core module for sequence handling. The existing BJ code was not originally going to be refactored into separate jars, unless Andreas has other plans! Scooter Willis wrote: > Andreas > > Congrats on taking on the responsibility of steering BioJava in a > positive direction. > > I needed the ability to generate phylogenetic trees from aligned > sequence data and found that the work was started in a google summer > project but looking at the code it wasn't finished and appeared to focus > only on loading trees not creating them. I ended up taking the tree > generation code out of jalview and removing as much jalview dependencies > as possible and have it as a nice tight collection of classes. My > assumption without any deep legal review is that because jalview is open > source that the code can be used and contributed to another open source > project like BioJava. I will also plan on contributing the code changes > back to Jalview. > > One of the challenges I ran into with the JalView code is performance > for building the tree when using 1800+ sequences(takes a very very long > time) so I am doing some code optimization and finishing up testing on a > fairly significant performance speedup doing Neighbor_Join with a > slightly different approach that makes it N2 instead of N3. I have a > couple things to fix in tree joinging code and then will compare results > for the quality of the tree compared to the original distance matrix. I > should know more this week. > > I think I remember a BioJava discussion about trying to seperate parts > and pieces to that if you try and use a particular feature set of > BioJava you are not forced into absorbing the entire BioJava collection > of Jars. In my case I would want a biojava-phylogenetic.jar that has all > things related to tree creation and/or tree viewing etc. If the common > data format for handling sequences is RichSequence or Sequence then I > would expect to have one other Jar requirement of biojava-core.jar. Not > sure if any work has been done to refactor the BioJava code base into > multiple jar files in the same way apache does its jars for great java > code geared to a specific problem domain. > > Let me know what I can do to assist moving forward. > > Thanks > > Scooter Willis > > On Tue, Apr 14, 2009 at 4:33 AM, Richard Holland > > wrote: > > Hello again. > > Well, nobody objected, and several people supported the idea, so I would > now like to formally hand over control of the BioJava project to Andreas > Prlic with immediate effect. > > It's been good fun working with the project over the last 5 years, and > although I'll no longer be in charge, I will still remain on the mailing > lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. > > I'll also continue to attend BOSC, including this year in Stockholm, so > I'm looking forward to meeting up with everyone there for a beer or two. > > Thanks for the help and support everyone's given, and I'm sure you'll > join me in wishing Andreas the best of luck with the project. He'll be > an excellent leader and with him in charge I believe the project will go > from strength to strength. > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Wed Apr 15 05:40:24 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 14 Apr 2009 22:40:24 -0700 Subject: [Biojava-l] [Biojava-dev] Leadership In-Reply-To: <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> References: <49D9C3CC.7010000@eaglegenomics.com> <49E44A41.2010703@eaglegenomics.com> <7ceb4beb0904140802g39aa0d0bo7e536fb45523bb78@mail.gmail.com> Message-ID: <59a41c430904142240u5a9812eejf11023f69baa27b7@mail.gmail.com> Hi Scooter, Thanks for volunteering. I like the idea of modularizing BioJava in the next version. The details of how to do this still need to be discussed on the dev mailing list. If you want to be involved into anything phylo related then this is great and any contribution will be welcome. About merging in 3rd party code that is under a different license: For this you need to get permission from the original copyright owners or rewrite the code .... Andreas On Tue, Apr 14, 2009 at 8:02 AM, Scooter Willis wrote: > Andreas > > Congrats on taking on the responsibility of steering BioJava in a positive > direction. > > I needed the ability to generate phylogenetic trees from aligned sequence > data and found that the work was started in a google summer project but > looking at the code it wasn't finished and appeared to focus only on loading > trees not creating them. I ended up taking the tree generation code out of > jalview and removing as much jalview dependencies as possible and have it as > a nice tight collection of classes. My assumption without any deep legal > review is that because jalview is open source that the code can be used and > contributed to another open source project like BioJava. I will also plan on > contributing the code changes back to Jalview. > > One of the challenges I ran into with the JalView code is performance for > building the tree when using 1800+ sequences(takes a very very long time) so > I am doing some code optimization and finishing up testing on a fairly > significant performance speedup doing Neighbor_Join with a slightly > different approach that makes it N2 instead of N3. I have a couple things to > fix in tree joinging code and then will compare results for the quality of > the tree compared to the original distance matrix. I should know more this > week. > > I think I remember a BioJava discussion about trying to seperate parts and > pieces to that if you try and use a particular feature set of BioJava you > are not forced into absorbing the entire BioJava collection of Jars. In my > case I would want a biojava-phylogenetic.jar that has all things related to > tree creation and/or tree viewing etc. If the common data format for > handling sequences is RichSequence or Sequence then I would expect to have > one other Jar requirement of biojava-core.jar. Not sure if any work has been > done to refactor the BioJava code base into multiple jar files in the same > way apache does its jars for great java code geared to a specific problem > domain. > > Let me know what I can do to assist moving forward. > > Thanks > > Scooter Willis > > On Tue, Apr 14, 2009 at 4:33 AM, Richard Holland > wrote: > >> Hello again. >> >> Well, nobody objected, and several people supported the idea, so I would >> now like to formally hand over control of the BioJava project to Andreas >> Prlic with immediate effect. >> >> It's been good fun working with the project over the last 5 years, and >> although I'll no longer be in charge, I will still remain on the mailing >> lists and contribute code/ideas/bugfixes/etc. whenever I get the chance. >> >> I'll also continue to attend BOSC, including this year in Stockholm, so >> I'm looking forward to meeting up with everyone there for a beer or two. >> >> Thanks for the help and support everyone's given, and I'm sure you'll >> join me in wishing Andreas the best of luck with the project. He'll be >> an excellent leader and with him in charge I believe the project will go >> from strength to strength. >> >> cheers, >> Richard >> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jolyon.holdstock at ogt.co.uk Fri Apr 17 10:27:16 2009 From: jolyon.holdstock at ogt.co.uk (Jolyon Holdstock) Date: Fri, 17 Apr 2009 11:27:16 +0100 Subject: [Biojava-l] User interface example for Cookbook Message-ID: <588D0DD225D05746B5D8CAE1BE971F3F026D7EAC@EUCLID.internal.ogtip.com> Hi, I've been meaning to generate an updated example of code for displaying a sequence (with some additional functionality) for the cookbook and finally got off my backside to do it. Code is below; I hope it's of use - feel free to point out errors, improvements etc... Cheers, Jolyon //Code starts ------------------------------------------------------------------------ ------------------- //Java libraries import java.awt.*; import java.awt.event.*; import java.io.*; import java.util.*; import javax.swing.*; //BioJava libraries import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.gui.sequence.*; //BioJava extension libraries import org.biojavax.*; import org.biojavax.ontology.*; import org.biojavax.bio.seq.*; public class DisplaySequenceFile extends JFrame implements SequenceViewerMotionListener { private TranslatedSequencePanel tsp = new TranslatedSequencePanel(); private MultiLineRenderer mlr = new MultiLineRenderer(); private RulerRenderer rr = new RulerRenderer(); private SequenceRenderer seqR = new SymbolSequenceRenderer(); private FeatureBlockSequenceRenderer fbsr; private RichSequence richSeq; private Container con; private JPanel controlPanel; private JButton mvLeft, mvRight, zoomIn, zoomOut; private double sequenceScale = 0.05; private int windowWidth = 1200; private int windowHeight = 200; public DisplaySequenceFile(String fileName){ //Load the sequence file try { richSeq = RichSequence.IOTools.readEMBLDNA(new BufferedReader(new FileReader(new File(fileName))), null).nextRichSequence(); } catch (BioException bioe){ System.err.println("Not an EMBL sequence" + bioe); } catch(FileNotFoundException fnfe){ System.err.println("FileNotFoundException: " + fnfe); } catch (IOException ioe){ System.err.println("IOException: " + ioe); } //Define the appearance of the rendered Features BasicFeatureRenderer bfr = new BasicFeatureRenderer(); GradientPaint gradient = new GradientPaint(0, 10, Color.RED, 0, 0, Color.white, true); bfr.setFill(gradient); bfr.setOutline(Color.RED); //Form a bridge between Sequence rendering and Feature rendering fbsr = new FeatureBlockSequenceRenderer(bfr); fbsr.setCollapsing(false); //Filter for CDS features on the forward strand SequenceRenderer fwd_sr = new FilteringRenderer(fbsr, new FeatureFilter.And(new FeatureFilter.ByType("CDS"), new FeatureFilter.StrandFilter(StrandedFeature.POSITIVE)), true); //Filter for CDS features on the reverse strand SequenceRenderer rev_sr = new FilteringRenderer(fbsr, new FeatureFilter.And(new FeatureFilter.ByType("CDS"), new FeatureFilter.StrandFilter(StrandedFeature.NEGATIVE)), true); //Add the renderers to the MultiLineRenderer mlr.addRenderer(fwd_sr); mlr.addRenderer(rr); mlr.addRenderer(rev_sr); mlr.addRenderer(seqR); //Set the sequence renderer for the TranslatedSequencePanel tsp.setRenderer(mlr); //Set the sequence to render tsp.setSequence(richSeq); //Set the position of the displayed sequence tsp.setSymbolTranslation(1); //Set the scale as pixels per Symbol. tsp.setScale(sequenceScale); //Add a sequence viewer motion listener to the TranslateSequencePanel tsp.addSequenceViewerMotionListener(this); //Generate the control panel controlPanel = new JPanel(); controlPanel.setBackground(Color.lightGray); //Move along the sequence towards 5' end mvLeft = new JButton("<<"); mvLeft.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ int rightSide = tsp.getRange().getMax(); int leftSide = tsp.getRange().getMin(); int newStartPoint = leftSide - (rightSide - leftSide); if (newStartPoint < 1){ newStartPoint = 1; } tsp.setSymbolTranslation(newStartPoint); } }); //Move along the sequence towards 3' end mvRight = new JButton(">>"); mvRight.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ int rightSide = tsp.getRange().getMax(); int leftSide = tsp.getRange().getMin(); int screenWidth = rightSide - leftSide; if ((rightSide + screenWidth) >= richSeq.length()){ tsp.setSymbolTranslation(richSeq.length() - screenWidth); } else { tsp.setSymbolTranslation(rightSide); } } }); //Increase sequence scale zoomIn = new JButton("+"); zoomIn.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ sequenceScale = sequenceScale * 2; //if sequence scale = 12 the bases are rendered //no need to zoom in further so disable the button. if (sequenceScale > 12){ sequenceScale = 12; zoomIn.setEnabled(false); } tsp.setScale(sequenceScale); } }); //Reduce sequence scale zoomOut = new JButton("-"); zoomOut.addActionListener(new ActionListener(){ public void actionPerformed(ActionEvent ae){ sequenceScale = sequenceScale / 2; //if sequence scale is below 12 the enable zoomIn button if (sequenceScale < 12){ zoomIn.setEnabled(true); } //If the scale allows more than the sequence to be displayed //display the whole sequence if (sequenceScale < ((double)tsp.getWidth()/(double)richSeq.length())){ sequenceScale = (double)tsp.getWidth()/(double)richSeq.length(); tsp.setSymbolTranslation(1); } tsp.setScale(sequenceScale); //If the new scale coupled with the current SymbolTranslation means the //displayed can't fill the TranslatedSequencePanel then reset the SymbolTranlstion if(tsp.getRange().getMax() >= richSeq.length()){ int tmp = (int)((double)tsp.getWidth()/sequenceScale); tsp.setSymbolTranslation(richSeq.length() - tmp); } } }); controlPanel.add(mvLeft); controlPanel.add(mvRight); controlPanel.add(zoomIn); controlPanel.add(zoomOut); con = new Container(); con = getContentPane(); con.setLayout(new BorderLayout()); con.add(controlPanel, BorderLayout.NORTH); con.add(tsp, BorderLayout.CENTER); setLocation(50,50); setSize(windowWidth,windowHeight); setVisible(true); setResizable(false); } /** * Detect mouse dragged events * @param sve */ public void mouseDragged(SequenceViewerEvent sve) { } /** * Detect mouse movement events * If the mouse moves over a CDS feature create a tooltiptext stating the * the name of the gene associated with the CDS feature. * @param sve */ public void mouseMoved(SequenceViewerEvent sve) { //Manage the tooltip ToolTipManager ttm = ToolTipManager.sharedInstance(); ttm.setDismissDelay(2000); //If the mouse have moved over a SimpleFeatureHolder if (sve.getTarget() instanceof SimpleFeatureHolder){ ComparableTerm gene = RichObjectFactory.getDefaultOntology().getOrCreateTerm("gene"); SimpleFeatureHolder sfh = (SimpleFeatureHolder)sve.getTarget(); FeatureHolder fh = sfh.filter(new FeatureFilter.ByType("CDS")); Iterator i = fh.features(); while(i.hasNext()){ RichFeature rf = i.next(); RichAnnotation anno = (RichAnnotation) rf.getAnnotation(); Set annotationNotes = anno.getNoteSet(); for (Iterator it = annotationNotes.iterator(); it.hasNext();) { Note note = it.next(); if (note.getTerm().equals(gene)) { tsp.setToolTipText("Gene: " + note.getValue()); } } } } else { //Remove the tooltip ttm.setDismissDelay(10); } } /** * Main method * @param args */ public static void main(String args []){ if (args.length == 1){ new DisplaySequenceFile(args[0]); } else { System.out.println("Usage: java SequenceViewer "); System.exit(1); } } } //Code ends ------------------------------------------------------------------------ ------------------- Dr. Jolyon Holdstock Senior Computational Biologist, Oxford Gene Technology, Begbroke Science Park, Sandy Lane, Yarnton, Oxford, OX5 1PF, UK. T: +44 (0)1865 856 852 F: +44 (0)1865 842 116 E: jolyon.holdstock at ogt.co.uk W: www.ogt.co.uk Looking to outsource your microarray studies? Look no further. Click here to tour our facilities Click here to request a quotation Scientific pedigree delivering high quality microarray results to you: * Service capacity >1000 samples per week * Rigorous QC from sample to result * Applications available include aCGH, CNV, methylation studies and miRNA Oxford Gene Technology (Operations) Ltd. Registered in England No: 03845432 Begbroke Science Park Sandy Lane Yarnton Oxford OX5 1PF Confidentiality Notice: The contents of this email from Oxford Gene Technology are confidential and intended solely for the person to whom it is addressed. It may contain privileged and confidential information. If you are not the intended recipient you must not read, copy, distribute, discuss or take any action in reliance on it. If you have received this email in error please advise the sender so that we can arrange for proper delivery. Then please delete the message from your inbox. Thank you. From hkayabilisim at gmail.com Mon Apr 20 14:14:03 2009 From: hkayabilisim at gmail.com (=?ISO-8859-1?Q?H=FCseyin_Kaya?=) Date: Mon, 20 Apr 2009 17:14:03 +0300 Subject: [Biojava-l] A problem in reading remote AB1 files Message-ID: Hi BioJava Community, I have a little problem in reading AB1 files residing on a remote webserver. I have two different AB1 files; ok.ab1 and failed.ab1. They were both generated from the same sequencer. Here is a quick summary: Reading ok.ab1 from a local directory is OK Reading failed.ab1 from a local directory is OK Reading ok.ab1 from a remote webserver is OK Reading failed.ab1 from a remote webserver is not OK The exception is: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.biojava.utils.io.CachingInputStream.read(CachingInputStream.java:101) at java.io.DataInputStream.readFully(Unknown Source) at java.io.DataInputStream.readFully(Unknown Source) at org.biojava.bio.program.abi.ABIFParser$DataStream.readFully(ABIFParser.java:376) at org.biojava.bio.program.abi.ABIFParser.readDataRecords(ABIFParser.java:129) at org.biojava.bio.program.abi.ABIFParser.(ABIFParser.java:100) at org.biojava.bio.program.abi.ABIFParser.(ABIFParser.java:89) at org.biojava.bio.program.abi.ABIFChromatogram$Parser.(ABIFChromatogram.java:117) at org.biojava.bio.program.abi.ABIFChromatogram.load(ABIFChromatogram.java:101) at org.biojava.bio.program.abi.ABIFChromatogram.create(ABIFChromatogram.java:89) at org.biojava.bio.chromatogram.ChromatogramFactory.create(ChromatogramFactory.java:119) at BugTest.readChromatogram(BugTest.java:34) at BugTest.main(BugTest.java:22) I will be glad if you help me in resolving this problem. Sincerely Huseyin Kaya BugTest.java import java.net.URL; import org.biojava.bio.chromatogram.AbstractChromatogram; import org.biojava.bio.chromatogram.ChromatogramFactory; public class BugTest { public static final String URL_REMOTE_FAILED = " http://dna.iontek.com.tr/files/failed.ab1"; public static final String URL_REMOTE_OK = " http://dna.iontek.com.tr/files/ok.ab1"; public static final String URL_LOCAL_FAILED = "file:///C:/failed.ab1"; public static final String URL_LOCAL_OK = "file:///C:/ok.ab1"; public static void main(String[] args) throws Exception { readChromatogram("Reading ok.ab1 from local directory",URL_LOCAL_OK); readChromatogram("Reading failed.ab1 from local directory",URL_LOCAL_FAILED); readChromatogram("Reading ok.ab1 from webserver ",URL_REMOTE_OK); // This one is failed readChromatogram("Reading failed.ab1 from webserver ",URL_REMOTE_FAILED); } private static void readChromatogram(String message, String urlstr) { System.out.println(message + "["+urlstr+"]"); AbstractChromatogram ch = null ; URL url; try { url = new URL(urlstr); ch = (AbstractChromatogram)ChromatogramFactory.create(url.openStream()); } catch (Exception e) { e.printStackTrace(); } if (ch != null) System.out.println("Trace length is "+ch.getTraceLength()); } } From jp at javaclass.co.uk Mon Apr 20 14:56:00 2009 From: jp at javaclass.co.uk (JP) Date: Mon, 20 Apr 2009 15:56:00 +0100 Subject: [Biojava-l] BioJava Question - Evolutionary Rate Calculation Message-ID: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> Hi there at Biojava, I have a number of orthologue protein (sequences) from different species - I would like to calculate the distance (score) between each of these (programatically) based on the sequence. Can anyone suggest a way to do this (I'd rather use an existing bit of software than having to reinvent the wheel) ? Surely this software must exist...(hopefully in biojava). Many Thanks Jean-Paul Ebejer, Malta From cif077 at gmail.com Mon Apr 20 23:11:48 2009 From: cif077 at gmail.com (Ian Yi-Feng Chang) Date: Tue, 21 Apr 2009 07:11:48 +0800 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> Message-ID: <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> Dear All, I've a problem while editing a richsequence. and got this exception: Exception in thread "main" org.biojava.utils.ChangeVetoException: AbstractSymbolList is immutable at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) at gizmo.tools.GBKCurator.main(GBKCurator.java:176) I trace this problem in this mailing list and find a latest thread in** *Wed Feb 20 21:33:39 EST 2008* However, I still have no idea how to Here is the solution (from the JavaDoc) SimpleRichSequenceBuilderFactory public SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) Creates a new instance of SimpleRichSequenceBuilderFactory that uses a specified factory for SymbolLists longer than a specified length. Before that a SimpleSymbolListFacotry is used. Parameters: fact - the factory to use when building the SymbolList.threshold - the threshold to exceed before using this factory However, could you please help to demonstrate how to use this solution to edit a richsequence? Thank you so much. ian chang From andreas at sdsc.edu Mon Apr 20 23:26:07 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Apr 2009 16:26:07 -0700 Subject: [Biojava-l] BioJava Question - Evolutionary Rate Calculation In-Reply-To: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> References: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> Message-ID: <59a41c430904201626y28a2f6d4m34cab6e9a34ad767@mail.gmail.com> Hi Jean-Paul, You can use BioJava to calculate pairwise alignments between your sequences: http://biojava.org/wiki/BioJava:CookBook:DP:PairWise2 Andreas On Mon, Apr 20, 2009 at 7:56 AM, JP wrote: > Hi there at Biojava, > > I have a number of orthologue protein (sequences) from different species - I > would like to calculate the distance (score) between each of these > (programatically) based on the sequence. > > Can anyone suggest a way to do this (I'd rather use an existing bit of > software than having to reinvent the wheel) ? ?Surely this software must > exist...(hopefully in biojava). > > Many Thanks > Jean-Paul Ebejer, Malta > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From cif077 at gmail.com Tue Apr 21 08:59:06 2009 From: cif077 at gmail.com (Ian Yi-Feng Chang) Date: Tue, 21 Apr 2009 16:59:06 +0800 Subject: [Biojava-l] Cannot edit RichSequence larger than 16Kbp Message-ID: <720d02c10904210159n57c44e02n2f63856cd706c7e5@mail.gmail.com> Dear All, I try to edit the sequence in GenBank flat file and try to preserve all Annotations and Features of it. Therefore I use following code to edit RichSequence. However, the following code only works with sequence length < 16Kbp. Is anything wrong in my code? Thanks for your help. import java.io.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; import org.biojavax.*; import org.biojavax.bio.seq.*; import org.biojavax.bio.seq.RichSequence.*; import org.biojavax.bio.seq.io.*; public class RichSequenceTest { public static void main(String[] args) throws Exception{ SimpleRichSequenceBuilderFactory srsbf = new SimpleRichSequenceBuilderFactory(new SimpleSymbolListFactory(),100000); RichSequence seq = IOTools.readGenbank( new BufferedReader(new FileReader("/data/gbk/NC_011995q.gbk")), IOTools.getDNAParser(), srsbf, RichObjectFactory.getDefaultNamespace() ).nextRichSequence(); Edit ed = new Edit(3, 2, DNATools.createDNA("aatagaa")); seq.edit(ed); System.out.println(seq.seqString()); } } Exception in thread "main" org.biojava.utils.ChangeVetoException: AbstractSymbolList is immutable at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) at gizmo.test.RichSequenceTest.main(RichSequenceTest.java:20 From andreas at sdsc.edu Tue Apr 21 17:18:38 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 21 Apr 2009 10:18:38 -0700 Subject: [Biojava-l] BioJava Question - Evolutionary Rate Calculation In-Reply-To: <4adc29060904210319g7a391c21n867e893ae0598800@mail.gmail.com> References: <4adc29060904200756v6ad5d22dgfecbcf7d5e012b6c@mail.gmail.com> <59a41c430904201626y28a2f6d4m34cab6e9a34ad767@mail.gmail.com> <4adc29060904210319g7a391c21n867e893ae0598800@mail.gmail.com> Message-ID: <59a41c430904211018k4fb8fd64vdf0796274389996@mail.gmail.com> A simple measure of distance is e.g. the number of amino acid differences of two sequences divided by the number of aligned amino acids. You can use the pairwise alignments to derive that number. There is also plenty of other ways to calculate evolutionary distances and many papers have been written on this topic... Andreas On Tue, Apr 21, 2009 at 3:19 AM, JP wrote: > Thanks Andreas - Would you say that evolutionary distance is the same as > pairwise alignment ? > > Many thanks > JP > > On Tue, Apr 21, 2009 at 12:26 AM, Andreas Prlic wrote: >> >> Hi Jean-Paul, >> >> You can use BioJava to calculate pairwise alignments between your >> sequences: >> >> http://biojava.org/wiki/BioJava:CookBook:DP:PairWise2 >> >> Andreas >> >> >> On Mon, Apr 20, 2009 at 7:56 AM, JP wrote: >> > Hi there at Biojava, >> > >> > I have a number of orthologue protein (sequences) from different species >> > - I >> > would like to calculate the distance (score) between each of these >> > (programatically) based on the sequence. >> > >> > Can anyone suggest a way to do this (I'd rather use an existing bit of >> > software than having to reinvent the wheel) ? ?Surely this software must >> > exist...(hopefully in biojava). >> > >> > Many Thanks >> > Jean-Paul Ebejer, Malta >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > From holland at eaglegenomics.com Wed Apr 22 15:45:15 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Apr 2009 16:45:15 +0100 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> Message-ID: <49EF3B8B.5090509@eaglegenomics.com> The problem lies in SimpleRichSequenceBuilder: public void addSymbols(Alphabet alpha, Symbol[] syms, int start, int length) throws IllegalAlphabetException { if (this.symbols==null) { if (threshold<=0) { this.symbols = new ChunkedSymbolListFactory(this.factory); } else { this.symbols = new ChunkedSymbolListFactory(this.factory,threshold); } } this.symbols.addSymbols(alpha, syms, start, length); } The references to ChunkedSymbolListFactory are causing the problem. ChunkedSymbolListFactory is supposed to perform the threshold checking/factory selection. However it is also applying a further layer of abstraction which forces all symbol lists for sequences over 16k (1<<14) long to be ChunkedSymbolLists, regardless of the factory specified - the factory only specifies what the constituent sequences are within the ChunkedSymbolList. ChunkedSymbolList is immutable so will not allow edits even if its constituents are mutable. However if your sequence is less than 16k long, it behaves properly and you will get the type of sequence you asked for (SimpleSymbolList below the threshold, whatever you specify above it - SimpleSymbolList also happens to be the only SymbolList implementation in BioJava that is actually mutable at present.) As the older thread describes, ChunkedSymbolList and its Factory are very embedded into the core of BioJava and are hard to change - it could break all kinds of things. Therefore the only real solution for now is to temporarily modify your local copy so that inside ChunkedSymbolList, you change the CHUNK_SIZE to something much larger than 1<<14. thanks, Richard Ian Yi-Feng Chang wrote: > Dear All, > I've a problem while editing a richsequence. > and got this exception: > Exception in thread "main" org.biojava.utils.ChangeVetoException: > AbstractSymbolList is immutable > at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) > > at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) > at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) > at gizmo.tools.GBKCurator.main(GBKCurator.java:176) > > I trace this problem in this mailing list and find a latest thread > in** *Wed Feb 20 21:33:39 EST 2008* > > However, I still have no idea how to > > Here is the solution (from the JavaDoc) > > > SimpleRichSequenceBuilderFactory public > SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) > Creates a new instance of SimpleRichSequenceBuilderFactory that uses > a specified factory for SymbolLists longer than a specified length. > Before that a SimpleSymbolListFacotry is used. > > Parameters: > fact - the factory to use when building the > SymbolList.threshold - the threshold to exceed before using this factory > > However, could you please help to demonstrate how to use this solution > to edit a richsequence? > > Thank you so much. > > ian chang > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Wed Apr 22 15:50:17 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Apr 2009 16:50:17 +0100 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <49EF3B8B.5090509@eaglegenomics.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> <49EF3B8B.5090509@eaglegenomics.com> Message-ID: <49EF3CB9.3070909@eaglegenomics.com> I forgot to mention - ChunkedSymbolListFactory is currently the only SymbolListFactory implementation in BioJava which can accept 'streamed' data rather than taking the whole sequence at once. So, the other alternative to changing CHUNK_SIZE is to create a new SymbolListFactory implementation which can accept 'streamed' data and use it to replace the reference to ChunkedSymbolListFactory in SimpleRichSequenceBuilder. Richard. Richard Holland wrote: > The problem lies in SimpleRichSequenceBuilder: > > public void addSymbols(Alphabet alpha, Symbol[] syms, int start, int > length) throws IllegalAlphabetException { > if (this.symbols==null) { > if (threshold<=0) { > this.symbols = new ChunkedSymbolListFactory(this.factory); > } else { > this.symbols = new > ChunkedSymbolListFactory(this.factory,threshold); > } > } > this.symbols.addSymbols(alpha, syms, start, length); > } > > The references to ChunkedSymbolListFactory are causing the problem. > ChunkedSymbolListFactory is supposed to perform the threshold > checking/factory selection. However it is also applying a further layer > of abstraction which forces all symbol lists for sequences over 16k > (1<<14) long to be ChunkedSymbolLists, regardless of the factory > specified - the factory only specifies what the constituent sequences > are within the ChunkedSymbolList. ChunkedSymbolList is immutable so will > not allow edits even if its constituents are mutable. However if your > sequence is less than 16k long, it behaves properly and you will get the > type of sequence you asked for (SimpleSymbolList below the threshold, > whatever you specify above it - SimpleSymbolList also happens to be the > only SymbolList implementation in BioJava that is actually mutable at > present.) > > As the older thread describes, ChunkedSymbolList and its Factory are > very embedded into the core of BioJava and are hard to change - it could > break all kinds of things. Therefore the only real solution for now is > to temporarily modify your local copy so that inside ChunkedSymbolList, > you change the CHUNK_SIZE to something much larger than 1<<14. > > thanks, > Richard > > Ian Yi-Feng Chang wrote: >> Dear All, >> I've a problem while editing a richsequence. >> and got this exception: >> Exception in thread "main" org.biojava.utils.ChangeVetoException: >> AbstractSymbolList is immutable >> at org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) >> >> at org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) >> at org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) >> at gizmo.tools.GBKCurator.main(GBKCurator.java:176) >> >> I trace this problem in this mailing list and find a latest thread >> in** *Wed Feb 20 21:33:39 EST 2008* >> >> However, I still have no idea how to >> >> Here is the solution (from the JavaDoc) >> >> >> SimpleRichSequenceBuilderFactory public >> SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) >> Creates a new instance of SimpleRichSequenceBuilderFactory that uses >> a specified factory for SymbolLists longer than a specified length. >> Before that a SimpleSymbolListFacotry is used. >> >> Parameters: >> fact - the factory to use when building the >> SymbolList.threshold - the threshold to exceed before using this factory >> >> However, could you please help to demonstrate how to use this solution >> to edit a richsequence? >> >> Thank you so much. >> >> ian chang >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From cif077 at gmail.com Thu Apr 23 01:39:14 2009 From: cif077 at gmail.com (Ian Yi-Feng Chang) Date: Thu, 23 Apr 2009 09:39:14 +0800 Subject: [Biojava-l] Editing a RichSequence In-Reply-To: <49EF3CB9.3070909@eaglegenomics.com> References: <720d02c10904200122v501581e4v25abdc93c20365d9@mail.gmail.com> <720d02c10904201611g4c94458cy92e61c6cc1075ac3@mail.gmail.com> <49EF3B8B.5090509@eaglegenomics.com> <49EF3CB9.3070909@eaglegenomics.com> Message-ID: <720d02c10904221839s74f8c403j67c8ad3c56bda963@mail.gmail.com> Thanks for your detail explanation. I got it now. On Wed, Apr 22, 2009 at 11:50 PM, Richard Holland wrote: > I forgot to mention - ChunkedSymbolListFactory is currently the only > SymbolListFactory implementation in BioJava which can accept 'streamed' > data rather than taking the whole sequence at once. So, the other > alternative to changing CHUNK_SIZE is to create a new SymbolListFactory > implementation which can accept 'streamed' data and use it to replace > the reference to ChunkedSymbolListFactory in SimpleRichSequenceBuilder. > > Richard. > > Richard Holland wrote: > > The problem lies in SimpleRichSequenceBuilder: > > > > public void addSymbols(Alphabet alpha, Symbol[] syms, int start, int > > length) throws IllegalAlphabetException { > > if (this.symbols==null) { > > if (threshold<=0) { > > this.symbols = new > ChunkedSymbolListFactory(this.factory); > > } else { > > this.symbols = new > > ChunkedSymbolListFactory(this.factory,threshold); > > } > > } > > this.symbols.addSymbols(alpha, syms, start, length); > > } > > > > The references to ChunkedSymbolListFactory are causing the problem. > > ChunkedSymbolListFactory is supposed to perform the threshold > > checking/factory selection. However it is also applying a further layer > > of abstraction which forces all symbol lists for sequences over 16k > > (1<<14) long to be ChunkedSymbolLists, regardless of the factory > > specified - the factory only specifies what the constituent sequences > > are within the ChunkedSymbolList. ChunkedSymbolList is immutable so will > > not allow edits even if its constituents are mutable. However if your > > sequence is less than 16k long, it behaves properly and you will get the > > type of sequence you asked for (SimpleSymbolList below the threshold, > > whatever you specify above it - SimpleSymbolList also happens to be the > > only SymbolList implementation in BioJava that is actually mutable at > > present.) > > > > As the older thread describes, ChunkedSymbolList and its Factory are > > very embedded into the core of BioJava and are hard to change - it could > > break all kinds of things. Therefore the only real solution for now is > > to temporarily modify your local copy so that inside ChunkedSymbolList, > > you change the CHUNK_SIZE to something much larger than 1<<14. > > > > thanks, > > Richard > > > > Ian Yi-Feng Chang wrote: > >> Dear All, > >> I've a problem while editing a richsequence. > >> and got this exception: > >> Exception in thread "main" org.biojava.utils.ChangeVetoException: > >> AbstractSymbolList is immutable > >> at > org.biojava.bio.symbol.AbstractSymbolList.edit(AbstractSymbolList.java:113) > >> > >> at > org.biojavax.bio.seq.DummyRichSequenceHandler.edit(DummyRichSequenceHandler.java:31) > >> at > org.biojavax.bio.seq.ThinRichSequence.edit(ThinRichSequence.java:163) > >> at gizmo.tools.GBKCurator.main(GBKCurator.java:176) > >> > >> I trace this problem in this mailing list and find a latest thread > >> in** *Wed Feb 20 21:33:39 EST 2008* > >> > >> However, I still have no idea how to > >> > >> Here is the solution (from the JavaDoc) > >> > >> > >> SimpleRichSequenceBuilderFactory public > >> SimpleRichSequenceBuilderFactory(SymbolListFactory fact, int threshold) > >> Creates a new instance of SimpleRichSequenceBuilderFactory that uses > >> a specified factory for SymbolLists longer than a specified length. > >> Before that a SimpleSymbolListFacotry is used. > >> > >> Parameters: > >> fact - the factory to use when building the > >> SymbolList.threshold - the threshold to exceed before using this factory > >> > >> However, could you please help to demonstrate how to use this solution > >> to edit a richsequence? > >> > >> Thank you so much. > >> > >> ian chang > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From paolo.romano at istge.it Tue Apr 28 13:21:44 2009 From: paolo.romano at istge.it (Paolo Romano) Date: Tue, 28 Apr 2009 15:21:44 +0200 Subject: [Biojava-l] NETTAB 2009: Deadline postponed to May 4, 2009, for Oral communications Message-ID: <200904281335.n3SDZsrd018188@ibm43p.biotech.ist.unige.it> Due to many requests for a new deadline for submission of contributions for oral communications, the related deadline has been postponed to: Monday May 4, 2009, at 12.00 (noon), EST (GMT+1). ===== Last Call for Oral communications NETTAB 2009 Workshop on "Technologies, Tools and Applications for Collaborative and Social Bioinformatics Research and Development" with a Special Session on: "Methods and Tools for RNA Structure and Functional Analysis" June 10-13, 2009 Department of Computer Science, University of Catania, Italy http://www.nettab.org/2009/ Deadline approaching: May 4, 2009: Oral communication submission Contributions must be short papers of around THREE A4 pages or 12.000 characters long. Submit through the EasyChair system at: http://www.easychair.org/conferences/?conf=nettab2009 . See web site for details. Motivation The most recent developments of collaborative development tools are impressive. Researchers can now collaboratively develop software (open source systems), discuss and compare development strategies (social networks), write documents (google docs, wiki systems), build knowledge bases. So, it may now be the time for presenting current technologies, tools and applications for collaborative work and for discussing perspectives of their utilization in support of Bioinformatics. For these reasons, NETTAB 2009 will be devoted to "Technologies, Tools and Applications for Collaborative and Social Bioinformatics Research and Development". The RNA community is also taking advantage of collaborative research tools such as Wikis and other virtual environments. The RNA WikiProject contains now over 600 articles describing families of noncoding RNAs based on the Rfam database, and invite the community to update, edit, and correct those articles. Therefore, the NETTAB 2009 special session will focus on collaborative research project, computational methods and tools for the analysis of RNA structures and functions, with a special emphasis on ncRNAs. Invited Speakers (more to be announced) # Alex Bateman Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK # Doron Betel MSKCC - Computational Biology Center New York, USA # Tim Clark Director of Informatics, MassGeneral Institute for Neurodegenerative Disease Neurology Research Department, Massachusetts General Hospital, Boston, USA # Duncan Hull School of Chemistry, University of Manchester, Manchester, UK # Gabriel Valiente Technical University of Catalonia, Department of Software, Barcelona, Spain # Debora Marks Systems Biology Department, Harvard Medical School, Boston, USA # Gabriel Valiente Technical University of Catalonia, Department of Software, Barcelona, Spain Topics - Collaborative Web sites (bioinformatics.org, biojava, bioperl, ) - Communities of Practices (CoPs) Scientific practices in scientific communities Automatic detection / gathering / modelling of scientific practices Implementations of CoPs - Social networking (myExperiment, Annotea, myScience) Social Bookmarking Semantic Document Markup Relationships mining from literature - Open Source development Sharing of data models, libraries, interfaces - Social software for collaborative documentation development Wikis, blogs, google docs Knowledge Wikis Social-software-mediated collaborative scientific research Social-software-mediated collaborative tools' development Knowledge base collaborative development Ontologies collaborative development - Education and training tools E-learning Virtual environments Methods and Tools for RNA Structure and Functional Analysis - RNA structure prediction - Collaborative studies of RNAs - ncRNAs functional analysis and classification - miRNAs and networks - Genome-wide functional studies - Identification of ncRNAs - Databases of ncRNAs and miRNA targets - miRNA targets prediction - Synthetic miRNA and siRNA design - Gene expression analysis - Analysis of viral RNAs - RNAi therapeutics - Identification of ncRNAs biomarkers - RNA-protein interaction prediction Deadlines Contributions for both oral communications and posters must be short papers of around THREE A4 pages or 12.000 characters long. They must be submitted through the EasyChair system at: http://www.easychair.org/conferences/?conf=nettab2009 . - May 4, 2009: Oral communication submission - May 15, 2009: Posters submission - May 17, 2009: Early registration - June 10-13, 2009: Tutorials and Workshop Calls for SPECIAL ISSUES We plan to launch Calls for Special Issues on the themes of the workshop in peer-review journals with associated Impact factor around July for submission in September 2009. Best regards. Paolo Romano on behalf of NETTAB 2009 Chairs NETTAB '09 - Ninth International Workshop on Network Tools and Applications in Biology 10-13 June 2009, Catania, Italy http://www.nettab.org/2009/ Paolo Romano (paolo.romano at istge.it) Bioinformatics National Cancer Research Institute (IST) From jp at javaclass.co.uk Tue Apr 28 14:01:04 2009 From: jp at javaclass.co.uk (JP) Date: Tue, 28 Apr 2009 15:01:04 +0100 Subject: [Biojava-l] FASTA parsing bug ? Message-ID: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> Hi all at BioJava, I am trying to parse several FASTA files using the following code: fr = new FileReader(fastaProteinFileName); > br = new BufferedReader(fr); > > RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > while (protIter.hasNext()) { > BioEntry bioEntry = protIter.nextBioEntry(); > System.out.println (fastaProteinFileName + " == " + accessionId + " = > " + bioEntry.getAccession()); > } At particular points in my fasta file - I get the following exception: 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > biojava library) > org.biojava.bio.BioException: Could not read sequence > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > at > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > at > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > at > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > at > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > at > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > Caused by: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(Unknown Source) > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > ... 5 more Interestingly if I delete the header portion of the header line (from type=protein... till the end of the line ...Dgri;) >FBpp0145468 type=protein; > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > species=Dgri; > It works - but I have a number of these exceptions (and I do not want to edit the original data). Mind you I have longer headers in my file which are parsed OK (strange!). Any ideas anyone ? Alternatively - is there a better way how to get ONE SINGLE sequence from the whole fasta file give that I have the accession id (FBpp0145468) ? Many Thanks JP From jp at javaclass.co.uk Tue Apr 28 14:59:40 2009 From: jp at javaclass.co.uk (JP) Date: Tue, 28 Apr 2009 15:59:40 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <49F71258.3060103@eaglegenomics.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> Message-ID: <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> Thanks Richard for your prompt reply. I will not attach the fasta file I am parsing (12MB) its dgri-all-translation-r1.3.fasta from the flybase project. If the file had any extra new lines I would see them when I loaded it in a text editor - no ? I implemented the whole thing without using Biojava (for this part) fr = new FileReader(fastaProteinFileName); br = new BufferedReader(fr); String fastaLine; String startAccession = '>' + accessionId.trim(); String fastaEntry = ""; boolean record = false; while ((fastaLine = br.readLine()) != null) { fastaLine = fastaLine.trim() + '\n'; if (fastaLine.startsWith(startAccession)) { record = true; } else if (record && fastaLine.startsWith(">")) { record = false; break; } if (record) { fastaEntry += fastaLine; } } Notice - I do not use regex - since I'd need to read the whole file and then regex upon it (if the record is the first one - I just read that one). Cheers JP On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland wrote: > The "Mark invalid" exception is indicating that the parser has gone too > far ahead in the file looking for a valid header. I'm not sure why but > looking at your original query, there may be extra newlines embedded > into your FASTA header line? That would definitely confuse it. > > The parser is not able to currently pull out just one sequence - in > effect this is a search facility, which it doesn't have. :( > > thanks, > Richard > > JP wrote: > > Hi all at BioJava, > > > > I am trying to parse several FASTA files using the following code: > > > > fr = new FileReader(fastaProteinFileName); > >> br = new BufferedReader(fr); > >> > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > >> while (protIter.hasNext()) { > >> BioEntry bioEntry = protIter.nextBioEntry(); > >> System.out.println (fastaProteinFileName + " == " + accessionId + " > = > >> " + bioEntry.getAccession()); > >> } > > > > > > At particular points in my fasta file - I get the following exception: > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > >> biojava library) > >> org.biojava.bio.BioException: Could not read sequence > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > >> at > >> > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > >> at > >> > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > >> Caused by: java.io.IOException: Mark invalid > >> at java.io.BufferedReader.reset(Unknown Source) > >> at > >> > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > >> ... 5 more > > > > > > Interestingly if I delete the header portion of the header line (from > > type=protein... till the end of the line ...Dgri;) > > > >> FBpp0145468 type=protein; > >> > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > >> species=Dgri; > >> > > > > It works - but I have a number of these exceptions (and I do not want to > > edit the original data). Mind you I have longer headers in my file which > > are parsed OK (strange!). > > > > Any ideas anyone ? Alternatively - is there a better way how to get ONE > > SINGLE sequence from the whole fasta file give that I have the accession > id > > (FBpp0145468) ? > > > > Many Thanks > > JP > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > From holland at eaglegenomics.com Tue Apr 28 15:21:25 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 28 Apr 2009 16:21:25 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> Message-ID: <49F71EF5.90702@eaglegenomics.com> You're right, doesn't look like newlines. The "Mark invalid" happens when the parser looks too far ahead in the file attempting to seek out the next valid sequence to parse. I'm not sure why this is happening. I don't have the time to test right now but if you could post the link to where someone could download the same FASTA as you're using, then it would make it possible for someone else to investigate in more detail. thanks, Richard JP wrote: > Thanks Richard for your prompt reply. > > I will not attach the fasta file I am parsing (12MB) its > dgri-all-translation-r1.3.fasta from the flybase project. > > If the file had any extra new lines I would see them when I loaded it in > a text editor - no ? > > I implemented the whole thing without using Biojava (for this part) > > fr = new FileReader(fastaProteinFileName); > br = new BufferedReader(fr); > String fastaLine; > String startAccession = '>' + accessionId.trim(); > String fastaEntry = ""; > boolean record = false; > while ((fastaLine = br.readLine()) != null) { > fastaLine = fastaLine.trim() + '\n'; > if (fastaLine.startsWith(startAccession)) { > record = true; > } else if (record && fastaLine.startsWith(">")) { > record = false; > break; > } > if (record) { > fastaEntry += fastaLine; > } > } > > > Notice - I do not use regex - since I'd need to read the whole file and > then regex upon it (if the record is the first one - I just read that one). > > Cheers > JP > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > wrote: > > The "Mark invalid" exception is indicating that the parser has gone too > far ahead in the file looking for a valid header. I'm not sure why but > looking at your original query, there may be extra newlines embedded > into your FASTA header line? That would definitely confuse it. > > The parser is not able to currently pull out just one sequence - in > effect this is a search facility, which it doesn't have. :( > > thanks, > Richard > > JP wrote: > > Hi all at BioJava, > > > > I am trying to parse several FASTA files using the following code: > > > > fr = new FileReader(fastaProteinFileName); > >> br = new BufferedReader(fr); > >> > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > >> while (protIter.hasNext()) { > >> BioEntry bioEntry = protIter.nextBioEntry(); > >> System.out.println (fastaProteinFileName + " == " + > accessionId + " = > >> " + bioEntry.getAccession()); > >> } > > > > > > At particular points in my fasta file - I get the following exception: > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > >> biojava library) > >> org.biojava.bio.BioException: Could not read sequence > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > >> at > >> > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > >> at > >> > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > >> at > >> > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > >> Caused by: java.io.IOException: Mark invalid > >> at java.io.BufferedReader.reset(Unknown Source) > >> at > >> > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > >> at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > >> ... 5 more > > > > > > Interestingly if I delete the header portion of the header line (from > > type=protein... till the end of the line ...Dgri;) > > > >> FBpp0145468 type=protein; > >> > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > >> species=Dgri; > >> > > > > It works - but I have a number of these exceptions (and I do not > want to > > edit the original data). Mind you I have longer headers in my > file which > > are parsed OK (strange!). > > > > Any ideas anyone ? Alternatively - is there a better way how to > get ONE > > SINGLE sequence from the whole fasta file give that I have the > accession id > > (FBpp0145468) ? > > > > Many Thanks > > JP > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Tue Apr 28 14:27:36 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 28 Apr 2009 15:27:36 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> Message-ID: <49F71258.3060103@eaglegenomics.com> The "Mark invalid" exception is indicating that the parser has gone too far ahead in the file looking for a valid header. I'm not sure why but looking at your original query, there may be extra newlines embedded into your FASTA header line? That would definitely confuse it. The parser is not able to currently pull out just one sequence - in effect this is a search facility, which it doesn't have. :( thanks, Richard JP wrote: > Hi all at BioJava, > > I am trying to parse several FASTA files using the following code: > > fr = new FileReader(fastaProteinFileName); >> br = new BufferedReader(fr); >> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); >> while (protIter.hasNext()) { >> BioEntry bioEntry = protIter.nextBioEntry(); >> System.out.println (fastaProteinFileName + " == " + accessionId + " = >> " + bioEntry.getAccession()); >> } > > > At particular points in my fasta file - I get the following exception: > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from >> biojava library) >> org.biojava.bio.BioException: Could not read sequence >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) >> at >> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) >> at >> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) >> at >> edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) >> at >> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) >> Caused by: java.io.IOException: Mark invalid >> at java.io.BufferedReader.reset(Unknown Source) >> at >> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) >> at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) >> ... 5 more > > > Interestingly if I delete the header portion of the header line (from > type=protein... till the end of the line ...Dgri;) > >> FBpp0145468 type=protein; >> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; >> species=Dgri; >> > > It works - but I have a number of these exceptions (and I do not want to > edit the original data). Mind you I have longer headers in my file which > are parsed OK (strange!). > > Any ideas anyone ? Alternatively - is there a better way how to get ONE > SINGLE sequence from the whole fasta file give that I have the accession id > (FBpp0145468) ? > > Many Thanks > JP > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jogoodma at indiana.edu Wed Apr 29 03:08:43 2009 From: jogoodma at indiana.edu (Josh Goodman) Date: Tue, 28 Apr 2009 23:08:43 -0400 (EDT) Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <49F71EF5.90702@eaglegenomics.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: Hi Richard and JP, I think I can be of some help as I'm the FlyBase developer responsible for generating these troublesome FASTA files :-). The cause of this problem appears to be the description line length for the record FBpp0145470. The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop at line 196. Biojava correctly reads in FBpp0145468 but throws an error when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 but when biojava reaches the end of the sequence it reads in the header for the next record (FBpp0145470). It then tries to reset the BufferedReader to the start of FBpp0145470 but that is where the exception is thrown because line 197 sets the read ahead limit to 500 characters and the reader.readLine() command exceeds that limit. What isn't obvious to me is why other large definition lines that precede that line don't throw the same error (e.g. FBpp0157909). I guess the javadoc on BufferedReader.mark() does say "may fail" but I assumed it would be more predictable than that. The file in question can be downloaded from ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz. If there is interest in a solution that doesn't involve simply upping the read ahead limit I can put a patch file together in the next day or so. Cheers, Josh On Tue, 28 Apr 2009, Richard Holland wrote: > You're right, doesn't look like newlines. > > The "Mark invalid" happens when the parser looks too far ahead in the > file attempting to seek out the next valid sequence to parse. I'm not > sure why this is happening. > > I don't have the time to test right now but if you could post the link > to where someone could download the same FASTA as you're using, then it > would make it possible for someone else to investigate in more detail. > > thanks, > Richard > > JP wrote: > > Thanks Richard for your prompt reply. > > > > I will not attach the fasta file I am parsing (12MB) its > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > If the file had any extra new lines I would see them when I loaded it in > > a text editor - no ? > > > > I implemented the whole thing without using Biojava (for this part) > > > > fr = new FileReader(fastaProteinFileName); > > br = new BufferedReader(fr); > > String fastaLine; > > String startAccession = '>' + accessionId.trim(); > > String fastaEntry = ""; > > boolean record = false; > > while ((fastaLine = br.readLine()) != null) { > > fastaLine = fastaLine.trim() + '\n'; > > if (fastaLine.startsWith(startAccession)) { > > record = true; > > } else if (record && fastaLine.startsWith(">")) { > > record = false; > > break; > > } > > if (record) { > > fastaEntry += fastaLine; > > } > > } > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > then regex upon it (if the record is the first one - I just read that one). > > > > Cheers > > JP > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > wrote: > > > > The "Mark invalid" exception is indicating that the parser has gone too > > far ahead in the file looking for a valid header. I'm not sure why but > > looking at your original query, there may be extra newlines embedded > > into your FASTA header line? That would definitely confuse it. > > > > The parser is not able to currently pull out just one sequence - in > > effect this is a search facility, which it doesn't have. :( > > > > thanks, > > Richard > > > > JP wrote: > > > Hi all at BioJava, > > > > > > I am trying to parse several FASTA files using the following code: > > > > > > fr = new FileReader(fastaProteinFileName); > > >> br = new BufferedReader(fr); > > >> > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); > > >> while (protIter.hasNext()) { > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > >> System.out.println (fastaProteinFileName + " == " + > > accessionId + " = > > >> " + bioEntry.getAccession()); > > >> } > > > > > > > > > At particular points in my fasta file - I get the following exception: > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from > > >> biojava library) > > >> org.biojava.bio.BioException: Could not read sequence > > >> at > > >> > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > >> at > > >> > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > >> at > > >> > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > >> at > > >> > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > >> at > > >> > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > >> at > > >> > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > >> Caused by: java.io.IOException: Mark invalid > > >> at java.io.BufferedReader.reset(Unknown Source) > > >> at > > >> > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > >> at > > >> > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > >> ... 5 more > > > > > > > > > Interestingly if I delete the header portion of the header line (from > > > type=protein... till the end of the line ...Dgri;) > > > > > >> FBpp0145468 type=protein; > > >> > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > >> species=Dgri; > > >> > > > > > > It works - but I have a number of these exceptions (and I do not > > want to > > > edit the original data). Mind you I have longer headers in my > > file which > > > are parsed OK (strange!). > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > get ONE > > > SINGLE sequence from the whole fasta file give that I have the > > accession id > > > (FBpp0145468) ? > > > > > > Many Thanks > > > JP > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > http://www.eaglegenomics.com/ > > > > > > -- > Richard Holland, BSc MBCS > Finance Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jp at javaclass.co.uk Wed Apr 29 07:13:02 2009 From: jp at javaclass.co.uk (JP) Date: Wed, 29 Apr 2009 08:13:02 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> This is why we all love the internet and the community. What is the chance of this happening ? You are speaking about World Peace, and Kofi Annan butts in. :) I found that strange also (that there are larger headers preceding the troublesome one). Maybe (and this is a long shot) there is some buffer which gets filled at that particular record or point in file ? (Does the error move record if we delete a couple of initial Fasta entries ?) Mind you this is NOT the only flybase fasta file I get errors with (same happens with dpse one v2.3 - and I am sure there are others). I am interested in the solution, so are a ton of other people who use biojava and particularly verbose fasta files. I love flybase and biojava JP On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman wrote: > > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > . > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > You're right, doesn't look like newlines. > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > file attempting to seek out the next valid sequence to parse. I'm not > > sure why this is happening. > > > > I don't have the time to test right now but if you could post the link > > to where someone could download the same FASTA as you're using, then it > > would make it possible for someone else to investigate in more detail. > > > > thanks, > > Richard > > > > JP wrote: > > > Thanks Richard for your prompt reply. > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > If the file had any extra new lines I would see them when I loaded it > in > > > a text editor - no ? > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > fr = new FileReader(fastaProteinFileName); > > > br = new BufferedReader(fr); > > > String fastaLine; > > > String startAccession = '>' + accessionId.trim(); > > > String fastaEntry = ""; > > > boolean record = false; > > > while ((fastaLine = br.readLine()) != null) { > > > fastaLine = fastaLine.trim() + '\n'; > > > if (fastaLine.startsWith(startAccession)) { > > > record = true; > > > } else if (record && fastaLine.startsWith(">")) { > > > record = false; > > > break; > > > } > > > if (record) { > > > fastaEntry += fastaLine; > > > } > > > } > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > then regex upon it (if the record is the first one - I just read that > one). > > > > > > Cheers > > > JP > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > wrote: > > > > > > The "Mark invalid" exception is indicating that the parser has gone > too > > > far ahead in the file looking for a valid header. I'm not sure why > but > > > looking at your original query, there may be extra newlines > embedded > > > into your FASTA header line? That would definitely confuse it. > > > > > > The parser is not able to currently pull out just one sequence - in > > > effect this is a search facility, which it doesn't have. :( > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Hi all at BioJava, > > > > > > > > I am trying to parse several FASTA files using the following > code: > > > > > > > > fr = new FileReader(fastaProteinFileName); > > > >> br = new BufferedReader(fr); > > > >> > > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > null); > > > >> while (protIter.hasNext()) { > > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > > >> System.out.println (fastaProteinFileName + " == " + > > > accessionId + " = > > > >> " + bioEntry.getAccession()); > > > >> } > > > > > > > > > > > > At particular points in my fasta file - I get the following > exception: > > > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception > (from > > > >> biojava library) > > > >> org.biojava.bio.BioException: Could not read sequence > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > >> Caused by: java.io.IOException: Mark invalid > > > >> at java.io.BufferedReader.reset(Unknown Source) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > >> ... 5 more > > > > > > > > > > > > Interestingly if I delete the header portion of the header line > (from > > > > type=protein... till the end of the line ...Dgri;) > > > > > > > >> FBpp0145468 type=protein; > > > >> > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; > parent=FBgn0119042,FBtr0146976; > > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > >> species=Dgri; > > > >> > > > > > > > > It works - but I have a number of these exceptions (and I do not > > > want to > > > > edit the original data). Mind you I have longer headers in my > > > file which > > > > are parsed OK (strange!). > > > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > > get ONE > > > > SINGLE sequence from the whole fasta file give that I have the > > > accession id > > > > (FBpp0145468) ? > > > > > > > > Many Thanks > > > > JP > > > > _______________________________________________ > > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > > > http://www.eaglegenomics.com/ > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > From markjschreiber at gmail.com Wed Apr 29 08:31:00 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 29 Apr 2009 16:31:00 +0800 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> Message-ID: <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> People who know me will know I am not a big fan of FASTA format. Sure it was useful in the days of FORTRAN but we really need to move on. I'm not sure the people who started the format foresaw the kind of abuse that the "format" would get. What I would much prefer is something that looks like the BioEntry table of BioSQL plus the BioSequence information in somekind of (dare I say it) XML format. It would certainly be tidier and vastly more machine readable, for a start not all the metadata would need to be on the description line in no specific order. I think by limiting it to those two tables you get most of the key metadata without all the cruft that comes with more extensive XML formats. It would be a bit less user friendly for people pasting sequences into webforms (although I think FASTA is fine for that) but much better for data distribution, webservices, machine processing etc. Anyhow, that's enough venting. I don't wan't to start somekind of holy war or anything... - Mark ps. Sorry for getting off topic. On Wed, Apr 29, 2009 at 3:13 PM, JP wrote: > > This is why we all love the internet and the community. > What is the chance of this happening ? ?You are speaking about World Peace, > and Kofi Annan butts in. :) > > I found that strange also (that there are larger headers preceding the > troublesome one). ?Maybe (and this is a long shot) there is some buffer > which gets filled at that particular record or point in file ? ?(Does the > error move record if we delete a couple of initial Fasta entries ?) > > Mind you this is NOT the only flybase fasta file I get errors with (same > happens with dpse one v2.3 - and I am sure there are others). > > I am interested in the solution, so are a ton of other people who use > biojava and particularly verbose fasta files. > > I love flybase and biojava > JP > > On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman wrote: > > > > > Hi Richard and JP, > > > > I think I can be of some help as I'm the FlyBase developer responsible for > > generating these troublesome FASTA files :-). ?The cause of this problem > > appears to be the description line length for the record FBpp0145470. > > > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > > at line 196. ?Biojava correctly reads in FBpp0145468 but throws an error > > when trying to parse FBpp0145469. ?There is nothing wrong in FBpp0145469 > > but when biojava reaches the end of the sequence it reads in the header > > for the next record (FBpp0145470). ?It then tries to reset the > > BufferedReader to the start of FBpp0145470 but that is where the exception > > is thrown because line 197 sets the read ahead limit to 500 characters and > > the reader.readLine() command exceeds that limit. > > > > What isn't obvious to me is why other large definition lines that precede > > that line don't throw the same error (e.g. FBpp0157909). ?I guess the > > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > > would be more predictable than that. > > > > The file in question can be downloaded from > > > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > > . > > > > If there is interest in a solution that doesn't involve simply upping the > > read ahead limit I can put a patch file together in the next day or so. > > > > Cheers, > > Josh > > > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > > > You're right, doesn't look like newlines. > > > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > > file attempting to seek out the next valid sequence to parse. I'm not > > > sure why this is happening. > > > > > > I don't have the time to test right now but if you could post the link > > > to where someone could download the same FASTA as you're using, then it > > > would make it possible for someone else to investigate in more detail. > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Thanks Richard for your prompt reply. > > > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > > > If the file had any extra new lines I would see them when I loaded it > > in > > > > a text editor - no ? > > > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > > > ? ? fr = new FileReader(fastaProteinFileName); > > > > ? ? br = new BufferedReader(fr); > > > > ? ? String fastaLine; > > > > ? ? String startAccession = '>' + accessionId.trim(); > > > > ? ? String fastaEntry = ""; > > > > ? ? boolean record = false; > > > > ? ? while ((fastaLine = br.readLine()) != null) { > > > > ? ? ? ? fastaLine = fastaLine.trim() + '\n'; > > > > ? ? ? ? if (fastaLine.startsWith(startAccession)) { > > > > ? ? ? ? ? ? record = true; > > > > ? ? ? ? } else if (record && fastaLine.startsWith(">")) { > > > > ? ? ? ? ? ? record = false; > > > > ? ? ? ? ? ? break; > > > > ? ? ? ? } > > > > ? ? ? ? if (record) { > > > > ? ? ? ? ? ? fastaEntry += fastaLine; > > > > ? ? ? ? } > > > > ? ? } > > > > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > > then regex upon it (if the record is the first one - I just read that > > one). > > > > > > > > Cheers > > > > JP > > > > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > > wrote: > > > > > > > > ? ? The "Mark invalid" exception is indicating that the parser has gone > > too > > > > ? ? far ahead in the file looking for a valid header. I'm not sure why > > but > > > > ? ? looking at your original query, there may be extra newlines > > embedded > > > > ? ? into your FASTA header line? That would definitely confuse it. > > > > > > > > ? ? The parser is not able to currently pull out just one sequence - in > > > > ? ? effect this is a search facility, which it doesn't have. :( > > > > > > > > ? ? thanks, > > > > ? ? Richard > > > > > > > > ? ? JP wrote: > > > > ? ? > Hi all at BioJava, > > > > ? ? > > > > > ? ? > I am trying to parse several FASTA files using the following > > code: > > > > ? ? > > > > > ? ? > fr = new FileReader(fastaProteinFileName); > > > > ? ? >> br = new BufferedReader(fr); > > > > ? ? >> > > > > ? ? >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > > null); > > > > ? ? >> while (protIter.hasNext()) { > > > > ? ? >> ? ? ?BioEntry bioEntry = protIter.nextBioEntry(); > > > > ? ? >> ? ? ?System.out.println (fastaProteinFileName + " == " + > > > > ? ? accessionId + " = > > > > ? ? >> " + bioEntry.getAccession()); > > > > ? ? >> } > > > > ? ? > > > > > ? ? > > > > > ? ? > At particular points in my fasta file - I get the following > > exception: > > > > ? ? > > > > > ? ? > 14:53:42,546 ERROR FastaFileProcessing ?- File parsing exception > > (from > > > > ? ? >> biojava library) > > > > ? ? >> org.biojava.bio.BioException: Could not read sequence > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > > ? ? >> Caused by: java.io.IOException: Mark invalid > > > > ? ? >> ? ? at java.io.BufferedReader.reset(Unknown Source) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > > ? ? >> ? ? at > > > > ? ? >> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > > ? ? >> ? ? ... 5 more > > > > ? ? > > > > > ? ? > > > > > ? ? > Interestingly if I delete the header portion of the header line > > (from > > > > ? ? > type=protein... till the end of the line ...Dgri;) > > > > ? ? > > > > > ? ? >> FBpp0145468 type=protein; > > > > ? ? >> > > > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > > ? ? >> ID=FBpp0145468; name=Dgri\GH11562-PA; > > parent=FBgn0119042,FBtr0146976; > > > > ? ? >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > > ? ? >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > > ? ? >> species=Dgri; > > > > ? ? >> > > > > ? ? > > > > > ? ? > It works - but I have a number of these exceptions (and I do not > > > > ? ? want to > > > > ? ? > edit the original data). ?Mind you I have longer headers in my > > > > ? ? file which > > > > ? ? > are parsed OK (strange!). > > > > ? ? > > > > > ? ? > Any ideas anyone ? ?Alternatively - is there a better way how to > > > > ? ? get ONE > > > > ? ? > SINGLE sequence from the whole fasta file give that I have the > > > > ? ? accession id > > > > ? ? > (FBpp0145468) ? > > > > ? ? > > > > > ? ? > Many Thanks > > > > ? ? > JP > > > > ? ? > _______________________________________________ > > > > ? ? > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > > ? ? > > > > ? ? > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > ? ? > > > > > > > > > ? ? -- > > > > ? ? Richard Holland, BSc MBCS > > > > ? ? Finance Director, Eagle Genomics Ltd > > > > ? ? T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > ? ? > > > > ? ? http://www.eaglegenomics.com/ > > > > > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > http://www.eaglegenomics.com/ > > > _______________________________________________ > > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at eaglegenomics.com Wed Apr 29 09:49:58 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 29 Apr 2009 10:49:58 +0100 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: <49F822C6.1020809@eaglegenomics.com> I'd love to see a proper solution to this that doesn't involve upping the read-ahead limit. I was aware that it might be the issue, but had no idea why it was not failing for other similar long sequences. I look forward to seeing your suggested fix! thanks, Richard Josh Goodman wrote: > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz. > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > >> You're right, doesn't look like newlines. >> >> The "Mark invalid" happens when the parser looks too far ahead in the >> file attempting to seek out the next valid sequence to parse. I'm not >> sure why this is happening. >> >> I don't have the time to test right now but if you could post the link >> to where someone could download the same FASTA as you're using, then it >> would make it possible for someone else to investigate in more detail. >> >> thanks, >> Richard >> >> JP wrote: >>> Thanks Richard for your prompt reply. >>> >>> I will not attach the fasta file I am parsing (12MB) its >>> dgri-all-translation-r1.3.fasta from the flybase project. >>> >>> If the file had any extra new lines I would see them when I loaded it in >>> a text editor - no ? >>> >>> I implemented the whole thing without using Biojava (for this part) >>> >>> fr = new FileReader(fastaProteinFileName); >>> br = new BufferedReader(fr); >>> String fastaLine; >>> String startAccession = '>' + accessionId.trim(); >>> String fastaEntry = ""; >>> boolean record = false; >>> while ((fastaLine = br.readLine()) != null) { >>> fastaLine = fastaLine.trim() + '\n'; >>> if (fastaLine.startsWith(startAccession)) { >>> record = true; >>> } else if (record && fastaLine.startsWith(">")) { >>> record = false; >>> break; >>> } >>> if (record) { >>> fastaEntry += fastaLine; >>> } >>> } >>> >>> >>> Notice - I do not use regex - since I'd need to read the whole file and >>> then regex upon it (if the record is the first one - I just read that one). >>> >>> Cheers >>> JP >>> >>> >>> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland >>> > wrote: >>> >>> The "Mark invalid" exception is indicating that the parser has gone too >>> far ahead in the file looking for a valid header. I'm not sure why but >>> looking at your original query, there may be extra newlines embedded >>> into your FASTA header line? That would definitely confuse it. >>> >>> The parser is not able to currently pull out just one sequence - in >>> effect this is a search facility, which it doesn't have. :( >>> >>> thanks, >>> Richard >>> >>> JP wrote: >>> > Hi all at BioJava, >>> > >>> > I am trying to parse several FASTA files using the following code: >>> > >>> > fr = new FileReader(fastaProteinFileName); >>> >> br = new BufferedReader(fr); >>> >> >>> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null); >>> >> while (protIter.hasNext()) { >>> >> BioEntry bioEntry = protIter.nextBioEntry(); >>> >> System.out.println (fastaProteinFileName + " == " + >>> accessionId + " = >>> >> " + bioEntry.getAccession()); >>> >> } >>> > >>> > >>> > At particular points in my fasta file - I get the following exception: >>> > >>> > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from >>> >> biojava library) >>> >> org.biojava.bio.BioException: Could not read sequence >>> >> at >>> >> >>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) >>> >> at >>> >> >>> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) >>> >> at >>> >> >>> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) >>> >> Caused by: java.io.IOException: Mark invalid >>> >> at java.io.BufferedReader.reset(Unknown Source) >>> >> at >>> >> >>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) >>> >> at >>> >> >>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) >>> >> ... 5 more >>> > >>> > >>> > Interestingly if I delete the header portion of the header line (from >>> > type=protein... till the end of the line ...Dgri;) >>> > >>> >> FBpp0145468 type=protein; >>> >> >>> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); >>> >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976; >>> >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; >>> >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; >>> >> species=Dgri; >>> >> >>> > >>> > It works - but I have a number of these exceptions (and I do not >>> want to >>> > edit the original data). Mind you I have longer headers in my >>> file which >>> > are parsed OK (strange!). >>> > >>> > Any ideas anyone ? Alternatively - is there a better way how to >>> get ONE >>> > SINGLE sequence from the whole fasta file give that I have the >>> accession id >>> > (FBpp0145468) ? >>> > >>> > Many Thanks >>> > JP >>> > _______________________________________________ >>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> >>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > >>> >>> -- >>> Richard Holland, BSc MBCS >>> Finance Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >>> >>> http://www.eaglegenomics.com/ >>> >>> >> -- >> Richard Holland, BSc MBCS >> Finance Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From markjschreiber at gmail.com Wed Apr 29 14:33:27 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 29 Apr 2009 22:33:27 +0800 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <93b45ca50904290726n4149ce7bhb5e9e82467982fb@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> <93b45ca50904290726n4149ce7bhb5e9e82467982fb@mail.gmail.com> Message-ID: <93b45ca50904290733k68afb5b0na661588b4f09d804@mail.gmail.com> I can understand a bench scientist wanting FASTA but a computational biologist. They should be ashamed! With some of the friendly XPath implementations in common scripting languages there really is no excuse. It's easier to parse XML than FASTA in Groovy, Perl, Python and Ruby. Probably Java and C as well. The state of bioinformatics data formats is cringe worthy. Let's try and enter the 21st century! OK I'm ranting again. Maybe I'll go join twitter. - Mark On 29 Apr 2009, 10:04 PM, "Josh Goodman" wrote: Hi Mark, I couldn't agree with you more, which is why we also provide this data in GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only Chado database. However, no matter how much we try to encourage use of the other formats users still flock to the good old FASTA files. There are a variety of reasons but the most common case involves bench scientists and/or programmers who run at the sight of anything more complex than a FASTA file. I've toyed with the idea of reducing the data we cram into the headers to gently try to encourage use of the other more sensible formats. However, at the end of the day we (FlyBase) serve at the behest of our user community and this is what they want to see. Cheers, Josh On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know I am not a big fan of F... From SMarkel at accelrys.com Wed Apr 29 19:53:10 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Wed, 29 Apr 2009 15:53:10 -0400 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A010637@exch1-hi.accelrys.net> A quick note in order to add one more data point. While looking at this it should be kept in mind that NCBI's nonredundant database FASTA files (nr.fa and nt.fa) use ctrl-A characters to concatenate multiple descriptions. These concatenated descriptions can be thousands of characters long. I've got one that I use as a test case that has 378,260 characters (5204 concatenated descriptions). It's a 98 residue sequence for "NADH dehydrogenase subunit 4L". I'm not saying it's right, just that cases like this do exist. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Josh Goodman > Sent: Tuesday, 28 April 2009 8:09 PM > To: Richard Holland > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] FASTA parsing bug ? > > > Hi Richard and JP, > > I think I can be of some help as I'm the FlyBase developer responsible for > generating these troublesome FASTA files :-). The cause of this problem > appears to be the description line length for the record FBpp0145470. > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > at line 196. Biojava correctly reads in FBpp0145468 but throws an error > when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469 > but when biojava reaches the end of the sequence it reads in the header > for the next record (FBpp0145470). It then tries to reset the > BufferedReader to the start of FBpp0145470 but that is where the exception > is thrown because line 197 sets the read ahead limit to 500 characters and > the reader.readLine() command exceeds that limit. > > What isn't obvious to me is why other large definition lines that precede > that line don't throw the same error (e.g. FBpp0157909). I guess the > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > would be more predictable than that. > > The file in question can be downloaded from > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fas > ta/dgri-all-translation-r1.3.fasta.gz. > > If there is interest in a solution that doesn't involve simply upping the > read ahead limit I can put a patch file together in the next day or so. > > Cheers, > Josh > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > You're right, doesn't look like newlines. > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > file attempting to seek out the next valid sequence to parse. I'm not > > sure why this is happening. > > > > I don't have the time to test right now but if you could post the link > > to where someone could download the same FASTA as you're using, then it > > would make it possible for someone else to investigate in more detail. > > > > thanks, > > Richard > > > > JP wrote: > > > Thanks Richard for your prompt reply. > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > If the file had any extra new lines I would see them when I loaded it > in > > > a text editor - no ? > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > fr = new FileReader(fastaProteinFileName); > > > br = new BufferedReader(fr); > > > String fastaLine; > > > String startAccession = '>' + accessionId.trim(); > > > String fastaEntry = ""; > > > boolean record = false; > > > while ((fastaLine = br.readLine()) != null) { > > > fastaLine = fastaLine.trim() + '\n'; > > > if (fastaLine.startsWith(startAccession)) { > > > record = true; > > > } else if (record && fastaLine.startsWith(">")) { > > > record = false; > > > break; > > > } > > > if (record) { > > > fastaEntry += fastaLine; > > > } > > > } > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file > and > > > then regex upon it (if the record is the first one - I just read that > one). > > > > > > Cheers > > > JP > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > wrote: > > > > > > The "Mark invalid" exception is indicating that the parser has > gone too > > > far ahead in the file looking for a valid header. I'm not sure why > but > > > looking at your original query, there may be extra newlines > embedded > > > into your FASTA header line? That would definitely confuse it. > > > > > > The parser is not able to currently pull out just one sequence - > in > > > effect this is a search facility, which it doesn't have. :( > > > > > > thanks, > > > Richard > > > > > > JP wrote: > > > > Hi all at BioJava, > > > > > > > > I am trying to parse several FASTA files using the following > code: > > > > > > > > fr = new FileReader(fastaProteinFileName); > > > >> br = new BufferedReader(fr); > > > >> > > > >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > null); > > > >> while (protIter.hasNext()) { > > > >> BioEntry bioEntry = protIter.nextBioEntry(); > > > >> System.out.println (fastaProteinFileName + " == " + > > > accessionId + " = > > > >> " + bioEntry.getAccession()); > > > >> } > > > > > > > > > > > > At particular points in my fasta file - I get the following > exception: > > > > > > > > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception > (from > > > >> biojava library) > > > >> org.biojava.bio.BioException: Could not read sequence > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader > .java:113) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.jav > a:99) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFr > omFASTAFile(FastaFileProcessing.java:60) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(Ortholog > ueFinder.java:64) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder. > java:51) > > > >> at > > > >> > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(Ortholo > gueFinderLauncher.java:60) > > > >> Caused by: java.io.IOException: Mark invalid > > > >> at java.io.BufferedReader.reset(Unknown Source) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > >> at > > > >> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader > .java:110) > > > >> ... 5 more > > > > > > > > > > > > Interestingly if I delete the header portion of the header line > (from > > > > type=protein... till the end of the line ...Dgri;) > > > > > > > >> FBpp0145468 type=protein; > > > >> > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13 > 220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331.. > 13226463,13226531..13226658); > > > >> ID=FBpp0145468; name=Dgri\GH11562-PA; > parent=FBgn0119042,FBtr0146976; > > > >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > >> species=Dgri; > > > >> > > > > > > > > It works - but I have a number of these exceptions (and I do not > > > want to > > > > edit the original data). Mind you I have longer headers in my > > > file which > > > > are parsed OK (strange!). > > > > > > > > Any ideas anyone ? Alternatively - is there a better way how to > > > get ONE > > > > SINGLE sequence from the whole fasta file give that I have the > > > accession id > > > > (FBpp0145468) ? > > > > > > > > Many Thanks > > > > JP > > > > _______________________________________________ > > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > -- > > > Richard Holland, BSc MBCS > > > Finance Director, Eagle Genomics Ltd > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > > > http://www.eaglegenomics.com/ > > > > > > > > > > -- > > Richard Holland, BSc MBCS > > Finance Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Thu Apr 30 03:01:20 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 30 Apr 2009 11:01:20 +0800 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <616a29410904291528i1f2a4aag34988a7d036bcbe4@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> <93b45ca50904290726n4149ce7bhb5e9e82467982fb@mail.gmail.com> <93b45ca50904290733k68afb5b0na661588b4f09d804@mail.gmail.com> <616a29410904291528i1f2a4aag34988a7d036bcbe4@mail.gmail.com> Message-ID: <93b45ca50904292001n7f24947bh6f57dfb07eb73641@mail.gmail.com> A minimal XML equivalent to Fasta would look like this: ACGTGCACGCTGCACGT I think a biologist could handle that and it is much easier to parse than FASTA because it is well formed. You don't even need to use an XML parser. You could even convert this to FASTA using a text editor with a few find and replace expressions. Possibly this would be easier to handle even for someone who can't program at all? Of course you could make it a lot more sophisticated but then you are approximating GenbankXML or something similar. Remember BioJava has FASTA parsers made by experienced programmers and over 10 years of testing and bug fixes and people still manage to break them. This indicates to me that the FASTA format is bad and should be voted off the island. - Mark On Thu, Apr 30, 2009 at 6:28 AM, simon rayner wrote: > don't forget that a lot of the people doing bioinformatics are biologists > with no formal training.? They want to get the job done in the easiest > possible way and aren't really concerned about the details.? If you want > people to switch to XML for example, the whole concept needs to be made more > accessible.? I'm still struggling to get my students to adopt XML. > > It seems that more basic tutorials would be useful - but in a less formal > style that would be easier for newcomers to follow.?? Is there any feelings > about trying to develop this side of the Biojava project?? I thought about > trying to add some stuff, but my java programming is embarrassingly poor and > i thought i would be laughed off the website. > > Simon > > On Wed, Apr 29, 2009 at 10:33 PM, Mark Schreiber > wrote: >> >> I can understand a bench scientist wanting FASTA but a computational >> biologist. They should be ashamed! With some of the friendly XPath >> implementations in common scripting languages there really is no excuse. >> It's easier to parse XML than FASTA in Groovy, Perl, Python and Ruby. >> Probably Java and C as well. >> >> The state of bioinformatics data formats is cringe worthy. Let's try and >> enter the 21st century! >> >> OK I'm ranting again. Maybe I'll go join twitter. >> >> - Mark >> >> On 29 Apr 2009, 10:04 PM, "Josh Goodman" wrote: >> >> >> Hi Mark, >> >> I couldn't agree with you more, which is why we also provide this data in >> GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only >> Chado database. ?However, no matter how much we try to encourage use of >> the other formats users still flock to the good old FASTA files. ?There >> are a variety of reasons but the most common case involves bench >> scientists and/or programmers who run at the sight of anything more >> complex than a FASTA file. >> >> I've toyed with the idea of reducing the data we cram into the headers to >> gently try to encourage use of the other more sensible formats. ?However, >> at the end of the day we (FlyBase) serve at the behest of our user >> community and this is what they want to see. >> >> Cheers, >> Josh >> >> On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know >> I >> am not a big fan of F... >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Simon Rayner > > State Key Laboratory of Virology > Wuhan Institute of Virology > Chinese Academy of Sciences > Wuhan, Hubei 430071 > P.R.China > > +86 (27) 87199895 (office) > +86 15972923715 (cell) > > From jogoodma at indiana.edu Wed Apr 29 14:04:42 2009 From: jogoodma at indiana.edu (Josh Goodman) Date: Wed, 29 Apr 2009 14:04:42 -0000 Subject: [Biojava-l] FASTA parsing bug ? In-Reply-To: <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> References: <4adc29060904280701q5d3dc760mb018f6b38a9e056f@mail.gmail.com> <49F71258.3060103@eaglegenomics.com> <4adc29060904280759q4b27a4eembd28974d46199532@mail.gmail.com> <49F71EF5.90702@eaglegenomics.com> <4adc29060904290013u62e5a4b0y4fefa93865e9a3ae@mail.gmail.com> <93b45ca50904290131m78b10b2av224c66413313966a@mail.gmail.com> Message-ID: Hi Mark, I couldn't agree with you more, which is why we also provide this data in GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only Chado database. However, no matter how much we try to encourage use of the other formats users still flock to the good old FASTA files. There are a variety of reasons but the most common case involves bench scientists and/or programmers who run at the sight of anything more complex than a FASTA file. I've toyed with the idea of reducing the data we cram into the headers to gently try to encourage use of the other more sensible formats. However, at the end of the day we (FlyBase) serve at the behest of our user community and this is what they want to see. Cheers, Josh On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know I am not a big fan of FASTA format. Sure > it was useful in the days of FORTRAN but we really need to move on. > I'm not sure the people who started the format foresaw the kind of > abuse that the "format" would get. > > What I would much prefer is something that looks like the BioEntry > table of BioSQL plus the BioSequence information in somekind of (dare > I say it) XML format. It would certainly be tidier and vastly more > machine readable, for a start not all the metadata would need to be on > the description line in no specific order. I think by limiting it to > those two tables you get most of the key metadata without all the > cruft that comes with more extensive XML formats. > > It would be a bit less user friendly for people pasting sequences into > webforms (although I think FASTA is fine for that) but much better for > data distribution, webservices, machine processing etc. > > Anyhow, that's enough venting. I don't wan't to start somekind of holy > war or anything... > > - Mark > > ps. Sorry for getting off topic. > > On Wed, Apr 29, 2009 at 3:13 PM, JP wrote: > > > > This is why we all love the internet and the community. > > What is the chance of this happening ? ?You are speaking about World Peace, > > and Kofi Annan butts in. :) > > > > I found that strange also (that there are larger headers preceding the > > troublesome one). ?Maybe (and this is a long shot) there is some buffer > > which gets filled at that particular record or point in file ? ?(Does the > > error move record if we delete a couple of initial Fasta entries ?) > > > > Mind you this is NOT the only flybase fasta file I get errors with (same > > happens with dpse one v2.3 - and I am sure there are others). > > > > I am interested in the solution, so are a ton of other people who use > > biojava and particularly verbose fasta files. > > > > I love flybase and biojava > > JP > > > > On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman wrote: > > > > > > > > Hi Richard and JP, > > > > > > I think I can be of some help as I'm the FlyBase developer responsible for > > > generating these troublesome FASTA files :-). ?The cause of this problem > > > appears to be the description line length for the record FBpp0145470. > > > > > > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop > > > at line 196. ?Biojava correctly reads in FBpp0145468 but throws an error > > > when trying to parse FBpp0145469. ?There is nothing wrong in FBpp0145469 > > > but when biojava reaches the end of the sequence it reads in the header > > > for the next record (FBpp0145470). ?It then tries to reset the > > > BufferedReader to the start of FBpp0145470 but that is where the exception > > > is thrown because line 197 sets the read ahead limit to 500 characters and > > > the reader.readLine() command exceeds that limit. > > > > > > What isn't obvious to me is why other large definition lines that precede > > > that line don't throw the same error (e.g. FBpp0157909). ?I guess the > > > javadoc on BufferedReader.mark() does say "may fail" but I assumed it > > > would be more predictable than that. > > > > > > The file in question can be downloaded from > > > > > > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz > > > . > > > > > > If there is interest in a solution that doesn't involve simply upping the > > > read ahead limit I can put a patch file together in the next day or so. > > > > > > Cheers, > > > Josh > > > > > > On Tue, 28 Apr 2009, Richard Holland wrote: > > > > > > > You're right, doesn't look like newlines. > > > > > > > > The "Mark invalid" happens when the parser looks too far ahead in the > > > > file attempting to seek out the next valid sequence to parse. I'm not > > > > sure why this is happening. > > > > > > > > I don't have the time to test right now but if you could post the link > > > > to where someone could download the same FASTA as you're using, then it > > > > would make it possible for someone else to investigate in more detail.. > > > > > > > > thanks, > > > > Richard > > > > > > > > JP wrote: > > > > > Thanks Richard for your prompt reply. > > > > > > > > > > I will not attach the fasta file I am parsing (12MB) its > > > > > dgri-all-translation-r1.3.fasta from the flybase project. > > > > > > > > > > If the file had any extra new lines I would see them when I loaded it > > > in > > > > > a text editor - no ? > > > > > > > > > > I implemented the whole thing without using Biojava (for this part) > > > > > > > > > > ? ? fr = new FileReader(fastaProteinFileName); > > > > > ? ? br = new BufferedReader(fr); > > > > > ? ? String fastaLine; > > > > > ? ? String startAccession = '>' + accessionId.trim(); > > > > > ? ? String fastaEntry = ""; > > > > > ? ? boolean record = false; > > > > > ? ? while ((fastaLine = br.readLine()) != null) { > > > > > ? ? ? ? fastaLine = fastaLine.trim() + '\n'; > > > > > ? ? ? ? if (fastaLine.startsWith(startAccession)) { > > > > > ? ? ? ? ? ? record = true; > > > > > ? ? ? ? } else if (record && fastaLine.startsWith(">")) { > > > > > ? ? ? ? ? ? record = false; > > > > > ? ? ? ? ? ? break; > > > > > ? ? ? ? } > > > > > ? ? ? ? if (record) { > > > > > ? ? ? ? ? ? fastaEntry += fastaLine; > > > > > ? ? ? ? } > > > > > ? ? } > > > > > > > > > > > > > > > Notice - I do not use regex - since I'd need to read the whole file and > > > > > then regex upon it (if the record is the first one - I just read that > > > one). > > > > > > > > > > Cheers > > > > > JP > > > > > > > > > > > > > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland > > > > > > wrote: > > > > > > > > > > ? ? The "Mark invalid" exception is indicating that the parser has gone > > > too > > > > > ? ? far ahead in the file looking for a valid header. I'm not sure why > > > but > > > > > ? ? looking at your original query, there may be extra newlines > > > embedded > > > > > ? ? into your FASTA header line? That would definitely confuse it. > > > > > > > > > > ? ? The parser is not able to currently pull out just one sequence - in > > > > > ? ? effect this is a search facility, which it doesn't have. :( > > > > > > > > > > ? ? thanks, > > > > > ? ? Richard > > > > > > > > > > ? ? JP wrote: > > > > > ? ? > Hi all at BioJava, > > > > > ? ? > > > > > > ? ? > I am trying to parse several FASTA files using the following > > > code: > > > > > ? ? > > > > > > ? ? > fr = new FileReader(fastaProteinFileName); > > > > > ? ? >> br = new BufferedReader(fr); > > > > > ? ? >> > > > > > ? ? >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, > > > null); > > > > > ? ? >> while (protIter.hasNext()) { > > > > > ? ? >> ? ? ?BioEntry bioEntry = protIter.nextBioEntry(); > > > > > ? ? >> ? ? ?System.out.println (fastaProteinFileName + " == " + > > > > > ? ? accessionId + " = > > > > > ? ? >> " + bioEntry.getAccession()); > > > > > ? ? >> } > > > > > ? ? > > > > > > ? ? > > > > > > ? ? > At particular points in my fasta file - I get the following > > > exception: > > > > > ? ? > > > > > > ? ? > 14:53:42,546 ERROR FastaFileProcessing ?- File parsing exception > > > (from > > > > > ? ? >> biojava library) > > > > > ? ? >> org.biojava.bio.BioException: Could not read sequence > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.core.OrthologueFinder.(OrthologueFinder.java:51) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60) > > > > > ? ? >> Caused by: java.io.IOException: Mark invalid > > > > > ? ? >> ? ? at java.io.BufferedReader.reset(Unknown Source) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > > > > > ? ? >> ? ? at > > > > > ? ? >> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > > > > ? ? >> ? ? ... 5 more > > > > > ? ? > > > > > > ? ? > > > > > > ? ? > Interestingly if I delete the header portion of the header line > > > (from > > > > > ? ? > type=protein... till the end of the line ...Dgri;) > > > > > ? ? > > > > > > ? ? >> FBpp0145468 type=protein; > > > > > ? ? >> > > > > > > > > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658); > > > > > ? ? >> ID=FBpp0145468; name=Dgri\GH11562-PA; > > > parent=FBgn0119042,FBtr0146976; > > > > > ? ? >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA; > > > > > ? ? >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3; > > > > > ? ? >> species=Dgri; > > > > > ? ? >> > > > > > ? ? > > > > > > ? ? > It works - but I have a number of these exceptions (and I do not > > > > > ? ? want to > > > > > ? ? > edit the original data). ?Mind you I have longer headers in my > > > > > ? ? file which > > > > > ? ? > are parsed OK (strange!). > > > > > ? ? > > > > > > ? ? > Any ideas anyone ? ?Alternatively - is there a better way how to > > > > > ? ? get ONE > > > > > ? ? > SINGLE sequence from the whole fasta file give that I have the > > > > > ? ? accession id > > > > > ? ? > (FBpp0145468) ? > > > > > ? ? > > > > > > ? ? > Many Thanks > > > > > ? ? > JP > > > > > ? ? > _______________________________________________ > > > > > ? ? > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > > > ? ? > > > > > ? ? > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > ? ? > > > > > > > > > > > ? ? -- > > > > > ? ? Richard Holland, BSc MBCS > > > > > ? ? Finance Director, Eagle Genomics Ltd > > > > > ? ? T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > > ? ? > > > > > ? ? http://www.eaglegenomics.com/ > > > > > > > > > > > > > > > > > > -- > > > > Richard Holland, BSc MBCS > > > > Finance Director, Eagle Genomics Ltd > > > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > > > http://www.eaglegenomics.com/ > > > > _______________________________________________ > > > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > _______________________________________________ > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l >