From holland at eaglegenomics.com Mon Sep 14 04:44:57 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 14 Sep 2009 09:44:57 +0100 Subject: [Biojava-l] BioJava Hackathon in January Message-ID: <844D84C0-3363-4A3E-84B5-472C1E296B0E@eaglegenomics.com> Hi all, The January hackathon is now confirmed. It will take place from January 18th-22nd 2010, at the Wellcome Trust Genome Campus (at Hinxton, near Cambridge). Exact location on the campus will be determined nearer the time as it will depend on final numbers attending. If you will definitely be attending, _please email me_ to let me know as soon as possible so that I can ensure we get a big enough room. This is also important because unless I know you're coming, I won't be able to arrange you a security pass to get onto the campus. No funding is available at present so attendance will be at your own cost (or your employer's if you can persuade them!). If this changes I will let you know but don't hold your hopes too high! There is no accommodation near the campus itself but there is a free bus shuttle linking it to central Cambridge where you can find hotels and hostels of all kinds and for all budgets. The purpose of the hackathon is to make progress towards achieving the goals listed on this wiki page, which Andreas Prlic is maintaining: http://biojava.org/wiki/BioJava:Modules cheers, Richard -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Mon Sep 14 11:33:27 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 14 Sep 2009 08:33:27 -0700 Subject: [Biojava-l] [Biojava-dev] BioJava Hackathon in January In-Reply-To: <844D84C0-3363-4A3E-84B5-472C1E296B0E@eaglegenomics.com> References: <844D84C0-3363-4A3E-84B5-472C1E296B0E@eaglegenomics.com> Message-ID: <59a41c430909140833i2827cd30g5ac83f134a1b5f21@mail.gmail.com> Excellent, thanks for organizing this, Richard! Looking forward to meeting you all in Cambridge! Andreas On Mon, Sep 14, 2009 at 1:44 AM, Richard Holland wrote: > Hi all, > > The January hackathon is now confirmed. It will take place from January > 18th-22nd 2010, at the Wellcome Trust Genome Campus (at Hinxton, near > Cambridge). Exact location on the campus will be determined nearer the time > as it will depend on final numbers attending. > > If you will definitely be attending, _please email me_ to let me know as > soon as possible so that I can ensure we get a big enough room. This is also > important because unless I know you're coming, I won't be able to arrange > you a security pass to get onto the campus. > > No funding is available at present so attendance will be at your own cost > (or your employer's if you can persuade them!). If this changes I will let > you know but don't hold your hopes too high! > > There is no accommodation near the campus itself but there is a free bus > shuttle linking it to central Cambridge where you can find hotels and > hostels of all kinds and for all budgets. > > The purpose of the hackathon is to make progress towards achieving the goals > listed on this wiki page, which Andreas Prlic is maintaining: > > ?http://biojava.org/wiki/BioJava:Modules > > cheers, > Richard > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From dence at lsu.edu Mon Sep 14 16:25:20 2009 From: dence at lsu.edu (Daniel D Ence) Date: Mon, 14 Sep 2009 15:25:20 -0500 Subject: [Biojava-l] Snow Leopard Check Out Problems Message-ID: <1D5C148F9259BC47BC3CBD2F76ABA2050219A12F@email002.lsu.edu> Hi, I just discovered the BioJava project today when I searched to see if anyone else had written a Nexus file parser in java. I was excited to discover that this is part of what you guys are doing. I am trying to get biojava1.7-all.jar for some phylogenetics software I am writing. If I understood correctly, I download the "all" jar file and extract it using jar ("jar xf file.jar"). I am doing this from a bash terminal on MAC-OS X 10.6 on an Intel Core Duo chip. However, when doing this command, "jar xf file.jar", I get the following error: java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at java.util.zip.ZipInputStream.read(ZipInputStream.java:146) at sun.tools.jar.Main.extractFile(Main.java:849) at sun.tools.jar.Main.extract(Main.java:776) at sun.tools.jar.Main.run(Main.java:211) at sun.tools.jar.Main.main(Main.java:1044) A folder titled "biojava-1.7" is created which contains two jar files, "apps-1.7.jar" and "biojava-1.7.jar" and a folder "apps". The apps folder contains the following nested folders org->biojava->app. App contains two java files, BioFlatIndex.java and BioGetSeq.java. It would appear that the EOF exception interrupts the extraction and I only get those two files. I don't know if this is a jar issue on snow leopard or what. I guess it is possible that I am the first person to download a new copy of biojava since the Snow leopard release less than three weeks ago. The next thing I will try is to get the source code and the supporting jar files listed on the download pages and just compile it myself. Any help you could offer on this would be appreciated. Thanks, Daniel Ence From andreas at sdsc.edu Mon Sep 14 16:52:13 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 14 Sep 2009 13:52:13 -0700 Subject: [Biojava-l] Snow Leopard Check Out Problems In-Reply-To: <1D5C148F9259BC47BC3CBD2F76ABA2050219A12F@email002.lsu.edu> References: <1D5C148F9259BC47BC3CBD2F76ABA2050219A12F@email002.lsu.edu> Message-ID: <59a41c430909141352m330d98fco33ce19bfa7e4f9e4@mail.gmail.com> Hi Daniel, I am also on Snow Leopard and can't confirm this problem. When I do a jar xvf biojava-1.7-all.jar it all extracts fine. Are you sure your dowload has fully completed before you expand the file? Another thought: in order for gzip and tar to get installed on my system I had to manually install some optional packages from the Snow Leopard installation disk. Perhaps that is related somehow? Andreas On Mon, Sep 14, 2009 at 1:25 PM, Daniel D Ence wrote: > Hi, > > I just discovered the BioJava project today when I searched to see if anyone else had written a Nexus file parser in java. I was excited to discover that this is part of what you guys are doing. I am trying to get biojava1.7-all.jar for some phylogenetics software I am writing. If I understood correctly, I download the "all" jar file and extract it using jar ("jar xf file.jar"). I am doing this from a bash terminal on MAC-OS X 10.6 on an Intel Core Duo chip. > > However, when doing this command, "jar xf file.jar", I get the following error: > > java.io.EOFException: Unexpected end of ZLIB input stream > ? ? ? ?at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) > ? ? ? ?at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) > ? ? ? ?at java.util.zip.ZipInputStream.read(ZipInputStream.java:146) > ? ? ? ?at sun.tools.jar.Main.extractFile(Main.java:849) > ? ? ? ?at sun.tools.jar.Main.extract(Main.java:776) > ? ? ? ?at sun.tools.jar.Main.run(Main.java:211) > ? ? ? ?at sun.tools.jar.Main.main(Main.java:1044) > > A folder titled "biojava-1.7" is created which contains two jar files, "apps-1.7.jar" and "biojava-1.7.jar" and a folder "apps". The apps folder contains the following nested folders org->biojava->app. App contains two java files, BioFlatIndex.java and BioGetSeq.java. > > It would appear that the EOF exception interrupts the extraction and I only get those two files. > > I don't know if this is a jar issue on snow leopard or what. I guess it is possible that I am the first person to download a new copy of biojava since the Snow leopard release less than three weeks ago. The next thing I will try is to get the source code and the supporting jar files listed on the download pages and just compile it myself. > > Any help you could offer on this would be appreciated. > > > Thanks, > > Daniel Ence > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From dence at lsu.edu Mon Sep 14 18:11:35 2009 From: dence at lsu.edu (Daniel D Ence) Date: Mon, 14 Sep 2009 17:11:35 -0500 Subject: [Biojava-l] Snow Leopard Check Out Problems References: <1D5C148F9259BC47BC3CBD2F76ABA2050219A12F@email002.lsu.edu> <59a41c430909141352m330d98fco33ce19bfa7e4f9e4@mail.gmail.com> Message-ID: <1D5C148F9259BC47BC3CBD2F76ABA2050219A131@email002.lsu.edu> I tried it again, after verifying that gzip and tar were installed on the system, and it worked this time. Thanks for the suggestions though. Daniel Ence From markjschreiber at gmail.com Sat Sep 19 12:19:52 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 20 Sep 2009 00:19:52 +0800 Subject: [Biojava-l] Fwd: [Wg-phyloinformatics] bioinform article "Biomanycores" In-Reply-To: <3031CF22FD1649048ECD7BA60485CC32@NewLife> References: <3031CF22FD1649048ECD7BA60485CC32@NewLife> Message-ID: <93b45ca50909190919g332033b5k4ef84866f3c85909@mail.gmail.com> This may be of interest, especially as there is a biojava interface to this technology. Mark ---------- Forwarded message ---------- From: "Mark A. Jensen" Date: Sep 19, 2009 11:02 AM Subject: [Wg-phyloinformatics] bioinform article "Biomanycores" To: "Phyloinformatics Group" Thought this might be of interest -- cheers, Mark * ------------------------------ Biomanycores Repository Seeks to Bridge Bioinformatics Software and GPU Computing * September 18, 2009 By Vivien Marx *As interest *in the use of graphics processing units for accelerating bioinformatics tasks continues to rise, a team of French researchers is looking to make it easier for users to find and use software that has been optimized for GPU-enabled high-performance computing. The researchers, members of a bioinformatics group called Sequioa at Lille University, have created a repository for bioinformatics parallel code written in the Open Computing Language, OpenCL, or Nvidia?s Compute Unified Device Architecture, CUDA. The team launched the repository, called Biomanycores, in early 2009. It currently includes GPU-enabled implementations of three bioinformatics packages: SWcuda, a version of the Smith-Waterman algorithm; pknotsRG, an algorithm for predicting pseudoknots in an RNA sequence; and cudaPWM, an algorithm for scanning a position weight matrix against a DNA sequence. Biomanycore's hosts are hoping that other developers will upload their software to the repository. In order to improve interoperability, all the software on the site so far has interfaces with BioPerl, BioJava, and BioPython. "We are also looking for help to develop interfaces to Bio* frameworks," Jean-St?phane Varr?, a Sequoia group member and associate professor at INRIA Lille, told *BioInform*. GPU computing offers considerable speedups for bioinformatics computing ? generally between 3-fold and 50-fold, Varr? said. For example, the SWcuda Smith-Waterman implementation in the repository, developed by researchers at the University of Padova, is about 30 times faster than Smith-Waterman on a CPU, he said. "This implies that you can make use of exact search instead of approximate search for comparing sequences." The cudaPWM package, meantime, runs 77 times faster on an Nvidia GTX 280 as compared to a CPU. *Joiners Welcome * The Lille team has been working with GPUs since 2007, and this year began a "partnership" with Nvidia, said Mathieu Giraud, another researcher within the Sequoia group who is affiliated with the French national research institute CNRS. He did not offer further details of the nature of the partnership. When exploring ways to disseminate algorithms, "we decided the good way was to create a repository" for scientists to upload their own software or to download software from others, Giraud said. Biomanycores started with two in-house algorithms and the publicly available CUDA implementation of the Smith-Waterman algorithm, which was completed by the team at the University of Padova. "We plan to continue the integration of new applications in the months to come," Giraud said, adding that likely applications will be in the areas of short-read analysis, RNA secondary structure prediction, and phylogeny. Second-generation sequencers and other high-throughput technologies are driving "exponential" data growth that is outpacing Moore's law, Varr? said. While some biology labs have set up powerful computers or even small grids, installation and maintenance of IT systems is difficult without onsite staff, he said. "Since parallel computing using many-core processors could be an answer, with Biomanycores we would like to help the community to access the latest methods developed for such processors," Varr? said. Varr? added that algorithm developers are not "the primary users" of Biomanycores. As an example of how it might be used in practice, he said that he and his colleagues are working with biologists to develop an analysis pipeline in the field of comparative genomics for non-coding sequence analysis. "Using GPUs via Biomanycores in this pipeline will help avoid some hurdles," he said. "We think that it is the spirit of Biomanycores: providing very efficient implementations for the most complex or repetitive tasks." But not everyone agrees that such a resource would be of immediate use for end-user biologists. "Libraries would help to accelerate the implementation process, Bertil Schmidt, a computer scientist at Nanyang Technological University in Singapore, told *BioInform* via e-mail. "However, Biomanycores interfaces are still in early stage of development." Schmidt developed CudaSW++, to optimize Smith-Waterman sequence database searches on CUDA-enabled GPUs, which was publishedin *BMC Research Notes*, in May. Schmidt's group is designing "efficient CUDA solutions" for a variety of bioinformatics applications and has worked on multiple sequence alignment, Smith-Waterman database scanning, motif finding, and molecular dynamics. "This work is important since it provides around one order-of-magnitude speedup on a single standard GPU," he said. Schmidt said that he is planning on releasing CUDA-BLASTP and a CUDA-based assembly tool for Illumina short-read data soon. "We probably will publish those on sourceforge.net," he said in response to a question about whether he will deposit it on Biomanycores. *Be Parallel * Another aim of Biomanycores is to offer recommendations for scientists working with GPU computing. For example, developers should remember to not benchmark algorithms and hardware together ? "a common pitfall in parallelism," Giraud said. For some algorithms, there are "standard metrics," Varr? said. For example, Smith-Waterman speed can be measured in MCUPS, millions of cells updated per second. This allows researchers "to compare algorithms across different implementations and hardware." Good benchmarks should run the same algorithm on different processors, presenting and explaining the bottlenecks in each architecture, Giraud said. "It is also important to describe the nature of the speedups: is it against the same algorithm in a dummy one-core CPU version, or against an optimized CPU version that includes multithreading and [streaming single instruction, multiple data extensions] instructions?" Although the bar for users in this area is still high, the Biomanycores team believes it is worth the effort. "A key point is how you access your memory. You usually have fast, but small, caches, and large, but slow, global memories," Giraud said. With some processors, the cache is even explicitly managed: which means researchers must design algorithms that work on small subsets of data. Varr? said that researchers must choose programming models carefully. Current GPU architectures are mostly SIMD, or single instruction, multiple data, processors. "To be fully efficient, you should understand this model and fit your computations to it," he said. *Going Bio* Biomanycores offers a BioJava API, but scientists need the Java build tool Apache Ant as well as a number of CUDA programs before they can run Biomanycores interfaces. Users need to install "at least one of the three CUDA programs available in Biomanycores," said St?phane Janot, another researcher in the Sequoia group. "That is not always easy, and will be one point we want to improve" to allow a better integration between BioJava and the CUDA programs. The team has similar plans for the BioPerl and BioPython APIs. CUDA is nVidia's proprietary GPU programming model, and the programming language is "C with CUDA extensions," Andrew Humber, nVidia's senior PR manager for CUDA products told *BioInform* via e-mail. There are many players in the GPU market, notably nVidia, AMD and others. The industry is moving toward adoption of the Open Computing Language, or OpenCL, a standard for "heterogeneous parallel programming" managed by an industry consortium called the Khronos Compute Working Group. OpenCL is "largely based" on Nvidia's "own C language implementation, although it?s slightly higher level," Humber said, adding that the company chairs the working group that helped ratify and release the OpenCL standard. "Snow Leopard is the first shipping OS to support OpenCL, but we also have drivers for Windows and Linux with developers today," he said. OpenCL runs on Nvidia's CUDA-enabled GPUs. CUDA, in his view, offers flexibility because it allows the use of "other languages such as Fortran, which still has a great deal of traction in the high-performance computing space" along with Java and Python. OpenCL stands to offer researchers flexibility in implementing their applications, Varr? said. "Indeed, with OpenCL one will be able to develop parallel algorithms and then compile them on several platforms: from multi-core processors, the ones everyone has in their own personal computer today, to many-cores processors such as GPU or, perhaps, the upcoming Intel Larrabee," Varr? said. OpenCL "promises to provide code portability," said Nanyang's Schmidt. "However, it still remains to be seen whether OpenCL implementations are also performance portable; [in that] the same code delivers high performance on a variety of parallel architectures." Giraud sees it as positive news for the field that the first OpenCL compilers are now available. "Nvidia has one available for developers, Apple has embedded a compiler in OS X Snow Leopard, and AMD/ATI is also releasing a compiler," he said. "In high-performance computing, 2010 should be the year of OpenCL." _______________________________________________ Wg-phyloinformatics mailing list Wg-phyloinformatics at nescent.org https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics From d.johnson at reading.ac.uk Wed Sep 23 06:24:47 2009 From: d.johnson at reading.ac.uk (David Johnson) Date: Wed, 23 Sep 2009 11:24:47 +0100 Subject: [Biojava-l] ParseException when using interleaved Nexus file In-Reply-To: References: <30109C99-0ADF-461D-AB8A-B9800C4111D1@eaglegenomics.com> <32FD8C34-2F30-4159-A9FB-D809784B7ED0@eaglegenomics.com> Message-ID: Hi Richard, Forgot to say after your last mail (ages ago now), thanks for all your help! The stuff I'm using the Nexus parser in works great now. Cheers, -David 2009/8/11 Richard Holland : > It should already be on CruiseControl. > > Standards in bioinformatics are a pain - people write them to describe the > format of files their software outputs, then the very same people then > produce files that break those standards without any additional > documentation or explanation. (Genbank are one of the biggest offenders!) It > makes it very hard to write parsers, because if you stick to the official > spec there will always be files that don't work yet people insist are still > valid, yet without prior documented evidence of invalid files that are > considered to be valid, it is impossible to write a parser to cater for > them. :) > > cheers, > Richard > > On 11 Aug 2009, at 11:12, David Johnson wrote: > >> Hi Richard, >> >> OK that's good to know... I suppose that's the problem with specifications >> - people don't always follow them! >> >> But I get the impression either some people think that using >> interleave=yes/no is standard practice, or it could be being generated by >> some other phylo software (e.g. maybe PAUP or some other tools). >> >> I had a talk with my supervisor and he actually can't find the specific >> programs that have been putting that in, but looking at a range of his Nexus >> files, there's quite a few that seem to use put in the yes/no bits, some >> files he received from other researchers. >> >> Are the modifications available in the latest automated build (on >> CruiseControl)? >> >> Cheers, >> -David >> >> 2009/8/11 Richard Holland >> I've found the problem - "interleave=yes" is not valid according to the >> official NEXUS format spec which the parser was written against. (Maddison >> et al., 1997) >> >> Interleaved file are supposed to only include the word "interleave" - it >> takes no parameters. Non-interleaved files shouldn't mention it at all. >> >> I've modified the parser to tolerate this but I'd be interested to know >> where the invalid token came from - was it added manually, or by an existing >> piece of publically available software? >> >> The modification has been made in the trunk of the biojava-live subversion >> repository. >> >> cheers, >> Richard >> >> From andreas.prlic at gmail.com Wed Sep 30 23:48:11 2009 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Wed, 30 Sep 2009 20:48:11 -0700 Subject: [Biojava-l] [Biojava-dev] what is the default parameters of In-Reply-To: <27727343.508701254286606200.JavaMail.coremail@bj163app55.163.com> References: <27727343.508701254286606200.JavaMail.coremail@bj163app55.163.com> Message-ID: <59a41c430909302048g14712d53tefdef1f1d64edfb3@mail.gmail.com> Hi, That depends on several things. E.g. on the substitution matrix being used, also on what you want to align. Here a site that suggests a combination of substitution matrix and gap penalties: ( BLOSUM62 matrix with gap opening penalty of 10 and a gap extension penalty of 0.5) http://hydra.icgeb.trieste.it/benchmark_previous/index.php?experiment=13 To also advertise some related, but ancient work of myself ;-) see: at: http://peds.oxfordjournals.org/cgi/content/full/13/8/545 Andreas http://hydra.icgeb.trieste.it/benchmark_previous/index.php?experiment=13 2009/9/29 simpleyrx : > > Dear experts, > > > NeedlemanWunsch's constructive funtion in biojava is as below: > public NeedlemanWunsch(short?match, > ? ? ? ? ? ? ? ? ? ? ? short?replace, > ? ? ? ? ? ? ? ? ? ? ? short?insert, > ? ? ? ? ? ? ? ? ? ? ? short?delete, > ? ? ? ? ? ? ? ? ? ? ? short?gapExtend, > ? ? ? ? ? ? ? ? ? ? ? SubstitutionMatrix?subMat) > I would like to know what are the default values to used NeedlemanWunsch. Such as how select match ,replace,insert,delete,gapExtend or subMat parameters ? > > > -- > > > Renxiang Yan > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > >