From markjschreiber at gmail.com Mon Dec 3 22:07:32 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 3 Dec 2007 22:07:32 -0500 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> Message-ID: <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> The only major advantage to using the JDK DOM/SAX is that everyone has them (no new JARs required) and they will never go away. However I can see there is a strong case for something else like XOM or Apache alternatives Saxon etc. Infact these projects often feature bleeding edge technologies before they appear in the JDK. To prevent an explosion of JARs I think we should agree on a small few XML options. As Mark mentions a good interface design makes the user code completely independent of the XML parser that is used. This makes it much easier to change what is used under the hood if something better comes along or if one of our project dependencies stops being developed. This has actually happened before in biojava. We used to rely on Xerces or something similar but once SAX and DOM appeared in the JDK we swapped out Xerces without too much impact. Good unit tests help to make sure everything still works. The occasional problem with NCBI XML is probably the best argument to delve into the dark world of ASN.1 - Mark (Classic Mark, not New Mark) On Nov 30, 2007 1:30 PM, Mark Fortner wrote: > There's a potential gotcha involved with XPath parsing. If you use the > current implementation that ships with the Java 5 & 6 JDKs, it performs a > DOM parse on the whole document, even if you pass it a specific starting > node in the document. I stumbled across this one the hard way when using > the hybrid approach that you mention. This may be solved with another XPath > implementation such as Saxon. > > One other problem I've noticed is that the NCBI XML doesn't always parse. > I've reported this to them, and they've promised to address this. It usually > occurs when submitters put non-escaped characters into text fields such as > author lists in PubMed. NCBI doesn't always use CDATA blocks around text and > as soon as the parser hits one of these characters it throws an exception. > > I've also noticed a tendency (in other code bases) for developers to use > several different parsers; usually, whatever parser they're most familiar > with. The problem with this is that they often introduce parser-specific > code into the code base, so you end up with numerous dependencies for > different parsers, and a potential configuration problem if you're passing > the XML parser as a run-time configuration parameter. The most frequent > external parsers I've seen used are JDOM and DOM4J. The usual way to get > around this is to write to an interface, but that will require some > additional vigilance. > > Just a few things to watch out for as we move forward. > > Mark (the other one) :-) > > > On Nov 30, 2007 1:26 AM, Andy Yates wrote: > > > I think I've seen XPath hanging around in other people's code in a 1.5 > > code-base (in fact one of the guys I work with). I've used Java's DOM > > before & it really isn't very nice & quite verbose. I'd prefer if there > > was a better alternative/wrapper around the XML parsers just to cut down > > on code chatter. > > > > Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these > > Java tools & I think I've gone cross-eyed with the sheer number of > > acronyms! You've gotta love something which seems to add a letter to ER > > & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the > > list know of a ASN.1 parser for Java that's good and should we support > > it (considering NCBI generate their DTD & XML from the ASN.1 > > representation). > > > > Andy > > > > Mark Schreiber wrote: > > > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > > > not XQuery although XPath is probably more important for this use. > > > > > > The DOM model is a direct implementation of the W3C standard which > > > makes it a little awkward from a java point of view but it is usable. > > > > > > Java 6 has StAX (the other one). > > > > > > There are a few java API's for parsing ASN.1 mostly developed for the > > > telco industry, I've never really looked into which is best (anyone > > > experienced with this?) but we could probably use one to work directly > > > off NCBI ASN.1 > > > > > > - Mark > > > > > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: > > >> Hi Mark, > > >> > > >> Okay that sounds like a perfectly sensible way to deal with this. Is > > >> this kind of parsing model supported in Java5? I only ask as I've not > > >> done a lot of XML parsing with Java5; more with things like XOM (which > > I > > >> think offers a DOM only representation but I'm probably wrong). > > >> > > >> That's good. There's not a huge point to have a format & a DTD/XSD and > > >> then have your files not conform to it. > > >> > > >> I was thinking the exact same thing about ASN.1 (well that & it looks > > >> bleeding horrible to parse but that is an un-educated look at the > > format > > >> which I'm sure is a parsable as JSON & the alike). > > >> > > >> When it comes to flat file parsers I would be happier to provide > > >> implementations of the more common formats where a viable alternative > > is > > >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which > > provide > > >> similar output to the above have a chance to write their own > > >> parsers/formatters. This is very similar to the current situation but > > we > > >> just need to remove dependencies on statically located data structures > > >> (don't get rid of them completely just give users an option to not use > > >> them). > > >> > > >> I'm not sure how much automatically generated parsers would help us. I > > >> guess it depends on the data model(s) we use if they are auto-parser > > >> friendly (which normally means POJO/JavaBean conventions including the > > >> no-args constructor). > > >> > > >> Cool I don't want to exclude flat file parsers completely (if only > > >> because my group has an interest in BioJava being able to read & write > > >> flat files) :) > > >> > > >> They decided to have HUPO-PSI Format instead :) > > >> > > >> Andy > > >> > > >> > > >> Mark Schreiber wrote: > > >>> Hi - > > >>> > > >>> I think in most cases huge XML files in bioinformatics result from a > > >>> single XML containing multiple repetitive elements. Eg a BLAST XML > > >>> output with several hits or a GenBankXML with many Sequences. A nice > > >>> approach I have seen for dealing with these is to use SAX to read over > > >>> the file and every time it comes to an element it delegates to a DOM > > >>> object. You then parse the bits of the DOM you want with XPath or > > >>> convert to objects or something and then when you are finished with > > >>> that entry everything gets garbage collected and the SAX parser moves > > >>> to the next element and repeats the whole process. This is a hybrid > > >>> of event based parsing and object-model based parsing which could let > > >>> you efficiently deal with huge files. > > >>> > > >>> I think the BLAST XML has improved substantially, at least in terms of > > >>> validating against it's own DTD. The DTD itself may not be the best > > >>> design but that is always a matter of taste and if you are using XPath > > >>> to get the relevant bits you don't need to make a SAX parser jump > > >>> through hoops to get them. > > >>> > > >>> I agree we will have to keep flat file parsers but we should strongly > > >>> encourage the use of XML where possible. It is simply easier to deal > > >>> with. Most biological flat-files were designed for Fortran and mainly > > >>> for human consumption. There is no obvious validation mechanism. > > >>> Notably everything in NCBI is derived from ASN.1, what you see in the > > >>> flatfile is produced from there. I tend to think this means that the > > >>> ASN.1 is the holy gospel and what you get in the flat file is some > > >>> translation. Ideally NCBI files should be parsed from the ASN.1 where > > >>> you can guarantee validation, the more practical alternative is to use > > >>> the XML which you can at least validate against a DTD. > > >>> > > >>> With XML we (Biojava) can say if it validates we will parse it and if > > >>> it doesn't we may not. With flat files there are so many dodgey > > >>> variants we cannot say anything. Because XML dtds (or xsd's) have > > >>> versions it also makes it much easier to have parsers for different > > >>> versions and the parsing machinery can figure out which is needed. > > >>> With flat files it is anyones guess what version you are dealing with. > > >>> > > >>> Finally parsers can be auto-generated for XML if you have the DTD or > > >>> XSD. This often doesn't give you an ideal parser but it can be a > > >>> useful starting point for rapid development. > > >>> > > >>> For Biojava v 3 I think we should concentrate on XML parsers first and > > >>> flat files second. if only Fasta had an XML format > > >>> > > >>> - Mark > > >>> > > >>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: > > >>>> I was always under the impression that blast's XML output was nearly > > as > > >>>> hard to parse as the flat file format but I do agree that if we can > > use > > >>>> XML whenever we can it would make writing parsers a lot easier > > >>>> (especially if there are SAX based XPath libraries available). > > Actually > > >>>> this brings up a good question about development of this type of > > parser. > > >>>> The majority of XPath supporting libraries are DOM based which will > > mean > > >>>> large memory usage in some situations but overall providing an easier > > >>>> coding experience (and hopefully reduce our chances of creating > > bugs). > > >>>> Or should we code to the edge cases of someone trying to parse a 1GB > > >>>> XML? Personally I'd favour the former. > > >>>> > > >>>> Going back to the original topic there are going to be situations > > where > > >>>> people want the flat file parsers/writers & I think it's a valid > > point > > >>>> to say this is where BioJava is meant to come in & help a developer. > > >>>> Afterall XML is a computer science problem where as parsing an EMBL > > flat > > >>>> file or blast output is a bioinformatics problem. > > >>>> > > >>>> Andy > > >>>> > > >>>> > > >>>> Mark Schreiber wrote: > > >>>>> For a long time now my feeling has been that we should *only* > > support > > >>>>> the XML version of blast output. The other formats are too brittle > > to > > >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc > > that > > >>>>> may be an extreme view but the power of generic XML parsers and > > things > > >>>>> like XPath etc really make these formats look very attractive. > > >>>>> > > >>>>> - Mark > > >>>>> > > >>>>> > > >>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: > > >>>>>> I think Groovy have adopted a similar system recently & have > > guidelines > > >>>>>> for how each module should behave (dependencies, build system etc). > > This > > >>>>>> enforces the idea that a module whilst not part of the core project > > must > > >>>>>> behave in the same manner the core does. I do like the idea that we > > can > > >>>>>> have a core biojava & things get added around it & it might > > encourage > > >>>>>> other users to start developing their own modules for any > > >>>>>> formats/purpose they want. > > >>>>>> > > >>>>>> Richard Holland wrote: > > >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- > > >>>>>>> Hash: SHA1 > > >>>>>>> > > >>>>>>>> What format options are there from blast? Just thinking if it > > supports > > >>>>>>>> CIGAR or something like that are we better providing a parser for > > that > > >>>>>>>> format & saying that we do not support the traditional blast > > output? > > >>>>>>>> That said it doesn't help is when that format changes so maybe > > what is > > >>>>>>>> needed is a way to push out parser changes without requiring a > > full > > >>>>>>>> biojava release (v3 discussion) ... > > >>>>>>> Exactly! So the modular idea would work nicely here - we could > > have a > > >>>>>>> blast module and only update that single module (which would be > > its own > > >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as > > such > > >>>>>>> would no longer happen, except maybe for some kind of core BioJava > > >>>>>>> module. Everything would be done in terms of individual module+JAR > > >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, > > one > > >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. > > etc. > > >>>>>>> > > >>>>>>> cheers, > > >>>>>>> Richard > > >>>>>> _______________________________________________ > > >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>>>> > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Tue Dec 4 04:12:51 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 04 Dec 2007 09:12:51 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> Message-ID: <47551A13.8000407@ebi.ac.uk> I think avoiding the jar explosion is a very good idea. I think if every Jar choice has to go through a process of issue/vote which makes it a bit harder to decide to introduce a new JAR without others knowing what it is, why the submitter has chosen it & why is it better than other alternatives; this really could be a simple as I've used this one & it's API is easier to understand. Same thing is seen in all libraries. Just looking at the Spring synchronized collection factories you can see it testing for Java versions & class existence to know what type of synchronized collection it can create. Also XML apis are one of the worst for jar dependency hell since everyone has their favourite parser (just try running a program in ant without forking & using two XML apis ... it's fun). Using XPath & a generic retrieval system could give us this flexibility we all seem to be wanting. It more depends on is there a good enough XPath implementation that can handle the XML files we'll be pushing through it (why is it I think the answer is no). Hmmm it does but how many bioinformaticians use the ASN.1 syntax though compared to flat file & XML? I'm guessing that flat file is the winner here with XML & ASN.1 coming in reasonably equal*. If this is true then yes I'd be more tempted to write a ASN.1 parser & then support XML. Andy (not a Mark in the slightest) * Please note that this is a finger in the air guess with no actual statistical backing one way or another :). Mark Schreiber wrote: > The only major advantage to using the JDK DOM/SAX is that everyone has > them (no new JARs required) and they will never go away. However I > can see there is a strong case for something else like XOM or Apache > alternatives Saxon etc. Infact these projects often feature bleeding > edge technologies before they appear in the JDK. > > To prevent an explosion of JARs I think we should agree on a small few > XML options. As Mark mentions a good interface design makes the user > code completely independent of the XML parser that is used. This makes > it much easier to change what is used under the hood if something > better comes along or if one of our project dependencies stops being > developed. > > This has actually happened before in biojava. We used to rely on > Xerces or something similar but once SAX and DOM appeared in the JDK > we swapped out Xerces without too much impact. Good unit tests help > to make sure everything still works. > > The occasional problem with NCBI XML is probably the best argument to > delve into the dark world of ASN.1 > > - Mark (Classic Mark, not New Mark) > > On Nov 30, 2007 1:30 PM, Mark Fortner wrote: >> There's a potential gotcha involved with XPath parsing. If you use the >> current implementation that ships with the Java 5 & 6 JDKs, it performs a >> DOM parse on the whole document, even if you pass it a specific starting >> node in the document. I stumbled across this one the hard way when using >> the hybrid approach that you mention. This may be solved with another XPath >> implementation such as Saxon. >> >> One other problem I've noticed is that the NCBI XML doesn't always parse. >> I've reported this to them, and they've promised to address this. It usually >> occurs when submitters put non-escaped characters into text fields such as >> author lists in PubMed. NCBI doesn't always use CDATA blocks around text and >> as soon as the parser hits one of these characters it throws an exception. >> >> I've also noticed a tendency (in other code bases) for developers to use >> several different parsers; usually, whatever parser they're most familiar >> with. The problem with this is that they often introduce parser-specific >> code into the code base, so you end up with numerous dependencies for >> different parsers, and a potential configuration problem if you're passing >> the XML parser as a run-time configuration parameter. The most frequent >> external parsers I've seen used are JDOM and DOM4J. The usual way to get >> around this is to write to an interface, but that will require some >> additional vigilance. >> >> Just a few things to watch out for as we move forward. >> >> Mark (the other one) :-) >> >> >> On Nov 30, 2007 1:26 AM, Andy Yates wrote: >> >>> I think I've seen XPath hanging around in other people's code in a 1.5 >>> code-base (in fact one of the guys I work with). I've used Java's DOM >>> before & it really isn't very nice & quite verbose. I'd prefer if there >>> was a better alternative/wrapper around the XML parsers just to cut down >>> on code chatter. >>> >>> Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these >>> Java tools & I think I've gone cross-eyed with the sheer number of >>> acronyms! You've gotta love something which seems to add a letter to ER >>> & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the >>> list know of a ASN.1 parser for Java that's good and should we support >>> it (considering NCBI generate their DTD & XML from the ASN.1 >>> representation). >>> >>> Andy >>> >>> Mark Schreiber wrote: >>>> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but >>>> not XQuery although XPath is probably more important for this use. >>>> >>>> The DOM model is a direct implementation of the W3C standard which >>>> makes it a little awkward from a java point of view but it is usable. >>>> >>>> Java 6 has StAX (the other one). >>>> >>>> There are a few java API's for parsing ASN.1 mostly developed for the >>>> telco industry, I've never really looked into which is best (anyone >>>> experienced with this?) but we could probably use one to work directly >>>> off NCBI ASN.1 >>>> >>>> - Mark >>>> >>>> On Nov 28, 2007 10:29 PM, Andy Yates wrote: >>>>> Hi Mark, >>>>> >>>>> Okay that sounds like a perfectly sensible way to deal with this. Is >>>>> this kind of parsing model supported in Java5? I only ask as I've not >>>>> done a lot of XML parsing with Java5; more with things like XOM (which >>> I >>>>> think offers a DOM only representation but I'm probably wrong). >>>>> >>>>> That's good. There's not a huge point to have a format & a DTD/XSD and >>>>> then have your files not conform to it. >>>>> >>>>> I was thinking the exact same thing about ASN.1 (well that & it looks >>>>> bleeding horrible to parse but that is an un-educated look at the >>> format >>>>> which I'm sure is a parsable as JSON & the alike). >>>>> >>>>> When it comes to flat file parsers I would be happier to provide >>>>> implementations of the more common formats where a viable alternative >>> is >>>>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which >>> provide >>>>> similar output to the above have a chance to write their own >>>>> parsers/formatters. This is very similar to the current situation but >>> we >>>>> just need to remove dependencies on statically located data structures >>>>> (don't get rid of them completely just give users an option to not use >>>>> them). >>>>> >>>>> I'm not sure how much automatically generated parsers would help us. I >>>>> guess it depends on the data model(s) we use if they are auto-parser >>>>> friendly (which normally means POJO/JavaBean conventions including the >>>>> no-args constructor). >>>>> >>>>> Cool I don't want to exclude flat file parsers completely (if only >>>>> because my group has an interest in BioJava being able to read & write >>>>> flat files) :) >>>>> >>>>> They decided to have HUPO-PSI Format instead :) >>>>> >>>>> Andy >>>>> >>>>> >>>>> Mark Schreiber wrote: >>>>>> Hi - >>>>>> >>>>>> I think in most cases huge XML files in bioinformatics result from a >>>>>> single XML containing multiple repetitive elements. Eg a BLAST XML >>>>>> output with several hits or a GenBankXML with many Sequences. A nice >>>>>> approach I have seen for dealing with these is to use SAX to read over >>>>>> the file and every time it comes to an element it delegates to a DOM >>>>>> object. You then parse the bits of the DOM you want with XPath or >>>>>> convert to objects or something and then when you are finished with >>>>>> that entry everything gets garbage collected and the SAX parser moves >>>>>> to the next element and repeats the whole process. This is a hybrid >>>>>> of event based parsing and object-model based parsing which could let >>>>>> you efficiently deal with huge files. >>>>>> >>>>>> I think the BLAST XML has improved substantially, at least in terms of >>>>>> validating against it's own DTD. The DTD itself may not be the best >>>>>> design but that is always a matter of taste and if you are using XPath >>>>>> to get the relevant bits you don't need to make a SAX parser jump >>>>>> through hoops to get them. >>>>>> >>>>>> I agree we will have to keep flat file parsers but we should strongly >>>>>> encourage the use of XML where possible. It is simply easier to deal >>>>>> with. Most biological flat-files were designed for Fortran and mainly >>>>>> for human consumption. There is no obvious validation mechanism. >>>>>> Notably everything in NCBI is derived from ASN.1, what you see in the >>>>>> flatfile is produced from there. I tend to think this means that the >>>>>> ASN.1 is the holy gospel and what you get in the flat file is some >>>>>> translation. Ideally NCBI files should be parsed from the ASN.1 where >>>>>> you can guarantee validation, the more practical alternative is to use >>>>>> the XML which you can at least validate against a DTD. >>>>>> >>>>>> With XML we (Biojava) can say if it validates we will parse it and if >>>>>> it doesn't we may not. With flat files there are so many dodgey >>>>>> variants we cannot say anything. Because XML dtds (or xsd's) have >>>>>> versions it also makes it much easier to have parsers for different >>>>>> versions and the parsing machinery can figure out which is needed. >>>>>> With flat files it is anyones guess what version you are dealing with. >>>>>> >>>>>> Finally parsers can be auto-generated for XML if you have the DTD or >>>>>> XSD. This often doesn't give you an ideal parser but it can be a >>>>>> useful starting point for rapid development. >>>>>> >>>>>> For Biojava v 3 I think we should concentrate on XML parsers first and >>>>>> flat files second. if only Fasta had an XML format >>>>>> >>>>>> - Mark >>>>>> >>>>>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: >>>>>>> I was always under the impression that blast's XML output was nearly >>> as >>>>>>> hard to parse as the flat file format but I do agree that if we can >>> use >>>>>>> XML whenever we can it would make writing parsers a lot easier >>>>>>> (especially if there are SAX based XPath libraries available). >>> Actually >>>>>>> this brings up a good question about development of this type of >>> parser. >>>>>>> The majority of XPath supporting libraries are DOM based which will >>> mean >>>>>>> large memory usage in some situations but overall providing an easier >>>>>>> coding experience (and hopefully reduce our chances of creating >>> bugs). >>>>>>> Or should we code to the edge cases of someone trying to parse a 1GB >>>>>>> XML? Personally I'd favour the former. >>>>>>> >>>>>>> Going back to the original topic there are going to be situations >>> where >>>>>>> people want the flat file parsers/writers & I think it's a valid >>> point >>>>>>> to say this is where BioJava is meant to come in & help a developer. >>>>>>> Afterall XML is a computer science problem where as parsing an EMBL >>> flat >>>>>>> file or blast output is a bioinformatics problem. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> >>>>>>> Mark Schreiber wrote: >>>>>>>> For a long time now my feeling has been that we should *only* >>> support >>>>>>>> the XML version of blast output. The other formats are too brittle >>> to >>>>>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc >>> that >>>>>>>> may be an extreme view but the power of generic XML parsers and >>> things >>>>>>>> like XPath etc really make these formats look very attractive. >>>>>>>> >>>>>>>> - Mark >>>>>>>> >>>>>>>> >>>>>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >>>>>>>>> I think Groovy have adopted a similar system recently & have >>> guidelines >>>>>>>>> for how each module should behave (dependencies, build system etc). >>> This >>>>>>>>> enforces the idea that a module whilst not part of the core project >>> must >>>>>>>>> behave in the same manner the core does. I do like the idea that we >>> can >>>>>>>>> have a core biojava & things get added around it & it might >>> encourage >>>>>>>>> other users to start developing their own modules for any >>>>>>>>> formats/purpose they want. >>>>>>>>> >>>>>>>>> Richard Holland wrote: >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>>>> Hash: SHA1 >>>>>>>>>> >>>>>>>>>>> What format options are there from blast? Just thinking if it >>> supports >>>>>>>>>>> CIGAR or something like that are we better providing a parser for >>> that >>>>>>>>>>> format & saying that we do not support the traditional blast >>> output? >>>>>>>>>>> That said it doesn't help is when that format changes so maybe >>> what is >>>>>>>>>>> needed is a way to push out parser changes without requiring a >>> full >>>>>>>>>>> biojava release (v3 discussion) ... >>>>>>>>>> Exactly! So the modular idea would work nicely here - we could >>> have a >>>>>>>>>> blast module and only update that single module (which would be >>> its own >>>>>>>>>> JAR) whenever the format changes. In a way, BioJava releases as >>> such >>>>>>>>>> would no longer happen, except maybe for some kind of core BioJava >>>>>>>>>> module. Everything would be done in terms of individual module+JAR >>>>>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, >>> one >>>>>>>>>> for Phylogenetic tools, one for translation/transcription, etc. >>> etc. >>>>>>>>>> cheers, >>>>>>>>>> Richard >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From smh1008 at cam.ac.uk Tue Dec 4 04:45:16 2007 From: smh1008 at cam.ac.uk (David Huen) Date: 04 Dec 2007 09:45:16 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <47551A13.8000407@ebi.ac.uk> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> <47551A13.8000407@ebi.ac.uk> Message-ID: On Dec 4 2007, Andy Yates wrote: >be wanting. It more depends on is there a good enough XPath >implementation that can handle the XML files we'll be pushing through it >(why is it I think the answer is no). > And why is it I think you are right? :-) Some of the XML files used by bioinformaticians can be horrendously large and at least some of the XML packages do appear to behave like they bring the whole file into a memory representation before allowing you to work on it. I think memory use and performance was a major factor in the current BJ implementations adopting an event-based model even though it's more difficult to use usually. Regards, DH -- David Huen Dept of Genetics University of Cambridge CB2 3EH U.K. From ayates at ebi.ac.uk Tue Dec 4 05:42:19 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 04 Dec 2007 10:42:19 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> <47551A13.8000407@ebi.ac.uk> Message-ID: <47552F0B.1080302@ebi.ac.uk> David Huen wrote: > On Dec 4 2007, Andy Yates wrote: > > >> be wanting. It more depends on is there a good enough XPath >> implementation that can handle the XML files we'll be pushing through >> it (why is it I think the answer is no). >> > And why is it I think you are right? :-) Lol :) > > Some of the XML files used by bioinformaticians can be horrendously > large and at least some of the XML packages do appear to behave like > they bring the whole file into a memory representation before allowing > you to work on it. I think memory use and performance was a major factor > in the current BJ implementations adopting an event-based model even > though it's more difficult to use usually. > It is one of my biggest concerns that a huge DOM model + BioJava model is going to take up a lot of memory. However if a SAX, StAX (either one) or DOM based parser is hidden behind a good enough interface hopefully the implementation used can be up to the user. That said these goals maybe too different & distant for us to be able to do it. Andy From crackeur at comcast.net Thu Dec 6 04:46:25 2007 From: crackeur at comcast.net (jimmy Zhang) Date: Thu, 6 Dec 2007 01:46:25 -0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com><474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: <004901c837ec$de4766d0$0402a8c0@your55e5f9e3d2> VTD-XML should also be worth mentioning http://vtd-xml.sf.net ----- Original Message ----- From: "Mark Schreiber" To: "Andy Yates" Cc: "biojava-1 mailing list" Sent: Thursday, November 29, 2007 6:28 PM Subject: Re: [Biojava-l] SAX, DOM, XPath and Flat files > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). > > There are a few java API's for parsing ASN.1 mostly developed for the > telco industry, I've never really looked into which is best (anyone > experienced with this?) but we could probably use one to work directly > off NCBI ASN.1 > > - Mark > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: >> Hi Mark, >> >> Okay that sounds like a perfectly sensible way to deal with this. Is >> this kind of parsing model supported in Java5? I only ask as I've not >> done a lot of XML parsing with Java5; more with things like XOM (which I >> think offers a DOM only representation but I'm probably wrong). >> >> That's good. There's not a huge point to have a format & a DTD/XSD and >> then have your files not conform to it. >> >> I was thinking the exact same thing about ASN.1 (well that & it looks >> bleeding horrible to parse but that is an un-educated look at the format >> which I'm sure is a parsable as JSON & the alike). >> >> When it comes to flat file parsers I would be happier to provide >> implementations of the more common formats where a viable alternative is >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide >> similar output to the above have a chance to write their own >> parsers/formatters. This is very similar to the current situation but we >> just need to remove dependencies on statically located data structures >> (don't get rid of them completely just give users an option to not use >> them). >> >> I'm not sure how much automatically generated parsers would help us. I >> guess it depends on the data model(s) we use if they are auto-parser >> friendly (which normally means POJO/JavaBean conventions including the >> no-args constructor). >> >> Cool I don't want to exclude flat file parsers completely (if only >> because my group has an interest in BioJava being able to read & write >> flat files) :) >> >> They decided to have HUPO-PSI Format instead :) >> >> Andy >> >> >> Mark Schreiber wrote: >> > Hi - >> > >> > I think in most cases huge XML files in bioinformatics result from a >> > single XML containing multiple repetitive elements. Eg a BLAST XML >> > output with several hits or a GenBankXML with many Sequences. A nice >> > approach I have seen for dealing with these is to use SAX to read over >> > the file and every time it comes to an element it delegates to a DOM >> > object. You then parse the bits of the DOM you want with XPath or >> > convert to objects or something and then when you are finished with >> > that entry everything gets garbage collected and the SAX parser moves >> > to the next element and repeats the whole process. This is a hybrid >> > of event based parsing and object-model based parsing which could let >> > you efficiently deal with huge files. >> > >> > I think the BLAST XML has improved substantially, at least in terms of >> > validating against it's own DTD. The DTD itself may not be the best >> > design but that is always a matter of taste and if you are using XPath >> > to get the relevant bits you don't need to make a SAX parser jump >> > through hoops to get them. >> > >> > I agree we will have to keep flat file parsers but we should strongly >> > encourage the use of XML where possible. It is simply easier to deal >> > with. Most biological flat-files were designed for Fortran and mainly >> > for human consumption. There is no obvious validation mechanism. >> > Notably everything in NCBI is derived from ASN.1, what you see in the >> > flatfile is produced from there. I tend to think this means that the >> > ASN.1 is the holy gospel and what you get in the flat file is some >> > translation. Ideally NCBI files should be parsed from the ASN.1 where >> > you can guarantee validation, the more practical alternative is to use >> > the XML which you can at least validate against a DTD. >> > >> > With XML we (Biojava) can say if it validates we will parse it and if >> > it doesn't we may not. With flat files there are so many dodgey >> > variants we cannot say anything. Because XML dtds (or xsd's) have >> > versions it also makes it much easier to have parsers for different >> > versions and the parsing machinery can figure out which is needed. >> > With flat files it is anyones guess what version you are dealing with. >> > >> > Finally parsers can be auto-generated for XML if you have the DTD or >> > XSD. This often doesn't give you an ideal parser but it can be a >> > useful starting point for rapid development. >> > >> > For Biojava v 3 I think we should concentrate on XML parsers first and >> > flat files second. if only Fasta had an XML format >> > >> > - Mark >> > >> > On Nov 27, 2007 11:16 PM, Andy Yates wrote: >> >> I was always under the impression that blast's XML output was nearly >> >> as >> >> hard to parse as the flat file format but I do agree that if we can >> >> use >> >> XML whenever we can it would make writing parsers a lot easier >> >> (especially if there are SAX based XPath libraries available). >> >> Actually >> >> this brings up a good question about development of this type of >> >> parser. >> >> The majority of XPath supporting libraries are DOM based which will >> >> mean >> >> large memory usage in some situations but overall providing an easier >> >> coding experience (and hopefully reduce our chances of creating bugs). >> >> Or should we code to the edge cases of someone trying to parse a 1GB >> >> XML? Personally I'd favour the former. >> >> >> >> Going back to the original topic there are going to be situations >> >> where >> >> people want the flat file parsers/writers & I think it's a valid point >> >> to say this is where BioJava is meant to come in & help a developer. >> >> Afterall XML is a computer science problem where as parsing an EMBL >> >> flat >> >> file or blast output is a bioinformatics problem. >> >> >> >> Andy >> >> >> >> >> >> Mark Schreiber wrote: >> >>> For a long time now my feeling has been that we should *only* support >> >>> the XML version of blast output. The other formats are too brittle >> >>> to >> >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc >> >>> that >> >>> may be an extreme view but the power of generic XML parsers and >> >>> things >> >>> like XPath etc really make these formats look very attractive. >> >>> >> >>> - Mark >> >>> >> >>> >> >>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >> >>>> I think Groovy have adopted a similar system recently & have >> >>>> guidelines >> >>>> for how each module should behave (dependencies, build system etc). >> >>>> This >> >>>> enforces the idea that a module whilst not part of the core project >> >>>> must >> >>>> behave in the same manner the core does. I do like the idea that we >> >>>> can >> >>>> have a core biojava & things get added around it & it might >> >>>> encourage >> >>>> other users to start developing their own modules for any >> >>>> formats/purpose they want. >> >>>> >> >>>> Richard Holland wrote: >> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >> >>>>> Hash: SHA1 >> >>>>> >> >>>>>> What format options are there from blast? Just thinking if it >> >>>>>> supports >> >>>>>> CIGAR or something like that are we better providing a parser for >> >>>>>> that >> >>>>>> format & saying that we do not support the traditional blast >> >>>>>> output? >> >>>>>> That said it doesn't help is when that format changes so maybe >> >>>>>> what is >> >>>>>> needed is a way to push out parser changes without requiring a >> >>>>>> full >> >>>>>> biojava release (v3 discussion) ... >> >>>>> Exactly! So the modular idea would work nicely here - we could have >> >>>>> a >> >>>>> blast module and only update that single module (which would be its >> >>>>> own >> >>>>> JAR) whenever the format changes. In a way, BioJava releases as >> >>>>> such >> >>>>> would no longer happen, except maybe for some kind of core BioJava >> >>>>> module. Everything would be done in terms of individual module+JAR >> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, >> >>>>> one >> >>>>> for Phylogenetic tools, one for translation/transcription, etc. >> >>>>> etc. >> >>>>> >> >>>>> cheers, >> >>>>> Richard >> >>>> _______________________________________________ >> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >>>> >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ap3 at sanger.ac.uk Thu Dec 6 05:33:17 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 6 Dec 2007 10:33:17 +0000 Subject: [Biojava-l] status update SVN migration Message-ID: Hi, a quick status update of the CVS to SVN migration for BioJava: George Hartzell, created the first svn dumps for the CVS repository. I am running tests on these to make sure the whole repository has been exported correctly. For details please see here: http://biojava.org/wiki/CVS_to_SVN_Migration A few minor problems have been found during this. As soon as these have been resolved we will be ready to make the final migration. In order to speed the migration process up, please commit any uncommitted changes to CVS in the next couple of days. Once the tests are finished I will send another notification email which will declare a CVS freeze a few days after. After this freeze CVS will remain frozen forever and all new development should happen in SVN. There will also be a new BioJava release at that point. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap3 at sanger.ac.uk Mon Dec 10 08:26:13 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 10 Dec 2007 13:26:13 +0000 Subject: [Biojava-l] SVN migration: declaring CVS freeze Message-ID: <099BDD0E-DA16-43C7-88CE-A47CA810D1EE@sanger.ac.uk> Hi, for the SVN migration please commit any remaining code to CVS in the next few days. On Wednesday, December 12th, 18:00 GMT the BioJava CVS will be frozen. In the following days the repository will be migrated to subversion (SVN) . From then on all future development will be happening in the new SVN repository. All code (+ history) will be available via SVN. I will send a confirmation email when the new SVN repository will become accessible. Detailed instructions on how to check out and commit code will be sent out at that stage as well. for more details see: http://biojava.org/wiki/CVS_to_SVN_Migration Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From markjschreiber at gmail.com Wed Dec 12 04:34:28 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 12 Dec 2007 17:34:28 +0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: References: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: <93b45ca50712120134w65cfad5dwbaeec2ea19a5a3b4@mail.gmail.com> Not a bad suggestion. I wasn't aware of how many formats are now in OWL. It would give us a pretty rapid route to pathway and microarray object models as well. Jena seems to be a nice package to build upon. Unfortunately due to the 'striped' rather than nested nature of ontologies we can forget the SAX vs DOM argument. The Jena OntModel is all going into memory. Looks like informatics is going to be the memory pig of the next decade. - Mark > May I kindly suggest skipping all of this talk about XML and have us > jump straight to OWL? ;) > > > http://dev.isb-sib.ch/projects/uniprot-rdf/ > > michael > > From phossein at umd.edu Sun Dec 16 15:22:15 2007 From: phossein at umd.edu (Parsa Hosseini) Date: Sun, 16 Dec 2007 15:22:15 -0500 (EST) Subject: [Biojava-l] Connecting to TAIR Message-ID: <20071216152215.ACH56650@po5.mail.umd.edu> Hi, I'm interested in contributing to the Biojava project. More specifically, I'm curious in making a package were you can query the TAIR (Arabidopsis.org) website? Would this be of use or benefit to anyone other than myself? Thanks, Parsa phossein at umd.edu From markjschreiber at gmail.com Mon Dec 17 02:13:41 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 17 Dec 2007 02:13:41 -0500 Subject: [Biojava-l] Connecting to TAIR In-Reply-To: <93b45ca50712161918w5d1e9c05t690f2d68b0fe9c@mail.gmail.com> References: <20071216152215.ACH56650@po5.mail.umd.edu> <93b45ca50712161918w5d1e9c05t690f2d68b0fe9c@mail.gmail.com> Message-ID: <93b45ca50712162313m7f9f39b7ha2d70b3306468b27@mail.gmail.com> Hi - I think this would be of great use to BioJava, especially as TAIR is (I think) built on Ensembl. There was a suggestion that biojava take over the Java bindings to Ensembl (Ensj). If you are interested in this then let us know and someone can point you in the right direction. - Mark On Dec 16, 2007 3:22 PM, Parsa Hosseini wrote: > Hi, > > I'm interested in contributing to the Biojava project. > More specifically, I'm curious in making a package were you can query the TAIR (Arabidopsis.org) website? > Would this be of use or benefit to anyone other than myself? > > Thanks, > > > > Parsa > phossein at umd.edu > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From phossein at umd.edu Tue Dec 18 01:39:21 2007 From: phossein at umd.edu (Parsa Hosseini) Date: Tue, 18 Dec 2007 01:39:21 -0500 (EST) Subject: [Biojava-l] Problem finding Restriction Enzyme sites Message-ID: <20071218013921.ACI39718@po5.mail.umd.edu> Hi, I've been using the 'molbio' package to do some Restriction enzyme analysis. I've come across errors when I want to find the corresponding recognition site. I'd be very thankful if I could be lead down the path to find were I am going wrong. Thank you in advance, Parsa phossein at umd.edu ------------------------ try { SymbolList siteOfEnzyme = DNATools.createDNA("GAATC"); Sequence sequence = DNATools.createDNASequence("AAAAGAATCTTC", "mySequence"); RestrictionEnzyme re1 = new RestrictionEnzyme("ECOR1", siteOfEnzyme, 0, 0); SimpleThreadPool tr = new SimpleThreadPool(); RestrictionMapper reMapper = new RestrictionMapper(tr); reMapper.addEnzyme(re1); System.out.println(reMapper.annotate(sequence)); } catch (IllegalSymbolException e) { } catch (IllegalAlphabetException e) { } Exception in thread "Thread-1" org.biojava.bio.BioRuntimeException: Failed to complete search for ECOR1 GAATC (0/0) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:136) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) Caused by: java.lang.IllegalArgumentException: RestrictionEnzyme 'ECOR1' is not registered. No precompiled Pattern is available at org.biojava.bio.molbio.RestrictionEnzymeManager.getPatterns(RestrictionEnzymeManager.java:280) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:77) ... 1 more From holland at ebi.ac.uk Tue Dec 18 03:56:36 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 18 Dec 2007 08:56:36 +0000 Subject: [Biojava-l] Problem finding Restriction Enzyme sites In-Reply-To: <20071218013921.ACI39718@po5.mail.umd.edu> References: <20071218013921.ACI39718@po5.mail.umd.edu> Message-ID: <47678B44.1040004@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello. Your code is throwing an exception saying that it doesn't know what the definition for ECOR1 is. This is because the molbio package doesn't contain any restriction enzyme definitions by default, and you haven't loaded any into it yet (at least, not in the code fragment you quote). Before you can use the methods in this package, you need to have made a call to RestrictionEnzymeManager.loadEnzymeFile() and pass it a reference to a file containing some restriction enzyme definitions in REBASE format #31. Once you've used this method to load a REBASE file containing a definition for the ECOR1 enzyme, you'll be able to use the code. cheers, Richard Parsa Hosseini wrote: > Hi, > > I've been using the 'molbio' package to do some Restriction enzyme analysis. I've come across errors when I want to find the corresponding recognition site. I'd be very thankful if I could be lead down the path to find were I am going wrong. > > Thank you in advance, > > Parsa > > phossein at umd.edu > > ------------------------ > > try { > > SymbolList siteOfEnzyme = DNATools.createDNA("GAATC"); > Sequence sequence = DNATools.createDNASequence("AAAAGAATCTTC", "mySequence"); > RestrictionEnzyme re1 = new RestrictionEnzyme("ECOR1", siteOfEnzyme, 0, 0); > SimpleThreadPool tr = new SimpleThreadPool(); > RestrictionMapper reMapper = new RestrictionMapper(tr); > reMapper.addEnzyme(re1); System.out.println(reMapper.annotate(sequence)); > } > catch (IllegalSymbolException e) { > > } > catch (IllegalAlphabetException e) { > > } > > Exception in thread "Thread-1" org.biojava.bio.BioRuntimeException: Failed to complete search for ECOR1 GAATC (0/0) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:136) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) > Caused by: java.lang.IllegalArgumentException: RestrictionEnzyme 'ECOR1' is not registered. No precompiled Pattern is available > at org.biojava.bio.molbio.RestrictionEnzymeManager.getPatterns(RestrictionEnzymeManager.java:280) > at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:77) > ... 1 more > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHZ4tD4C5LeMEKA/QRApOrAKCYZ9rXbPzyKuAV4DrwaSL424s/wACfSIDC K6TFXwVdQSrV4bTG4DGKYhI= =Nc4w -----END PGP SIGNATURE----- From phossein at umd.edu Wed Dec 26 22:46:14 2007 From: phossein at umd.edu (Parsa Hosseini) Date: Wed, 26 Dec 2007 22:46:14 -0500 (EST) Subject: [Biojava-l] New tutorials Message-ID: <20071226224614.ACM82423@po5.mail.umd.edu> Hi all, I'm in the process of writing some new tutorials. Maybe they can go in the Tutorial section, or the 'BioJava in Anger' section. Either way, I'm interested in writing on the following topics: *Downloading GI's from NCBI *Pairwise alignment using global and local means *Restriction Enzyme analysis .... and if there is anything else, please let me know. I'm going to throw in some graphics and nice illustrations using some corel or photoshop too. If there are any other things to add, please feel free to let me know. Parsa Hosseini phossein at umd.edu From markjschreiber at gmail.com Thu Dec 27 01:29:16 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 27 Dec 2007 14:29:16 +0800 Subject: [Biojava-l] New tutorials In-Reply-To: <20071226224614.ACM82423@po5.mail.umd.edu> References: <20071226224614.ACM82423@po5.mail.umd.edu> Message-ID: <93b45ca50712262229u760f1e48rf0811de60527994a@mail.gmail.com> Great! The more the better! I have been thinking for a while that there could also be another section of examples in the cookbook for examples that don't make much use of biojava but are biorelated and use java. - Mark On Dec 27, 2007 11:46 AM, Parsa Hosseini wrote: > Hi all, > > I'm in the process of writing some new tutorials. Maybe they can go in the Tutorial section, or the 'BioJava in Anger' section. Either way, I'm interested in writing on the following topics: > > *Downloading GI's from NCBI > *Pairwise alignment using global and local means > *Restriction Enzyme analysis > > .... and if there is anything else, please let me know. > I'm going to throw in some graphics and nice illustrations using some corel or photoshop too. > If there are any other things to add, please feel free to let me know. > > Parsa Hosseini > phossein at umd.edu > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From david.bourgais at bioxpr.be Thu Dec 27 09:11:33 2007 From: david.bourgais at bioxpr.be (David Bourgais) Date: Thu, 27 Dec 2007 15:11:33 +0100 Subject: [Biojava-l] About ABITrace class Message-ID: <1198764693.20435.5.camel@bioxpr-04.ct.fundp.ac.be> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From david.bourgais at bioxpr.be Fri Dec 28 03:54:30 2007 From: david.bourgais at bioxpr.be (David Bourgais) Date: Fri, 28 Dec 2007 09:54:30 +0100 Subject: [Biojava-l] About ABITrace class Message-ID: <1198832070.29755.0.camel@bioxpr-04.ct.fundp.ac.be> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From markjschreiber at gmail.com Fri Dec 28 22:02:26 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 29 Dec 2007 11:02:26 +0800 Subject: [Biojava-l] Applet not able to find DNATools class. In-Reply-To: <474A8A1C.4020901@ebi.ac.uk> References: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> <474A8A1C.4020901@ebi.ac.uk> Message-ID: <93b45ca50712281902v700366f7s3b00595efa681108@mail.gmail.com> Hi - This has come up several times on the mailing lists. You could probably find a resolution in the archives. - Mark On Nov 26, 2007 4:55 PM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Sounds like either a classpath problem (in which case check your > classpath to ensure all parts of biojava are definitely on it) or a > broken biojava.jar (in which case you need to recompile/redownload it). > > cheers, > Richard > > Abhinav Ram Karhu wrote: > > Hello all, > > I am having an error while loading the applet. > > > > I am getting the following stack trace. > > > > java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools > > at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161) > > at Trace.init(Trace.java:161) > > at sun.applet.AppletPanel.run(Unknown Source) > > at java.lang.Thread.run(Unknown Source) > > > > I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder. > > > > I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java. > > > > My applet code in the php page looks like this: > > > > > > > > Please suggest if I am missing something. > > > > Thanks in advance. > > > > Abhinav > > > > > > > > - -- > Richard Holland (BioMart) > EMBL EBI, Wellcome Trust Genome Campus, > Hinxton, Cambridgeshire CB10 1SD, UK > Tel. +44 (0)1223 494416 > > http://www.biomart.org/ > http://www.biojava.org/ > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHSoob4C5LeMEKA/QRAsfkAJ9SlwIzDulzSDQpAfgh0alISRsplACcDqUx > uyQUEmRFEWTdnEHsm7k2lg0= > =SWHu > -----END PGP SIGNATURE----- > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From russ at kepler-eng.com Sun Dec 30 12:04:05 2007 From: russ at kepler-eng.com (Russ Kepler) Date: Sun, 30 Dec 2007 10:04:05 -0700 Subject: [Biojava-l] About ABITrace class In-Reply-To: <1198832070.29755.0.camel@bioxpr-04.ct.fundp.ac.be> References: <1198832070.29755.0.camel@bioxpr-04.ct.fundp.ac.be> Message-ID: <200712301004.05398.russ@kepler-eng.com> On Friday 28 December 2007 01:54:30 David Bourgais wrote: > Hello BioJava users > > I am writing a little program using BioJava 1.5 and JFreeChart 1.0.8. > My aim is to display with JFreeChart a chromatogram by reading an ab1 > file with the BioJava help. > Okay, I can display my chromatogram. But, my chromatogram size is 13301 > (using getTrace(AtomicSymbol base)) when my sequence size is 1110 bp > (according to the the BufferedImage produced with BioJava). > Why this difference ? How can I correct my program in order to see > correctly my chromatogram ? If I understand you correctly you're looking at the trace data (multiple points per peak) and comparing to the base calls (one call per peak). In my experience folks interested in the chromatogram would prefer to see an image with the chromatogram spaced properly, those primarily interested in the base calls prefer to have the chromatogram scaled around a fixed width for the basecall. There's code to do either in BioJava, but the former might not be setup to generate a BufferedImage (if you can't find it I might be able to help some). From ap3 at sanger.ac.uk Sun Dec 30 13:43:36 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 30 Dec 2007 18:43:36 +0000 Subject: [Biojava-l] biojava 1.6 release candidate 1 Message-ID: <9BA76ACB-33B8-4E84-96F5-04981DE3CD62@sanger.ac.uk> Hi, I prepared release candidate 1 for the next biojava release for download at http://www.biojava.org/download/bj16/rc1/ This release candidate is build from the new svn repository, as a test for it. It contains numerous bug fixes, improvements in the protein structure modules and better documentation. This release will be the first one where biojava will be using Java 1.5. If you are still using the old Java 1.4, please consider an upgrade, or continue using the previous biojava 1.5 release. If all is fine with it, I will make this the next biojava release. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From markjschreiber at gmail.com Tue Dec 4 03:07:32 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 3 Dec 2007 22:07:32 -0500 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> Message-ID: <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> The only major advantage to using the JDK DOM/SAX is that everyone has them (no new JARs required) and they will never go away. However I can see there is a strong case for something else like XOM or Apache alternatives Saxon etc. Infact these projects often feature bleeding edge technologies before they appear in the JDK. To prevent an explosion of JARs I think we should agree on a small few XML options. As Mark mentions a good interface design makes the user code completely independent of the XML parser that is used. This makes it much easier to change what is used under the hood if something better comes along or if one of our project dependencies stops being developed. This has actually happened before in biojava. We used to rely on Xerces or something similar but once SAX and DOM appeared in the JDK we swapped out Xerces without too much impact. Good unit tests help to make sure everything still works. The occasional problem with NCBI XML is probably the best argument to delve into the dark world of ASN.1 - Mark (Classic Mark, not New Mark) On Nov 30, 2007 1:30 PM, Mark Fortner wrote: > There's a potential gotcha involved with XPath parsing. If you use the > current implementation that ships with the Java 5 & 6 JDKs, it performs a > DOM parse on the whole document, even if you pass it a specific starting > node in the document. I stumbled across this one the hard way when using > the hybrid approach that you mention. This may be solved with another XPath > implementation such as Saxon. > > One other problem I've noticed is that the NCBI XML doesn't always parse. > I've reported this to them, and they've promised to address this. It usually > occurs when submitters put non-escaped characters into text fields such as > author lists in PubMed. NCBI doesn't always use CDATA blocks around text and > as soon as the parser hits one of these characters it throws an exception. > > I've also noticed a tendency (in other code bases) for developers to use > several different parsers; usually, whatever parser they're most familiar > with. The problem with this is that they often introduce parser-specific > code into the code base, so you end up with numerous dependencies for > different parsers, and a potential configuration problem if you're passing > the XML parser as a run-time configuration parameter. The most frequent > external parsers I've seen used are JDOM and DOM4J. The usual way to get > around this is to write to an interface, but that will require some > additional vigilance. > > Just a few things to watch out for as we move forward. > > Mark (the other one) :-) > > > On Nov 30, 2007 1:26 AM, Andy Yates wrote: > > > I think I've seen XPath hanging around in other people's code in a 1.5 > > code-base (in fact one of the guys I work with). I've used Java's DOM > > before & it really isn't very nice & quite verbose. I'd prefer if there > > was a better alternative/wrapper around the XML parsers just to cut down > > on code chatter. > > > > Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these > > Java tools & I think I've gone cross-eyed with the sheer number of > > acronyms! You've gotta love something which seems to add a letter to ER > > & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the > > list know of a ASN.1 parser for Java that's good and should we support > > it (considering NCBI generate their DTD & XML from the ASN.1 > > representation). > > > > Andy > > > > Mark Schreiber wrote: > > > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > > > not XQuery although XPath is probably more important for this use. > > > > > > The DOM model is a direct implementation of the W3C standard which > > > makes it a little awkward from a java point of view but it is usable. > > > > > > Java 6 has StAX (the other one). > > > > > > There are a few java API's for parsing ASN.1 mostly developed for the > > > telco industry, I've never really looked into which is best (anyone > > > experienced with this?) but we could probably use one to work directly > > > off NCBI ASN.1 > > > > > > - Mark > > > > > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: > > >> Hi Mark, > > >> > > >> Okay that sounds like a perfectly sensible way to deal with this. Is > > >> this kind of parsing model supported in Java5? I only ask as I've not > > >> done a lot of XML parsing with Java5; more with things like XOM (which > > I > > >> think offers a DOM only representation but I'm probably wrong). > > >> > > >> That's good. There's not a huge point to have a format & a DTD/XSD and > > >> then have your files not conform to it. > > >> > > >> I was thinking the exact same thing about ASN.1 (well that & it looks > > >> bleeding horrible to parse but that is an un-educated look at the > > format > > >> which I'm sure is a parsable as JSON & the alike). > > >> > > >> When it comes to flat file parsers I would be happier to provide > > >> implementations of the more common formats where a viable alternative > > is > > >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which > > provide > > >> similar output to the above have a chance to write their own > > >> parsers/formatters. This is very similar to the current situation but > > we > > >> just need to remove dependencies on statically located data structures > > >> (don't get rid of them completely just give users an option to not use > > >> them). > > >> > > >> I'm not sure how much automatically generated parsers would help us. I > > >> guess it depends on the data model(s) we use if they are auto-parser > > >> friendly (which normally means POJO/JavaBean conventions including the > > >> no-args constructor). > > >> > > >> Cool I don't want to exclude flat file parsers completely (if only > > >> because my group has an interest in BioJava being able to read & write > > >> flat files) :) > > >> > > >> They decided to have HUPO-PSI Format instead :) > > >> > > >> Andy > > >> > > >> > > >> Mark Schreiber wrote: > > >>> Hi - > > >>> > > >>> I think in most cases huge XML files in bioinformatics result from a > > >>> single XML containing multiple repetitive elements. Eg a BLAST XML > > >>> output with several hits or a GenBankXML with many Sequences. A nice > > >>> approach I have seen for dealing with these is to use SAX to read over > > >>> the file and every time it comes to an element it delegates to a DOM > > >>> object. You then parse the bits of the DOM you want with XPath or > > >>> convert to objects or something and then when you are finished with > > >>> that entry everything gets garbage collected and the SAX parser moves > > >>> to the next element and repeats the whole process. This is a hybrid > > >>> of event based parsing and object-model based parsing which could let > > >>> you efficiently deal with huge files. > > >>> > > >>> I think the BLAST XML has improved substantially, at least in terms of > > >>> validating against it's own DTD. The DTD itself may not be the best > > >>> design but that is always a matter of taste and if you are using XPath > > >>> to get the relevant bits you don't need to make a SAX parser jump > > >>> through hoops to get them. > > >>> > > >>> I agree we will have to keep flat file parsers but we should strongly > > >>> encourage the use of XML where possible. It is simply easier to deal > > >>> with. Most biological flat-files were designed for Fortran and mainly > > >>> for human consumption. There is no obvious validation mechanism. > > >>> Notably everything in NCBI is derived from ASN.1, what you see in the > > >>> flatfile is produced from there. I tend to think this means that the > > >>> ASN.1 is the holy gospel and what you get in the flat file is some > > >>> translation. Ideally NCBI files should be parsed from the ASN.1 where > > >>> you can guarantee validation, the more practical alternative is to use > > >>> the XML which you can at least validate against a DTD. > > >>> > > >>> With XML we (Biojava) can say if it validates we will parse it and if > > >>> it doesn't we may not. With flat files there are so many dodgey > > >>> variants we cannot say anything. Because XML dtds (or xsd's) have > > >>> versions it also makes it much easier to have parsers for different > > >>> versions and the parsing machinery can figure out which is needed. > > >>> With flat files it is anyones guess what version you are dealing with. > > >>> > > >>> Finally parsers can be auto-generated for XML if you have the DTD or > > >>> XSD. This often doesn't give you an ideal parser but it can be a > > >>> useful starting point for rapid development. > > >>> > > >>> For Biojava v 3 I think we should concentrate on XML parsers first and > > >>> flat files second. if only Fasta had an XML format > > >>> > > >>> - Mark > > >>> > > >>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: > > >>>> I was always under the impression that blast's XML output was nearly > > as > > >>>> hard to parse as the flat file format but I do agree that if we can > > use > > >>>> XML whenever we can it would make writing parsers a lot easier > > >>>> (especially if there are SAX based XPath libraries available). > > Actually > > >>>> this brings up a good question about development of this type of > > parser. > > >>>> The majority of XPath supporting libraries are DOM based which will > > mean > > >>>> large memory usage in some situations but overall providing an easier > > >>>> coding experience (and hopefully reduce our chances of creating > > bugs). > > >>>> Or should we code to the edge cases of someone trying to parse a 1GB > > >>>> XML? Personally I'd favour the former. > > >>>> > > >>>> Going back to the original topic there are going to be situations > > where > > >>>> people want the flat file parsers/writers & I think it's a valid > > point > > >>>> to say this is where BioJava is meant to come in & help a developer. > > >>>> Afterall XML is a computer science problem where as parsing an EMBL > > flat > > >>>> file or blast output is a bioinformatics problem. > > >>>> > > >>>> Andy > > >>>> > > >>>> > > >>>> Mark Schreiber wrote: > > >>>>> For a long time now my feeling has been that we should *only* > > support > > >>>>> the XML version of blast output. The other formats are too brittle > > to > > >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc > > that > > >>>>> may be an extreme view but the power of generic XML parsers and > > things > > >>>>> like XPath etc really make these formats look very attractive. > > >>>>> > > >>>>> - Mark > > >>>>> > > >>>>> > > >>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: > > >>>>>> I think Groovy have adopted a similar system recently & have > > guidelines > > >>>>>> for how each module should behave (dependencies, build system etc). > > This > > >>>>>> enforces the idea that a module whilst not part of the core project > > must > > >>>>>> behave in the same manner the core does. I do like the idea that we > > can > > >>>>>> have a core biojava & things get added around it & it might > > encourage > > >>>>>> other users to start developing their own modules for any > > >>>>>> formats/purpose they want. > > >>>>>> > > >>>>>> Richard Holland wrote: > > >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- > > >>>>>>> Hash: SHA1 > > >>>>>>> > > >>>>>>>> What format options are there from blast? Just thinking if it > > supports > > >>>>>>>> CIGAR or something like that are we better providing a parser for > > that > > >>>>>>>> format & saying that we do not support the traditional blast > > output? > > >>>>>>>> That said it doesn't help is when that format changes so maybe > > what is > > >>>>>>>> needed is a way to push out parser changes without requiring a > > full > > >>>>>>>> biojava release (v3 discussion) ... > > >>>>>>> Exactly! So the modular idea would work nicely here - we could > > have a > > >>>>>>> blast module and only update that single module (which would be > > its own > > >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as > > such > > >>>>>>> would no longer happen, except maybe for some kind of core BioJava > > >>>>>>> module. Everything would be done in terms of individual module+JAR > > >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, > > one > > >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. > > etc. > > >>>>>>> > > >>>>>>> cheers, > > >>>>>>> Richard > > >>>>>> _______________________________________________ > > >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>>>> > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Tue Dec 4 09:12:51 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 04 Dec 2007 09:12:51 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> Message-ID: <47551A13.8000407@ebi.ac.uk> I think avoiding the jar explosion is a very good idea. I think if every Jar choice has to go through a process of issue/vote which makes it a bit harder to decide to introduce a new JAR without others knowing what it is, why the submitter has chosen it & why is it better than other alternatives; this really could be a simple as I've used this one & it's API is easier to understand. Same thing is seen in all libraries. Just looking at the Spring synchronized collection factories you can see it testing for Java versions & class existence to know what type of synchronized collection it can create. Also XML apis are one of the worst for jar dependency hell since everyone has their favourite parser (just try running a program in ant without forking & using two XML apis ... it's fun). Using XPath & a generic retrieval system could give us this flexibility we all seem to be wanting. It more depends on is there a good enough XPath implementation that can handle the XML files we'll be pushing through it (why is it I think the answer is no). Hmmm it does but how many bioinformaticians use the ASN.1 syntax though compared to flat file & XML? I'm guessing that flat file is the winner here with XML & ASN.1 coming in reasonably equal*. If this is true then yes I'd be more tempted to write a ASN.1 parser & then support XML. Andy (not a Mark in the slightest) * Please note that this is a finger in the air guess with no actual statistical backing one way or another :). Mark Schreiber wrote: > The only major advantage to using the JDK DOM/SAX is that everyone has > them (no new JARs required) and they will never go away. However I > can see there is a strong case for something else like XOM or Apache > alternatives Saxon etc. Infact these projects often feature bleeding > edge technologies before they appear in the JDK. > > To prevent an explosion of JARs I think we should agree on a small few > XML options. As Mark mentions a good interface design makes the user > code completely independent of the XML parser that is used. This makes > it much easier to change what is used under the hood if something > better comes along or if one of our project dependencies stops being > developed. > > This has actually happened before in biojava. We used to rely on > Xerces or something similar but once SAX and DOM appeared in the JDK > we swapped out Xerces without too much impact. Good unit tests help > to make sure everything still works. > > The occasional problem with NCBI XML is probably the best argument to > delve into the dark world of ASN.1 > > - Mark (Classic Mark, not New Mark) > > On Nov 30, 2007 1:30 PM, Mark Fortner wrote: >> There's a potential gotcha involved with XPath parsing. If you use the >> current implementation that ships with the Java 5 & 6 JDKs, it performs a >> DOM parse on the whole document, even if you pass it a specific starting >> node in the document. I stumbled across this one the hard way when using >> the hybrid approach that you mention. This may be solved with another XPath >> implementation such as Saxon. >> >> One other problem I've noticed is that the NCBI XML doesn't always parse. >> I've reported this to them, and they've promised to address this. It usually >> occurs when submitters put non-escaped characters into text fields such as >> author lists in PubMed. NCBI doesn't always use CDATA blocks around text and >> as soon as the parser hits one of these characters it throws an exception. >> >> I've also noticed a tendency (in other code bases) for developers to use >> several different parsers; usually, whatever parser they're most familiar >> with. The problem with this is that they often introduce parser-specific >> code into the code base, so you end up with numerous dependencies for >> different parsers, and a potential configuration problem if you're passing >> the XML parser as a run-time configuration parameter. The most frequent >> external parsers I've seen used are JDOM and DOM4J. The usual way to get >> around this is to write to an interface, but that will require some >> additional vigilance. >> >> Just a few things to watch out for as we move forward. >> >> Mark (the other one) :-) >> >> >> On Nov 30, 2007 1:26 AM, Andy Yates wrote: >> >>> I think I've seen XPath hanging around in other people's code in a 1.5 >>> code-base (in fact one of the guys I work with). I've used Java's DOM >>> before & it really isn't very nice & quite verbose. I'd prefer if there >>> was a better alternative/wrapper around the XML parsers just to cut down >>> on code chatter. >>> >>> Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these >>> Java tools & I think I've gone cross-eyed with the sheer number of >>> acronyms! You've gotta love something which seems to add a letter to ER >>> & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the >>> list know of a ASN.1 parser for Java that's good and should we support >>> it (considering NCBI generate their DTD & XML from the ASN.1 >>> representation). >>> >>> Andy >>> >>> Mark Schreiber wrote: >>>> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but >>>> not XQuery although XPath is probably more important for this use. >>>> >>>> The DOM model is a direct implementation of the W3C standard which >>>> makes it a little awkward from a java point of view but it is usable. >>>> >>>> Java 6 has StAX (the other one). >>>> >>>> There are a few java API's for parsing ASN.1 mostly developed for the >>>> telco industry, I've never really looked into which is best (anyone >>>> experienced with this?) but we could probably use one to work directly >>>> off NCBI ASN.1 >>>> >>>> - Mark >>>> >>>> On Nov 28, 2007 10:29 PM, Andy Yates wrote: >>>>> Hi Mark, >>>>> >>>>> Okay that sounds like a perfectly sensible way to deal with this. Is >>>>> this kind of parsing model supported in Java5? I only ask as I've not >>>>> done a lot of XML parsing with Java5; more with things like XOM (which >>> I >>>>> think offers a DOM only representation but I'm probably wrong). >>>>> >>>>> That's good. There's not a huge point to have a format & a DTD/XSD and >>>>> then have your files not conform to it. >>>>> >>>>> I was thinking the exact same thing about ASN.1 (well that & it looks >>>>> bleeding horrible to parse but that is an un-educated look at the >>> format >>>>> which I'm sure is a parsable as JSON & the alike). >>>>> >>>>> When it comes to flat file parsers I would be happier to provide >>>>> implementations of the more common formats where a viable alternative >>> is >>>>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which >>> provide >>>>> similar output to the above have a chance to write their own >>>>> parsers/formatters. This is very similar to the current situation but >>> we >>>>> just need to remove dependencies on statically located data structures >>>>> (don't get rid of them completely just give users an option to not use >>>>> them). >>>>> >>>>> I'm not sure how much automatically generated parsers would help us. I >>>>> guess it depends on the data model(s) we use if they are auto-parser >>>>> friendly (which normally means POJO/JavaBean conventions including the >>>>> no-args constructor). >>>>> >>>>> Cool I don't want to exclude flat file parsers completely (if only >>>>> because my group has an interest in BioJava being able to read & write >>>>> flat files) :) >>>>> >>>>> They decided to have HUPO-PSI Format instead :) >>>>> >>>>> Andy >>>>> >>>>> >>>>> Mark Schreiber wrote: >>>>>> Hi - >>>>>> >>>>>> I think in most cases huge XML files in bioinformatics result from a >>>>>> single XML containing multiple repetitive elements. Eg a BLAST XML >>>>>> output with several hits or a GenBankXML with many Sequences. A nice >>>>>> approach I have seen for dealing with these is to use SAX to read over >>>>>> the file and every time it comes to an element it delegates to a DOM >>>>>> object. You then parse the bits of the DOM you want with XPath or >>>>>> convert to objects or something and then when you are finished with >>>>>> that entry everything gets garbage collected and the SAX parser moves >>>>>> to the next element and repeats the whole process. This is a hybrid >>>>>> of event based parsing and object-model based parsing which could let >>>>>> you efficiently deal with huge files. >>>>>> >>>>>> I think the BLAST XML has improved substantially, at least in terms of >>>>>> validating against it's own DTD. The DTD itself may not be the best >>>>>> design but that is always a matter of taste and if you are using XPath >>>>>> to get the relevant bits you don't need to make a SAX parser jump >>>>>> through hoops to get them. >>>>>> >>>>>> I agree we will have to keep flat file parsers but we should strongly >>>>>> encourage the use of XML where possible. It is simply easier to deal >>>>>> with. Most biological flat-files were designed for Fortran and mainly >>>>>> for human consumption. There is no obvious validation mechanism. >>>>>> Notably everything in NCBI is derived from ASN.1, what you see in the >>>>>> flatfile is produced from there. I tend to think this means that the >>>>>> ASN.1 is the holy gospel and what you get in the flat file is some >>>>>> translation. Ideally NCBI files should be parsed from the ASN.1 where >>>>>> you can guarantee validation, the more practical alternative is to use >>>>>> the XML which you can at least validate against a DTD. >>>>>> >>>>>> With XML we (Biojava) can say if it validates we will parse it and if >>>>>> it doesn't we may not. With flat files there are so many dodgey >>>>>> variants we cannot say anything. Because XML dtds (or xsd's) have >>>>>> versions it also makes it much easier to have parsers for different >>>>>> versions and the parsing machinery can figure out which is needed. >>>>>> With flat files it is anyones guess what version you are dealing with. >>>>>> >>>>>> Finally parsers can be auto-generated for XML if you have the DTD or >>>>>> XSD. This often doesn't give you an ideal parser but it can be a >>>>>> useful starting point for rapid development. >>>>>> >>>>>> For Biojava v 3 I think we should concentrate on XML parsers first and >>>>>> flat files second. if only Fasta had an XML format >>>>>> >>>>>> - Mark >>>>>> >>>>>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: >>>>>>> I was always under the impression that blast's XML output was nearly >>> as >>>>>>> hard to parse as the flat file format but I do agree that if we can >>> use >>>>>>> XML whenever we can it would make writing parsers a lot easier >>>>>>> (especially if there are SAX based XPath libraries available). >>> Actually >>>>>>> this brings up a good question about development of this type of >>> parser. >>>>>>> The majority of XPath supporting libraries are DOM based which will >>> mean >>>>>>> large memory usage in some situations but overall providing an easier >>>>>>> coding experience (and hopefully reduce our chances of creating >>> bugs). >>>>>>> Or should we code to the edge cases of someone trying to parse a 1GB >>>>>>> XML? Personally I'd favour the former. >>>>>>> >>>>>>> Going back to the original topic there are going to be situations >>> where >>>>>>> people want the flat file parsers/writers & I think it's a valid >>> point >>>>>>> to say this is where BioJava is meant to come in & help a developer. >>>>>>> Afterall XML is a computer science problem where as parsing an EMBL >>> flat >>>>>>> file or blast output is a bioinformatics problem. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> >>>>>>> Mark Schreiber wrote: >>>>>>>> For a long time now my feeling has been that we should *only* >>> support >>>>>>>> the XML version of blast output. The other formats are too brittle >>> to >>>>>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc >>> that >>>>>>>> may be an extreme view but the power of generic XML parsers and >>> things >>>>>>>> like XPath etc really make these formats look very attractive. >>>>>>>> >>>>>>>> - Mark >>>>>>>> >>>>>>>> >>>>>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >>>>>>>>> I think Groovy have adopted a similar system recently & have >>> guidelines >>>>>>>>> for how each module should behave (dependencies, build system etc). >>> This >>>>>>>>> enforces the idea that a module whilst not part of the core project >>> must >>>>>>>>> behave in the same manner the core does. I do like the idea that we >>> can >>>>>>>>> have a core biojava & things get added around it & it might >>> encourage >>>>>>>>> other users to start developing their own modules for any >>>>>>>>> formats/purpose they want. >>>>>>>>> >>>>>>>>> Richard Holland wrote: >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>>>> Hash: SHA1 >>>>>>>>>> >>>>>>>>>>> What format options are there from blast? Just thinking if it >>> supports >>>>>>>>>>> CIGAR or something like that are we better providing a parser for >>> that >>>>>>>>>>> format & saying that we do not support the traditional blast >>> output? >>>>>>>>>>> That said it doesn't help is when that format changes so maybe >>> what is >>>>>>>>>>> needed is a way to push out parser changes without requiring a >>> full >>>>>>>>>>> biojava release (v3 discussion) ... >>>>>>>>>> Exactly! So the modular idea would work nicely here - we could >>> have a >>>>>>>>>> blast module and only update that single module (which would be >>> its own >>>>>>>>>> JAR) whenever the format changes. In a way, BioJava releases as >>> such >>>>>>>>>> would no longer happen, except maybe for some kind of core BioJava >>>>>>>>>> module. Everything would be done in terms of individual module+JAR >>>>>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, >>> one >>>>>>>>>> for Phylogenetic tools, one for translation/transcription, etc. >>> etc. >>>>>>>>>> cheers, >>>>>>>>>> Richard >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From smh1008 at cam.ac.uk Tue Dec 4 09:45:16 2007 From: smh1008 at cam.ac.uk (David Huen) Date: 04 Dec 2007 09:45:16 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <47551A13.8000407@ebi.ac.uk> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> <47551A13.8000407@ebi.ac.uk> Message-ID: On Dec 4 2007, Andy Yates wrote: >be wanting. It more depends on is there a good enough XPath >implementation that can handle the XML files we'll be pushing through it >(why is it I think the answer is no). > And why is it I think you are right? :-) Some of the XML files used by bioinformaticians can be horrendously large and at least some of the XML packages do appear to behave like they bring the whole file into a memory representation before allowing you to work on it. I think memory use and performance was a major factor in the current BJ implementations adopting an event-based model even though it's more difficult to use usually. Regards, DH -- David Huen Dept of Genetics University of Cambridge CB2 3EH U.K. From ayates at ebi.ac.uk Tue Dec 4 10:42:19 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 04 Dec 2007 10:42:19 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> <93b45ca50712031907j10325c12jcff4984315f6c330@mail.gmail.com> <47551A13.8000407@ebi.ac.uk> Message-ID: <47552F0B.1080302@ebi.ac.uk> David Huen wrote: > On Dec 4 2007, Andy Yates wrote: > > >> be wanting. It more depends on is there a good enough XPath >> implementation that can handle the XML files we'll be pushing through >> it (why is it I think the answer is no). >> > And why is it I think you are right? :-) Lol :) > > Some of the XML files used by bioinformaticians can be horrendously > large and at least some of the XML packages do appear to behave like > they bring the whole file into a memory representation before allowing > you to work on it. I think memory use and performance was a major factor > in the current BJ implementations adopting an event-based model even > though it's more difficult to use usually. > It is one of my biggest concerns that a huge DOM model + BioJava model is going to take up a lot of memory. However if a SAX, StAX (either one) or DOM based parser is hidden behind a good enough interface hopefully the implementation used can be up to the user. That said these goals maybe too different & distant for us to be able to do it. Andy From crackeur at comcast.net Thu Dec 6 09:46:25 2007 From: crackeur at comcast.net (jimmy Zhang) Date: Thu, 6 Dec 2007 01:46:25 -0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com><474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: <004901c837ec$de4766d0$0402a8c0@your55e5f9e3d2> VTD-XML should also be worth mentioning http://vtd-xml.sf.net ----- Original Message ----- From: "Mark Schreiber" To: "Andy Yates" Cc: "biojava-1 mailing list" Sent: Thursday, November 29, 2007 6:28 PM Subject: Re: [Biojava-l] SAX, DOM, XPath and Flat files > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). > > There are a few java API's for parsing ASN.1 mostly developed for the > telco industry, I've never really looked into which is best (anyone > experienced with this?) but we could probably use one to work directly > off NCBI ASN.1 > > - Mark > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: >> Hi Mark, >> >> Okay that sounds like a perfectly sensible way to deal with this. Is >> this kind of parsing model supported in Java5? I only ask as I've not >> done a lot of XML parsing with Java5; more with things like XOM (which I >> think offers a DOM only representation but I'm probably wrong). >> >> That's good. There's not a huge point to have a format & a DTD/XSD and >> then have your files not conform to it. >> >> I was thinking the exact same thing about ASN.1 (well that & it looks >> bleeding horrible to parse but that is an un-educated look at the format >> which I'm sure is a parsable as JSON & the alike). >> >> When it comes to flat file parsers I would be happier to provide >> implementations of the more common formats where a viable alternative is >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide >> similar output to the above have a chance to write their own >> parsers/formatters. This is very similar to the current situation but we >> just need to remove dependencies on statically located data structures >> (don't get rid of them completely just give users an option to not use >> them). >> >> I'm not sure how much automatically generated parsers would help us. I >> guess it depends on the data model(s) we use if they are auto-parser >> friendly (which normally means POJO/JavaBean conventions including the >> no-args constructor). >> >> Cool I don't want to exclude flat file parsers completely (if only >> because my group has an interest in BioJava being able to read & write >> flat files) :) >> >> They decided to have HUPO-PSI Format instead :) >> >> Andy >> >> >> Mark Schreiber wrote: >> > Hi - >> > >> > I think in most cases huge XML files in bioinformatics result from a >> > single XML containing multiple repetitive elements. Eg a BLAST XML >> > output with several hits or a GenBankXML with many Sequences. A nice >> > approach I have seen for dealing with these is to use SAX to read over >> > the file and every time it comes to an element it delegates to a DOM >> > object. You then parse the bits of the DOM you want with XPath or >> > convert to objects or something and then when you are finished with >> > that entry everything gets garbage collected and the SAX parser moves >> > to the next element and repeats the whole process. This is a hybrid >> > of event based parsing and object-model based parsing which could let >> > you efficiently deal with huge files. >> > >> > I think the BLAST XML has improved substantially, at least in terms of >> > validating against it's own DTD. The DTD itself may not be the best >> > design but that is always a matter of taste and if you are using XPath >> > to get the relevant bits you don't need to make a SAX parser jump >> > through hoops to get them. >> > >> > I agree we will have to keep flat file parsers but we should strongly >> > encourage the use of XML where possible. It is simply easier to deal >> > with. Most biological flat-files were designed for Fortran and mainly >> > for human consumption. There is no obvious validation mechanism. >> > Notably everything in NCBI is derived from ASN.1, what you see in the >> > flatfile is produced from there. I tend to think this means that the >> > ASN.1 is the holy gospel and what you get in the flat file is some >> > translation. Ideally NCBI files should be parsed from the ASN.1 where >> > you can guarantee validation, the more practical alternative is to use >> > the XML which you can at least validate against a DTD. >> > >> > With XML we (Biojava) can say if it validates we will parse it and if >> > it doesn't we may not. With flat files there are so many dodgey >> > variants we cannot say anything. Because XML dtds (or xsd's) have >> > versions it also makes it much easier to have parsers for different >> > versions and the parsing machinery can figure out which is needed. >> > With flat files it is anyones guess what version you are dealing with. >> > >> > Finally parsers can be auto-generated for XML if you have the DTD or >> > XSD. This often doesn't give you an ideal parser but it can be a >> > useful starting point for rapid development. >> > >> > For Biojava v 3 I think we should concentrate on XML parsers first and >> > flat files second. if only Fasta had an XML format >> > >> > - Mark >> > >> > On Nov 27, 2007 11:16 PM, Andy Yates wrote: >> >> I was always under the impression that blast's XML output was nearly >> >> as >> >> hard to parse as the flat file format but I do agree that if we can >> >> use >> >> XML whenever we can it would make writing parsers a lot easier >> >> (especially if there are SAX based XPath libraries available). >> >> Actually >> >> this brings up a good question about development of this type of >> >> parser. >> >> The majority of XPath supporting libraries are DOM based which will >> >> mean >> >> large memory usage in some situations but overall providing an easier >> >> coding experience (and hopefully reduce our chances of creating bugs). >> >> Or should we code to the edge cases of someone trying to parse a 1GB >> >> XML? Personally I'd favour the former. >> >> >> >> Going back to the original topic there are going to be situations >> >> where >> >> people want the flat file parsers/writers & I think it's a valid point >> >> to say this is where BioJava is meant to come in & help a developer. >> >> Afterall XML is a computer science problem where as parsing an EMBL >> >> flat >> >> file or blast output is a bioinformatics problem. >> >> >> >> Andy >> >> >> >> >> >> Mark Schreiber wrote: >> >>> For a long time now my feeling has been that we should *only* support >> >>> the XML version of blast output. The other formats are too brittle >> >>> to >> >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc >> >>> that >> >>> may be an extreme view but the power of generic XML parsers and >> >>> things >> >>> like XPath etc really make these formats look very attractive. >> >>> >> >>> - Mark >> >>> >> >>> >> >>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >> >>>> I think Groovy have adopted a similar system recently & have >> >>>> guidelines >> >>>> for how each module should behave (dependencies, build system etc). >> >>>> This >> >>>> enforces the idea that a module whilst not part of the core project >> >>>> must >> >>>> behave in the same manner the core does. I do like the idea that we >> >>>> can >> >>>> have a core biojava & things get added around it & it might >> >>>> encourage >> >>>> other users to start developing their own modules for any >> >>>> formats/purpose they want. >> >>>> >> >>>> Richard Holland wrote: >> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >> >>>>> Hash: SHA1 >> >>>>> >> >>>>>> What format options are there from blast? Just thinking if it >> >>>>>> supports >> >>>>>> CIGAR or something like that are we better providing a parser for >> >>>>>> that >> >>>>>> format & saying that we do not support the traditional blast >> >>>>>> output? >> >>>>>> That said it doesn't help is when that format changes so maybe >> >>>>>> what is >> >>>>>> needed is a way to push out parser changes without requiring a >> >>>>>> full >> >>>>>> biojava release (v3 discussion) ... >> >>>>> Exactly! So the modular idea would work nicely here - we could have >> >>>>> a >> >>>>> blast module and only update that single module (which would be its >> >>>>> own >> >>>>> JAR) whenever the format changes. In a way, BioJava releases as >> >>>>> such >> >>>>> would no longer happen, except maybe for some kind of core BioJava >> >>>>> module. Everything would be done in terms of individual module+JAR >> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, >> >>>>> one >> >>>>> for Phylogenetic tools, one for translation/transcription, etc. >> >>>>> etc. >> >>>>> >> >>>>> cheers, >> >>>>> Richard >> >>>> _______________________________________________ >> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >>>> >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ap3 at sanger.ac.uk Thu Dec 6 10:33:17 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 6 Dec 2007 10:33:17 +0000 Subject: [Biojava-l] status update SVN migration Message-ID: Hi, a quick status update of the CVS to SVN migration for BioJava: George Hartzell, created the first svn dumps for the CVS repository. I am running tests on these to make sure the whole repository has been exported correctly. For details please see here: http://biojava.org/wiki/CVS_to_SVN_Migration A few minor problems have been found during this. As soon as these have been resolved we will be ready to make the final migration. In order to speed the migration process up, please commit any uncommitted changes to CVS in the next couple of days. Once the tests are finished I will send another notification email which will declare a CVS freeze a few days after. After this freeze CVS will remain frozen forever and all new development should happen in SVN. There will also be a new BioJava release at that point. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap3 at sanger.ac.uk Mon Dec 10 13:26:13 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 10 Dec 2007 13:26:13 +0000 Subject: [Biojava-l] SVN migration: declaring CVS freeze Message-ID: <099BDD0E-DA16-43C7-88CE-A47CA810D1EE@sanger.ac.uk> Hi, for the SVN migration please commit any remaining code to CVS in the next few days. On Wednesday, December 12th, 18:00 GMT the BioJava CVS will be frozen. In the following days the repository will be migrated to subversion (SVN) . From then on all future development will be happening in the new SVN repository. All code (+ history) will be available via SVN. I will send a confirmation email when the new SVN repository will become accessible. Detailed instructions on how to check out and commit code will be sent out at that stage as well. for more details see: http://biojava.org/wiki/CVS_to_SVN_Migration Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From markjschreiber at gmail.com Wed Dec 12 09:34:28 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 12 Dec 2007 17:34:28 +0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: References: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: <93b45ca50712120134w65cfad5dwbaeec2ea19a5a3b4@mail.gmail.com> Not a bad suggestion. I wasn't aware of how many formats are now in OWL. It would give us a pretty rapid route to pathway and microarray object models as well. Jena seems to be a nice package to build upon. Unfortunately due to the 'striped' rather than nested nature of ontologies we can forget the SAX vs DOM argument. The Jena OntModel is all going into memory. Looks like informatics is going to be the memory pig of the next decade. - Mark > May I kindly suggest skipping all of this talk about XML and have us > jump straight to OWL? ;) > > > http://dev.isb-sib.ch/projects/uniprot-rdf/ > > michael > > From phossein at umd.edu Sun Dec 16 20:22:15 2007 From: phossein at umd.edu (Parsa Hosseini) Date: Sun, 16 Dec 2007 15:22:15 -0500 (EST) Subject: [Biojava-l] Connecting to TAIR Message-ID: <20071216152215.ACH56650@po5.mail.umd.edu> Hi, I'm interested in contributing to the Biojava project. More specifically, I'm curious in making a package were you can query the TAIR (Arabidopsis.org) website? Would this be of use or benefit to anyone other than myself? Thanks, Parsa phossein at umd.edu From markjschreiber at gmail.com Mon Dec 17 07:13:41 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 17 Dec 2007 02:13:41 -0500 Subject: [Biojava-l] Connecting to TAIR In-Reply-To: <93b45ca50712161918w5d1e9c05t690f2d68b0fe9c@mail.gmail.com> References: <20071216152215.ACH56650@po5.mail.umd.edu> <93b45ca50712161918w5d1e9c05t690f2d68b0fe9c@mail.gmail.com> Message-ID: <93b45ca50712162313m7f9f39b7ha2d70b3306468b27@mail.gmail.com> Hi - I think this would be of great use to BioJava, especially as TAIR is (I think) built on Ensembl. There was a suggestion that biojava take over the Java bindings to Ensembl (Ensj). If you are interested in this then let us know and someone can point you in the right direction. - Mark On Dec 16, 2007 3:22 PM, Parsa Hosseini wrote: > Hi, > > I'm interested in contributing to the Biojava project. > More specifically, I'm curious in making a package were you can query the TAIR (Arabidopsis.org) website? > Would this be of use or benefit to anyone other than myself? > > Thanks, > > > > Parsa > phossein at umd.edu > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From phossein at umd.edu Tue Dec 18 06:39:21 2007 From: phossein at umd.edu (Parsa Hosseini) Date: Tue, 18 Dec 2007 01:39:21 -0500 (EST) Subject: [Biojava-l] Problem finding Restriction Enzyme sites Message-ID: <20071218013921.ACI39718@po5.mail.umd.edu> Hi, I've been using the 'molbio' package to do some Restriction enzyme analysis. I've come across errors when I want to find the corresponding recognition site. I'd be very thankful if I could be lead down the path to find were I am going wrong. Thank you in advance, Parsa phossein at umd.edu ------------------------ try { SymbolList siteOfEnzyme = DNATools.createDNA("GAATC"); Sequence sequence = DNATools.createDNASequence("AAAAGAATCTTC", "mySequence"); RestrictionEnzyme re1 = new RestrictionEnzyme("ECOR1", siteOfEnzyme, 0, 0); SimpleThreadPool tr = new SimpleThreadPool(); RestrictionMapper reMapper = new RestrictionMapper(tr); reMapper.addEnzyme(re1); System.out.println(reMapper.annotate(sequence)); } catch (IllegalSymbolException e) { } catch (IllegalAlphabetException e) { } Exception in thread "Thread-1" org.biojava.bio.BioRuntimeException: Failed to complete search for ECOR1 GAATC (0/0) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:136) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) Caused by: java.lang.IllegalArgumentException: RestrictionEnzyme 'ECOR1' is not registered. No precompiled Pattern is available at org.biojava.bio.molbio.RestrictionEnzymeManager.getPatterns(RestrictionEnzymeManager.java:280) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:77) ... 1 more From holland at ebi.ac.uk Tue Dec 18 08:56:36 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 18 Dec 2007 08:56:36 +0000 Subject: [Biojava-l] Problem finding Restriction Enzyme sites In-Reply-To: <20071218013921.ACI39718@po5.mail.umd.edu> References: <20071218013921.ACI39718@po5.mail.umd.edu> Message-ID: <47678B44.1040004@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello. Your code is throwing an exception saying that it doesn't know what the definition for ECOR1 is. This is because the molbio package doesn't contain any restriction enzyme definitions by default, and you haven't loaded any into it yet (at least, not in the code fragment you quote). Before you can use the methods in this package, you need to have made a call to RestrictionEnzymeManager.loadEnzymeFile() and pass it a reference to a file containing some restriction enzyme definitions in REBASE format #31. Once you've used this method to load a REBASE file containing a definition for the ECOR1 enzyme, you'll be able to use the code. cheers, Richard Parsa Hosseini wrote: > Hi, > > I've been using the 'molbio' package to do some Restriction enzyme analysis. I've come across errors when I want to find the corresponding recognition site. I'd be very thankful if I could be lead down the path to find were I am going wrong. > > Thank you in advance, > > Parsa > > phossein at umd.edu > > ------------------------ > > try { > > SymbolList siteOfEnzyme = DNATools.createDNA("GAATC"); > Sequence sequence = DNATools.createDNASequence("AAAAGAATCTTC", "mySequence"); > RestrictionEnzyme re1 = new RestrictionEnzyme("ECOR1", siteOfEnzyme, 0, 0); > SimpleThreadPool tr = new SimpleThreadPool(); > RestrictionMapper reMapper = new RestrictionMapper(tr); > reMapper.addEnzyme(re1); System.out.println(reMapper.annotate(sequence)); > } > catch (IllegalSymbolException e) { > > } > catch (IllegalAlphabetException e) { > > } > > Exception in thread "Thread-1" org.biojava.bio.BioRuntimeException: Failed to complete search for ECOR1 GAATC (0/0) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:136) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) > Caused by: java.lang.IllegalArgumentException: RestrictionEnzyme 'ECOR1' is not registered. No precompiled Pattern is available > at org.biojava.bio.molbio.RestrictionEnzymeManager.getPatterns(RestrictionEnzymeManager.java:280) > at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:77) > ... 1 more > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHZ4tD4C5LeMEKA/QRApOrAKCYZ9rXbPzyKuAV4DrwaSL424s/wACfSIDC K6TFXwVdQSrV4bTG4DGKYhI= =Nc4w -----END PGP SIGNATURE----- From phossein at umd.edu Thu Dec 27 03:46:14 2007 From: phossein at umd.edu (Parsa Hosseini) Date: Wed, 26 Dec 2007 22:46:14 -0500 (EST) Subject: [Biojava-l] New tutorials Message-ID: <20071226224614.ACM82423@po5.mail.umd.edu> Hi all, I'm in the process of writing some new tutorials. Maybe they can go in the Tutorial section, or the 'BioJava in Anger' section. Either way, I'm interested in writing on the following topics: *Downloading GI's from NCBI *Pairwise alignment using global and local means *Restriction Enzyme analysis .... and if there is anything else, please let me know. I'm going to throw in some graphics and nice illustrations using some corel or photoshop too. If there are any other things to add, please feel free to let me know. Parsa Hosseini phossein at umd.edu From markjschreiber at gmail.com Thu Dec 27 06:29:16 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 27 Dec 2007 14:29:16 +0800 Subject: [Biojava-l] New tutorials In-Reply-To: <20071226224614.ACM82423@po5.mail.umd.edu> References: <20071226224614.ACM82423@po5.mail.umd.edu> Message-ID: <93b45ca50712262229u760f1e48rf0811de60527994a@mail.gmail.com> Great! The more the better! I have been thinking for a while that there could also be another section of examples in the cookbook for examples that don't make much use of biojava but are biorelated and use java. - Mark On Dec 27, 2007 11:46 AM, Parsa Hosseini wrote: > Hi all, > > I'm in the process of writing some new tutorials. Maybe they can go in the Tutorial section, or the 'BioJava in Anger' section. Either way, I'm interested in writing on the following topics: > > *Downloading GI's from NCBI > *Pairwise alignment using global and local means > *Restriction Enzyme analysis > > .... and if there is anything else, please let me know. > I'm going to throw in some graphics and nice illustrations using some corel or photoshop too. > If there are any other things to add, please feel free to let me know. > > Parsa Hosseini > phossein at umd.edu > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From david.bourgais at bioxpr.be Thu Dec 27 14:11:33 2007 From: david.bourgais at bioxpr.be (David Bourgais) Date: Thu, 27 Dec 2007 15:11:33 +0100 Subject: [Biojava-l] About ABITrace class Message-ID: <1198764693.20435.5.camel@bioxpr-04.ct.fundp.ac.be> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From david.bourgais at bioxpr.be Fri Dec 28 08:54:30 2007 From: david.bourgais at bioxpr.be (David Bourgais) Date: Fri, 28 Dec 2007 09:54:30 +0100 Subject: [Biojava-l] About ABITrace class Message-ID: <1198832070.29755.0.camel@bioxpr-04.ct.fundp.ac.be> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From markjschreiber at gmail.com Sat Dec 29 03:02:26 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 29 Dec 2007 11:02:26 +0800 Subject: [Biojava-l] Applet not able to find DNATools class. In-Reply-To: <474A8A1C.4020901@ebi.ac.uk> References: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> <474A8A1C.4020901@ebi.ac.uk> Message-ID: <93b45ca50712281902v700366f7s3b00595efa681108@mail.gmail.com> Hi - This has come up several times on the mailing lists. You could probably find a resolution in the archives. - Mark On Nov 26, 2007 4:55 PM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Sounds like either a classpath problem (in which case check your > classpath to ensure all parts of biojava are definitely on it) or a > broken biojava.jar (in which case you need to recompile/redownload it). > > cheers, > Richard > > Abhinav Ram Karhu wrote: > > Hello all, > > I am having an error while loading the applet. > > > > I am getting the following stack trace. > > > > java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools > > at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161) > > at Trace.init(Trace.java:161) > > at sun.applet.AppletPanel.run(Unknown Source) > > at java.lang.Thread.run(Unknown Source) > > > > I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder. > > > > I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java. > > > > My applet code in the php page looks like this: > > > > > > > > Please suggest if I am missing something. > > > > Thanks in advance. > > > > Abhinav > > > > > > > > - -- > Richard Holland (BioMart) > EMBL EBI, Wellcome Trust Genome Campus, > Hinxton, Cambridgeshire CB10 1SD, UK > Tel. +44 (0)1223 494416 > > http://www.biomart.org/ > http://www.biojava.org/ > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHSoob4C5LeMEKA/QRAsfkAJ9SlwIzDulzSDQpAfgh0alISRsplACcDqUx > uyQUEmRFEWTdnEHsm7k2lg0= > =SWHu > -----END PGP SIGNATURE----- > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From russ at kepler-eng.com Sun Dec 30 17:04:05 2007 From: russ at kepler-eng.com (Russ Kepler) Date: Sun, 30 Dec 2007 10:04:05 -0700 Subject: [Biojava-l] About ABITrace class In-Reply-To: <1198832070.29755.0.camel@bioxpr-04.ct.fundp.ac.be> References: <1198832070.29755.0.camel@bioxpr-04.ct.fundp.ac.be> Message-ID: <200712301004.05398.russ@kepler-eng.com> On Friday 28 December 2007 01:54:30 David Bourgais wrote: > Hello BioJava users > > I am writing a little program using BioJava 1.5 and JFreeChart 1.0.8. > My aim is to display with JFreeChart a chromatogram by reading an ab1 > file with the BioJava help. > Okay, I can display my chromatogram. But, my chromatogram size is 13301 > (using getTrace(AtomicSymbol base)) when my sequence size is 1110 bp > (according to the the BufferedImage produced with BioJava). > Why this difference ? How can I correct my program in order to see > correctly my chromatogram ? If I understand you correctly you're looking at the trace data (multiple points per peak) and comparing to the base calls (one call per peak). In my experience folks interested in the chromatogram would prefer to see an image with the chromatogram spaced properly, those primarily interested in the base calls prefer to have the chromatogram scaled around a fixed width for the basecall. There's code to do either in BioJava, but the former might not be setup to generate a BufferedImage (if you can't find it I might be able to help some). From ap3 at sanger.ac.uk Sun Dec 30 18:43:36 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 30 Dec 2007 18:43:36 +0000 Subject: [Biojava-l] biojava 1.6 release candidate 1 Message-ID: <9BA76ACB-33B8-4E84-96F5-04981DE3CD62@sanger.ac.uk> Hi, I prepared release candidate 1 for the next biojava release for download at http://www.biojava.org/download/bj16/rc1/ This release candidate is build from the new svn repository, as a test for it. It contains numerous bug fixes, improvements in the protein structure modules and better documentation. This release will be the first one where biojava will be using Java 1.5. If you are still using the old Java 1.4, please consider an upgrade, or continue using the previous biojava 1.5 release. If all is fine with it, I will make this the next biojava release. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.