From ap3 at sanger.ac.uk Sun Sep 2 06:02:47 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 2 Sep 2007 11:02:47 +0100 Subject: [Biojava-dev] cruisecontrol In-Reply-To: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> References: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> Message-ID: Hi, CruiseControl is now running at http://www.spice-3d.org/cruise/ it does: * trigger a new build 20 minutes after a new commit to CVS * run the junit tests * build the javadocs * provide the latest biojava.jar for download * send a notification email if something goes wrong (and only then) to this list This basically works with a chain of ant scripts that are triggered by CruiseControl, so it is easy to add other functionality / exchange CVS with subversion, etc. Andreas On 29 Aug 2007, at 01:16, Mark Schreiber wrote: > Sounds good. > > Thomas had a script running off his home machine a while ago for > nightly builds which I have missed since he stopped running it. > > Notifications of failed tests would be good too. > > - Mark > > On 8/28/07, Andreas Prlic wrote: >> Hi biojava - devs, >> >> would you be interested in getting CruiseControl running for BioJava? >> >> It would allow us to >> >> * provide nightly builds of biojava, >> * run unit test in regular intervals, >> * get a notification email sent to biojava-dev if the CVS does not >> build >> >> http://cruisecontrol.sourceforge.net/ >> >> If there is interest for this I will set it up, >> >> Andreas >> >> --------------------------------------------------------------------- >> -- >> >> Andreas Prlic Wellcome Trust Sanger Institute >> Hinxton, Cambridge CB10 1SA, UK >> +44 (0) 1223 49 6891 >> >> --------------------------------------------------------------------- >> -- >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Mon Sep 3 03:42:10 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 03 Sep 2007 08:42:10 +0100 Subject: [Biojava-dev] cruisecontrol In-Reply-To: References: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> Message-ID: <46DBBAD2.3040900@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 that's really cool. is there any way of integrating it into the open-bio servers that the rest of BioJava lives on? Andreas Prlic wrote: > Hi, > > CruiseControl is now running at > > http://www.spice-3d.org/cruise/ > > it does: > * trigger a new build 20 minutes after a new commit to CVS > * run the junit tests > * build the javadocs > * provide the latest biojava.jar for download > * send a notification email if something goes wrong (and only then) > to this list > > This basically works with a chain of ant scripts that are triggered > by CruiseControl, > so it is easy to add other functionality / exchange CVS with > subversion, etc. > > Andreas > > > > On 29 Aug 2007, at 01:16, Mark Schreiber wrote: > >> Sounds good. >> >> Thomas had a script running off his home machine a while ago for >> nightly builds which I have missed since he stopped running it. >> >> Notifications of failed tests would be good too. >> >> - Mark >> >> On 8/28/07, Andreas Prlic wrote: >>> Hi biojava - devs, >>> >>> would you be interested in getting CruiseControl running for BioJava? >>> >>> It would allow us to >>> >>> * provide nightly builds of biojava, >>> * run unit test in regular intervals, >>> * get a notification email sent to biojava-dev if the CVS does not >>> build >>> >>> http://cruisecontrol.sourceforge.net/ >>> >>> If there is interest for this I will set it up, >>> >>> Andreas >>> >>> --------------------------------------------------------------------- >>> -- >>> >>> Andreas Prlic Wellcome Trust Sanger Institute >>> Hinxton, Cambridge CB10 1SA, UK >>> +44 (0) 1223 49 6891 >>> >>> --------------------------------------------------------------------- >>> -- >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > ----------------------------------------------------------------------- > > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG27rS4C5LeMEKA/QRAvpaAJ0f7bC3OeMoqGUGPiQ2zX9YTfq/2ACcCKWu qo+/SvcrG0a5Ycf9H1XmSsY= =CR4f -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Mon Sep 3 09:15:35 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 3 Sep 2007 14:15:35 +0100 Subject: [Biojava-dev] cruisecontrol In-Reply-To: <46DBBAD2.3040900@ebi.ac.uk> References: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> <46DBBAD2.3040900@ebi.ac.uk> Message-ID: <948C3625-6C4B-4D9B-A739-980579CC12D9@sanger.ac.uk> > is there any way of integrating it into the open-bio servers that the > rest of BioJava lives on? the simples way is to link to it, Another possibility is to run it directly on the open-bio servers, but I do not have any admin permissions to set this up. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bugzilla-daemon at portal.open-bio.org Thu Sep 6 10:11:26 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 Sep 2007 10:11:26 -0400 Subject: [Biojava-dev] [Bug 2330] DP/ Profile HMM bug In-Reply-To: Message-ID: <200709061411.l86EBQbm011955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2330 mark.schreiber at novartis.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mark.schreiber at novartis.com 2007-09-06 10:11 EST ------- Added change listener to LinearAlphabetIndex so that it rebuilds as Symbols are added and removed from the Alphabet. Interesting SimpleAlphabet was not emitting a ChangeEvent when removing Symbols, this is fixed now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Sep 8 19:02:38 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 8 Sep 2007 19:02:38 -0400 Subject: [Biojava-dev] [Bug 2359] New: SingleDP deserialization fails Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2359 Summary: SingleDP deserialization fails Product: BioJava Version: 1.5 Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: dist/dp AssignedTo: biojava-dev at biojava.org ReportedBy: daniel.rohrbach at web.de it isn't possible to load an instance of SingleDP via serialization. SingleDP which is serializable inherits from DP which is not serializable . Loading the Object works well. I used biojava 1.5 latest build 9/8/07 2:22 AM but the same exception occurs in all 1.5 the reason for the bug is that SingleDP extends DP which is not serializable. In that case it must implement a no args constructor but it doesn't! Because of that the same should occur for PairwiseDP the stack trace: java.io.InvalidClassException: org.biojava.bio.dp.onehead.SingleDP; org.biojava.bio.dp.onehead.SingleDP; no valid constructor at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:713) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1733) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at biojavabugs.SingleDBBug.main(SingleDBBug.java:92) Caused by: java.io.InvalidClassException: org.biojava.bio.dp.onehead.SingleDP; no valid constructor at java.io.ObjectStreamClass.(ObjectStreamClass.java:471) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:310) and the code i used to cause the bug: //create the HMM ProfileHMM hmm = new ProfileHMM(ProteinTools.getAlphabet(), 12, DistributionFactory.DEFAULT, DistributionFactory.DEFAULT, "biojava profile hmm"); //create an SingleDP object which we want to save and load DP dp = (SingleDP) DPFactory.DEFAULT.createDP(hmm); try { // // saving // // the filename File load = new File("/home/dani/Desktop/dp"); FilePermission fp = new FilePermission(load.getAbsolutePath(), "write"); if(!load.createNewFile()) { throw new IOException("file '" + load.getAbsolutePath() + "' could not be created!"); } FileOutputStream fos = new FileOutputStream(load); ObjectOutputStream oos = new ObjectOutputStream(fos); //store object to disk oos.writeObject(dp); oos.close(); // // loading // // try to load the SingleDP object fp = new FilePermission( load.getAbsolutePath(), "read"); FileInputStream fis = new FileInputStream(load); ObjectInputStream ois = new ObjectInputStream(fis); Vector v = new Vector(); //here is where the EXCEPTION occurs Object o = ois.readObject(); v.add(o); System.out.println("loaded Object!"); // System.out.println(obj.toString()); } catch (ClassNotFoundException ex) { ex.printStackTrace(); } catch (IOException ex) { ex.printStackTrace(); } -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Sep 8 19:20:12 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 8 Sep 2007 19:20:12 -0400 Subject: [Biojava-dev] [Bug 2360] New: saving of ProfileHmm cause NullPointerException Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2360 Summary: saving of ProfileHmm cause NullPointerException Product: BioJava Version: 1.5 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: dist/dp AssignedTo: biojava-dev at biojava.org ReportedBy: daniel.rohrbach at web.de saving an untrained ProfileHMM via serialization cause a NullPointerException. After training the model saving works well. I used biojava 1.5 latest build 9/8/07 2:22 AM the stack trace: Exception in thread "main" java.lang.NullPointerException at org.biojava.bio.dist.SimpleDistribution.writeObject(SimpleDistribution.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at java.util.HashSet.writeObject(HashSet.java:267) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at biojavabugs.SingleDBBug.main(SingleDBBug.java:66) and the code I used to create the exception //create the HMM ProfileHMM hmm = new ProfileHMM(ProteinTools.getAlphabet(), 12, DistributionFactory.DEFAULT, DistributionFactory.DEFAULT, "biojava profile hmm"); // the filename File load = new File("/home/dani/Desktop/hmm"); FilePermission fp = new FilePermission(load.getAbsolutePath(), "write"); if(!load.createNewFile()) { throw new IOException("file '" + load.getAbsolutePath() + "' could not be created!"); } FileOutputStream fos = new FileOutputStream(load); ObjectOutputStream oos = new ObjectOutputStream(fos); //store object to disk //here comes the exception oos.writeObject(hmm); oos.close(); //create an SingleDP object which we want to save and load DP dp = (SingleDP) DPFactory.DEFAULT.createDP(hmm); PairwiseDP pdp = new PairwiseDP(hmm, new DPInterpreter.Maker()); -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Fri Sep 14 19:15:55 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Fri, 14 Sep 2007 18:15:55 -0500 Subject: [Biojava-dev] New Developer Message-ID: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> Hi all, I am interested in getting involved with the BioJava community. I have just joined a Bioinformatics Core, and we use a lot of R/ BioConductor right now, however we have some new projects that we would like to begin working in BioJava. I have checked out the source and compiled, however I noticed that the README was slightly out of sync with the project (the build targets listed are not the same as the build.xml). I made the updates to the README and would have liked to commit them, but as I am sure you are all aware, I do not have write access. I cannot yet say how much or little I would be able to contribute to the project, or even in what areas, however I think it could be beneficial to the community if I were permitted to make changes like this. I am sure synchronizing documentation is the last thing on your minds, and often it is hard to see that instructions might be unclear when you have been working on a project for a long time. I also think it could be a good opportunity to get to know the community and perhaps ease my way into becoming a more active developer. Please let me know what you think! Thanks! Jared From markjschreiber at gmail.com Sat Sep 15 04:54:56 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 15 Sep 2007 16:54:56 +0800 Subject: [Biojava-dev] New Developer In-Reply-To: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> References: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> Message-ID: <93b45ca50709150154y437eaa91h68c422fad235565b@mail.gmail.com> Hi Jared - It's great to have people checking and reporting these things. A CVS account can be arranged if you make regular contributions however in the meantime you can email a patch or fix to the dev list. Because the open-bio lists block attachments (and even HTML email) to prevent spamming the easiest way to submit a fix is as text in the body of your email. One of the core developers can then check it in. Additionally if you notice any problems with the documentation at www.biojava.org please feel free to correct the wiki. Finally if you notice bugs please report them to the bugzilla site linked from the main page of biojava.org so that we can track them. Even better if you can submit a possible fix at the same time so we can make the change and create a test to prevent it from re-emerging later. Thanks for you help. Contributions are always appreciated. - Mark On 9/15/07, Jared Flatow wrote: > Hi all, > > I am interested in getting involved with the BioJava community. I > have just joined a Bioinformatics Core, and we use a lot of R/ > BioConductor right now, however we have some new projects that we > would like to begin working in BioJava. I have checked out the source > and compiled, however I noticed that the README was slightly out of > sync with the project (the build targets listed are not the same as > the build.xml). I made the updates to the README and would have liked > to commit them, but as I am sure you are all aware, I do not have > write access. I cannot yet say how much or little I would be able to > contribute to the project, or even in what areas, however I think it > could be beneficial to the community if I were permitted to make > changes like this. I am sure synchronizing documentation is the last > thing on your minds, and often it is hard to see that instructions > might be unclear when you have been working on a project for a long > time. I also think it could be a good opportunity to get to know the > community and perhaps ease my way into becoming a more active > developer. Please let me know what you think! > > Thanks! > Jared > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From jflatow at northwestern.edu Sat Sep 15 14:53:03 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Sat, 15 Sep 2007 13:53:03 -0500 Subject: [Biojava-dev] New Developer In-Reply-To: <93dd9ad00709141844j36433c26t6f9e4e47d6cac989@mail.gmail.com> References: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> <93dd9ad00709141844j36433c26t6f9e4e47d6cac989@mail.gmail.com> Message-ID: Sounds like a great suggestion to me! This would be much more convenient for me than having to email text (which kind of defeats the purpose of the version control system in my opinion, and will likely clutter the list). Whatever y'all decide will be fine, but I like David's idea! Best Regards, Jared On Sep 14, 2007, at 8:44 PM, David Barbosa Feitosa wrote: > Maybe you can create a branch for this, and ask somebody to inspect > the changes. > After inspection, somebody with write access can merge you changes > with the main development branch. > Only a sugestion :-) > > 2007/9/14, Jared Flatow : > Hi all, > > I am interested in getting involved with the BioJava community. I > have just joined a Bioinformatics Core, and we use a lot of R/ > BioConductor right now, however we have some new projects that we > would like to begin working in BioJava. I have checked out the source > and compiled, however I noticed that the README was slightly out of > sync with the project (the build targets listed are not the same as > the build.xml). I made the updates to the README and would have liked > to commit them, but as I am sure you are all aware, I do not have > write access. I cannot yet say how much or little I would be able to > contribute to the project, or even in what areas, however I think it > could be beneficial to the community if I were permitted to make > changes like this. I am sure synchronizing documentation is the last > thing on your minds, and often it is hard to see that instructions > might be unclear when you have been working on a project for a long > time. I also think it could be a good opportunity to get to know the > community and perhaps ease my way into becoming a more active > developer. Please let me know what you think! > > Thanks! > Jared > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at ebi.ac.uk Wed Sep 19 06:31:06 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 19 Sep 2007 11:31:06 +0100 Subject: [Biojava-dev] The future of BioJava Message-ID: <46F0FA6A.1030404@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all. We are considering moving on to start work on BioJava3, which will resolve many of the issues of usability and maintainability that plague the existing versions of BioJava. I have set up a wiki page containing a preliminary outline of intentions: http://biojava.org/wiki/BioJava3_Proposal Please could as many people as possible update this page with your comments, suggestions, ideas and changes. We want to know what technologies or design patterns you feel would be suitable for various parts, how the new code should be structured and organised, what degree of modularisation would be appropriate, and where the line between biological problems and more generalised Java or programming problems should be drawn. Basically we want comments on anything you can think of. You should make your comments by directly modifying the page. Please do be constructive - if something's a bad idea, we want to know about it, but we'd appreciate it if you could also suggest a better alternative. We're open to all suggestions and will consider everything. We aim to use this page to flesh out a detailed plan for what should happen next. I will act as moderator and use the contents of the final page as the basis of a detailed plan of action early next year. cheers, Richard PS. This is sent to biojava-dev. I'll send it also to biojava-l when there are more details and clearer intentions, meaning users are less likely to get scared. From biojava-l I'll ask for features that users would like to see. This is likely to be around November time. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8Ppq4C5LeMEKA/QRAu8+AJ0dgQYOsOUdgqs9My3RkIFn9FzaVQCeJJ84 aytR4wDyRwhICKPn60CI0gw= =hDJF -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Wed Sep 19 13:33:57 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 19 Sep 2007 18:33:57 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F0FA6A.1030404@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> Message-ID: <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> Hi, A question related to the discussion of how to design a future BioJava is to have a look at which parts of BioJava are being actively used and how to improve these. So what are the most frequently used bits of BioJava? One way to look at this is to go to the web-stats and see how many hits we have got on our documentation web pages. In an ideal world BioJava would be so simple to use, that nobody needs to read any docu. Unfortunately we are far away from this, so actually looking at these stats gives an impression on * topics / functionality which are of particular interest to the community * topics / functionality which might not be straightforward to use, therefore there are many hits on these pages. A look at the webstats from the last couple of months gives these top 10 Cookbook pages that have been accessed frequently. This list is ordered by nr. of pageviews 1. /wiki/BioJava:Cookbook:Alphabets 2. /wiki/BioJava:CookBook:Blast:Parser 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES 5. /wiki/BioJava:CookBook:DP:PairWise2 6. /wiki/BioJava:CookBook:PDB:read 7. /wiki/BioJava:Cookbook:Sequence 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI 10. /wiki/BioJava:CookBook:Fasta:Parse I would group these pages into 2 groups. A) How to work with core concepts of BioJava B) How to use a functionality of BioJava to achieve a certain goal The "conceptual" pages (A) I would identify as * How to get an Alphabet * How to make a Sequence Object from a String or make a Sequence Object back into a String The "functionality" pages (B) I would summarize as * How to parse a Blast output * How to read sequences from a Fasta file * How to read a GenBank, SwissProt or EMBL file * How to generate a global or local alignment with the Needleman- Wunsch- or the Smith-Waterman-algorithm * How to read a protein structure - PDB file * How to export a sequence to fasta * How to view a sequence in a gui * How to parse a Fasta database search output file As a conclusion I would suggest that BioJava should have the goal to provide easy access to the core "functionalities" (group B). I believe that we should try to keep the "concepts" that are being used to achieve these functionalities as simple as possible. In this sense, I feel that we have too many hits on the group A pages. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Thu Sep 20 03:57:49 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 20 Sep 2007 08:57:49 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> Message-ID: <46F227FD.6020807@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I totally agree. Can you post a short summary of this to the Wiki page? Not all aspects of BioJava are documented, leading people either to give up, consult the JavaDocs online, or post a message to biojava-l or biojava-dev. Is it possible to get similar stats to the ones you have calculated for the JavaDoc pages on our website? Also, is it possible to build some kind of index over the mailing list archives to pull out the most frequently used terms? cheers, Richard Andreas Prlic wrote: > Hi, > > A question related to the discussion of how to design a future BioJava > is to have a look > at which parts of BioJava are being actively used and how to improve these. > > So what are the most frequently used bits of BioJava? One way to look at > this is to go to the > web-stats and see how many hits we have got on our documentation web pages. > > In an ideal world BioJava would be so simple to use, that nobody needs > to read any docu. > Unfortunately we are far away from this, so actually looking at these > stats gives an impression > on > > * topics / functionality which are of particular interest to the community > * topics / functionality which might not be straightforward to use, > therefore there are many hits on these pages. > > A look at the webstats from the last couple of months gives these top 10 > Cookbook pages that > have been accessed frequently. This list is ordered by nr. of pageviews > > 1. /wiki/BioJava:Cookbook:Alphabets > 2. /wiki/BioJava:CookBook:Blast:Parser > 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta > 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES > 5. /wiki/BioJava:CookBook:DP:PairWise2 > 6. /wiki/BioJava:CookBook:PDB:read > 7. /wiki/BioJava:Cookbook:Sequence > 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta > 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI > 10. /wiki/BioJava:CookBook:Fasta:Parse > > I would group these pages into 2 groups. > A) How to work with core concepts of BioJava > B) How to use a functionality of BioJava to achieve a certain goal > > The "conceptual" pages (A) I would identify as > * How to get an Alphabet > * How to make a Sequence Object from a String or make a Sequence Object > back into a String > > The "functionality" pages (B) I would summarize as > * How to parse a Blast output > * How to read sequences from a Fasta file > * How to read a GenBank, SwissProt or EMBL file > * How to generate a global or local alignment with the Needleman-Wunsch- > or the Smith-Waterman-algorithm > * How to read a protein structure - PDB file > * How to export a sequence to fasta > * How to view a sequence in a gui > * How to parse a Fasta database search output file > > > As a conclusion I would suggest that BioJava should have the goal to > provide easy access to the > core "functionalities" (group B). I believe that we should try to keep > the "concepts" that are being used to > achieve these functionalities as simple as possible. In this sense, I > feel that we have too many hits on the group A pages. > > Andreas > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > ----------------------------------------------------------------------- > > > > --The Wellcome Trust Sanger Institute is operated by Genome > ResearchLimited, a charity registered in England with number 1021457 and > acompany registered in England with number 2742969, whose > registeredoffice is 215 Euston Road, London, NW1 2BE. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8if94C5LeMEKA/QRAkZ7AJ0a2xaU717XFfrX4eCc/wmPN/OL2ACfZMHi U21o+ZfVD5XOqT1mR7STp6Q= =dct8 -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Thu Sep 20 04:55:13 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 20 Sep 2007 09:55:13 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F227FD.6020807@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> Message-ID: <46F23571.4050908@ebi.ac.uk> Hi, I would say yes to this as well. It is very important to know what green people are attempting to do with BioJava rather than us assuming that we know :). There are parts in BioJava where the flexibility of the code is not sufficient for other people who want to use the code base & in other areas too flexible. I've talked to quite a few people over the years who have used biojava for simple & complex applications and they all seem to come back round to a few key problems: * Sequence & SymbolLists are strange and why can't I use a String - All of this makes a lot more sense if you know about the flyweight pattern; if not it just seems very strange. * I have a format that's EMBL like. Can I parse it using Biojava? * How do I read in a FASTA file? * How can I get X from this chromatogram & can I parse my specific trace format into a BioJava object? As Andreas said it's the occurrence of the category A problems that are the most worrying. In terms of sequences I think I can see why people have a problem with it. Just if we take this as an example: I have my DNA sequence in a String I can substring it, perform a regular expression over it, replace sections, pad it out, format it & so on. If I have a Sequence object I can perform most of these actions but the interface to them seems unintuitive. Things like calling seqString() to get the String back out from a sequence rather than calling toString(). Also lets say I want to use a sequence as a key in a hash map or ask if two sequences are equal (using the old sequence objects) ... at the moment I'd have to convert Sequence -> String to perform the comparison (and that doesn't include checking a Sequence for alphabet equality). I know this sounds like nit-picking & for people who have used biojava extensively a lot of this makes sense. For someone new to the project it seems like we've done something just for the sake of it and we need to get rid of that feeling which I'm sure will happen if we address the category A problem. The rest will fall into place :) Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I totally agree. > > Can you post a short summary of this to the Wiki page? > > Not all aspects of BioJava are documented, leading people either to give > up, consult the JavaDocs online, or post a message to biojava-l or > biojava-dev. > > Is it possible to get similar stats to the ones you have calculated for > the JavaDoc pages on our website? > > Also, is it possible to build some kind of index over the mailing list > archives to pull out the most frequently used terms? > > cheers, > Richard > > Andreas Prlic wrote: >> Hi, >> >> A question related to the discussion of how to design a future BioJava >> is to have a look >> at which parts of BioJava are being actively used and how to improve these. >> >> So what are the most frequently used bits of BioJava? One way to look at >> this is to go to the >> web-stats and see how many hits we have got on our documentation web pages. >> >> In an ideal world BioJava would be so simple to use, that nobody needs >> to read any docu. >> Unfortunately we are far away from this, so actually looking at these >> stats gives an impression >> on >> >> * topics / functionality which are of particular interest to the community >> * topics / functionality which might not be straightforward to use, >> therefore there are many hits on these pages. >> >> A look at the webstats from the last couple of months gives these top 10 >> Cookbook pages that >> have been accessed frequently. This list is ordered by nr. of pageviews >> >> 1. /wiki/BioJava:Cookbook:Alphabets >> 2. /wiki/BioJava:CookBook:Blast:Parser >> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >> 5. /wiki/BioJava:CookBook:DP:PairWise2 >> 6. /wiki/BioJava:CookBook:PDB:read >> 7. /wiki/BioJava:Cookbook:Sequence >> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >> 10. /wiki/BioJava:CookBook:Fasta:Parse >> >> I would group these pages into 2 groups. >> A) How to work with core concepts of BioJava >> B) How to use a functionality of BioJava to achieve a certain goal >> >> The "conceptual" pages (A) I would identify as >> * How to get an Alphabet >> * How to make a Sequence Object from a String or make a Sequence Object >> back into a String >> >> The "functionality" pages (B) I would summarize as >> * How to parse a Blast output >> * How to read sequences from a Fasta file >> * How to read a GenBank, SwissProt or EMBL file >> * How to generate a global or local alignment with the Needleman-Wunsch- >> or the Smith-Waterman-algorithm >> * How to read a protein structure - PDB file >> * How to export a sequence to fasta >> * How to view a sequence in a gui >> * How to parse a Fasta database search output file >> >> >> As a conclusion I would suggest that BioJava should have the goal to >> provide easy access to the >> core "functionalities" (group B). I believe that we should try to keep >> the "concepts" that are being used to >> achieve these functionalities as simple as possible. In this sense, I >> feel that we have too many hits on the group A pages. >> >> Andreas >> >> ----------------------------------------------------------------------- >> >> Andreas Prlic Wellcome Trust Sanger Institute >> Hinxton, Cambridge CB10 1SA, UK >> +44 (0) 1223 49 6891 >> >> ----------------------------------------------------------------------- >> >> >> >> --The Wellcome Trust Sanger Institute is operated by Genome >> ResearchLimited, a charity registered in England with number 1021457 and >> acompany registered in England with number 2742969, whose >> registeredoffice is 215 Euston Road, London, NW1 2BE. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG8if94C5LeMEKA/QRAkZ7AJ0a2xaU717XFfrX4eCc/wmPN/OL2ACfZMHi > U21o+ZfVD5XOqT1mR7STp6Q= > =dct8 > -----END PGP SIGNATURE----- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From holland at ebi.ac.uk Thu Sep 20 05:04:53 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 20 Sep 2007 10:04:53 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F23571.4050908@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> Message-ID: <46F237B5.107@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This is one of my main bugbears too. I've never quite understood why we can't just use Strings, and resort to SymbolLists only when more advanced manipulation is required (e.g. quality scores for each base). After all, a String is a memory word overhead (32- or 64-bits) plus 16-bits (unicode) per character, but most SymbolList implementations are a memory word overhead plus an additional entire memory word per Symbol, each word being a pointer to the memory location where the Symbol singleton lives. So SymbolLists actually use more memory than Strings, not less. (This is not true for CompressedSymbolList which represents sequences as a sequence of bits, grouped into groups large enough to uniquely identify any single symbol in the alphabet - e.g. 2 bits for DNA). As you say, most users just want to read a sequence, sublist it, maybe reverse comp it or run some simple search over it. This can all easily be achieved straight from String format. The other 'category A' problems are equally important. Could you add a section to the Wiki about these and the 'category B' problems? Then we can use this as a priority use-case list when it comes to actual development. cheers, Richard Andy Yates wrote: > Hi, > > I would say yes to this as well. It is very important to know what green > people are attempting to do with BioJava rather than us assuming that we > know :). There are parts in BioJava where the flexibility of the code is > not sufficient for other people who want to use the code base & in other > areas too flexible. > > I've talked to quite a few people over the years who have used biojava > for simple & complex applications and they all seem to come back round > to a few key problems: > > * Sequence & SymbolLists are strange and why can't I use a String - All > of this makes a lot more sense if you know about the flyweight pattern; > if not it just seems very strange. > > * I have a format that's EMBL like. Can I parse it using Biojava? > > * How do I read in a FASTA file? > > * How can I get X from this chromatogram & can I parse my specific trace > format into a BioJava object? > > As Andreas said it's the occurrence of the category A problems that are > the most worrying. In terms of sequences I think I can see why people > have a problem with it. > > Just if we take this as an example: > > I have my DNA sequence in a String I can substring it, perform a regular > expression over it, replace sections, pad it out, format it & so on. If > I have a Sequence object I can perform most of these actions but the > interface to them seems unintuitive. Things like calling seqString() to > get the String back out from a sequence rather than calling toString(). > Also lets say I want to use a sequence as a key in a hash map or ask if > two sequences are equal (using the old sequence objects) ... at the > moment I'd have to convert Sequence -> String to perform the comparison > (and that doesn't include checking a Sequence for alphabet equality). > > I know this sounds like nit-picking & for people who have used biojava > extensively a lot of this makes sense. For someone new to the project it > seems like we've done something just for the sake of it and we need to > get rid of that feeling which I'm sure will happen if we address the > category A problem. The rest will fall into place :) > > Andy > > Richard Holland wrote: > I totally agree. > > Can you post a short summary of this to the Wiki page? > > Not all aspects of BioJava are documented, leading people either to give > up, consult the JavaDocs online, or post a message to biojava-l or > biojava-dev. > > Is it possible to get similar stats to the ones you have calculated for > the JavaDoc pages on our website? > > Also, is it possible to build some kind of index over the mailing list > archives to pull out the most frequently used terms? > > cheers, > Richard > > Andreas Prlic wrote: >>>> Hi, >>>> >>>> A question related to the discussion of how to design a future BioJava >>>> is to have a look >>>> at which parts of BioJava are being actively used and how to improve >>>> these. >>>> >>>> So what are the most frequently used bits of BioJava? One way to look at >>>> this is to go to the >>>> web-stats and see how many hits we have got on our documentation web >>>> pages. >>>> >>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>> to read any docu. >>>> Unfortunately we are far away from this, so actually looking at these >>>> stats gives an impression >>>> on >>>> >>>> * topics / functionality which are of particular interest to the >>>> community >>>> * topics / functionality which might not be straightforward to use, >>>> therefore there are many hits on these pages. >>>> >>>> A look at the webstats from the last couple of months gives these top 10 >>>> Cookbook pages that >>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>> >>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>> 6. /wiki/BioJava:CookBook:PDB:read >>>> 7. /wiki/BioJava:Cookbook:Sequence >>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>> >>>> I would group these pages into 2 groups. >>>> A) How to work with core concepts of BioJava >>>> B) How to use a functionality of BioJava to achieve a certain goal >>>> >>>> The "conceptual" pages (A) I would identify as >>>> * How to get an Alphabet >>>> * How to make a Sequence Object from a String or make a Sequence Object >>>> back into a String >>>> >>>> The "functionality" pages (B) I would summarize as >>>> * How to parse a Blast output >>>> * How to read sequences from a Fasta file >>>> * How to read a GenBank, SwissProt or EMBL file >>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>> or the Smith-Waterman-algorithm >>>> * How to read a protein structure - PDB file >>>> * How to export a sequence to fasta >>>> * How to view a sequence in a gui >>>> * How to parse a Fasta database search output file >>>> >>>> >>>> As a conclusion I would suggest that BioJava should have the goal to >>>> provide easy access to the >>>> core "functionalities" (group B). I believe that we should try to keep >>>> the "concepts" that are being used to >>>> achieve these functionalities as simple as possible. In this sense, I >>>> feel that we have too many hits on the group A pages. >>>> >>>> Andreas >>>> >>>> ----------------------------------------------------------------------- >>>> >>>> Andreas Prlic Wellcome Trust Sanger Institute >>>> Hinxton, Cambridge CB10 1SA, UK >>>> +44 (0) 1223 49 6891 >>>> >>>> ----------------------------------------------------------------------- >>>> >>>> >>>> >>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>> ResearchLimited, a charity registered in England with number 1021457 and >>>> acompany registered in England with number 2742969, whose >>>> registeredoffice is 215 Euston Road, London, NW1 2BE. _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB 8RPZSfbrr9Nfbk3AlqqAet8= =K3qH -----END PGP SIGNATURE----- From markjschreiber at gmail.com Thu Sep 20 05:28:14 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 20 Sep 2007 17:28:14 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F237B5.107@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> Message-ID: <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> The main value of the Symbol representation comes in when you do Distributions and DP which is really why Matthew and Thomas developed it. Quite probably why they developed biojava at all. If you are just pushing data around which seems to be most applications then Strings are better. I have previously proposed seperating the Symbol, Alphabet, DP and Dist from the rest of the packages because they have value well beyond biology but an equal argument would be that most bio stuff doens't need this level of analysis. If you only want to convert EMBL to Fasta or read a BLAST result you don't need it. For those who want to read in EMBL and compute some Distribution or run a Hidden Markov Model then I would propose the conversion of Stringy sequences to SymbolLists at the point when it is needed not at the point when you read them in. Given that almost all I/O of sequence starts and ends as a String the point where you convert to Symbols doesn't matter much. The only question is do you need to convert to Symbols for the analysis you are doing? (Sorry for not putting this on the wiki, I'll do it later). - Mark On 9/20/07, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is one of my main bugbears too. I've never quite understood why we > can't just use Strings, and resort to SymbolLists only when more > advanced manipulation is required (e.g. quality scores for each base). > After all, a String is a memory word overhead (32- or 64-bits) plus > 16-bits (unicode) per character, but most SymbolList implementations are > a memory word overhead plus an additional entire memory word per Symbol, > each word being a pointer to the memory location where the Symbol > singleton lives. So SymbolLists actually use more memory than Strings, > not less. > > (This is not true for CompressedSymbolList which represents sequences as > a sequence of bits, grouped into groups large enough to uniquely > identify any single symbol in the alphabet - e.g. 2 bits for DNA). > > As you say, most users just want to read a sequence, sublist it, maybe > reverse comp it or run some simple search over it. This can all easily > be achieved straight from String format. > > The other 'category A' problems are equally important. Could you add a > section to the Wiki about these and the 'category B' problems? Then we > can use this as a priority use-case list when it comes to actual > development. > > cheers, > Richard > > > Andy Yates wrote: > > Hi, > > > > I would say yes to this as well. It is very important to know what green > > people are attempting to do with BioJava rather than us assuming that we > > know :). There are parts in BioJava where the flexibility of the code is > > not sufficient for other people who want to use the code base & in other > > areas too flexible. > > > > I've talked to quite a few people over the years who have used biojava > > for simple & complex applications and they all seem to come back round > > to a few key problems: > > > > * Sequence & SymbolLists are strange and why can't I use a String - All > > of this makes a lot more sense if you know about the flyweight pattern; > > if not it just seems very strange. > > > > * I have a format that's EMBL like. Can I parse it using Biojava? > > > > * How do I read in a FASTA file? > > > > * How can I get X from this chromatogram & can I parse my specific trace > > format into a BioJava object? > > > > As Andreas said it's the occurrence of the category A problems that are > > the most worrying. In terms of sequences I think I can see why people > > have a problem with it. > > > > Just if we take this as an example: > > > > I have my DNA sequence in a String I can substring it, perform a regular > > expression over it, replace sections, pad it out, format it & so on. If > > I have a Sequence object I can perform most of these actions but the > > interface to them seems unintuitive. Things like calling seqString() to > > get the String back out from a sequence rather than calling toString(). > > Also lets say I want to use a sequence as a key in a hash map or ask if > > two sequences are equal (using the old sequence objects) ... at the > > moment I'd have to convert Sequence -> String to perform the comparison > > (and that doesn't include checking a Sequence for alphabet equality). > > > > I know this sounds like nit-picking & for people who have used biojava > > extensively a lot of this makes sense. For someone new to the project it > > seems like we've done something just for the sake of it and we need to > > get rid of that feeling which I'm sure will happen if we address the > > category A problem. The rest will fall into place :) > > > > Andy > > > > Richard Holland wrote: > > I totally agree. > > > > Can you post a short summary of this to the Wiki page? > > > > Not all aspects of BioJava are documented, leading people either to give > > up, consult the JavaDocs online, or post a message to biojava-l or > > biojava-dev. > > > > Is it possible to get similar stats to the ones you have calculated for > > the JavaDoc pages on our website? > > > > Also, is it possible to build some kind of index over the mailing list > > archives to pull out the most frequently used terms? > > > > cheers, > > Richard > > > > Andreas Prlic wrote: > >>>> Hi, > >>>> > >>>> A question related to the discussion of how to design a future BioJava > >>>> is to have a look > >>>> at which parts of BioJava are being actively used and how to improve > >>>> these. > >>>> > >>>> So what are the most frequently used bits of BioJava? One way to look at > >>>> this is to go to the > >>>> web-stats and see how many hits we have got on our documentation web > >>>> pages. > >>>> > >>>> In an ideal world BioJava would be so simple to use, that nobody needs > >>>> to read any docu. > >>>> Unfortunately we are far away from this, so actually looking at these > >>>> stats gives an impression > >>>> on > >>>> > >>>> * topics / functionality which are of particular interest to the > >>>> community > >>>> * topics / functionality which might not be straightforward to use, > >>>> therefore there are many hits on these pages. > >>>> > >>>> A look at the webstats from the last couple of months gives these top 10 > >>>> Cookbook pages that > >>>> have been accessed frequently. This list is ordered by nr. of pageviews > >>>> > >>>> 1. /wiki/BioJava:Cookbook:Alphabets > >>>> 2. /wiki/BioJava:CookBook:Blast:Parser > >>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta > >>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES > >>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 > >>>> 6. /wiki/BioJava:CookBook:PDB:read > >>>> 7. /wiki/BioJava:Cookbook:Sequence > >>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta > >>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI > >>>> 10. /wiki/BioJava:CookBook:Fasta:Parse > >>>> > >>>> I would group these pages into 2 groups. > >>>> A) How to work with core concepts of BioJava > >>>> B) How to use a functionality of BioJava to achieve a certain goal > >>>> > >>>> The "conceptual" pages (A) I would identify as > >>>> * How to get an Alphabet > >>>> * How to make a Sequence Object from a String or make a Sequence Object > >>>> back into a String > >>>> > >>>> The "functionality" pages (B) I would summarize as > >>>> * How to parse a Blast output > >>>> * How to read sequences from a Fasta file > >>>> * How to read a GenBank, SwissProt or EMBL file > >>>> * How to generate a global or local alignment with the Needleman-Wunsch- > >>>> or the Smith-Waterman-algorithm > >>>> * How to read a protein structure - PDB file > >>>> * How to export a sequence to fasta > >>>> * How to view a sequence in a gui > >>>> * How to parse a Fasta database search output file > >>>> > >>>> > >>>> As a conclusion I would suggest that BioJava should have the goal to > >>>> provide easy access to the > >>>> core "functionalities" (group B). I believe that we should try to keep > >>>> the "concepts" that are being used to > >>>> achieve these functionalities as simple as possible. In this sense, I > >>>> feel that we have too many hits on the group A pages. > >>>> > >>>> Andreas > >>>> > >>>> ----------------------------------------------------------------------- > >>>> > >>>> Andreas Prlic Wellcome Trust Sanger Institute > >>>> Hinxton, Cambridge CB10 1SA, UK > >>>> +44 (0) 1223 49 6891 > >>>> > >>>> ----------------------------------------------------------------------- > >>>> > >>>> > >>>> > >>>> --The Wellcome Trust Sanger Institute is operated by Genome > >>>> ResearchLimited, a charity registered in England with number 1021457 and > >>>> acompany registered in England with number 2742969, whose > >>>> registeredoffice is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB > 8RPZSfbrr9Nfbk3AlqqAet8= > =K3qH > -----END PGP SIGNATURE----- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at ebi.ac.uk Thu Sep 20 05:32:01 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 20 Sep 2007 10:32:01 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> Message-ID: <46F23E11.1000601@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Agreed. What we need is to use Strings by default, and allow conversion to SymbolLists for advanced manipulation (DPs, etc.). Not only does this simplify stuff but it also speeds up simple tasks as it removes the need for conversion and iteration of the lists. cheers, Richard Mark Schreiber wrote: > The main value of the Symbol representation comes in when you do > Distributions and DP which is really why Matthew and Thomas developed > it. Quite probably why they developed biojava at all. If you are just > pushing data around which seems to be most applications then Strings > are better. > > I have previously proposed seperating the Symbol, Alphabet, DP and > Dist from the rest of the packages because they have value well beyond > biology but an equal argument would be that most bio stuff doens't > need this level of analysis. If you only want to convert EMBL to Fasta > or read a BLAST result you don't need it. > > For those who want to read in EMBL and compute some Distribution or > run a Hidden Markov Model then I would propose the conversion of > Stringy sequences to SymbolLists at the point when it is needed not at > the point when you read them in. Given that almost all I/O of > sequence starts and ends as a String the point where you convert to > Symbols doesn't matter much. The only question is do you need to > convert to Symbols for the analysis you are doing? > > (Sorry for not putting this on the wiki, I'll do it later). > > - Mark > > On 9/20/07, Richard Holland wrote: > This is one of my main bugbears too. I've never quite understood why we > can't just use Strings, and resort to SymbolLists only when more > advanced manipulation is required (e.g. quality scores for each base). > After all, a String is a memory word overhead (32- or 64-bits) plus > 16-bits (unicode) per character, but most SymbolList implementations are > a memory word overhead plus an additional entire memory word per Symbol, > each word being a pointer to the memory location where the Symbol > singleton lives. So SymbolLists actually use more memory than Strings, > not less. > > (This is not true for CompressedSymbolList which represents sequences as > a sequence of bits, grouped into groups large enough to uniquely > identify any single symbol in the alphabet - e.g. 2 bits for DNA). > > As you say, most users just want to read a sequence, sublist it, maybe > reverse comp it or run some simple search over it. This can all easily > be achieved straight from String format. > > The other 'category A' problems are equally important. Could you add a > section to the Wiki about these and the 'category B' problems? Then we > can use this as a priority use-case list when it comes to actual > development. > > cheers, > Richard > > > Andy Yates wrote: >>>> Hi, >>>> >>>> I would say yes to this as well. It is very important to know what green >>>> people are attempting to do with BioJava rather than us assuming that we >>>> know :). There are parts in BioJava where the flexibility of the code is >>>> not sufficient for other people who want to use the code base & in other >>>> areas too flexible. >>>> >>>> I've talked to quite a few people over the years who have used biojava >>>> for simple & complex applications and they all seem to come back round >>>> to a few key problems: >>>> >>>> * Sequence & SymbolLists are strange and why can't I use a String - All >>>> of this makes a lot more sense if you know about the flyweight pattern; >>>> if not it just seems very strange. >>>> >>>> * I have a format that's EMBL like. Can I parse it using Biojava? >>>> >>>> * How do I read in a FASTA file? >>>> >>>> * How can I get X from this chromatogram & can I parse my specific trace >>>> format into a BioJava object? >>>> >>>> As Andreas said it's the occurrence of the category A problems that are >>>> the most worrying. In terms of sequences I think I can see why people >>>> have a problem with it. >>>> >>>> Just if we take this as an example: >>>> >>>> I have my DNA sequence in a String I can substring it, perform a regular >>>> expression over it, replace sections, pad it out, format it & so on. If >>>> I have a Sequence object I can perform most of these actions but the >>>> interface to them seems unintuitive. Things like calling seqString() to >>>> get the String back out from a sequence rather than calling toString(). >>>> Also lets say I want to use a sequence as a key in a hash map or ask if >>>> two sequences are equal (using the old sequence objects) ... at the >>>> moment I'd have to convert Sequence -> String to perform the comparison >>>> (and that doesn't include checking a Sequence for alphabet equality). >>>> >>>> I know this sounds like nit-picking & for people who have used biojava >>>> extensively a lot of this makes sense. For someone new to the project it >>>> seems like we've done something just for the sake of it and we need to >>>> get rid of that feeling which I'm sure will happen if we address the >>>> category A problem. The rest will fall into place :) >>>> >>>> Andy >>>> >>>> Richard Holland wrote: >>>> I totally agree. >>>> >>>> Can you post a short summary of this to the Wiki page? >>>> >>>> Not all aspects of BioJava are documented, leading people either to give >>>> up, consult the JavaDocs online, or post a message to biojava-l or >>>> biojava-dev. >>>> >>>> Is it possible to get similar stats to the ones you have calculated for >>>> the JavaDoc pages on our website? >>>> >>>> Also, is it possible to build some kind of index over the mailing list >>>> archives to pull out the most frequently used terms? >>>> >>>> cheers, >>>> Richard >>>> >>>> Andreas Prlic wrote: >>>>>>> Hi, >>>>>>> >>>>>>> A question related to the discussion of how to design a future BioJava >>>>>>> is to have a look >>>>>>> at which parts of BioJava are being actively used and how to improve >>>>>>> these. >>>>>>> >>>>>>> So what are the most frequently used bits of BioJava? One way to look at >>>>>>> this is to go to the >>>>>>> web-stats and see how many hits we have got on our documentation web >>>>>>> pages. >>>>>>> >>>>>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>>>>> to read any docu. >>>>>>> Unfortunately we are far away from this, so actually looking at these >>>>>>> stats gives an impression >>>>>>> on >>>>>>> >>>>>>> * topics / functionality which are of particular interest to the >>>>>>> community >>>>>>> * topics / functionality which might not be straightforward to use, >>>>>>> therefore there are many hits on these pages. >>>>>>> >>>>>>> A look at the webstats from the last couple of months gives these top 10 >>>>>>> Cookbook pages that >>>>>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>>>>> >>>>>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>>>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>>>>> 6. /wiki/BioJava:CookBook:PDB:read >>>>>>> 7. /wiki/BioJava:Cookbook:Sequence >>>>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>>>>> >>>>>>> I would group these pages into 2 groups. >>>>>>> A) How to work with core concepts of BioJava >>>>>>> B) How to use a functionality of BioJava to achieve a certain goal >>>>>>> >>>>>>> The "conceptual" pages (A) I would identify as >>>>>>> * How to get an Alphabet >>>>>>> * How to make a Sequence Object from a String or make a Sequence Object >>>>>>> back into a String >>>>>>> >>>>>>> The "functionality" pages (B) I would summarize as >>>>>>> * How to parse a Blast output >>>>>>> * How to read sequences from a Fasta file >>>>>>> * How to read a GenBank, SwissProt or EMBL file >>>>>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>>>>> or the Smith-Waterman-algorithm >>>>>>> * How to read a protein structure - PDB file >>>>>>> * How to export a sequence to fasta >>>>>>> * How to view a sequence in a gui >>>>>>> * How to parse a Fasta database search output file >>>>>>> >>>>>>> >>>>>>> As a conclusion I would suggest that BioJava should have the goal to >>>>>>> provide easy access to the >>>>>>> core "functionalities" (group B). I believe that we should try to keep >>>>>>> the "concepts" that are being used to >>>>>>> achieve these functionalities as simple as possible. In this sense, I >>>>>>> feel that we have too many hits on the group A pages. >>>>>>> >>>>>>> Andreas >>>>>>> >>>>>>> ----------------------------------------------------------------------- >>>>>>> >>>>>>> Andreas Prlic Wellcome Trust Sanger Institute >>>>>>> Hinxton, Cambridge CB10 1SA, UK >>>>>>> +44 (0) 1223 49 6891 >>>>>>> >>>>>>> ----------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> >>>>>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>>>>> ResearchLimited, a charity registered in England with number 1021457 and >>>>>>> acompany registered in England with number 2742969, whose >>>>>>> registeredoffice is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev >> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8j4Q4C5LeMEKA/QRAiCCAJ9V09vR55BsKuF2rDjvLs3l5cnWKACeN43x BOF0kkjVytLsvCE/4jkWrGg= =Pfrz -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Thu Sep 20 06:54:31 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 20 Sep 2007 11:54:31 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> Message-ID: <46F25167.9080307@ebi.ac.uk> I think my EMBL point was more about groups like mine which distribute data in an EMBL format but we do not follow the EMBL rules 100% about what elements can follow other elements. Customization is very important to us which at the moment means there is a biojava src checkout here which gets edited accordingly. Not the most useful/nice solution but it works & is something I've had to do before when I was working with chromatograms. Most of the work I've done with Biojava sequences where just to push in a DNA sequence, rev comp it and push it back out. Even then that got dropped as someone in-house made their own version which kept it all in Strings. That said it should have been used more since it was a DNA alignment/sequencing project & all positions work WRT index 1 (you don't what to know how many times I typed in -1 in that project ... and the number of bugs it caused). Anyway I guess what I'm getting round to saying in a very bad way is that there are places where I should have used the sequence representations from biojava but the inital hump/learning curve of what they are, how to use them & why to use them was too large and I have too little time. I'm sure there are so many other people in the community which have this same problem and I'm sure they'll be hurting because of it as much as I did (and if anyone from that group is reading this email I do apologize ... again). Andy Mark Schreiber wrote: > The main value of the Symbol representation comes in when you do > Distributions and DP which is really why Matthew and Thomas developed > it. Quite probably why they developed biojava at all. If you are just > pushing data around which seems to be most applications then Strings > are better. > > I have previously proposed seperating the Symbol, Alphabet, DP and > Dist from the rest of the packages because they have value well beyond > biology but an equal argument would be that most bio stuff doens't > need this level of analysis. If you only want to convert EMBL to Fasta > or read a BLAST result you don't need it. > > For those who want to read in EMBL and compute some Distribution or > run a Hidden Markov Model then I would propose the conversion of > Stringy sequences to SymbolLists at the point when it is needed not at > the point when you read them in. Given that almost all I/O of > sequence starts and ends as a String the point where you convert to > Symbols doesn't matter much. The only question is do you need to > convert to Symbols for the analysis you are doing? > > (Sorry for not putting this on the wiki, I'll do it later). > > - Mark > > On 9/20/07, Richard Holland wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> This is one of my main bugbears too. I've never quite understood why we >> can't just use Strings, and resort to SymbolLists only when more >> advanced manipulation is required (e.g. quality scores for each base). >> After all, a String is a memory word overhead (32- or 64-bits) plus >> 16-bits (unicode) per character, but most SymbolList implementations are >> a memory word overhead plus an additional entire memory word per Symbol, >> each word being a pointer to the memory location where the Symbol >> singleton lives. So SymbolLists actually use more memory than Strings, >> not less. >> >> (This is not true for CompressedSymbolList which represents sequences as >> a sequence of bits, grouped into groups large enough to uniquely >> identify any single symbol in the alphabet - e.g. 2 bits for DNA). >> >> As you say, most users just want to read a sequence, sublist it, maybe >> reverse comp it or run some simple search over it. This can all easily >> be achieved straight from String format. >> >> The other 'category A' problems are equally important. Could you add a >> section to the Wiki about these and the 'category B' problems? Then we >> can use this as a priority use-case list when it comes to actual >> development. >> >> cheers, >> Richard >> >> >> Andy Yates wrote: >>> Hi, >>> >>> I would say yes to this as well. It is very important to know what green >>> people are attempting to do with BioJava rather than us assuming that we >>> know :). There are parts in BioJava where the flexibility of the code is >>> not sufficient for other people who want to use the code base & in other >>> areas too flexible. >>> >>> I've talked to quite a few people over the years who have used biojava >>> for simple & complex applications and they all seem to come back round >>> to a few key problems: >>> >>> * Sequence & SymbolLists are strange and why can't I use a String - All >>> of this makes a lot more sense if you know about the flyweight pattern; >>> if not it just seems very strange. >>> >>> * I have a format that's EMBL like. Can I parse it using Biojava? >>> >>> * How do I read in a FASTA file? >>> >>> * How can I get X from this chromatogram & can I parse my specific trace >>> format into a BioJava object? >>> >>> As Andreas said it's the occurrence of the category A problems that are >>> the most worrying. In terms of sequences I think I can see why people >>> have a problem with it. >>> >>> Just if we take this as an example: >>> >>> I have my DNA sequence in a String I can substring it, perform a regular >>> expression over it, replace sections, pad it out, format it & so on. If >>> I have a Sequence object I can perform most of these actions but the >>> interface to them seems unintuitive. Things like calling seqString() to >>> get the String back out from a sequence rather than calling toString(). >>> Also lets say I want to use a sequence as a key in a hash map or ask if >>> two sequences are equal (using the old sequence objects) ... at the >>> moment I'd have to convert Sequence -> String to perform the comparison >>> (and that doesn't include checking a Sequence for alphabet equality). >>> >>> I know this sounds like nit-picking & for people who have used biojava >>> extensively a lot of this makes sense. For someone new to the project it >>> seems like we've done something just for the sake of it and we need to >>> get rid of that feeling which I'm sure will happen if we address the >>> category A problem. The rest will fall into place :) >>> >>> Andy >>> >>> Richard Holland wrote: >>> I totally agree. >>> >>> Can you post a short summary of this to the Wiki page? >>> >>> Not all aspects of BioJava are documented, leading people either to give >>> up, consult the JavaDocs online, or post a message to biojava-l or >>> biojava-dev. >>> >>> Is it possible to get similar stats to the ones you have calculated for >>> the JavaDoc pages on our website? >>> >>> Also, is it possible to build some kind of index over the mailing list >>> archives to pull out the most frequently used terms? >>> >>> cheers, >>> Richard >>> >>> Andreas Prlic wrote: >>>>>> Hi, >>>>>> >>>>>> A question related to the discussion of how to design a future BioJava >>>>>> is to have a look >>>>>> at which parts of BioJava are being actively used and how to improve >>>>>> these. >>>>>> >>>>>> So what are the most frequently used bits of BioJava? One way to look at >>>>>> this is to go to the >>>>>> web-stats and see how many hits we have got on our documentation web >>>>>> pages. >>>>>> >>>>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>>>> to read any docu. >>>>>> Unfortunately we are far away from this, so actually looking at these >>>>>> stats gives an impression >>>>>> on >>>>>> >>>>>> * topics / functionality which are of particular interest to the >>>>>> community >>>>>> * topics / functionality which might not be straightforward to use, >>>>>> therefore there are many hits on these pages. >>>>>> >>>>>> A look at the webstats from the last couple of months gives these top 10 >>>>>> Cookbook pages that >>>>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>>>> >>>>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>>>> 6. /wiki/BioJava:CookBook:PDB:read >>>>>> 7. /wiki/BioJava:Cookbook:Sequence >>>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>>>> >>>>>> I would group these pages into 2 groups. >>>>>> A) How to work with core concepts of BioJava >>>>>> B) How to use a functionality of BioJava to achieve a certain goal >>>>>> >>>>>> The "conceptual" pages (A) I would identify as >>>>>> * How to get an Alphabet >>>>>> * How to make a Sequence Object from a String or make a Sequence Object >>>>>> back into a String >>>>>> >>>>>> The "functionality" pages (B) I would summarize as >>>>>> * How to parse a Blast output >>>>>> * How to read sequences from a Fasta file >>>>>> * How to read a GenBank, SwissProt or EMBL file >>>>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>>>> or the Smith-Waterman-algorithm >>>>>> * How to read a protein structure - PDB file >>>>>> * How to export a sequence to fasta >>>>>> * How to view a sequence in a gui >>>>>> * How to parse a Fasta database search output file >>>>>> >>>>>> >>>>>> As a conclusion I would suggest that BioJava should have the goal to >>>>>> provide easy access to the >>>>>> core "functionalities" (group B). I believe that we should try to keep >>>>>> the "concepts" that are being used to >>>>>> achieve these functionalities as simple as possible. In this sense, I >>>>>> feel that we have too many hits on the group A pages. >>>>>> >>>>>> Andreas >>>>>> >>>>>> ----------------------------------------------------------------------- >>>>>> >>>>>> Andreas Prlic Wellcome Trust Sanger Institute >>>>>> Hinxton, Cambridge CB10 1SA, UK >>>>>> +44 (0) 1223 49 6891 >>>>>> >>>>>> ----------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> >>>>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>>>> ResearchLimited, a charity registered in England with number 1021457 and >>>>>> acompany registered in England with number 2742969, whose >>>>>> registeredoffice is 215 Euston Road, London, NW1 2BE. >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB >> 8RPZSfbrr9Nfbk3AlqqAet8= >> =K3qH >> -----END PGP SIGNATURE----- >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> From ayates at ebi.ac.uk Thu Sep 20 06:55:13 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 20 Sep 2007 11:55:13 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F237B5.107@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> Message-ID: <46F25191.70109@ebi.ac.uk> Ok I'll add them in. Can you remember if I've actually got a wiki account? Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is one of my main bugbears too. I've never quite understood why we > can't just use Strings, and resort to SymbolLists only when more > advanced manipulation is required (e.g. quality scores for each base). > After all, a String is a memory word overhead (32- or 64-bits) plus > 16-bits (unicode) per character, but most SymbolList implementations are > a memory word overhead plus an additional entire memory word per Symbol, > each word being a pointer to the memory location where the Symbol > singleton lives. So SymbolLists actually use more memory than Strings, > not less. > > (This is not true for CompressedSymbolList which represents sequences as > a sequence of bits, grouped into groups large enough to uniquely > identify any single symbol in the alphabet - e.g. 2 bits for DNA). > > As you say, most users just want to read a sequence, sublist it, maybe > reverse comp it or run some simple search over it. This can all easily > be achieved straight from String format. > > The other 'category A' problems are equally important. Could you add a > section to the Wiki about these and the 'category B' problems? Then we > can use this as a priority use-case list when it comes to actual > development. > > cheers, > Richard > > > Andy Yates wrote: >> Hi, >> >> I would say yes to this as well. It is very important to know what green >> people are attempting to do with BioJava rather than us assuming that we >> know :). There are parts in BioJava where the flexibility of the code is >> not sufficient for other people who want to use the code base & in other >> areas too flexible. >> >> I've talked to quite a few people over the years who have used biojava >> for simple & complex applications and they all seem to come back round >> to a few key problems: >> >> * Sequence & SymbolLists are strange and why can't I use a String - All >> of this makes a lot more sense if you know about the flyweight pattern; >> if not it just seems very strange. >> >> * I have a format that's EMBL like. Can I parse it using Biojava? >> >> * How do I read in a FASTA file? >> >> * How can I get X from this chromatogram & can I parse my specific trace >> format into a BioJava object? >> >> As Andreas said it's the occurrence of the category A problems that are >> the most worrying. In terms of sequences I think I can see why people >> have a problem with it. >> >> Just if we take this as an example: >> >> I have my DNA sequence in a String I can substring it, perform a regular >> expression over it, replace sections, pad it out, format it & so on. If >> I have a Sequence object I can perform most of these actions but the >> interface to them seems unintuitive. Things like calling seqString() to >> get the String back out from a sequence rather than calling toString(). >> Also lets say I want to use a sequence as a key in a hash map or ask if >> two sequences are equal (using the old sequence objects) ... at the >> moment I'd have to convert Sequence -> String to perform the comparison >> (and that doesn't include checking a Sequence for alphabet equality). >> >> I know this sounds like nit-picking & for people who have used biojava >> extensively a lot of this makes sense. For someone new to the project it >> seems like we've done something just for the sake of it and we need to >> get rid of that feeling which I'm sure will happen if we address the >> category A problem. The rest will fall into place :) >> >> Andy >> >> Richard Holland wrote: >> I totally agree. >> >> Can you post a short summary of this to the Wiki page? >> >> Not all aspects of BioJava are documented, leading people either to give >> up, consult the JavaDocs online, or post a message to biojava-l or >> biojava-dev. >> >> Is it possible to get similar stats to the ones you have calculated for >> the JavaDoc pages on our website? >> >> Also, is it possible to build some kind of index over the mailing list >> archives to pull out the most frequently used terms? >> >> cheers, >> Richard >> >> Andreas Prlic wrote: >>>>> Hi, >>>>> >>>>> A question related to the discussion of how to design a future BioJava >>>>> is to have a look >>>>> at which parts of BioJava are being actively used and how to improve >>>>> these. >>>>> >>>>> So what are the most frequently used bits of BioJava? One way to look at >>>>> this is to go to the >>>>> web-stats and see how many hits we have got on our documentation web >>>>> pages. >>>>> >>>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>>> to read any docu. >>>>> Unfortunately we are far away from this, so actually looking at these >>>>> stats gives an impression >>>>> on >>>>> >>>>> * topics / functionality which are of particular interest to the >>>>> community >>>>> * topics / functionality which might not be straightforward to use, >>>>> therefore there are many hits on these pages. >>>>> >>>>> A look at the webstats from the last couple of months gives these top 10 >>>>> Cookbook pages that >>>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>>> >>>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>>> 6. /wiki/BioJava:CookBook:PDB:read >>>>> 7. /wiki/BioJava:Cookbook:Sequence >>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>>> >>>>> I would group these pages into 2 groups. >>>>> A) How to work with core concepts of BioJava >>>>> B) How to use a functionality of BioJava to achieve a certain goal >>>>> >>>>> The "conceptual" pages (A) I would identify as >>>>> * How to get an Alphabet >>>>> * How to make a Sequence Object from a String or make a Sequence Object >>>>> back into a String >>>>> >>>>> The "functionality" pages (B) I would summarize as >>>>> * How to parse a Blast output >>>>> * How to read sequences from a Fasta file >>>>> * How to read a GenBank, SwissProt or EMBL file >>>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>>> or the Smith-Waterman-algorithm >>>>> * How to read a protein structure - PDB file >>>>> * How to export a sequence to fasta >>>>> * How to view a sequence in a gui >>>>> * How to parse a Fasta database search output file >>>>> >>>>> >>>>> As a conclusion I would suggest that BioJava should have the goal to >>>>> provide easy access to the >>>>> core "functionalities" (group B). I believe that we should try to keep >>>>> the "concepts" that are being used to >>>>> achieve these functionalities as simple as possible. In this sense, I >>>>> feel that we have too many hits on the group A pages. >>>>> >>>>> Andreas >>>>> >>>>> ----------------------------------------------------------------------- >>>>> >>>>> Andreas Prlic Wellcome Trust Sanger Institute >>>>> Hinxton, Cambridge CB10 1SA, UK >>>>> +44 (0) 1223 49 6891 >>>>> >>>>> ----------------------------------------------------------------------- >>>>> >>>>> >>>>> >>>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>>> ResearchLimited, a charity registered in England with number 1021457 and >>>>> acompany registered in England with number 2742969, whose >>>>> registeredoffice is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB > 8RPZSfbrr9Nfbk3AlqqAet8= > =K3qH > -----END PGP SIGNATURE----- From gwaldon at geneinfinity.org Fri Sep 21 02:53:12 2007 From: gwaldon at geneinfinity.org (george waldon) Date: Thu, 20 Sep 2007 23:53:12 -0700 Subject: [Biojava-dev] The future of BioJava Message-ID: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> Hello, All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as ?token?, e.g. Alphabet.getTokenizarion(?token?) or SymbolTokenization.tokenizeSymbolList(SymbolList). I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. Richard wrote: >It is suggested that development stops on the existing Biojava(?) Well, I don?t think the license can let you do that :-) Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: - Switch to Subversion repository - Change of the build process compatible with creation of modules - Improving testing frame (mentioned several times) - Creation of white papers for coding practices, build releases, (others?) Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. Hope it helps, George From markjschreiber at gmail.com Fri Sep 21 03:24:23 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 21 Sep 2007 15:24:23 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> Message-ID: <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> Hello - Just to clarify my opinion on Strings vs Symbols. I generally prefer Symbols and SymbolLists to Strings cause SymbolLists are smart and Strings are dumb. Classic case is ambiguity symbols like 'W'. BioJava knows, in the context of DNA this is A or T. However, I think it would be vastly simpler if there where simpler getters and setters for SymbolLists that exposed Strings in a friendlier manner. I also think there is a case for SymbolLists that are backed by Strings (more likely a char[]) instead of Symbol arrays and only do the needed conversion when required (ie, when the user calls SymbolAt(). These would be ideal for the case where someone is converting GenBank to Fasta and there is no need to go through the Symbol parsing. Finally, I think SymbolLists (or whatever they get called) should implement more of the methods found in String to make them look more like Strings. Ideally we should think about implementing some of the methods that Groovy likes to use for operator overloading. If we do this is would be possible to concatenate two sequences in groovy by doing this (I may have the syntax wrong). Seq3 = Seq1 + Seq2 The other issue with SymbolLists is that they are not intuitive to construct because they are not so bean like. This is not just a problem for newbies but also a major hinderance to the use of JEE, Spring, JAXB and other important frameworks. It should be possible to do this: SymbolList sl = new SymbolList(); sl.setName("AB123456"); sl.setSequence(seqString); The final hinderance to the use of JEE is serialization. If we keep Symbols flyweight (singleton) we need to make this bullet proof from the start. It is also practicaly impossible to make something a bean and make it a Singleton, some careful thought is required. If we keep symbols behind the scenes they may not need to be so bean like. - Mark On 9/21/07, george waldon wrote: > Hello, > > All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > > I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > Richard wrote: > >It is suggested that development stops on the existing Biojava(?) > Well, I don't think the license can let you do that :-) > Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > - Switch to Subversion repository > - Change of the build process compatible with creation of modules > - Improving testing frame (mentioned several times) > - Creation of white papers for coding practices, build releases, (others?) > > Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > Hope it helps, > George > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at ebi.ac.uk Fri Sep 21 03:54:51 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 08:54:51 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> Message-ID: <46F378CB.2030903@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi George. By 'stop development' I really meant just that active development efforts would be focused on the new codebase rather than modifying the existing one (except of course for fixing bugs, which is always important and we wouldn't stop doing that until the new codebase was well established as an alternative). I agree that modifying the existing codebase would improve many of the problems currently experienced with it - code abstraction being just one of them. BioJavaX was an attempt at doing this. The big stumbling block was interfaces - users do not expect interfaces to change as it breaks all code that already uses that interface. They also do not expect the defined behaviour of methods in interfaces to change - which meant, for instance, that I had real problems trying to get RichFeature/RichLocation and RichLocation/Location to match up as some parts of Feature and Location conflicted with the more realistic requirements of their Rich* equivalents (e.g. circularity). If you change interfaces, you might as well start from scratch in terms of the effect it has on end-user's code. Also, if we start from scratch, it allows us to build up from the very basics the kind of robustness and flexibility we need throughout the system. As mentioned in the original posting the existing system is heavily sequence-focused, meaning that even the simple task of scanning a set of features cannot be done without also loading the associated sequences because the two are so closely integrated. We need to make it much more flexible and I think new code would give us a better opportunity to do so without being tied into complying with existing interfaces or behaviour expectations. Having said that, I do expect large parts of the new codebase to be only slightly modified copies of the original code, particularly regarding recent developments such as genetic algorithms and phylogenetics. It would be silly to write such logic all over again where the code is relatively self-contained. cheers, Richard george waldon wrote: > Hello, > > All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as ?token?, e.g. Alphabet.getTokenizarion(?token?) or SymbolTokenization.tokenizeSymbolList(SymbolList). > > I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > Richard wrote: >> It is suggested that development stops on the existing Biojava(?) > Well, I don?t think the license can let you do that :-) > Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > - Switch to Subversion repository > - Change of the build process compatible with creation of modules > - Improving testing frame (mentioned several times) > - Creation of white papers for coding practices, build releases, (others?) > > Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > Hope it helps, > George > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG83jK4C5LeMEKA/QRAtOFAJsF9YNdgdsOm1KY65GyRehsO1ElYwCfeUfi yXWTMXSzn3mXZqXXo9999rw= =WbAQ -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Sep 21 04:07:51 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 09:07:51 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> Message-ID: <46F37BD7.5020402@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I like that idea of having SymbolLists backed by different things. I'd suggest that by default, all sequences read from file should be String-backed SymbolLists, and that they are not broken down into Symbols until first requested to do so by code that needs to know the actual Symbols (e.g. code that cares about ambiguity symbols). The same applies in reverse - lists constructed from symbols should not be converted to strings until needed. Something like this: SymbolList sl = new SymbolList(); sl.setString("AGCGGACT"); // Changes the string, and clears any cached // conversion of it. String seq = sl.getString(); // Dumps the string. If not already converted // to a string, does the conversion and // caches it first. char base = sl.charAt(5); // 1-indexed single-base string. This would // likely delegate to String.charAt() and only // works for single-character alphabets. Not // to be used in any other cirumstances. sl.set/getAlphabet().... // Use these to set the alphabet before // using set/getSymbols()/symbolAt(). sl.setSymbols(new List(....)); // Uses the list to update the cached symbols // and clear the cached string. List syms = sl.getSymbols(); // Converts if not already converted, caches // the conversion, and returns it. Symbol sym = sl.symbolAt(5); // 1-indexed fully flexible symbol finder. toString() would delegate to getString(), as would hashCode(), equals(), and compareTo(). We could provide additional equals()-style methods for testing equality whilst taking into account ambiguities. cheers, Richard Mark Schreiber wrote: > Hello - > > Just to clarify my opinion on Strings vs Symbols. > > I generally prefer Symbols and SymbolLists to Strings cause > SymbolLists are smart and Strings are dumb. Classic case is ambiguity > symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > However, I think it would be vastly simpler if there where simpler > getters and setters for SymbolLists that exposed Strings in a > friendlier manner. > > I also think there is a case for SymbolLists that are backed by > Strings (more likely a char[]) instead of Symbol arrays and only do > the needed conversion when required (ie, when the user calls > SymbolAt(). These would be ideal for the case where someone is > converting GenBank to Fasta and there is no need to go through the > Symbol parsing. > > Finally, I think SymbolLists (or whatever they get called) should > implement more of the methods found in String to make them look more > like Strings. Ideally we should think about implementing some of the > methods that Groovy likes to use for operator overloading. If we do > this is would be possible to concatenate two sequences in groovy by > doing this (I may have the syntax wrong). > > Seq3 = Seq1 + Seq2 > > The other issue with SymbolLists is that they are not intuitive to > construct because they are not so bean like. This is not just a > problem for newbies but also a major hinderance to the use of JEE, > Spring, JAXB and other important frameworks. It should be possible to > do this: > > SymbolList sl = new SymbolList(); > sl.setName("AB123456"); > sl.setSequence(seqString); > > The final hinderance to the use of JEE is serialization. If we keep > Symbols flyweight (singleton) we need to make this bullet proof from > the start. It is also practicaly impossible to make something a bean > and make it a Singleton, some careful thought is required. If we keep > symbols behind the scenes they may not need to be so bean like. > > - Mark > > On 9/21/07, george waldon wrote: >> Hello, >> >> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. >> >> I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). >> >> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. >> >> Richard wrote: >>> It is suggested that development stops on the existing Biojava(?) >> Well, I don't think the license can let you do that :-) >> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: >> - Switch to Subversion repository >> - Change of the build process compatible with creation of modules >> - Improving testing frame (mentioned several times) >> - Creation of white papers for coding practices, build releases, (others?) >> >> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) >> >> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. >> >> Hope it helps, >> George >> >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG83vW4C5LeMEKA/QRAgvcAKCbjSMERdawCoeeEA/Cg+c/z/DqsgCeImE/ QfSYrzx1TUHVscTXCs2vAoY= =x+Su -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Sep 21 04:47:55 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 09:47:55 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F37BD7.5020402@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> Message-ID: <46F3853B.7070701@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Also could we make SymbolList implement List? The iterator() method would then do the cached conversion if required before returning an Iterator over the symbols. That would make it very pluggable. We'd need it to have a settable flag indicating whether the user wants 1-indexed or 0-indexed access (the default being 1-indexed as this is the most common biological use). Only downside is that List uses generics and so SymbolList must too - meaning that SymbolList must always be declared as SymbolList (or some subclass of Symbol). But that's also an upside - you could subclass Symbol into DNASymbol, RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the symbol and need not be specified separately: SymbolList dna = new SymbolList(); dna.add(RNAAlphabet.Q); // Throws standard List exception! SymbolList> = new ....; // Cool! Also cool is that you could do this: public SymbolList translate(SymbolList dna); // Also cool! cheers, Richard Richard Holland wrote: > I like that idea of having SymbolLists backed by different things. I'd > suggest that by default, all sequences read from file should be > String-backed SymbolLists, and that they are not broken down into > Symbols until first requested to do so by code that needs to know the > actual Symbols (e.g. code that cares about ambiguity symbols). The same > applies in reverse - lists constructed from symbols should not be > converted to strings until needed. > > Something like this: > > SymbolList sl = new SymbolList(); > sl.setString("AGCGGACT"); > // Changes the string, and clears any cached > // conversion of it. > String seq = sl.getString(); > // Dumps the string. If not already converted > // to a string, does the conversion and > // caches it first. > char base = sl.charAt(5); > // 1-indexed single-base string. This would > // likely delegate to String.charAt() and only > // works for single-character alphabets. Not > // to be used in any other cirumstances. > sl.set/getAlphabet().... > // Use these to set the alphabet before > // using set/getSymbols()/symbolAt(). > sl.setSymbols(new List(....)); > // Uses the list to update the cached symbols > // and clear the cached string. > List syms = sl.getSymbols(); > // Converts if not already converted, caches > // the conversion, and returns it. > Symbol sym = sl.symbolAt(5); > // 1-indexed fully flexible symbol finder. > > toString() would delegate to getString(), as would hashCode(), equals(), > and compareTo(). We could provide additional equals()-style methods for > testing equality whilst taking into account ambiguities. > > cheers, > Richard > > > Mark Schreiber wrote: >>> Hello - >>> >>> Just to clarify my opinion on Strings vs Symbols. >>> >>> I generally prefer Symbols and SymbolLists to Strings cause >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. >>> However, I think it would be vastly simpler if there where simpler >>> getters and setters for SymbolLists that exposed Strings in a >>> friendlier manner. >>> >>> I also think there is a case for SymbolLists that are backed by >>> Strings (more likely a char[]) instead of Symbol arrays and only do >>> the needed conversion when required (ie, when the user calls >>> SymbolAt(). These would be ideal for the case where someone is >>> converting GenBank to Fasta and there is no need to go through the >>> Symbol parsing. >>> >>> Finally, I think SymbolLists (or whatever they get called) should >>> implement more of the methods found in String to make them look more >>> like Strings. Ideally we should think about implementing some of the >>> methods that Groovy likes to use for operator overloading. If we do >>> this is would be possible to concatenate two sequences in groovy by >>> doing this (I may have the syntax wrong). >>> >>> Seq3 = Seq1 + Seq2 >>> >>> The other issue with SymbolLists is that they are not intuitive to >>> construct because they are not so bean like. This is not just a >>> problem for newbies but also a major hinderance to the use of JEE, >>> Spring, JAXB and other important frameworks. It should be possible to >>> do this: >>> >>> SymbolList sl = new SymbolList(); >>> sl.setName("AB123456"); >>> sl.setSequence(seqString); >>> >>> The final hinderance to the use of JEE is serialization. If we keep >>> Symbols flyweight (singleton) we need to make this bullet proof from >>> the start. It is also practicaly impossible to make something a bean >>> and make it a Singleton, some careful thought is required. If we keep >>> symbols behind the scenes they may not need to be so bean like. >>> >>> - Mark >>> >>> On 9/21/07, george waldon wrote: >>>> Hello, >>>> >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. >>>> >>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). >>>> >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. >>>> >>>> Richard wrote: >>>>> It is suggested that development stops on the existing Biojava(&) >>>> Well, I don't think the license can let you do that :-) >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: >>>> - Switch to Subversion repository >>>> - Change of the build process compatible with creation of modules >>>> - Improving testing frame (mentioned several times) >>>> - Creation of white papers for coding practices, build releases, (others?) >>>> >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) >>>> >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. >>>> >>>> Hope it helps, >>>> George >>>> >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V eOMOo3pCl71dPhZMyYlBBE4= =NByU -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Sep 21 05:20:18 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 21 Sep 2007 10:20:18 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> Message-ID: <46F38CD2.6000005@ebi.ac.uk> > > Finally, I think SymbolLists (or whatever they get called) should > implement more of the methods found in String to make them look more > like Strings. Ideally we should think about implementing some of the > methods that Groovy likes to use for operator overloading. If we do > this is would be possible to concatenate two sequences in groovy by > doing this (I may have the syntax wrong). > > Seq3 = Seq1 + Seq2 Yup that seems about right. It's on of the nice things about groovy that you can overload the operators and create something which approaches an in-language DSL (can't really call it a true DSL since it's constrained by the Groovy language). But anyway you can start mucking around with the operators to get things like: fasta = new Fasta('id','AAAAAA') fasta_output = new FastaWriter('some_location'); fasta_output << fasta Assuming that the Fasta class would represent a Fasta record & the FastaWriter is just that; you can begin to write some very nice & tight code which just looks nice to use :). > > The other issue with SymbolLists is that they are not intuitive to > construct because they are not so bean like. This is not just a > problem for newbies but also a major hinderance to the use of JEE, > Spring, JAXB and other important frameworks. It should be possible to > do this: > > SymbolList sl = new SymbolList(); > sl.setName("AB123456"); > sl.setSequence(seqString); Yup I'll agree with that. > > The final hinderance to the use of JEE is serialization. If we keep > Symbols flyweight (singleton) we need to make this bullet proof from > the start. It is also practicaly impossible to make something a bean > and make it a Singleton, some careful thought is required. If we keep > symbols behind the scenes they may not need to be so bean like. I think we may need a bit of both. I would suggest something like an interface which back onto Symbol. Then collections of symbols are actually enums e.g. public interface Symbol { String toString(); } public enum DNA implements Symbol, java.io.Serializable { A, C, G, T; public String toString() { return this.name().toLowerCase(); } private Object readResolve () throws java.io.ObjectStreamException { DNA symbol = null; for(DNA dna: values()) { if(dna.toString().equals(this.toString()) { symbol = dna; break; } } return symbol; } } The read resolve needs to go in here to make sure this is bullet proof to serialization. Otherwise we end up in a situation where you can serialize an enum, deserialize it & then you'll end up where deserialzied enum is not equal (using ==) to the statically available enum. From what I've done previously using Enums are a very nice way of working with static constants. However they are very hard to extend so they're fine for known constants like DNA (don't think we're going to stumble onto a new nucleotide) but the symbol interface does mean that people can extend the symbol concept if need be. From ayates at ebi.ac.uk Fri Sep 21 05:26:49 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 21 Sep 2007 10:26:49 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F3853B.7070701@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> Message-ID: <46F38E59.5030005@ebi.ac.uk> > > Also could we make SymbolList implement List? The iterator() method > would then do the cached conversion if required before returning an > Iterator over the symbols. That would make it very pluggable. > We'd need it to have a settable flag indicating whether the user wants > 1-indexed or 0-indexed access (the default being 1-indexed as this is > the most common biological use). The important part of the 1.5 api is Iterable which can be applied to any class. It just means it will return an iterator & can be used in the new foreach loop construct. > > Only downside is that List uses generics and so SymbolList must too - > meaning that SymbolList must always be declared as SymbolList > (or some subclass of Symbol). > > But that's also an upside - you could subclass Symbol into DNASymbol, > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > symbol and need not be specified separately: > > SymbolList dna = new SymbolList(); > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > SymbolList> = new ....; // Cool! > > Also cool is that you could do this: > > public SymbolList translate(SymbolList dna); > // Also cool! Two problems with using generics that I've encountered: 1). getSomething(SymbolList list); & getSomething(SymbolList list); as far as the compiler is concerned are the same method. Both take in an instance of SymbolList (remember that generics are a list minute bolt-on to the JDK 1.5 API and it really shows). 2). It is impossible to infer the type of a generic i.e. public void doSomething(T genericObject) { if(T.equals(String.class)) { //Do something } } This T type is ... well magical. It exists but it doesn't. Anyway just be careful with generics. They save a lot of time & effort but get too involved (or think they can solve everything like I did for a short period) they're going to burn you badly or drive you mad for 1/2 a day wondering why javac claims something you've written is bogus. Andy From holland at ebi.ac.uk Fri Sep 21 05:56:07 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 10:56:07 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F38E59.5030005@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> <46F38E59.5030005@ebi.ac.uk> Message-ID: <46F39537.6040002@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 For our purposes we'd use them to restrict valid input to a method - so that if the user tries to write a program which passes in List to a method which only takes List it'll throw a wobbly at compile time. The methods would include the type in their signature and therefore only accept lists of that type. We'd obviously have to be careful with method naming as you point out - i.e. no overloading methods with different generic types of the same parameters. The bit about not being able to work out what T is is a pain indeed, but I don't think we'd need to use that. In most cases it can be solved hackily - whenever Sun produces a proper way of doing it we'd use it of course. (Hacky solution: If passed List, then do an instanceof list.iterate().next() to find out type of first item in list, thus implying the type of everything else in the list, assuming list is not empty). cheers, Richard Andy Yates wrote: >> Also could we make SymbolList implement List? The iterator() method >> would then do the cached conversion if required before returning an >> Iterator over the symbols. That would make it very pluggable. >> We'd need it to have a settable flag indicating whether the user wants >> 1-indexed or 0-indexed access (the default being 1-indexed as this is >> the most common biological use). > > The important part of the 1.5 api is Iterable which can be applied to > any class. It just means it will return an iterator & can be used in the > new foreach loop construct. > >> Only downside is that List uses generics and so SymbolList must too - >> meaning that SymbolList must always be declared as SymbolList >> (or some subclass of Symbol). >> >> But that's also an upside - you could subclass Symbol into DNASymbol, >> RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the >> symbol and need not be specified separately: >> >> SymbolList dna = new SymbolList(); >> dna.add(RNAAlphabet.Q); // Throws standard List exception! >> >> SymbolList> = new ....; // Cool! >> >> Also cool is that you could do this: >> >> public SymbolList translate(SymbolList dna); >> // Also cool! > > Two problems with using generics that I've encountered: > > 1). getSomething(SymbolList list); & > getSomething(SymbolList list); as far as the compiler is > concerned are the same method. Both take in an instance of SymbolList > (remember that generics are a list minute bolt-on to the JDK 1.5 API and > it really shows). > > 2). It is impossible to infer the type of a generic i.e. > > public void doSomething(T genericObject) { > if(T.equals(String.class)) { > //Do something > } > } > > This T type is ... well magical. It exists but it doesn't. > > Anyway just be careful with generics. They save a lot of time & effort > but get too involved (or think they can solve everything like I did for > a short period) they're going to burn you badly or drive you mad for 1/2 > a day wondering why javac claims something you've written is bogus. > > Andy > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG85U34C5LeMEKA/QRAl3pAKCFC6HBv5iXmGVKpuTwJQiwWuoMmwCdG/g2 ILxIABP6me8pfY995/e6A5M= =a+oW -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Sep 21 06:24:21 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 21 Sep 2007 11:24:21 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F39537.6040002@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> <46F38E59.5030005@ebi.ac.uk> <46F39537.6040002@ebi.ac.uk> Message-ID: <46F39BD5.3030305@ebi.ac.uk> That's fair enough & spot on what generics are intended for. Also means if someone wanted to allow any symbols in it's easy enough to use: void doSomething(List symbols); I've only ever need to know what T was once & the easiest way around it is as you've said to check the first element of an input collection or to take in a class & use generics to enforce the correct class type i.e. T getGenericType(Class clazz); String output = getGenericType(String.class); //This is ok String output = getGenericType(Long.class); //This won't compile I'd say so long as 'dodgy' things are refactored out to helper classes then they can change as & when better solutions come along. A good example of this is Spring's synchronized map builder which at runtime will attempt to figure out what is available on the classpath and then return a map which will provide the best syncrhronized performance (for example if it's a pre 1.5 jdk it'll return a normal synchronized map otherwise it'll use ConcurrentHashMap). Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > For our purposes we'd use them to restrict valid input to a method - so > that if the user tries to write a program which passes in > List to a method which only takes List it'll throw > a wobbly at compile time. The methods would include the type in their > signature and therefore only accept lists of that type. > > We'd obviously have to be careful with method naming as you point out - > i.e. no overloading methods with different generic types of the same > parameters. > > The bit about not being able to work out what T is is a pain indeed, but > I don't think we'd need to use that. In most cases it can be solved > hackily - whenever Sun produces a proper way of doing it we'd use it of > course. (Hacky solution: If passed List, then do an instanceof > list.iterate().next() to find out type of first item in list, thus > implying the type of everything else in the list, assuming list is not > empty). > > cheers, > Richard > > > Andy Yates wrote: >>> Also could we make SymbolList implement List? The iterator() method >>> would then do the cached conversion if required before returning an >>> Iterator over the symbols. That would make it very pluggable. >>> We'd need it to have a settable flag indicating whether the user wants >>> 1-indexed or 0-indexed access (the default being 1-indexed as this is >>> the most common biological use). >> The important part of the 1.5 api is Iterable which can be applied to >> any class. It just means it will return an iterator & can be used in the >> new foreach loop construct. >> >>> Only downside is that List uses generics and so SymbolList must too - >>> meaning that SymbolList must always be declared as SymbolList >>> (or some subclass of Symbol). >>> >>> But that's also an upside - you could subclass Symbol into DNASymbol, >>> RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the >>> symbol and need not be specified separately: >>> >>> SymbolList dna = new SymbolList(); >>> dna.add(RNAAlphabet.Q); // Throws standard List exception! >>> >>> SymbolList> = new ....; // Cool! >>> >>> Also cool is that you could do this: >>> >>> public SymbolList translate(SymbolList dna); >>> // Also cool! >> Two problems with using generics that I've encountered: >> >> 1). getSomething(SymbolList list); & >> getSomething(SymbolList list); as far as the compiler is >> concerned are the same method. Both take in an instance of SymbolList >> (remember that generics are a list minute bolt-on to the JDK 1.5 API and >> it really shows). >> >> 2). It is impossible to infer the type of a generic i.e. >> >> public void doSomething(T genericObject) { >> if(T.equals(String.class)) { >> //Do something >> } >> } >> >> This T type is ... well magical. It exists but it doesn't. >> >> Anyway just be careful with generics. They save a lot of time & effort >> but get too involved (or think they can solve everything like I did for >> a short period) they're going to burn you badly or drive you mad for 1/2 >> a day wondering why javac claims something you've written is bogus. >> >> Andy >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG85U34C5LeMEKA/QRAl3pAKCFC6HBv5iXmGVKpuTwJQiwWuoMmwCdG/g2 > ILxIABP6me8pfY995/e6A5M= > =a+oW > -----END PGP SIGNATURE----- From heuermh at acm.org Sat Sep 22 01:29:22 2007 From: heuermh at acm.org (Michael Heuer) Date: Sat, 22 Sep 2007 01:29:22 -0400 (EDT) Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F3853B.7070701@ebi.ac.uk> Message-ID: I honestly haven't looked at it in a couple of years, but there is a proposal of mine for static generic symbols/symbol lists at > http://www3.shore.net/~heuermh/static-alphabet-generics.tar.gz Probably not useful or correct in its current state (I never did fully understand gap symbols) but it might be useful from a discussion standpoint. michael Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Also could we make SymbolList implement List? The iterator() method > would then do the cached conversion if required before returning an > Iterator over the symbols. That would make it very pluggable. > We'd need it to have a settable flag indicating whether the user wants > 1-indexed or 0-indexed access (the default being 1-indexed as this is > the most common biological use). > > Only downside is that List uses generics and so SymbolList must too - > meaning that SymbolList must always be declared as SymbolList > (or some subclass of Symbol). > > But that's also an upside - you could subclass Symbol into DNASymbol, > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > symbol and need not be specified separately: > > SymbolList dna = new SymbolList(); > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > SymbolList> = new ....; // Cool! > > Also cool is that you could do this: > > public SymbolList translate(SymbolList dna); > // Also cool! > > cheers, > Richard > > Richard Holland wrote: > > I like that idea of having SymbolLists backed by different things. I'd > > suggest that by default, all sequences read from file should be > > String-backed SymbolLists, and that they are not broken down into > > Symbols until first requested to do so by code that needs to know the > > actual Symbols (e.g. code that cares about ambiguity symbols). The same > > applies in reverse - lists constructed from symbols should not be > > converted to strings until needed. > > > > Something like this: > > > > SymbolList sl = new SymbolList(); > > sl.setString("AGCGGACT"); > > // Changes the string, and clears any cached > > // conversion of it. > > String seq = sl.getString(); > > // Dumps the string. If not already converted > > // to a string, does the conversion and > > // caches it first. > > char base = sl.charAt(5); > > // 1-indexed single-base string. This would > > // likely delegate to String.charAt() and only > > // works for single-character alphabets. Not > > // to be used in any other cirumstances. > > sl.set/getAlphabet().... > > // Use these to set the alphabet before > > // using set/getSymbols()/symbolAt(). > > sl.setSymbols(new List(....)); > > // Uses the list to update the cached symbols > > // and clear the cached string. > > List syms = sl.getSymbols(); > > // Converts if not already converted, caches > > // the conversion, and returns it. > > Symbol sym = sl.symbolAt(5); > > // 1-indexed fully flexible symbol finder. > > > > toString() would delegate to getString(), as would hashCode(), equals(), > > and compareTo(). We could provide additional equals()-style methods for > > testing equality whilst taking into account ambiguities. > > > > cheers, > > Richard > > > > > > Mark Schreiber wrote: > >>> Hello - > >>> > >>> Just to clarify my opinion on Strings vs Symbols. > >>> > >>> I generally prefer Symbols and SymbolLists to Strings cause > >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity > >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > >>> However, I think it would be vastly simpler if there where simpler > >>> getters and setters for SymbolLists that exposed Strings in a > >>> friendlier manner. > >>> > >>> I also think there is a case for SymbolLists that are backed by > >>> Strings (more likely a char[]) instead of Symbol arrays and only do > >>> the needed conversion when required (ie, when the user calls > >>> SymbolAt(). These would be ideal for the case where someone is > >>> converting GenBank to Fasta and there is no need to go through the > >>> Symbol parsing. > >>> > >>> Finally, I think SymbolLists (or whatever they get called) should > >>> implement more of the methods found in String to make them look more > >>> like Strings. Ideally we should think about implementing some of the > >>> methods that Groovy likes to use for operator overloading. If we do > >>> this is would be possible to concatenate two sequences in groovy by > >>> doing this (I may have the syntax wrong). > >>> > >>> Seq3 = Seq1 + Seq2 > >>> > >>> The other issue with SymbolLists is that they are not intuitive to > >>> construct because they are not so bean like. This is not just a > >>> problem for newbies but also a major hinderance to the use of JEE, > >>> Spring, JAXB and other important frameworks. It should be possible to > >>> do this: > >>> > >>> SymbolList sl = new SymbolList(); > >>> sl.setName("AB123456"); > >>> sl.setSequence(seqString); > >>> > >>> The final hinderance to the use of JEE is serialization. If we keep > >>> Symbols flyweight (singleton) we need to make this bullet proof from > >>> the start. It is also practicaly impossible to make something a bean > >>> and make it a Singleton, some careful thought is required. If we keep > >>> symbols behind the scenes they may not need to be so bean like. > >>> > >>> - Mark > >>> > >>> On 9/21/07, george waldon wrote: > >>>> Hello, > >>>> > >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > >>>> > >>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > >>>> > >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > >>>> > >>>> Richard wrote: > >>>>> It is suggested that development stops on the existing Biojava(&) > >>>> Well, I don't think the license can let you do that :-) > >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > >>>> - Switch to Subversion repository > >>>> - Change of the build process compatible with creation of modules > >>>> - Improving testing frame (mentioned several times) > >>>> - Creation of white papers for coding practices, build releases, (others?) > >>>> > >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > >>>> > >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > >>>> > >>>> Hope it helps, > >>>> George > >>>> > >>>> _______________________________________________ > >>>> biojava-dev mailing list > >>>> biojava-dev at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>>> > >>> _______________________________________________ > >>> biojava-dev mailing list > >>> biojava-dev at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>> > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V > eOMOo3pCl71dPhZMyYlBBE4= > =NByU > -----END PGP SIGNATURE----- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From heuermh at acm.org Sat Sep 22 01:35:37 2007 From: heuermh at acm.org (Michael Heuer) Date: Sat, 22 Sep 2007 01:35:37 -0400 (EDT) Subject: [Biojava-dev] The future of BioJava In-Reply-To: Message-ID: Oh and don't forget to review Matthew's bjv2 rework of symbols and symbol lists in full generics regalia. michael Michael Heuer wrote: > I honestly haven't looked at it in a couple of years, but there is a > proposal of mine for static generic symbols/symbol lists at > > > http://www3.shore.net/~heuermh/static-alphabet-generics.tar.gz > > Probably not useful or correct in its current state (I never did fully > understand gap symbols) but it might be useful from a discussion > standpoint. > > michael > > > Richard Holland wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Also could we make SymbolList implement List? The iterator() method > > would then do the cached conversion if required before returning an > > Iterator over the symbols. That would make it very pluggable. > > We'd need it to have a settable flag indicating whether the user wants > > 1-indexed or 0-indexed access (the default being 1-indexed as this is > > the most common biological use). > > > > Only downside is that List uses generics and so SymbolList must too - > > meaning that SymbolList must always be declared as SymbolList > > (or some subclass of Symbol). > > > > But that's also an upside - you could subclass Symbol into DNASymbol, > > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > > symbol and need not be specified separately: > > > > SymbolList dna = new SymbolList(); > > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > > > SymbolList> = new ....; // Cool! > > > > Also cool is that you could do this: > > > > public SymbolList translate(SymbolList dna); > > // Also cool! > > > > cheers, > > Richard > > > > Richard Holland wrote: > > > I like that idea of having SymbolLists backed by different things. I'd > > > suggest that by default, all sequences read from file should be > > > String-backed SymbolLists, and that they are not broken down into > > > Symbols until first requested to do so by code that needs to know the > > > actual Symbols (e.g. code that cares about ambiguity symbols). The same > > > applies in reverse - lists constructed from symbols should not be > > > converted to strings until needed. > > > > > > Something like this: > > > > > > SymbolList sl = new SymbolList(); > > > sl.setString("AGCGGACT"); > > > // Changes the string, and clears any cached > > > // conversion of it. > > > String seq = sl.getString(); > > > // Dumps the string. If not already converted > > > // to a string, does the conversion and > > > // caches it first. > > > char base = sl.charAt(5); > > > // 1-indexed single-base string. This would > > > // likely delegate to String.charAt() and only > > > // works for single-character alphabets. Not > > > // to be used in any other cirumstances. > > > sl.set/getAlphabet().... > > > // Use these to set the alphabet before > > > // using set/getSymbols()/symbolAt(). > > > sl.setSymbols(new List(....)); > > > // Uses the list to update the cached symbols > > > // and clear the cached string. > > > List syms = sl.getSymbols(); > > > // Converts if not already converted, caches > > > // the conversion, and returns it. > > > Symbol sym = sl.symbolAt(5); > > > // 1-indexed fully flexible symbol finder. > > > > > > toString() would delegate to getString(), as would hashCode(), equals(), > > > and compareTo(). We could provide additional equals()-style methods for > > > testing equality whilst taking into account ambiguities. > > > > > > cheers, > > > Richard > > > > > > > > > Mark Schreiber wrote: > > >>> Hello - > > >>> > > >>> Just to clarify my opinion on Strings vs Symbols. > > >>> > > >>> I generally prefer Symbols and SymbolLists to Strings cause > > >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity > > >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > > >>> However, I think it would be vastly simpler if there where simpler > > >>> getters and setters for SymbolLists that exposed Strings in a > > >>> friendlier manner. > > >>> > > >>> I also think there is a case for SymbolLists that are backed by > > >>> Strings (more likely a char[]) instead of Symbol arrays and only do > > >>> the needed conversion when required (ie, when the user calls > > >>> SymbolAt(). These would be ideal for the case where someone is > > >>> converting GenBank to Fasta and there is no need to go through the > > >>> Symbol parsing. > > >>> > > >>> Finally, I think SymbolLists (or whatever they get called) should > > >>> implement more of the methods found in String to make them look more > > >>> like Strings. Ideally we should think about implementing some of the > > >>> methods that Groovy likes to use for operator overloading. If we do > > >>> this is would be possible to concatenate two sequences in groovy by > > >>> doing this (I may have the syntax wrong). > > >>> > > >>> Seq3 = Seq1 + Seq2 > > >>> > > >>> The other issue with SymbolLists is that they are not intuitive to > > >>> construct because they are not so bean like. This is not just a > > >>> problem for newbies but also a major hinderance to the use of JEE, > > >>> Spring, JAXB and other important frameworks. It should be possible to > > >>> do this: > > >>> > > >>> SymbolList sl = new SymbolList(); > > >>> sl.setName("AB123456"); > > >>> sl.setSequence(seqString); > > >>> > > >>> The final hinderance to the use of JEE is serialization. If we keep > > >>> Symbols flyweight (singleton) we need to make this bullet proof from > > >>> the start. It is also practicaly impossible to make something a bean > > >>> and make it a Singleton, some careful thought is required. If we keep > > >>> symbols behind the scenes they may not need to be so bean like. > > >>> > > >>> - Mark > > >>> > > >>> On 9/21/07, george waldon wrote: > > >>>> Hello, > > >>>> > > >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > >>>> > > >>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > > >>>> > > >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > >>>> > > >>>> Richard wrote: > > >>>>> It is suggested that development stops on the existing Biojava(&) > > >>>> Well, I don't think the license can let you do that :-) > > >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > > >>>> - Switch to Subversion repository > > >>>> - Change of the build process compatible with creation of modules > > >>>> - Improving testing frame (mentioned several times) > > >>>> - Creation of white papers for coding practices, build releases, (others?) > > >>>> > > >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > >>>> > > >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > >>>> > > >>>> Hope it helps, > > >>>> George > > >>>> > > >>>> _______________________________________________ > > >>>> biojava-dev mailing list > > >>>> biojava-dev at lists.open-bio.org > > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > >>>> > > >>> _______________________________________________ > > >>> biojava-dev mailing list > > >>> biojava-dev at lists.open-bio.org > > >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > >>> > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V > > eOMOo3pCl71dPhZMyYlBBE4= > > =NByU > > -----END PGP SIGNATURE----- > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > From markjschreiber at gmail.com Sat Sep 22 07:31:29 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 22 Sep 2007 19:31:29 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F38E59.5030005@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> <46F38E59.5030005@ebi.ac.uk> Message-ID: <93b45ca50709220431s5229a703w10beef514d3c2d99@mail.gmail.com> > 2). It is impossible to infer the type of a generic i.e. > > public void doSomething(T genericObject) { > if(T.equals(String.class)) { > //Do something > } > } > > This T type is ... well magical. It exists but it doesn't. > T only exists at compile time. It doesn't exist at all in the JVM. This is because Java generics are implemented using 'erasure'. It is a bit of a drawback but no getting around it. From markjschreiber at gmail.com Sat Sep 22 07:33:12 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 22 Sep 2007 19:33:12 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: References: Message-ID: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> bjv2 can be found at http://www.derkholm.net/svn/repos/bjv2 - Mark On 9/22/07, Michael Heuer wrote: > Oh and don't forget to review Matthew's bjv2 rework of symbols and symbol > lists in full generics regalia. > > michael > > > Michael Heuer wrote: > > > I honestly haven't looked at it in a couple of years, but there is a > > proposal of mine for static generic symbols/symbol lists at > > > > > http://www3.shore.net/~heuermh/static-alphabet-generics.tar.gz > > > > Probably not useful or correct in its current state (I never did fully > > understand gap symbols) but it might be useful from a discussion > > standpoint. > > > > michael > > > > > > Richard Holland wrote: > > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > Hash: SHA1 > > > > > > Also could we make SymbolList implement List? The iterator() method > > > would then do the cached conversion if required before returning an > > > Iterator over the symbols. That would make it very pluggable. > > > We'd need it to have a settable flag indicating whether the user wants > > > 1-indexed or 0-indexed access (the default being 1-indexed as this is > > > the most common biological use). > > > > > > Only downside is that List uses generics and so SymbolList must too - > > > meaning that SymbolList must always be declared as SymbolList > > > (or some subclass of Symbol). > > > > > > But that's also an upside - you could subclass Symbol into DNASymbol, > > > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > > > symbol and need not be specified separately: > > > > > > SymbolList dna = new SymbolList(); > > > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > > > > > SymbolList> = new ....; // Cool! > > > > > > Also cool is that you could do this: > > > > > > public SymbolList translate(SymbolList dna); > > > // Also cool! > > > > > > cheers, > > > Richard > > > > > > Richard Holland wrote: > > > > I like that idea of having SymbolLists backed by different things. I'd > > > > suggest that by default, all sequences read from file should be > > > > String-backed SymbolLists, and that they are not broken down into > > > > Symbols until first requested to do so by code that needs to know the > > > > actual Symbols (e.g. code that cares about ambiguity symbols). The same > > > > applies in reverse - lists constructed from symbols should not be > > > > converted to strings until needed. > > > > > > > > Something like this: > > > > > > > > SymbolList sl = new SymbolList(); > > > > sl.setString("AGCGGACT"); > > > > // Changes the string, and clears any cached > > > > // conversion of it. > > > > String seq = sl.getString(); > > > > // Dumps the string. If not already converted > > > > // to a string, does the conversion and > > > > // caches it first. > > > > char base = sl.charAt(5); > > > > // 1-indexed single-base string. This would > > > > // likely delegate to String.charAt() and only > > > > // works for single-character alphabets. Not > > > > // to be used in any other cirumstances. > > > > sl.set/getAlphabet().... > > > > // Use these to set the alphabet before > > > > // using set/getSymbols()/symbolAt(). > > > > sl.setSymbols(new List(....)); > > > > // Uses the list to update the cached symbols > > > > // and clear the cached string. > > > > List syms = sl.getSymbols(); > > > > // Converts if not already converted, caches > > > > // the conversion, and returns it. > > > > Symbol sym = sl.symbolAt(5); > > > > // 1-indexed fully flexible symbol finder. > > > > > > > > toString() would delegate to getString(), as would hashCode(), equals(), > > > > and compareTo(). We could provide additional equals()-style methods for > > > > testing equality whilst taking into account ambiguities. > > > > > > > > cheers, > > > > Richard > > > > > > > > > > > > Mark Schreiber wrote: > > > >>> Hello - > > > >>> > > > >>> Just to clarify my opinion on Strings vs Symbols. > > > >>> > > > >>> I generally prefer Symbols and SymbolLists to Strings cause > > > >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity > > > >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > > > >>> However, I think it would be vastly simpler if there where simpler > > > >>> getters and setters for SymbolLists that exposed Strings in a > > > >>> friendlier manner. > > > >>> > > > >>> I also think there is a case for SymbolLists that are backed by > > > >>> Strings (more likely a char[]) instead of Symbol arrays and only do > > > >>> the needed conversion when required (ie, when the user calls > > > >>> SymbolAt(). These would be ideal for the case where someone is > > > >>> converting GenBank to Fasta and there is no need to go through the > > > >>> Symbol parsing. > > > >>> > > > >>> Finally, I think SymbolLists (or whatever they get called) should > > > >>> implement more of the methods found in String to make them look more > > > >>> like Strings. Ideally we should think about implementing some of the > > > >>> methods that Groovy likes to use for operator overloading. If we do > > > >>> this is would be possible to concatenate two sequences in groovy by > > > >>> doing this (I may have the syntax wrong). > > > >>> > > > >>> Seq3 = Seq1 + Seq2 > > > >>> > > > >>> The other issue with SymbolLists is that they are not intuitive to > > > >>> construct because they are not so bean like. This is not just a > > > >>> problem for newbies but also a major hinderance to the use of JEE, > > > >>> Spring, JAXB and other important frameworks. It should be possible to > > > >>> do this: > > > >>> > > > >>> SymbolList sl = new SymbolList(); > > > >>> sl.setName("AB123456"); > > > >>> sl.setSequence(seqString); > > > >>> > > > >>> The final hinderance to the use of JEE is serialization. If we keep > > > >>> Symbols flyweight (singleton) we need to make this bullet proof from > > > >>> the start. It is also practicaly impossible to make something a bean > > > >>> and make it a Singleton, some careful thought is required. If we keep > > > >>> symbols behind the scenes they may not need to be so bean like. > > > >>> > > > >>> - Mark > > > >>> > > > >>> On 9/21/07, george waldon wrote: > > > >>>> Hello, > > > >>>> > > > >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > > >>>> > > > >>>> I noticed that the tutorial has seriously improved thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > > > >>>> > > > >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > > >>>> > > > >>>> Richard wrote: > > > >>>>> It is suggested that development stops on the existing Biojava(&) > > > >>>> Well, I don't think the license can let you do that :-) > > > >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > > > >>>> - Switch to Subversion repository > > > >>>> - Change of the build process compatible with creation of modules > > > >>>> - Improving testing frame (mentioned several times) > > > >>>> - Creation of white papers for coding practices, build releases, (others?) > > > >>>> > > > >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > > >>>> > > > >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > > >>>> > > > >>>> Hope it helps, > > > >>>> George > > > >>>> > > > >>>> _______________________________________________ > > > >>>> biojava-dev mailing list > > > >>>> biojava-dev at lists.open-bio.org > > > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > >>>> > > > >>> _______________________________________________ > > > >>> biojava-dev mailing list > > > >>> biojava-dev at lists.open-bio.org > > > >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > >>> > > > _______________________________________________ > > > biojava-dev mailing list > > > biojava-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > > > -----BEGIN PGP SIGNATURE----- > > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > > > iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V > > > eOMOo3pCl71dPhZMyYlBBE4= > > > =NByU > > > -----END PGP SIGNATURE----- > > > _______________________________________________ > > > biojava-dev mailing list > > > biojava-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > > > > > > From gwaldon at geneinfinity.org Sat Sep 22 12:03:15 2007 From: gwaldon at geneinfinity.org (george waldon) Date: Sat, 22 Sep 2007 09:03:15 -0700 Subject: [Biojava-dev] The future of BioJava Message-ID: <20070922160315.33233.qmail@mmm1924.dulles19-verio.com> Thank you Mark for making the point so clearly. I could see String being used internally in SymbolList, but still what is really the point of rewriting a logic that is already present in the current code? Again, rewrite appears easier. There is nothing that prevent us to write with the current code: SymbolList sl = new DNASymbolList(); sl.setName("AB123456"); sl.setSequence("aAaA-aAaA"); //a polyadenine Ok, setName and setSequence are not part of the current SymbolList interface but we can have SymbolListEx to complete it, or we can have a new SymbolList in the biojavax domain or the bj3 domain, or even we can create SymbolString (//cool!) in the current biojava domain. The point is that we can have both old and new interfaces coexisting in the same overall project and swap from one to another module per module and nothing is ever broken. - George > -----Original Message----- > From: Mark Schreiber [mailto:markjschreiber at gmail.com] > Sent: Friday, September 21, 2007 12:24 AM > To: george waldon > Cc: biojava-dev at biojava.org > Subject: Re: [Biojava-dev] The future of BioJava > > Hello - > > Just to clarify my opinion on Strings vs Symbols. > > I generally prefer Symbols and SymbolLists to Strings cause > SymbolLists are smart and Strings are dumb. Classic case is ambiguity > symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > However, I think it would be vastly simpler if there where simpler > getters and setters for SymbolLists that exposed Strings in a > friendlier manner. > > I also think there is a case for SymbolLists that are backed by > Strings (more likely a char[]) instead of Symbol arrays and only do > the needed conversion when required (ie, when the user calls > SymbolAt(). These would be ideal for the case where someone is > converting GenBank to Fasta and there is no need to go through the > Symbol parsing. > > Finally, I think SymbolLists (or whatever they get called) should > implement more of the methods found in String to make them look more > like Strings. Ideally we should think about implementing some of the > methods that Groovy likes to use for operator overloading. If we do > this is would be possible to concatenate two sequences in groovy by > doing this (I may have the syntax wrong). > > Seq3 = Seq1 + Seq2 > > The other issue with SymbolLists is that they are not intuitive to > construct because they are not so bean like. This is not just a > problem for newbies but also a major hinderance to the use of JEE, > Spring, JAXB and other important frameworks. It should be possible to > do this: > > SymbolList sl = new SymbolList(); > sl.setName("AB123456"); > sl.setSequence(seqString); > > The final hinderance to the use of JEE is serialization. If we keep > Symbols flyweight (singleton) we need to make this bullet proof from > the start. It is also practicaly impossible to make something a bean > and make it a Singleton, some careful thought is required. If we keep > symbols behind the scenes they may not need to be so bean like. > > - Mark > From gwaldon at geneinfinity.org Sat Sep 22 12:35:22 2007 From: gwaldon at geneinfinity.org (george waldon) Date: Sat, 22 Sep 2007 09:35:22 -0700 Subject: [Biojava-dev] The future of BioJava Message-ID: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> Richard, You cannot kill biojava and it is not vista; you cannot force people to use it. I have a project with hundreds of classes using biojava and working without a glitch and the choice of either keeping with it or switching to a bj3 in the middle of a rewrite of around 1500 classes that may take months or years to complete. I may just never switch to the new biojava. Most likely, a lot of people are going to be in a similar situation and most likely bj3 will also have to have support old biojava classes - great! I agree that you cannot change interface but you can deprecate them and toss them after one release cycle or put them into a deprecated module that is not included in releases. The question becomes: what are the fundamental problems of biojava that truly justify a rewrite from the ground? Certainly, need for a new symbol model could be one; maintenance and testing are not; modular structure is not; and use of generics is not - they do not break old code. George > -----Original Message----- > From: Richard Holland > To: george waldon > Cc: biojava-dev at biojava.org > Sent: 9/21/2007 12:54 AM > Subject: Re: [Biojava-dev] The future of BioJava > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi George. > > By 'stop development' I really meant just that active development > efforts would be focused on the new codebase rather than modifying the > existing one (except of course for fixing bugs, which is always > important and we wouldn't stop doing that until the new codebase was > well established as an alternative). > > I agree that modifying the existing codebase would improve many of the > problems currently experienced with it - code abstraction being just one > of them. BioJavaX was an attempt at doing this. The big stumbling block > was interfaces - users do not expect interfaces to change as it breaks > all code that already uses that interface. They also do not expect the > defined behaviour of methods in interfaces to change - which meant, for > instance, that I had real problems trying to get > RichFeature/RichLocation and RichLocation/Location to match up as some > parts of Feature and Location conflicted with the more realistic > requirements of their Rich* equivalents (e.g. circularity). > > If you change interfaces, you might as well start from scratch in terms > of the effect it has on end-user's code. Also, if we start from scratch, > it allows us to build up from the very basics the kind of robustness and > flexibility we need throughout the system. As mentioned in the original > posting the existing system is heavily sequence-focused, meaning that > even the simple task of scanning a set of features cannot be done > without also loading the associated sequences because the two are so > closely integrated. We need to make it much more flexible and I think > new code would give us a better opportunity to do so without being tied > into complying with existing interfaces or behaviour expectations. > > Having said that, I do expect large parts of the new codebase to be only > slightly modified copies of the original code, particularly regarding > recent developments such as genetic algorithms and phylogenetics. It > would be silly to write such logic all over again where the code is > relatively self-contained. > > cheers, > Richard > > > > george waldon wrote: > > Hello, > > > > All this is very exciting. I would certainly contribute to something > like that. A few remarks that come to my mind while reading all these > emails. > > > > I noticed that the tutorial has seriously improved ? thanks for the > work. I remember my initial steps going to understanding Symbol and > cross-alphabets (?) Still, from time to time, I have difficulties with > basic things that are not intuitive to me such as ?token?, e.g. > Alphabet.getTokenizarion(?token?) or > SymbolTokenization.tokenizeSymbolList(SymbolList). > > > > I am surprised by the all the requests to use String instead of > SymbolList. The CookBook tells precisely, and with code examples, how to > make most of all basic operations. Maybe someone could illustrate the > new kind of code versus the old one? I bet many newbies (and older one) > actually get their answer in the Cookbook. > > > > Richard wrote: > >> It is suggested that development stops on the existing Biojava(?) > > Well, I don?t think the license can let you do that :-) > > Writing new code might be easier but certainly making old code better > will improve the level of code abstraction. Therefore I am promoting > improving existing Biojava code versus hazardous code rewrite. I can see > some of the initial steps on the roadmap: > > - Switch to Subversion repository > > - Change of the build process compatible with creation of modules > > - Improving testing frame (mentioned several times) > > - Creation of white papers for coding practices, build releases, > (others?) > > > > Then maybe the proper work of restructuring Biojava may start. We can > either divide the existing mammoth into multiple modules at first or - > my preference ? building modules one by one by selectively picking > classes. This way it will be easy to find out classes that can be > deprecated (by lack of users) and we can even have a deprecated module > at the end. Some coupling may need to loosen up. We will also need a > list of API change for developers who will use the newer version. I am > sure that the kind of data structures proposed by Richard could find > their place as well as some of the proposed patterns (beans, others?) > > > > Anyway, all these are simple ideas. I am not an expert in build > process, but I can help with improving javadocs, writing examples and > test cases. I have also a fair knowledge of the molecular biology > package. > > > > Hope it helps, > > George > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG83jK4C5LeMEKA/QRAtOFAJsF9YNdgdsOm1KY65GyRehsO1ElYwCfeUfi > yXWTMXSzn3mXZqXXo9999rw= > =WbAQ > -----END PGP SIGNATURE----- From phidias51 at gmail.com Sat Sep 22 14:42:50 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Sat, 22 Sep 2007 11:42:50 -0700 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> Message-ID: <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> Richard & Andy, 1. I like the idea of making readers more pluggable, and Dozer definitely looks interesting. Is this going to be supported via the Service Provider Interface approach (used by Taverna and other projects)? 2. Andy brought up the point of people who create non-standard variations of EMBL-formatted files. I was wondering if these files were created in programming languages other than Java? If so, would those users be willing to use a Jython, JRuby, or a Perl-like scripting language like Sleep,? This would allow them to use biojava as a library, and still use a scripting language whose syntax they were familiar with. They would also be producing files in a more standardized format. This might cut down on the number of parsing mistakes caused by "unsupported" file variations. You can go to http://scripting.dev.java.net for more information on the scripting languages that the Java VM supports. 3. Was there any reason why non-standard files were being created? Perhaps some use-case not being covered? 4. If BioJava is split up into a variety of smaller JARs, how would you insure that the users had all of the JARs that they needed? Would an installer be provided to allow users to select groups of JARs? There are a number of open source installers that would make this process easier. Using Maven is suitable if you're a developer, if you're a scripter it's a little more difficult to deal with. 5. Are there any thoughts about using a templating system like Velocity, FreeMarker or JST? This would make it easier to insure that files were produced in a standard fashion. It would also make it easier to maintain support for writing files in different file formats. 6. When it comes to unit testing and continuous building, is the bio*.org server going to handle that automated build & burn, or is someone in the group going to have to do it? I think the inability to have the build setup on the server had us stymied before. 7. Now that Java also includes the Derby database, and the Java Persistence API (JPA), has anyone considered migrating the BioSQL support from Hibernate to JPA, and using Derby as the default database? This would make it a little easier to maintain and would minimize the setup work that a new user would have to do. 8. Richard, you mention in the "Reasoning" section that "users have moved on". What types of use-cases beyond basic sequence analysis, should BioJava support? Would support for more of lab-related processes expand the user base and number of committers? Would support for parsing different types of instrument files be a useful addition? I could imagine use cases where users would like to be able to parse an Affy file and fetch probe information, gene information, and perhaps pathway data. 9. Are there any thoughts about using annotations (perhaps in combination with ontologies) to handle semantic validation of arguments? For example, you might have an annotation like @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"} indicating that the attribute or method argument is a LocusLink id. Thanks for kick-starting this discussion? Regards, Mark Fortner From holland at ebi.ac.uk Sun Sep 23 07:16:14 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Sun, 23 Sep 2007 12:16:14 +0100 (BST) Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> References: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> Message-ID: <51328.80.42.55.181.1190546174.squirrel@webmail.ebi.ac.uk> Understood. I was thinking of including a 'compatibility mode' module in BJ3 which provides all the existing BJ2 interfaces and maps them to the new ones. This way we have the best of both worlds - existing projects would replace the BJ2 jars with the new BJ3 jars on the classpath plus the compatibility jar and wouldn't need to change any code at all. Existing import statements would then pick up the compatibility mappings instead of the original classes. Anyhow your comments will definitely be considered. This is very early discussion after all - the point being to gather opinion and ideas to see if its even worth making a change, let alone what kind of change that would be. cheers, Richard PS. The most fundamental problem is that some of the existing interfaces are broken. They enforce situations which are not biologically logical - e.g. the feature and location interfaces have got strand mixed up. You can't fix this without altering the interfaces - and to alter the interfaces requires people to change existing code. If they're going to change existing code, why not make a clean sweep of it. Even deprecating for one release then removing in a subsequent one will still require you to change the 1500+ classes you mention, which is only delaying the problem. PPS. I will compile a comprehensive list of things I think are broken/wrong so that people can discuss specifically what should be done about them - whether they be rewrite or modification. I do want this to be a democratic process and if the majority of people don't want a particular plan of action to happen, then it won't. On Sat, September 22, 2007 5:35 pm, george waldon wrote: > Richard, > > You cannot kill biojava and it is not vista; you cannot force people to > use it. I have a project with hundreds of classes using biojava and > working without a glitch and the choice of either keeping with it or > switching to a bj3 in the middle of a rewrite of around 1500 classes that > may take months or years to complete. I may just never switch to the new > biojava. Most likely, a lot of people are going to be in a similar > situation and most likely bj3 will also have to have support old biojava > classes - great! > > I agree that you cannot change interface but you can deprecate them and > toss them after one release cycle or put them into a deprecated module > that is not included in releases. > > The question becomes: what are the fundamental problems of biojava that > truly justify a rewrite from the ground? Certainly, need for a new symbol > model could be one; maintenance and testing are not; modular structure is > not; and use of generics is not - they do not break old code. > > George > > >> -----Original Message----- >> From: Richard Holland >> To: george waldon >> Cc: biojava-dev at biojava.org >> Sent: 9/21/2007 12:54 AM >> Subject: Re: [Biojava-dev] The future of BioJava >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hi George. >> >> By 'stop development' I really meant just that active development >> efforts would be focused on the new codebase rather than modifying the >> existing one (except of course for fixing bugs, which is always >> important and we wouldn't stop doing that until the new codebase was >> well established as an alternative). >> >> I agree that modifying the existing codebase would improve many of the >> problems currently experienced with it - code abstraction being just one >> of them. BioJavaX was an attempt at doing this. The big stumbling block >> was interfaces - users do not expect interfaces to change as it breaks >> all code that already uses that interface. They also do not expect the >> defined behaviour of methods in interfaces to change - which meant, for >> instance, that I had real problems trying to get >> RichFeature/RichLocation and RichLocation/Location to match up as some >> parts of Feature and Location conflicted with the more realistic >> requirements of their Rich* equivalents (e.g. circularity). >> >> If you change interfaces, you might as well start from scratch in terms >> of the effect it has on end-user's code. Also, if we start from scratch, >> it allows us to build up from the very basics the kind of robustness and >> flexibility we need throughout the system. As mentioned in the original >> posting the existing system is heavily sequence-focused, meaning that >> even the simple task of scanning a set of features cannot be done >> without also loading the associated sequences because the two are so >> closely integrated. We need to make it much more flexible and I think >> new code would give us a better opportunity to do so without being tied >> into complying with existing interfaces or behaviour expectations. >> >> Having said that, I do expect large parts of the new codebase to be only >> slightly modified copies of the original code, particularly regarding >> recent developments such as genetic algorithms and phylogenetics. It >> would be silly to write such logic all over again where the code is >> relatively self-contained. >> >> cheers, >> Richard >> >> >> >> george waldon wrote: >> > Hello, >> > >> > All this is very exciting. I would certainly contribute to something >> like that. A few remarks that come to my mind while reading all these >> emails. >> > >> > I noticed that the tutorial has seriously improved ??? thanks for the >> work. I remember my initial steps going to understanding Symbol and >> cross-alphabets (???) Still, from time to time, I have difficulties >> with >> basic things that are not intuitive to me such as ???token???, e.g. >> Alphabet.getTokenizarion(???token???) or >> SymbolTokenization.tokenizeSymbolList(SymbolList). >> > >> > I am surprised by the all the requests to use String instead of >> SymbolList. The CookBook tells precisely, and with code examples, how to >> make most of all basic operations. Maybe someone could illustrate the >> new kind of code versus the old one? I bet many newbies (and older one) >> actually get their answer in the Cookbook. >> > >> > Richard wrote: >> >> It is suggested that development stops on the existing Biojava(???) >> > Well, I don???t think the license can let you do that :-) >> > Writing new code might be easier but certainly making old code better >> will improve the level of code abstraction. Therefore I am promoting >> improving existing Biojava code versus hazardous code rewrite. I can see >> some of the initial steps on the roadmap: >> > - Switch to Subversion repository >> > - Change of the build process compatible with creation of modules >> > - Improving testing frame (mentioned several times) >> > - Creation of white papers for coding practices, build releases, >> (others?) >> > >> > Then maybe the proper work of restructuring Biojava may start. We can >> either divide the existing mammoth into multiple modules at first or - >> my preference ??? building modules one by one by selectively picking >> classes. This way it will be easy to find out classes that can be >> deprecated (by lack of users) and we can even have a deprecated module >> at the end. Some coupling may need to loosen up. We will also need a >> list of API change for developers who will use the newer version. I am >> sure that the kind of data structures proposed by Richard could find >> their place as well as some of the proposed patterns (beans, others?) >> > >> > Anyway, all these are simple ideas. I am not an expert in build >> process, but I can help with improving javadocs, writing examples and >> test cases. I have also a fair knowledge of the molecular biology >> package. >> > >> > Hope it helps, >> > George >> > >> > _______________________________________________ >> > biojava-dev mailing list >> > biojava-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFG83jK4C5LeMEKA/QRAtOFAJsF9YNdgdsOm1KY65GyRehsO1ElYwCfeUfi >> yXWTMXSzn3mXZqXXo9999rw= >> =WbAQ >> -----END PGP SIGNATURE----- > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK From markjschreiber at gmail.com Sun Sep 23 07:48:03 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 23 Sep 2007 19:48:03 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> References: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> Message-ID: <93b45ca50709230448w4e9dd109wbe647aa4520a60f3@mail.gmail.com> > You cannot kill biojava and it is not vista; you cannot force people to use it. I have a project with hundreds of classes using biojava and working without a glitch and the choice of either keeping with it or switching to a bj3 in the middle of a rewrite of around 1500 classes that may take months or years to complete. I may just never switch to the new biojava. Most likely, a lot of people are going to be in a similar situation and most likely bj3 will also have to have support old biojava classes - great! > If a new version of biojava came out that was not compatible there would be nothing to stop you from keeping the olb bj1.5 jars in your lib directory and letting your application work off them. I've never been a fan of global class paths. Additionally if you don't need to switch there would be little point. There is little point in switching from bj1.4 to 1.5 unless you need some new feature or bug fix. Bj1.4 and 1.5 were also the first to even attempt backwards compatibility so I'm not totally convinced we need to support old classes and interfaces. I would hope the motivation for a switch would be a cleaner and easier to use code base. Your correct, we can't force people to use it. > I agree that you cannot change interface but you can deprecate them and toss them after one release cycle or put them into a deprecated module that is not included in releases. > Dropping a deprecated interface is about the same as changing it in terms of backwards compatibility. Although you do have the advantage of a little more warning. > The question becomes: what are the fundamental problems of biojava that truly justify a rewrite from the ground? Certainly, need for a new symbol model could be one; maintenance and testing are not; modular structure is not; and use of generics is not - they do not break old code. > > George > I would say the main argument is improving ease of use and getting away from the singleton symbol model. With generics I am interested to know what happens when you take an interface that requires List and change it to require List such as. public void foo(List l); to public void foo(List l) Due to erasure this may not break, old code would still run, however old code would probably no longer compile. - Mark From markjschreiber at gmail.com Sun Sep 23 08:06:21 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 23 Sep 2007 20:06:21 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> Message-ID: <93b45ca50709230506j2cf48857h2f52b69b7a1e0594@mail.gmail.com> > 1. I like the idea of making readers more pluggable, and Dozer > definitely looks interesting. Is this going to be supported via the Service > Provider Interface approach (used by Taverna and other projects)? > An SPI interface would be a great addition. I believe taverna's is quite a nice feature. It would be good to have. > 2. Andy brought up the point of people who create non-standard > variations of EMBL-formatted files. I was wondering if these files were > created in programming languages other than Java? If so, would those users > be willing to use a Jython, JRuby, or a Perl-like scripting language like > Sleep,? This would allow them to use biojava as a library, and still use a > scripting language whose syntax they were familiar with. They would also be > producing files in a more standardized format. This might cut down on the > number of parsing mistakes caused by "unsupported" file variations. You can > go to http://scripting.dev.java.net for more information on the > scripting languages that the Java VM supports. > I think if we designed it right you could do a lot with Groovy with the added benefit of very java like syntax. Richard and I did discuss the possibility of having all I/O file processing written in Groovy and compiled to classes. > 3. Was there any reason why non-standard files were being created? > Perhaps some use-case not being covered? > Non standard GenBank type files are made by VectorNTI. Also formats change over the years. I think this recently happened with EMBL format. Unfortunately flatfiles unlike XML do not have versioning or need to validate against a definition. > 4. If BioJava is split up into a variety of smaller JARs, how would > you insure that the users had all of the JARs that they needed? Would an > installer be provided to allow users to select groups of JARs? There are a > number of open source installers that would make this process easier. Using > Maven is suitable if you're a developer, if you're a scripter it's a little > more difficult to deal with. > Many projects are distributed as multiple jars (eg hibernate). Typically the user would download the core bundle and put them in a lib folder. Additional jars could be downloaded for extra activities. > > 6. When it comes to unit testing and continuous building, is the > bio*.org server going to handle that automated build & burn, or is someone > in the group going to have to do it? I think the inability to have the > build setup on the server had us stymied before. The open-bio servers are a natural choice but I think a discussion of the pros and cons of others is a good idea. > > 7. Now that Java also includes the Derby database, and the Java > Persistence API (JPA), has anyone considered migrating the BioSQL support > from Hibernate to JPA, and using Derby as the default database? This would > make it a little easier to maintain and would minimize the setup work that a > new user would have to do. > I agree on this. This is also a good argument for making our classes more bean like so they can be easily turned into enterprise beans. A nice part of JPA is that you can use hibernate to do the persistence. Having the Derby database built in offers other interesting possibilities as well. > 8. Richard, you mention in the "Reasoning" section that "users have > moved on". What types of use-cases beyond basic sequence analysis, should > BioJava support? Would support for more of lab-related processes expand the > user base and number of committers? Would support for parsing different > types of instrument files be a useful addition? I could imagine use cases > where users would like to be able to parse an Affy file and fetch probe > information, gene information, and perhaps pathway data. > > 9. Are there any thoughts about using annotations (perhaps in > combination with ontologies) to handle semantic validation of arguments? > For example, you might have an annotation like > > @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"} > > indicating that the attribute or method argument is a LocusLink id. > I think this is an excellent example of how we can use Annotations. It would allow quite a bit of flexibility for integration tasks. - Mark Schreiber > Mark Fortner > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From bugzilla-daemon at portal.open-bio.org Mon Sep 24 01:43:12 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Sep 2007 01:43:12 -0400 Subject: [Biojava-dev] [Bug 2371] New: ChromatogramFactory.create fails on Windows Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2371 Summary: ChromatogramFactory.create fails on Windows Product: BioJava Version: 1.5 Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: bio AssignedTo: biojava-dev at biojava.org ReportedBy: hkaya at be.itu.edu.tr The following small code perfectly runs on Linux but it fails on Windows. I'm using biojava 1.5. 1. TraceTest.java import java.io.File; import org.biojava.bio.chromatogram.Chromatogram; import org.biojava.bio.chromatogram.ChromatogramFactory; public class TraceTest { public static void main(String[] args) { try { Chromatogram trace = ChromatogramFactory.create(new File("test.scf")); System.out.println("Success!"); } catch (Exception e){ System.out.println("Failed!"); e.printStackTrace(); } } } 2. Test chromatogram file : http://istanbul.be.itu.edu.tr/~huseyin/test.scf 3. Exception thrown in Windows XP/Vista: Desktop> java -classpath biojava-1.5.jar;. TraceTest Exception in thread "main" org.biojava.bio.BioError: Unable to initialize DNATools at org.biojava.bio.seq.DNATools.(DNATools.java:117) at org.biojava.bio.program.scf.SCF$V3Parser.parseSamples(SCF.java:560) at org.biojava.bio.program.scf.SCF$Parser.parse(SCF.java:350) at org.biojava.bio.program.scf.SCF$ParserFactory.parse(SCF.java:206) at org.biojava.bio.program.scf.SCF.load(SCF.java:149) at org.biojava.bio.program.scf.SCF.load(SCF.java:141) at org.biojava.bio.program.scf.SCF.create(SCF.java:126) at org.biojava.bio.chromatogram.ChromatogramFactory.create(ChromatogramFactory.java:75) at TraceTest.main(TraceTest.java:8) Caused by: org.biojava.bio.BioError: Unable to initialize RNATools at org.biojava.bio.seq.RNATools.(RNATools.java:126) at org.biojava.bio.seq.DNATools.(DNATools.java:110) ... 8 more Caused by: org.biojava.bio.BioError: Couldn't parse TranslationTables.xml at org.biojava.bio.seq.RNATools.loadGeneticCodes(RNATools.java:529) at org.biojava.bio.seq.RNATools.(RNATools.java:124) ... 9 more Caused by: org.biojava.bio.symbol.IllegalSymbolException: Token `his' does not appear as a named symbol in alphabet `PROTEIN-TERM' at org.biojava.bio.seq.io.NameTokenization.parseToken(NameTokenization.java:110) at org.biojava.bio.seq.RNATools.loadGeneticCodes(RNATools.java:520) ... 10 more -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From holland at ebi.ac.uk Mon Sep 24 03:42:34 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 24 Sep 2007 08:42:34 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709230448w4e9dd109wbe647aa4520a60f3@mail.gmail.com> References: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> <93b45ca50709230448w4e9dd109wbe647aa4520a60f3@mail.gmail.com> Message-ID: <46F76A6A.8010401@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark - I'll use your comments as the basis of a list of things that are broken and need fixing, which I mentioned over the weekend that I would set up. This list will expand over time I'm sure. > With generics I am interested to know what happens when you take an > interface that requires List and change it to require List > such as. > > public void foo(List l); > > to > > public void foo(List l) > > Due to erasure this may not break, old code would still run, however > old code would probably no longer compile. Old code would still compile and run just fine, with warnings at compile time indicating that a genericised collection has been used without specifying the generic type. However, you wouldn't be able to pass in a List to a method which accepts a genericised List without casting it, as that is a compile-time error. So your foo() example above would not work. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG92pq4C5LeMEKA/QRAr4UAJ9FbCVFdE4enQrbFclNZx36RQaCBwCeM07d E+oDZ3+smexvGFWAA0eHeFM= =rFu5 -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Mon Sep 24 04:26:27 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Mon, 24 Sep 2007 09:26:27 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> Message-ID: <46F774B3.4010209@ebi.ac.uk> Hi Mark, > > 2. Andy brought up the point of people who create non-standard > variations of EMBL-formatted files. I was wondering if these files were > created in programming languages other than Java? If so, would those users > be willing to use a Jython, JRuby, or a Perl-like scripting language like > Sleep,? This would allow them to use biojava as a library, and still use a > scripting language whose syntax they were familiar with. They would also be > producing files in a more standardized format. This might cut down on the > number of parsing mistakes caused by "unsupported" file variations. You can > go to http://scripting.dev.java.net for more information on the > scripting languages that the Java VM supports. > > 3. Was there any reason why non-standard files were being created? > Perhaps some use-case not being covered? These files are not being created by accident just the groups that are producing them have different requirements wrt the data they release. So they want to produce EMBL flat files which have the look/markup of an EMBL record yet do not follow the same rules as EMBL. A good example (if memory serves me correctly) is UniProtKB. The specification of UniProtKB records are different to EMBL yet both output files of a similar markup. So it's not so much as biojava no supporting a use-case or a group producing flat files with a custom writer just they have a different requirement. Getting BioJava to support them all is just a non-starter considering the number of projects available. A better way is just to let people plug into the grammer/objects for parsing these file formats & then groups can choose to release their parsing code or not. > 4. If BioJava is split up into a variety of smaller JARs, how would > you insure that the users had all of the JARs that they needed? Would an > installer be provided to allow users to select groups of JARs? There are a > number of open source installers that would make this process easier. Using > Maven is suitable if you're a developer, if you're a scripter it's a little > more difficult to deal with. Yes that's very true. We're encountering similar problems in our group where we have a set of people working on new maven projects & older projects still using Ant. Our solution atmo is producing maven assemblies which cover different use cases & end users choose which one suits their needs the most. If we're talking about scripters though then it's probably easier to have it written on the wiki with a 'first steps' in the major JVM scripting languages (I'm thinking Groovy, JavaScript & JRuby should cover the bases). > 5. Are there any thoughts about using a templating system like > Velocity, FreeMarker or JST? This would make it easier to insure that files > were produced in a standard fashion. It would also make it easier to > maintain support for writing files in different file formats. I'd prefer to use StringTemplate (just because it's a push based templating system not a pull like Velocity) but yeah I can see it being very useful. > 6. When it comes to unit testing and continuous building, is the > bio*.org server going to handle that automated build & burn, or is someone > in the group going to have to do it? I think the inability to have the > build setup on the server had us stymied before. I think that Andreas is in a better position to answer this one maybe but I'm guessing we can schedule the builds on a time basis along with building on each commit into the repository. > 7. Now that Java also includes the Derby database, and the Java > Persistence API (JPA), has anyone considered migrating the BioSQL support > from Hibernate to JPA, and using Derby as the default database? This would > make it a little easier to maintain and would minimize the setup work that a > new user would have to do. Hibernate supports JPA so the switch shouldn't be hard to do if needed. That said Hibernate is still the 'market leader' when it comes to Java based persistence so I'm not to worried about this. > > 8. Richard, you mention in the "Reasoning" section that "users have > moved on". What types of use-cases beyond basic sequence analysis, should > BioJava support? Would support for more of lab-related processes expand the > user base and number of committers? Would support for parsing different > types of instrument files be a useful addition? I could imagine use cases > where users would like to be able to parse an Affy file and fetch probe > information, gene information, and perhaps pathway data. I'm already aware of people doing the Affy parsing themselves (I was involved with writing the parsers for their XDA data format ... bloody unsigned big endian ints) but the code was never incorporated into biojava because the group wasn't 100% comfortable about releasing the code. But yes there are a lot of other use cases out there that I'm sure we're unaware of. Our only choice is to see if we can get people to contribute ideas to this stage of development & give people the opportunity to contribute code as & when it's required. > > 9. Are there any thoughts about using annotations (perhaps in > combination with ontologies) to handle semantic validation of arguments? > For example, you might have an annotation like > > @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"} > > indicating that the attribute or method argument is a LocusLink id. > That's quite an interesting idea. Not sure about where else to introduce them in if they are required but it's a good idea :) Andy From ap3 at sanger.ac.uk Mon Sep 24 05:39:17 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 24 Sep 2007 10:39:17 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F774B3.4010209@ebi.ac.uk> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> <46F774B3.4010209@ebi.ac.uk> Message-ID: >> 6. When it comes to unit testing and continuous building, is the >> bio*.org server going to handle that automated build & burn, or >> is someone >> in the group going to have to do it? I think the inability to >> have the >> build setup on the server had us stymied before. > > I think that Andreas is in a better position to answer this one maybe > but I'm guessing we can schedule the builds on a time basis along with > building on each commit into the repository. Just to clarify regarding the the continuous builds: There is an auto-build running for BioJava at http://www.spice-3d.org/ cruise/ A build is triggered ~40 minutes after a CVS commit as well as regularly every night. The following things happen every build: * all junit tests are run (see http://www.spice-3d.org/cruise/ buildresults/biojava-live?tab=testResults ) * the latest javadoc is created ( http://www.spice-3d.org/public- files/javadoc/biojava/ ) * provide a biojava.jar file for download * provide a biojava-src.jar (source code bundle). if the build fails (i.e. somebody committed broken code to CVS) this mailing list will be notified. Actually this is untested so far since CVS has been fine in the last few weeks :-) I set this up on my own machine, since I don't have admin rights on open-bio, but if people would prefer to have it running from there, it should be fairly simple to move the setup. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bugzilla-daemon at portal.open-bio.org Thu Sep 27 02:16:50 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 27 Sep 2007 02:16:50 -0400 Subject: [Biojava-dev] [Bug 2359] SingleDP deserialization fails In-Reply-To: Message-ID: <200709270616.l8R6Gos7012751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2359 mark.schreiber at novartis.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mark.schreiber at novartis.com 2007-09-27 02:16 EST ------- Bug is fixed. Unit tests committed. Also fixed similar problem with PairDP -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From markjschreiber at gmail.com Thu Sep 27 05:40:53 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 27 Sep 2007 17:40:53 +0800 Subject: [Biojava-dev] Taglets Message-ID: <93b45ca50709270240x7383243u34d84083200bbf20@mail.gmail.com> Would anyone object if I removed the taglets from biojava? They are not widely used in the javadocs (to say the least) and they seem to require unstable imports of com.sun packages which means they break whenever you change JDK. - Mark From holland at ebi.ac.uk Thu Sep 27 06:03:09 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 27 Sep 2007 11:03:09 +0100 Subject: [Biojava-dev] Taglets In-Reply-To: <93b45ca50709270240x7383243u34d84083200bbf20@mail.gmail.com> References: <93b45ca50709270240x7383243u34d84083200bbf20@mail.gmail.com> Message-ID: <46FB7FDD.2030903@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 fine by me. anything unstable or non-standard is not good news. Mark Schreiber wrote: > Would anyone object if I removed the taglets from biojava? > > They are not widely used in the javadocs (to say the least) and they > seem to require unstable imports of com.sun packages which means they > break whenever you change JDK. > > - Mark > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG+3/d4C5LeMEKA/QRAmgxAJ4yn14N/ViH7lamr3fPD4YhdVci2gCfSvtj INXwKGRUM0Kw8Nw8Jtd+8YQ= =EJ7R -----END PGP SIGNATURE----- From pranav.waila at gmail.com Thu Sep 27 03:05:46 2007 From: pranav.waila at gmail.com (pranav waila) Date: Thu, 27 Sep 2007 07:05:46 -0000 Subject: [Biojava-dev] help (F1 F1) regarding biojava Message-ID: <5c85b5bb0709270005m366227acx65ca6151e870ce79@mail.gmail.com> package org.biojava.bio.seq.db; import org.biojava.bio.seq.*; import org.biojava.bio.seq.io.*; import org.biojava.bio.*; import org.biojava.bio.seq.db.*; import org.biojava.bio.seq.io.*; import java.net.*; class test{ public static SequenceFormat sf; public static void main(String args[]){ System.getProperties().put("proxySet","true"); System.getProperties().put("proxyPort","3128"); System.getProperties().put("proxyHost","172.16.0.6"); Sequence s; sf=new FastaFormat(); //sf=new GenbankXmlFormat(); NCBISequenceDB ncbi = new NCBISequenceDB( NCBISequenceDB.DB_PROTEIN,sf);//new FastaFormat()); // GenbankSequenceDB gdb=new GenbankSequenceDB(); //ncbi.setSequenceFormat(FASTA); try{ // Sequence sequenceFromGenbank = ncbi.getSequence("P10659"); // System.out.println(sequenceFromGenbank.getName()); // older code s=ncbi.getSequence("P10659"); //s=gdb.getAddress();//getSequence("P10659");//3789789"); System.out.print("check"); System.out.println(ncbi.getSequence("190786")); // } catch(Exception e){ e.printStackTrace(); //System.out.println("protien name is : zinteminia"); } } } This is my code for fetching the sequence from NCBI but it is giving somany exceptions. can u provide me some code to do so.. the errors are as follows : Bio java exception could not read sequence CAN U PLEASE HELP ME. waiting for reply PRANAV WAILA From ap3 at sanger.ac.uk Sun Sep 2 10:02:47 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 2 Sep 2007 11:02:47 +0100 Subject: [Biojava-dev] cruisecontrol In-Reply-To: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> References: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> Message-ID: Hi, CruiseControl is now running at http://www.spice-3d.org/cruise/ it does: * trigger a new build 20 minutes after a new commit to CVS * run the junit tests * build the javadocs * provide the latest biojava.jar for download * send a notification email if something goes wrong (and only then) to this list This basically works with a chain of ant scripts that are triggered by CruiseControl, so it is easy to add other functionality / exchange CVS with subversion, etc. Andreas On 29 Aug 2007, at 01:16, Mark Schreiber wrote: > Sounds good. > > Thomas had a script running off his home machine a while ago for > nightly builds which I have missed since he stopped running it. > > Notifications of failed tests would be good too. > > - Mark > > On 8/28/07, Andreas Prlic wrote: >> Hi biojava - devs, >> >> would you be interested in getting CruiseControl running for BioJava? >> >> It would allow us to >> >> * provide nightly builds of biojava, >> * run unit test in regular intervals, >> * get a notification email sent to biojava-dev if the CVS does not >> build >> >> http://cruisecontrol.sourceforge.net/ >> >> If there is interest for this I will set it up, >> >> Andreas >> >> --------------------------------------------------------------------- >> -- >> >> Andreas Prlic Wellcome Trust Sanger Institute >> Hinxton, Cambridge CB10 1SA, UK >> +44 (0) 1223 49 6891 >> >> --------------------------------------------------------------------- >> -- >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Mon Sep 3 07:42:10 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 03 Sep 2007 08:42:10 +0100 Subject: [Biojava-dev] cruisecontrol In-Reply-To: References: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> Message-ID: <46DBBAD2.3040900@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 that's really cool. is there any way of integrating it into the open-bio servers that the rest of BioJava lives on? Andreas Prlic wrote: > Hi, > > CruiseControl is now running at > > http://www.spice-3d.org/cruise/ > > it does: > * trigger a new build 20 minutes after a new commit to CVS > * run the junit tests > * build the javadocs > * provide the latest biojava.jar for download > * send a notification email if something goes wrong (and only then) > to this list > > This basically works with a chain of ant scripts that are triggered > by CruiseControl, > so it is easy to add other functionality / exchange CVS with > subversion, etc. > > Andreas > > > > On 29 Aug 2007, at 01:16, Mark Schreiber wrote: > >> Sounds good. >> >> Thomas had a script running off his home machine a while ago for >> nightly builds which I have missed since he stopped running it. >> >> Notifications of failed tests would be good too. >> >> - Mark >> >> On 8/28/07, Andreas Prlic wrote: >>> Hi biojava - devs, >>> >>> would you be interested in getting CruiseControl running for BioJava? >>> >>> It would allow us to >>> >>> * provide nightly builds of biojava, >>> * run unit test in regular intervals, >>> * get a notification email sent to biojava-dev if the CVS does not >>> build >>> >>> http://cruisecontrol.sourceforge.net/ >>> >>> If there is interest for this I will set it up, >>> >>> Andreas >>> >>> --------------------------------------------------------------------- >>> -- >>> >>> Andreas Prlic Wellcome Trust Sanger Institute >>> Hinxton, Cambridge CB10 1SA, UK >>> +44 (0) 1223 49 6891 >>> >>> --------------------------------------------------------------------- >>> -- >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > ----------------------------------------------------------------------- > > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG27rS4C5LeMEKA/QRAvpaAJ0f7bC3OeMoqGUGPiQ2zX9YTfq/2ACcCKWu qo+/SvcrG0a5Ycf9H1XmSsY= =CR4f -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Mon Sep 3 13:15:35 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 3 Sep 2007 14:15:35 +0100 Subject: [Biojava-dev] cruisecontrol In-Reply-To: <46DBBAD2.3040900@ebi.ac.uk> References: <93b45ca50708281716x3c2cadc9me9cb0b00b52647a2@mail.gmail.com> <46DBBAD2.3040900@ebi.ac.uk> Message-ID: <948C3625-6C4B-4D9B-A739-980579CC12D9@sanger.ac.uk> > is there any way of integrating it into the open-bio servers that the > rest of BioJava lives on? the simples way is to link to it, Another possibility is to run it directly on the open-bio servers, but I do not have any admin permissions to set this up. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bugzilla-daemon at portal.open-bio.org Thu Sep 6 14:11:26 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 Sep 2007 10:11:26 -0400 Subject: [Biojava-dev] [Bug 2330] DP/ Profile HMM bug In-Reply-To: Message-ID: <200709061411.l86EBQbm011955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2330 mark.schreiber at novartis.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mark.schreiber at novartis.com 2007-09-06 10:11 EST ------- Added change listener to LinearAlphabetIndex so that it rebuilds as Symbols are added and removed from the Alphabet. Interesting SimpleAlphabet was not emitting a ChangeEvent when removing Symbols, this is fixed now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Sep 8 23:02:38 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 8 Sep 2007 19:02:38 -0400 Subject: [Biojava-dev] [Bug 2359] New: SingleDP deserialization fails Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2359 Summary: SingleDP deserialization fails Product: BioJava Version: 1.5 Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: dist/dp AssignedTo: biojava-dev at biojava.org ReportedBy: daniel.rohrbach at web.de it isn't possible to load an instance of SingleDP via serialization. SingleDP which is serializable inherits from DP which is not serializable . Loading the Object works well. I used biojava 1.5 latest build 9/8/07 2:22 AM but the same exception occurs in all 1.5 the reason for the bug is that SingleDP extends DP which is not serializable. In that case it must implement a no args constructor but it doesn't! Because of that the same should occur for PairwiseDP the stack trace: java.io.InvalidClassException: org.biojava.bio.dp.onehead.SingleDP; org.biojava.bio.dp.onehead.SingleDP; no valid constructor at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:713) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1733) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at biojavabugs.SingleDBBug.main(SingleDBBug.java:92) Caused by: java.io.InvalidClassException: org.biojava.bio.dp.onehead.SingleDP; no valid constructor at java.io.ObjectStreamClass.(ObjectStreamClass.java:471) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:310) and the code i used to cause the bug: //create the HMM ProfileHMM hmm = new ProfileHMM(ProteinTools.getAlphabet(), 12, DistributionFactory.DEFAULT, DistributionFactory.DEFAULT, "biojava profile hmm"); //create an SingleDP object which we want to save and load DP dp = (SingleDP) DPFactory.DEFAULT.createDP(hmm); try { // // saving // // the filename File load = new File("/home/dani/Desktop/dp"); FilePermission fp = new FilePermission(load.getAbsolutePath(), "write"); if(!load.createNewFile()) { throw new IOException("file '" + load.getAbsolutePath() + "' could not be created!"); } FileOutputStream fos = new FileOutputStream(load); ObjectOutputStream oos = new ObjectOutputStream(fos); //store object to disk oos.writeObject(dp); oos.close(); // // loading // // try to load the SingleDP object fp = new FilePermission( load.getAbsolutePath(), "read"); FileInputStream fis = new FileInputStream(load); ObjectInputStream ois = new ObjectInputStream(fis); Vector v = new Vector(); //here is where the EXCEPTION occurs Object o = ois.readObject(); v.add(o); System.out.println("loaded Object!"); // System.out.println(obj.toString()); } catch (ClassNotFoundException ex) { ex.printStackTrace(); } catch (IOException ex) { ex.printStackTrace(); } -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Sep 8 23:20:12 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 8 Sep 2007 19:20:12 -0400 Subject: [Biojava-dev] [Bug 2360] New: saving of ProfileHmm cause NullPointerException Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2360 Summary: saving of ProfileHmm cause NullPointerException Product: BioJava Version: 1.5 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: dist/dp AssignedTo: biojava-dev at biojava.org ReportedBy: daniel.rohrbach at web.de saving an untrained ProfileHMM via serialization cause a NullPointerException. After training the model saving works well. I used biojava 1.5 latest build 9/8/07 2:22 AM the stack trace: Exception in thread "main" java.lang.NullPointerException at org.biojava.bio.dist.SimpleDistribution.writeObject(SimpleDistribution.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at java.util.HashSet.writeObject(HashSet.java:267) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at biojavabugs.SingleDBBug.main(SingleDBBug.java:66) and the code I used to create the exception //create the HMM ProfileHMM hmm = new ProfileHMM(ProteinTools.getAlphabet(), 12, DistributionFactory.DEFAULT, DistributionFactory.DEFAULT, "biojava profile hmm"); // the filename File load = new File("/home/dani/Desktop/hmm"); FilePermission fp = new FilePermission(load.getAbsolutePath(), "write"); if(!load.createNewFile()) { throw new IOException("file '" + load.getAbsolutePath() + "' could not be created!"); } FileOutputStream fos = new FileOutputStream(load); ObjectOutputStream oos = new ObjectOutputStream(fos); //store object to disk //here comes the exception oos.writeObject(hmm); oos.close(); //create an SingleDP object which we want to save and load DP dp = (SingleDP) DPFactory.DEFAULT.createDP(hmm); PairwiseDP pdp = new PairwiseDP(hmm, new DPInterpreter.Maker()); -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Fri Sep 14 23:15:55 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Fri, 14 Sep 2007 18:15:55 -0500 Subject: [Biojava-dev] New Developer Message-ID: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> Hi all, I am interested in getting involved with the BioJava community. I have just joined a Bioinformatics Core, and we use a lot of R/ BioConductor right now, however we have some new projects that we would like to begin working in BioJava. I have checked out the source and compiled, however I noticed that the README was slightly out of sync with the project (the build targets listed are not the same as the build.xml). I made the updates to the README and would have liked to commit them, but as I am sure you are all aware, I do not have write access. I cannot yet say how much or little I would be able to contribute to the project, or even in what areas, however I think it could be beneficial to the community if I were permitted to make changes like this. I am sure synchronizing documentation is the last thing on your minds, and often it is hard to see that instructions might be unclear when you have been working on a project for a long time. I also think it could be a good opportunity to get to know the community and perhaps ease my way into becoming a more active developer. Please let me know what you think! Thanks! Jared From markjschreiber at gmail.com Sat Sep 15 08:54:56 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 15 Sep 2007 16:54:56 +0800 Subject: [Biojava-dev] New Developer In-Reply-To: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> References: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> Message-ID: <93b45ca50709150154y437eaa91h68c422fad235565b@mail.gmail.com> Hi Jared - It's great to have people checking and reporting these things. A CVS account can be arranged if you make regular contributions however in the meantime you can email a patch or fix to the dev list. Because the open-bio lists block attachments (and even HTML email) to prevent spamming the easiest way to submit a fix is as text in the body of your email. One of the core developers can then check it in. Additionally if you notice any problems with the documentation at www.biojava.org please feel free to correct the wiki. Finally if you notice bugs please report them to the bugzilla site linked from the main page of biojava.org so that we can track them. Even better if you can submit a possible fix at the same time so we can make the change and create a test to prevent it from re-emerging later. Thanks for you help. Contributions are always appreciated. - Mark On 9/15/07, Jared Flatow wrote: > Hi all, > > I am interested in getting involved with the BioJava community. I > have just joined a Bioinformatics Core, and we use a lot of R/ > BioConductor right now, however we have some new projects that we > would like to begin working in BioJava. I have checked out the source > and compiled, however I noticed that the README was slightly out of > sync with the project (the build targets listed are not the same as > the build.xml). I made the updates to the README and would have liked > to commit them, but as I am sure you are all aware, I do not have > write access. I cannot yet say how much or little I would be able to > contribute to the project, or even in what areas, however I think it > could be beneficial to the community if I were permitted to make > changes like this. I am sure synchronizing documentation is the last > thing on your minds, and often it is hard to see that instructions > might be unclear when you have been working on a project for a long > time. I also think it could be a good opportunity to get to know the > community and perhaps ease my way into becoming a more active > developer. Please let me know what you think! > > Thanks! > Jared > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From jflatow at northwestern.edu Sat Sep 15 18:53:03 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Sat, 15 Sep 2007 13:53:03 -0500 Subject: [Biojava-dev] New Developer In-Reply-To: <93dd9ad00709141844j36433c26t6f9e4e47d6cac989@mail.gmail.com> References: <83B15D7C-32E9-4F89-991A-F61076F90EC6@northwestern.edu> <93dd9ad00709141844j36433c26t6f9e4e47d6cac989@mail.gmail.com> Message-ID: Sounds like a great suggestion to me! This would be much more convenient for me than having to email text (which kind of defeats the purpose of the version control system in my opinion, and will likely clutter the list). Whatever y'all decide will be fine, but I like David's idea! Best Regards, Jared On Sep 14, 2007, at 8:44 PM, David Barbosa Feitosa wrote: > Maybe you can create a branch for this, and ask somebody to inspect > the changes. > After inspection, somebody with write access can merge you changes > with the main development branch. > Only a sugestion :-) > > 2007/9/14, Jared Flatow : > Hi all, > > I am interested in getting involved with the BioJava community. I > have just joined a Bioinformatics Core, and we use a lot of R/ > BioConductor right now, however we have some new projects that we > would like to begin working in BioJava. I have checked out the source > and compiled, however I noticed that the README was slightly out of > sync with the project (the build targets listed are not the same as > the build.xml). I made the updates to the README and would have liked > to commit them, but as I am sure you are all aware, I do not have > write access. I cannot yet say how much or little I would be able to > contribute to the project, or even in what areas, however I think it > could be beneficial to the community if I were permitted to make > changes like this. I am sure synchronizing documentation is the last > thing on your minds, and often it is hard to see that instructions > might be unclear when you have been working on a project for a long > time. I also think it could be a good opportunity to get to know the > community and perhaps ease my way into becoming a more active > developer. Please let me know what you think! > > Thanks! > Jared > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at ebi.ac.uk Wed Sep 19 10:31:06 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 19 Sep 2007 11:31:06 +0100 Subject: [Biojava-dev] The future of BioJava Message-ID: <46F0FA6A.1030404@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all. We are considering moving on to start work on BioJava3, which will resolve many of the issues of usability and maintainability that plague the existing versions of BioJava. I have set up a wiki page containing a preliminary outline of intentions: http://biojava.org/wiki/BioJava3_Proposal Please could as many people as possible update this page with your comments, suggestions, ideas and changes. We want to know what technologies or design patterns you feel would be suitable for various parts, how the new code should be structured and organised, what degree of modularisation would be appropriate, and where the line between biological problems and more generalised Java or programming problems should be drawn. Basically we want comments on anything you can think of. You should make your comments by directly modifying the page. Please do be constructive - if something's a bad idea, we want to know about it, but we'd appreciate it if you could also suggest a better alternative. We're open to all suggestions and will consider everything. We aim to use this page to flesh out a detailed plan for what should happen next. I will act as moderator and use the contents of the final page as the basis of a detailed plan of action early next year. cheers, Richard PS. This is sent to biojava-dev. I'll send it also to biojava-l when there are more details and clearer intentions, meaning users are less likely to get scared. From biojava-l I'll ask for features that users would like to see. This is likely to be around November time. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8Ppq4C5LeMEKA/QRAu8+AJ0dgQYOsOUdgqs9My3RkIFn9FzaVQCeJJ84 aytR4wDyRwhICKPn60CI0gw= =hDJF -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Wed Sep 19 17:33:57 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 19 Sep 2007 18:33:57 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F0FA6A.1030404@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> Message-ID: <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> Hi, A question related to the discussion of how to design a future BioJava is to have a look at which parts of BioJava are being actively used and how to improve these. So what are the most frequently used bits of BioJava? One way to look at this is to go to the web-stats and see how many hits we have got on our documentation web pages. In an ideal world BioJava would be so simple to use, that nobody needs to read any docu. Unfortunately we are far away from this, so actually looking at these stats gives an impression on * topics / functionality which are of particular interest to the community * topics / functionality which might not be straightforward to use, therefore there are many hits on these pages. A look at the webstats from the last couple of months gives these top 10 Cookbook pages that have been accessed frequently. This list is ordered by nr. of pageviews 1. /wiki/BioJava:Cookbook:Alphabets 2. /wiki/BioJava:CookBook:Blast:Parser 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES 5. /wiki/BioJava:CookBook:DP:PairWise2 6. /wiki/BioJava:CookBook:PDB:read 7. /wiki/BioJava:Cookbook:Sequence 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI 10. /wiki/BioJava:CookBook:Fasta:Parse I would group these pages into 2 groups. A) How to work with core concepts of BioJava B) How to use a functionality of BioJava to achieve a certain goal The "conceptual" pages (A) I would identify as * How to get an Alphabet * How to make a Sequence Object from a String or make a Sequence Object back into a String The "functionality" pages (B) I would summarize as * How to parse a Blast output * How to read sequences from a Fasta file * How to read a GenBank, SwissProt or EMBL file * How to generate a global or local alignment with the Needleman- Wunsch- or the Smith-Waterman-algorithm * How to read a protein structure - PDB file * How to export a sequence to fasta * How to view a sequence in a gui * How to parse a Fasta database search output file As a conclusion I would suggest that BioJava should have the goal to provide easy access to the core "functionalities" (group B). I believe that we should try to keep the "concepts" that are being used to achieve these functionalities as simple as possible. In this sense, I feel that we have too many hits on the group A pages. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Thu Sep 20 07:57:49 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 20 Sep 2007 08:57:49 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> Message-ID: <46F227FD.6020807@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I totally agree. Can you post a short summary of this to the Wiki page? Not all aspects of BioJava are documented, leading people either to give up, consult the JavaDocs online, or post a message to biojava-l or biojava-dev. Is it possible to get similar stats to the ones you have calculated for the JavaDoc pages on our website? Also, is it possible to build some kind of index over the mailing list archives to pull out the most frequently used terms? cheers, Richard Andreas Prlic wrote: > Hi, > > A question related to the discussion of how to design a future BioJava > is to have a look > at which parts of BioJava are being actively used and how to improve these. > > So what are the most frequently used bits of BioJava? One way to look at > this is to go to the > web-stats and see how many hits we have got on our documentation web pages. > > In an ideal world BioJava would be so simple to use, that nobody needs > to read any docu. > Unfortunately we are far away from this, so actually looking at these > stats gives an impression > on > > * topics / functionality which are of particular interest to the community > * topics / functionality which might not be straightforward to use, > therefore there are many hits on these pages. > > A look at the webstats from the last couple of months gives these top 10 > Cookbook pages that > have been accessed frequently. This list is ordered by nr. of pageviews > > 1. /wiki/BioJava:Cookbook:Alphabets > 2. /wiki/BioJava:CookBook:Blast:Parser > 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta > 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES > 5. /wiki/BioJava:CookBook:DP:PairWise2 > 6. /wiki/BioJava:CookBook:PDB:read > 7. /wiki/BioJava:Cookbook:Sequence > 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta > 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI > 10. /wiki/BioJava:CookBook:Fasta:Parse > > I would group these pages into 2 groups. > A) How to work with core concepts of BioJava > B) How to use a functionality of BioJava to achieve a certain goal > > The "conceptual" pages (A) I would identify as > * How to get an Alphabet > * How to make a Sequence Object from a String or make a Sequence Object > back into a String > > The "functionality" pages (B) I would summarize as > * How to parse a Blast output > * How to read sequences from a Fasta file > * How to read a GenBank, SwissProt or EMBL file > * How to generate a global or local alignment with the Needleman-Wunsch- > or the Smith-Waterman-algorithm > * How to read a protein structure - PDB file > * How to export a sequence to fasta > * How to view a sequence in a gui > * How to parse a Fasta database search output file > > > As a conclusion I would suggest that BioJava should have the goal to > provide easy access to the > core "functionalities" (group B). I believe that we should try to keep > the "concepts" that are being used to > achieve these functionalities as simple as possible. In this sense, I > feel that we have too many hits on the group A pages. > > Andreas > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > ----------------------------------------------------------------------- > > > > --The Wellcome Trust Sanger Institute is operated by Genome > ResearchLimited, a charity registered in England with number 1021457 and > acompany registered in England with number 2742969, whose > registeredoffice is 215 Euston Road, London, NW1 2BE. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8if94C5LeMEKA/QRAkZ7AJ0a2xaU717XFfrX4eCc/wmPN/OL2ACfZMHi U21o+ZfVD5XOqT1mR7STp6Q= =dct8 -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Thu Sep 20 08:55:13 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 20 Sep 2007 09:55:13 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F227FD.6020807@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> Message-ID: <46F23571.4050908@ebi.ac.uk> Hi, I would say yes to this as well. It is very important to know what green people are attempting to do with BioJava rather than us assuming that we know :). There are parts in BioJava where the flexibility of the code is not sufficient for other people who want to use the code base & in other areas too flexible. I've talked to quite a few people over the years who have used biojava for simple & complex applications and they all seem to come back round to a few key problems: * Sequence & SymbolLists are strange and why can't I use a String - All of this makes a lot more sense if you know about the flyweight pattern; if not it just seems very strange. * I have a format that's EMBL like. Can I parse it using Biojava? * How do I read in a FASTA file? * How can I get X from this chromatogram & can I parse my specific trace format into a BioJava object? As Andreas said it's the occurrence of the category A problems that are the most worrying. In terms of sequences I think I can see why people have a problem with it. Just if we take this as an example: I have my DNA sequence in a String I can substring it, perform a regular expression over it, replace sections, pad it out, format it & so on. If I have a Sequence object I can perform most of these actions but the interface to them seems unintuitive. Things like calling seqString() to get the String back out from a sequence rather than calling toString(). Also lets say I want to use a sequence as a key in a hash map or ask if two sequences are equal (using the old sequence objects) ... at the moment I'd have to convert Sequence -> String to perform the comparison (and that doesn't include checking a Sequence for alphabet equality). I know this sounds like nit-picking & for people who have used biojava extensively a lot of this makes sense. For someone new to the project it seems like we've done something just for the sake of it and we need to get rid of that feeling which I'm sure will happen if we address the category A problem. The rest will fall into place :) Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I totally agree. > > Can you post a short summary of this to the Wiki page? > > Not all aspects of BioJava are documented, leading people either to give > up, consult the JavaDocs online, or post a message to biojava-l or > biojava-dev. > > Is it possible to get similar stats to the ones you have calculated for > the JavaDoc pages on our website? > > Also, is it possible to build some kind of index over the mailing list > archives to pull out the most frequently used terms? > > cheers, > Richard > > Andreas Prlic wrote: >> Hi, >> >> A question related to the discussion of how to design a future BioJava >> is to have a look >> at which parts of BioJava are being actively used and how to improve these. >> >> So what are the most frequently used bits of BioJava? One way to look at >> this is to go to the >> web-stats and see how many hits we have got on our documentation web pages. >> >> In an ideal world BioJava would be so simple to use, that nobody needs >> to read any docu. >> Unfortunately we are far away from this, so actually looking at these >> stats gives an impression >> on >> >> * topics / functionality which are of particular interest to the community >> * topics / functionality which might not be straightforward to use, >> therefore there are many hits on these pages. >> >> A look at the webstats from the last couple of months gives these top 10 >> Cookbook pages that >> have been accessed frequently. This list is ordered by nr. of pageviews >> >> 1. /wiki/BioJava:Cookbook:Alphabets >> 2. /wiki/BioJava:CookBook:Blast:Parser >> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >> 5. /wiki/BioJava:CookBook:DP:PairWise2 >> 6. /wiki/BioJava:CookBook:PDB:read >> 7. /wiki/BioJava:Cookbook:Sequence >> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >> 10. /wiki/BioJava:CookBook:Fasta:Parse >> >> I would group these pages into 2 groups. >> A) How to work with core concepts of BioJava >> B) How to use a functionality of BioJava to achieve a certain goal >> >> The "conceptual" pages (A) I would identify as >> * How to get an Alphabet >> * How to make a Sequence Object from a String or make a Sequence Object >> back into a String >> >> The "functionality" pages (B) I would summarize as >> * How to parse a Blast output >> * How to read sequences from a Fasta file >> * How to read a GenBank, SwissProt or EMBL file >> * How to generate a global or local alignment with the Needleman-Wunsch- >> or the Smith-Waterman-algorithm >> * How to read a protein structure - PDB file >> * How to export a sequence to fasta >> * How to view a sequence in a gui >> * How to parse a Fasta database search output file >> >> >> As a conclusion I would suggest that BioJava should have the goal to >> provide easy access to the >> core "functionalities" (group B). I believe that we should try to keep >> the "concepts" that are being used to >> achieve these functionalities as simple as possible. In this sense, I >> feel that we have too many hits on the group A pages. >> >> Andreas >> >> ----------------------------------------------------------------------- >> >> Andreas Prlic Wellcome Trust Sanger Institute >> Hinxton, Cambridge CB10 1SA, UK >> +44 (0) 1223 49 6891 >> >> ----------------------------------------------------------------------- >> >> >> >> --The Wellcome Trust Sanger Institute is operated by Genome >> ResearchLimited, a charity registered in England with number 1021457 and >> acompany registered in England with number 2742969, whose >> registeredoffice is 215 Euston Road, London, NW1 2BE. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG8if94C5LeMEKA/QRAkZ7AJ0a2xaU717XFfrX4eCc/wmPN/OL2ACfZMHi > U21o+ZfVD5XOqT1mR7STp6Q= > =dct8 > -----END PGP SIGNATURE----- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From holland at ebi.ac.uk Thu Sep 20 09:04:53 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 20 Sep 2007 10:04:53 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F23571.4050908@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> Message-ID: <46F237B5.107@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This is one of my main bugbears too. I've never quite understood why we can't just use Strings, and resort to SymbolLists only when more advanced manipulation is required (e.g. quality scores for each base). After all, a String is a memory word overhead (32- or 64-bits) plus 16-bits (unicode) per character, but most SymbolList implementations are a memory word overhead plus an additional entire memory word per Symbol, each word being a pointer to the memory location where the Symbol singleton lives. So SymbolLists actually use more memory than Strings, not less. (This is not true for CompressedSymbolList which represents sequences as a sequence of bits, grouped into groups large enough to uniquely identify any single symbol in the alphabet - e.g. 2 bits for DNA). As you say, most users just want to read a sequence, sublist it, maybe reverse comp it or run some simple search over it. This can all easily be achieved straight from String format. The other 'category A' problems are equally important. Could you add a section to the Wiki about these and the 'category B' problems? Then we can use this as a priority use-case list when it comes to actual development. cheers, Richard Andy Yates wrote: > Hi, > > I would say yes to this as well. It is very important to know what green > people are attempting to do with BioJava rather than us assuming that we > know :). There are parts in BioJava where the flexibility of the code is > not sufficient for other people who want to use the code base & in other > areas too flexible. > > I've talked to quite a few people over the years who have used biojava > for simple & complex applications and they all seem to come back round > to a few key problems: > > * Sequence & SymbolLists are strange and why can't I use a String - All > of this makes a lot more sense if you know about the flyweight pattern; > if not it just seems very strange. > > * I have a format that's EMBL like. Can I parse it using Biojava? > > * How do I read in a FASTA file? > > * How can I get X from this chromatogram & can I parse my specific trace > format into a BioJava object? > > As Andreas said it's the occurrence of the category A problems that are > the most worrying. In terms of sequences I think I can see why people > have a problem with it. > > Just if we take this as an example: > > I have my DNA sequence in a String I can substring it, perform a regular > expression over it, replace sections, pad it out, format it & so on. If > I have a Sequence object I can perform most of these actions but the > interface to them seems unintuitive. Things like calling seqString() to > get the String back out from a sequence rather than calling toString(). > Also lets say I want to use a sequence as a key in a hash map or ask if > two sequences are equal (using the old sequence objects) ... at the > moment I'd have to convert Sequence -> String to perform the comparison > (and that doesn't include checking a Sequence for alphabet equality). > > I know this sounds like nit-picking & for people who have used biojava > extensively a lot of this makes sense. For someone new to the project it > seems like we've done something just for the sake of it and we need to > get rid of that feeling which I'm sure will happen if we address the > category A problem. The rest will fall into place :) > > Andy > > Richard Holland wrote: > I totally agree. > > Can you post a short summary of this to the Wiki page? > > Not all aspects of BioJava are documented, leading people either to give > up, consult the JavaDocs online, or post a message to biojava-l or > biojava-dev. > > Is it possible to get similar stats to the ones you have calculated for > the JavaDoc pages on our website? > > Also, is it possible to build some kind of index over the mailing list > archives to pull out the most frequently used terms? > > cheers, > Richard > > Andreas Prlic wrote: >>>> Hi, >>>> >>>> A question related to the discussion of how to design a future BioJava >>>> is to have a look >>>> at which parts of BioJava are being actively used and how to improve >>>> these. >>>> >>>> So what are the most frequently used bits of BioJava? One way to look at >>>> this is to go to the >>>> web-stats and see how many hits we have got on our documentation web >>>> pages. >>>> >>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>> to read any docu. >>>> Unfortunately we are far away from this, so actually looking at these >>>> stats gives an impression >>>> on >>>> >>>> * topics / functionality which are of particular interest to the >>>> community >>>> * topics / functionality which might not be straightforward to use, >>>> therefore there are many hits on these pages. >>>> >>>> A look at the webstats from the last couple of months gives these top 10 >>>> Cookbook pages that >>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>> >>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>> 6. /wiki/BioJava:CookBook:PDB:read >>>> 7. /wiki/BioJava:Cookbook:Sequence >>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>> >>>> I would group these pages into 2 groups. >>>> A) How to work with core concepts of BioJava >>>> B) How to use a functionality of BioJava to achieve a certain goal >>>> >>>> The "conceptual" pages (A) I would identify as >>>> * How to get an Alphabet >>>> * How to make a Sequence Object from a String or make a Sequence Object >>>> back into a String >>>> >>>> The "functionality" pages (B) I would summarize as >>>> * How to parse a Blast output >>>> * How to read sequences from a Fasta file >>>> * How to read a GenBank, SwissProt or EMBL file >>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>> or the Smith-Waterman-algorithm >>>> * How to read a protein structure - PDB file >>>> * How to export a sequence to fasta >>>> * How to view a sequence in a gui >>>> * How to parse a Fasta database search output file >>>> >>>> >>>> As a conclusion I would suggest that BioJava should have the goal to >>>> provide easy access to the >>>> core "functionalities" (group B). I believe that we should try to keep >>>> the "concepts" that are being used to >>>> achieve these functionalities as simple as possible. In this sense, I >>>> feel that we have too many hits on the group A pages. >>>> >>>> Andreas >>>> >>>> ----------------------------------------------------------------------- >>>> >>>> Andreas Prlic Wellcome Trust Sanger Institute >>>> Hinxton, Cambridge CB10 1SA, UK >>>> +44 (0) 1223 49 6891 >>>> >>>> ----------------------------------------------------------------------- >>>> >>>> >>>> >>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>> ResearchLimited, a charity registered in England with number 1021457 and >>>> acompany registered in England with number 2742969, whose >>>> registeredoffice is 215 Euston Road, London, NW1 2BE. _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB 8RPZSfbrr9Nfbk3AlqqAet8= =K3qH -----END PGP SIGNATURE----- From markjschreiber at gmail.com Thu Sep 20 09:28:14 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 20 Sep 2007 17:28:14 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F237B5.107@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> Message-ID: <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> The main value of the Symbol representation comes in when you do Distributions and DP which is really why Matthew and Thomas developed it. Quite probably why they developed biojava at all. If you are just pushing data around which seems to be most applications then Strings are better. I have previously proposed seperating the Symbol, Alphabet, DP and Dist from the rest of the packages because they have value well beyond biology but an equal argument would be that most bio stuff doens't need this level of analysis. If you only want to convert EMBL to Fasta or read a BLAST result you don't need it. For those who want to read in EMBL and compute some Distribution or run a Hidden Markov Model then I would propose the conversion of Stringy sequences to SymbolLists at the point when it is needed not at the point when you read them in. Given that almost all I/O of sequence starts and ends as a String the point where you convert to Symbols doesn't matter much. The only question is do you need to convert to Symbols for the analysis you are doing? (Sorry for not putting this on the wiki, I'll do it later). - Mark On 9/20/07, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is one of my main bugbears too. I've never quite understood why we > can't just use Strings, and resort to SymbolLists only when more > advanced manipulation is required (e.g. quality scores for each base). > After all, a String is a memory word overhead (32- or 64-bits) plus > 16-bits (unicode) per character, but most SymbolList implementations are > a memory word overhead plus an additional entire memory word per Symbol, > each word being a pointer to the memory location where the Symbol > singleton lives. So SymbolLists actually use more memory than Strings, > not less. > > (This is not true for CompressedSymbolList which represents sequences as > a sequence of bits, grouped into groups large enough to uniquely > identify any single symbol in the alphabet - e.g. 2 bits for DNA). > > As you say, most users just want to read a sequence, sublist it, maybe > reverse comp it or run some simple search over it. This can all easily > be achieved straight from String format. > > The other 'category A' problems are equally important. Could you add a > section to the Wiki about these and the 'category B' problems? Then we > can use this as a priority use-case list when it comes to actual > development. > > cheers, > Richard > > > Andy Yates wrote: > > Hi, > > > > I would say yes to this as well. It is very important to know what green > > people are attempting to do with BioJava rather than us assuming that we > > know :). There are parts in BioJava where the flexibility of the code is > > not sufficient for other people who want to use the code base & in other > > areas too flexible. > > > > I've talked to quite a few people over the years who have used biojava > > for simple & complex applications and they all seem to come back round > > to a few key problems: > > > > * Sequence & SymbolLists are strange and why can't I use a String - All > > of this makes a lot more sense if you know about the flyweight pattern; > > if not it just seems very strange. > > > > * I have a format that's EMBL like. Can I parse it using Biojava? > > > > * How do I read in a FASTA file? > > > > * How can I get X from this chromatogram & can I parse my specific trace > > format into a BioJava object? > > > > As Andreas said it's the occurrence of the category A problems that are > > the most worrying. In terms of sequences I think I can see why people > > have a problem with it. > > > > Just if we take this as an example: > > > > I have my DNA sequence in a String I can substring it, perform a regular > > expression over it, replace sections, pad it out, format it & so on. If > > I have a Sequence object I can perform most of these actions but the > > interface to them seems unintuitive. Things like calling seqString() to > > get the String back out from a sequence rather than calling toString(). > > Also lets say I want to use a sequence as a key in a hash map or ask if > > two sequences are equal (using the old sequence objects) ... at the > > moment I'd have to convert Sequence -> String to perform the comparison > > (and that doesn't include checking a Sequence for alphabet equality). > > > > I know this sounds like nit-picking & for people who have used biojava > > extensively a lot of this makes sense. For someone new to the project it > > seems like we've done something just for the sake of it and we need to > > get rid of that feeling which I'm sure will happen if we address the > > category A problem. The rest will fall into place :) > > > > Andy > > > > Richard Holland wrote: > > I totally agree. > > > > Can you post a short summary of this to the Wiki page? > > > > Not all aspects of BioJava are documented, leading people either to give > > up, consult the JavaDocs online, or post a message to biojava-l or > > biojava-dev. > > > > Is it possible to get similar stats to the ones you have calculated for > > the JavaDoc pages on our website? > > > > Also, is it possible to build some kind of index over the mailing list > > archives to pull out the most frequently used terms? > > > > cheers, > > Richard > > > > Andreas Prlic wrote: > >>>> Hi, > >>>> > >>>> A question related to the discussion of how to design a future BioJava > >>>> is to have a look > >>>> at which parts of BioJava are being actively used and how to improve > >>>> these. > >>>> > >>>> So what are the most frequently used bits of BioJava? One way to look at > >>>> this is to go to the > >>>> web-stats and see how many hits we have got on our documentation web > >>>> pages. > >>>> > >>>> In an ideal world BioJava would be so simple to use, that nobody needs > >>>> to read any docu. > >>>> Unfortunately we are far away from this, so actually looking at these > >>>> stats gives an impression > >>>> on > >>>> > >>>> * topics / functionality which are of particular interest to the > >>>> community > >>>> * topics / functionality which might not be straightforward to use, > >>>> therefore there are many hits on these pages. > >>>> > >>>> A look at the webstats from the last couple of months gives these top 10 > >>>> Cookbook pages that > >>>> have been accessed frequently. This list is ordered by nr. of pageviews > >>>> > >>>> 1. /wiki/BioJava:Cookbook:Alphabets > >>>> 2. /wiki/BioJava:CookBook:Blast:Parser > >>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta > >>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES > >>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 > >>>> 6. /wiki/BioJava:CookBook:PDB:read > >>>> 7. /wiki/BioJava:Cookbook:Sequence > >>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta > >>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI > >>>> 10. /wiki/BioJava:CookBook:Fasta:Parse > >>>> > >>>> I would group these pages into 2 groups. > >>>> A) How to work with core concepts of BioJava > >>>> B) How to use a functionality of BioJava to achieve a certain goal > >>>> > >>>> The "conceptual" pages (A) I would identify as > >>>> * How to get an Alphabet > >>>> * How to make a Sequence Object from a String or make a Sequence Object > >>>> back into a String > >>>> > >>>> The "functionality" pages (B) I would summarize as > >>>> * How to parse a Blast output > >>>> * How to read sequences from a Fasta file > >>>> * How to read a GenBank, SwissProt or EMBL file > >>>> * How to generate a global or local alignment with the Needleman-Wunsch- > >>>> or the Smith-Waterman-algorithm > >>>> * How to read a protein structure - PDB file > >>>> * How to export a sequence to fasta > >>>> * How to view a sequence in a gui > >>>> * How to parse a Fasta database search output file > >>>> > >>>> > >>>> As a conclusion I would suggest that BioJava should have the goal to > >>>> provide easy access to the > >>>> core "functionalities" (group B). I believe that we should try to keep > >>>> the "concepts" that are being used to > >>>> achieve these functionalities as simple as possible. In this sense, I > >>>> feel that we have too many hits on the group A pages. > >>>> > >>>> Andreas > >>>> > >>>> ----------------------------------------------------------------------- > >>>> > >>>> Andreas Prlic Wellcome Trust Sanger Institute > >>>> Hinxton, Cambridge CB10 1SA, UK > >>>> +44 (0) 1223 49 6891 > >>>> > >>>> ----------------------------------------------------------------------- > >>>> > >>>> > >>>> > >>>> --The Wellcome Trust Sanger Institute is operated by Genome > >>>> ResearchLimited, a charity registered in England with number 1021457 and > >>>> acompany registered in England with number 2742969, whose > >>>> registeredoffice is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB > 8RPZSfbrr9Nfbk3AlqqAet8= > =K3qH > -----END PGP SIGNATURE----- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at ebi.ac.uk Thu Sep 20 09:32:01 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 20 Sep 2007 10:32:01 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> Message-ID: <46F23E11.1000601@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Agreed. What we need is to use Strings by default, and allow conversion to SymbolLists for advanced manipulation (DPs, etc.). Not only does this simplify stuff but it also speeds up simple tasks as it removes the need for conversion and iteration of the lists. cheers, Richard Mark Schreiber wrote: > The main value of the Symbol representation comes in when you do > Distributions and DP which is really why Matthew and Thomas developed > it. Quite probably why they developed biojava at all. If you are just > pushing data around which seems to be most applications then Strings > are better. > > I have previously proposed seperating the Symbol, Alphabet, DP and > Dist from the rest of the packages because they have value well beyond > biology but an equal argument would be that most bio stuff doens't > need this level of analysis. If you only want to convert EMBL to Fasta > or read a BLAST result you don't need it. > > For those who want to read in EMBL and compute some Distribution or > run a Hidden Markov Model then I would propose the conversion of > Stringy sequences to SymbolLists at the point when it is needed not at > the point when you read them in. Given that almost all I/O of > sequence starts and ends as a String the point where you convert to > Symbols doesn't matter much. The only question is do you need to > convert to Symbols for the analysis you are doing? > > (Sorry for not putting this on the wiki, I'll do it later). > > - Mark > > On 9/20/07, Richard Holland wrote: > This is one of my main bugbears too. I've never quite understood why we > can't just use Strings, and resort to SymbolLists only when more > advanced manipulation is required (e.g. quality scores for each base). > After all, a String is a memory word overhead (32- or 64-bits) plus > 16-bits (unicode) per character, but most SymbolList implementations are > a memory word overhead plus an additional entire memory word per Symbol, > each word being a pointer to the memory location where the Symbol > singleton lives. So SymbolLists actually use more memory than Strings, > not less. > > (This is not true for CompressedSymbolList which represents sequences as > a sequence of bits, grouped into groups large enough to uniquely > identify any single symbol in the alphabet - e.g. 2 bits for DNA). > > As you say, most users just want to read a sequence, sublist it, maybe > reverse comp it or run some simple search over it. This can all easily > be achieved straight from String format. > > The other 'category A' problems are equally important. Could you add a > section to the Wiki about these and the 'category B' problems? Then we > can use this as a priority use-case list when it comes to actual > development. > > cheers, > Richard > > > Andy Yates wrote: >>>> Hi, >>>> >>>> I would say yes to this as well. It is very important to know what green >>>> people are attempting to do with BioJava rather than us assuming that we >>>> know :). There are parts in BioJava where the flexibility of the code is >>>> not sufficient for other people who want to use the code base & in other >>>> areas too flexible. >>>> >>>> I've talked to quite a few people over the years who have used biojava >>>> for simple & complex applications and they all seem to come back round >>>> to a few key problems: >>>> >>>> * Sequence & SymbolLists are strange and why can't I use a String - All >>>> of this makes a lot more sense if you know about the flyweight pattern; >>>> if not it just seems very strange. >>>> >>>> * I have a format that's EMBL like. Can I parse it using Biojava? >>>> >>>> * How do I read in a FASTA file? >>>> >>>> * How can I get X from this chromatogram & can I parse my specific trace >>>> format into a BioJava object? >>>> >>>> As Andreas said it's the occurrence of the category A problems that are >>>> the most worrying. In terms of sequences I think I can see why people >>>> have a problem with it. >>>> >>>> Just if we take this as an example: >>>> >>>> I have my DNA sequence in a String I can substring it, perform a regular >>>> expression over it, replace sections, pad it out, format it & so on. If >>>> I have a Sequence object I can perform most of these actions but the >>>> interface to them seems unintuitive. Things like calling seqString() to >>>> get the String back out from a sequence rather than calling toString(). >>>> Also lets say I want to use a sequence as a key in a hash map or ask if >>>> two sequences are equal (using the old sequence objects) ... at the >>>> moment I'd have to convert Sequence -> String to perform the comparison >>>> (and that doesn't include checking a Sequence for alphabet equality). >>>> >>>> I know this sounds like nit-picking & for people who have used biojava >>>> extensively a lot of this makes sense. For someone new to the project it >>>> seems like we've done something just for the sake of it and we need to >>>> get rid of that feeling which I'm sure will happen if we address the >>>> category A problem. The rest will fall into place :) >>>> >>>> Andy >>>> >>>> Richard Holland wrote: >>>> I totally agree. >>>> >>>> Can you post a short summary of this to the Wiki page? >>>> >>>> Not all aspects of BioJava are documented, leading people either to give >>>> up, consult the JavaDocs online, or post a message to biojava-l or >>>> biojava-dev. >>>> >>>> Is it possible to get similar stats to the ones you have calculated for >>>> the JavaDoc pages on our website? >>>> >>>> Also, is it possible to build some kind of index over the mailing list >>>> archives to pull out the most frequently used terms? >>>> >>>> cheers, >>>> Richard >>>> >>>> Andreas Prlic wrote: >>>>>>> Hi, >>>>>>> >>>>>>> A question related to the discussion of how to design a future BioJava >>>>>>> is to have a look >>>>>>> at which parts of BioJava are being actively used and how to improve >>>>>>> these. >>>>>>> >>>>>>> So what are the most frequently used bits of BioJava? One way to look at >>>>>>> this is to go to the >>>>>>> web-stats and see how many hits we have got on our documentation web >>>>>>> pages. >>>>>>> >>>>>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>>>>> to read any docu. >>>>>>> Unfortunately we are far away from this, so actually looking at these >>>>>>> stats gives an impression >>>>>>> on >>>>>>> >>>>>>> * topics / functionality which are of particular interest to the >>>>>>> community >>>>>>> * topics / functionality which might not be straightforward to use, >>>>>>> therefore there are many hits on these pages. >>>>>>> >>>>>>> A look at the webstats from the last couple of months gives these top 10 >>>>>>> Cookbook pages that >>>>>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>>>>> >>>>>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>>>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>>>>> 6. /wiki/BioJava:CookBook:PDB:read >>>>>>> 7. /wiki/BioJava:Cookbook:Sequence >>>>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>>>>> >>>>>>> I would group these pages into 2 groups. >>>>>>> A) How to work with core concepts of BioJava >>>>>>> B) How to use a functionality of BioJava to achieve a certain goal >>>>>>> >>>>>>> The "conceptual" pages (A) I would identify as >>>>>>> * How to get an Alphabet >>>>>>> * How to make a Sequence Object from a String or make a Sequence Object >>>>>>> back into a String >>>>>>> >>>>>>> The "functionality" pages (B) I would summarize as >>>>>>> * How to parse a Blast output >>>>>>> * How to read sequences from a Fasta file >>>>>>> * How to read a GenBank, SwissProt or EMBL file >>>>>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>>>>> or the Smith-Waterman-algorithm >>>>>>> * How to read a protein structure - PDB file >>>>>>> * How to export a sequence to fasta >>>>>>> * How to view a sequence in a gui >>>>>>> * How to parse a Fasta database search output file >>>>>>> >>>>>>> >>>>>>> As a conclusion I would suggest that BioJava should have the goal to >>>>>>> provide easy access to the >>>>>>> core "functionalities" (group B). I believe that we should try to keep >>>>>>> the "concepts" that are being used to >>>>>>> achieve these functionalities as simple as possible. In this sense, I >>>>>>> feel that we have too many hits on the group A pages. >>>>>>> >>>>>>> Andreas >>>>>>> >>>>>>> ----------------------------------------------------------------------- >>>>>>> >>>>>>> Andreas Prlic Wellcome Trust Sanger Institute >>>>>>> Hinxton, Cambridge CB10 1SA, UK >>>>>>> +44 (0) 1223 49 6891 >>>>>>> >>>>>>> ----------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> >>>>>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>>>>> ResearchLimited, a charity registered in England with number 1021457 and >>>>>>> acompany registered in England with number 2742969, whose >>>>>>> registeredoffice is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev >> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8j4Q4C5LeMEKA/QRAiCCAJ9V09vR55BsKuF2rDjvLs3l5cnWKACeN43x BOF0kkjVytLsvCE/4jkWrGg= =Pfrz -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Thu Sep 20 10:54:31 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 20 Sep 2007 11:54:31 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> <93b45ca50709200228i6d5af5a1la7d6b686886aa984@mail.gmail.com> Message-ID: <46F25167.9080307@ebi.ac.uk> I think my EMBL point was more about groups like mine which distribute data in an EMBL format but we do not follow the EMBL rules 100% about what elements can follow other elements. Customization is very important to us which at the moment means there is a biojava src checkout here which gets edited accordingly. Not the most useful/nice solution but it works & is something I've had to do before when I was working with chromatograms. Most of the work I've done with Biojava sequences where just to push in a DNA sequence, rev comp it and push it back out. Even then that got dropped as someone in-house made their own version which kept it all in Strings. That said it should have been used more since it was a DNA alignment/sequencing project & all positions work WRT index 1 (you don't what to know how many times I typed in -1 in that project ... and the number of bugs it caused). Anyway I guess what I'm getting round to saying in a very bad way is that there are places where I should have used the sequence representations from biojava but the inital hump/learning curve of what they are, how to use them & why to use them was too large and I have too little time. I'm sure there are so many other people in the community which have this same problem and I'm sure they'll be hurting because of it as much as I did (and if anyone from that group is reading this email I do apologize ... again). Andy Mark Schreiber wrote: > The main value of the Symbol representation comes in when you do > Distributions and DP which is really why Matthew and Thomas developed > it. Quite probably why they developed biojava at all. If you are just > pushing data around which seems to be most applications then Strings > are better. > > I have previously proposed seperating the Symbol, Alphabet, DP and > Dist from the rest of the packages because they have value well beyond > biology but an equal argument would be that most bio stuff doens't > need this level of analysis. If you only want to convert EMBL to Fasta > or read a BLAST result you don't need it. > > For those who want to read in EMBL and compute some Distribution or > run a Hidden Markov Model then I would propose the conversion of > Stringy sequences to SymbolLists at the point when it is needed not at > the point when you read them in. Given that almost all I/O of > sequence starts and ends as a String the point where you convert to > Symbols doesn't matter much. The only question is do you need to > convert to Symbols for the analysis you are doing? > > (Sorry for not putting this on the wiki, I'll do it later). > > - Mark > > On 9/20/07, Richard Holland wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> This is one of my main bugbears too. I've never quite understood why we >> can't just use Strings, and resort to SymbolLists only when more >> advanced manipulation is required (e.g. quality scores for each base). >> After all, a String is a memory word overhead (32- or 64-bits) plus >> 16-bits (unicode) per character, but most SymbolList implementations are >> a memory word overhead plus an additional entire memory word per Symbol, >> each word being a pointer to the memory location where the Symbol >> singleton lives. So SymbolLists actually use more memory than Strings, >> not less. >> >> (This is not true for CompressedSymbolList which represents sequences as >> a sequence of bits, grouped into groups large enough to uniquely >> identify any single symbol in the alphabet - e.g. 2 bits for DNA). >> >> As you say, most users just want to read a sequence, sublist it, maybe >> reverse comp it or run some simple search over it. This can all easily >> be achieved straight from String format. >> >> The other 'category A' problems are equally important. Could you add a >> section to the Wiki about these and the 'category B' problems? Then we >> can use this as a priority use-case list when it comes to actual >> development. >> >> cheers, >> Richard >> >> >> Andy Yates wrote: >>> Hi, >>> >>> I would say yes to this as well. It is very important to know what green >>> people are attempting to do with BioJava rather than us assuming that we >>> know :). There are parts in BioJava where the flexibility of the code is >>> not sufficient for other people who want to use the code base & in other >>> areas too flexible. >>> >>> I've talked to quite a few people over the years who have used biojava >>> for simple & complex applications and they all seem to come back round >>> to a few key problems: >>> >>> * Sequence & SymbolLists are strange and why can't I use a String - All >>> of this makes a lot more sense if you know about the flyweight pattern; >>> if not it just seems very strange. >>> >>> * I have a format that's EMBL like. Can I parse it using Biojava? >>> >>> * How do I read in a FASTA file? >>> >>> * How can I get X from this chromatogram & can I parse my specific trace >>> format into a BioJava object? >>> >>> As Andreas said it's the occurrence of the category A problems that are >>> the most worrying. In terms of sequences I think I can see why people >>> have a problem with it. >>> >>> Just if we take this as an example: >>> >>> I have my DNA sequence in a String I can substring it, perform a regular >>> expression over it, replace sections, pad it out, format it & so on. If >>> I have a Sequence object I can perform most of these actions but the >>> interface to them seems unintuitive. Things like calling seqString() to >>> get the String back out from a sequence rather than calling toString(). >>> Also lets say I want to use a sequence as a key in a hash map or ask if >>> two sequences are equal (using the old sequence objects) ... at the >>> moment I'd have to convert Sequence -> String to perform the comparison >>> (and that doesn't include checking a Sequence for alphabet equality). >>> >>> I know this sounds like nit-picking & for people who have used biojava >>> extensively a lot of this makes sense. For someone new to the project it >>> seems like we've done something just for the sake of it and we need to >>> get rid of that feeling which I'm sure will happen if we address the >>> category A problem. The rest will fall into place :) >>> >>> Andy >>> >>> Richard Holland wrote: >>> I totally agree. >>> >>> Can you post a short summary of this to the Wiki page? >>> >>> Not all aspects of BioJava are documented, leading people either to give >>> up, consult the JavaDocs online, or post a message to biojava-l or >>> biojava-dev. >>> >>> Is it possible to get similar stats to the ones you have calculated for >>> the JavaDoc pages on our website? >>> >>> Also, is it possible to build some kind of index over the mailing list >>> archives to pull out the most frequently used terms? >>> >>> cheers, >>> Richard >>> >>> Andreas Prlic wrote: >>>>>> Hi, >>>>>> >>>>>> A question related to the discussion of how to design a future BioJava >>>>>> is to have a look >>>>>> at which parts of BioJava are being actively used and how to improve >>>>>> these. >>>>>> >>>>>> So what are the most frequently used bits of BioJava? One way to look at >>>>>> this is to go to the >>>>>> web-stats and see how many hits we have got on our documentation web >>>>>> pages. >>>>>> >>>>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>>>> to read any docu. >>>>>> Unfortunately we are far away from this, so actually looking at these >>>>>> stats gives an impression >>>>>> on >>>>>> >>>>>> * topics / functionality which are of particular interest to the >>>>>> community >>>>>> * topics / functionality which might not be straightforward to use, >>>>>> therefore there are many hits on these pages. >>>>>> >>>>>> A look at the webstats from the last couple of months gives these top 10 >>>>>> Cookbook pages that >>>>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>>>> >>>>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>>>> 6. /wiki/BioJava:CookBook:PDB:read >>>>>> 7. /wiki/BioJava:Cookbook:Sequence >>>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>>>> >>>>>> I would group these pages into 2 groups. >>>>>> A) How to work with core concepts of BioJava >>>>>> B) How to use a functionality of BioJava to achieve a certain goal >>>>>> >>>>>> The "conceptual" pages (A) I would identify as >>>>>> * How to get an Alphabet >>>>>> * How to make a Sequence Object from a String or make a Sequence Object >>>>>> back into a String >>>>>> >>>>>> The "functionality" pages (B) I would summarize as >>>>>> * How to parse a Blast output >>>>>> * How to read sequences from a Fasta file >>>>>> * How to read a GenBank, SwissProt or EMBL file >>>>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>>>> or the Smith-Waterman-algorithm >>>>>> * How to read a protein structure - PDB file >>>>>> * How to export a sequence to fasta >>>>>> * How to view a sequence in a gui >>>>>> * How to parse a Fasta database search output file >>>>>> >>>>>> >>>>>> As a conclusion I would suggest that BioJava should have the goal to >>>>>> provide easy access to the >>>>>> core "functionalities" (group B). I believe that we should try to keep >>>>>> the "concepts" that are being used to >>>>>> achieve these functionalities as simple as possible. In this sense, I >>>>>> feel that we have too many hits on the group A pages. >>>>>> >>>>>> Andreas >>>>>> >>>>>> ----------------------------------------------------------------------- >>>>>> >>>>>> Andreas Prlic Wellcome Trust Sanger Institute >>>>>> Hinxton, Cambridge CB10 1SA, UK >>>>>> +44 (0) 1223 49 6891 >>>>>> >>>>>> ----------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> >>>>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>>>> ResearchLimited, a charity registered in England with number 1021457 and >>>>>> acompany registered in England with number 2742969, whose >>>>>> registeredoffice is 215 Euston Road, London, NW1 2BE. >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB >> 8RPZSfbrr9Nfbk3AlqqAet8= >> =K3qH >> -----END PGP SIGNATURE----- >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> From ayates at ebi.ac.uk Thu Sep 20 10:55:13 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 20 Sep 2007 11:55:13 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F237B5.107@ebi.ac.uk> References: <46F0FA6A.1030404@ebi.ac.uk> <92BF0823-DE13-4522-B567-ED7DE522949D@sanger.ac.uk> <46F227FD.6020807@ebi.ac.uk> <46F23571.4050908@ebi.ac.uk> <46F237B5.107@ebi.ac.uk> Message-ID: <46F25191.70109@ebi.ac.uk> Ok I'll add them in. Can you remember if I've actually got a wiki account? Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is one of my main bugbears too. I've never quite understood why we > can't just use Strings, and resort to SymbolLists only when more > advanced manipulation is required (e.g. quality scores for each base). > After all, a String is a memory word overhead (32- or 64-bits) plus > 16-bits (unicode) per character, but most SymbolList implementations are > a memory word overhead plus an additional entire memory word per Symbol, > each word being a pointer to the memory location where the Symbol > singleton lives. So SymbolLists actually use more memory than Strings, > not less. > > (This is not true for CompressedSymbolList which represents sequences as > a sequence of bits, grouped into groups large enough to uniquely > identify any single symbol in the alphabet - e.g. 2 bits for DNA). > > As you say, most users just want to read a sequence, sublist it, maybe > reverse comp it or run some simple search over it. This can all easily > be achieved straight from String format. > > The other 'category A' problems are equally important. Could you add a > section to the Wiki about these and the 'category B' problems? Then we > can use this as a priority use-case list when it comes to actual > development. > > cheers, > Richard > > > Andy Yates wrote: >> Hi, >> >> I would say yes to this as well. It is very important to know what green >> people are attempting to do with BioJava rather than us assuming that we >> know :). There are parts in BioJava where the flexibility of the code is >> not sufficient for other people who want to use the code base & in other >> areas too flexible. >> >> I've talked to quite a few people over the years who have used biojava >> for simple & complex applications and they all seem to come back round >> to a few key problems: >> >> * Sequence & SymbolLists are strange and why can't I use a String - All >> of this makes a lot more sense if you know about the flyweight pattern; >> if not it just seems very strange. >> >> * I have a format that's EMBL like. Can I parse it using Biojava? >> >> * How do I read in a FASTA file? >> >> * How can I get X from this chromatogram & can I parse my specific trace >> format into a BioJava object? >> >> As Andreas said it's the occurrence of the category A problems that are >> the most worrying. In terms of sequences I think I can see why people >> have a problem with it. >> >> Just if we take this as an example: >> >> I have my DNA sequence in a String I can substring it, perform a regular >> expression over it, replace sections, pad it out, format it & so on. If >> I have a Sequence object I can perform most of these actions but the >> interface to them seems unintuitive. Things like calling seqString() to >> get the String back out from a sequence rather than calling toString(). >> Also lets say I want to use a sequence as a key in a hash map or ask if >> two sequences are equal (using the old sequence objects) ... at the >> moment I'd have to convert Sequence -> String to perform the comparison >> (and that doesn't include checking a Sequence for alphabet equality). >> >> I know this sounds like nit-picking & for people who have used biojava >> extensively a lot of this makes sense. For someone new to the project it >> seems like we've done something just for the sake of it and we need to >> get rid of that feeling which I'm sure will happen if we address the >> category A problem. The rest will fall into place :) >> >> Andy >> >> Richard Holland wrote: >> I totally agree. >> >> Can you post a short summary of this to the Wiki page? >> >> Not all aspects of BioJava are documented, leading people either to give >> up, consult the JavaDocs online, or post a message to biojava-l or >> biojava-dev. >> >> Is it possible to get similar stats to the ones you have calculated for >> the JavaDoc pages on our website? >> >> Also, is it possible to build some kind of index over the mailing list >> archives to pull out the most frequently used terms? >> >> cheers, >> Richard >> >> Andreas Prlic wrote: >>>>> Hi, >>>>> >>>>> A question related to the discussion of how to design a future BioJava >>>>> is to have a look >>>>> at which parts of BioJava are being actively used and how to improve >>>>> these. >>>>> >>>>> So what are the most frequently used bits of BioJava? One way to look at >>>>> this is to go to the >>>>> web-stats and see how many hits we have got on our documentation web >>>>> pages. >>>>> >>>>> In an ideal world BioJava would be so simple to use, that nobody needs >>>>> to read any docu. >>>>> Unfortunately we are far away from this, so actually looking at these >>>>> stats gives an impression >>>>> on >>>>> >>>>> * topics / functionality which are of particular interest to the >>>>> community >>>>> * topics / functionality which might not be straightforward to use, >>>>> therefore there are many hits on these pages. >>>>> >>>>> A look at the webstats from the last couple of months gives these top 10 >>>>> Cookbook pages that >>>>> have been accessed frequently. This list is ordered by nr. of pageviews >>>>> >>>>> 1. /wiki/BioJava:Cookbook:Alphabets >>>>> 2. /wiki/BioJava:CookBook:Blast:Parser >>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta >>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES >>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2 >>>>> 6. /wiki/BioJava:CookBook:PDB:read >>>>> 7. /wiki/BioJava:Cookbook:Sequence >>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta >>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI >>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse >>>>> >>>>> I would group these pages into 2 groups. >>>>> A) How to work with core concepts of BioJava >>>>> B) How to use a functionality of BioJava to achieve a certain goal >>>>> >>>>> The "conceptual" pages (A) I would identify as >>>>> * How to get an Alphabet >>>>> * How to make a Sequence Object from a String or make a Sequence Object >>>>> back into a String >>>>> >>>>> The "functionality" pages (B) I would summarize as >>>>> * How to parse a Blast output >>>>> * How to read sequences from a Fasta file >>>>> * How to read a GenBank, SwissProt or EMBL file >>>>> * How to generate a global or local alignment with the Needleman-Wunsch- >>>>> or the Smith-Waterman-algorithm >>>>> * How to read a protein structure - PDB file >>>>> * How to export a sequence to fasta >>>>> * How to view a sequence in a gui >>>>> * How to parse a Fasta database search output file >>>>> >>>>> >>>>> As a conclusion I would suggest that BioJava should have the goal to >>>>> provide easy access to the >>>>> core "functionalities" (group B). I believe that we should try to keep >>>>> the "concepts" that are being used to >>>>> achieve these functionalities as simple as possible. In this sense, I >>>>> feel that we have too many hits on the group A pages. >>>>> >>>>> Andreas >>>>> >>>>> ----------------------------------------------------------------------- >>>>> >>>>> Andreas Prlic Wellcome Trust Sanger Institute >>>>> Hinxton, Cambridge CB10 1SA, UK >>>>> +44 (0) 1223 49 6891 >>>>> >>>>> ----------------------------------------------------------------------- >>>>> >>>>> >>>>> >>>>> --The Wellcome Trust Sanger Institute is operated by Genome >>>>> ResearchLimited, a charity registered in England with number 1021457 and >>>>> acompany registered in England with number 2742969, whose >>>>> registeredoffice is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB > 8RPZSfbrr9Nfbk3AlqqAet8= > =K3qH > -----END PGP SIGNATURE----- From gwaldon at geneinfinity.org Fri Sep 21 06:53:12 2007 From: gwaldon at geneinfinity.org (george waldon) Date: Thu, 20 Sep 2007 23:53:12 -0700 Subject: [Biojava-dev] The future of BioJava Message-ID: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> Hello, All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as ?token?, e.g. Alphabet.getTokenizarion(?token?) or SymbolTokenization.tokenizeSymbolList(SymbolList). I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. Richard wrote: >It is suggested that development stops on the existing Biojava(?) Well, I don?t think the license can let you do that :-) Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: - Switch to Subversion repository - Change of the build process compatible with creation of modules - Improving testing frame (mentioned several times) - Creation of white papers for coding practices, build releases, (others?) Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. Hope it helps, George From markjschreiber at gmail.com Fri Sep 21 07:24:23 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 21 Sep 2007 15:24:23 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> Message-ID: <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> Hello - Just to clarify my opinion on Strings vs Symbols. I generally prefer Symbols and SymbolLists to Strings cause SymbolLists are smart and Strings are dumb. Classic case is ambiguity symbols like 'W'. BioJava knows, in the context of DNA this is A or T. However, I think it would be vastly simpler if there where simpler getters and setters for SymbolLists that exposed Strings in a friendlier manner. I also think there is a case for SymbolLists that are backed by Strings (more likely a char[]) instead of Symbol arrays and only do the needed conversion when required (ie, when the user calls SymbolAt(). These would be ideal for the case where someone is converting GenBank to Fasta and there is no need to go through the Symbol parsing. Finally, I think SymbolLists (or whatever they get called) should implement more of the methods found in String to make them look more like Strings. Ideally we should think about implementing some of the methods that Groovy likes to use for operator overloading. If we do this is would be possible to concatenate two sequences in groovy by doing this (I may have the syntax wrong). Seq3 = Seq1 + Seq2 The other issue with SymbolLists is that they are not intuitive to construct because they are not so bean like. This is not just a problem for newbies but also a major hinderance to the use of JEE, Spring, JAXB and other important frameworks. It should be possible to do this: SymbolList sl = new SymbolList(); sl.setName("AB123456"); sl.setSequence(seqString); The final hinderance to the use of JEE is serialization. If we keep Symbols flyweight (singleton) we need to make this bullet proof from the start. It is also practicaly impossible to make something a bean and make it a Singleton, some careful thought is required. If we keep symbols behind the scenes they may not need to be so bean like. - Mark On 9/21/07, george waldon wrote: > Hello, > > All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > > I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > Richard wrote: > >It is suggested that development stops on the existing Biojava(?) > Well, I don't think the license can let you do that :-) > Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > - Switch to Subversion repository > - Change of the build process compatible with creation of modules > - Improving testing frame (mentioned several times) > - Creation of white papers for coding practices, build releases, (others?) > > Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > Hope it helps, > George > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From holland at ebi.ac.uk Fri Sep 21 07:54:51 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 08:54:51 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> Message-ID: <46F378CB.2030903@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi George. By 'stop development' I really meant just that active development efforts would be focused on the new codebase rather than modifying the existing one (except of course for fixing bugs, which is always important and we wouldn't stop doing that until the new codebase was well established as an alternative). I agree that modifying the existing codebase would improve many of the problems currently experienced with it - code abstraction being just one of them. BioJavaX was an attempt at doing this. The big stumbling block was interfaces - users do not expect interfaces to change as it breaks all code that already uses that interface. They also do not expect the defined behaviour of methods in interfaces to change - which meant, for instance, that I had real problems trying to get RichFeature/RichLocation and RichLocation/Location to match up as some parts of Feature and Location conflicted with the more realistic requirements of their Rich* equivalents (e.g. circularity). If you change interfaces, you might as well start from scratch in terms of the effect it has on end-user's code. Also, if we start from scratch, it allows us to build up from the very basics the kind of robustness and flexibility we need throughout the system. As mentioned in the original posting the existing system is heavily sequence-focused, meaning that even the simple task of scanning a set of features cannot be done without also loading the associated sequences because the two are so closely integrated. We need to make it much more flexible and I think new code would give us a better opportunity to do so without being tied into complying with existing interfaces or behaviour expectations. Having said that, I do expect large parts of the new codebase to be only slightly modified copies of the original code, particularly regarding recent developments such as genetic algorithms and phylogenetics. It would be silly to write such logic all over again where the code is relatively self-contained. cheers, Richard george waldon wrote: > Hello, > > All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as ?token?, e.g. Alphabet.getTokenizarion(?token?) or SymbolTokenization.tokenizeSymbolList(SymbolList). > > I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > Richard wrote: >> It is suggested that development stops on the existing Biojava(?) > Well, I don?t think the license can let you do that :-) > Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > - Switch to Subversion repository > - Change of the build process compatible with creation of modules > - Improving testing frame (mentioned several times) > - Creation of white papers for coding practices, build releases, (others?) > > Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > Hope it helps, > George > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG83jK4C5LeMEKA/QRAtOFAJsF9YNdgdsOm1KY65GyRehsO1ElYwCfeUfi yXWTMXSzn3mXZqXXo9999rw= =WbAQ -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Sep 21 08:07:51 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 09:07:51 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> Message-ID: <46F37BD7.5020402@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I like that idea of having SymbolLists backed by different things. I'd suggest that by default, all sequences read from file should be String-backed SymbolLists, and that they are not broken down into Symbols until first requested to do so by code that needs to know the actual Symbols (e.g. code that cares about ambiguity symbols). The same applies in reverse - lists constructed from symbols should not be converted to strings until needed. Something like this: SymbolList sl = new SymbolList(); sl.setString("AGCGGACT"); // Changes the string, and clears any cached // conversion of it. String seq = sl.getString(); // Dumps the string. If not already converted // to a string, does the conversion and // caches it first. char base = sl.charAt(5); // 1-indexed single-base string. This would // likely delegate to String.charAt() and only // works for single-character alphabets. Not // to be used in any other cirumstances. sl.set/getAlphabet().... // Use these to set the alphabet before // using set/getSymbols()/symbolAt(). sl.setSymbols(new List(....)); // Uses the list to update the cached symbols // and clear the cached string. List syms = sl.getSymbols(); // Converts if not already converted, caches // the conversion, and returns it. Symbol sym = sl.symbolAt(5); // 1-indexed fully flexible symbol finder. toString() would delegate to getString(), as would hashCode(), equals(), and compareTo(). We could provide additional equals()-style methods for testing equality whilst taking into account ambiguities. cheers, Richard Mark Schreiber wrote: > Hello - > > Just to clarify my opinion on Strings vs Symbols. > > I generally prefer Symbols and SymbolLists to Strings cause > SymbolLists are smart and Strings are dumb. Classic case is ambiguity > symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > However, I think it would be vastly simpler if there where simpler > getters and setters for SymbolLists that exposed Strings in a > friendlier manner. > > I also think there is a case for SymbolLists that are backed by > Strings (more likely a char[]) instead of Symbol arrays and only do > the needed conversion when required (ie, when the user calls > SymbolAt(). These would be ideal for the case where someone is > converting GenBank to Fasta and there is no need to go through the > Symbol parsing. > > Finally, I think SymbolLists (or whatever they get called) should > implement more of the methods found in String to make them look more > like Strings. Ideally we should think about implementing some of the > methods that Groovy likes to use for operator overloading. If we do > this is would be possible to concatenate two sequences in groovy by > doing this (I may have the syntax wrong). > > Seq3 = Seq1 + Seq2 > > The other issue with SymbolLists is that they are not intuitive to > construct because they are not so bean like. This is not just a > problem for newbies but also a major hinderance to the use of JEE, > Spring, JAXB and other important frameworks. It should be possible to > do this: > > SymbolList sl = new SymbolList(); > sl.setName("AB123456"); > sl.setSequence(seqString); > > The final hinderance to the use of JEE is serialization. If we keep > Symbols flyweight (singleton) we need to make this bullet proof from > the start. It is also practicaly impossible to make something a bean > and make it a Singleton, some careful thought is required. If we keep > symbols behind the scenes they may not need to be so bean like. > > - Mark > > On 9/21/07, george waldon wrote: >> Hello, >> >> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. >> >> I noticed that the tutorial has seriously improved ? thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (?) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). >> >> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. >> >> Richard wrote: >>> It is suggested that development stops on the existing Biojava(?) >> Well, I don't think the license can let you do that :-) >> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: >> - Switch to Subversion repository >> - Change of the build process compatible with creation of modules >> - Improving testing frame (mentioned several times) >> - Creation of white papers for coding practices, build releases, (others?) >> >> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference ? building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) >> >> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. >> >> Hope it helps, >> George >> >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG83vW4C5LeMEKA/QRAgvcAKCbjSMERdawCoeeEA/Cg+c/z/DqsgCeImE/ QfSYrzx1TUHVscTXCs2vAoY= =x+Su -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Sep 21 08:47:55 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 09:47:55 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F37BD7.5020402@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> Message-ID: <46F3853B.7070701@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Also could we make SymbolList implement List? The iterator() method would then do the cached conversion if required before returning an Iterator over the symbols. That would make it very pluggable. We'd need it to have a settable flag indicating whether the user wants 1-indexed or 0-indexed access (the default being 1-indexed as this is the most common biological use). Only downside is that List uses generics and so SymbolList must too - meaning that SymbolList must always be declared as SymbolList (or some subclass of Symbol). But that's also an upside - you could subclass Symbol into DNASymbol, RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the symbol and need not be specified separately: SymbolList dna = new SymbolList(); dna.add(RNAAlphabet.Q); // Throws standard List exception! SymbolList> = new ....; // Cool! Also cool is that you could do this: public SymbolList translate(SymbolList dna); // Also cool! cheers, Richard Richard Holland wrote: > I like that idea of having SymbolLists backed by different things. I'd > suggest that by default, all sequences read from file should be > String-backed SymbolLists, and that they are not broken down into > Symbols until first requested to do so by code that needs to know the > actual Symbols (e.g. code that cares about ambiguity symbols). The same > applies in reverse - lists constructed from symbols should not be > converted to strings until needed. > > Something like this: > > SymbolList sl = new SymbolList(); > sl.setString("AGCGGACT"); > // Changes the string, and clears any cached > // conversion of it. > String seq = sl.getString(); > // Dumps the string. If not already converted > // to a string, does the conversion and > // caches it first. > char base = sl.charAt(5); > // 1-indexed single-base string. This would > // likely delegate to String.charAt() and only > // works for single-character alphabets. Not > // to be used in any other cirumstances. > sl.set/getAlphabet().... > // Use these to set the alphabet before > // using set/getSymbols()/symbolAt(). > sl.setSymbols(new List(....)); > // Uses the list to update the cached symbols > // and clear the cached string. > List syms = sl.getSymbols(); > // Converts if not already converted, caches > // the conversion, and returns it. > Symbol sym = sl.symbolAt(5); > // 1-indexed fully flexible symbol finder. > > toString() would delegate to getString(), as would hashCode(), equals(), > and compareTo(). We could provide additional equals()-style methods for > testing equality whilst taking into account ambiguities. > > cheers, > Richard > > > Mark Schreiber wrote: >>> Hello - >>> >>> Just to clarify my opinion on Strings vs Symbols. >>> >>> I generally prefer Symbols and SymbolLists to Strings cause >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. >>> However, I think it would be vastly simpler if there where simpler >>> getters and setters for SymbolLists that exposed Strings in a >>> friendlier manner. >>> >>> I also think there is a case for SymbolLists that are backed by >>> Strings (more likely a char[]) instead of Symbol arrays and only do >>> the needed conversion when required (ie, when the user calls >>> SymbolAt(). These would be ideal for the case where someone is >>> converting GenBank to Fasta and there is no need to go through the >>> Symbol parsing. >>> >>> Finally, I think SymbolLists (or whatever they get called) should >>> implement more of the methods found in String to make them look more >>> like Strings. Ideally we should think about implementing some of the >>> methods that Groovy likes to use for operator overloading. If we do >>> this is would be possible to concatenate two sequences in groovy by >>> doing this (I may have the syntax wrong). >>> >>> Seq3 = Seq1 + Seq2 >>> >>> The other issue with SymbolLists is that they are not intuitive to >>> construct because they are not so bean like. This is not just a >>> problem for newbies but also a major hinderance to the use of JEE, >>> Spring, JAXB and other important frameworks. It should be possible to >>> do this: >>> >>> SymbolList sl = new SymbolList(); >>> sl.setName("AB123456"); >>> sl.setSequence(seqString); >>> >>> The final hinderance to the use of JEE is serialization. If we keep >>> Symbols flyweight (singleton) we need to make this bullet proof from >>> the start. It is also practicaly impossible to make something a bean >>> and make it a Singleton, some careful thought is required. If we keep >>> symbols behind the scenes they may not need to be so bean like. >>> >>> - Mark >>> >>> On 9/21/07, george waldon wrote: >>>> Hello, >>>> >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. >>>> >>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). >>>> >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. >>>> >>>> Richard wrote: >>>>> It is suggested that development stops on the existing Biojava(&) >>>> Well, I don't think the license can let you do that :-) >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: >>>> - Switch to Subversion repository >>>> - Change of the build process compatible with creation of modules >>>> - Improving testing frame (mentioned several times) >>>> - Creation of white papers for coding practices, build releases, (others?) >>>> >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) >>>> >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. >>>> >>>> Hope it helps, >>>> George >>>> >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>> _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V eOMOo3pCl71dPhZMyYlBBE4= =NByU -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Sep 21 09:20:18 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 21 Sep 2007 10:20:18 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> Message-ID: <46F38CD2.6000005@ebi.ac.uk> > > Finally, I think SymbolLists (or whatever they get called) should > implement more of the methods found in String to make them look more > like Strings. Ideally we should think about implementing some of the > methods that Groovy likes to use for operator overloading. If we do > this is would be possible to concatenate two sequences in groovy by > doing this (I may have the syntax wrong). > > Seq3 = Seq1 + Seq2 Yup that seems about right. It's on of the nice things about groovy that you can overload the operators and create something which approaches an in-language DSL (can't really call it a true DSL since it's constrained by the Groovy language). But anyway you can start mucking around with the operators to get things like: fasta = new Fasta('id','AAAAAA') fasta_output = new FastaWriter('some_location'); fasta_output << fasta Assuming that the Fasta class would represent a Fasta record & the FastaWriter is just that; you can begin to write some very nice & tight code which just looks nice to use :). > > The other issue with SymbolLists is that they are not intuitive to > construct because they are not so bean like. This is not just a > problem for newbies but also a major hinderance to the use of JEE, > Spring, JAXB and other important frameworks. It should be possible to > do this: > > SymbolList sl = new SymbolList(); > sl.setName("AB123456"); > sl.setSequence(seqString); Yup I'll agree with that. > > The final hinderance to the use of JEE is serialization. If we keep > Symbols flyweight (singleton) we need to make this bullet proof from > the start. It is also practicaly impossible to make something a bean > and make it a Singleton, some careful thought is required. If we keep > symbols behind the scenes they may not need to be so bean like. I think we may need a bit of both. I would suggest something like an interface which back onto Symbol. Then collections of symbols are actually enums e.g. public interface Symbol { String toString(); } public enum DNA implements Symbol, java.io.Serializable { A, C, G, T; public String toString() { return this.name().toLowerCase(); } private Object readResolve () throws java.io.ObjectStreamException { DNA symbol = null; for(DNA dna: values()) { if(dna.toString().equals(this.toString()) { symbol = dna; break; } } return symbol; } } The read resolve needs to go in here to make sure this is bullet proof to serialization. Otherwise we end up in a situation where you can serialize an enum, deserialize it & then you'll end up where deserialzied enum is not equal (using ==) to the statically available enum. From what I've done previously using Enums are a very nice way of working with static constants. However they are very hard to extend so they're fine for known constants like DNA (don't think we're going to stumble onto a new nucleotide) but the symbol interface does mean that people can extend the symbol concept if need be. From ayates at ebi.ac.uk Fri Sep 21 09:26:49 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 21 Sep 2007 10:26:49 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F3853B.7070701@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> Message-ID: <46F38E59.5030005@ebi.ac.uk> > > Also could we make SymbolList implement List? The iterator() method > would then do the cached conversion if required before returning an > Iterator over the symbols. That would make it very pluggable. > We'd need it to have a settable flag indicating whether the user wants > 1-indexed or 0-indexed access (the default being 1-indexed as this is > the most common biological use). The important part of the 1.5 api is Iterable which can be applied to any class. It just means it will return an iterator & can be used in the new foreach loop construct. > > Only downside is that List uses generics and so SymbolList must too - > meaning that SymbolList must always be declared as SymbolList > (or some subclass of Symbol). > > But that's also an upside - you could subclass Symbol into DNASymbol, > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > symbol and need not be specified separately: > > SymbolList dna = new SymbolList(); > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > SymbolList> = new ....; // Cool! > > Also cool is that you could do this: > > public SymbolList translate(SymbolList dna); > // Also cool! Two problems with using generics that I've encountered: 1). getSomething(SymbolList list); & getSomething(SymbolList list); as far as the compiler is concerned are the same method. Both take in an instance of SymbolList (remember that generics are a list minute bolt-on to the JDK 1.5 API and it really shows). 2). It is impossible to infer the type of a generic i.e. public void doSomething(T genericObject) { if(T.equals(String.class)) { //Do something } } This T type is ... well magical. It exists but it doesn't. Anyway just be careful with generics. They save a lot of time & effort but get too involved (or think they can solve everything like I did for a short period) they're going to burn you badly or drive you mad for 1/2 a day wondering why javac claims something you've written is bogus. Andy From holland at ebi.ac.uk Fri Sep 21 09:56:07 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 21 Sep 2007 10:56:07 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F38E59.5030005@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> <46F38E59.5030005@ebi.ac.uk> Message-ID: <46F39537.6040002@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 For our purposes we'd use them to restrict valid input to a method - so that if the user tries to write a program which passes in List to a method which only takes List it'll throw a wobbly at compile time. The methods would include the type in their signature and therefore only accept lists of that type. We'd obviously have to be careful with method naming as you point out - i.e. no overloading methods with different generic types of the same parameters. The bit about not being able to work out what T is is a pain indeed, but I don't think we'd need to use that. In most cases it can be solved hackily - whenever Sun produces a proper way of doing it we'd use it of course. (Hacky solution: If passed List, then do an instanceof list.iterate().next() to find out type of first item in list, thus implying the type of everything else in the list, assuming list is not empty). cheers, Richard Andy Yates wrote: >> Also could we make SymbolList implement List? The iterator() method >> would then do the cached conversion if required before returning an >> Iterator over the symbols. That would make it very pluggable. >> We'd need it to have a settable flag indicating whether the user wants >> 1-indexed or 0-indexed access (the default being 1-indexed as this is >> the most common biological use). > > The important part of the 1.5 api is Iterable which can be applied to > any class. It just means it will return an iterator & can be used in the > new foreach loop construct. > >> Only downside is that List uses generics and so SymbolList must too - >> meaning that SymbolList must always be declared as SymbolList >> (or some subclass of Symbol). >> >> But that's also an upside - you could subclass Symbol into DNASymbol, >> RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the >> symbol and need not be specified separately: >> >> SymbolList dna = new SymbolList(); >> dna.add(RNAAlphabet.Q); // Throws standard List exception! >> >> SymbolList> = new ....; // Cool! >> >> Also cool is that you could do this: >> >> public SymbolList translate(SymbolList dna); >> // Also cool! > > Two problems with using generics that I've encountered: > > 1). getSomething(SymbolList list); & > getSomething(SymbolList list); as far as the compiler is > concerned are the same method. Both take in an instance of SymbolList > (remember that generics are a list minute bolt-on to the JDK 1.5 API and > it really shows). > > 2). It is impossible to infer the type of a generic i.e. > > public void doSomething(T genericObject) { > if(T.equals(String.class)) { > //Do something > } > } > > This T type is ... well magical. It exists but it doesn't. > > Anyway just be careful with generics. They save a lot of time & effort > but get too involved (or think they can solve everything like I did for > a short period) they're going to burn you badly or drive you mad for 1/2 > a day wondering why javac claims something you've written is bogus. > > Andy > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG85U34C5LeMEKA/QRAl3pAKCFC6HBv5iXmGVKpuTwJQiwWuoMmwCdG/g2 ILxIABP6me8pfY995/e6A5M= =a+oW -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Sep 21 10:24:21 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 21 Sep 2007 11:24:21 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F39537.6040002@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> <46F38E59.5030005@ebi.ac.uk> <46F39537.6040002@ebi.ac.uk> Message-ID: <46F39BD5.3030305@ebi.ac.uk> That's fair enough & spot on what generics are intended for. Also means if someone wanted to allow any symbols in it's easy enough to use: void doSomething(List symbols); I've only ever need to know what T was once & the easiest way around it is as you've said to check the first element of an input collection or to take in a class & use generics to enforce the correct class type i.e. T getGenericType(Class clazz); String output = getGenericType(String.class); //This is ok String output = getGenericType(Long.class); //This won't compile I'd say so long as 'dodgy' things are refactored out to helper classes then they can change as & when better solutions come along. A good example of this is Spring's synchronized map builder which at runtime will attempt to figure out what is available on the classpath and then return a map which will provide the best syncrhronized performance (for example if it's a pre 1.5 jdk it'll return a normal synchronized map otherwise it'll use ConcurrentHashMap). Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > For our purposes we'd use them to restrict valid input to a method - so > that if the user tries to write a program which passes in > List to a method which only takes List it'll throw > a wobbly at compile time. The methods would include the type in their > signature and therefore only accept lists of that type. > > We'd obviously have to be careful with method naming as you point out - > i.e. no overloading methods with different generic types of the same > parameters. > > The bit about not being able to work out what T is is a pain indeed, but > I don't think we'd need to use that. In most cases it can be solved > hackily - whenever Sun produces a proper way of doing it we'd use it of > course. (Hacky solution: If passed List, then do an instanceof > list.iterate().next() to find out type of first item in list, thus > implying the type of everything else in the list, assuming list is not > empty). > > cheers, > Richard > > > Andy Yates wrote: >>> Also could we make SymbolList implement List? The iterator() method >>> would then do the cached conversion if required before returning an >>> Iterator over the symbols. That would make it very pluggable. >>> We'd need it to have a settable flag indicating whether the user wants >>> 1-indexed or 0-indexed access (the default being 1-indexed as this is >>> the most common biological use). >> The important part of the 1.5 api is Iterable which can be applied to >> any class. It just means it will return an iterator & can be used in the >> new foreach loop construct. >> >>> Only downside is that List uses generics and so SymbolList must too - >>> meaning that SymbolList must always be declared as SymbolList >>> (or some subclass of Symbol). >>> >>> But that's also an upside - you could subclass Symbol into DNASymbol, >>> RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the >>> symbol and need not be specified separately: >>> >>> SymbolList dna = new SymbolList(); >>> dna.add(RNAAlphabet.Q); // Throws standard List exception! >>> >>> SymbolList> = new ....; // Cool! >>> >>> Also cool is that you could do this: >>> >>> public SymbolList translate(SymbolList dna); >>> // Also cool! >> Two problems with using generics that I've encountered: >> >> 1). getSomething(SymbolList list); & >> getSomething(SymbolList list); as far as the compiler is >> concerned are the same method. Both take in an instance of SymbolList >> (remember that generics are a list minute bolt-on to the JDK 1.5 API and >> it really shows). >> >> 2). It is impossible to infer the type of a generic i.e. >> >> public void doSomething(T genericObject) { >> if(T.equals(String.class)) { >> //Do something >> } >> } >> >> This T type is ... well magical. It exists but it doesn't. >> >> Anyway just be careful with generics. They save a lot of time & effort >> but get too involved (or think they can solve everything like I did for >> a short period) they're going to burn you badly or drive you mad for 1/2 >> a day wondering why javac claims something you've written is bogus. >> >> Andy >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG85U34C5LeMEKA/QRAl3pAKCFC6HBv5iXmGVKpuTwJQiwWuoMmwCdG/g2 > ILxIABP6me8pfY995/e6A5M= > =a+oW > -----END PGP SIGNATURE----- From heuermh at acm.org Sat Sep 22 05:29:22 2007 From: heuermh at acm.org (Michael Heuer) Date: Sat, 22 Sep 2007 01:29:22 -0400 (EDT) Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F3853B.7070701@ebi.ac.uk> Message-ID: I honestly haven't looked at it in a couple of years, but there is a proposal of mine for static generic symbols/symbol lists at > http://www3.shore.net/~heuermh/static-alphabet-generics.tar.gz Probably not useful or correct in its current state (I never did fully understand gap symbols) but it might be useful from a discussion standpoint. michael Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Also could we make SymbolList implement List? The iterator() method > would then do the cached conversion if required before returning an > Iterator over the symbols. That would make it very pluggable. > We'd need it to have a settable flag indicating whether the user wants > 1-indexed or 0-indexed access (the default being 1-indexed as this is > the most common biological use). > > Only downside is that List uses generics and so SymbolList must too - > meaning that SymbolList must always be declared as SymbolList > (or some subclass of Symbol). > > But that's also an upside - you could subclass Symbol into DNASymbol, > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > symbol and need not be specified separately: > > SymbolList dna = new SymbolList(); > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > SymbolList> = new ....; // Cool! > > Also cool is that you could do this: > > public SymbolList translate(SymbolList dna); > // Also cool! > > cheers, > Richard > > Richard Holland wrote: > > I like that idea of having SymbolLists backed by different things. I'd > > suggest that by default, all sequences read from file should be > > String-backed SymbolLists, and that they are not broken down into > > Symbols until first requested to do so by code that needs to know the > > actual Symbols (e.g. code that cares about ambiguity symbols). The same > > applies in reverse - lists constructed from symbols should not be > > converted to strings until needed. > > > > Something like this: > > > > SymbolList sl = new SymbolList(); > > sl.setString("AGCGGACT"); > > // Changes the string, and clears any cached > > // conversion of it. > > String seq = sl.getString(); > > // Dumps the string. If not already converted > > // to a string, does the conversion and > > // caches it first. > > char base = sl.charAt(5); > > // 1-indexed single-base string. This would > > // likely delegate to String.charAt() and only > > // works for single-character alphabets. Not > > // to be used in any other cirumstances. > > sl.set/getAlphabet().... > > // Use these to set the alphabet before > > // using set/getSymbols()/symbolAt(). > > sl.setSymbols(new List(....)); > > // Uses the list to update the cached symbols > > // and clear the cached string. > > List syms = sl.getSymbols(); > > // Converts if not already converted, caches > > // the conversion, and returns it. > > Symbol sym = sl.symbolAt(5); > > // 1-indexed fully flexible symbol finder. > > > > toString() would delegate to getString(), as would hashCode(), equals(), > > and compareTo(). We could provide additional equals()-style methods for > > testing equality whilst taking into account ambiguities. > > > > cheers, > > Richard > > > > > > Mark Schreiber wrote: > >>> Hello - > >>> > >>> Just to clarify my opinion on Strings vs Symbols. > >>> > >>> I generally prefer Symbols and SymbolLists to Strings cause > >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity > >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > >>> However, I think it would be vastly simpler if there where simpler > >>> getters and setters for SymbolLists that exposed Strings in a > >>> friendlier manner. > >>> > >>> I also think there is a case for SymbolLists that are backed by > >>> Strings (more likely a char[]) instead of Symbol arrays and only do > >>> the needed conversion when required (ie, when the user calls > >>> SymbolAt(). These would be ideal for the case where someone is > >>> converting GenBank to Fasta and there is no need to go through the > >>> Symbol parsing. > >>> > >>> Finally, I think SymbolLists (or whatever they get called) should > >>> implement more of the methods found in String to make them look more > >>> like Strings. Ideally we should think about implementing some of the > >>> methods that Groovy likes to use for operator overloading. If we do > >>> this is would be possible to concatenate two sequences in groovy by > >>> doing this (I may have the syntax wrong). > >>> > >>> Seq3 = Seq1 + Seq2 > >>> > >>> The other issue with SymbolLists is that they are not intuitive to > >>> construct because they are not so bean like. This is not just a > >>> problem for newbies but also a major hinderance to the use of JEE, > >>> Spring, JAXB and other important frameworks. It should be possible to > >>> do this: > >>> > >>> SymbolList sl = new SymbolList(); > >>> sl.setName("AB123456"); > >>> sl.setSequence(seqString); > >>> > >>> The final hinderance to the use of JEE is serialization. If we keep > >>> Symbols flyweight (singleton) we need to make this bullet proof from > >>> the start. It is also practicaly impossible to make something a bean > >>> and make it a Singleton, some careful thought is required. If we keep > >>> symbols behind the scenes they may not need to be so bean like. > >>> > >>> - Mark > >>> > >>> On 9/21/07, george waldon wrote: > >>>> Hello, > >>>> > >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > >>>> > >>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > >>>> > >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > >>>> > >>>> Richard wrote: > >>>>> It is suggested that development stops on the existing Biojava(&) > >>>> Well, I don't think the license can let you do that :-) > >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > >>>> - Switch to Subversion repository > >>>> - Change of the build process compatible with creation of modules > >>>> - Improving testing frame (mentioned several times) > >>>> - Creation of white papers for coding practices, build releases, (others?) > >>>> > >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > >>>> > >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > >>>> > >>>> Hope it helps, > >>>> George > >>>> > >>>> _______________________________________________ > >>>> biojava-dev mailing list > >>>> biojava-dev at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>>> > >>> _______________________________________________ > >>> biojava-dev mailing list > >>> biojava-dev at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>> > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V > eOMOo3pCl71dPhZMyYlBBE4= > =NByU > -----END PGP SIGNATURE----- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From heuermh at acm.org Sat Sep 22 05:35:37 2007 From: heuermh at acm.org (Michael Heuer) Date: Sat, 22 Sep 2007 01:35:37 -0400 (EDT) Subject: [Biojava-dev] The future of BioJava In-Reply-To: Message-ID: Oh and don't forget to review Matthew's bjv2 rework of symbols and symbol lists in full generics regalia. michael Michael Heuer wrote: > I honestly haven't looked at it in a couple of years, but there is a > proposal of mine for static generic symbols/symbol lists at > > > http://www3.shore.net/~heuermh/static-alphabet-generics.tar.gz > > Probably not useful or correct in its current state (I never did fully > understand gap symbols) but it might be useful from a discussion > standpoint. > > michael > > > Richard Holland wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Also could we make SymbolList implement List? The iterator() method > > would then do the cached conversion if required before returning an > > Iterator over the symbols. That would make it very pluggable. > > We'd need it to have a settable flag indicating whether the user wants > > 1-indexed or 0-indexed access (the default being 1-indexed as this is > > the most common biological use). > > > > Only downside is that List uses generics and so SymbolList must too - > > meaning that SymbolList must always be declared as SymbolList > > (or some subclass of Symbol). > > > > But that's also an upside - you could subclass Symbol into DNASymbol, > > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > > symbol and need not be specified separately: > > > > SymbolList dna = new SymbolList(); > > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > > > SymbolList> = new ....; // Cool! > > > > Also cool is that you could do this: > > > > public SymbolList translate(SymbolList dna); > > // Also cool! > > > > cheers, > > Richard > > > > Richard Holland wrote: > > > I like that idea of having SymbolLists backed by different things. I'd > > > suggest that by default, all sequences read from file should be > > > String-backed SymbolLists, and that they are not broken down into > > > Symbols until first requested to do so by code that needs to know the > > > actual Symbols (e.g. code that cares about ambiguity symbols). The same > > > applies in reverse - lists constructed from symbols should not be > > > converted to strings until needed. > > > > > > Something like this: > > > > > > SymbolList sl = new SymbolList(); > > > sl.setString("AGCGGACT"); > > > // Changes the string, and clears any cached > > > // conversion of it. > > > String seq = sl.getString(); > > > // Dumps the string. If not already converted > > > // to a string, does the conversion and > > > // caches it first. > > > char base = sl.charAt(5); > > > // 1-indexed single-base string. This would > > > // likely delegate to String.charAt() and only > > > // works for single-character alphabets. Not > > > // to be used in any other cirumstances. > > > sl.set/getAlphabet().... > > > // Use these to set the alphabet before > > > // using set/getSymbols()/symbolAt(). > > > sl.setSymbols(new List(....)); > > > // Uses the list to update the cached symbols > > > // and clear the cached string. > > > List syms = sl.getSymbols(); > > > // Converts if not already converted, caches > > > // the conversion, and returns it. > > > Symbol sym = sl.symbolAt(5); > > > // 1-indexed fully flexible symbol finder. > > > > > > toString() would delegate to getString(), as would hashCode(), equals(), > > > and compareTo(). We could provide additional equals()-style methods for > > > testing equality whilst taking into account ambiguities. > > > > > > cheers, > > > Richard > > > > > > > > > Mark Schreiber wrote: > > >>> Hello - > > >>> > > >>> Just to clarify my opinion on Strings vs Symbols. > > >>> > > >>> I generally prefer Symbols and SymbolLists to Strings cause > > >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity > > >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > > >>> However, I think it would be vastly simpler if there where simpler > > >>> getters and setters for SymbolLists that exposed Strings in a > > >>> friendlier manner. > > >>> > > >>> I also think there is a case for SymbolLists that are backed by > > >>> Strings (more likely a char[]) instead of Symbol arrays and only do > > >>> the needed conversion when required (ie, when the user calls > > >>> SymbolAt(). These would be ideal for the case where someone is > > >>> converting GenBank to Fasta and there is no need to go through the > > >>> Symbol parsing. > > >>> > > >>> Finally, I think SymbolLists (or whatever they get called) should > > >>> implement more of the methods found in String to make them look more > > >>> like Strings. Ideally we should think about implementing some of the > > >>> methods that Groovy likes to use for operator overloading. If we do > > >>> this is would be possible to concatenate two sequences in groovy by > > >>> doing this (I may have the syntax wrong). > > >>> > > >>> Seq3 = Seq1 + Seq2 > > >>> > > >>> The other issue with SymbolLists is that they are not intuitive to > > >>> construct because they are not so bean like. This is not just a > > >>> problem for newbies but also a major hinderance to the use of JEE, > > >>> Spring, JAXB and other important frameworks. It should be possible to > > >>> do this: > > >>> > > >>> SymbolList sl = new SymbolList(); > > >>> sl.setName("AB123456"); > > >>> sl.setSequence(seqString); > > >>> > > >>> The final hinderance to the use of JEE is serialization. If we keep > > >>> Symbols flyweight (singleton) we need to make this bullet proof from > > >>> the start. It is also practicaly impossible to make something a bean > > >>> and make it a Singleton, some careful thought is required. If we keep > > >>> symbols behind the scenes they may not need to be so bean like. > > >>> > > >>> - Mark > > >>> > > >>> On 9/21/07, george waldon wrote: > > >>>> Hello, > > >>>> > > >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > >>>> > > >>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > > >>>> > > >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > >>>> > > >>>> Richard wrote: > > >>>>> It is suggested that development stops on the existing Biojava(&) > > >>>> Well, I don't think the license can let you do that :-) > > >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > > >>>> - Switch to Subversion repository > > >>>> - Change of the build process compatible with creation of modules > > >>>> - Improving testing frame (mentioned several times) > > >>>> - Creation of white papers for coding practices, build releases, (others?) > > >>>> > > >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > >>>> > > >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > >>>> > > >>>> Hope it helps, > > >>>> George > > >>>> > > >>>> _______________________________________________ > > >>>> biojava-dev mailing list > > >>>> biojava-dev at lists.open-bio.org > > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > >>>> > > >>> _______________________________________________ > > >>> biojava-dev mailing list > > >>> biojava-dev at lists.open-bio.org > > >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > >>> > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V > > eOMOo3pCl71dPhZMyYlBBE4= > > =NByU > > -----END PGP SIGNATURE----- > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > From markjschreiber at gmail.com Sat Sep 22 11:31:29 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 22 Sep 2007 19:31:29 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F38E59.5030005@ebi.ac.uk> References: <20070921065312.48537.qmail@mmm1924.dulles19-verio.com> <93b45ca50709210024r2d0da4dbn7e2eca822f692fea@mail.gmail.com> <46F37BD7.5020402@ebi.ac.uk> <46F3853B.7070701@ebi.ac.uk> <46F38E59.5030005@ebi.ac.uk> Message-ID: <93b45ca50709220431s5229a703w10beef514d3c2d99@mail.gmail.com> > 2). It is impossible to infer the type of a generic i.e. > > public void doSomething(T genericObject) { > if(T.equals(String.class)) { > //Do something > } > } > > This T type is ... well magical. It exists but it doesn't. > T only exists at compile time. It doesn't exist at all in the JVM. This is because Java generics are implemented using 'erasure'. It is a bit of a drawback but no getting around it. From markjschreiber at gmail.com Sat Sep 22 11:33:12 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 22 Sep 2007 19:33:12 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: References: Message-ID: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> bjv2 can be found at http://www.derkholm.net/svn/repos/bjv2 - Mark On 9/22/07, Michael Heuer wrote: > Oh and don't forget to review Matthew's bjv2 rework of symbols and symbol > lists in full generics regalia. > > michael > > > Michael Heuer wrote: > > > I honestly haven't looked at it in a couple of years, but there is a > > proposal of mine for static generic symbols/symbol lists at > > > > > http://www3.shore.net/~heuermh/static-alphabet-generics.tar.gz > > > > Probably not useful or correct in its current state (I never did fully > > understand gap symbols) but it might be useful from a discussion > > standpoint. > > > > michael > > > > > > Richard Holland wrote: > > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > Hash: SHA1 > > > > > > Also could we make SymbolList implement List? The iterator() method > > > would then do the cached conversion if required before returning an > > > Iterator over the symbols. That would make it very pluggable. > > > We'd need it to have a settable flag indicating whether the user wants > > > 1-indexed or 0-indexed access (the default being 1-indexed as this is > > > the most common biological use). > > > > > > Only downside is that List uses generics and so SymbolList must too - > > > meaning that SymbolList must always be declared as SymbolList > > > (or some subclass of Symbol). > > > > > > But that's also an upside - you could subclass Symbol into DNASymbol, > > > RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the > > > symbol and need not be specified separately: > > > > > > SymbolList dna = new SymbolList(); > > > dna.add(RNAAlphabet.Q); // Throws standard List exception! > > > > > > SymbolList> = new ....; // Cool! > > > > > > Also cool is that you could do this: > > > > > > public SymbolList translate(SymbolList dna); > > > // Also cool! > > > > > > cheers, > > > Richard > > > > > > Richard Holland wrote: > > > > I like that idea of having SymbolLists backed by different things. I'd > > > > suggest that by default, all sequences read from file should be > > > > String-backed SymbolLists, and that they are not broken down into > > > > Symbols until first requested to do so by code that needs to know the > > > > actual Symbols (e.g. code that cares about ambiguity symbols). The same > > > > applies in reverse - lists constructed from symbols should not be > > > > converted to strings until needed. > > > > > > > > Something like this: > > > > > > > > SymbolList sl = new SymbolList(); > > > > sl.setString("AGCGGACT"); > > > > // Changes the string, and clears any cached > > > > // conversion of it. > > > > String seq = sl.getString(); > > > > // Dumps the string. If not already converted > > > > // to a string, does the conversion and > > > > // caches it first. > > > > char base = sl.charAt(5); > > > > // 1-indexed single-base string. This would > > > > // likely delegate to String.charAt() and only > > > > // works for single-character alphabets. Not > > > > // to be used in any other cirumstances. > > > > sl.set/getAlphabet().... > > > > // Use these to set the alphabet before > > > > // using set/getSymbols()/symbolAt(). > > > > sl.setSymbols(new List(....)); > > > > // Uses the list to update the cached symbols > > > > // and clear the cached string. > > > > List syms = sl.getSymbols(); > > > > // Converts if not already converted, caches > > > > // the conversion, and returns it. > > > > Symbol sym = sl.symbolAt(5); > > > > // 1-indexed fully flexible symbol finder. > > > > > > > > toString() would delegate to getString(), as would hashCode(), equals(), > > > > and compareTo(). We could provide additional equals()-style methods for > > > > testing equality whilst taking into account ambiguities. > > > > > > > > cheers, > > > > Richard > > > > > > > > > > > > Mark Schreiber wrote: > > > >>> Hello - > > > >>> > > > >>> Just to clarify my opinion on Strings vs Symbols. > > > >>> > > > >>> I generally prefer Symbols and SymbolLists to Strings cause > > > >>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity > > > >>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > > > >>> However, I think it would be vastly simpler if there where simpler > > > >>> getters and setters for SymbolLists that exposed Strings in a > > > >>> friendlier manner. > > > >>> > > > >>> I also think there is a case for SymbolLists that are backed by > > > >>> Strings (more likely a char[]) instead of Symbol arrays and only do > > > >>> the needed conversion when required (ie, when the user calls > > > >>> SymbolAt(). These would be ideal for the case where someone is > > > >>> converting GenBank to Fasta and there is no need to go through the > > > >>> Symbol parsing. > > > >>> > > > >>> Finally, I think SymbolLists (or whatever they get called) should > > > >>> implement more of the methods found in String to make them look more > > > >>> like Strings. Ideally we should think about implementing some of the > > > >>> methods that Groovy likes to use for operator overloading. If we do > > > >>> this is would be possible to concatenate two sequences in groovy by > > > >>> doing this (I may have the syntax wrong). > > > >>> > > > >>> Seq3 = Seq1 + Seq2 > > > >>> > > > >>> The other issue with SymbolLists is that they are not intuitive to > > > >>> construct because they are not so bean like. This is not just a > > > >>> problem for newbies but also a major hinderance to the use of JEE, > > > >>> Spring, JAXB and other important frameworks. It should be possible to > > > >>> do this: > > > >>> > > > >>> SymbolList sl = new SymbolList(); > > > >>> sl.setName("AB123456"); > > > >>> sl.setSequence(seqString); > > > >>> > > > >>> The final hinderance to the use of JEE is serialization. If we keep > > > >>> Symbols flyweight (singleton) we need to make this bullet proof from > > > >>> the start. It is also practicaly impossible to make something a bean > > > >>> and make it a Singleton, some careful thought is required. If we keep > > > >>> symbols behind the scenes they may not need to be so bean like. > > > >>> > > > >>> - Mark > > > >>> > > > >>> On 9/21/07, george waldon wrote: > > > >>>> Hello, > > > >>>> > > > >>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails. > > > >>>> > > > >>>> I noticed that the tutorial has seriously improved thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&) Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList). > > > >>>> > > > >>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook. > > > >>>> > > > >>>> Richard wrote: > > > >>>>> It is suggested that development stops on the existing Biojava(&) > > > >>>> Well, I don't think the license can let you do that :-) > > > >>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap: > > > >>>> - Switch to Subversion repository > > > >>>> - Change of the build process compatible with creation of modules > > > >>>> - Improving testing frame (mentioned several times) > > > >>>> - Creation of white papers for coding practices, build releases, (others?) > > > >>>> > > > >>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version. I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?) > > > >>>> > > > >>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package. > > > >>>> > > > >>>> Hope it helps, > > > >>>> George > > > >>>> > > > >>>> _______________________________________________ > > > >>>> biojava-dev mailing list > > > >>>> biojava-dev at lists.open-bio.org > > > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > >>>> > > > >>> _______________________________________________ > > > >>> biojava-dev mailing list > > > >>> biojava-dev at lists.open-bio.org > > > >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > >>> > > > _______________________________________________ > > > biojava-dev mailing list > > > biojava-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > > > -----BEGIN PGP SIGNATURE----- > > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > > > iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V > > > eOMOo3pCl71dPhZMyYlBBE4= > > > =NByU > > > -----END PGP SIGNATURE----- > > > _______________________________________________ > > > biojava-dev mailing list > > > biojava-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > > > > > > From gwaldon at geneinfinity.org Sat Sep 22 16:03:15 2007 From: gwaldon at geneinfinity.org (george waldon) Date: Sat, 22 Sep 2007 09:03:15 -0700 Subject: [Biojava-dev] The future of BioJava Message-ID: <20070922160315.33233.qmail@mmm1924.dulles19-verio.com> Thank you Mark for making the point so clearly. I could see String being used internally in SymbolList, but still what is really the point of rewriting a logic that is already present in the current code? Again, rewrite appears easier. There is nothing that prevent us to write with the current code: SymbolList sl = new DNASymbolList(); sl.setName("AB123456"); sl.setSequence("aAaA-aAaA"); //a polyadenine Ok, setName and setSequence are not part of the current SymbolList interface but we can have SymbolListEx to complete it, or we can have a new SymbolList in the biojavax domain or the bj3 domain, or even we can create SymbolString (//cool!) in the current biojava domain. The point is that we can have both old and new interfaces coexisting in the same overall project and swap from one to another module per module and nothing is ever broken. - George > -----Original Message----- > From: Mark Schreiber [mailto:markjschreiber at gmail.com] > Sent: Friday, September 21, 2007 12:24 AM > To: george waldon > Cc: biojava-dev at biojava.org > Subject: Re: [Biojava-dev] The future of BioJava > > Hello - > > Just to clarify my opinion on Strings vs Symbols. > > I generally prefer Symbols and SymbolLists to Strings cause > SymbolLists are smart and Strings are dumb. Classic case is ambiguity > symbols like 'W'. BioJava knows, in the context of DNA this is A or T. > However, I think it would be vastly simpler if there where simpler > getters and setters for SymbolLists that exposed Strings in a > friendlier manner. > > I also think there is a case for SymbolLists that are backed by > Strings (more likely a char[]) instead of Symbol arrays and only do > the needed conversion when required (ie, when the user calls > SymbolAt(). These would be ideal for the case where someone is > converting GenBank to Fasta and there is no need to go through the > Symbol parsing. > > Finally, I think SymbolLists (or whatever they get called) should > implement more of the methods found in String to make them look more > like Strings. Ideally we should think about implementing some of the > methods that Groovy likes to use for operator overloading. If we do > this is would be possible to concatenate two sequences in groovy by > doing this (I may have the syntax wrong). > > Seq3 = Seq1 + Seq2 > > The other issue with SymbolLists is that they are not intuitive to > construct because they are not so bean like. This is not just a > problem for newbies but also a major hinderance to the use of JEE, > Spring, JAXB and other important frameworks. It should be possible to > do this: > > SymbolList sl = new SymbolList(); > sl.setName("AB123456"); > sl.setSequence(seqString); > > The final hinderance to the use of JEE is serialization. If we keep > Symbols flyweight (singleton) we need to make this bullet proof from > the start. It is also practicaly impossible to make something a bean > and make it a Singleton, some careful thought is required. If we keep > symbols behind the scenes they may not need to be so bean like. > > - Mark > From gwaldon at geneinfinity.org Sat Sep 22 16:35:22 2007 From: gwaldon at geneinfinity.org (george waldon) Date: Sat, 22 Sep 2007 09:35:22 -0700 Subject: [Biojava-dev] The future of BioJava Message-ID: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> Richard, You cannot kill biojava and it is not vista; you cannot force people to use it. I have a project with hundreds of classes using biojava and working without a glitch and the choice of either keeping with it or switching to a bj3 in the middle of a rewrite of around 1500 classes that may take months or years to complete. I may just never switch to the new biojava. Most likely, a lot of people are going to be in a similar situation and most likely bj3 will also have to have support old biojava classes - great! I agree that you cannot change interface but you can deprecate them and toss them after one release cycle or put them into a deprecated module that is not included in releases. The question becomes: what are the fundamental problems of biojava that truly justify a rewrite from the ground? Certainly, need for a new symbol model could be one; maintenance and testing are not; modular structure is not; and use of generics is not - they do not break old code. George > -----Original Message----- > From: Richard Holland > To: george waldon > Cc: biojava-dev at biojava.org > Sent: 9/21/2007 12:54 AM > Subject: Re: [Biojava-dev] The future of BioJava > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi George. > > By 'stop development' I really meant just that active development > efforts would be focused on the new codebase rather than modifying the > existing one (except of course for fixing bugs, which is always > important and we wouldn't stop doing that until the new codebase was > well established as an alternative). > > I agree that modifying the existing codebase would improve many of the > problems currently experienced with it - code abstraction being just one > of them. BioJavaX was an attempt at doing this. The big stumbling block > was interfaces - users do not expect interfaces to change as it breaks > all code that already uses that interface. They also do not expect the > defined behaviour of methods in interfaces to change - which meant, for > instance, that I had real problems trying to get > RichFeature/RichLocation and RichLocation/Location to match up as some > parts of Feature and Location conflicted with the more realistic > requirements of their Rich* equivalents (e.g. circularity). > > If you change interfaces, you might as well start from scratch in terms > of the effect it has on end-user's code. Also, if we start from scratch, > it allows us to build up from the very basics the kind of robustness and > flexibility we need throughout the system. As mentioned in the original > posting the existing system is heavily sequence-focused, meaning that > even the simple task of scanning a set of features cannot be done > without also loading the associated sequences because the two are so > closely integrated. We need to make it much more flexible and I think > new code would give us a better opportunity to do so without being tied > into complying with existing interfaces or behaviour expectations. > > Having said that, I do expect large parts of the new codebase to be only > slightly modified copies of the original code, particularly regarding > recent developments such as genetic algorithms and phylogenetics. It > would be silly to write such logic all over again where the code is > relatively self-contained. > > cheers, > Richard > > > > george waldon wrote: > > Hello, > > > > All this is very exciting. I would certainly contribute to something > like that. A few remarks that come to my mind while reading all these > emails. > > > > I noticed that the tutorial has seriously improved ? thanks for the > work. I remember my initial steps going to understanding Symbol and > cross-alphabets (?) Still, from time to time, I have difficulties with > basic things that are not intuitive to me such as ?token?, e.g. > Alphabet.getTokenizarion(?token?) or > SymbolTokenization.tokenizeSymbolList(SymbolList). > > > > I am surprised by the all the requests to use String instead of > SymbolList. The CookBook tells precisely, and with code examples, how to > make most of all basic operations. Maybe someone could illustrate the > new kind of code versus the old one? I bet many newbies (and older one) > actually get their answer in the Cookbook. > > > > Richard wrote: > >> It is suggested that development stops on the existing Biojava(?) > > Well, I don?t think the license can let you do that :-) > > Writing new code might be easier but certainly making old code better > will improve the level of code abstraction. Therefore I am promoting > improving existing Biojava code versus hazardous code rewrite. I can see > some of the initial steps on the roadmap: > > - Switch to Subversion repository > > - Change of the build process compatible with creation of modules > > - Improving testing frame (mentioned several times) > > - Creation of white papers for coding practices, build releases, > (others?) > > > > Then maybe the proper work of restructuring Biojava may start. We can > either divide the existing mammoth into multiple modules at first or - > my preference ? building modules one by one by selectively picking > classes. This way it will be easy to find out classes that can be > deprecated (by lack of users) and we can even have a deprecated module > at the end. Some coupling may need to loosen up. We will also need a > list of API change for developers who will use the newer version. I am > sure that the kind of data structures proposed by Richard could find > their place as well as some of the proposed patterns (beans, others?) > > > > Anyway, all these are simple ideas. I am not an expert in build > process, but I can help with improving javadocs, writing examples and > test cases. I have also a fair knowledge of the molecular biology > package. > > > > Hope it helps, > > George > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG83jK4C5LeMEKA/QRAtOFAJsF9YNdgdsOm1KY65GyRehsO1ElYwCfeUfi > yXWTMXSzn3mXZqXXo9999rw= > =WbAQ > -----END PGP SIGNATURE----- From phidias51 at gmail.com Sat Sep 22 18:42:50 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Sat, 22 Sep 2007 11:42:50 -0700 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> Message-ID: <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> Richard & Andy, 1. I like the idea of making readers more pluggable, and Dozer definitely looks interesting. Is this going to be supported via the Service Provider Interface approach (used by Taverna and other projects)? 2. Andy brought up the point of people who create non-standard variations of EMBL-formatted files. I was wondering if these files were created in programming languages other than Java? If so, would those users be willing to use a Jython, JRuby, or a Perl-like scripting language like Sleep,? This would allow them to use biojava as a library, and still use a scripting language whose syntax they were familiar with. They would also be producing files in a more standardized format. This might cut down on the number of parsing mistakes caused by "unsupported" file variations. You can go to http://scripting.dev.java.net for more information on the scripting languages that the Java VM supports. 3. Was there any reason why non-standard files were being created? Perhaps some use-case not being covered? 4. If BioJava is split up into a variety of smaller JARs, how would you insure that the users had all of the JARs that they needed? Would an installer be provided to allow users to select groups of JARs? There are a number of open source installers that would make this process easier. Using Maven is suitable if you're a developer, if you're a scripter it's a little more difficult to deal with. 5. Are there any thoughts about using a templating system like Velocity, FreeMarker or JST? This would make it easier to insure that files were produced in a standard fashion. It would also make it easier to maintain support for writing files in different file formats. 6. When it comes to unit testing and continuous building, is the bio*.org server going to handle that automated build & burn, or is someone in the group going to have to do it? I think the inability to have the build setup on the server had us stymied before. 7. Now that Java also includes the Derby database, and the Java Persistence API (JPA), has anyone considered migrating the BioSQL support from Hibernate to JPA, and using Derby as the default database? This would make it a little easier to maintain and would minimize the setup work that a new user would have to do. 8. Richard, you mention in the "Reasoning" section that "users have moved on". What types of use-cases beyond basic sequence analysis, should BioJava support? Would support for more of lab-related processes expand the user base and number of committers? Would support for parsing different types of instrument files be a useful addition? I could imagine use cases where users would like to be able to parse an Affy file and fetch probe information, gene information, and perhaps pathway data. 9. Are there any thoughts about using annotations (perhaps in combination with ontologies) to handle semantic validation of arguments? For example, you might have an annotation like @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"} indicating that the attribute or method argument is a LocusLink id. Thanks for kick-starting this discussion? Regards, Mark Fortner From holland at ebi.ac.uk Sun Sep 23 11:16:14 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Sun, 23 Sep 2007 12:16:14 +0100 (BST) Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> References: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> Message-ID: <51328.80.42.55.181.1190546174.squirrel@webmail.ebi.ac.uk> Understood. I was thinking of including a 'compatibility mode' module in BJ3 which provides all the existing BJ2 interfaces and maps them to the new ones. This way we have the best of both worlds - existing projects would replace the BJ2 jars with the new BJ3 jars on the classpath plus the compatibility jar and wouldn't need to change any code at all. Existing import statements would then pick up the compatibility mappings instead of the original classes. Anyhow your comments will definitely be considered. This is very early discussion after all - the point being to gather opinion and ideas to see if its even worth making a change, let alone what kind of change that would be. cheers, Richard PS. The most fundamental problem is that some of the existing interfaces are broken. They enforce situations which are not biologically logical - e.g. the feature and location interfaces have got strand mixed up. You can't fix this without altering the interfaces - and to alter the interfaces requires people to change existing code. If they're going to change existing code, why not make a clean sweep of it. Even deprecating for one release then removing in a subsequent one will still require you to change the 1500+ classes you mention, which is only delaying the problem. PPS. I will compile a comprehensive list of things I think are broken/wrong so that people can discuss specifically what should be done about them - whether they be rewrite or modification. I do want this to be a democratic process and if the majority of people don't want a particular plan of action to happen, then it won't. On Sat, September 22, 2007 5:35 pm, george waldon wrote: > Richard, > > You cannot kill biojava and it is not vista; you cannot force people to > use it. I have a project with hundreds of classes using biojava and > working without a glitch and the choice of either keeping with it or > switching to a bj3 in the middle of a rewrite of around 1500 classes that > may take months or years to complete. I may just never switch to the new > biojava. Most likely, a lot of people are going to be in a similar > situation and most likely bj3 will also have to have support old biojava > classes - great! > > I agree that you cannot change interface but you can deprecate them and > toss them after one release cycle or put them into a deprecated module > that is not included in releases. > > The question becomes: what are the fundamental problems of biojava that > truly justify a rewrite from the ground? Certainly, need for a new symbol > model could be one; maintenance and testing are not; modular structure is > not; and use of generics is not - they do not break old code. > > George > > >> -----Original Message----- >> From: Richard Holland >> To: george waldon >> Cc: biojava-dev at biojava.org >> Sent: 9/21/2007 12:54 AM >> Subject: Re: [Biojava-dev] The future of BioJava >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hi George. >> >> By 'stop development' I really meant just that active development >> efforts would be focused on the new codebase rather than modifying the >> existing one (except of course for fixing bugs, which is always >> important and we wouldn't stop doing that until the new codebase was >> well established as an alternative). >> >> I agree that modifying the existing codebase would improve many of the >> problems currently experienced with it - code abstraction being just one >> of them. BioJavaX was an attempt at doing this. The big stumbling block >> was interfaces - users do not expect interfaces to change as it breaks >> all code that already uses that interface. They also do not expect the >> defined behaviour of methods in interfaces to change - which meant, for >> instance, that I had real problems trying to get >> RichFeature/RichLocation and RichLocation/Location to match up as some >> parts of Feature and Location conflicted with the more realistic >> requirements of their Rich* equivalents (e.g. circularity). >> >> If you change interfaces, you might as well start from scratch in terms >> of the effect it has on end-user's code. Also, if we start from scratch, >> it allows us to build up from the very basics the kind of robustness and >> flexibility we need throughout the system. As mentioned in the original >> posting the existing system is heavily sequence-focused, meaning that >> even the simple task of scanning a set of features cannot be done >> without also loading the associated sequences because the two are so >> closely integrated. We need to make it much more flexible and I think >> new code would give us a better opportunity to do so without being tied >> into complying with existing interfaces or behaviour expectations. >> >> Having said that, I do expect large parts of the new codebase to be only >> slightly modified copies of the original code, particularly regarding >> recent developments such as genetic algorithms and phylogenetics. It >> would be silly to write such logic all over again where the code is >> relatively self-contained. >> >> cheers, >> Richard >> >> >> >> george waldon wrote: >> > Hello, >> > >> > All this is very exciting. I would certainly contribute to something >> like that. A few remarks that come to my mind while reading all these >> emails. >> > >> > I noticed that the tutorial has seriously improved ??? thanks for the >> work. I remember my initial steps going to understanding Symbol and >> cross-alphabets (???) Still, from time to time, I have difficulties >> with >> basic things that are not intuitive to me such as ???token???, e.g. >> Alphabet.getTokenizarion(???token???) or >> SymbolTokenization.tokenizeSymbolList(SymbolList). >> > >> > I am surprised by the all the requests to use String instead of >> SymbolList. The CookBook tells precisely, and with code examples, how to >> make most of all basic operations. Maybe someone could illustrate the >> new kind of code versus the old one? I bet many newbies (and older one) >> actually get their answer in the Cookbook. >> > >> > Richard wrote: >> >> It is suggested that development stops on the existing Biojava(???) >> > Well, I don???t think the license can let you do that :-) >> > Writing new code might be easier but certainly making old code better >> will improve the level of code abstraction. Therefore I am promoting >> improving existing Biojava code versus hazardous code rewrite. I can see >> some of the initial steps on the roadmap: >> > - Switch to Subversion repository >> > - Change of the build process compatible with creation of modules >> > - Improving testing frame (mentioned several times) >> > - Creation of white papers for coding practices, build releases, >> (others?) >> > >> > Then maybe the proper work of restructuring Biojava may start. We can >> either divide the existing mammoth into multiple modules at first or - >> my preference ??? building modules one by one by selectively picking >> classes. This way it will be easy to find out classes that can be >> deprecated (by lack of users) and we can even have a deprecated module >> at the end. Some coupling may need to loosen up. We will also need a >> list of API change for developers who will use the newer version. I am >> sure that the kind of data structures proposed by Richard could find >> their place as well as some of the proposed patterns (beans, others?) >> > >> > Anyway, all these are simple ideas. I am not an expert in build >> process, but I can help with improving javadocs, writing examples and >> test cases. I have also a fair knowledge of the molecular biology >> package. >> > >> > Hope it helps, >> > George >> > >> > _______________________________________________ >> > biojava-dev mailing list >> > biojava-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFG83jK4C5LeMEKA/QRAtOFAJsF9YNdgdsOm1KY65GyRehsO1ElYwCfeUfi >> yXWTMXSzn3mXZqXXo9999rw= >> =WbAQ >> -----END PGP SIGNATURE----- > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK From markjschreiber at gmail.com Sun Sep 23 11:48:03 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 23 Sep 2007 19:48:03 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> References: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> Message-ID: <93b45ca50709230448w4e9dd109wbe647aa4520a60f3@mail.gmail.com> > You cannot kill biojava and it is not vista; you cannot force people to use it. I have a project with hundreds of classes using biojava and working without a glitch and the choice of either keeping with it or switching to a bj3 in the middle of a rewrite of around 1500 classes that may take months or years to complete. I may just never switch to the new biojava. Most likely, a lot of people are going to be in a similar situation and most likely bj3 will also have to have support old biojava classes - great! > If a new version of biojava came out that was not compatible there would be nothing to stop you from keeping the olb bj1.5 jars in your lib directory and letting your application work off them. I've never been a fan of global class paths. Additionally if you don't need to switch there would be little point. There is little point in switching from bj1.4 to 1.5 unless you need some new feature or bug fix. Bj1.4 and 1.5 were also the first to even attempt backwards compatibility so I'm not totally convinced we need to support old classes and interfaces. I would hope the motivation for a switch would be a cleaner and easier to use code base. Your correct, we can't force people to use it. > I agree that you cannot change interface but you can deprecate them and toss them after one release cycle or put them into a deprecated module that is not included in releases. > Dropping a deprecated interface is about the same as changing it in terms of backwards compatibility. Although you do have the advantage of a little more warning. > The question becomes: what are the fundamental problems of biojava that truly justify a rewrite from the ground? Certainly, need for a new symbol model could be one; maintenance and testing are not; modular structure is not; and use of generics is not - they do not break old code. > > George > I would say the main argument is improving ease of use and getting away from the singleton symbol model. With generics I am interested to know what happens when you take an interface that requires List and change it to require List such as. public void foo(List l); to public void foo(List l) Due to erasure this may not break, old code would still run, however old code would probably no longer compile. - Mark From markjschreiber at gmail.com Sun Sep 23 12:06:21 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sun, 23 Sep 2007 20:06:21 +0800 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> Message-ID: <93b45ca50709230506j2cf48857h2f52b69b7a1e0594@mail.gmail.com> > 1. I like the idea of making readers more pluggable, and Dozer > definitely looks interesting. Is this going to be supported via the Service > Provider Interface approach (used by Taverna and other projects)? > An SPI interface would be a great addition. I believe taverna's is quite a nice feature. It would be good to have. > 2. Andy brought up the point of people who create non-standard > variations of EMBL-formatted files. I was wondering if these files were > created in programming languages other than Java? If so, would those users > be willing to use a Jython, JRuby, or a Perl-like scripting language like > Sleep,? This would allow them to use biojava as a library, and still use a > scripting language whose syntax they were familiar with. They would also be > producing files in a more standardized format. This might cut down on the > number of parsing mistakes caused by "unsupported" file variations. You can > go to http://scripting.dev.java.net for more information on the > scripting languages that the Java VM supports. > I think if we designed it right you could do a lot with Groovy with the added benefit of very java like syntax. Richard and I did discuss the possibility of having all I/O file processing written in Groovy and compiled to classes. > 3. Was there any reason why non-standard files were being created? > Perhaps some use-case not being covered? > Non standard GenBank type files are made by VectorNTI. Also formats change over the years. I think this recently happened with EMBL format. Unfortunately flatfiles unlike XML do not have versioning or need to validate against a definition. > 4. If BioJava is split up into a variety of smaller JARs, how would > you insure that the users had all of the JARs that they needed? Would an > installer be provided to allow users to select groups of JARs? There are a > number of open source installers that would make this process easier. Using > Maven is suitable if you're a developer, if you're a scripter it's a little > more difficult to deal with. > Many projects are distributed as multiple jars (eg hibernate). Typically the user would download the core bundle and put them in a lib folder. Additional jars could be downloaded for extra activities. > > 6. When it comes to unit testing and continuous building, is the > bio*.org server going to handle that automated build & burn, or is someone > in the group going to have to do it? I think the inability to have the > build setup on the server had us stymied before. The open-bio servers are a natural choice but I think a discussion of the pros and cons of others is a good idea. > > 7. Now that Java also includes the Derby database, and the Java > Persistence API (JPA), has anyone considered migrating the BioSQL support > from Hibernate to JPA, and using Derby as the default database? This would > make it a little easier to maintain and would minimize the setup work that a > new user would have to do. > I agree on this. This is also a good argument for making our classes more bean like so they can be easily turned into enterprise beans. A nice part of JPA is that you can use hibernate to do the persistence. Having the Derby database built in offers other interesting possibilities as well. > 8. Richard, you mention in the "Reasoning" section that "users have > moved on". What types of use-cases beyond basic sequence analysis, should > BioJava support? Would support for more of lab-related processes expand the > user base and number of committers? Would support for parsing different > types of instrument files be a useful addition? I could imagine use cases > where users would like to be able to parse an Affy file and fetch probe > information, gene information, and perhaps pathway data. > > 9. Are there any thoughts about using annotations (perhaps in > combination with ontologies) to handle semantic validation of arguments? > For example, you might have an annotation like > > @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"} > > indicating that the attribute or method argument is a LocusLink id. > I think this is an excellent example of how we can use Annotations. It would allow quite a bit of flexibility for integration tasks. - Mark Schreiber > Mark Fortner > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From bugzilla-daemon at portal.open-bio.org Mon Sep 24 05:43:12 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Sep 2007 01:43:12 -0400 Subject: [Biojava-dev] [Bug 2371] New: ChromatogramFactory.create fails on Windows Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2371 Summary: ChromatogramFactory.create fails on Windows Product: BioJava Version: 1.5 Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: bio AssignedTo: biojava-dev at biojava.org ReportedBy: hkaya at be.itu.edu.tr The following small code perfectly runs on Linux but it fails on Windows. I'm using biojava 1.5. 1. TraceTest.java import java.io.File; import org.biojava.bio.chromatogram.Chromatogram; import org.biojava.bio.chromatogram.ChromatogramFactory; public class TraceTest { public static void main(String[] args) { try { Chromatogram trace = ChromatogramFactory.create(new File("test.scf")); System.out.println("Success!"); } catch (Exception e){ System.out.println("Failed!"); e.printStackTrace(); } } } 2. Test chromatogram file : http://istanbul.be.itu.edu.tr/~huseyin/test.scf 3. Exception thrown in Windows XP/Vista: Desktop> java -classpath biojava-1.5.jar;. TraceTest Exception in thread "main" org.biojava.bio.BioError: Unable to initialize DNATools at org.biojava.bio.seq.DNATools.(DNATools.java:117) at org.biojava.bio.program.scf.SCF$V3Parser.parseSamples(SCF.java:560) at org.biojava.bio.program.scf.SCF$Parser.parse(SCF.java:350) at org.biojava.bio.program.scf.SCF$ParserFactory.parse(SCF.java:206) at org.biojava.bio.program.scf.SCF.load(SCF.java:149) at org.biojava.bio.program.scf.SCF.load(SCF.java:141) at org.biojava.bio.program.scf.SCF.create(SCF.java:126) at org.biojava.bio.chromatogram.ChromatogramFactory.create(ChromatogramFactory.java:75) at TraceTest.main(TraceTest.java:8) Caused by: org.biojava.bio.BioError: Unable to initialize RNATools at org.biojava.bio.seq.RNATools.(RNATools.java:126) at org.biojava.bio.seq.DNATools.(DNATools.java:110) ... 8 more Caused by: org.biojava.bio.BioError: Couldn't parse TranslationTables.xml at org.biojava.bio.seq.RNATools.loadGeneticCodes(RNATools.java:529) at org.biojava.bio.seq.RNATools.(RNATools.java:124) ... 9 more Caused by: org.biojava.bio.symbol.IllegalSymbolException: Token `his' does not appear as a named symbol in alphabet `PROTEIN-TERM' at org.biojava.bio.seq.io.NameTokenization.parseToken(NameTokenization.java:110) at org.biojava.bio.seq.RNATools.loadGeneticCodes(RNATools.java:520) ... 10 more -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From holland at ebi.ac.uk Mon Sep 24 07:42:34 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 24 Sep 2007 08:42:34 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <93b45ca50709230448w4e9dd109wbe647aa4520a60f3@mail.gmail.com> References: <20070922163522.41990.qmail@mmm1924.dulles19-verio.com> <93b45ca50709230448w4e9dd109wbe647aa4520a60f3@mail.gmail.com> Message-ID: <46F76A6A.8010401@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark - I'll use your comments as the basis of a list of things that are broken and need fixing, which I mentioned over the weekend that I would set up. This list will expand over time I'm sure. > With generics I am interested to know what happens when you take an > interface that requires List and change it to require List > such as. > > public void foo(List l); > > to > > public void foo(List l) > > Due to erasure this may not break, old code would still run, however > old code would probably no longer compile. Old code would still compile and run just fine, with warnings at compile time indicating that a genericised collection has been used without specifying the generic type. However, you wouldn't be able to pass in a List to a method which accepts a genericised List without casting it, as that is a compile-time error. So your foo() example above would not work. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG92pq4C5LeMEKA/QRAr4UAJ9FbCVFdE4enQrbFclNZx36RQaCBwCeM07d E+oDZ3+smexvGFWAA0eHeFM= =rFu5 -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Mon Sep 24 08:26:27 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Mon, 24 Sep 2007 09:26:27 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> Message-ID: <46F774B3.4010209@ebi.ac.uk> Hi Mark, > > 2. Andy brought up the point of people who create non-standard > variations of EMBL-formatted files. I was wondering if these files were > created in programming languages other than Java? If so, would those users > be willing to use a Jython, JRuby, or a Perl-like scripting language like > Sleep,? This would allow them to use biojava as a library, and still use a > scripting language whose syntax they were familiar with. They would also be > producing files in a more standardized format. This might cut down on the > number of parsing mistakes caused by "unsupported" file variations. You can > go to http://scripting.dev.java.net for more information on the > scripting languages that the Java VM supports. > > 3. Was there any reason why non-standard files were being created? > Perhaps some use-case not being covered? These files are not being created by accident just the groups that are producing them have different requirements wrt the data they release. So they want to produce EMBL flat files which have the look/markup of an EMBL record yet do not follow the same rules as EMBL. A good example (if memory serves me correctly) is UniProtKB. The specification of UniProtKB records are different to EMBL yet both output files of a similar markup. So it's not so much as biojava no supporting a use-case or a group producing flat files with a custom writer just they have a different requirement. Getting BioJava to support them all is just a non-starter considering the number of projects available. A better way is just to let people plug into the grammer/objects for parsing these file formats & then groups can choose to release their parsing code or not. > 4. If BioJava is split up into a variety of smaller JARs, how would > you insure that the users had all of the JARs that they needed? Would an > installer be provided to allow users to select groups of JARs? There are a > number of open source installers that would make this process easier. Using > Maven is suitable if you're a developer, if you're a scripter it's a little > more difficult to deal with. Yes that's very true. We're encountering similar problems in our group where we have a set of people working on new maven projects & older projects still using Ant. Our solution atmo is producing maven assemblies which cover different use cases & end users choose which one suits their needs the most. If we're talking about scripters though then it's probably easier to have it written on the wiki with a 'first steps' in the major JVM scripting languages (I'm thinking Groovy, JavaScript & JRuby should cover the bases). > 5. Are there any thoughts about using a templating system like > Velocity, FreeMarker or JST? This would make it easier to insure that files > were produced in a standard fashion. It would also make it easier to > maintain support for writing files in different file formats. I'd prefer to use StringTemplate (just because it's a push based templating system not a pull like Velocity) but yeah I can see it being very useful. > 6. When it comes to unit testing and continuous building, is the > bio*.org server going to handle that automated build & burn, or is someone > in the group going to have to do it? I think the inability to have the > build setup on the server had us stymied before. I think that Andreas is in a better position to answer this one maybe but I'm guessing we can schedule the builds on a time basis along with building on each commit into the repository. > 7. Now that Java also includes the Derby database, and the Java > Persistence API (JPA), has anyone considered migrating the BioSQL support > from Hibernate to JPA, and using Derby as the default database? This would > make it a little easier to maintain and would minimize the setup work that a > new user would have to do. Hibernate supports JPA so the switch shouldn't be hard to do if needed. That said Hibernate is still the 'market leader' when it comes to Java based persistence so I'm not to worried about this. > > 8. Richard, you mention in the "Reasoning" section that "users have > moved on". What types of use-cases beyond basic sequence analysis, should > BioJava support? Would support for more of lab-related processes expand the > user base and number of committers? Would support for parsing different > types of instrument files be a useful addition? I could imagine use cases > where users would like to be able to parse an Affy file and fetch probe > information, gene information, and perhaps pathway data. I'm already aware of people doing the Affy parsing themselves (I was involved with writing the parsers for their XDA data format ... bloody unsigned big endian ints) but the code was never incorporated into biojava because the group wasn't 100% comfortable about releasing the code. But yes there are a lot of other use cases out there that I'm sure we're unaware of. Our only choice is to see if we can get people to contribute ideas to this stage of development & give people the opportunity to contribute code as & when it's required. > > 9. Are there any thoughts about using annotations (perhaps in > combination with ontologies) to handle semantic validation of arguments? > For example, you might have an annotation like > > @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"} > > indicating that the attribute or method argument is a LocusLink id. > That's quite an interesting idea. Not sure about where else to introduce them in if they are required but it's a good idea :) Andy From ap3 at sanger.ac.uk Mon Sep 24 09:39:17 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 24 Sep 2007 10:39:17 +0100 Subject: [Biojava-dev] The future of BioJava In-Reply-To: <46F774B3.4010209@ebi.ac.uk> References: <93b45ca50709220433k6dd2311al2f16a103ef32550f@mail.gmail.com> <6e1d61f50709221142m6d65c8easd992b3983b57c9e2@mail.gmail.com> <46F774B3.4010209@ebi.ac.uk> Message-ID: >> 6. When it comes to unit testing and continuous building, is the >> bio*.org server going to handle that automated build & burn, or >> is someone >> in the group going to have to do it? I think the inability to >> have the >> build setup on the server had us stymied before. > > I think that Andreas is in a better position to answer this one maybe > but I'm guessing we can schedule the builds on a time basis along with > building on each commit into the repository. Just to clarify regarding the the continuous builds: There is an auto-build running for BioJava at http://www.spice-3d.org/ cruise/ A build is triggered ~40 minutes after a CVS commit as well as regularly every night. The following things happen every build: * all junit tests are run (see http://www.spice-3d.org/cruise/ buildresults/biojava-live?tab=testResults ) * the latest javadoc is created ( http://www.spice-3d.org/public- files/javadoc/biojava/ ) * provide a biojava.jar file for download * provide a biojava-src.jar (source code bundle). if the build fails (i.e. somebody committed broken code to CVS) this mailing list will be notified. Actually this is untested so far since CVS has been fine in the last few weeks :-) I set this up on my own machine, since I don't have admin rights on open-bio, but if people would prefer to have it running from there, it should be fairly simple to move the setup. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bugzilla-daemon at portal.open-bio.org Thu Sep 27 06:16:50 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 27 Sep 2007 02:16:50 -0400 Subject: [Biojava-dev] [Bug 2359] SingleDP deserialization fails In-Reply-To: Message-ID: <200709270616.l8R6Gos7012751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2359 mark.schreiber at novartis.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mark.schreiber at novartis.com 2007-09-27 02:16 EST ------- Bug is fixed. Unit tests committed. Also fixed similar problem with PairDP -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From markjschreiber at gmail.com Thu Sep 27 09:40:53 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 27 Sep 2007 17:40:53 +0800 Subject: [Biojava-dev] Taglets Message-ID: <93b45ca50709270240x7383243u34d84083200bbf20@mail.gmail.com> Would anyone object if I removed the taglets from biojava? They are not widely used in the javadocs (to say the least) and they seem to require unstable imports of com.sun packages which means they break whenever you change JDK. - Mark From holland at ebi.ac.uk Thu Sep 27 10:03:09 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 27 Sep 2007 11:03:09 +0100 Subject: [Biojava-dev] Taglets In-Reply-To: <93b45ca50709270240x7383243u34d84083200bbf20@mail.gmail.com> References: <93b45ca50709270240x7383243u34d84083200bbf20@mail.gmail.com> Message-ID: <46FB7FDD.2030903@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 fine by me. anything unstable or non-standard is not good news. Mark Schreiber wrote: > Would anyone object if I removed the taglets from biojava? > > They are not widely used in the javadocs (to say the least) and they > seem to require unstable imports of com.sun packages which means they > break whenever you change JDK. > > - Mark > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG+3/d4C5LeMEKA/QRAmgxAJ4yn14N/ViH7lamr3fPD4YhdVci2gCfSvtj INXwKGRUM0Kw8Nw8Jtd+8YQ= =EJ7R -----END PGP SIGNATURE----- From pranav.waila at gmail.com Thu Sep 27 07:05:46 2007 From: pranav.waila at gmail.com (pranav waila) Date: Thu, 27 Sep 2007 07:05:46 -0000 Subject: [Biojava-dev] help (F1 F1) regarding biojava Message-ID: <5c85b5bb0709270005m366227acx65ca6151e870ce79@mail.gmail.com> package org.biojava.bio.seq.db; import org.biojava.bio.seq.*; import org.biojava.bio.seq.io.*; import org.biojava.bio.*; import org.biojava.bio.seq.db.*; import org.biojava.bio.seq.io.*; import java.net.*; class test{ public static SequenceFormat sf; public static void main(String args[]){ System.getProperties().put("proxySet","true"); System.getProperties().put("proxyPort","3128"); System.getProperties().put("proxyHost","172.16.0.6"); Sequence s; sf=new FastaFormat(); //sf=new GenbankXmlFormat(); NCBISequenceDB ncbi = new NCBISequenceDB( NCBISequenceDB.DB_PROTEIN,sf);//new FastaFormat()); // GenbankSequenceDB gdb=new GenbankSequenceDB(); //ncbi.setSequenceFormat(FASTA); try{ // Sequence sequenceFromGenbank = ncbi.getSequence("P10659"); // System.out.println(sequenceFromGenbank.getName()); // older code s=ncbi.getSequence("P10659"); //s=gdb.getAddress();//getSequence("P10659");//3789789"); System.out.print("check"); System.out.println(ncbi.getSequence("190786")); // } catch(Exception e){ e.printStackTrace(); //System.out.println("protien name is : zinteminia"); } } } This is my code for fetching the sequence from NCBI but it is giving somany exceptions. can u provide me some code to do so.. the errors are as follows : Bio java exception could not read sequence CAN U PLEASE HELP ME. waiting for reply PRANAV WAILA