From alex at coolest.com Thu Nov 1 04:20:26 2007 From: alex at coolest.com (dasoudesu) Date: Thu, 1 Nov 2007 01:20:26 -0700 (PDT) Subject: [Biojava-l] [ann] Informal Text-mining & Java Meetup in Tokyo Message-ID: <13524848.post@talk.nabble.com> Just wanted to announce a mini-event: Informal Text-mining & Java Meetup in Tokyo http://curehunter.com/public/events.do Come have a casual drink with some similarly minded devs interested in new tech. (We like: Text-mining, Natural Language Processing, Java, C#, Python, Flex, Dojo, Lucene...) Time/location: November 29th 2007, Thursday 8pm-10pm Amarcord in Hatsudai (near Shinjuku), Tokyo http://way.sub.jp/amarcord/access.php 2000-3000yen for food/drinks If you can attend, please confirm by emailing: events at curehunter com We will do a short demo of CureHunter and talk about some of the tech we used. After that we will have a projector available if anyone else would like to present for 5-15 min on stuff they are working on. (the location is best equipped for drinking, however) Hope to meet a few Java people from around Tokyo. Best Regards, Alex --- http://curehunter.com - http://popjisyo.com - http://winstone.sf.net -- View this message in context: http://www.nabble.com/-ann--Informal-Text-mining---Java-Meetup-in-Tokyo-tf4729944.html#a13524848 Sent from the BioJava mailing list archive at Nabble.com. From ap3 at sanger.ac.uk Thu Nov 1 12:59:35 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 1 Nov 2007 16:59:35 +0000 Subject: [Biojava-l] Biojava migrating to Subversion Message-ID: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> Hi all, Over the next weeks (until Christmas) BioJava will finally move the version control system from CVS to Subversion (svn). This is happening in parallel to the other open-bio projects. We will ensure that nothing gets lost during this migration. This means that all Biojava modules, branches, tags and the history of the files will be imported into the new repository. Over the next weeks we will A) Test the migration procedure to ensure nothing gets lost B) We will declare a CVS freeze at some point, giving all developers enough time to commit the latest code to CVS. C) After the freeze the final svn migration will happen. At this point we will also do a quick BioJava release (version 1.5.1) D) From that moment on all future Biojava development will happen via svn, CVS will remain frozen. Detailed instructions for how to check out and commit code using svn will be announced closer to the migration date. We will keep you informed about the details of these ongoings. There is also a wiki page which provides documentation for this: http://biojava.org/wiki/CVS_to_SVN_Migration Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From abhi232 at cc.gatech.edu Mon Nov 5 12:59:15 2007 From: abhi232 at cc.gatech.edu (abhi232 at cc.gatech.edu) Date: Mon, 5 Nov 2007 12:59:15 -0500 (EST) Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> Message-ID: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> Hi all, I am having a byte array which is having the data from an .ab1 file.The biojava library provides a class called as ABITrace which takes as input either a byte[] array , a file or a url.If i use the later parameters (the file or the url )the program works but if I pass the byte array to the constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a problem with the ABITrace class or how can I bypass this particular error. I am printing the length of the byte array and it comes to 144930...Can that cause a problem in my code? Thanks in advance. Abhinav From holland at ebi.ac.uk Tue Nov 6 05:15:43 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Nov 2007 10:15:43 +0000 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> Message-ID: <47303ECF.4020806@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I suspect the byte array itself may contain inaccurate data. Internally, both the URL and File constructors read the data into a byte array and then pass it to the same method as is used by the byte[] constructor. So, something must be different between the byte array you have, and the byte array obtained by reading the file in. The File constructor uses the following code to read the file: byte[] bytes = null; ByteArrayOutputStream baos = new ByteArrayOutputStream(); FileInputStream fis = new FileInputStream(ABIFile); BufferedInputStream bis = new BufferedInputStream(fis); int b; while ((b = bis.read()) >= 0) { baos.write(b); } bis.close(); fis.close(); baos.close(); bytes = baos.toByteArray(); If the above code produces different results to your byte array when reading data from the same file as your code, then something has gone wrong with the construction of your byte array. Lastly, a full stack trace would help us pinpoint the line that is breaking, and hopefully provide a hint as to what is wrong with the contents of the byte array. If you could provide one that would be very helpful. cheers, Richard abhi232 at cc.gatech.edu wrote: > Hi all, > I am having a byte array which is having the data from an .ab1 file.The > biojava library provides a class called as ABITrace which takes as input > either a byte[] array , a file or a url.If i use the later parameters (the > file or the url )the program works but if I pass the byte array to the > constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a > problem with the ABITrace class or how can I bypass this particular error. > I am printing the length of the byte array and it comes to 144930...Can > that cause a problem in my code? > > Thanks in advance. > Abhinav > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr oe52S3khmTvi5BFCL1W4KHM= =5JAO -----END PGP SIGNATURE----- From holland at ebi.ac.uk Tue Nov 6 11:53:54 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Nov 2007 16:53:54 +0000 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <4730A6F1.9050407@cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> Message-ID: <47309C22.10803@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I think that either the file is at fault, or the method you are using to read the file into Java is at fault. Could you provide us with the complete piece of code you are using from the point where you read the file into the array through to the point where you generate the output you quoted? (Not as an attachment as the mailing list will strip those - simply paste it into the message body instead). cheers, Richard abhinav wrote: > Richard Holland wrote: > I suspect the byte array itself may contain inaccurate data. > > Internally, both the URL and File constructors read the data into a byte > array and then pass it to the same method as is used by the byte[] > constructor. > > So, something must be different between the byte array you have, and the > byte array obtained by reading the file in. > > The File constructor uses the following code to read the file: > > byte[] bytes = null; > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > FileInputStream fis = new FileInputStream(ABIFile); > BufferedInputStream bis = new BufferedInputStream(fis); > int b; > while ((b = bis.read()) >= 0) > { > baos.write(b); > } > bis.close(); fis.close(); baos.close(); > bytes = baos.toByteArray(); > > If the above code produces different results to your byte array when > reading data from the same file as your code, then something has gone > wrong with the construction of your byte array. > > Lastly, a full stack trace would help us pinpoint the line that is > breaking, and hopefully provide a hint as to what is wrong with the > contents of the byte array. If you could provide one that would be very > helpful. > > cheers, > Richard > > > abhi232 at cc.gatech.edu wrote: > >>>> Hi all, >>>> I am having a byte array which is having the data from an .ab1 file.The >>>> biojava library provides a class called as ABITrace which takes as input >>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>> file or the url )the program works but if I pass the byte array to the >>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>> problem with the ABITrace class or how can I bypass this particular error. >>>> I am printing the length of the byte array and it comes to 144930...Can >>>> that cause a problem in my code? >>>> >>>> Thanks in advance. >>>> Abhinav >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> > Yes I looked at the file ABITrace and found out that the first three > characters must be ABI or the 128-130 characters must be ABI.But I > cannot find that in the file that I am having.Also If this is not the > case then there should be an illegal format exception whereas I am > arrayIndexOutOfBound Exception which is also weird. > I am getting the following stack trace. > The bytes that i want are:0 > The bytes that i want are:11 > The bytes that i want are:0 > The size of the byte array generated is:144930 > Byte array also recieved > java.lang.ArrayIndexOutOfBoundsException: 128 > at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) > at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) > at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) > at Trace.init(Trace.java:138) > at sun.applet.AppletPanel.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > The bytes I want are the first three bytes that I want to check if my > file is ABI or not.I checked the isABI function as well it returns true > or false value and not arrayIndexOutOfBouond . Also the number 128 does > it hve any significance in this case? > Thanks in advance > Abhinav -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG T1Z8W0JhCe7+hx5rbKLGqVk= =qNcr -----END PGP SIGNATURE----- From abhi232 at cc.gatech.edu Tue Nov 6 13:03:02 2007 From: abhi232 at cc.gatech.edu (abhinav) Date: Tue, 06 Nov 2007 12:03:02 -0600 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <47309C22.10803@ebi.ac.uk> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> <47309C22.10803@ebi.ac.uk> Message-ID: <4730AC56.9060808@cc.gatech.edu> Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I think that either the file is at fault, or the method you are using to > read the file into Java is at fault. > > Could you provide us with the complete piece of code you are using from > the point where you read the file into the array through to the point > where you generate the output you quoted? (Not as an attachment as the > mailing list will strip those - simply paste it into the message body > instead). > > cheers, > Richard > > > abhinav wrote: > >> Richard Holland wrote: >> I suspect the byte array itself may contain inaccurate data. >> >> Internally, both the URL and File constructors read the data into a byte >> array and then pass it to the same method as is used by the byte[] >> constructor. >> >> So, something must be different between the byte array you have, and the >> byte array obtained by reading the file in. >> >> The File constructor uses the following code to read the file: >> >> byte[] bytes = null; >> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >> FileInputStream fis = new FileInputStream(ABIFile); >> BufferedInputStream bis = new BufferedInputStream(fis); >> int b; >> while ((b = bis.read()) >= 0) >> { >> baos.write(b); >> } >> bis.close(); fis.close(); baos.close(); >> bytes = baos.toByteArray(); >> >> If the above code produces different results to your byte array when >> reading data from the same file as your code, then something has gone >> wrong with the construction of your byte array. >> >> Lastly, a full stack trace would help us pinpoint the line that is >> breaking, and hopefully provide a hint as to what is wrong with the >> contents of the byte array. If you could provide one that would be very >> helpful. >> >> cheers, >> Richard >> >> >> abhi232 at cc.gatech.edu wrote: >> >> >>>>> Hi all, >>>>> I am having a byte array which is having the data from an .ab1 file.The >>>>> biojava library provides a class called as ABITrace which takes as input >>>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>>> file or the url )the program works but if I pass the byte array to the >>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>>> problem with the ABITrace class or how can I bypass this particular error. >>>>> I am printing the length of the byte array and it comes to 144930...Can >>>>> that cause a problem in my code? >>>>> >>>>> Thanks in advance. >>>>> Abhinav >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> >>>>> > > >> Yes I looked at the file ABITrace and found out that the first three >> characters must be ABI or the 128-130 characters must be ABI.But I >> cannot find that in the file that I am having.Also If this is not the >> case then there should be an illegal format exception whereas I am >> arrayIndexOutOfBound Exception which is also weird. >> I am getting the following stack trace. >> The bytes that i want are:0 >> The bytes that i want are:11 >> The bytes that i want are:0 >> The size of the byte array generated is:144930 >> Byte array also recieved >> java.lang.ArrayIndexOutOfBoundsException: 128 >> at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) >> at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) >> at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) >> at Trace.init(Trace.java:138) >> at sun.applet.AppletPanel.run(Unknown Source) >> at java.lang.Thread.run(Unknown Source) >> The bytes I want are the first three bytes that I want to check if my >> file is ABI or not.I checked the isABI function as well it returns true >> or false value and not arrayIndexOutOfBouond . Also the number 128 does >> it hve any significance in this case? >> Thanks in advance >> Abhinav >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG > T1Z8W0JhCe7+hx5rbKLGqVk= > =qNcr > -----END PGP SIGNATURE----- > Ok Yes here is the code that i am using .I establish a connection with a php page which in turn reads the file and prints the content back to me.I am using DataOutputStream for sending data and BufferedReader for taking in the data.Then I am reading the data into a string and converting it to byte[] array . this the code where the connection is estableshed and the data is taken and displayed. private HttpURLConnection httpConn; private DataOutputStream out; private DataInputStream temp_stream; private BufferedReader in; private BufferedInputStream in_buff_stream; private String str ; private byte[] bytearray; Chromatogram abif_chromatogram; /** Creates a new instance of testPost */ public testPost() { httpConn = null; str = new String(""); bytearray = new byte[144930]; } public byte[] create_and_write_Connection(String url,String data_request) { try { URL conn_url = new URL(url); httpConn = (HttpURLConnection)conn_url.openConnection(); httpConn.setDoOutput(true); httpConn.setDoInput(true); httpConn.setRequestMethod("POST"); out=new DataOutputStream(httpConn.getOutputStream()); out.writeBytes(data_request); out.flush(); System.out.println("Connection established successfully and data written"); InputStreamReader in_stream = new InputStreamReader(httpConn.getInputStream()); System.out.println("The character encoding used is:"+ in_stream.getEncoding()); in = new BufferedReader(in_stream); System.out.println("Data acceptance started"); while(in.readLine()!=null) { str += in.readLine(); } System.out.println("The string to be returned is:"+str); bytearray = str.getBytes("ISO8859-1"); String temp_string = new String(bytearray,"windows-1252"); System.out.println("The encoded string is as follows:"+ temp_string); System.out.println("The size of byte array inside testpost is:"+ Array.getLength(bytearray)); for(int i = 0 ; i < 3 ; i ++) System.out.println("The bytes that i want are:"+ bytearray[i]); return bytearray; } catch(Exception e) { e.printStackTrace(); } return bytearray; } Please guide me on this point Thanks Abhinav From holland at ebi.ac.uk Tue Nov 6 12:05:12 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Nov 2007 17:05:12 +0000 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <4730AC56.9060808@cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> <47309C22.10803@ebi.ac.uk> <4730AC56.9060808@cc.gatech.edu> Message-ID: <47309EC8.2070904@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The String is where you're going wrong. ABI files are not Stringifyable - - they are binary data. Converting them to a String will corrupt them. cheers, Richard abhinav wrote: > Richard Holland wrote: > I think that either the file is at fault, or the method you are using to > read the file into Java is at fault. > > Could you provide us with the complete piece of code you are using from > the point where you read the file into the array through to the point > where you generate the output you quoted? (Not as an attachment as the > mailing list will strip those - simply paste it into the message body > instead). > > cheers, > Richard > > > abhinav wrote: > >>>> Richard Holland wrote: >>>> I suspect the byte array itself may contain inaccurate data. >>>> >>>> Internally, both the URL and File constructors read the data into a byte >>>> array and then pass it to the same method as is used by the byte[] >>>> constructor. >>>> >>>> So, something must be different between the byte array you have, and the >>>> byte array obtained by reading the file in. >>>> >>>> The File constructor uses the following code to read the file: >>>> >>>> byte[] bytes = null; >>>> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >>>> FileInputStream fis = new FileInputStream(ABIFile); >>>> BufferedInputStream bis = new BufferedInputStream(fis); >>>> int b; >>>> while ((b = bis.read()) >= 0) >>>> { >>>> baos.write(b); >>>> } >>>> bis.close(); fis.close(); baos.close(); >>>> bytes = baos.toByteArray(); >>>> >>>> If the above code produces different results to your byte array when >>>> reading data from the same file as your code, then something has gone >>>> wrong with the construction of your byte array. >>>> >>>> Lastly, a full stack trace would help us pinpoint the line that is >>>> breaking, and hopefully provide a hint as to what is wrong with the >>>> contents of the byte array. If you could provide one that would be very >>>> helpful. >>>> >>>> cheers, >>>> Richard >>>> >>>> >>>> abhi232 at cc.gatech.edu wrote: >>>> >>>> >>>>>>> Hi all, >>>>>>> I am having a byte array which is having the data from an .ab1 file.The >>>>>>> biojava library provides a class called as ABITrace which takes as input >>>>>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>>>>> file or the url )the program works but if I pass the byte array to the >>>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>>>>> problem with the ABITrace class or how can I bypass this particular error. >>>>>>> I am printing the length of the byte array and it comes to 144930...Can >>>>>>> that cause a problem in my code? >>>>>>> >>>>>>> Thanks in advance. >>>>>>> Abhinav >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>>> >>>>>>> > > >>>> Yes I looked at the file ABITrace and found out that the first three >>>> characters must be ABI or the 128-130 characters must be ABI.But I >>>> cannot find that in the file that I am having.Also If this is not the >>>> case then there should be an illegal format exception whereas I am >>>> arrayIndexOutOfBound Exception which is also weird. >>>> I am getting the following stack trace. >>>> The bytes that i want are:0 >>>> The bytes that i want are:11 >>>> The bytes that i want are:0 >>>> The size of the byte array generated is:144930 >>>> Byte array also recieved >>>> java.lang.ArrayIndexOutOfBoundsException: 128 >>>> at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) >>>> at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) >>>> at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) >>>> at Trace.init(Trace.java:138) >>>> at sun.applet.AppletPanel.run(Unknown Source) >>>> at java.lang.Thread.run(Unknown Source) >>>> The bytes I want are the first three bytes that I want to check if my >>>> file is ABI or not.I checked the isABI function as well it returns true >>>> or false value and not arrayIndexOutOfBouond . Also the number 128 does >>>> it hve any significance in this case? >>>> Thanks in advance >>>> Abhinav >>>> > > Ok Yes here is the code that i am using .I establish a connection with a > php page which in turn reads the file and prints the content back to > me.I am using DataOutputStream for sending data and BufferedReader for > taking in the data.Then I am reading the data into a string and > converting it to byte[] array . this the code where the connection is > estableshed and the data is taken and displayed. > private HttpURLConnection httpConn; > private DataOutputStream out; > private DataInputStream temp_stream; > private BufferedReader in; > private BufferedInputStream in_buff_stream; > private String str ; > private byte[] bytearray; > Chromatogram abif_chromatogram; > /** Creates a new instance of testPost */ > public testPost() > { > httpConn = null; > str = new String(""); > bytearray = new byte[144930]; > } > public byte[] create_and_write_Connection(String url,String > data_request) > { > try > { > URL conn_url = new URL(url); > httpConn = (HttpURLConnection)conn_url.openConnection(); > httpConn.setDoOutput(true); > httpConn.setDoInput(true); > httpConn.setRequestMethod("POST"); > out=new DataOutputStream(httpConn.getOutputStream()); > out.writeBytes(data_request); > out.flush(); > System.out.println("Connection established successfully and > data written"); > InputStreamReader in_stream = new > InputStreamReader(httpConn.getInputStream()); > System.out.println("The character encoding used is:"+ > in_stream.getEncoding()); > in = new BufferedReader(in_stream); > System.out.println("Data acceptance started"); > while(in.readLine()!=null) > { > str += in.readLine(); > } > System.out.println("The string to be returned is:"+str); > bytearray = str.getBytes("ISO8859-1"); > String temp_string = new String(bytearray,"windows-1252"); > System.out.println("The encoded string is as follows:"+ > temp_string); > System.out.println("The size of byte array inside testpost > is:"+ Array.getLength(bytearray)); > for(int i = 0 ; i < 3 ; i ++) > System.out.println("The bytes that i want are:"+ > bytearray[i]); > return bytearray; > } > catch(Exception e) > { > e.printStackTrace(); > } > return bytearray; > } > Please guide me on this point > Thanks > Abhinav -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMJ7I4C5LeMEKA/QRAupLAJ9YDoGohk5uZSNYZnRRMJ5WeNDpGgCfdCyg +Z/gXBbPmrG3SuQlfeHuD3A= =akSf -----END PGP SIGNATURE----- From abhi232 at cc.gatech.edu Tue Nov 6 12:40:01 2007 From: abhi232 at cc.gatech.edu (abhinav) Date: Tue, 06 Nov 2007 11:40:01 -0600 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <47303ECF.4020806@ebi.ac.uk> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> Message-ID: <4730A6F1.9050407@cc.gatech.edu> Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I suspect the byte array itself may contain inaccurate data. > > Internally, both the URL and File constructors read the data into a byte > array and then pass it to the same method as is used by the byte[] > constructor. > > So, something must be different between the byte array you have, and the > byte array obtained by reading the file in. > > The File constructor uses the following code to read the file: > > byte[] bytes = null; > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > FileInputStream fis = new FileInputStream(ABIFile); > BufferedInputStream bis = new BufferedInputStream(fis); > int b; > while ((b = bis.read()) >= 0) > { > baos.write(b); > } > bis.close(); fis.close(); baos.close(); > bytes = baos.toByteArray(); > > If the above code produces different results to your byte array when > reading data from the same file as your code, then something has gone > wrong with the construction of your byte array. > > Lastly, a full stack trace would help us pinpoint the line that is > breaking, and hopefully provide a hint as to what is wrong with the > contents of the byte array. If you could provide one that would be very > helpful. > > cheers, > Richard > > > abhi232 at cc.gatech.edu wrote: > >> Hi all, >> I am having a byte array which is having the data from an .ab1 file.The >> biojava library provides a class called as ABITrace which takes as input >> either a byte[] array , a file or a url.If i use the later parameters (the >> file or the url )the program works but if I pass the byte array to the >> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >> problem with the ABITrace class or how can I bypass this particular error. >> I am printing the length of the byte array and it comes to 144930...Can >> that cause a problem in my code? >> >> Thanks in advance. >> Abhinav >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr > oe52S3khmTvi5BFCL1W4KHM= > =5JAO > -----END PGP SIGNATURE----- > Yes I looked at the file ABITrace and found out that the first three characters must be ABI or the 128-130 characters must be ABI.But I cannot find that in the file that I am having.Also If this is not the case then there should be an illegal format exception whereas I am arrayIndexOutOfBound Exception which is also weird. I am getting the following stack trace. The bytes that i want are:0 The bytes that i want are:11 The bytes that i want are:0 The size of the byte array generated is:144930 Byte array also recieved java.lang.ArrayIndexOutOfBoundsException: 128 at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) at Trace.init(Trace.java:138) at sun.applet.AppletPanel.run(Unknown Source) at java.lang.Thread.run(Unknown Source) The bytes I want are the first three bytes that I want to check if my file is ABI or not.I checked the isABI function as well it returns true or false value and not arrayIndexOutOfBouond . Also the number 128 does it hve any significance in this case? Thanks in advance Abhinav From walsh at andrew.cmu.edu Tue Nov 6 12:23:36 2007 From: walsh at andrew.cmu.edu (Andrew Walsh) Date: Tue, 06 Nov 2007 12:23:36 -0500 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <4730AC56.9060808@cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> <47309C22.10803@ebi.ac.uk> <4730AC56.9060808@cc.gatech.edu> Message-ID: <4730A318.8010406@andrew.cmu.edu> You also appear to be losing every other line with the following code: while(in.readLine()!=null) { str += in.readLine(); } Every time the while statement checks its condition, a line is read from the inputstream. That line is never stored. Then, if the condition is met, another line is read and that line is added to your String. -Andy abhinav wrote: > Richard Holland wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> I think that either the file is at fault, or the method you are using to >> read the file into Java is at fault. >> >> Could you provide us with the complete piece of code you are using from >> the point where you read the file into the array through to the point >> where you generate the output you quoted? (Not as an attachment as the >> mailing list will strip those - simply paste it into the message body >> instead). >> >> cheers, >> Richard >> >> >> abhinav wrote: >> >> >>> Richard Holland wrote: >>> I suspect the byte array itself may contain inaccurate data. >>> >>> Internally, both the URL and File constructors read the data into a byte >>> array and then pass it to the same method as is used by the byte[] >>> constructor. >>> >>> So, something must be different between the byte array you have, and the >>> byte array obtained by reading the file in. >>> >>> The File constructor uses the following code to read the file: >>> >>> byte[] bytes = null; >>> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >>> FileInputStream fis = new FileInputStream(ABIFile); >>> BufferedInputStream bis = new BufferedInputStream(fis); >>> int b; >>> while ((b = bis.read()) >= 0) >>> { >>> baos.write(b); >>> } >>> bis.close(); fis.close(); baos.close(); >>> bytes = baos.toByteArray(); >>> >>> If the above code produces different results to your byte array when >>> reading data from the same file as your code, then something has gone >>> wrong with the construction of your byte array. >>> >>> Lastly, a full stack trace would help us pinpoint the line that is >>> breaking, and hopefully provide a hint as to what is wrong with the >>> contents of the byte array. If you could provide one that would be very >>> helpful. >>> >>> cheers, >>> Richard >>> >>> >>> abhi232 at cc.gatech.edu wrote: >>> >>> >>> >>>>>> Hi all, >>>>>> I am having a byte array which is having the data from an .ab1 file.The >>>>>> biojava library provides a class called as ABITrace which takes as input >>>>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>>>> file or the url )the program works but if I pass the byte array to the >>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>>>> problem with the ABITrace class or how can I bypass this particular error. >>>>>> I am printing the length of the byte array and it comes to 144930...Can >>>>>> that cause a problem in my code? >>>>>> >>>>>> Thanks in advance. >>>>>> Abhinav >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>>> >>>>>> >>>>>> >> >> >>> Yes I looked at the file ABITrace and found out that the first three >>> characters must be ABI or the 128-130 characters must be ABI.But I >>> cannot find that in the file that I am having.Also If this is not the >>> case then there should be an illegal format exception whereas I am >>> arrayIndexOutOfBound Exception which is also weird. >>> I am getting the following stack trace. >>> The bytes that i want are:0 >>> The bytes that i want are:11 >>> The bytes that i want are:0 >>> The size of the byte array generated is:144930 >>> Byte array also recieved >>> java.lang.ArrayIndexOutOfBoundsException: 128 >>> at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) >>> at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) >>> at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) >>> at Trace.init(Trace.java:138) >>> at sun.applet.AppletPanel.run(Unknown Source) >>> at java.lang.Thread.run(Unknown Source) >>> The bytes I want are the first three bytes that I want to check if my >>> file is ABI or not.I checked the isABI function as well it returns true >>> or false value and not arrayIndexOutOfBouond . Also the number 128 does >>> it hve any significance in this case? >>> Thanks in advance >>> Abhinav >>> >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG >> T1Z8W0JhCe7+hx5rbKLGqVk= >> =qNcr >> -----END PGP SIGNATURE----- >> >> > Ok Yes here is the code that i am using .I establish a connection with a > php page which in turn reads the file and prints the content back to > me.I am using DataOutputStream for sending data and BufferedReader for > taking in the data.Then I am reading the data into a string and > converting it to byte[] array . this the code where the connection is > estableshed and the data is taken and displayed. > > > > private HttpURLConnection httpConn; > private DataOutputStream out; > private DataInputStream temp_stream; > private BufferedReader in; > private BufferedInputStream in_buff_stream; > private String str ; > private byte[] bytearray; > Chromatogram abif_chromatogram; > > /** Creates a new instance of testPost */ > public testPost() > { > > httpConn = null; > str = new String(""); > bytearray = new byte[144930]; > > } > public byte[] create_and_write_Connection(String url,String > data_request) > { > try > { > URL conn_url = new URL(url); > httpConn = (HttpURLConnection)conn_url.openConnection(); > httpConn.setDoOutput(true); > httpConn.setDoInput(true); > httpConn.setRequestMethod("POST"); > out=new DataOutputStream(httpConn.getOutputStream()); > out.writeBytes(data_request); > out.flush(); > System.out.println("Connection established successfully and > data written"); > InputStreamReader in_stream = new > InputStreamReader(httpConn.getInputStream()); > > System.out.println("The character encoding used is:"+ > in_stream.getEncoding()); > in = new BufferedReader(in_stream); > > > System.out.println("Data acceptance started"); > > > while(in.readLine()!=null) > { > str += in.readLine(); > } > System.out.println("The string to be returned is:"+str); > bytearray = str.getBytes("ISO8859-1"); > String temp_string = new String(bytearray,"windows-1252"); > System.out.println("The encoded string is as follows:"+ > temp_string); > System.out.println("The size of byte array inside testpost > is:"+ Array.getLength(bytearray)); > for(int i = 0 ; i < 3 ; i ++) > System.out.println("The bytes that i want are:"+ > bytearray[i]); > return bytearray; > } > catch(Exception e) > { > e.printStackTrace(); > } > return bytearray; > } > Please guide me on this point > Thanks > Abhinav > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From holland at ebi.ac.uk Thu Nov 8 08:53:09 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 13:53:09 +0000 Subject: [Biojava-l] BioJava 3 Proposals Message-ID: <473314C5.8070207@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear BioJava users, The BioJava developers are considering options for the future development of the BioJava toolkit. We consider that it needs improvement in a few major areas to make it easier to use and understand, and also faster and more scalable. The options are to either rewrite large parts of the existing code, working within the existing interfaces and paradigms, or to develop a new set of BioJava packages from the ground up in order to take advantage of lessons learned from the design patterns of the existing code. The BioJava developers have spent the last couple of months discussing ideas and proposals related to these options on a Wiki page, and would now like to open this discussion to all users of BioJava and the bioinformatics community in general. We would like to invite anyone who has any ideas or suggestions to contribute these to the Wiki page, and/or to comment on the ideas and suggestions that have already been posted there. Here is a link to the Wiki page, and also a link to the associated Talk page where much of the discussion has taken place so far: http://biojava.org/wiki/BioJava3_Proposal http://biojava.org/wiki/Talk:BioJava3_Proposal It is our intention to leave the discussion open until mid-January 2008 when we will summarise it and use it as the basis of a plan of action. We will then distribute the summary and the action plan via the BioJava website. We look forward to hearing your comments and ideas. Please do remember to make them directly to the Wiki page so that they are preserved in context, making it easier for us to summarise them later! cheers, Richard (on behalf of all BioJava developers) PS. Just to reassure you, this is NOT a plan to drop the existing codebase. It will continue to exist, but the outcome of these discussions will determine whether we will continue to develop and support it or start afresh with a clean slate and a new codebase. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMxTE4C5LeMEKA/QRAlGSAJwKzO0oAe3T2e8ibcG8uRReOVfh7wCdGlwn JkcVzA55Ye32o8Ry48LO+04= =oaaC -----END PGP SIGNATURE----- From holland at ebi.ac.uk Thu Nov 8 08:58:23 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 13:58:23 +0000 Subject: [Biojava-l] Biojava wiki Message-ID: <473315FF.70506@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 what's happened to the biojava wiki today? i get errors from all pages, including the front page, indicating zero-sized replies. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62 6KkoyXhahrWkZo2OWyCL+Uk= =1jK7 -----END PGP SIGNATURE----- From phidias51 at gmail.com Thu Nov 8 10:39:29 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Thu, 8 Nov 2007 07:39:29 -0800 Subject: [Biojava-l] Biojava wiki In-Reply-To: <473315FF.70506@ebi.ac.uk> References: <473315FF.70506@ebi.ac.uk> Message-ID: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com> Richard, That's odd. It comes up fine for me. BTW, in your proposal you mentioned that people had "moved on". I was wondering what types of tasks they had moved on to, and what should be included in the Proposal to insure that BioJava stays relevant to them? Regards, Mark On Nov 8, 2007 5:58 AM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > what's happened to the biojava wiki today? i get errors from all pages, > including the front page, indicating zero-sized replies. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62 > 6KkoyXhahrWkZo2OWyCL+Uk= > =1jK7 > -----END PGP SIGNATURE----- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From hlapp at gmx.net Thu Nov 8 10:53:03 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 10:53:03 -0500 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <762277.43372.qm@web26507.mail.ukl.yahoo.com> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Message-ID: Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we explicitly lowercase the value found for alphabet, and the comment says why: # Note: Biojava uses upper-case terms for alphabet, so we # need to change to all-lower in case the sequence was # manipulated by Biojava. $obj->alphabet(lc($rows->[3])) if $rows->[3]; However, when inserting sequences, we leave the value as is in BioPerl (which is lowercase), leading to a potential problem for Biojava upon retrieval. Do the Biojava folks deal with that? Should this may harmonized across the board? -hilmar On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: > Dear Peter, > > All the alphabet are "DNA" (upper case) in my database. The > sequences are taken from NCBI by a BioJava application. > Thus is should be that BioJava inserts the records with "DNA". Thus > no potential "hidden bug" in BioPython. > > Maybe a point to share with the Open-Bio committee. > > Eric > > ----- Message d'origine ---- > De : Peter > ? : Eric Gibert > Cc : biopython at lists.open-bio.org > Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s > Objet : Re: [BioPython] small "bug" correction in package BioSql > > Eric Gibert wrote: >> Dear all, >> >> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >> function: >> >> ... >> >> please note my correction: force moltype to be turn in lower case as >> my database has upper case value! this raises the "Unknown moltype" >> error. > > Hi Eric, I've made your suggested change in CVS, > biopython/BioSQL/BioSeq.py revision 1.13, thank you. > > I would encourage you to investigate why some of the "alphabet" fields > in the biosequence table are in upper case. There could be a bug > elsewhere which is writing these entries with the wrong alphabet. Is > this affecting all entries, or just some? > > Peter > > > > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From holland at ebi.ac.uk Thu Nov 8 11:17:25 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 16:17:25 +0000 Subject: [Biojava-l] Biojava wiki In-Reply-To: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com> References: <473315FF.70506@ebi.ac.uk> <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com> Message-ID: <47333695.40808@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > BTW, in your proposal you mentioned that people had "moved on". I was > wondering what types of tasks they had moved on to, and what should be > included in the Proposal to insure that BioJava stays relevant to them? Good point. From what we can tell, people are not so sequence-focused any more but are more interested in features, alignments, population data, etc. - more 'metadata' so to speak. We do need some mechanism to ensure that we are correct in this thinking, and that future shifts in direction are catered for in this design phase. Could you add a note to the wiki with your points, and/or any ideas you may have about ensuring these requirements are met? cheers, Richard > Regards, > > Mark > > On Nov 8, 2007 5:58 AM, Richard Holland wrote: > > what's happened to the biojava wiki today? i get errors from all pages, > including the front page, indicating zero-sized replies. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMzaV4C5LeMEKA/QRAoPUAJ0TQ+xFF1J3EtZgHmvYj2HH41koCgCeLYm0 D5Z7SJDWjvJ9rbCrS+RTEeI= =XhE1 -----END PGP SIGNATURE----- From holland at ebi.ac.uk Thu Nov 8 11:18:46 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 16:18:46 +0000 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Message-ID: <473336E6.6000100@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 we do need a consensus here. I'm happy to go with whatever value is chosen, as the BioJava code can easily be modified to suit. cheers, Richard Hilmar Lapp wrote: > Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we > explicitly lowercase the value found for alphabet, and the comment > says why: > > # Note: Biojava uses upper-case terms for alphabet, so we > # need to change to all-lower in case the sequence was > # manipulated by Biojava. > $obj->alphabet(lc($rows->[3])) if $rows->[3]; > > However, when inserting sequences, we leave the value as is in > BioPerl (which is lowercase), leading to a potential problem for > Biojava upon retrieval. Do the Biojava folks deal with that? Should > this may harmonized across the board? > > -hilmar > > On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: > >> Dear Peter, >> >> All the alphabet are "DNA" (upper case) in my database. The >> sequences are taken from NCBI by a BioJava application. >> Thus is should be that BioJava inserts the records with "DNA". Thus >> no potential "hidden bug" in BioPython. >> >> Maybe a point to share with the Open-Bio committee. >> >> Eric >> >> ----- Message d'origine ---- >> De : Peter >> ? : Eric Gibert >> Cc : biopython at lists.open-bio.org >> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >> Objet : Re: [BioPython] small "bug" correction in package BioSql >> >> Eric Gibert wrote: >>> Dear all, >>> >>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>> function: >>> >>> ... >>> >>> please note my correction: force moltype to be turn in lower case as >>> my database has upper case value! this raises the "Unknown moltype" >>> error. >> Hi Eric, I've made your suggested change in CVS, >> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >> >> I would encourage you to investigate why some of the "alphabet" fields >> in the biosequence table are in upper case. There could be a bug >> elsewhere which is writing these entries with the wrong alphabet. Is >> this affecting all entries, or just some? >> >> Peter >> >> >> >> >> >> >> >> >> ______________________________________________________________________ >> _______ >> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >> Yahoo! Mail >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3 9x+CUHig3GfBCZ56rDb1ZG4= =OJyB -----END PGP SIGNATURE----- From hlapp at gmx.net Thu Nov 8 15:28:19 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:28:19 -0500 Subject: [Biojava-l] [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: Maybe we need to hold some mini-hackathon to make the different toolkits compatible in how they map annotation to the schema. Obviously I don't know whether you have the latest Biojava setup here, but I'll just comment how BioPerl/Bioperl-db would map this: 'ORIGIN' - if I'm not mistaken this is only a token that introduces the actual sequence. I'm not sure what Biojava is storing as value here. 'DIVISION' - this maps to column division in table bioentry (though I agree that if perfectly following the weak typing principle this should be tag/value association, but at present it's still an actual column) 'genbank_accessions' - secondary accession numbers indeed go into the qualifier value table. The primary accession maps to column accession in table bioentry 'TITLE' - this is part of a publication reference, and should map to column title in table reference (which it does in bioperl-db) 'cross_references' - not sure where these would be coming from in GenBank format; for EMBL this will map to the dbxref table 'data_file_division' - not sure what this is (same as DIVISION?) 'VERSION' - in BioPerl we parse this apart into a version for the accession (which is column version in table bioentry) and the GI number, which maps to column identifier in table bioentry 'references' - these map to table reference (and bioentry_reference for association with the bioentry) 'KEYWORDS' - indeed these map to bioentry_qualifier_value 'GI' - maps to column identifier in table bioentry 'SIZE' - not sure what size that is. If it is the length of the sequence, it should (and in BioPerl/bioperl-db does) map to column length in table biosequence 'DEFINITION' - maps to column description in table bioentry 'REFERENCE' - should be the same as for 'references' 'MDAT' - not sure what this is 'ORGANISM' - this is the organism and maps to the table taxon (and taxon_name), with a foreign key in bioentry pointing to the taxon 'JOURNAL' - this is part of a reference, see 'references' 'ACCESSION' - the primary accession, maps to column accession in table bioentry 'LOCUS' - in the file itself this is an entire line consisting of multiple fields; BioPerl/bioperl-db maps the locus name (the first token after the literal token LOCUS) to column name in table bioentry 'SOURCE' - this is the organism, see 'ORGANISM' 'PUBMED' - this is part of a literature reference, and maps to a foreign key in the reference table (reference.dbxref) to a dbxref entry with PUBMED or PMID as the database and the pubmed ID as the accession 'AUTHORS' - part of a literature reference, maps to column authors in table reference 'TYPE' - not sure what this is. If it's the alphabet, it maps to table biosequence, column alphabet 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, though there have been plans to make it a column in table biosequence. Note that this could in fact be the way Biojava stores it too, but upon retrieval represents it in the way you are seeing it. Hth, -hilmar On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > Dear all, > > When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted > previously by my BioJava application, I have: > > print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() > > Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', > 'genbank_accessions', 'TITLE', 'cross_references', > 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', > 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', > 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', > 'CIRCULAR'] > > but a freshly inserted BioSeq by BioPython 1.44 only gives me: > Debug on Seq: EF631597.1 = ['cross_references', 'dates', > 'references', 'gi', 'data_file_division'] > > > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which > should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? > > Eric > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Nov 8 15:30:29 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:30:29 -0500 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <473336E6.6000100@ebi.ac.uk> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> Message-ID: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> It seems BioPerl and Biopython both want (and have traditionally used) lowercase - do you mind going with that for Biojava as well, or alternatively, simply map upon insert/update and retrieve? -hilmar On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >> explicitly lowercase the value found for alphabet, and the comment >> says why: >> >> # Note: Biojava uses upper-case terms for alphabet, so we >> # need to change to all-lower in case the sequence was >> # manipulated by Biojava. >> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >> >> However, when inserting sequences, we leave the value as is in >> BioPerl (which is lowercase), leading to a potential problem for >> Biojava upon retrieval. Do the Biojava folks deal with that? Should >> this may harmonized across the board? >> >> -hilmar >> >> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >> >>> Dear Peter, >>> >>> All the alphabet are "DNA" (upper case) in my database. The >>> sequences are taken from NCBI by a BioJava application. >>> Thus is should be that BioJava inserts the records with "DNA". Thus >>> no potential "hidden bug" in BioPython. >>> >>> Maybe a point to share with the Open-Bio committee. >>> >>> Eric >>> >>> ----- Message d'origine ---- >>> De : Peter >>> ? : Eric Gibert >>> Cc : biopython at lists.open-bio.org >>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>> >>> Eric Gibert wrote: >>>> Dear all, >>>> >>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>> function: >>>> >>>> ... >>>> >>>> please note my correction: force moltype to be turn in lower >>>> case as >>>> my database has upper case value! this raises the "Unknown moltype" >>>> error. >>> Hi Eric, I've made your suggested change in CVS, >>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>> >>> I would encourage you to investigate why some of the "alphabet" >>> fields >>> in the biosequence table are in upper case. There could be a bug >>> elsewhere which is writing these entries with the wrong >>> alphabet. Is >>> this affecting all entries, or just some? >>> >>> Peter >>> >>> >>> >>> >>> >>> >>> >>> >>> ____________________________________________________________________ >>> __ >>> _______ >>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>> Yahoo! Mail >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3 > 9x+CUHig3GfBCZ56rDb1ZG4= > =OJyB > -----END PGP SIGNATURE----- -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From holland at ebi.ac.uk Fri Nov 9 03:39:01 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 09 Nov 2007 08:39:01 +0000 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> Message-ID: <47341CA5.9080509@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 i'll see what i can do. Hilmar Lapp wrote: > It seems BioPerl and Biopython both want (and have traditionally used) > lowercase - do you mind going with that for Biojava as well, or > alternatively, simply map upon insert/update and retrieve? > > -hilmar > > On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >>>> explicitly lowercase the value found for alphabet, and the comment >>>> says why: >>>> >>>> # Note: Biojava uses upper-case terms for alphabet, so we >>>> # need to change to all-lower in case the sequence was >>>> # manipulated by Biojava. >>>> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >>>> >>>> However, when inserting sequences, we leave the value as is in >>>> BioPerl (which is lowercase), leading to a potential problem for >>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should >>>> this may harmonized across the board? >>>> >>>> -hilmar >>>> >>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >>>> >>>>> Dear Peter, >>>>> >>>>> All the alphabet are "DNA" (upper case) in my database. The >>>>> sequences are taken from NCBI by a BioJava application. >>>>> Thus is should be that BioJava inserts the records with "DNA". Thus >>>>> no potential "hidden bug" in BioPython. >>>>> >>>>> Maybe a point to share with the Open-Bio committee. >>>>> >>>>> Eric >>>>> >>>>> ----- Message d'origine ---- >>>>> De : Peter >>>>> ? : Eric Gibert >>>>> Cc : biopython at lists.open-bio.org >>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>>>> >>>>> Eric Gibert wrote: >>>>>> Dear all, >>>>>> >>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>>>> function: >>>>>> >>>>>> ... >>>>>> >>>>>> please note my correction: force moltype to be turn in lower case as >>>>>> my database has upper case value! this raises the "Unknown moltype" >>>>>> error. >>>>> Hi Eric, I've made your suggested change in CVS, >>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>>>> >>>>> I would encourage you to investigate why some of the "alphabet" fields >>>>> in the biosequence table are in upper case. There could be a bug >>>>> elsewhere which is writing these entries with the wrong alphabet. Is >>>>> this affecting all entries, or just some? >>>>> >>>>> Peter >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> _______ >>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>>>> Yahoo! Mail >>>>> _______________________________________________ >>>>> BioPython mailing list - BioPython at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> > --=========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHNByl4C5LeMEKA/QRAmCzAJ9fxSm8l5YAEHAUe2hH+Gwc1Xe5IwCfcMf6 c9sy8lASDV069FQJ79Geemw= =RHM1 -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 9 07:42:38 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 09 Nov 2007 12:42:38 +0000 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> Message-ID: <473455BE.6040807@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I did a bit of poking around in our code and internally BioJava represents all the default alphabet names (Protein, DNA, etc.) in upper case. It also allows for mixed case alphabet names. It's not quite as easy as I thought to change these to lower case as they are often referenced by text name, meaning other people's code might break if I change them. Also, as it allows for mixed-case alphabet names, I can't do a toUpper/toLower fudge on persistence to BioSQL, as I wouldn't necessarily get out what I put in! So, I think I'll add this as a point on the recently announced BioJava 3 proposal, that BioSQL interaction must be compliant with standards laid down by the BioSQL project, and that our code will be able to cope with this internally. That brings us back to BioSQL standards - the idea of a mini-hackathon to solve this once and for all is a very good one. Our previous attempts between BioPerl and BioJava in Singapore were good, but still there are niggles as seen in this thread of discussion. It seems that a schema on it's own just isn't enough to make the various projects play nicely, and instructions are needed on exactly how to use that schema if they are truly all going to be able to use it without caring who or what wrote the data that is being read. cheers, Richard Hilmar Lapp wrote: > It seems BioPerl and Biopython both want (and have traditionally used) > lowercase - do you mind going with that for Biojava as well, or > alternatively, simply map upon insert/update and retrieve? > > -hilmar > > On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >>>> explicitly lowercase the value found for alphabet, and the comment >>>> says why: >>>> >>>> # Note: Biojava uses upper-case terms for alphabet, so we >>>> # need to change to all-lower in case the sequence was >>>> # manipulated by Biojava. >>>> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >>>> >>>> However, when inserting sequences, we leave the value as is in >>>> BioPerl (which is lowercase), leading to a potential problem for >>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should >>>> this may harmonized across the board? >>>> >>>> -hilmar >>>> >>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >>>> >>>>> Dear Peter, >>>>> >>>>> All the alphabet are "DNA" (upper case) in my database. The >>>>> sequences are taken from NCBI by a BioJava application. >>>>> Thus is should be that BioJava inserts the records with "DNA". Thus >>>>> no potential "hidden bug" in BioPython. >>>>> >>>>> Maybe a point to share with the Open-Bio committee. >>>>> >>>>> Eric >>>>> >>>>> ----- Message d'origine ---- >>>>> De : Peter >>>>> ? : Eric Gibert >>>>> Cc : biopython at lists.open-bio.org >>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>>>> >>>>> Eric Gibert wrote: >>>>>> Dear all, >>>>>> >>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>>>> function: >>>>>> >>>>>> ... >>>>>> >>>>>> please note my correction: force moltype to be turn in lower case as >>>>>> my database has upper case value! this raises the "Unknown moltype" >>>>>> error. >>>>> Hi Eric, I've made your suggested change in CVS, >>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>>>> >>>>> I would encourage you to investigate why some of the "alphabet" fields >>>>> in the biosequence table are in upper case. There could be a bug >>>>> elsewhere which is writing these entries with the wrong alphabet. Is >>>>> this affecting all entries, or just some? >>>>> >>>>> Peter >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> _______ >>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>>>> Yahoo! Mail >>>>> _______________________________________________ >>>>> BioPython mailing list - BioPython at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> > --=========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHNFW84C5LeMEKA/QRApBiAJ41WqCDKOJhee5NxIsquYaR/ImBRgCfb7zM LX75HHvCUC/v4n3okmUQ+ME= =d6QO -----END PGP SIGNATURE----- From email2ants at gmail.com Fri Nov 9 12:55:36 2007 From: email2ants at gmail.com (Anthony Underwood) Date: Fri, 9 Nov 2007 17:55:36 +0000 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) Message-ID: Hi All, I've generated an alignment and I am retrieving positions within the alignment using Symbol base = alignment.symbolAt(label, i); I am trying to get whether the base at this position is G, A, T or C However when I use base.getName() it returns strings such as "thymine" The documentation states that the method getToken should also be available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html Is there a simple way of retrieving a one letter textual representation of the symbol? Many thanks Anthony From zagato.gekko at gmail.com Fri Nov 9 13:48:02 2007 From: zagato.gekko at gmail.com (Zagato) Date: Fri, 9 Nov 2007 13:48:02 -0500 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) In-Reply-To: References: Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> Try with: String s = alignment.symbolListForLabel( label ).subStr( i, i+1 ); Bye... Alan Jairo Acosta Cali - Colombia On Nov 9, 2007 12:55 PM, Anthony Underwood wrote: > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. > http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Farewell. http://www.youtube.com/zagatogekko ruby << __EOF__ puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse __EOF__ From zagato.gekko at gmail.com Fri Nov 9 13:48:02 2007 From: zagato.gekko at gmail.com (Zagato) Date: Fri, 9 Nov 2007 13:48:02 -0500 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) In-Reply-To: References: Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> Try with: String s = alignment.symbolListForLabel( label ).subStr( i, i+1 ); Bye... Alan Jairo Acosta Cali - Colombia On Nov 9, 2007 12:55 PM, Anthony Underwood wrote: > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. > http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Farewell. http://www.youtube.com/zagatogekko ruby << __EOF__ puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse __EOF__ From gwaldon at geneinfinity.org Fri Nov 9 13:45:10 2007 From: gwaldon at geneinfinity.org (George Waldon) Date: Fri, 09 Nov 2007 10:45:10 -0800 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) Message-ID: <20071109184510.80580.qmail@mmm1924.dulles19-verio.com> Tokens are associated with alphabets. Get the tokenization from the alphabet using: SymbolTokenization = Alphabet.getTokenization("token"); Get the token from the tokenization using: String = SymbolTokenization.tokenizeSymbol(Symbol); Also, check the tutotial and the cookbook on the biojava web site at www.biojava.org, which are often more informative than the javadoc. Frankly speaking, I agree with you and we should have a method like String = Symbol.getToken(Alphabet,"token"); to do these operations simply and without loosing our hairs! Best luck, George > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Anthony Underwood > Sent: Friday, November 09, 2007 9:56 AM > To: BioJava > Subject: [Biojava-l] Getting a base from an alignment (way to complex?) > > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. > http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists From email2ants at gmail.com Fri Nov 9 18:23:01 2007 From: email2ants at gmail.com (Anthony Underwood) Date: Fri, 9 Nov 2007 23:23:01 +0000 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) In-Reply-To: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> References: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> Message-ID: <70FC5536-E1B3-41C7-92BC-0B43A0E11E09@gmail.com> Hi Alan, Thanks for the suggestion. That was my first thought, but then I was thinking for amino acids this wouldn't work. I would have to use a hashmap to convert the amino acid to the appropriate single letter code. Hi George, I'll try your suggestion. As you say I think this is too much for something that should be a one liner. Thanks for your advice. Get the tokenization from the alphabet using: SymbolTokenization = Alphabet.getTokenization("token"); Get the token from the tokenization using: String = SymbolTokenization.tokenizeSymbol(Symbol); Thanks to both of you Anthony On 9 Nov 2007, at 18:48, Zagato wrote: > Try with: > String s = alignment.symbolListForLabel( label ).subStr( i, i+1 ); > > Bye... > > Alan Jairo Acosta > Cali - Colombia > > On Nov 9, 2007 12:55 PM, Anthony Underwood < email2ants at gmail.com> > wrote: > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Farewell. > http://www.youtube.com/zagatogekko > ruby << __EOF__ > puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse > __EOF__ From hlapp at gmx.net Sat Nov 10 15:38:17 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 15:38:17 -0500 Subject: [Biojava-l] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> Message-ID: <5DDEBCDE-C8DA-4B2C-86F4-47FDB82CADAC@gmx.net> Just a few comments below, specifically where no rows would in fact be what I expect: On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote: > [...] > -------- For you information, I went thru the tables of my BioSQL > database: > [...] > 1) table bioentry: all column populated except for 'taxon_id' which > is NULL > (maybe I need an extra call for populating the 'taxon' table before?) Bioperl-db will try to look up (or create if necessary) the taxon from the taxon information attached to the sequence, but for BioPerl we actually recommend to pre-load the database with the NCBI taxonomy, which can be comfortably done with the script load_ncbi_taxonomy.pl that comes with BioSQL. > > 2) table bioentry_dbxref: no data inserted (always empty, even with > BioJava) This would mean that the sequence(s) have no dbxrefs. Note that for GenBank sequences that would be expected, since unfortunately, and unlike EMBL format, GenBank puts the dbxrefs into the feature table. > 3) table bioentry_qualifier_value: > > One entry only, for the 'term_id' = 149, rank = 1, and value = '07- > JUL-2005' > or other 'DD-MMM-YYYY' dates (see my remarks below) Below you say that your term table is empty, so I don't know why you can have value here at all. > [...] > 5) table bioentry_relationships: no entry found (always empty, even > with > BioJava) If you load sequences, they won't have direct relationships to other sequences (except dbxrefs, but those are rather 'pointers' and are stored in their own table). In Bioperl-db, this table is used only if you load sequence clusters through Bio::Cluster objects (such as UniGene). > [...] > 7) table comment: no entry found (always empty, even with BioJava) Again, this is expected with GenBank. AFAIK genbank format doesn't allow for comments at the level of the sequence. You would (i.e., should) find entries here if you load UniProt entries. > 8) table dbxref: some records are generated, for dbname 'PUBMED' > and 'Taxon' > with the correct value Taxon obviously isn't really a dbxref, but rather a taxon (and hence should go into that table). > [...] > 9) table dbxref_qualifier_value: (always empty, even with BioJava) That's almost expected. There's rather few cases where dbxrefs have additional attributes that the language can parse out from a source (and then maps to the schema). > [...] > 10) table location: all locations loaded correctly, note that > 'term_id' and > 'dbxref_id' remain NULL for these seq but I have value for other seq. Theoretically, the term_id should point to the term giving the type of the location. If you (or Biopython) are only dealing with simple ('normal') locations, then it's not needed. The dbxref_id gives the reference to the remote sequence if the location for a feature refers to a different sequence than the feature itself does (so-called 'remote locations'). If the sequences you loaded don't have such locations, there this would be expected to be empty (or if Biopython doesn't handle such locations). > 11) table location_qualifier_value: always empty, even with BioJava This is expected if Biopython doesn't support fuzzy locations, or if none of the feature locations that you loaded are fuzzy. > [...] > 13) Table reference: entries correct, note 'dbxref_id' remains NULL > for > these seq but I have value for other seq. It should point to the pubmed ID for the reference but only if there was one. > 14) table seqfeature: entries are there (same as in table 'location'). > FYI:'display_name is always NULL. GenBank doesn't give names to features (and I think EMBL does neither), so this is expected. > 15) table seqfeature_dbxref: always empty, even with BioJava That's likely more to do with your language object model than with anything else. dbxref annotation for features is in tag/value pairs, just as any other, so your language (Biopython in this case) will have to do a lot of interpretation to tease out the semantics behind each tag name and based on that decide what to do with the value. Indeed, by default we don't even do this in BioPerl. > [...] > 17) table seqfeature_relationship: always empty, even with BioJava GenBank (and EMBL) feature tables are flat, not hierarchical, so this is expected. > 18) table taxon: always empty, even with BioJava) This is where the organism should go. > 19) table taxon_name: I have one but not from this test (I tried to > tinker a > little bit with taxon but stopped) That's odd that you can have an entry in taxon_name w/o a corresponding one in taxon. Do you have foreign key checks disabled? > 20) table term: always empty, even with BioJava That's strange, since you say you do have rows in bioentry_qualifier_value, which has an enforced foreign key to term. Did you disable the foreign key checks? > 21) table term_dbxref: always empty, even with BioJava That's expected unless you loaded an ontology whose terms have dbxrefs, and your language object model supports that. > [...] > 23) table term_synonym: always empty, even with BioJava Same as for 21). Your terms would have to have synonyms, and your language object model would have to support those, before you could expect to get anything in here. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From shirleyc at cis.upenn.edu Tue Nov 13 13:45:59 2007 From: shirleyc at cis.upenn.edu (Shirley Cohen) Date: Tue, 13 Nov 2007 13:45:59 -0500 Subject: [Biojava-l] maximum parsimony search Message-ID: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu> Hi BioJava People, I'm looking for existing code that implements a maximum parsimony search in Java. Does BioJava have this functionality? If so, can you point me to the appropriate classes? Thanks, Shirley From bmduggan at yahoo.com Tue Nov 13 19:48:22 2007 From: bmduggan at yahoo.com (Brendan Duggan) Date: Wed, 14 Nov 2007 11:48:22 +1100 (EST) Subject: [Biojava-l] Disulfide information in PDB files Message-ID: <454510.91557.qm@web52705.mail.re2.yahoo.com> Greetings I'm trying to mine some information on disulfides in the PDB and was hoping there might be a way of obtaining this information with the BioJava PDB parser. However, I haven't been able to see anything like this mentioned in the API docs. If it is currently not possible to extract disulfide information from PDB files are there any plans to implement this? Thanks! Brendan Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html From holland at ebi.ac.uk Wed Nov 14 03:50:31 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 14 Nov 2007 08:50:31 +0000 Subject: [Biojava-l] maximum parsimony search In-Reply-To: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu> References: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu> Message-ID: <473AB6D7.2010405@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 There is a class currently only available from the head of CVS - ie. it is unreleased yet. To get it you'll need to check out the very latest BioJava source code from CVS. The JavaDoc for the class is here: http://www.spice-3d.org/public-files/javadoc/biojava/org/biojavax/bio/phylo/ParsimonyTreeMethod.html It is designed to take input in the form of blocks of data similar to what you would find in a Nexus file (the Nexus file parsers elsewhere in the org/biojavax/bio/phylo package will provide these). However you could of course create such objects from your own data without needing to read/write any Nexus files. cheers, Richard Shirley Cohen wrote: > Hi BioJava People, > > I'm looking for existing code that implements a maximum parsimony > search in Java. Does BioJava have this functionality? If so, can you > point me to the appropriate classes? > > Thanks, > > Shirley > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHOrbW4C5LeMEKA/QRAuswAJ9olIwj7DGszOnKORU255YS3m2ohACfbKTw ihjuQVv0j+nlXb+4SL5pIfw= =ldfM -----END PGP SIGNATURE----- From holland at ebi.ac.uk Wed Nov 14 03:55:24 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 14 Nov 2007 08:55:24 +0000 Subject: [Biojava-l] Disulfide information in PDB files In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com> References: <454510.91557.qm@web52705.mail.re2.yahoo.com> Message-ID: <473AB7FC.10403@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Currently this is not parsed - the parser does not read all the tags in the most recent PDB specification. Could you open a bug request at http://bugzilla.open-bio.org/ to formally add this to our to-do list? Thanks! cheers, Richard Brendan Duggan wrote: > Greetings > > I'm trying to mine some information on disulfides in > the PDB and was hoping there might be a way of > obtaining this information with the BioJava PDB > parser. However, I haven't been able to see anything > like this mentioned in the API docs. If it is > currently not possible to extract disulfide > information from PDB files are there any plans to > implement this? > > Thanks! > > Brendan > > > Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHOrf84C5LeMEKA/QRArfeAJ9nCViM2jyVfubIpl5w/1EXMYTv/gCgjVEs zDnxHjv8xJsRBw5pfE2NdkA= =tGqm -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Wed Nov 14 04:32:28 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 14 Nov 2007 09:32:28 +0000 Subject: [Biojava-l] Disulfide information in PDB files In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com> References: <454510.91557.qm@web52705.mail.re2.yahoo.com> Message-ID: <9B898ADF-78EB-4B5C-A432-98274190815F@sanger.ac.uk> Hi Brendan, SSBOND lines are currently not parsed. If this is what you need, I can add this over the next couple of days. If you want to compute the bonds yourself, the framework can e.g. calculate distances between the sulphur atoms for you. - Andreas On 14 Nov 2007, at 00:48, Brendan Duggan wrote: > Greetings > > I'm trying to mine some information on disulfides in > the PDB and was hoping there might be a way of > obtaining this information with the BioJava PDB > parser. However, I haven't been able to see anything > like this mentioned in the API docs. If it is > currently not possible to extract disulfide > information from PDB files are there any plans to > implement this? > > Thanks! > > Brendan > > > Make the switch to the world's best email. Get the new Yahoo! > 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From deb at mb.au.dk Thu Nov 15 07:04:02 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Thu, 15 Nov 2007 13:04:02 +0100 Subject: [Biojava-l] Parsing exising gaps Message-ID: <002701c8277f$9dbdca50$d9395ef0$@au.dk> Dear all, I have managed to read an MSF-formatted alignment from a file selected through FileChooser as follows: BufferedReader br = new BufferedReader(new FileReader(aFileChooser.getSelectedFile())); SimpleAlignment align = (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); I can now retrieve the sequence names and sequences through the Alignment object: Iterator aLabels = align.getLabels().iterator(); Iterator aSequences = align.symbolListIterator(); However, I now what to be able to translate between real sequence numbers and the positions within each alignment string, i.e. retrieve positions that remove the gaps first (gaps are represented by hyphens '-' in the MSF format). How can I tell BioJava to parse the gaps into an GappedSequence format? I have tried the following to check what position 15 (past the the first gap) translates into: int n = 0; while(aSequences.hasNext()) { SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); SimpleGappedSequence aGapped = new SimpleGappedSequence(new SimpleSequence(aSym, "", aLabels.next().toString(), null)); System.out.println(aGapped.gappedToLocation(new PointLocation(15))); } But I only get 15 back out. I have also studied the constructor of the underlying SimpleGappedSymbolList but it simply copies the SymbolList and creates one big block: public SimpleGappedSymbolList(SymbolList source) { this.source = source; this.alpha = source.getAlphabet(); this.blocks = new ArrayList(); this.length = source.length(); Block b = new Block(1, length, 1, length); blocks.add(b); } Is there a way to tell SimpleGappedSequence to parse itself in terms of the gap characters in the sequence string? How is the sequence represented in this case, if not by gaps? Surely the hyphen cannot be a part of the standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it? Best wishes, Ditlev -- Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor Department of Molecular Biology Office: +45 89425259 University of AarhusLab: +45 89425022 Gustav Wieds Vej 10cFax: +45 86123178 DK-8000 Aarhus C Email: deb at mb.au.dk Denmark Lab WWW: www.bioxray.dk/~deb From holland at ebi.ac.uk Thu Nov 15 08:51:48 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 15 Nov 2007 13:51:48 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <002701c8277f$9dbdca50$d9395ef0$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> Message-ID: <473C4EF4.5080301@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I think you've uncovered a number of problems here: 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do all the other predefined alphabets. 2. The MSF parser doesn't bother trying to build GappedSequence instances, instead it just builds solid sequences with the gaps as normal symbols. 3. There is no constructor or method for taking a sequence with embedded gap symbols and turning it into a GappedSequence with separate chunks. Combined, these three problems make it impossible to do what you want easily. I will make a note to fix this on the plans for the next BioJava development cycle. In the meantime, your best bet would be to construct a second alignment block by iterating over the alignment block you already have and parsing the locations of the gap symbols. You would create a SimpleGappedSequence intially over the ungapped sequence, then use the insert gap methods to insert the gaps into this ungapped sequence before putting all the SimpleGappedSequence objects together into a new alignment. cheers, Richard Ditlev Egeskov Brodersen wrote: > Dear all, > > > > I have managed to read an MSF-formatted alignment from a file selected > through FileChooser as follows: > > > > BufferedReader br = new BufferedReader(new > FileReader(aFileChooser.getSelectedFile())); > > SimpleAlignment align = > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); > > > > I can now retrieve the sequence names and sequences through the Alignment > object: > > > > Iterator aLabels = align.getLabels().iterator(); > > Iterator aSequences = align.symbolListIterator(); > > > > However, I now what to be able to translate between real sequence numbers > and the positions within each alignment string, i.e. retrieve positions that > remove the gaps first (gaps are represented by hyphens '-' in the MSF > format). How can I tell BioJava to parse the gaps into an GappedSequence > format? I have tried the following to check what position 15 (past the the > first gap) translates into: > > > > int n = 0; > > while(aSequences.hasNext()) { > > SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > > SimpleGappedSequence aGapped = new SimpleGappedSequence(new > SimpleSequence(aSym, "", aLabels.next().toString(), null)); > > System.out.println(aGapped.gappedToLocation(new PointLocation(15))); > > } > > > > But I only get 15 back out. I have also studied the constructor of the > underlying SimpleGappedSymbolList but it simply copies the SymbolList and > creates one big block: > > > > public SimpleGappedSymbolList(SymbolList source) { > > this.source = source; > > this.alpha = source.getAlphabet(); > > this.blocks = new ArrayList(); > > this.length = source.length(); > > Block b = new Block(1, length, 1, length); > > blocks.add(b); > > } > > > > Is there a way to tell SimpleGappedSequence to parse itself in terms of the > gap characters in the sequence string? How is the sequence represented in > this case, if not by gaps? Surely the hyphen cannot be a part of the > standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it? > > > > Best wishes, > > > > Ditlev > > > > -- > > > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > > > Department of Molecular Biology Office: +45 89425259 > University of AarhusLab: +45 89425022 > Gustav Wieds Vej 10cFax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPE704C5LeMEKA/QRAniIAJsGv+5HIP3mCDxBIUdw0SjDrWu8dgCeNviA EsJK4gv+EVY7wc4r6W2A0+I= =wCQs -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 16 03:59:41 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 08:59:41 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> Message-ID: <473D5BFD.8080305@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Ditlev. After some investigation and some helpful hints from Mark, it turns out that there are methods in DNATools/ProteinTools that can construct proper GappedSymbolList objects out of strings. I have managed to modify the MSF parser to use this instead. This means that the MSF parser will now return instances of GappedSymbolList (actually GappedSequences to be accurate) rather than SimpleSymbolList. Thanks to the way the APIs work this will make no difference to existing users (except those who are depending on being able to cast it to a certain type - which they shouldn't, because the API doesn't guarantee it to be of any type!), but it will fix it for you. Future releases will modify the API (or include a completely new MSF parser) which will explicitly return GappedSymbolLists in the API declarations rather than plain SymbolLists, but I can't do that right now because it would break existing users code. To get the modified parser you will need to check out the very latest source code from our CVS repository and compile it using ant. Instructions are on our website at biojava.org if you have not done this before. Hope this helps you. cheers, Richard Ditlev Egeskov Brodersen wrote: > Hi Richard, > > thanks for clarifying this and for your useful suggestion, which I've > managed to implement as shown below. It works nicely, but I was really > surprised to learn that biojava hasn't yet implemented a proper parsing of > gap characters from strings into the object structure as this seems central > to any use of pre-aligned sequences. Also, I find it problematic that the > API implements the gap characters as part of the alphabets. In my view, this > breaks the logic of the object model because proteins don't really have gaps > in their sequences. > > Rather, the constructor of the Sequence-derived classes ought to throw an > exception when non-protein characters are passed and should not allow the > user to create an object with sequence elements that are non-standard. > Instead, I think there should be a static method that allows cleaning the > input sequence before passing it to the Sequence constructor. On the other > hand, the constructor of the GappedSequence-derived classes should recognise > the gaps and create an object with blocks of legal protein symbols and gaps > in the appropriate places. > > -- Ditlev > > // Read MSF file into Alignment object > BufferedReader br = new BufferedReader(new > FileReader(aFileChooser.getSelectedFile())); > SimpleAlignment align = > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); > > // Iterate through sequences in turn > Iterator aSequences = align.symbolListIterator(); > while(aSequences.hasNext()) { > > // Retrieve SymbolList, the associated gap symbol and sequence string > SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol(); > String aGappedString = aSym.seqString(); > > // Prepare non-gapped string > String aPlainString = ""; > > // Loop through individual symbols and add non-gap characters to > string > for(int i=1;i<=aSym.length();i++) > if(aSym.symbolAt(i) != aGapSymbol) > aPlainString += aGappedString.charAt(i-1); > > // Create a new gapped sequence object with the plain (non-gapped) > sequence > SimpleGappedSequence aGapped = > (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainString, > ""); > > // Use separate indices for gapped and plain sequences > int n = 1; > > // Loop through individual gapped sequence symbols and insert gap into > object when gap symbol is encountered > for(int i=1;i<=aSym.length();i++) > if(aSym.symbolAt(i) != aGapSymbol) > n++; > else > aGapped.addGapInSource(n); > > -- > > Ditlev Egeskov Brodersen > Lektor > Bakkefaldet 30, Hasle > 8210 ?rhus V > > www.lindeman-brodersen.dk > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 15 November 2007 14:52 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: [Biojava-l] Parsing exising gaps >> > I think you've uncovered a number of problems here: > > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do > all the other predefined alphabets. > > 2. The MSF parser doesn't bother trying to build GappedSequence > instances, instead it just builds solid sequences with the gaps as > normal symbols. > > 3. There is no constructor or method for taking a sequence with > embedded > gap symbols and turning it into a GappedSequence with separate chunks. > > Combined, these three problems make it impossible to do what you want > easily. I will make a note to fix this on the plans for the next > BioJava > development cycle. > > In the meantime, your best bet would be to construct a second alignment > block by iterating over the alignment block you already have and > parsing > the locations of the gap symbols. You would create a > SimpleGappedSequence intially over the ungapped sequence, then use the > insert gap methods to insert the gaps into this ungapped sequence > before > putting all the SimpleGappedSequence objects together into a new > alignment. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: >>>> Dear all, >>>> >>>> >>>> >>>> I have managed to read an MSF-formatted alignment from a file > selected >>>> through FileChooser as follows: >>>> >>>> >>>> >>>> BufferedReader br = new BufferedReader(new >>>> FileReader(aFileChooser.getSelectedFile())); >>>> >>>> SimpleAlignment align = >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, > br); >>>> >>>> >>>> I can now retrieve the sequence names and sequences through the > Alignment >>>> object: >>>> >>>> >>>> >>>> Iterator aLabels = align.getLabels().iterator(); >>>> >>>> Iterator aSequences = align.symbolListIterator(); >>>> >>>> >>>> >>>> However, I now what to be able to translate between real sequence > numbers >>>> and the positions within each alignment string, i.e. retrieve > positions that >>>> remove the gaps first (gaps are represented by hyphens '-' in the MSF >>>> format). How can I tell BioJava to parse the gaps into an > GappedSequence >>>> format? I have tried the following to check what position 15 (past > the the >>>> first gap) translates into: >>>> >>>> >>>> >>>> int n = 0; >>>> >>>> while(aSequences.hasNext()) { >>>> >>>> SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); >>>> >>>> SimpleGappedSequence aGapped = new SimpleGappedSequence(new >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null)); >>>> >>>> System.out.println(aGapped.gappedToLocation(new > PointLocation(15))); >>>> } >>>> >>>> >>>> >>>> But I only get 15 back out. I have also studied the constructor of > the >>>> underlying SimpleGappedSymbolList but it simply copies the SymbolList > and >>>> creates one big block: >>>> >>>> >>>> >>>> public SimpleGappedSymbolList(SymbolList source) { >>>> >>>> this.source = source; >>>> >>>> this.alpha = source.getAlphabet(); >>>> >>>> this.blocks = new ArrayList(); >>>> >>>> this.length = source.length(); >>>> >>>> Block b = new Block(1, length, 1, length); >>>> >>>> blocks.add(b); >>>> >>>> } >>>> >>>> >>>> >>>> Is there a way to tell SimpleGappedSequence to parse itself in terms > of the >>>> gap characters in the sequence string? How is the sequence > represented in >>>> this case, if not by gaps? Surely the hyphen cannot be a part of the >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the use > of it? >>>> >>>> >>>> Best wishes, >>>> >>>> >>>> >>>> Ditlev >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> >>>> Ditlev E. Brodersen, Ph.D. >>>> Lektor, Associate Professor >>>> >>>> >>>> >>>> Department of Molecular Biology Office: +45 89425259 >>>> University of AarhusLab: +45 89425022 >>>> Gustav Wieds Vej 10cFax: +45 86123178 >>>> DK-8000 Aarhus C Email: deb at mb.au.dk >>>> Denmark Lab WWW: > www.bioxray.dk/~deb >>>> >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS Rv73CVmtYpEz+apJwM1L3sA= =UPU6 -----END PGP SIGNATURE----- From deb at mb.au.dk Fri Nov 16 04:28:40 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Fri, 16 Nov 2007 10:28:40 +0100 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <473D5BFD.8080305@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> Message-ID: <000601c82833$143c5300$3cb4f900$@au.dk> Hi Richard, thanks for your super fast reply. I managed to recompile using CVS/ant and the MSF import now works brilliantly and simply as follows: BufferedReader br = new BufferedReader(new FileReader(aFileChooser.getSelectedFile())); SimpleAlignment align = (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); // Iterate through sequences in turn Iterator aSequences = align.symbolListIterator(); while(aSequences.hasNext()) { // Retrieve gapped sequence SimpleGappedSequence aGapped = (SimpleGappedSequence)aSequences.next(); ...do whatever with each gapped sequence } The returned gapped sequences are all properly set up with gaps, name etc. But as for other users, I think there may be some problems, since the SimpleAlignment object only has a general symbol list iterator, the user will have to cast each statement extracting a sequence object, and SimpleSequence aSimple = (SimpleSequence)aSequences.next(); returns an ClassCastException at run time. So old code might not run with the update as far as I can see. Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 10:00 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Parsing exising gaps > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Ditlev. > > After some investigation and some helpful hints from Mark, it turns out > that there are methods in DNATools/ProteinTools that can construct > proper GappedSymbolList objects out of strings. > > I have managed to modify the MSF parser to use this instead. This means > that the MSF parser will now return instances of GappedSymbolList > (actually GappedSequences to be accurate) rather than SimpleSymbolList. > Thanks to the way the APIs work this will make no difference to > existing > users (except those who are depending on being able to cast it to a > certain type - which they shouldn't, because the API doesn't guarantee > it to be of any type!), but it will fix it for you. Future releases > will > modify the API (or include a completely new MSF parser) which will > explicitly return GappedSymbolLists in the API declarations rather than > plain SymbolLists, but I can't do that right now because it would break > existing users code. > > To get the modified parser you will need to check out the very latest > source code from our CVS repository and compile it using ant. > Instructions are on our website at biojava.org if you have not done > this > before. > > Hope this helps you. > > cheers, > Richard > > > Ditlev Egeskov Brodersen wrote: > > Hi Richard, > > > > thanks for clarifying this and for your useful suggestion, which > I've > > managed to implement as shown below. It works nicely, but I was > really > > surprised to learn that biojava hasn't yet implemented a proper > parsing of > > gap characters from strings into the object structure as this seems > central > > to any use of pre-aligned sequences. Also, I find it problematic that > the > > API implements the gap characters as part of the alphabets. In my > view, this > > breaks the logic of the object model because proteins don't really > have gaps > > in their sequences. > > > > Rather, the constructor of the Sequence-derived classes ought to > throw an > > exception when non-protein characters are passed and should not allow > the > > user to create an object with sequence elements that are non- > standard. > > Instead, I think there should be a static method that allows cleaning > the > > input sequence before passing it to the Sequence constructor. On the > other > > hand, the constructor of the GappedSequence-derived classes should > recognise > > the gaps and create an object with blocks of legal protein symbols > and gaps > > in the appropriate places. > > > > -- Ditlev > > > > // Read MSF file into Alignment object > > BufferedReader br = new BufferedReader(new > > FileReader(aFileChooser.getSelectedFile())); > > SimpleAlignment align = > > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, > br); > > > > // Iterate through sequences in turn > > Iterator aSequences = align.symbolListIterator(); > > while(aSequences.hasNext()) { > > > > // Retrieve SymbolList, the associated gap symbol and sequence > string > > SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > > Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol(); > > String aGappedString = aSym.seqString(); > > > > // Prepare non-gapped string > > String aPlainString = ""; > > > > // Loop through individual symbols and add non-gap characters > to > > string > > for(int i=1;i<=aSym.length();i++) > > if(aSym.symbolAt(i) != aGapSymbol) > > aPlainString += aGappedString.charAt(i-1); > > > > // Create a new gapped sequence object with the plain (non- > gapped) > > sequence > > SimpleGappedSequence aGapped = > > > (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainSt > ring, > > ""); > > > > // Use separate indices for gapped and plain sequences > > int n = 1; > > > > // Loop through individual gapped sequence symbols and insert > gap into > > object when gap symbol is encountered > > for(int i=1;i<=aSym.length();i++) > > if(aSym.symbolAt(i) != aGapSymbol) > > n++; > > else > > aGapped.addGapInSource(n); > > > > -- > > > > Ditlev Egeskov Brodersen > > Lektor > > Bakkefaldet 30, Hasle > > 8210 ?rhus V > > > > www.lindeman-brodersen.dk > > > >> -----Original Message----- > >> From: Richard Holland [mailto:holland at ebi.ac.uk] > >> Sent: 15 November 2007 14:52 > >> To: Ditlev Egeskov Brodersen > >> Cc: biojava-l at biojava.org > >> Subject: Re: [Biojava-l] Parsing exising gaps > >> > > I think you've uncovered a number of problems here: > > > > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do > > all the other predefined alphabets. > > > > 2. The MSF parser doesn't bother trying to build GappedSequence > > instances, instead it just builds solid sequences with the gaps as > > normal symbols. > > > > 3. There is no constructor or method for taking a sequence with > > embedded > > gap symbols and turning it into a GappedSequence with separate > chunks. > > > > Combined, these three problems make it impossible to do what you want > > easily. I will make a note to fix this on the plans for the next > > BioJava > > development cycle. > > > > In the meantime, your best bet would be to construct a second > alignment > > block by iterating over the alignment block you already have and > > parsing > > the locations of the gap symbols. You would create a > > SimpleGappedSequence intially over the ungapped sequence, then use > the > > insert gap methods to insert the gaps into this ungapped sequence > > before > > putting all the SimpleGappedSequence objects together into a new > > alignment. > > > > cheers, > > Richard > > > > Ditlev Egeskov Brodersen wrote: > >>>> Dear all, > >>>> > >>>> > >>>> > >>>> I have managed to read an MSF-formatted alignment from a file > > selected > >>>> through FileChooser as follows: > >>>> > >>>> > >>>> > >>>> BufferedReader br = new BufferedReader(new > >>>> FileReader(aFileChooser.getSelectedFile())); > >>>> > >>>> SimpleAlignment align = > >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, > > br); > >>>> > >>>> > >>>> I can now retrieve the sequence names and sequences through the > > Alignment > >>>> object: > >>>> > >>>> > >>>> > >>>> Iterator aLabels = align.getLabels().iterator(); > >>>> > >>>> Iterator aSequences = align.symbolListIterator(); > >>>> > >>>> > >>>> > >>>> However, I now what to be able to translate between real sequence > > numbers > >>>> and the positions within each alignment string, i.e. retrieve > > positions that > >>>> remove the gaps first (gaps are represented by hyphens '-' in the > MSF > >>>> format). How can I tell BioJava to parse the gaps into an > > GappedSequence > >>>> format? I have tried the following to check what position 15 (past > > the the > >>>> first gap) translates into: > >>>> > >>>> > >>>> > >>>> int n = 0; > >>>> > >>>> while(aSequences.hasNext()) { > >>>> > >>>> SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > >>>> > >>>> SimpleGappedSequence aGapped = new SimpleGappedSequence(new > >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null)); > >>>> > >>>> System.out.println(aGapped.gappedToLocation(new > > PointLocation(15))); > >>>> } > >>>> > >>>> > >>>> > >>>> But I only get 15 back out. I have also studied the constructor of > > the > >>>> underlying SimpleGappedSymbolList but it simply copies the > SymbolList > > and > >>>> creates one big block: > >>>> > >>>> > >>>> > >>>> public SimpleGappedSymbolList(SymbolList source) { > >>>> > >>>> this.source = source; > >>>> > >>>> this.alpha = source.getAlphabet(); > >>>> > >>>> this.blocks = new ArrayList(); > >>>> > >>>> this.length = source.length(); > >>>> > >>>> Block b = new Block(1, length, 1, length); > >>>> > >>>> blocks.add(b); > >>>> > >>>> } > >>>> > >>>> > >>>> > >>>> Is there a way to tell SimpleGappedSequence to parse itself in > terms > > of the > >>>> gap characters in the sequence string? How is the sequence > > represented in > >>>> this case, if not by gaps? Surely the hyphen cannot be a part of > the > >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the > use > > of it? > >>>> > >>>> > >>>> Best wishes, > >>>> > >>>> > >>>> > >>>> Ditlev > >>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> > >>>> > >>>> Ditlev E. Brodersen, Ph.D. > >>>> Lektor, Associate Professor > >>>> > >>>> > >>>> > >>>> Department of Molecular Biology Office: +45 89425259 > >>>> University of AarhusLab: +45 89425022 > >>>> Gustav Wieds Vej 10cFax: +45 86123178 > >>>> DK-8000 Aarhus C Email: deb at mb.au.dk > >>>> Denmark Lab WWW: > > www.bioxray.dk/~deb > >>>> > >>>> > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS > Rv73CVmtYpEz+apJwM1L3sA= > =UPU6 > -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 16 04:49:35 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 09:49:35 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <000601c82833$143c5300$3cb4f900$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> Message-ID: <473D67AF.2020007@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > The returned gapped sequences are all properly set up with gaps, name etc. > But as for other users, I think there may be some problems, since the > SimpleAlignment object only has a general symbol list iterator, the user > will have to cast each statement extracting a sequence object, and > > SimpleSequence aSimple = (SimpleSequence)aSequences.next(); > > returns an ClassCastException at run time. So old code might not run with > the update as far as I can see. This is true. However, such code would be unsupported by us as the API clearly states that SimpleAlignment returns SymbolList instances, and does not make any guarantees about the exact implementation details of the objects it returns. To attempt to cast it to anything other than SymbolList would be a mistake! (Although actually it is now returning a guarantee of GappedSymbolList, which is what your code can now take advantage of). To assume it will return SimpleSequence is outside the behaviour defined by the API and therefore should not be relied upon. A more correct behaviour would be to test each item returned: SymbolList symlist = aSequences.next(); if (symlist instanceof SimpleSequence) { SimpleSequence seq = (SimpleSequence)symlist; // Do simple-sequence stuff } else { // Do something else! } In future, I will modify the API to change the SymbolList guarantee to a GappedSymbolList guarantee, but I can't do this right now as this really would break everyone's code! We are currently planning a redesign as you may be aware, so issues like this will hopefully be resolved as part of that process. For a start, if we use Java 5 generics in future as we plan, we can strictly specify what kinds of objects will be returned by things such as the alignment API, making it easier for us to enforce API-compliant behaviour in user's code. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB ITj7O/foDly4aOT4SV1Jb+k= =g7Vs -----END PGP SIGNATURE----- From deb at mb.au.dk Fri Nov 16 05:11:15 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Fri, 16 Nov 2007 11:11:15 +0100 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <473D67AF.2020007@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> Message-ID: <000f01c82839$06722550$13566ff0$@au.dk> Hi again, thanks for the info - will do the check just to be proper. I have another question: In my application, I would like to wrap the retrieved SimpleGappedSequence objects inside another object that extends the functionality with application-specific stuff. Ideally, I would do this by extending the SimpleGappedSequence object and create it by passing the SimpleGappedSequence from the alignment import to the constructor of the parent, like so: class AlignedSequence extends SimpleGappedSequence { public AlignedSequence(SimpleGappedSequence aGapped) { super(aGapped); } ..custom stuff.. } However, the problem is that there is only one constructor for the SimpleGappedSequence, one which takes a simple Sequence object. I can pass the derived class alright, but all gap information is lost again, presumably because the SimpleGappedSequence constructor just takes out the seqString() and puts it into its own sequence object. Shouldn't the constructor of the SimpleGappedSequence class recognise when a derived (and gapped) sequence object is passed, and process it accordingly? As it stands, I am forced to include the SimpleGappedSequence as a private member of the AlignedSequence class, which is not near as nice since all statement using the class will have to do something like class AlignedSequence extends SimpleGappedSequence { private SimpleGappedSequence gapped_sequence; public AlignedSequence(SimpleGappedSequence aGapped) { gapped_sequence = aGapped; } public SimpleGappedSequence getGappedSequence() { return(gapped_sequence); } ..custom stuff.. } ... AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.getGappedSequence().seqString(); rather than simply: AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.seqString(); In other words, is there any solution with the current setup that would allow me to extend SimpleGappedSequence and not loose the gap information? -- Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 10:50 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Parsing exising gaps > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > The returned gapped sequences are all properly set up with gaps, > name etc. > > But as for other users, I think there may be some problems, since the > > SimpleAlignment object only has a general symbol list iterator, the > user > > will have to cast each statement extracting a sequence object, and > > > > SimpleSequence aSimple = (SimpleSequence)aSequences.next(); > > > > returns an ClassCastException at run time. So old code might not run > with > > the update as far as I can see. > > This is true. However, such code would be unsupported by us as the API > clearly states that SimpleAlignment returns SymbolList instances, and > does not make any guarantees about the exact implementation details of > the objects it returns. To attempt to cast it to anything other than > SymbolList would be a mistake! (Although actually it is now returning a > guarantee of GappedSymbolList, which is what your code can now take > advantage of). To assume it will return SimpleSequence is outside the > behaviour defined by the API and therefore should not be relied upon. > > A more correct behaviour would be to test each item returned: > > SymbolList symlist = aSequences.next(); > if (symlist instanceof SimpleSequence) { > SimpleSequence seq = (SimpleSequence)symlist; > // Do simple-sequence stuff > } else { > // Do something else! > } > > In future, I will modify the API to change the SymbolList guarantee to > a > GappedSymbolList guarantee, but I can't do this right now as this > really > would break everyone's code! > > We are currently planning a redesign as you may be aware, so issues > like > this will hopefully be resolved as part of that process. For a start, > if > we use Java 5 generics in future as we plan, we can strictly specify > what kinds of objects will be returned by things such as the alignment > API, making it easier for us to enforce API-compliant behaviour in > user's code. > > cheers, > Richard > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB > ITj7O/foDly4aOT4SV1Jb+k= > =g7Vs > -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Fri Nov 16 04:51:35 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 16 Nov 2007 09:51:35 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <473D5BFD.8080305@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> Message-ID: > > To get the modified parser you will need to check out the very latest > source code from our CVS repository and compile it using ant. > Instructions are on our website at biojava.org if you have not done > this > before. alternatively you could get the automatically built biojava.jar from http://www.spice-3d.org/cruise/ Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Fri Nov 16 05:46:57 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 10:46:57 +0000 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <000f01c82839$06722550$13566ff0$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.dk> Message-ID: <473D7521.9070603@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The easiest way is simply for me to alter the constructor to SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to copy all gaps if passed another instance of GappedSymbolList as the parameter. I've just done this in CVS so you should be able to update your copy and observe the new behaviour. cheers, Richard Ditlev Egeskov Brodersen wrote: > Hi again, > > thanks for the info - will do the check just to be proper. I have another > question: In my application, I would like to wrap the retrieved > SimpleGappedSequence objects inside another object that extends the > functionality with application-specific stuff. Ideally, I would do this by > extending the SimpleGappedSequence object and create it by passing the > SimpleGappedSequence from the alignment import to the constructor of the > parent, like so: > > class AlignedSequence extends SimpleGappedSequence { > public AlignedSequence(SimpleGappedSequence aGapped) { > super(aGapped); > } > > ..custom stuff.. > } > > However, the problem is that there is only one constructor for the > SimpleGappedSequence, one which takes a simple Sequence object. I can pass > the derived class alright, but all gap information is lost again, presumably > because the SimpleGappedSequence constructor just takes out the seqString() > and puts it into its own sequence object. > > Shouldn't the constructor of the SimpleGappedSequence class recognise when a > derived (and gapped) sequence object is passed, and process it accordingly? > > As it stands, I am forced to include the SimpleGappedSequence as a private > member of the AlignedSequence class, which is not near as nice since all > statement using the class will have to do something like > > class AlignedSequence extends SimpleGappedSequence { > private SimpleGappedSequence gapped_sequence; > > public AlignedSequence(SimpleGappedSequence aGapped) { > gapped_sequence = aGapped; > } > > public SimpleGappedSequence getGappedSequence() { > return(gapped_sequence); > } > > ..custom stuff.. > } > > ... > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.getGappedSequence().seqString(); > > rather than simply: > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.seqString(); > > In other words, is there any solution with the current setup that would > allow me to extend SimpleGappedSequence and not loose the gap information? > > -- Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 16 November 2007 10:50 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: [Biojava-l] Parsing exising gaps >> >>>> The returned gapped sequences are all properly set up with gaps, > name etc. >>>> But as for other users, I think there may be some problems, since the >>>> SimpleAlignment object only has a general symbol list iterator, the > user >>>> will have to cast each statement extracting a sequence object, and >>>> >>>> SimpleSequence aSimple = (SimpleSequence)aSequences.next(); >>>> >>>> returns an ClassCastException at run time. So old code might not run > with >>>> the update as far as I can see. > This is true. However, such code would be unsupported by us as the API > clearly states that SimpleAlignment returns SymbolList instances, and > does not make any guarantees about the exact implementation details of > the objects it returns. To attempt to cast it to anything other than > SymbolList would be a mistake! (Although actually it is now returning a > guarantee of GappedSymbolList, which is what your code can now take > advantage of). To assume it will return SimpleSequence is outside the > behaviour defined by the API and therefore should not be relied upon. > > A more correct behaviour would be to test each item returned: > > SymbolList symlist = aSequences.next(); > if (symlist instanceof SimpleSequence) { > SimpleSequence seq = (SimpleSequence)symlist; > // Do simple-sequence stuff > } else { > // Do something else! > } > > In future, I will modify the API to change the SymbolList guarantee to > a > GappedSymbolList guarantee, but I can't do this right now as this > really > would break everyone's code! > > We are currently planning a redesign as you may be aware, so issues > like > this will hopefully be resolved as part of that process. For a start, > if > we use Java 5 generics in future as we plan, we can strictly specify > what kinds of objects will be returned by things such as the alignment > API, making it easier for us to enforce API-compliant behaviour in > user's code. > > cheers, > Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m 8EJTzWZYOyjp9VhmvsgvmNA= =1uaB -----END PGP SIGNATURE----- From deb at mb.au.dk Fri Nov 16 07:39:23 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Fri, 16 Nov 2007 13:39:23 +0100 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <473D7521.9070603@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> Message-ID: <001801c8284d$b8c525e0$2a4f71a0$@au.dk> Hi again, I updated CVS and got the new SimpleGappedSymbolList class, but there seems to be no changes to the SimpleGappedSequence class, which is the one I need to extend...have I missed something? Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 11:47 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: Wrapping SimpleGappedSequence > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > The easiest way is simply for me to alter the constructor to > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to > copy all gaps if passed another instance of GappedSymbolList as the > parameter. I've just done this in CVS so you should be able to update > your copy and observe the new behaviour. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: > > Hi again, > > > > thanks for the info - will do the check just to be proper. I have > another > > question: In my application, I would like to wrap the retrieved > > SimpleGappedSequence objects inside another object that extends the > > functionality with application-specific stuff. Ideally, I would do > this by > > extending the SimpleGappedSequence object and create it by passing > the > > SimpleGappedSequence from the alignment import to the constructor of > the > > parent, like so: > > > > class AlignedSequence extends SimpleGappedSequence { > > public AlignedSequence(SimpleGappedSequence aGapped) { > > super(aGapped); > > } > > > > ..custom stuff.. > > } > > > > However, the problem is that there is only one constructor for the > > SimpleGappedSequence, one which takes a simple Sequence object. I can > pass > > the derived class alright, but all gap information is lost again, > presumably > > because the SimpleGappedSequence constructor just takes out the > seqString() > > and puts it into its own sequence object. > > > > Shouldn't the constructor of the SimpleGappedSequence class recognise > when a > > derived (and gapped) sequence object is passed, and process it > accordingly? > > > > As it stands, I am forced to include the SimpleGappedSequence as a > private > > member of the AlignedSequence class, which is not near as nice since > all > > statement using the class will have to do something like > > > > class AlignedSequence extends SimpleGappedSequence { > > private SimpleGappedSequence gapped_sequence; > > > > public AlignedSequence(SimpleGappedSequence aGapped) { > > gapped_sequence = aGapped; > > } > > > > public SimpleGappedSequence getGappedSequence() { > > return(gapped_sequence); > > } > > > > ..custom stuff.. > > } > > > > ... > > > > AlignedSequence aAligned = new AlignedSequence(aGapped); > > aAligned.getGappedSequence().seqString(); > > > > rather than simply: > > > > AlignedSequence aAligned = new AlignedSequence(aGapped); > > aAligned.seqString(); > > > > In other words, is there any solution with the current setup that > would > > allow me to extend SimpleGappedSequence and not loose the gap > information? > > > > -- Ditlev > > > > -- > > > > Ditlev E. Brodersen, Ph.D. > > Lektor, Associate Professor > > > > Department of Molecular Biology Office: +45 89425259 > > University of Aarhus Lab: +45 89425022 > > Gustav Wieds Vej 10c Fax: +45 86123178 > > DK-8000 Aarhus C Email: deb at mb.au.dk > > Denmark Lab WWW: www.bioxray.dk/~deb > > > > > >> -----Original Message----- > >> From: Richard Holland [mailto:holland at ebi.ac.uk] > >> Sent: 16 November 2007 10:50 > >> To: Ditlev Egeskov Brodersen > >> Cc: biojava-l at biojava.org > >> Subject: Re: [Biojava-l] Parsing exising gaps > >> > >>>> The returned gapped sequences are all properly set up with gaps, > > name etc. > >>>> But as for other users, I think there may be some problems, since > the > >>>> SimpleAlignment object only has a general symbol list iterator, > the > > user > >>>> will have to cast each statement extracting a sequence object, and > >>>> > >>>> SimpleSequence aSimple = (SimpleSequence)aSequences.next(); > >>>> > >>>> returns an ClassCastException at run time. So old code might not > run > > with > >>>> the update as far as I can see. > > This is true. However, such code would be unsupported by us as the > API > > clearly states that SimpleAlignment returns SymbolList instances, and > > does not make any guarantees about the exact implementation details > of > > the objects it returns. To attempt to cast it to anything other than > > SymbolList would be a mistake! (Although actually it is now returning > a > > guarantee of GappedSymbolList, which is what your code can now take > > advantage of). To assume it will return SimpleSequence is outside the > > behaviour defined by the API and therefore should not be relied upon. > > > > A more correct behaviour would be to test each item returned: > > > > SymbolList symlist = aSequences.next(); > > if (symlist instanceof SimpleSequence) { > > SimpleSequence seq = (SimpleSequence)symlist; > > // Do simple-sequence stuff > > } else { > > // Do something else! > > } > > > > In future, I will modify the API to change the SymbolList guarantee > to > > a > > GappedSymbolList guarantee, but I can't do this right now as this > > really > > would break everyone's code! > > > > We are currently planning a redesign as you may be aware, so issues > > like > > this will hopefully be resolved as part of that process. For a start, > > if > > we use Java 5 generics in future as we plan, we can strictly specify > > what kinds of objects will be returned by things such as the > alignment > > API, making it easier for us to enforce API-compliant behaviour in > > user's code. > > > > cheers, > > Richard > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m > 8EJTzWZYOyjp9VhmvsgvmNA= > =1uaB > -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 16 07:46:23 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 12:46:23 +0000 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <001801c8284d$b8c525e0$2a4f71a0$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> Message-ID: <473D911F.2000303@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SimpleGappedSequence extends SimpleGappedSymbolList, and the constructor delegates to the SimpleGappedSymbolList constructor. When you extend SimpleGappedSequence you should delegate in your new constructor to the existing SimpleGappedSequence constructor, which in turn will delegate as above and preserve the gaps. By passing any object which implements GappedSymbolList to the SimpleGappedSequence constructor, e.g. SimpleGappedSequence or SimpleGappedSymbolList, it will automatically choose the new constructor from SimpleGappedSymbolList which you hopefully should be able to see in the code you have just checked out. If passed any other non-GappedSymbolList object, it will use the old constructor that already existed from before. cheers, Richard Ditlev Egeskov Brodersen wrote: > Hi again, > > I updated CVS and got the new SimpleGappedSymbolList class, but there > seems to be no changes to the SimpleGappedSequence class, which is the one I > need to extend...have I missed something? > > Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 16 November 2007 11:47 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: Wrapping SimpleGappedSequence >> > The easiest way is simply for me to alter the constructor to > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to > copy all gaps if passed another instance of GappedSymbolList as the > parameter. I've just done this in CVS so you should be able to update > your copy and observe the new behaviour. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: >>>> Hi again, >>>> >>>> thanks for the info - will do the check just to be proper. I have > another >>>> question: In my application, I would like to wrap the retrieved >>>> SimpleGappedSequence objects inside another object that extends the >>>> functionality with application-specific stuff. Ideally, I would do > this by >>>> extending the SimpleGappedSequence object and create it by passing > the >>>> SimpleGappedSequence from the alignment import to the constructor of > the >>>> parent, like so: >>>> >>>> class AlignedSequence extends SimpleGappedSequence { >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >>>> super(aGapped); >>>> } >>>> >>>> ..custom stuff.. >>>> } >>>> >>>> However, the problem is that there is only one constructor for the >>>> SimpleGappedSequence, one which takes a simple Sequence object. I can > pass >>>> the derived class alright, but all gap information is lost again, > presumably >>>> because the SimpleGappedSequence constructor just takes out the > seqString() >>>> and puts it into its own sequence object. >>>> >>>> Shouldn't the constructor of the SimpleGappedSequence class recognise > when a >>>> derived (and gapped) sequence object is passed, and process it > accordingly? >>>> As it stands, I am forced to include the SimpleGappedSequence as a > private >>>> member of the AlignedSequence class, which is not near as nice since > all >>>> statement using the class will have to do something like >>>> >>>> class AlignedSequence extends SimpleGappedSequence { >>>> private SimpleGappedSequence gapped_sequence; >>>> >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >>>> gapped_sequence = aGapped; >>>> } >>>> >>>> public SimpleGappedSequence getGappedSequence() { >>>> return(gapped_sequence); >>>> } >>>> >>>> ..custom stuff.. >>>> } >>>> >>>> ... >>>> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >>>> aAligned.getGappedSequence().seqString(); >>>> >>>> rather than simply: >>>> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >>>> aAligned.seqString(); >>>> >>>> In other words, is there any solution with the current setup that > would >>>> allow me to extend SimpleGappedSequence and not loose the gap > information? >>>> -- Ditlev >>>> >>>> -- >>>> >>>> Ditlev E. Brodersen, Ph.D. >>>> Lektor, Associate Professor >>>> >>>> Department of Molecular Biology Office: +45 89425259 >>>> University of Aarhus Lab: +45 89425022 >>>> Gustav Wieds Vej 10c Fax: +45 86123178 >>>> DK-8000 Aarhus C Email: deb at mb.au.dk >>>> Denmark Lab WWW: www.bioxray.dk/~deb >>>> >>>> >>>>> -----Original Message----- >>>>> From: Richard Holland [mailto:holland at ebi.ac.uk] >>>>> Sent: 16 November 2007 10:50 >>>>> To: Ditlev Egeskov Brodersen >>>>> Cc: biojava-l at biojava.org >>>>> Subject: Re: [Biojava-l] Parsing exising gaps >>>>> >>>>>>> The returned gapped sequences are all properly set up with gaps, >>>> name etc. >>>>>>> But as for other users, I think there may be some problems, since > the >>>>>>> SimpleAlignment object only has a general symbol list iterator, > the >>>> user >>>>>>> will have to cast each statement extracting a sequence object, and >>>>>>> >>>>>>> SimpleSequence aSimple = (SimpleSequence)aSequences.next(); >>>>>>> >>>>>>> returns an ClassCastException at run time. So old code might not > run >>>> with >>>>>>> the update as far as I can see. >>>> This is true. However, such code would be unsupported by us as the > API >>>> clearly states that SimpleAlignment returns SymbolList instances, and >>>> does not make any guarantees about the exact implementation details > of >>>> the objects it returns. To attempt to cast it to anything other than >>>> SymbolList would be a mistake! (Although actually it is now returning > a >>>> guarantee of GappedSymbolList, which is what your code can now take >>>> advantage of). To assume it will return SimpleSequence is outside the >>>> behaviour defined by the API and therefore should not be relied upon. >>>> >>>> A more correct behaviour would be to test each item returned: >>>> >>>> SymbolList symlist = aSequences.next(); >>>> if (symlist instanceof SimpleSequence) { >>>> SimpleSequence seq = (SimpleSequence)symlist; >>>> // Do simple-sequence stuff >>>> } else { >>>> // Do something else! >>>> } >>>> >>>> In future, I will modify the API to change the SymbolList guarantee > to >>>> a >>>> GappedSymbolList guarantee, but I can't do this right now as this >>>> really >>>> would break everyone's code! >>>> >>>> We are currently planning a redesign as you may be aware, so issues >>>> like >>>> this will hopefully be resolved as part of that process. For a start, >>>> if >>>> we use Java 5 generics in future as we plan, we can strictly specify >>>> what kinds of objects will be returned by things such as the > alignment >>>> API, making it easier for us to enforce API-compliant behaviour in >>>> user's code. >>>> >>>> cheers, >>>> Richard - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx H0QV01cFreNi1SNLRPbhepg= =023Y -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Fri Nov 16 09:43:39 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 16 Nov 2007 14:43:39 +0000 Subject: [Biojava-l] Disulfide information in PDB files In-Reply-To: <459609.71722.qm@web52710.mail.re2.yahoo.com> References: <459609.71722.qm@web52710.mail.re2.yahoo.com> Message-ID: <8F40FBF1-D491-4C3D-BCEB-41316147BD80@sanger.ac.uk> Hi Brendan, I just committed the patches to CVS so BioJava can now parse the SSBond records. Andreas On 14 Nov 2007, at 16:28, Brendan Duggan wrote: > Hi Andreas > > Thanks for the quick response. I submitted a bug > request (#2400) as suggested by Richard. Parsing the > SSBOND records is indeed what I was talking about. I > want to identify the disulfides then calculate their > torsions, dihedrals and bond lengths, all of which I > believe can be implemented with the existing code. If > you could implement this parsing in a few days that > would be great! > > Thanks > > Brendan > > > --- Andreas Prlic wrote: > >> Hi Brendan, >> >> SSBOND lines are currently not parsed. If this is >> what you need, >> I can add this over the next couple of days. >> >> If you want to compute the bonds yourself, the >> framework can >> e.g. calculate distances between the sulphur atoms >> for you. - >> >> Andreas >> >> >> >> >> >> On 14 Nov 2007, at 00:48, Brendan Duggan wrote: >> >>> Greetings >>> >>> I'm trying to mine some information on disulfides >> in >>> the PDB and was hoping there might be a way of >>> obtaining this information with the BioJava PDB >>> parser. However, I haven't been able to see >> anything >>> like this mentioned in the API docs. If it is >>> currently not possible to extract disulfide >>> information from PDB files are there any plans to >>> implement this? >>> >>> Thanks! >>> >>> Brendan >>> >>> >>> Make the switch to the world's best email. >> Get the new Yahoo! >>> 7 Mail now. >> http://au.yahoo.com/worldsbestmail/viagra/index.html >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - >> Biojava-l at lists.open-bio.org >>> >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > ---------------------------------------------------------------------- > - >> >> Andreas Prlic Wellcome Trust Sanger Institute >> Hinxton, Cambridge >> CB10 1SA, UK >> +44 (0) 1223 49 6891 >> >> > ---------------------------------------------------------------------- > - >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by >> Genome Research >> Limited, a charity registered in England with >> number 1021457 and a >> company registered in England with number 2742969, >> whose registered >> office is 215 Euston Road, London, NW1 2BE. >> > > > Brendan M. Duggan, PhD > > bmduggan at yahoo.com > (858) 692-2298 > > > Make the switch to the world's best email. Get the new Yahoo! > 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html > > ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Sun Nov 18 12:12:04 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Sun, 18 Nov 2007 17:12:04 -0000 (GMT) Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <000901c829d0$daa54620$8fefd260$@dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> <473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk> Message-ID: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> Interesting stuff. I'm not sure why it isn't working so I'll have to have a closer look. I'm currently on annual leave but will investigate when I return (Nov 27th). cheers, Richard On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote: > Hi Richard, > > I thought that was also correct what you say, but I can't get it to > work. > Below is a small test program to check this. First, I create a > SimpleGappedSequence through Text with > gaps->SymbolList->Sequence->GappedSequence. Gaps are there but not > "understood", as expected. Next, I create the same sequence non-gapped in > the above way, then introduce gaps with addGapsInSource. A gapped location > is now properly translated to a non-gapped sequence position. Finally, I > create a new SimpleGappedSequence based on the working one - as you can > see > the gaps are still there but not "understood"... > > aSymbolList = MSE--KLMPRT---TWAKG > aSequence = MSE--KLMPRT---TWAKG > > Gaps are not parsed when a SimpleGappedSequence is constructed from a > gapped > Sequence object: > aGapped = MSE--KLMPRT---TWAKG > Gapped position 10 = Plain position 10 > > aSymbolList = MSEKLMPRTTWAKG > aSequence = MSEKLMPRTTWAKG > > Gaps introduced through addGapsInSource work ok: > aGapped = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > Now a new SimpleGappedSequence object is created from the previous one: > aGapped2 = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 10 > > This should have been compiled with the new biojava.jar of 161107 (updated > via CVS), but perhaps I made a mistake updating? > > Any clues? > > Thanks, > > Ditlev > > --- > > package gappedsequencetest; > > import org.biojava.bio.*; > import org.biojava.bio.seq.*; > import org.biojava.bio.seq.impl.*; > import org.biojava.bio.symbol.*; > > public class Main { > > public static void main(String[] args) { > SymbolList aSymbolList = null; > try { > aSymbolList = > ProteinTools.createProtein("MSE--KLMPRT---TWAKG"); > > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + aSymbolList.seqString()); > > Sequence aSequence = new SimpleSequence(aSymbolList, "", > "mySequence", null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > SimpleGappedSequence aGapped = new > SimpleGappedSequence(aSequence); > System.out.println("Gaps are not parsed when a > SimpleGappedSequence > is constructed from a gapped Sequence object:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > try { > aSymbolList = ProteinTools.createProtein("MSEKLMPRTTWAKG"); > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + aSymbolList.seqString()); > > aSequence = new SimpleSequence(aSymbolList, "", "mySequence", > null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > aGapped = new SimpleGappedSequence(aSequence); > aGapped.addGapsInSource(9, 3); > aGapped.addGapsInSource(3, 2); > System.out.println("Gaps introduced through addGapsInSource work > ok:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > SimpleGappedSequence aGapped2 = new SimpleGappedSequence(aGapped); > System.out.println("Now a new SimpleGappedSequence object is > created > from the previous one:"); > System.out.println("aGapped2 = " + aGapped2.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > } > > } > > -- > > Ditlev Egeskov Brodersen > Lektor > Bakkefaldet 30, Hasle > 8210 ?rhus V > > www.lindeman-brodersen.dk > > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 16 November 2007 13:46 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: Wrapping SimpleGappedSequence >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> SimpleGappedSequence extends SimpleGappedSymbolList, and the >> constructor >> delegates to the SimpleGappedSymbolList constructor. >> >> When you extend SimpleGappedSequence you should delegate in your new >> constructor to the existing SimpleGappedSequence constructor, which in >> turn will delegate as above and preserve the gaps. >> >> By passing any object which implements GappedSymbolList to the >> SimpleGappedSequence constructor, e.g. SimpleGappedSequence or >> SimpleGappedSymbolList, it will automatically choose the new >> constructor >> from SimpleGappedSymbolList which you hopefully should be able to see >> in >> the code you have just checked out. If passed any other >> non-GappedSymbolList object, it will use the old constructor that >> already existed from before. >> >> cheers, >> Richard >> >> Ditlev Egeskov Brodersen wrote: >> > Hi again, >> > >> > I updated CVS and got the new SimpleGappedSymbolList class, but >> there >> > seems to be no changes to the SimpleGappedSequence class, which is >> the one I >> > need to extend...have I missed something? >> > >> > Ditlev >> > >> > -- >> > >> > Ditlev E. Brodersen, Ph.D. >> > Lektor, Associate Professor >> > >> > Department of Molecular Biology Office: +45 89425259 >> > University of Aarhus Lab: +45 89425022 >> > Gustav Wieds Vej 10c Fax: +45 86123178 >> > DK-8000 Aarhus C Email: deb at mb.au.dk >> > Denmark Lab WWW: www.bioxray.dk/~deb >> > >> > >> >> -----Original Message----- >> >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> >> Sent: 16 November 2007 11:47 >> >> To: Ditlev Egeskov Brodersen >> >> Cc: biojava-l at biojava.org >> >> Subject: Re: Wrapping SimpleGappedSequence >> >> >> > The easiest way is simply for me to alter the constructor to >> > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to >> > copy all gaps if passed another instance of GappedSymbolList as the >> > parameter. I've just done this in CVS so you should be able to update >> > your copy and observe the new behaviour. >> > >> > cheers, >> > Richard >> > >> > Ditlev Egeskov Brodersen wrote: >> >>>> Hi again, >> >>>> >> >>>> thanks for the info - will do the check just to be proper. I >> have >> > another >> >>>> question: In my application, I would like to wrap the retrieved >> >>>> SimpleGappedSequence objects inside another object that extends >> the >> >>>> functionality with application-specific stuff. Ideally, I would do >> > this by >> >>>> extending the SimpleGappedSequence object and create it by passing >> > the >> >>>> SimpleGappedSequence from the alignment import to the constructor >> of >> > the >> >>>> parent, like so: >> >>>> >> >>>> class AlignedSequence extends SimpleGappedSequence { >> >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >> >>>> super(aGapped); >> >>>> } >> >>>> >> >>>> ..custom stuff.. >> >>>> } >> >>>> >> >>>> However, the problem is that there is only one constructor for the >> >>>> SimpleGappedSequence, one which takes a simple Sequence object. I >> can >> > pass >> >>>> the derived class alright, but all gap information is lost again, >> > presumably >> >>>> because the SimpleGappedSequence constructor just takes out the >> > seqString() >> >>>> and puts it into its own sequence object. >> >>>> >> >>>> Shouldn't the constructor of the SimpleGappedSequence class >> recognise >> > when a >> >>>> derived (and gapped) sequence object is passed, and process it >> > accordingly? >> >>>> As it stands, I am forced to include the SimpleGappedSequence as a >> > private >> >>>> member of the AlignedSequence class, which is not near as nice >> since >> > all >> >>>> statement using the class will have to do something like >> >>>> >> >>>> class AlignedSequence extends SimpleGappedSequence { >> >>>> private SimpleGappedSequence gapped_sequence; >> >>>> >> >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >> >>>> gapped_sequence = aGapped; >> >>>> } >> >>>> >> >>>> public SimpleGappedSequence getGappedSequence() { >> >>>> return(gapped_sequence); >> >>>> } >> >>>> >> >>>> ..custom stuff.. >> >>>> } >> >>>> >> >>>> ... >> >>>> >> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >> >>>> aAligned.getGappedSequence().seqString(); >> >>>> >> >>>> rather than simply: >> >>>> >> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >> >>>> aAligned.seqString(); >> >>>> >> >>>> In other words, is there any solution with the current setup that >> > would >> >>>> allow me to extend SimpleGappedSequence and not loose the gap >> > information? >> >>>> -- Ditlev >> >>>> >> >>>> -- >> >>>> >> >>>> Ditlev E. Brodersen, Ph.D. >> >>>> Lektor, Associate Professor >> >>>> >> >>>> Department of Molecular Biology Office: +45 89425259 >> >>>> University of Aarhus Lab: +45 89425022 >> >>>> Gustav Wieds Vej 10c Fax: +45 86123178 >> >>>> DK-8000 Aarhus C Email: deb at mb.au.dk >> >>>> Denmark Lab WWW: www.bioxray.dk/~deb >> >>>> >> >>>> >> >>>>> -----Original Message----- >> >>>>> From: Richard Holland [mailto:holland at ebi.ac.uk] >> >>>>> Sent: 16 November 2007 10:50 >> >>>>> To: Ditlev Egeskov Brodersen >> >>>>> Cc: biojava-l at biojava.org >> >>>>> Subject: Re: [Biojava-l] Parsing exising gaps >> >>>>> >> >>>>>>> The returned gapped sequences are all properly set up with >> gaps, >> >>>> name etc. >> >>>>>>> But as for other users, I think there may be some problems, >> since >> > the >> >>>>>>> SimpleAlignment object only has a general symbol list iterator, >> > the >> >>>> user >> >>>>>>> will have to cast each statement extracting a sequence object, >> and >> >>>>>>> >> >>>>>>> SimpleSequence aSimple = >> (SimpleSequence)aSequences.next(); >> >>>>>>> >> >>>>>>> returns an ClassCastException at run time. So old code might >> not >> > run >> >>>> with >> >>>>>>> the update as far as I can see. >> >>>> This is true. However, such code would be unsupported by us as the >> > API >> >>>> clearly states that SimpleAlignment returns SymbolList instances, >> and >> >>>> does not make any guarantees about the exact implementation >> details >> > of >> >>>> the objects it returns. To attempt to cast it to anything other >> than >> >>>> SymbolList would be a mistake! (Although actually it is now >> returning >> > a >> >>>> guarantee of GappedSymbolList, which is what your code can now >> take >> >>>> advantage of). To assume it will return SimpleSequence is outside >> the >> >>>> behaviour defined by the API and therefore should not be relied >> upon. >> >>>> >> >>>> A more correct behaviour would be to test each item returned: >> >>>> >> >>>> SymbolList symlist = aSequences.next(); >> >>>> if (symlist instanceof SimpleSequence) { >> >>>> SimpleSequence seq = (SimpleSequence)symlist; >> >>>> // Do simple-sequence stuff >> >>>> } else { >> >>>> // Do something else! >> >>>> } >> >>>> >> >>>> In future, I will modify the API to change the SymbolList >> guarantee >> > to >> >>>> a >> >>>> GappedSymbolList guarantee, but I can't do this right now as this >> >>>> really >> >>>> would break everyone's code! >> >>>> >> >>>> We are currently planning a redesign as you may be aware, so >> issues >> >>>> like >> >>>> this will hopefully be resolved as part of that process. For a >> start, >> >>>> if >> >>>> we use Java 5 generics in future as we plan, we can strictly >> specify >> >>>> what kinds of objects will be returned by things such as the >> > alignment >> >>>> API, making it easier for us to enforce API-compliant behaviour in >> >>>> user's code. >> >>>> >> >>>> cheers, >> >>>> Richard >> >> - -- >> Richard Holland (BioMart) >> EMBL EBI, Wellcome Trust Genome Campus, >> Hinxton, Cambridgeshire CB10 1SD, UK >> Tel. +44 (0)1223 494416 >> >> http://www.biomart.org/ >> http://www.biojava.org/ >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx >> H0QV01cFreNi1SNLRPbhepg= >> =023Y >> -----END PGP SIGNATURE----- > > -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK From sterk at ebi.ac.uk Mon Nov 19 06:53:00 2007 From: sterk at ebi.ac.uk (Peter Sterk) Date: Mon, 19 Nov 2007 11:53:00 +0000 Subject: [Biojava-l] biojava.org wiki site down? Message-ID: <4741791C.2090307@ebi.ac.uk> Hi, I only get blank screens in firefox and IE can't display the pages, either. I think Richard reported something similar a few weeks ago. cheers, --Peter ----------------------------------------------------------------- Dr. Peter Sterk Tel: +44-(0)1223-494405 EMBL-European Bioinformatics Institute Fax: +44-(0)1223-494472 Wellcome Trust Genome Campus, Hinxton email: sterk at ebi.ac.uk Cambridge CB10 1SD, UK WWW: www.ebi.ac.uk Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/ ----------------------------------------------------------------- From deb at mb.au.dk Mon Nov 19 07:13:53 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Mon, 19 Nov 2007 13:13:53 +0100 Subject: [Biojava-l] biojava.org wiki site down? In-Reply-To: <4741791C.2090307@ebi.ac.uk> References: <4741791C.2090307@ebi.ac.uk> Message-ID: <003301c82aa5$a6fabdc0$f4f03940$@au.dk> www.biojava.org is down now, alright, but I was there less than 10 minutes ago, so it's recent crash. Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Peter Sterk > Sent: 19 November 2007 12:53 > To: biojava-l at lists.open-bio.org > Subject: [Biojava-l] biojava.org wiki site down? > > Hi, > > I only get blank screens in firefox and IE can't display the pages, > either. I think Richard reported something similar a few weeks ago. > > cheers, > > --Peter > ----------------------------------------------------------------- > Dr. Peter Sterk Tel: +44-(0)1223-494405 > EMBL-European Bioinformatics Institute Fax: +44-(0)1223-494472 > Wellcome Trust Genome Campus, Hinxton email: sterk at ebi.ac.uk > Cambridge CB10 1SD, UK WWW: www.ebi.ac.uk > > Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/ > ----------------------------------------------------------------- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From deb at mb.au.dk Mon Nov 19 09:46:01 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Mon, 19 Nov 2007 15:46:01 +0100 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> <473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk> <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> Message-ID: <003701c82aba$e85f4320$b91dc960$@au.dk> Dear Richard and all, I've been dissecting the delegation problem encountered when instantiating SimpleGappedSequence(Sequence) with an already gapped sequence. The constructor calls the parent SimpleGappedSymbolList(), which in Richard's CVS update of 161107 now contains a separate overloaded constructor for the gapped case: public SimpleGappedSymbolList(GappedSymbolList gappedSource) However, when instantiating a new SimpleGappedSequence based on an existing gapped sequence (with several blocks), the blocks were lost. After checking the path of code execution it appeared that for some reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified SimpleGappedSequence.java to include an overloaded constructor also for the descendant class, identical to the other constructor but with a GappedSequence argument: public SimpleGappedSequence(GappedSequence seq) { super(seq); this.sequence = seq; createOnUnderlying = false; } Now, the correct parent constructor (SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are two other problems with the new SimpleGappedSymbolList constructor that needs to be corrected for it to work as expected: First, the initial introduction of a single, large block is missing from the new code, so insert: Block b = new Block(1, length, 1, length); blocks.add(b); Secondly, the code for transferring the gaps from the sequence string need to use two separate indices, otherwise the gaps will be placed wrongly because their position is affected by previously inserted gaps: int n=1; for(int i=1;i<=this.length();i++) { if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i))) this.addGappInSource(n); else n++; In other words, the index giving the position of the gaps should only increment when there are NO gaps at the corresponding position in the gapped string. Following these changes, the GappedSequenceTest program from last week now works as expected: aSymbolList = MSE--KLMPRT---TWAKG aSequence = MSE--KLMPRT---TWAKG Gaps are not parsed when a SimpleGappedSequence is constructed from a gapped Sequence object: aGapped = MSE--KLMPRT---TWAKG Gapped position 10 = Plain position 10 aSymbolList = MSEKLMPRTTWAKG aSequence = MSEKLMPRTTWAKG Gaps introduced through addGapsInSource work ok: aGapped = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 8 Now a new SimpleGappedSequence object is created from the previous one: aGapped2 = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 8 -- Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- bounces at lists.open-bio.org] On Behalf Of Richard Holland Sent: 18 November 2007 18:12 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence Interesting stuff. I'm not sure why it isn't working so I'll have to have a closer look. I'm currently on annual leave but will investigate when I return (Nov 27th). cheers, Richard On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote: Hi Richard, I thought that was also correct what you say, but I can't get it to work. Below is a small test program to check this. First, I create a SimpleGappedSequence through Text with gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not "understood", as expected. Next, I create the same sequence non- gapped in the above way, then introduce gaps with addGapsInSource. A gapped location is now properly translated to a non-gapped sequence position. Finally, I create a new SimpleGappedSequence based on the working one - as you can see the gaps are still there but not "understood"... aSymbolList = MSE--KLMPRT---TWAKG aSequence = MSE--KLMPRT---TWAKG Gaps are not parsed when a SimpleGappedSequence is constructed from a gapped Sequence object: aGapped = MSE--KLMPRT---TWAKG Gapped position 10 = Plain position 10 aSymbolList = MSEKLMPRTTWAKG aSequence = MSEKLMPRTTWAKG Gaps introduced through addGapsInSource work ok: aGapped = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 8 Now a new SimpleGappedSequence object is created from the previous one: aGapped2 = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 10 This should have been compiled with the new biojava.jar of 161107 (updated via CVS), but perhaps I made a mistake updating? Any clues? Thanks, Ditlev --- package gappedsequencetest; import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.seq.impl.*; import org.biojava.bio.symbol.*; public class Main { public static void main(String[] args) { SymbolList aSymbolList = null; try { aSymbolList = ProteinTools.createProtein("MSE--KLMPRT---TWAKG"); } catch(BioException ex) {} System.out.println("aSymbolList = " + aSymbolList.seqString()); Sequence aSequence = new SimpleSequence(aSymbolList, "", "mySequence", null); System.out.println("aSequence = " + aSequence.seqString() + "\n"); SimpleGappedSequence aGapped = new SimpleGappedSequence(aSequence); System.out.println("Gaps are not parsed when a SimpleGappedSequence is constructed from a gapped Sequence object:"); System.out.println("aGapped = " + aGapped.seqString()); System.out.println("Gapped position 10 = Plain position " + aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); try { aSymbolList = ProteinTools.createProtein("MSEKLMPRTTWAKG"); } catch(BioException ex) {} System.out.println("aSymbolList = " + aSymbolList.seqString()); aSequence = new SimpleSequence(aSymbolList, "", "mySequence", null); System.out.println("aSequence = " + aSequence.seqString() + "\n"); aGapped = new SimpleGappedSequence(aSequence); aGapped.addGapsInSource(9, 3); aGapped.addGapsInSource(3, 2); System.out.println("Gaps introduced through addGapsInSource work ok:"); System.out.println("aGapped = " + aGapped.seqString()); System.out.println("Gapped position 10 = Plain position " + aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); SimpleGappedSequence aGapped2 = new SimpleGappedSequence(aGapped); System.out.println("Now a new SimpleGappedSequence object is created from the previous one:"); System.out.println("aGapped2 = " + aGapped2.seqString()); System.out.println("Gapped position 10 = Plain position " + aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); } } -- Ditlev Egeskov Brodersen Lektor Bakkefaldet 30, Hasle 8210 ?rhus V www.lindeman-brodersen.dk -----Original Message----- From: Richard Holland [mailto:holland at ebi.ac.uk] Sent: 16 November 2007 13:46 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: Wrapping SimpleGappedSequence -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SimpleGappedSequence extends SimpleGappedSymbolList, and the constructor delegates to the SimpleGappedSymbolList constructor. When you extend SimpleGappedSequence you should delegate in your new constructor to the existing SimpleGappedSequence constructor, which in turn will delegate as above and preserve the gaps. By passing any object which implements GappedSymbolList to the SimpleGappedSequence constructor, e.g. SimpleGappedSequence or SimpleGappedSymbolList, it will automatically choose the new constructor from SimpleGappedSymbolList which you hopefully should be able to see in the code you have just checked out. If passed any other non-GappedSymbolList object, it will use the old constructor that already existed from before. cheers, Richard Ditlev Egeskov Brodersen wrote: Hi again, I updated CVS and got the new SimpleGappedSymbolList class, but there seems to be no changes to the SimpleGappedSequence class, which is the one I need to extend...have I missed something? Ditlev -- Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor Department of Molecular Biology Office: +45 89425259 University of Aarhus Lab: +45 89425022 Gustav Wieds Vej 10c Fax: +45 86123178 DK-8000 Aarhus C Email: deb at mb.au.dk Denmark Lab WWW: www.bioxray.dk/~deb -----Original Message----- From: Richard Holland [mailto:holland at ebi.ac.uk] Sent: 16 November 2007 11:47 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: Wrapping SimpleGappedSequence The easiest way is simply for me to alter the constructor to SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to copy all gaps if passed another instance of GappedSymbolList as the parameter. I've just done this in CVS so you should be able to update your copy and observe the new behaviour. cheers, Richard Ditlev Egeskov Brodersen wrote: Hi again, thanks for the info - will do the check just to be proper. I have another question: In my application, I would like to wrap the retrieved SimpleGappedSequence objects inside another object that extends the functionality with application-specific stuff. Ideally, I would do this by extending the SimpleGappedSequence object and create it by passing the SimpleGappedSequence from the alignment import to the constructor of the parent, like so: class AlignedSequence extends SimpleGappedSequence { public AlignedSequence(SimpleGappedSequence aGapped) { super(aGapped); } ..custom stuff.. } However, the problem is that there is only one constructor for the SimpleGappedSequence, one which takes a simple Sequence object. I can pass the derived class alright, but all gap information is lost again, presumably because the SimpleGappedSequence constructor just takes out the seqString() and puts it into its own sequence object. Shouldn't the constructor of the SimpleGappedSequence class recognise when a derived (and gapped) sequence object is passed, and process it accordingly? As it stands, I am forced to include the SimpleGappedSequence as a private member of the AlignedSequence class, which is not near as nice since all statement using the class will have to do something like class AlignedSequence extends SimpleGappedSequence { private SimpleGappedSequence gapped_sequence; public AlignedSequence(SimpleGappedSequence aGapped) { gapped_sequence = aGapped; } public SimpleGappedSequence getGappedSequence() { return(gapped_sequence); } ..custom stuff.. } ... AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.getGappedSequence().seqString(); rather than simply: AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.seqString(); In other words, is there any solution with the current setup that would allow me to extend SimpleGappedSequence and not loose the gap information? -- Ditlev -- Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor Department of Molecular Biology Office: +45 89425259 University of Aarhus Lab: +45 89425022 Gustav Wieds Vej 10c Fax: +45 86123178 DK-8000 Aarhus C Email: deb at mb.au.dk Denmark Lab WWW: www.bioxray.dk/~deb -----Original Message----- From: Richard Holland [mailto:holland at ebi.ac.uk] Sent: 16 November 2007 10:50 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] Parsing exising gaps The returned gapped sequences are all properly set up with gaps, name etc. But as for other users, I think there may be some problems, since the SimpleAlignment object only has a general symbol list iterator, the user will have to cast each statement extracting a sequence object, and SimpleSequence aSimple = (SimpleSequence)aSequences.next(); returns an ClassCastException at run time. So old code might not run with the update as far as I can see. This is true. However, such code would be unsupported by us as the API clearly states that SimpleAlignment returns SymbolList instances, and does not make any guarantees about the exact implementation details of the objects it returns. To attempt to cast it to anything other than SymbolList would be a mistake! (Although actually it is now returning a guarantee of GappedSymbolList, which is what your code can now take advantage of). To assume it will return SimpleSequence is outside the behaviour defined by the API and therefore should not be relied upon. A more correct behaviour would be to test each item returned: SymbolList symlist = aSequences.next(); if (symlist instanceof SimpleSequence) { SimpleSequence seq = (SimpleSequence)symlist; // Do simple-sequence stuff } else { // Do something else! } In future, I will modify the API to change the SymbolList guarantee to a GappedSymbolList guarantee, but I can't do this right now as this really would break everyone's code! We are currently planning a redesign as you may be aware, so issues like this will hopefully be resolved as part of that process. For a start, if we use Java 5 generics in future as we plan, we can strictly specify what kinds of objects will be returned by things such as the alignment API, making it easier for us to enforce API-compliant behaviour in user's code. cheers, Richard - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx H0QV01cFreNi1SNLRPbhepg= =023Y -----END PGP SIGNATURE----- -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From allank at sanbi.ac.za Sun Nov 25 08:10:55 2007 From: allank at sanbi.ac.za (Allan Kamau) Date: Sun, 25 Nov 2007 15:10:55 +0200 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported Message-ID: <4749745F.9070104@sanbi.ac.za> Hi all, I've searched for a conclusive answer to the "Program ncbi-blastn Version is not supported" without success. I would like to know format of the blast output the Biojava's blast-like parsing framework likes, including some examples (without the data) of how such blast output may be created. For example, I am using ncbi-blastn and I am generating the blast file (which Biojava doesn't like) as follows. export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; export BLAST_REPORT_TABULAR=somesequence.blast.txt export BLAST_REPORT_XML=somesequence.blast.xml export BLAST_REPORT=somesequence.blast export INPUT_FASTA=somesequence.fasta export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" Then I get the error below. [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; Buildfile: build.xml runBlastParser: [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17 is not supported by the biojava blast-like parsing framework [java] at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:241) [java] at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) Allan. From markjschreiber at gmail.com Sun Nov 25 20:17:03 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 26 Nov 2007 09:17:03 +0800 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <4749745F.9070104@sanbi.ac.za> References: <4749745F.9070104@sanbi.ac.za> Message-ID: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> Hi Allan - I think the solution is to call the setParserLazy() or some method with a similar name (I don't have the API handy). This will prevent it doing the check. The original idea of this method was you could check against a list of version numbers that people had validated. I don't think this is a good idea as nothing is truely 100% validated and we haven't kept the list up to date. If there are no objections I would propose to make this method depricated (and it's opposite method) and change the default behaivour to lazy checking. Best regards. - Mark On 11/25/07, Allan Kamau wrote: > > Hi all, > I've searched for a conclusive answer to the "Program ncbi-blastn > Version is not supported" without success. > I would like to know format of the blast output the Biojava's blast-like > parsing framework likes, including some examples (without the data) of > how such blast output may be created. > For example, I am using ncbi-blastn and I am generating the blast file > (which Biojava doesn't like) as follows. > > export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; > export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; > export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; > export BLAST_REPORT_TABULAR=somesequence.blast.txt > export BLAST_REPORT_XML=somesequence.blast.xml > export BLAST_REPORT=somesequence.blast > export INPUT_FASTA=somesequence.fasta > export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence > > date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i > $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d > $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o > $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i > $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; > > Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied > from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" > > Then I get the error below. > > > [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; > Buildfile: build.xml > > runBlastParser: > [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17 > is not supported by the biojava blast-like parsing framework > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret( > BlastLikeSAXParser.java:241) > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse( > BlastLikeSAXParser.java:160) > > Allan. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From holland at ebi.ac.uk Mon Nov 26 03:55:56 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Nov 2007 08:55:56 +0000 Subject: [Biojava-l] Applet not able to find DNATools class. In-Reply-To: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> References: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> Message-ID: <474A8A1C.4020901@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sounds like either a classpath problem (in which case check your classpath to ensure all parts of biojava are definitely on it) or a broken biojava.jar (in which case you need to recompile/redownload it). cheers, Richard Abhinav Ram Karhu wrote: > Hello all, > I am having an error while loading the applet. > > I am getting the following stack trace. > > java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools > at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161) > at Trace.init(Trace.java:161) > at sun.applet.AppletPanel.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder. > > I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java. > > My applet code in the php page looks like this: > > > > Please suggest if I am missing something. > > Thanks in advance. > > Abhinav > > > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSoob4C5LeMEKA/QRAsfkAJ9SlwIzDulzSDQpAfgh0alISRsplACcDqUx uyQUEmRFEWTdnEHsm7k2lg0= =SWHu -----END PGP SIGNATURE----- From holland at ebi.ac.uk Mon Nov 26 07:55:23 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Nov 2007 12:55:23 +0000 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <003701c82aba$e85f4320$b91dc960$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> <473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk> <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> <003701c82aba$e85f4320$b91dc960$@au.dk> Message-ID: <474AC23B.3080500@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have made the changes you suggest below in CVS. Hopefully it will work for you now. cheers, Richard Ditlev Egeskov Brodersen wrote: > Dear Richard and all, > > I've been dissecting the delegation problem encountered when instantiating > SimpleGappedSequence(Sequence) with an already gapped sequence. The > constructor calls the parent SimpleGappedSymbolList(), which in Richard's > CVS update of 161107 now contains a separate overloaded constructor for the > gapped case: > > public SimpleGappedSymbolList(GappedSymbolList gappedSource) > > However, when instantiating a new SimpleGappedSequence based on an > existing gapped sequence (with several blocks), the blocks were lost. > > After checking the path of code execution it appeared that for some > reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified > SimpleGappedSequence.java to include an overloaded constructor also for the > descendant class, identical to the other constructor but with a > GappedSequence argument: > > public SimpleGappedSequence(GappedSequence seq) { > super(seq); > this.sequence = seq; > createOnUnderlying = false; > } > > Now, the correct parent constructor > (SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are > two other problems with the new SimpleGappedSymbolList constructor that > needs to be corrected for it to work as expected: First, the initial > introduction of a single, large block is missing from the new code, so > insert: > > Block b = new Block(1, length, 1, length); > blocks.add(b); > > Secondly, the code for transferring the gaps from the sequence string need > to use two separate indices, otherwise the gaps will be placed wrongly > because their position is affected by previously inserted gaps: > > int n=1; > for(int i=1;i<=this.length();i++) { > if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i))) > this.addGappInSource(n); > else > n++; > > In other words, the index giving the position of the gaps should only > increment when there are NO gaps at the corresponding position in the gapped > string. > > Following these changes, the GappedSequenceTest program from last week now > works as expected: > > aSymbolList = MSE--KLMPRT---TWAKG > aSequence = MSE--KLMPRT---TWAKG > > Gaps are not parsed when a SimpleGappedSequence is constructed from a > gapped Sequence object: > aGapped = MSE--KLMPRT---TWAKG > Gapped position 10 = Plain position 10 > > aSymbolList = MSEKLMPRTTWAKG > aSequence = MSEKLMPRTTWAKG > > Gaps introduced through addGapsInSource work ok: > aGapped = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > Now a new SimpleGappedSequence object is created from the previous one: > aGapped2 = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > -- Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Richard Holland > Sent: 18 November 2007 18:12 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence > > Interesting stuff. I'm not sure why it isn't working so I'll have to > have > a closer look. > > I'm currently on annual leave but will investigate when I return (Nov > 27th). > > cheers, > Richard > > On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote: > Hi Richard, > > I thought that was also correct what you say, but I can't get it to > work. > Below is a small test program to check this. First, I create a > SimpleGappedSequence through Text with > gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not > "understood", as expected. Next, I create the same sequence non- > gapped in > the above way, then introduce gaps with addGapsInSource. A gapped > location > is now properly translated to a non-gapped sequence position. > Finally, I > create a new SimpleGappedSequence based on the working one - as you > can > see > the gaps are still there but not "understood"... > > aSymbolList = MSE--KLMPRT---TWAKG > aSequence = MSE--KLMPRT---TWAKG > > Gaps are not parsed when a SimpleGappedSequence is constructed from a > gapped > Sequence object: > aGapped = MSE--KLMPRT---TWAKG > Gapped position 10 = Plain position 10 > > aSymbolList = MSEKLMPRTTWAKG > aSequence = MSEKLMPRTTWAKG > > Gaps introduced through addGapsInSource work ok: > aGapped = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > Now a new SimpleGappedSequence object is created from the previous > one: > aGapped2 = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 10 > > This should have been compiled with the new biojava.jar of 161107 > (updated > via CVS), but perhaps I made a mistake updating? > > Any clues? > > Thanks, > > Ditlev > > --- > > package gappedsequencetest; > > import org.biojava.bio.*; > import org.biojava.bio.seq.*; > import org.biojava.bio.seq.impl.*; > import org.biojava.bio.symbol.*; > > public class Main { > > public static void main(String[] args) { > SymbolList aSymbolList = null; > try { > aSymbolList = > ProteinTools.createProtein("MSE--KLMPRT---TWAKG"); > > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + > aSymbolList.seqString()); > > Sequence aSequence = new SimpleSequence(aSymbolList, "", > "mySequence", null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > SimpleGappedSequence aGapped = new > SimpleGappedSequence(aSequence); > System.out.println("Gaps are not parsed when a > SimpleGappedSequence > is constructed from a gapped Sequence object:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > try { > aSymbolList = > ProteinTools.createProtein("MSEKLMPRTTWAKG"); > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + > aSymbolList.seqString()); > > aSequence = new SimpleSequence(aSymbolList, "", "mySequence", > null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > aGapped = new SimpleGappedSequence(aSequence); > aGapped.addGapsInSource(9, 3); > aGapped.addGapsInSource(3, 2); > System.out.println("Gaps introduced through addGapsInSource > work > ok:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > SimpleGappedSequence aGapped2 = new > SimpleGappedSequence(aGapped); > System.out.println("Now a new SimpleGappedSequence object is > created > from the previous one:"); > System.out.println("aGapped2 = " + aGapped2.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > } > > } > > -- > > Ditlev Egeskov Brodersen > Lektor > Bakkefaldet 30, Hasle > 8210 ?rhus V > > www.lindeman-brodersen.dk > > > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 13:46 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: Wrapping SimpleGappedSequence > > SimpleGappedSequence extends SimpleGappedSymbolList, and the > constructor > delegates to the SimpleGappedSymbolList constructor. > > When you extend SimpleGappedSequence you should delegate in your new > constructor to the existing SimpleGappedSequence constructor, which >> in > turn will delegate as above and preserve the gaps. > > By passing any object which implements GappedSymbolList to the > SimpleGappedSequence constructor, e.g. SimpleGappedSequence or > SimpleGappedSymbolList, it will automatically choose the new > constructor > from SimpleGappedSymbolList which you hopefully should be able to >> see > in > the code you have just checked out. If passed any other > non-GappedSymbolList object, it will use the old constructor that > already existed from before. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: > Hi again, > > I updated CVS and got the new SimpleGappedSymbolList class, but > there > seems to be no changes to the SimpleGappedSequence class, which is > the one I > need to extend...have I missed something? > > Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 11:47 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: Wrapping SimpleGappedSequence > > The easiest way is simply for me to alter the constructor to > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) >> to > copy all gaps if passed another instance of GappedSymbolList as >> the > parameter. I've just done this in CVS so you should be able to >> update > your copy and observe the new behaviour. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: > Hi again, > > thanks for the info - will do the check just to be proper. I > have > another > question: In my application, I would like to wrap the retrieved > SimpleGappedSequence objects inside another object that extends > the > functionality with application-specific stuff. Ideally, I would >> do > this by > extending the SimpleGappedSequence object and create it by >> passing > the > SimpleGappedSequence from the alignment import to the >> constructor > of > the > parent, like so: > > class AlignedSequence extends SimpleGappedSequence { > public AlignedSequence(SimpleGappedSequence aGapped) { > super(aGapped); > } > > ..custom stuff.. > } > > However, the problem is that there is only one constructor for >> the > SimpleGappedSequence, one which takes a simple Sequence object. >> I > can > pass > the derived class alright, but all gap information is lost >> again, > presumably > because the SimpleGappedSequence constructor just takes out the > seqString() > and puts it into its own sequence object. > > Shouldn't the constructor of the SimpleGappedSequence class > recognise > when a > derived (and gapped) sequence object is passed, and process it > accordingly? > As it stands, I am forced to include the SimpleGappedSequence >> as a > private > member of the AlignedSequence class, which is not near as nice > since > all > statement using the class will have to do something like > > class AlignedSequence extends SimpleGappedSequence { > private SimpleGappedSequence gapped_sequence; > > public AlignedSequence(SimpleGappedSequence aGapped) { > gapped_sequence = aGapped; > } > > public SimpleGappedSequence getGappedSequence() { > return(gapped_sequence); > } > > ..custom stuff.. > } > > ... > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.getGappedSequence().seqString(); > > rather than simply: > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.seqString(); > > In other words, is there any solution with the current setup >> that > would > allow me to extend SimpleGappedSequence and not loose the gap > information? > -- Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 10:50 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Parsing exising gaps > > The returned gapped sequences are all properly set up with > gaps, > name etc. > But as for other users, I think there may be some problems, > since > the > SimpleAlignment object only has a general symbol list >> iterator, > the > user > will have to cast each statement extracting a sequence >> object, > and > > SimpleSequence aSimple = > (SimpleSequence)aSequences.next(); > > returns an ClassCastException at run time. So old code might > not > run > with > the update as far as I can see. > This is true. However, such code would be unsupported by us as >> the > API > clearly states that SimpleAlignment returns SymbolList >> instances, > and > does not make any guarantees about the exact implementation > details > of > the objects it returns. To attempt to cast it to anything other > than > SymbolList would be a mistake! (Although actually it is now > returning > a > guarantee of GappedSymbolList, which is what your code can now > take > advantage of). To assume it will return SimpleSequence is >> outside > the > behaviour defined by the API and therefore should not be relied > upon. > > A more correct behaviour would be to test each item returned: > > SymbolList symlist = aSequences.next(); > if (symlist instanceof SimpleSequence) { > SimpleSequence seq = (SimpleSequence)symlist; > // Do simple-sequence stuff > } else { > // Do something else! > } > > In future, I will modify the API to change the SymbolList > guarantee > to > a > GappedSymbolList guarantee, but I can't do this right now as >> this > really > would break everyone's code! > > We are currently planning a redesign as you may be aware, so > issues > like > this will hopefully be resolved as part of that process. For a > start, > if > we use Java 5 generics in future as we plan, we can strictly > specify > what kinds of objects will be returned by things such as the > alignment > API, making it easier for us to enforce API-compliant behaviour >> in > user's code. > > cheers, > Richard > > -- > Richard Holland (BioMart) > EMBL EBI, Wellcome Trust Genome Campus, > Hinxton, Cambridgeshire CB10 1SD, UK > Tel. +44 (0)1223 494416 > > http://www.biomart.org/ > http://www.biojava.org/ > -- > Richard Holland > BioMart (http://www.biomart.org/) > EMBL-EBI > Hinxton, Cambridgeshire CB10 1SD, UK > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSsI64C5LeMEKA/QRAg21AKCieEvT2KaWBFdqLFUtxazhHXmD2wCgiRwk Bz79hrJxD/eZrrCUXUAh758= =0Jpp -----END PGP SIGNATURE----- From allank at sanbi.ac.za Mon Nov 26 07:02:56 2007 From: allank at sanbi.ac.za (Allan Kamau) Date: Mon, 26 Nov 2007 14:02:56 +0200 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> Message-ID: <474AB5F0.6040802@sanbi.ac.za> Hi Mark, Thank you for your reply. Calling setModeLazy() method of the object of type BlastLikeSAXParser did provide the cure. Allan. Mark Schreiber wrote: > Hi Allan - > > I think the solution is to call the setParserLazy() or some method > with a similar name (I don't have the API handy). This will prevent it > doing the check. > > The original idea of this method was you could check against a list of > version numbers that people had validated. I don't think this is a > good idea as nothing is truely 100% validated and we haven't kept the > list up to date. If there are no objections I would propose to make > this method depricated (and it's opposite method) and change the > default behaivour to lazy checking. > > Best regards. > > - Mark > > > On 11/25/07, *Allan Kamau* > wrote: > > Hi all, > I've searched for a conclusive answer to the "Program ncbi-blastn > Version is not supported" without success. > I would like to know format of the blast output the Biojava's > blast-like > parsing framework likes, including some examples (without the data) of > how such blast output may be created. > For example, I am using ncbi-blastn and I am generating the blast > file > (which Biojava doesn't like) as follows. > > export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; > export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; > export > REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; > export BLAST_REPORT_TABULAR=somesequence.blast.txt > export BLAST_REPORT_XML=somesequence.blast.xml > export BLAST_REPORT=somesequence.blast > export INPUT_FASTA=somesequence.fasta > export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence > > date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i > $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d > $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o > $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d > $REFERENCES_FASTA_NAME -i > $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; > > Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied > from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" > > Then I get the error below. > > > [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; > Buildfile: build.xml > > runBlastParser: > [java] org.xml.sax.SAXException: Program ncbi-blastn Version > 2.2.17 > is not supported by the biojava blast-like parsing framework > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java > :241) > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) > > Allan. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From markjschreiber at gmail.com Mon Nov 26 22:16:35 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 27 Nov 2007 11:16:35 +0800 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474AB5F0.6040802@sanbi.ac.za> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> Message-ID: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Hi - Does anyone mind if I change the default behaivor to lazy parsing? Technically this would be a break in backwards compatibility (although only if you have a program that relies on strict parsing). Last chance to complain. - Mark On Nov 26, 2007 8:02 PM, Allan Kamau wrote: > Hi Mark, > Thank you for your reply. > Calling setModeLazy() method of the object of type BlastLikeSAXParser > did provide the cure. > > Allan. > > > Mark Schreiber wrote: > > Hi Allan - > > > > I think the solution is to call the setParserLazy() or some method > > with a similar name (I don't have the API handy). This will prevent it > > doing the check. > > > > The original idea of this method was you could check against a list of > > version numbers that people had validated. I don't think this is a > > good idea as nothing is truely 100% validated and we haven't kept the > > list up to date. If there are no objections I would propose to make > > this method depricated (and it's opposite method) and change the > > default behaivour to lazy checking. > > > > Best regards. > > > > - Mark > > > > > > On 11/25/07, *Allan Kamau* > > > > > wrote: > > > > Hi all, > > I've searched for a conclusive answer to the "Program ncbi-blastn > > Version is not supported" without success. > > I would like to know format of the blast output the Biojava's > > blast-like > > parsing framework likes, including some examples (without the data) of > > how such blast output may be created. > > For example, I am using ncbi-blastn and I am generating the blast > > file > > (which Biojava doesn't like) as follows. > > > > export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; > > export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; > > export > > REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; > > export BLAST_REPORT_TABULAR=somesequence.blast.txt > > export BLAST_REPORT_XML=somesequence.blast.xml > > export BLAST_REPORT=somesequence.blast > > export INPUT_FASTA=somesequence.fasta > > export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence > > > > date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i > > $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d > > $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o > > $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d > > $REFERENCES_FASTA_NAME -i > > $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; > > > > Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied > > from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" > > > > Then I get the error below. > > > > > > [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; > > Buildfile: build.xml > > > > runBlastParser: > > [java] org.xml.sax.SAXException: Program ncbi-blastn Version > > 2.2.17 > > is not supported by the biojava blast-like parsing framework > > [java] at > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java > > :241) > > [java] at > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) > > > > Allan. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > From holland at ebi.ac.uk Tue Nov 27 03:40:10 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 08:40:10 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Message-ID: <474BD7EA.4040604@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sounds good to me. Mark Schreiber wrote: > Hi - > > Does anyone mind if I change the default behaivor to lazy parsing? > Technically this would be a break in backwards compatibility (although > only if you have a program that relies on strict parsing). > > Last chance to complain. > > - Mark > > On Nov 26, 2007 8:02 PM, Allan Kamau wrote: >> Hi Mark, >> Thank you for your reply. >> Calling setModeLazy() method of the object of type BlastLikeSAXParser >> did provide the cure. >> >> Allan. >> >> >> Mark Schreiber wrote: >>> Hi Allan - >>> >>> I think the solution is to call the setParserLazy() or some method >>> with a similar name (I don't have the API handy). This will prevent it >>> doing the check. >>> >>> The original idea of this method was you could check against a list of >>> version numbers that people had validated. I don't think this is a >>> good idea as nothing is truely 100% validated and we haven't kept the >>> list up to date. If there are no objections I would propose to make >>> this method depricated (and it's opposite method) and change the >>> default behaivour to lazy checking. >>> >>> Best regards. >>> >>> - Mark >>> >>> >>> On 11/25/07, *Allan Kamau* > >> >>> > wrote: >>> >>> Hi all, >>> I've searched for a conclusive answer to the "Program ncbi-blastn >>> Version is not supported" without success. >>> I would like to know format of the blast output the Biojava's >>> blast-like >>> parsing framework likes, including some examples (without the data) of >>> how such blast output may be created. >>> For example, I am using ncbi-blastn and I am generating the blast >>> file >>> (which Biojava doesn't like) as follows. >>> >>> export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; >>> export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; >>> export >>> REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; >>> export BLAST_REPORT_TABULAR=somesequence.blast.txt >>> export BLAST_REPORT_XML=somesequence.blast.xml >>> export BLAST_REPORT=somesequence.blast >>> export INPUT_FASTA=somesequence.fasta >>> export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence >>> >>> date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i >>> $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d >>> $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o >>> $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d >>> $REFERENCES_FASTA_NAME -i >>> $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; >>> >>> Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied >>> from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" >>> >>> Then I get the error below. >>> >>> >>> [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; >>> Buildfile: build.xml >>> >>> runBlastParser: >>> [java] org.xml.sax.SAXException: Program ncbi-blastn Version >>> 2.2.17 >>> is not supported by the biojava blast-like parsing framework >>> [java] at >>> org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java >>> :241) >>> [java] at >>> org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) >>> >>> Allan. >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> >> >> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS9fq4C5LeMEKA/QRAm/3AJ9hi2yrSyeK6a3nXtObyJ2MAk0Y1QCeL5HT iYQc6HTdm6fJ+Lcfssnd34g= =VuJJ -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Tue Nov 27 05:24:49 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 27 Nov 2007 10:24:49 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Message-ID: > Does anyone mind if I change the default behaivor to lazy parsing? Hi Mark, I think this is a good idea. we had a couple of questions and feature requests recently regarding the blast parser, so I wonder if we should have a look at how to make it (and the documentation) better also during the V3 discussion... Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Tue Nov 27 06:01:33 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 11:01:33 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Message-ID: <474BF90D.3070003@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > we had a couple of questions and feature requests recently regarding > the blast parser, so I wonder if we should > have a look at how to make it (and the documentation) better also > during the V3 discussion... A rethink of the blast parser is definitely a good idea. It's starting to need more work than before as the various subtly different file formats used by the most recent versions and variants of blast have evolved beyond the tolerance limits of the existing parser. It also needs to be made simpler to use. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb Nu2SZ7yV4e0rUmyIBxNYTJU= =9nHg -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Tue Nov 27 06:11:30 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 27 Nov 2007 11:11:30 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474BF90D.3070003@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> Message-ID: <474BFB62.3040203@ebi.ac.uk> What format options are there from blast? Just thinking if it supports CIGAR or something like that are we better providing a parser for that format & saying that we do not support the traditional blast output? That said it doesn't help is when that format changes so maybe what is needed is a way to push out parser changes without requiring a full biojava release (v3 discussion) ... Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > >> we had a couple of questions and feature requests recently regarding >> the blast parser, so I wonder if we should >> have a look at how to make it (and the documentation) better also >> during the V3 discussion... > > A rethink of the blast parser is definitely a good idea. It's starting > to need more work than before as the various subtly different file > formats used by the most recent versions and variants of blast have > evolved beyond the tolerance limits of the existing parser. It also > needs to be made simpler to use. > > cheers, > Richard > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb > Nu2SZ7yV4e0rUmyIBxNYTJU= > =9nHg > -----END PGP SIGNATURE----- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at ebi.ac.uk Tue Nov 27 06:18:59 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 11:18:59 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474BFB62.3040203@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> Message-ID: <474BFD23.8060005@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > What format options are there from blast? Just thinking if it supports > CIGAR or something like that are we better providing a parser for that > format & saying that we do not support the traditional blast output? > That said it doesn't help is when that format changes so maybe what is > needed is a way to push out parser changes without requiring a full > biojava release (v3 discussion) ... Exactly! So the modular idea would work nicely here - we could have a blast module and only update that single module (which would be its own JAR) whenever the format changes. In a way, BioJava releases as such would no longer happen, except maybe for some kind of core BioJava module. Everything would be done in terms of individual module+JAR releases instead - one for Genbank, one for BioSQL, one for NEXUS, one for Phylogenetic tools, one for translation/transcription, etc. etc. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS/0j4C5LeMEKA/QRAkQuAJ9B+mmV7vo9QuFYwEgmnHczExyXqwCfamIx uPFQKdbXRC7pwC6lM5aBcJk= =F3PD -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Tue Nov 27 06:47:54 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 27 Nov 2007 11:47:54 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474BFD23.8060005@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> <474BFD23.8060005@ebi.ac.uk> Message-ID: <474C03EA.4070706@ebi.ac.uk> I think Groovy have adopted a similar system recently & have guidelines for how each module should behave (dependencies, build system etc). This enforces the idea that a module whilst not part of the core project must behave in the same manner the core does. I do like the idea that we can have a core biojava & things get added around it & it might encourage other users to start developing their own modules for any formats/purpose they want. Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > >> What format options are there from blast? Just thinking if it supports >> CIGAR or something like that are we better providing a parser for that >> format & saying that we do not support the traditional blast output? >> That said it doesn't help is when that format changes so maybe what is >> needed is a way to push out parser changes without requiring a full >> biojava release (v3 discussion) ... > > Exactly! So the modular idea would work nicely here - we could have a > blast module and only update that single module (which would be its own > JAR) whenever the format changes. In a way, BioJava releases as such > would no longer happen, except maybe for some kind of core BioJava > module. Everything would be done in terms of individual module+JAR > releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > for Phylogenetic tools, one for translation/transcription, etc. etc. > > cheers, > Richard From markjschreiber at gmail.com Tue Nov 27 09:48:12 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 27 Nov 2007 22:48:12 +0800 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474C03EA.4070706@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> <474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk> Message-ID: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com> For a long time now my feeling has been that we should *only* support the XML version of blast output. The other formats are too brittle to be easy to parse. I also feel similarly about Genbank, EMBL, etc that may be an extreme view but the power of generic XML parsers and things like XPath etc really make these formats look very attractive. - Mark On Nov 27, 2007 7:47 PM, Andy Yates wrote: > I think Groovy have adopted a similar system recently & have guidelines > for how each module should behave (dependencies, build system etc). This > enforces the idea that a module whilst not part of the core project must > behave in the same manner the core does. I do like the idea that we can > have a core biojava & things get added around it & it might encourage > other users to start developing their own modules for any > formats/purpose they want. > > Richard Holland wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > >> What format options are there from blast? Just thinking if it supports > >> CIGAR or something like that are we better providing a parser for that > >> format & saying that we do not support the traditional blast output? > >> That said it doesn't help is when that format changes so maybe what is > >> needed is a way to push out parser changes without requiring a full > >> biojava release (v3 discussion) ... > > > > Exactly! So the modular idea would work nicely here - we could have a > > blast module and only update that single module (which would be its own > > JAR) whenever the format changes. In a way, BioJava releases as such > > would no longer happen, except maybe for some kind of core BioJava > > module. Everything would be done in terms of individual module+JAR > > releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > > for Phylogenetic tools, one for translation/transcription, etc. etc. > > > > cheers, > > Richard > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Tue Nov 27 10:16:12 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 27 Nov 2007 15:16:12 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> <474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk> <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com> Message-ID: <474C34BC.4070209@ebi.ac.uk> I was always under the impression that blast's XML output was nearly as hard to parse as the flat file format but I do agree that if we can use XML whenever we can it would make writing parsers a lot easier (especially if there are SAX based XPath libraries available). Actually this brings up a good question about development of this type of parser. The majority of XPath supporting libraries are DOM based which will mean large memory usage in some situations but overall providing an easier coding experience (and hopefully reduce our chances of creating bugs). Or should we code to the edge cases of someone trying to parse a 1GB XML? Personally I'd favour the former. Going back to the original topic there are going to be situations where people want the flat file parsers/writers & I think it's a valid point to say this is where BioJava is meant to come in & help a developer. Afterall XML is a computer science problem where as parsing an EMBL flat file or blast output is a bioinformatics problem. Andy Mark Schreiber wrote: > For a long time now my feeling has been that we should *only* support > the XML version of blast output. The other formats are too brittle to > be easy to parse. I also feel similarly about Genbank, EMBL, etc that > may be an extreme view but the power of generic XML parsers and things > like XPath etc really make these formats look very attractive. > > - Mark > > > On Nov 27, 2007 7:47 PM, Andy Yates wrote: >> I think Groovy have adopted a similar system recently & have guidelines >> for how each module should behave (dependencies, build system etc). This >> enforces the idea that a module whilst not part of the core project must >> behave in the same manner the core does. I do like the idea that we can >> have a core biojava & things get added around it & it might encourage >> other users to start developing their own modules for any >> formats/purpose they want. >> >> Richard Holland wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>>> What format options are there from blast? Just thinking if it supports >>>> CIGAR or something like that are we better providing a parser for that >>>> format & saying that we do not support the traditional blast output? >>>> That said it doesn't help is when that format changes so maybe what is >>>> needed is a way to push out parser changes without requiring a full >>>> biojava release (v3 discussion) ... >>> Exactly! So the modular idea would work nicely here - we could have a >>> blast module and only update that single module (which would be its own >>> JAR) whenever the format changes. In a way, BioJava releases as such >>> would no longer happen, except maybe for some kind of core BioJava >>> module. Everything would be done in terms of individual module+JAR >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>> >>> cheers, >>> Richard >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> From markjschreiber at gmail.com Tue Nov 27 22:34:38 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 28 Nov 2007 11:34:38 +0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files Message-ID: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> Hi - I think in most cases huge XML files in bioinformatics result from a single XML containing multiple repetitive elements. Eg a BLAST XML output with several hits or a GenBankXML with many Sequences. A nice approach I have seen for dealing with these is to use SAX to read over the file and every time it comes to an element it delegates to a DOM object. You then parse the bits of the DOM you want with XPath or convert to objects or something and then when you are finished with that entry everything gets garbage collected and the SAX parser moves to the next element and repeats the whole process. This is a hybrid of event based parsing and object-model based parsing which could let you efficiently deal with huge files. I think the BLAST XML has improved substantially, at least in terms of validating against it's own DTD. The DTD itself may not be the best design but that is always a matter of taste and if you are using XPath to get the relevant bits you don't need to make a SAX parser jump through hoops to get them. I agree we will have to keep flat file parsers but we should strongly encourage the use of XML where possible. It is simply easier to deal with. Most biological flat-files were designed for Fortran and mainly for human consumption. There is no obvious validation mechanism. Notably everything in NCBI is derived from ASN.1, what you see in the flatfile is produced from there. I tend to think this means that the ASN.1 is the holy gospel and what you get in the flat file is some translation. Ideally NCBI files should be parsed from the ASN.1 where you can guarantee validation, the more practical alternative is to use the XML which you can at least validate against a DTD. With XML we (Biojava) can say if it validates we will parse it and if it doesn't we may not. With flat files there are so many dodgey variants we cannot say anything. Because XML dtds (or xsd's) have versions it also makes it much easier to have parsers for different versions and the parsing machinery can figure out which is needed. With flat files it is anyones guess what version you are dealing with. Finally parsers can be auto-generated for XML if you have the DTD or XSD. This often doesn't give you an ideal parser but it can be a useful starting point for rapid development. For Biojava v 3 I think we should concentrate on XML parsers first and flat files second. if only Fasta had an XML format - Mark On Nov 27, 2007 11:16 PM, Andy Yates wrote: > I was always under the impression that blast's XML output was nearly as > hard to parse as the flat file format but I do agree that if we can use > XML whenever we can it would make writing parsers a lot easier > (especially if there are SAX based XPath libraries available). Actually > this brings up a good question about development of this type of parser. > The majority of XPath supporting libraries are DOM based which will mean > large memory usage in some situations but overall providing an easier > coding experience (and hopefully reduce our chances of creating bugs). > Or should we code to the edge cases of someone trying to parse a 1GB > XML? Personally I'd favour the former. > > Going back to the original topic there are going to be situations where > people want the flat file parsers/writers & I think it's a valid point > to say this is where BioJava is meant to come in & help a developer. > Afterall XML is a computer science problem where as parsing an EMBL flat > file or blast output is a bioinformatics problem. > > Andy > > > Mark Schreiber wrote: > > For a long time now my feeling has been that we should *only* support > > the XML version of blast output. The other formats are too brittle to > > be easy to parse. I also feel similarly about Genbank, EMBL, etc that > > may be an extreme view but the power of generic XML parsers and things > > like XPath etc really make these formats look very attractive. > > > > - Mark > > > > > > On Nov 27, 2007 7:47 PM, Andy Yates wrote: > >> I think Groovy have adopted a similar system recently & have guidelines > >> for how each module should behave (dependencies, build system etc). This > >> enforces the idea that a module whilst not part of the core project must > >> behave in the same manner the core does. I do like the idea that we can > >> have a core biojava & things get added around it & it might encourage > >> other users to start developing their own modules for any > >> formats/purpose they want. > >> > >> Richard Holland wrote: > >>> -----BEGIN PGP SIGNED MESSAGE----- > >>> Hash: SHA1 > >>> > >>>> What format options are there from blast? Just thinking if it supports > >>>> CIGAR or something like that are we better providing a parser for that > >>>> format & saying that we do not support the traditional blast output? > >>>> That said it doesn't help is when that format changes so maybe what is > >>>> needed is a way to push out parser changes without requiring a full > >>>> biojava release (v3 discussion) ... > >>> Exactly! So the modular idea would work nicely here - we could have a > >>> blast module and only update that single module (which would be its own > >>> JAR) whenever the format changes. In a way, BioJava releases as such > >>> would no longer happen, except maybe for some kind of core BioJava > >>> module. Everything would be done in terms of individual module+JAR > >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > >>> for Phylogenetic tools, one for translation/transcription, etc. etc. > >>> > >>> cheers, > >>> Richard > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > From ayates at ebi.ac.uk Wed Nov 28 09:29:15 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 28 Nov 2007 14:29:15 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> Message-ID: <474D7B3B.8030807@ebi.ac.uk> Hi Mark, Okay that sounds like a perfectly sensible way to deal with this. Is this kind of parsing model supported in Java5? I only ask as I've not done a lot of XML parsing with Java5; more with things like XOM (which I think offers a DOM only representation but I'm probably wrong). That's good. There's not a huge point to have a format & a DTD/XSD and then have your files not conform to it. I was thinking the exact same thing about ASN.1 (well that & it looks bleeding horrible to parse but that is an un-educated look at the format which I'm sure is a parsable as JSON & the alike). When it comes to flat file parsers I would be happier to provide implementations of the more common formats where a viable alternative is not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide similar output to the above have a chance to write their own parsers/formatters. This is very similar to the current situation but we just need to remove dependencies on statically located data structures (don't get rid of them completely just give users an option to not use them). I'm not sure how much automatically generated parsers would help us. I guess it depends on the data model(s) we use if they are auto-parser friendly (which normally means POJO/JavaBean conventions including the no-args constructor). Cool I don't want to exclude flat file parsers completely (if only because my group has an interest in BioJava being able to read & write flat files) :) They decided to have HUPO-PSI Format instead :) Andy Mark Schreiber wrote: > Hi - > > I think in most cases huge XML files in bioinformatics result from a > single XML containing multiple repetitive elements. Eg a BLAST XML > output with several hits or a GenBankXML with many Sequences. A nice > approach I have seen for dealing with these is to use SAX to read over > the file and every time it comes to an element it delegates to a DOM > object. You then parse the bits of the DOM you want with XPath or > convert to objects or something and then when you are finished with > that entry everything gets garbage collected and the SAX parser moves > to the next element and repeats the whole process. This is a hybrid > of event based parsing and object-model based parsing which could let > you efficiently deal with huge files. > > I think the BLAST XML has improved substantially, at least in terms of > validating against it's own DTD. The DTD itself may not be the best > design but that is always a matter of taste and if you are using XPath > to get the relevant bits you don't need to make a SAX parser jump > through hoops to get them. > > I agree we will have to keep flat file parsers but we should strongly > encourage the use of XML where possible. It is simply easier to deal > with. Most biological flat-files were designed for Fortran and mainly > for human consumption. There is no obvious validation mechanism. > Notably everything in NCBI is derived from ASN.1, what you see in the > flatfile is produced from there. I tend to think this means that the > ASN.1 is the holy gospel and what you get in the flat file is some > translation. Ideally NCBI files should be parsed from the ASN.1 where > you can guarantee validation, the more practical alternative is to use > the XML which you can at least validate against a DTD. > > With XML we (Biojava) can say if it validates we will parse it and if > it doesn't we may not. With flat files there are so many dodgey > variants we cannot say anything. Because XML dtds (or xsd's) have > versions it also makes it much easier to have parsers for different > versions and the parsing machinery can figure out which is needed. > With flat files it is anyones guess what version you are dealing with. > > Finally parsers can be auto-generated for XML if you have the DTD or > XSD. This often doesn't give you an ideal parser but it can be a > useful starting point for rapid development. > > For Biojava v 3 I think we should concentrate on XML parsers first and > flat files second. if only Fasta had an XML format > > - Mark > > On Nov 27, 2007 11:16 PM, Andy Yates wrote: >> I was always under the impression that blast's XML output was nearly as >> hard to parse as the flat file format but I do agree that if we can use >> XML whenever we can it would make writing parsers a lot easier >> (especially if there are SAX based XPath libraries available). Actually >> this brings up a good question about development of this type of parser. >> The majority of XPath supporting libraries are DOM based which will mean >> large memory usage in some situations but overall providing an easier >> coding experience (and hopefully reduce our chances of creating bugs). >> Or should we code to the edge cases of someone trying to parse a 1GB >> XML? Personally I'd favour the former. >> >> Going back to the original topic there are going to be situations where >> people want the flat file parsers/writers & I think it's a valid point >> to say this is where BioJava is meant to come in & help a developer. >> Afterall XML is a computer science problem where as parsing an EMBL flat >> file or blast output is a bioinformatics problem. >> >> Andy >> >> >> Mark Schreiber wrote: >>> For a long time now my feeling has been that we should *only* support >>> the XML version of blast output. The other formats are too brittle to >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that >>> may be an extreme view but the power of generic XML parsers and things >>> like XPath etc really make these formats look very attractive. >>> >>> - Mark >>> >>> >>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >>>> I think Groovy have adopted a similar system recently & have guidelines >>>> for how each module should behave (dependencies, build system etc). This >>>> enforces the idea that a module whilst not part of the core project must >>>> behave in the same manner the core does. I do like the idea that we can >>>> have a core biojava & things get added around it & it might encourage >>>> other users to start developing their own modules for any >>>> formats/purpose they want. >>>> >>>> Richard Holland wrote: >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>>> What format options are there from blast? Just thinking if it supports >>>>>> CIGAR or something like that are we better providing a parser for that >>>>>> format & saying that we do not support the traditional blast output? >>>>>> That said it doesn't help is when that format changes so maybe what is >>>>>> needed is a way to push out parser changes without requiring a full >>>>>> biojava release (v3 discussion) ... >>>>> Exactly! So the modular idea would work nicely here - we could have a >>>>> blast module and only update that single module (which would be its own >>>>> JAR) whenever the format changes. In a way, BioJava releases as such >>>>> would no longer happen, except maybe for some kind of core BioJava >>>>> module. Everything would be done in terms of individual module+JAR >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>>>> >>>>> cheers, >>>>> Richard >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> From dmitry.repchevski at bsc.es Wed Nov 28 09:49:23 2007 From: dmitry.repchevski at bsc.es (Dmitry Repchevsky) Date: Wed, 28 Nov 2007 15:49:23 +0100 Subject: [Biojava-l] SAX, DOM, XPath and Flat files References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com Message-ID: <474D7FF3.9010901@bsc.es> Hello! Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) which is faster when SAX and allows writing. In JDK 6 apart of StAX there is JAXB which is a perfect combination to parse a huge files. You can go through the XML fie using StAX until the element you are interested in and unmarshall it using JAXB to POJO object. Cheers, Dmitry From ayates at ebi.ac.uk Wed Nov 28 10:37:03 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 28 Nov 2007 15:37:03 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <474D7FF3.9010901@bsc.es> References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com <474D7FF3.9010901@bsc.es> Message-ID: <474D8B1F.8070301@ebi.ac.uk> Hi Dmitry, StAX still has higher memory consumption than SAX (still not as large as DOM) but yes it is quite a good parser system & since we're moving towards the later versions of Java may be a good idea to use it as our standard parser ... if it supports XPath (can't remember off the top of my head) :) Andy Dmitry Repchevsky wrote: > Hello! > > Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) > which is faster when SAX and allows writing. > In JDK 6 apart of StAX there is JAXB which is a perfect combination to > parse a huge files. > You can go through the XML fie using StAX until the element you are > interested in and unmarshall it using JAXB to POJO object. > > Cheers, > > Dmitry > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Thu Nov 29 21:28:58 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 30 Nov 2007 10:28:58 +0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <474D7B3B.8030807@ebi.ac.uk> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> Message-ID: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but not XQuery although XPath is probably more important for this use. The DOM model is a direct implementation of the W3C standard which makes it a little awkward from a java point of view but it is usable. Java 6 has StAX (the other one). There are a few java API's for parsing ASN.1 mostly developed for the telco industry, I've never really looked into which is best (anyone experienced with this?) but we could probably use one to work directly off NCBI ASN.1 - Mark On Nov 28, 2007 10:29 PM, Andy Yates wrote: > Hi Mark, > > Okay that sounds like a perfectly sensible way to deal with this. Is > this kind of parsing model supported in Java5? I only ask as I've not > done a lot of XML parsing with Java5; more with things like XOM (which I > think offers a DOM only representation but I'm probably wrong). > > That's good. There's not a huge point to have a format & a DTD/XSD and > then have your files not conform to it. > > I was thinking the exact same thing about ASN.1 (well that & it looks > bleeding horrible to parse but that is an un-educated look at the format > which I'm sure is a parsable as JSON & the alike). > > When it comes to flat file parsers I would be happier to provide > implementations of the more common formats where a viable alternative is > not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide > similar output to the above have a chance to write their own > parsers/formatters. This is very similar to the current situation but we > just need to remove dependencies on statically located data structures > (don't get rid of them completely just give users an option to not use > them). > > I'm not sure how much automatically generated parsers would help us. I > guess it depends on the data model(s) we use if they are auto-parser > friendly (which normally means POJO/JavaBean conventions including the > no-args constructor). > > Cool I don't want to exclude flat file parsers completely (if only > because my group has an interest in BioJava being able to read & write > flat files) :) > > They decided to have HUPO-PSI Format instead :) > > Andy > > > Mark Schreiber wrote: > > Hi - > > > > I think in most cases huge XML files in bioinformatics result from a > > single XML containing multiple repetitive elements. Eg a BLAST XML > > output with several hits or a GenBankXML with many Sequences. A nice > > approach I have seen for dealing with these is to use SAX to read over > > the file and every time it comes to an element it delegates to a DOM > > object. You then parse the bits of the DOM you want with XPath or > > convert to objects or something and then when you are finished with > > that entry everything gets garbage collected and the SAX parser moves > > to the next element and repeats the whole process. This is a hybrid > > of event based parsing and object-model based parsing which could let > > you efficiently deal with huge files. > > > > I think the BLAST XML has improved substantially, at least in terms of > > validating against it's own DTD. The DTD itself may not be the best > > design but that is always a matter of taste and if you are using XPath > > to get the relevant bits you don't need to make a SAX parser jump > > through hoops to get them. > > > > I agree we will have to keep flat file parsers but we should strongly > > encourage the use of XML where possible. It is simply easier to deal > > with. Most biological flat-files were designed for Fortran and mainly > > for human consumption. There is no obvious validation mechanism. > > Notably everything in NCBI is derived from ASN.1, what you see in the > > flatfile is produced from there. I tend to think this means that the > > ASN.1 is the holy gospel and what you get in the flat file is some > > translation. Ideally NCBI files should be parsed from the ASN.1 where > > you can guarantee validation, the more practical alternative is to use > > the XML which you can at least validate against a DTD. > > > > With XML we (Biojava) can say if it validates we will parse it and if > > it doesn't we may not. With flat files there are so many dodgey > > variants we cannot say anything. Because XML dtds (or xsd's) have > > versions it also makes it much easier to have parsers for different > > versions and the parsing machinery can figure out which is needed. > > With flat files it is anyones guess what version you are dealing with. > > > > Finally parsers can be auto-generated for XML if you have the DTD or > > XSD. This often doesn't give you an ideal parser but it can be a > > useful starting point for rapid development. > > > > For Biojava v 3 I think we should concentrate on XML parsers first and > > flat files second. if only Fasta had an XML format > > > > - Mark > > > > On Nov 27, 2007 11:16 PM, Andy Yates wrote: > >> I was always under the impression that blast's XML output was nearly as > >> hard to parse as the flat file format but I do agree that if we can use > >> XML whenever we can it would make writing parsers a lot easier > >> (especially if there are SAX based XPath libraries available). Actually > >> this brings up a good question about development of this type of parser. > >> The majority of XPath supporting libraries are DOM based which will mean > >> large memory usage in some situations but overall providing an easier > >> coding experience (and hopefully reduce our chances of creating bugs). > >> Or should we code to the edge cases of someone trying to parse a 1GB > >> XML? Personally I'd favour the former. > >> > >> Going back to the original topic there are going to be situations where > >> people want the flat file parsers/writers & I think it's a valid point > >> to say this is where BioJava is meant to come in & help a developer. > >> Afterall XML is a computer science problem where as parsing an EMBL flat > >> file or blast output is a bioinformatics problem. > >> > >> Andy > >> > >> > >> Mark Schreiber wrote: > >>> For a long time now my feeling has been that we should *only* support > >>> the XML version of blast output. The other formats are too brittle to > >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that > >>> may be an extreme view but the power of generic XML parsers and things > >>> like XPath etc really make these formats look very attractive. > >>> > >>> - Mark > >>> > >>> > >>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: > >>>> I think Groovy have adopted a similar system recently & have guidelines > >>>> for how each module should behave (dependencies, build system etc). This > >>>> enforces the idea that a module whilst not part of the core project must > >>>> behave in the same manner the core does. I do like the idea that we can > >>>> have a core biojava & things get added around it & it might encourage > >>>> other users to start developing their own modules for any > >>>> formats/purpose they want. > >>>> > >>>> Richard Holland wrote: > >>>>> -----BEGIN PGP SIGNED MESSAGE----- > >>>>> Hash: SHA1 > >>>>> > >>>>>> What format options are there from blast? Just thinking if it supports > >>>>>> CIGAR or something like that are we better providing a parser for that > >>>>>> format & saying that we do not support the traditional blast output? > >>>>>> That said it doesn't help is when that format changes so maybe what is > >>>>>> needed is a way to push out parser changes without requiring a full > >>>>>> biojava release (v3 discussion) ... > >>>>> Exactly! So the modular idea would work nicely here - we could have a > >>>>> blast module and only update that single module (which would be its own > >>>>> JAR) whenever the format changes. In a way, BioJava releases as such > >>>>> would no longer happen, except maybe for some kind of core BioJava > >>>>> module. Everything would be done in terms of individual module+JAR > >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. > >>>>> > >>>>> cheers, > >>>>> Richard > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > From heuermh at acm.org Fri Nov 30 01:06:26 2007 From: heuermh at acm.org (Michael Heuer) Date: Fri, 30 Nov 2007 01:06:26 -0500 (EST) Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: Mark Schreiber wrote: > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). Yeah, those jerks. :) I wrote a note to the spec author a few weeks before "the other" StAX was announced at a Java One however long ago asking them to reconsider their project name. Oh well. We can still be the "original" StAX. > http://stax.sf.net May I kindly suggest skipping all of this talk about XML and have us jump straight to OWL? ;) > http://dev.isb-sib.ch/projects/uniprot-rdf/ michael From ayates at ebi.ac.uk Fri Nov 30 04:18:45 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 30 Nov 2007 09:18:45 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: References: Message-ID: <474FD575.3060307@ebi.ac.uk> Michael Heuer wrote: > Mark Schreiber wrote: > >> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but >> not XQuery although XPath is probably more important for this use. >> >> The DOM model is a direct implementation of the W3C standard which >> makes it a little awkward from a java point of view but it is usable. >> >> Java 6 has StAX (the other one). > > Yeah, those jerks. :) > > I wrote a note to the spec author a few weeks before "the other" StAX was > announced at a Java One however long ago asking them to reconsider their > project name. > > Oh well. We can still be the "original" StAX. > >> http://stax.sf.net Yup I remember that issue from BOSC 2005 ... oh well not a lot that can be done now. Maybe a re-brand of our StAX to StAX Original. Bit like the Coca Cola & New Coke mess-up. > > > May I kindly suggest skipping all of this talk about XML and have us > jump straight to OWL? ;) > >> http://dev.isb-sib.ch/projects/uniprot-rdf/ Lol just let me fire up my semantic web engine first :). From ayates at ebi.ac.uk Fri Nov 30 04:26:15 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 30 Nov 2007 09:26:15 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: <474FD737.9080801@ebi.ac.uk> I think I've seen XPath hanging around in other people's code in a 1.5 code-base (in fact one of the guys I work with). I've used Java's DOM before & it really isn't very nice & quite verbose. I'd prefer if there was a better alternative/wrapper around the XML parsers just to cut down on code chatter. Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these Java tools & I think I've gone cross-eyed with the sheer number of acronyms! You've gotta love something which seems to add a letter to ER & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the list know of a ASN.1 parser for Java that's good and should we support it (considering NCBI generate their DTD & XML from the ASN.1 representation). Andy Mark Schreiber wrote: > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). > > There are a few java API's for parsing ASN.1 mostly developed for the > telco industry, I've never really looked into which is best (anyone > experienced with this?) but we could probably use one to work directly > off NCBI ASN.1 > > - Mark > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: >> Hi Mark, >> >> Okay that sounds like a perfectly sensible way to deal with this. Is >> this kind of parsing model supported in Java5? I only ask as I've not >> done a lot of XML parsing with Java5; more with things like XOM (which I >> think offers a DOM only representation but I'm probably wrong). >> >> That's good. There's not a huge point to have a format & a DTD/XSD and >> then have your files not conform to it. >> >> I was thinking the exact same thing about ASN.1 (well that & it looks >> bleeding horrible to parse but that is an un-educated look at the format >> which I'm sure is a parsable as JSON & the alike). >> >> When it comes to flat file parsers I would be happier to provide >> implementations of the more common formats where a viable alternative is >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide >> similar output to the above have a chance to write their own >> parsers/formatters. This is very similar to the current situation but we >> just need to remove dependencies on statically located data structures >> (don't get rid of them completely just give users an option to not use >> them). >> >> I'm not sure how much automatically generated parsers would help us. I >> guess it depends on the data model(s) we use if they are auto-parser >> friendly (which normally means POJO/JavaBean conventions including the >> no-args constructor). >> >> Cool I don't want to exclude flat file parsers completely (if only >> because my group has an interest in BioJava being able to read & write >> flat files) :) >> >> They decided to have HUPO-PSI Format instead :) >> >> Andy >> >> >> Mark Schreiber wrote: >>> Hi - >>> >>> I think in most cases huge XML files in bioinformatics result from a >>> single XML containing multiple repetitive elements. Eg a BLAST XML >>> output with several hits or a GenBankXML with many Sequences. A nice >>> approach I have seen for dealing with these is to use SAX to read over >>> the file and every time it comes to an element it delegates to a DOM >>> object. You then parse the bits of the DOM you want with XPath or >>> convert to objects or something and then when you are finished with >>> that entry everything gets garbage collected and the SAX parser moves >>> to the next element and repeats the whole process. This is a hybrid >>> of event based parsing and object-model based parsing which could let >>> you efficiently deal with huge files. >>> >>> I think the BLAST XML has improved substantially, at least in terms of >>> validating against it's own DTD. The DTD itself may not be the best >>> design but that is always a matter of taste and if you are using XPath >>> to get the relevant bits you don't need to make a SAX parser jump >>> through hoops to get them. >>> >>> I agree we will have to keep flat file parsers but we should strongly >>> encourage the use of XML where possible. It is simply easier to deal >>> with. Most biological flat-files were designed for Fortran and mainly >>> for human consumption. There is no obvious validation mechanism. >>> Notably everything in NCBI is derived from ASN.1, what you see in the >>> flatfile is produced from there. I tend to think this means that the >>> ASN.1 is the holy gospel and what you get in the flat file is some >>> translation. Ideally NCBI files should be parsed from the ASN.1 where >>> you can guarantee validation, the more practical alternative is to use >>> the XML which you can at least validate against a DTD. >>> >>> With XML we (Biojava) can say if it validates we will parse it and if >>> it doesn't we may not. With flat files there are so many dodgey >>> variants we cannot say anything. Because XML dtds (or xsd's) have >>> versions it also makes it much easier to have parsers for different >>> versions and the parsing machinery can figure out which is needed. >>> With flat files it is anyones guess what version you are dealing with. >>> >>> Finally parsers can be auto-generated for XML if you have the DTD or >>> XSD. This often doesn't give you an ideal parser but it can be a >>> useful starting point for rapid development. >>> >>> For Biojava v 3 I think we should concentrate on XML parsers first and >>> flat files second. if only Fasta had an XML format >>> >>> - Mark >>> >>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: >>>> I was always under the impression that blast's XML output was nearly as >>>> hard to parse as the flat file format but I do agree that if we can use >>>> XML whenever we can it would make writing parsers a lot easier >>>> (especially if there are SAX based XPath libraries available). Actually >>>> this brings up a good question about development of this type of parser. >>>> The majority of XPath supporting libraries are DOM based which will mean >>>> large memory usage in some situations but overall providing an easier >>>> coding experience (and hopefully reduce our chances of creating bugs). >>>> Or should we code to the edge cases of someone trying to parse a 1GB >>>> XML? Personally I'd favour the former. >>>> >>>> Going back to the original topic there are going to be situations where >>>> people want the flat file parsers/writers & I think it's a valid point >>>> to say this is where BioJava is meant to come in & help a developer. >>>> Afterall XML is a computer science problem where as parsing an EMBL flat >>>> file or blast output is a bioinformatics problem. >>>> >>>> Andy >>>> >>>> >>>> Mark Schreiber wrote: >>>>> For a long time now my feeling has been that we should *only* support >>>>> the XML version of blast output. The other formats are too brittle to >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that >>>>> may be an extreme view but the power of generic XML parsers and things >>>>> like XPath etc really make these formats look very attractive. >>>>> >>>>> - Mark >>>>> >>>>> >>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >>>>>> I think Groovy have adopted a similar system recently & have guidelines >>>>>> for how each module should behave (dependencies, build system etc). This >>>>>> enforces the idea that a module whilst not part of the core project must >>>>>> behave in the same manner the core does. I do like the idea that we can >>>>>> have a core biojava & things get added around it & it might encourage >>>>>> other users to start developing their own modules for any >>>>>> formats/purpose they want. >>>>>> >>>>>> Richard Holland wrote: >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>> Hash: SHA1 >>>>>>> >>>>>>>> What format options are there from blast? Just thinking if it supports >>>>>>>> CIGAR or something like that are we better providing a parser for that >>>>>>>> format & saying that we do not support the traditional blast output? >>>>>>>> That said it doesn't help is when that format changes so maybe what is >>>>>>>> needed is a way to push out parser changes without requiring a full >>>>>>>> biojava release (v3 discussion) ... >>>>>>> Exactly! So the modular idea would work nicely here - we could have a >>>>>>> blast module and only update that single module (which would be its own >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as such >>>>>>> would no longer happen, except maybe for some kind of core BioJava >>>>>>> module. Everything would be done in terms of individual module+JAR >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> From phidias51 at gmail.com Fri Nov 30 13:30:50 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 30 Nov 2007 10:30:50 -0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <474FD737.9080801@ebi.ac.uk> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> Message-ID: <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> There's a potential gotcha involved with XPath parsing. If you use the current implementation that ships with the Java 5 & 6 JDKs, it performs a DOM parse on the whole document, even if you pass it a specific starting node in the document. I stumbled across this one the hard way when using the hybrid approach that you mention. This may be solved with another XPath implementation such as Saxon. One other problem I've noticed is that the NCBI XML doesn't always parse. I've reported this to them, and they've promised to address this. It usually occurs when submitters put non-escaped characters into text fields such as author lists in PubMed. NCBI doesn't always use CDATA blocks around text and as soon as the parser hits one of these characters it throws an exception. I've also noticed a tendency (in other code bases) for developers to use several different parsers; usually, whatever parser they're most familiar with. The problem with this is that they often introduce parser-specific code into the code base, so you end up with numerous dependencies for different parsers, and a potential configuration problem if you're passing the XML parser as a run-time configuration parameter. The most frequent external parsers I've seen used are JDOM and DOM4J. The usual way to get around this is to write to an interface, but that will require some additional vigilance. Just a few things to watch out for as we move forward. Mark (the other one) :-) On Nov 30, 2007 1:26 AM, Andy Yates wrote: > I think I've seen XPath hanging around in other people's code in a 1.5 > code-base (in fact one of the guys I work with). I've used Java's DOM > before & it really isn't very nice & quite verbose. I'd prefer if there > was a better alternative/wrapper around the XML parsers just to cut down > on code chatter. > > Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these > Java tools & I think I've gone cross-eyed with the sheer number of > acronyms! You've gotta love something which seems to add a letter to ER > & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the > list know of a ASN.1 parser for Java that's good and should we support > it (considering NCBI generate their DTD & XML from the ASN.1 > representation). > > Andy > > Mark Schreiber wrote: > > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > > not XQuery although XPath is probably more important for this use. > > > > The DOM model is a direct implementation of the W3C standard which > > makes it a little awkward from a java point of view but it is usable. > > > > Java 6 has StAX (the other one). > > > > There are a few java API's for parsing ASN.1 mostly developed for the > > telco industry, I've never really looked into which is best (anyone > > experienced with this?) but we could probably use one to work directly > > off NCBI ASN.1 > > > > - Mark > > > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: > >> Hi Mark, > >> > >> Okay that sounds like a perfectly sensible way to deal with this. Is > >> this kind of parsing model supported in Java5? I only ask as I've not > >> done a lot of XML parsing with Java5; more with things like XOM (which > I > >> think offers a DOM only representation but I'm probably wrong). > >> > >> That's good. There's not a huge point to have a format & a DTD/XSD and > >> then have your files not conform to it. > >> > >> I was thinking the exact same thing about ASN.1 (well that & it looks > >> bleeding horrible to parse but that is an un-educated look at the > format > >> which I'm sure is a parsable as JSON & the alike). > >> > >> When it comes to flat file parsers I would be happier to provide > >> implementations of the more common formats where a viable alternative > is > >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which > provide > >> similar output to the above have a chance to write their own > >> parsers/formatters. This is very similar to the current situation but > we > >> just need to remove dependencies on statically located data structures > >> (don't get rid of them completely just give users an option to not use > >> them). > >> > >> I'm not sure how much automatically generated parsers would help us. I > >> guess it depends on the data model(s) we use if they are auto-parser > >> friendly (which normally means POJO/JavaBean conventions including the > >> no-args constructor). > >> > >> Cool I don't want to exclude flat file parsers completely (if only > >> because my group has an interest in BioJava being able to read & write > >> flat files) :) > >> > >> They decided to have HUPO-PSI Format instead :) > >> > >> Andy > >> > >> > >> Mark Schreiber wrote: > >>> Hi - > >>> > >>> I think in most cases huge XML files in bioinformatics result from a > >>> single XML containing multiple repetitive elements. Eg a BLAST XML > >>> output with several hits or a GenBankXML with many Sequences. A nice > >>> approach I have seen for dealing with these is to use SAX to read over > >>> the file and every time it comes to an element it delegates to a DOM > >>> object. You then parse the bits of the DOM you want with XPath or > >>> convert to objects or something and then when you are finished with > >>> that entry everything gets garbage collected and the SAX parser moves > >>> to the next element and repeats the whole process. This is a hybrid > >>> of event based parsing and object-model based parsing which could let > >>> you efficiently deal with huge files. > >>> > >>> I think the BLAST XML has improved substantially, at least in terms of > >>> validating against it's own DTD. The DTD itself may not be the best > >>> design but that is always a matter of taste and if you are using XPath > >>> to get the relevant bits you don't need to make a SAX parser jump > >>> through hoops to get them. > >>> > >>> I agree we will have to keep flat file parsers but we should strongly > >>> encourage the use of XML where possible. It is simply easier to deal > >>> with. Most biological flat-files were designed for Fortran and mainly > >>> for human consumption. There is no obvious validation mechanism. > >>> Notably everything in NCBI is derived from ASN.1, what you see in the > >>> flatfile is produced from there. I tend to think this means that the > >>> ASN.1 is the holy gospel and what you get in the flat file is some > >>> translation. Ideally NCBI files should be parsed from the ASN.1 where > >>> you can guarantee validation, the more practical alternative is to use > >>> the XML which you can at least validate against a DTD. > >>> > >>> With XML we (Biojava) can say if it validates we will parse it and if > >>> it doesn't we may not. With flat files there are so many dodgey > >>> variants we cannot say anything. Because XML dtds (or xsd's) have > >>> versions it also makes it much easier to have parsers for different > >>> versions and the parsing machinery can figure out which is needed. > >>> With flat files it is anyones guess what version you are dealing with. > >>> > >>> Finally parsers can be auto-generated for XML if you have the DTD or > >>> XSD. This often doesn't give you an ideal parser but it can be a > >>> useful starting point for rapid development. > >>> > >>> For Biojava v 3 I think we should concentrate on XML parsers first and > >>> flat files second. if only Fasta had an XML format > >>> > >>> - Mark > >>> > >>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: > >>>> I was always under the impression that blast's XML output was nearly > as > >>>> hard to parse as the flat file format but I do agree that if we can > use > >>>> XML whenever we can it would make writing parsers a lot easier > >>>> (especially if there are SAX based XPath libraries available). > Actually > >>>> this brings up a good question about development of this type of > parser. > >>>> The majority of XPath supporting libraries are DOM based which will > mean > >>>> large memory usage in some situations but overall providing an easier > >>>> coding experience (and hopefully reduce our chances of creating > bugs). > >>>> Or should we code to the edge cases of someone trying to parse a 1GB > >>>> XML? Personally I'd favour the former. > >>>> > >>>> Going back to the original topic there are going to be situations > where > >>>> people want the flat file parsers/writers & I think it's a valid > point > >>>> to say this is where BioJava is meant to come in & help a developer. > >>>> Afterall XML is a computer science problem where as parsing an EMBL > flat > >>>> file or blast output is a bioinformatics problem. > >>>> > >>>> Andy > >>>> > >>>> > >>>> Mark Schreiber wrote: > >>>>> For a long time now my feeling has been that we should *only* > support > >>>>> the XML version of blast output. The other formats are too brittle > to > >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc > that > >>>>> may be an extreme view but the power of generic XML parsers and > things > >>>>> like XPath etc really make these formats look very attractive. > >>>>> > >>>>> - Mark > >>>>> > >>>>> > >>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: > >>>>>> I think Groovy have adopted a similar system recently & have > guidelines > >>>>>> for how each module should behave (dependencies, build system etc). > This > >>>>>> enforces the idea that a module whilst not part of the core project > must > >>>>>> behave in the same manner the core does. I do like the idea that we > can > >>>>>> have a core biojava & things get added around it & it might > encourage > >>>>>> other users to start developing their own modules for any > >>>>>> formats/purpose they want. > >>>>>> > >>>>>> Richard Holland wrote: > >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- > >>>>>>> Hash: SHA1 > >>>>>>> > >>>>>>>> What format options are there from blast? Just thinking if it > supports > >>>>>>>> CIGAR or something like that are we better providing a parser for > that > >>>>>>>> format & saying that we do not support the traditional blast > output? > >>>>>>>> That said it doesn't help is when that format changes so maybe > what is > >>>>>>>> needed is a way to push out parser changes without requiring a > full > >>>>>>>> biojava release (v3 discussion) ... > >>>>>>> Exactly! So the modular idea would work nicely here - we could > have a > >>>>>>> blast module and only update that single module (which would be > its own > >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as > such > >>>>>>> would no longer happen, except maybe for some kind of core BioJava > >>>>>>> module. Everything would be done in terms of individual module+JAR > >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, > one > >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. > etc. > >>>>>>> > >>>>>>> cheers, > >>>>>>> Richard > >>>>>> _______________________________________________ > >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From abhi232 at cc.gatech.edu Sat Nov 24 11:16:17 2007 From: abhi232 at cc.gatech.edu (Abhinav Ram Karhu) Date: Sat, 24 Nov 2007 16:16:17 -0000 Subject: [Biojava-l] Applet not able to find DNATools class. Message-ID: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> Hello all, I am having an error while loading the applet. I am getting the following stack trace. java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161) at Trace.init(Trace.java:161) at sun.applet.AppletPanel.run(Unknown Source) at java.lang.Thread.run(Unknown Source) I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder. I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java. My applet code in the php page looks like this: Please suggest if I am missing something. Thanks in advance. Abhinav From alex at coolest.com Thu Nov 1 08:20:26 2007 From: alex at coolest.com (dasoudesu) Date: Thu, 1 Nov 2007 01:20:26 -0700 (PDT) Subject: [Biojava-l] [ann] Informal Text-mining & Java Meetup in Tokyo Message-ID: <13524848.post@talk.nabble.com> Just wanted to announce a mini-event: Informal Text-mining & Java Meetup in Tokyo http://curehunter.com/public/events.do Come have a casual drink with some similarly minded devs interested in new tech. (We like: Text-mining, Natural Language Processing, Java, C#, Python, Flex, Dojo, Lucene...) Time/location: November 29th 2007, Thursday 8pm-10pm Amarcord in Hatsudai (near Shinjuku), Tokyo http://way.sub.jp/amarcord/access.php 2000-3000yen for food/drinks If you can attend, please confirm by emailing: events at curehunter com We will do a short demo of CureHunter and talk about some of the tech we used. After that we will have a projector available if anyone else would like to present for 5-15 min on stuff they are working on. (the location is best equipped for drinking, however) Hope to meet a few Java people from around Tokyo. Best Regards, Alex --- http://curehunter.com - http://popjisyo.com - http://winstone.sf.net -- View this message in context: http://www.nabble.com/-ann--Informal-Text-mining---Java-Meetup-in-Tokyo-tf4729944.html#a13524848 Sent from the BioJava mailing list archive at Nabble.com. From ap3 at sanger.ac.uk Thu Nov 1 16:59:35 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 1 Nov 2007 16:59:35 +0000 Subject: [Biojava-l] Biojava migrating to Subversion Message-ID: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> Hi all, Over the next weeks (until Christmas) BioJava will finally move the version control system from CVS to Subversion (svn). This is happening in parallel to the other open-bio projects. We will ensure that nothing gets lost during this migration. This means that all Biojava modules, branches, tags and the history of the files will be imported into the new repository. Over the next weeks we will A) Test the migration procedure to ensure nothing gets lost B) We will declare a CVS freeze at some point, giving all developers enough time to commit the latest code to CVS. C) After the freeze the final svn migration will happen. At this point we will also do a quick BioJava release (version 1.5.1) D) From that moment on all future Biojava development will happen via svn, CVS will remain frozen. Detailed instructions for how to check out and commit code using svn will be announced closer to the migration date. We will keep you informed about the details of these ongoings. There is also a wiki page which provides documentation for this: http://biojava.org/wiki/CVS_to_SVN_Migration Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From abhi232 at cc.gatech.edu Mon Nov 5 17:59:15 2007 From: abhi232 at cc.gatech.edu (abhi232 at cc.gatech.edu) Date: Mon, 5 Nov 2007 12:59:15 -0500 (EST) Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> Message-ID: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> Hi all, I am having a byte array which is having the data from an .ab1 file.The biojava library provides a class called as ABITrace which takes as input either a byte[] array , a file or a url.If i use the later parameters (the file or the url )the program works but if I pass the byte array to the constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a problem with the ABITrace class or how can I bypass this particular error. I am printing the length of the byte array and it comes to 144930...Can that cause a problem in my code? Thanks in advance. Abhinav From holland at ebi.ac.uk Tue Nov 6 10:15:43 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Nov 2007 10:15:43 +0000 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> Message-ID: <47303ECF.4020806@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I suspect the byte array itself may contain inaccurate data. Internally, both the URL and File constructors read the data into a byte array and then pass it to the same method as is used by the byte[] constructor. So, something must be different between the byte array you have, and the byte array obtained by reading the file in. The File constructor uses the following code to read the file: byte[] bytes = null; ByteArrayOutputStream baos = new ByteArrayOutputStream(); FileInputStream fis = new FileInputStream(ABIFile); BufferedInputStream bis = new BufferedInputStream(fis); int b; while ((b = bis.read()) >= 0) { baos.write(b); } bis.close(); fis.close(); baos.close(); bytes = baos.toByteArray(); If the above code produces different results to your byte array when reading data from the same file as your code, then something has gone wrong with the construction of your byte array. Lastly, a full stack trace would help us pinpoint the line that is breaking, and hopefully provide a hint as to what is wrong with the contents of the byte array. If you could provide one that would be very helpful. cheers, Richard abhi232 at cc.gatech.edu wrote: > Hi all, > I am having a byte array which is having the data from an .ab1 file.The > biojava library provides a class called as ABITrace which takes as input > either a byte[] array , a file or a url.If i use the later parameters (the > file or the url )the program works but if I pass the byte array to the > constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a > problem with the ABITrace class or how can I bypass this particular error. > I am printing the length of the byte array and it comes to 144930...Can > that cause a problem in my code? > > Thanks in advance. > Abhinav > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr oe52S3khmTvi5BFCL1W4KHM= =5JAO -----END PGP SIGNATURE----- From holland at ebi.ac.uk Tue Nov 6 16:53:54 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Nov 2007 16:53:54 +0000 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <4730A6F1.9050407@cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> Message-ID: <47309C22.10803@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I think that either the file is at fault, or the method you are using to read the file into Java is at fault. Could you provide us with the complete piece of code you are using from the point where you read the file into the array through to the point where you generate the output you quoted? (Not as an attachment as the mailing list will strip those - simply paste it into the message body instead). cheers, Richard abhinav wrote: > Richard Holland wrote: > I suspect the byte array itself may contain inaccurate data. > > Internally, both the URL and File constructors read the data into a byte > array and then pass it to the same method as is used by the byte[] > constructor. > > So, something must be different between the byte array you have, and the > byte array obtained by reading the file in. > > The File constructor uses the following code to read the file: > > byte[] bytes = null; > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > FileInputStream fis = new FileInputStream(ABIFile); > BufferedInputStream bis = new BufferedInputStream(fis); > int b; > while ((b = bis.read()) >= 0) > { > baos.write(b); > } > bis.close(); fis.close(); baos.close(); > bytes = baos.toByteArray(); > > If the above code produces different results to your byte array when > reading data from the same file as your code, then something has gone > wrong with the construction of your byte array. > > Lastly, a full stack trace would help us pinpoint the line that is > breaking, and hopefully provide a hint as to what is wrong with the > contents of the byte array. If you could provide one that would be very > helpful. > > cheers, > Richard > > > abhi232 at cc.gatech.edu wrote: > >>>> Hi all, >>>> I am having a byte array which is having the data from an .ab1 file.The >>>> biojava library provides a class called as ABITrace which takes as input >>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>> file or the url )the program works but if I pass the byte array to the >>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>> problem with the ABITrace class or how can I bypass this particular error. >>>> I am printing the length of the byte array and it comes to 144930...Can >>>> that cause a problem in my code? >>>> >>>> Thanks in advance. >>>> Abhinav >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> > Yes I looked at the file ABITrace and found out that the first three > characters must be ABI or the 128-130 characters must be ABI.But I > cannot find that in the file that I am having.Also If this is not the > case then there should be an illegal format exception whereas I am > arrayIndexOutOfBound Exception which is also weird. > I am getting the following stack trace. > The bytes that i want are:0 > The bytes that i want are:11 > The bytes that i want are:0 > The size of the byte array generated is:144930 > Byte array also recieved > java.lang.ArrayIndexOutOfBoundsException: 128 > at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) > at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) > at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) > at Trace.init(Trace.java:138) > at sun.applet.AppletPanel.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > The bytes I want are the first three bytes that I want to check if my > file is ABI or not.I checked the isABI function as well it returns true > or false value and not arrayIndexOutOfBouond . Also the number 128 does > it hve any significance in this case? > Thanks in advance > Abhinav -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG T1Z8W0JhCe7+hx5rbKLGqVk= =qNcr -----END PGP SIGNATURE----- From abhi232 at cc.gatech.edu Tue Nov 6 18:03:02 2007 From: abhi232 at cc.gatech.edu (abhinav) Date: Tue, 06 Nov 2007 12:03:02 -0600 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <47309C22.10803@ebi.ac.uk> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> <47309C22.10803@ebi.ac.uk> Message-ID: <4730AC56.9060808@cc.gatech.edu> Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I think that either the file is at fault, or the method you are using to > read the file into Java is at fault. > > Could you provide us with the complete piece of code you are using from > the point where you read the file into the array through to the point > where you generate the output you quoted? (Not as an attachment as the > mailing list will strip those - simply paste it into the message body > instead). > > cheers, > Richard > > > abhinav wrote: > >> Richard Holland wrote: >> I suspect the byte array itself may contain inaccurate data. >> >> Internally, both the URL and File constructors read the data into a byte >> array and then pass it to the same method as is used by the byte[] >> constructor. >> >> So, something must be different between the byte array you have, and the >> byte array obtained by reading the file in. >> >> The File constructor uses the following code to read the file: >> >> byte[] bytes = null; >> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >> FileInputStream fis = new FileInputStream(ABIFile); >> BufferedInputStream bis = new BufferedInputStream(fis); >> int b; >> while ((b = bis.read()) >= 0) >> { >> baos.write(b); >> } >> bis.close(); fis.close(); baos.close(); >> bytes = baos.toByteArray(); >> >> If the above code produces different results to your byte array when >> reading data from the same file as your code, then something has gone >> wrong with the construction of your byte array. >> >> Lastly, a full stack trace would help us pinpoint the line that is >> breaking, and hopefully provide a hint as to what is wrong with the >> contents of the byte array. If you could provide one that would be very >> helpful. >> >> cheers, >> Richard >> >> >> abhi232 at cc.gatech.edu wrote: >> >> >>>>> Hi all, >>>>> I am having a byte array which is having the data from an .ab1 file.The >>>>> biojava library provides a class called as ABITrace which takes as input >>>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>>> file or the url )the program works but if I pass the byte array to the >>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>>> problem with the ABITrace class or how can I bypass this particular error. >>>>> I am printing the length of the byte array and it comes to 144930...Can >>>>> that cause a problem in my code? >>>>> >>>>> Thanks in advance. >>>>> Abhinav >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> >>>>> > > >> Yes I looked at the file ABITrace and found out that the first three >> characters must be ABI or the 128-130 characters must be ABI.But I >> cannot find that in the file that I am having.Also If this is not the >> case then there should be an illegal format exception whereas I am >> arrayIndexOutOfBound Exception which is also weird. >> I am getting the following stack trace. >> The bytes that i want are:0 >> The bytes that i want are:11 >> The bytes that i want are:0 >> The size of the byte array generated is:144930 >> Byte array also recieved >> java.lang.ArrayIndexOutOfBoundsException: 128 >> at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) >> at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) >> at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) >> at Trace.init(Trace.java:138) >> at sun.applet.AppletPanel.run(Unknown Source) >> at java.lang.Thread.run(Unknown Source) >> The bytes I want are the first three bytes that I want to check if my >> file is ABI or not.I checked the isABI function as well it returns true >> or false value and not arrayIndexOutOfBouond . Also the number 128 does >> it hve any significance in this case? >> Thanks in advance >> Abhinav >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG > T1Z8W0JhCe7+hx5rbKLGqVk= > =qNcr > -----END PGP SIGNATURE----- > Ok Yes here is the code that i am using .I establish a connection with a php page which in turn reads the file and prints the content back to me.I am using DataOutputStream for sending data and BufferedReader for taking in the data.Then I am reading the data into a string and converting it to byte[] array . this the code where the connection is estableshed and the data is taken and displayed. private HttpURLConnection httpConn; private DataOutputStream out; private DataInputStream temp_stream; private BufferedReader in; private BufferedInputStream in_buff_stream; private String str ; private byte[] bytearray; Chromatogram abif_chromatogram; /** Creates a new instance of testPost */ public testPost() { httpConn = null; str = new String(""); bytearray = new byte[144930]; } public byte[] create_and_write_Connection(String url,String data_request) { try { URL conn_url = new URL(url); httpConn = (HttpURLConnection)conn_url.openConnection(); httpConn.setDoOutput(true); httpConn.setDoInput(true); httpConn.setRequestMethod("POST"); out=new DataOutputStream(httpConn.getOutputStream()); out.writeBytes(data_request); out.flush(); System.out.println("Connection established successfully and data written"); InputStreamReader in_stream = new InputStreamReader(httpConn.getInputStream()); System.out.println("The character encoding used is:"+ in_stream.getEncoding()); in = new BufferedReader(in_stream); System.out.println("Data acceptance started"); while(in.readLine()!=null) { str += in.readLine(); } System.out.println("The string to be returned is:"+str); bytearray = str.getBytes("ISO8859-1"); String temp_string = new String(bytearray,"windows-1252"); System.out.println("The encoded string is as follows:"+ temp_string); System.out.println("The size of byte array inside testpost is:"+ Array.getLength(bytearray)); for(int i = 0 ; i < 3 ; i ++) System.out.println("The bytes that i want are:"+ bytearray[i]); return bytearray; } catch(Exception e) { e.printStackTrace(); } return bytearray; } Please guide me on this point Thanks Abhinav From holland at ebi.ac.uk Tue Nov 6 17:05:12 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Nov 2007 17:05:12 +0000 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <4730AC56.9060808@cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> <47309C22.10803@ebi.ac.uk> <4730AC56.9060808@cc.gatech.edu> Message-ID: <47309EC8.2070904@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The String is where you're going wrong. ABI files are not Stringifyable - - they are binary data. Converting them to a String will corrupt them. cheers, Richard abhinav wrote: > Richard Holland wrote: > I think that either the file is at fault, or the method you are using to > read the file into Java is at fault. > > Could you provide us with the complete piece of code you are using from > the point where you read the file into the array through to the point > where you generate the output you quoted? (Not as an attachment as the > mailing list will strip those - simply paste it into the message body > instead). > > cheers, > Richard > > > abhinav wrote: > >>>> Richard Holland wrote: >>>> I suspect the byte array itself may contain inaccurate data. >>>> >>>> Internally, both the URL and File constructors read the data into a byte >>>> array and then pass it to the same method as is used by the byte[] >>>> constructor. >>>> >>>> So, something must be different between the byte array you have, and the >>>> byte array obtained by reading the file in. >>>> >>>> The File constructor uses the following code to read the file: >>>> >>>> byte[] bytes = null; >>>> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >>>> FileInputStream fis = new FileInputStream(ABIFile); >>>> BufferedInputStream bis = new BufferedInputStream(fis); >>>> int b; >>>> while ((b = bis.read()) >= 0) >>>> { >>>> baos.write(b); >>>> } >>>> bis.close(); fis.close(); baos.close(); >>>> bytes = baos.toByteArray(); >>>> >>>> If the above code produces different results to your byte array when >>>> reading data from the same file as your code, then something has gone >>>> wrong with the construction of your byte array. >>>> >>>> Lastly, a full stack trace would help us pinpoint the line that is >>>> breaking, and hopefully provide a hint as to what is wrong with the >>>> contents of the byte array. If you could provide one that would be very >>>> helpful. >>>> >>>> cheers, >>>> Richard >>>> >>>> >>>> abhi232 at cc.gatech.edu wrote: >>>> >>>> >>>>>>> Hi all, >>>>>>> I am having a byte array which is having the data from an .ab1 file.The >>>>>>> biojava library provides a class called as ABITrace which takes as input >>>>>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>>>>> file or the url )the program works but if I pass the byte array to the >>>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>>>>> problem with the ABITrace class or how can I bypass this particular error. >>>>>>> I am printing the length of the byte array and it comes to 144930...Can >>>>>>> that cause a problem in my code? >>>>>>> >>>>>>> Thanks in advance. >>>>>>> Abhinav >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>>> >>>>>>> > > >>>> Yes I looked at the file ABITrace and found out that the first three >>>> characters must be ABI or the 128-130 characters must be ABI.But I >>>> cannot find that in the file that I am having.Also If this is not the >>>> case then there should be an illegal format exception whereas I am >>>> arrayIndexOutOfBound Exception which is also weird. >>>> I am getting the following stack trace. >>>> The bytes that i want are:0 >>>> The bytes that i want are:11 >>>> The bytes that i want are:0 >>>> The size of the byte array generated is:144930 >>>> Byte array also recieved >>>> java.lang.ArrayIndexOutOfBoundsException: 128 >>>> at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) >>>> at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) >>>> at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) >>>> at Trace.init(Trace.java:138) >>>> at sun.applet.AppletPanel.run(Unknown Source) >>>> at java.lang.Thread.run(Unknown Source) >>>> The bytes I want are the first three bytes that I want to check if my >>>> file is ABI or not.I checked the isABI function as well it returns true >>>> or false value and not arrayIndexOutOfBouond . Also the number 128 does >>>> it hve any significance in this case? >>>> Thanks in advance >>>> Abhinav >>>> > > Ok Yes here is the code that i am using .I establish a connection with a > php page which in turn reads the file and prints the content back to > me.I am using DataOutputStream for sending data and BufferedReader for > taking in the data.Then I am reading the data into a string and > converting it to byte[] array . this the code where the connection is > estableshed and the data is taken and displayed. > private HttpURLConnection httpConn; > private DataOutputStream out; > private DataInputStream temp_stream; > private BufferedReader in; > private BufferedInputStream in_buff_stream; > private String str ; > private byte[] bytearray; > Chromatogram abif_chromatogram; > /** Creates a new instance of testPost */ > public testPost() > { > httpConn = null; > str = new String(""); > bytearray = new byte[144930]; > } > public byte[] create_and_write_Connection(String url,String > data_request) > { > try > { > URL conn_url = new URL(url); > httpConn = (HttpURLConnection)conn_url.openConnection(); > httpConn.setDoOutput(true); > httpConn.setDoInput(true); > httpConn.setRequestMethod("POST"); > out=new DataOutputStream(httpConn.getOutputStream()); > out.writeBytes(data_request); > out.flush(); > System.out.println("Connection established successfully and > data written"); > InputStreamReader in_stream = new > InputStreamReader(httpConn.getInputStream()); > System.out.println("The character encoding used is:"+ > in_stream.getEncoding()); > in = new BufferedReader(in_stream); > System.out.println("Data acceptance started"); > while(in.readLine()!=null) > { > str += in.readLine(); > } > System.out.println("The string to be returned is:"+str); > bytearray = str.getBytes("ISO8859-1"); > String temp_string = new String(bytearray,"windows-1252"); > System.out.println("The encoded string is as follows:"+ > temp_string); > System.out.println("The size of byte array inside testpost > is:"+ Array.getLength(bytearray)); > for(int i = 0 ; i < 3 ; i ++) > System.out.println("The bytes that i want are:"+ > bytearray[i]); > return bytearray; > } > catch(Exception e) > { > e.printStackTrace(); > } > return bytearray; > } > Please guide me on this point > Thanks > Abhinav -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMJ7I4C5LeMEKA/QRAupLAJ9YDoGohk5uZSNYZnRRMJ5WeNDpGgCfdCyg +Z/gXBbPmrG3SuQlfeHuD3A= =akSf -----END PGP SIGNATURE----- From abhi232 at cc.gatech.edu Tue Nov 6 17:40:01 2007 From: abhi232 at cc.gatech.edu (abhinav) Date: Tue, 06 Nov 2007 11:40:01 -0600 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <47303ECF.4020806@ebi.ac.uk> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> Message-ID: <4730A6F1.9050407@cc.gatech.edu> Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I suspect the byte array itself may contain inaccurate data. > > Internally, both the URL and File constructors read the data into a byte > array and then pass it to the same method as is used by the byte[] > constructor. > > So, something must be different between the byte array you have, and the > byte array obtained by reading the file in. > > The File constructor uses the following code to read the file: > > byte[] bytes = null; > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > FileInputStream fis = new FileInputStream(ABIFile); > BufferedInputStream bis = new BufferedInputStream(fis); > int b; > while ((b = bis.read()) >= 0) > { > baos.write(b); > } > bis.close(); fis.close(); baos.close(); > bytes = baos.toByteArray(); > > If the above code produces different results to your byte array when > reading data from the same file as your code, then something has gone > wrong with the construction of your byte array. > > Lastly, a full stack trace would help us pinpoint the line that is > breaking, and hopefully provide a hint as to what is wrong with the > contents of the byte array. If you could provide one that would be very > helpful. > > cheers, > Richard > > > abhi232 at cc.gatech.edu wrote: > >> Hi all, >> I am having a byte array which is having the data from an .ab1 file.The >> biojava library provides a class called as ABITrace which takes as input >> either a byte[] array , a file or a url.If i use the later parameters (the >> file or the url )the program works but if I pass the byte array to the >> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >> problem with the ABITrace class or how can I bypass this particular error. >> I am printing the length of the byte array and it comes to 144930...Can >> that cause a problem in my code? >> >> Thanks in advance. >> Abhinav >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr > oe52S3khmTvi5BFCL1W4KHM= > =5JAO > -----END PGP SIGNATURE----- > Yes I looked at the file ABITrace and found out that the first three characters must be ABI or the 128-130 characters must be ABI.But I cannot find that in the file that I am having.Also If this is not the case then there should be an illegal format exception whereas I am arrayIndexOutOfBound Exception which is also weird. I am getting the following stack trace. The bytes that i want are:0 The bytes that i want are:11 The bytes that i want are:0 The size of the byte array generated is:144930 Byte array also recieved java.lang.ArrayIndexOutOfBoundsException: 128 at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) at Trace.init(Trace.java:138) at sun.applet.AppletPanel.run(Unknown Source) at java.lang.Thread.run(Unknown Source) The bytes I want are the first three bytes that I want to check if my file is ABI or not.I checked the isABI function as well it returns true or false value and not arrayIndexOutOfBouond . Also the number 128 does it hve any significance in this case? Thanks in advance Abhinav From walsh at andrew.cmu.edu Tue Nov 6 17:23:36 2007 From: walsh at andrew.cmu.edu (Andrew Walsh) Date: Tue, 06 Nov 2007 12:23:36 -0500 Subject: [Biojava-l] Error while reading byte data for creating a Trace. In-Reply-To: <4730AC56.9060808@cc.gatech.edu> References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk> <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu> <47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu> <47309C22.10803@ebi.ac.uk> <4730AC56.9060808@cc.gatech.edu> Message-ID: <4730A318.8010406@andrew.cmu.edu> You also appear to be losing every other line with the following code: while(in.readLine()!=null) { str += in.readLine(); } Every time the while statement checks its condition, a line is read from the inputstream. That line is never stored. Then, if the condition is met, another line is read and that line is added to your String. -Andy abhinav wrote: > Richard Holland wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> I think that either the file is at fault, or the method you are using to >> read the file into Java is at fault. >> >> Could you provide us with the complete piece of code you are using from >> the point where you read the file into the array through to the point >> where you generate the output you quoted? (Not as an attachment as the >> mailing list will strip those - simply paste it into the message body >> instead). >> >> cheers, >> Richard >> >> >> abhinav wrote: >> >> >>> Richard Holland wrote: >>> I suspect the byte array itself may contain inaccurate data. >>> >>> Internally, both the URL and File constructors read the data into a byte >>> array and then pass it to the same method as is used by the byte[] >>> constructor. >>> >>> So, something must be different between the byte array you have, and the >>> byte array obtained by reading the file in. >>> >>> The File constructor uses the following code to read the file: >>> >>> byte[] bytes = null; >>> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >>> FileInputStream fis = new FileInputStream(ABIFile); >>> BufferedInputStream bis = new BufferedInputStream(fis); >>> int b; >>> while ((b = bis.read()) >= 0) >>> { >>> baos.write(b); >>> } >>> bis.close(); fis.close(); baos.close(); >>> bytes = baos.toByteArray(); >>> >>> If the above code produces different results to your byte array when >>> reading data from the same file as your code, then something has gone >>> wrong with the construction of your byte array. >>> >>> Lastly, a full stack trace would help us pinpoint the line that is >>> breaking, and hopefully provide a hint as to what is wrong with the >>> contents of the byte array. If you could provide one that would be very >>> helpful. >>> >>> cheers, >>> Richard >>> >>> >>> abhi232 at cc.gatech.edu wrote: >>> >>> >>> >>>>>> Hi all, >>>>>> I am having a byte array which is having the data from an .ab1 file.The >>>>>> biojava library provides a class called as ABITrace which takes as input >>>>>> either a byte[] array , a file or a url.If i use the later parameters (the >>>>>> file or the url )the program works but if I pass the byte array to the >>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a >>>>>> problem with the ABITrace class or how can I bypass this particular error. >>>>>> I am printing the length of the byte array and it comes to 144930...Can >>>>>> that cause a problem in my code? >>>>>> >>>>>> Thanks in advance. >>>>>> Abhinav >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>>> >>>>>> >>>>>> >> >> >>> Yes I looked at the file ABITrace and found out that the first three >>> characters must be ABI or the 128-130 characters must be ABI.But I >>> cannot find that in the file that I am having.Also If this is not the >>> case then there should be an illegal format exception whereas I am >>> arrayIndexOutOfBound Exception which is also weird. >>> I am getting the following stack trace. >>> The bytes that i want are:0 >>> The bytes that i want are:11 >>> The bytes that i want are:0 >>> The size of the byte array generated is:144930 >>> Byte array also recieved >>> java.lang.ArrayIndexOutOfBoundsException: 128 >>> at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552) >>> at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289) >>> at org.biojava.bio.program.abi.ABITrace.(ABITrace.java:136) >>> at Trace.init(Trace.java:138) >>> at sun.applet.AppletPanel.run(Unknown Source) >>> at java.lang.Thread.run(Unknown Source) >>> The bytes I want are the first three bytes that I want to check if my >>> file is ABI or not.I checked the isABI function as well it returns true >>> or false value and not arrayIndexOutOfBouond . Also the number 128 does >>> it hve any significance in this case? >>> Thanks in advance >>> Abhinav >>> >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG >> T1Z8W0JhCe7+hx5rbKLGqVk= >> =qNcr >> -----END PGP SIGNATURE----- >> >> > Ok Yes here is the code that i am using .I establish a connection with a > php page which in turn reads the file and prints the content back to > me.I am using DataOutputStream for sending data and BufferedReader for > taking in the data.Then I am reading the data into a string and > converting it to byte[] array . this the code where the connection is > estableshed and the data is taken and displayed. > > > > private HttpURLConnection httpConn; > private DataOutputStream out; > private DataInputStream temp_stream; > private BufferedReader in; > private BufferedInputStream in_buff_stream; > private String str ; > private byte[] bytearray; > Chromatogram abif_chromatogram; > > /** Creates a new instance of testPost */ > public testPost() > { > > httpConn = null; > str = new String(""); > bytearray = new byte[144930]; > > } > public byte[] create_and_write_Connection(String url,String > data_request) > { > try > { > URL conn_url = new URL(url); > httpConn = (HttpURLConnection)conn_url.openConnection(); > httpConn.setDoOutput(true); > httpConn.setDoInput(true); > httpConn.setRequestMethod("POST"); > out=new DataOutputStream(httpConn.getOutputStream()); > out.writeBytes(data_request); > out.flush(); > System.out.println("Connection established successfully and > data written"); > InputStreamReader in_stream = new > InputStreamReader(httpConn.getInputStream()); > > System.out.println("The character encoding used is:"+ > in_stream.getEncoding()); > in = new BufferedReader(in_stream); > > > System.out.println("Data acceptance started"); > > > while(in.readLine()!=null) > { > str += in.readLine(); > } > System.out.println("The string to be returned is:"+str); > bytearray = str.getBytes("ISO8859-1"); > String temp_string = new String(bytearray,"windows-1252"); > System.out.println("The encoded string is as follows:"+ > temp_string); > System.out.println("The size of byte array inside testpost > is:"+ Array.getLength(bytearray)); > for(int i = 0 ; i < 3 ; i ++) > System.out.println("The bytes that i want are:"+ > bytearray[i]); > return bytearray; > } > catch(Exception e) > { > e.printStackTrace(); > } > return bytearray; > } > Please guide me on this point > Thanks > Abhinav > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From holland at ebi.ac.uk Thu Nov 8 13:53:09 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 13:53:09 +0000 Subject: [Biojava-l] BioJava 3 Proposals Message-ID: <473314C5.8070207@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear BioJava users, The BioJava developers are considering options for the future development of the BioJava toolkit. We consider that it needs improvement in a few major areas to make it easier to use and understand, and also faster and more scalable. The options are to either rewrite large parts of the existing code, working within the existing interfaces and paradigms, or to develop a new set of BioJava packages from the ground up in order to take advantage of lessons learned from the design patterns of the existing code. The BioJava developers have spent the last couple of months discussing ideas and proposals related to these options on a Wiki page, and would now like to open this discussion to all users of BioJava and the bioinformatics community in general. We would like to invite anyone who has any ideas or suggestions to contribute these to the Wiki page, and/or to comment on the ideas and suggestions that have already been posted there. Here is a link to the Wiki page, and also a link to the associated Talk page where much of the discussion has taken place so far: http://biojava.org/wiki/BioJava3_Proposal http://biojava.org/wiki/Talk:BioJava3_Proposal It is our intention to leave the discussion open until mid-January 2008 when we will summarise it and use it as the basis of a plan of action. We will then distribute the summary and the action plan via the BioJava website. We look forward to hearing your comments and ideas. Please do remember to make them directly to the Wiki page so that they are preserved in context, making it easier for us to summarise them later! cheers, Richard (on behalf of all BioJava developers) PS. Just to reassure you, this is NOT a plan to drop the existing codebase. It will continue to exist, but the outcome of these discussions will determine whether we will continue to develop and support it or start afresh with a clean slate and a new codebase. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMxTE4C5LeMEKA/QRAlGSAJwKzO0oAe3T2e8ibcG8uRReOVfh7wCdGlwn JkcVzA55Ye32o8Ry48LO+04= =oaaC -----END PGP SIGNATURE----- From holland at ebi.ac.uk Thu Nov 8 13:58:23 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 13:58:23 +0000 Subject: [Biojava-l] Biojava wiki Message-ID: <473315FF.70506@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 what's happened to the biojava wiki today? i get errors from all pages, including the front page, indicating zero-sized replies. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62 6KkoyXhahrWkZo2OWyCL+Uk= =1jK7 -----END PGP SIGNATURE----- From phidias51 at gmail.com Thu Nov 8 15:39:29 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Thu, 8 Nov 2007 07:39:29 -0800 Subject: [Biojava-l] Biojava wiki In-Reply-To: <473315FF.70506@ebi.ac.uk> References: <473315FF.70506@ebi.ac.uk> Message-ID: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com> Richard, That's odd. It comes up fine for me. BTW, in your proposal you mentioned that people had "moved on". I was wondering what types of tasks they had moved on to, and what should be included in the Proposal to insure that BioJava stays relevant to them? Regards, Mark On Nov 8, 2007 5:58 AM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > what's happened to the biojava wiki today? i get errors from all pages, > including the front page, indicating zero-sized replies. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62 > 6KkoyXhahrWkZo2OWyCL+Uk= > =1jK7 > -----END PGP SIGNATURE----- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From hlapp at gmx.net Thu Nov 8 15:53:03 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 10:53:03 -0500 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <762277.43372.qm@web26507.mail.ukl.yahoo.com> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Message-ID: Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we explicitly lowercase the value found for alphabet, and the comment says why: # Note: Biojava uses upper-case terms for alphabet, so we # need to change to all-lower in case the sequence was # manipulated by Biojava. $obj->alphabet(lc($rows->[3])) if $rows->[3]; However, when inserting sequences, we leave the value as is in BioPerl (which is lowercase), leading to a potential problem for Biojava upon retrieval. Do the Biojava folks deal with that? Should this may harmonized across the board? -hilmar On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: > Dear Peter, > > All the alphabet are "DNA" (upper case) in my database. The > sequences are taken from NCBI by a BioJava application. > Thus is should be that BioJava inserts the records with "DNA". Thus > no potential "hidden bug" in BioPython. > > Maybe a point to share with the Open-Bio committee. > > Eric > > ----- Message d'origine ---- > De : Peter > ? : Eric Gibert > Cc : biopython at lists.open-bio.org > Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s > Objet : Re: [BioPython] small "bug" correction in package BioSql > > Eric Gibert wrote: >> Dear all, >> >> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >> function: >> >> ... >> >> please note my correction: force moltype to be turn in lower case as >> my database has upper case value! this raises the "Unknown moltype" >> error. > > Hi Eric, I've made your suggested change in CVS, > biopython/BioSQL/BioSeq.py revision 1.13, thank you. > > I would encourage you to investigate why some of the "alphabet" fields > in the biosequence table are in upper case. There could be a bug > elsewhere which is writing these entries with the wrong alphabet. Is > this affecting all entries, or just some? > > Peter > > > > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From holland at ebi.ac.uk Thu Nov 8 16:17:25 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 16:17:25 +0000 Subject: [Biojava-l] Biojava wiki In-Reply-To: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com> References: <473315FF.70506@ebi.ac.uk> <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com> Message-ID: <47333695.40808@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > BTW, in your proposal you mentioned that people had "moved on". I was > wondering what types of tasks they had moved on to, and what should be > included in the Proposal to insure that BioJava stays relevant to them? Good point. From what we can tell, people are not so sequence-focused any more but are more interested in features, alignments, population data, etc. - more 'metadata' so to speak. We do need some mechanism to ensure that we are correct in this thinking, and that future shifts in direction are catered for in this design phase. Could you add a note to the wiki with your points, and/or any ideas you may have about ensuring these requirements are met? cheers, Richard > Regards, > > Mark > > On Nov 8, 2007 5:58 AM, Richard Holland wrote: > > what's happened to the biojava wiki today? i get errors from all pages, > including the front page, indicating zero-sized replies. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMzaV4C5LeMEKA/QRAoPUAJ0TQ+xFF1J3EtZgHmvYj2HH41koCgCeLYm0 D5Z7SJDWjvJ9rbCrS+RTEeI= =XhE1 -----END PGP SIGNATURE----- From holland at ebi.ac.uk Thu Nov 8 16:18:46 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 08 Nov 2007 16:18:46 +0000 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Message-ID: <473336E6.6000100@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 we do need a consensus here. I'm happy to go with whatever value is chosen, as the BioJava code can easily be modified to suit. cheers, Richard Hilmar Lapp wrote: > Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we > explicitly lowercase the value found for alphabet, and the comment > says why: > > # Note: Biojava uses upper-case terms for alphabet, so we > # need to change to all-lower in case the sequence was > # manipulated by Biojava. > $obj->alphabet(lc($rows->[3])) if $rows->[3]; > > However, when inserting sequences, we leave the value as is in > BioPerl (which is lowercase), leading to a potential problem for > Biojava upon retrieval. Do the Biojava folks deal with that? Should > this may harmonized across the board? > > -hilmar > > On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: > >> Dear Peter, >> >> All the alphabet are "DNA" (upper case) in my database. The >> sequences are taken from NCBI by a BioJava application. >> Thus is should be that BioJava inserts the records with "DNA". Thus >> no potential "hidden bug" in BioPython. >> >> Maybe a point to share with the Open-Bio committee. >> >> Eric >> >> ----- Message d'origine ---- >> De : Peter >> ? : Eric Gibert >> Cc : biopython at lists.open-bio.org >> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >> Objet : Re: [BioPython] small "bug" correction in package BioSql >> >> Eric Gibert wrote: >>> Dear all, >>> >>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>> function: >>> >>> ... >>> >>> please note my correction: force moltype to be turn in lower case as >>> my database has upper case value! this raises the "Unknown moltype" >>> error. >> Hi Eric, I've made your suggested change in CVS, >> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >> >> I would encourage you to investigate why some of the "alphabet" fields >> in the biosequence table are in upper case. There could be a bug >> elsewhere which is writing these entries with the wrong alphabet. Is >> this affecting all entries, or just some? >> >> Peter >> >> >> >> >> >> >> >> >> ______________________________________________________________________ >> _______ >> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >> Yahoo! Mail >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3 9x+CUHig3GfBCZ56rDb1ZG4= =OJyB -----END PGP SIGNATURE----- From hlapp at gmx.net Thu Nov 8 20:28:19 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:28:19 -0500 Subject: [Biojava-l] [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: Maybe we need to hold some mini-hackathon to make the different toolkits compatible in how they map annotation to the schema. Obviously I don't know whether you have the latest Biojava setup here, but I'll just comment how BioPerl/Bioperl-db would map this: 'ORIGIN' - if I'm not mistaken this is only a token that introduces the actual sequence. I'm not sure what Biojava is storing as value here. 'DIVISION' - this maps to column division in table bioentry (though I agree that if perfectly following the weak typing principle this should be tag/value association, but at present it's still an actual column) 'genbank_accessions' - secondary accession numbers indeed go into the qualifier value table. The primary accession maps to column accession in table bioentry 'TITLE' - this is part of a publication reference, and should map to column title in table reference (which it does in bioperl-db) 'cross_references' - not sure where these would be coming from in GenBank format; for EMBL this will map to the dbxref table 'data_file_division' - not sure what this is (same as DIVISION?) 'VERSION' - in BioPerl we parse this apart into a version for the accession (which is column version in table bioentry) and the GI number, which maps to column identifier in table bioentry 'references' - these map to table reference (and bioentry_reference for association with the bioentry) 'KEYWORDS' - indeed these map to bioentry_qualifier_value 'GI' - maps to column identifier in table bioentry 'SIZE' - not sure what size that is. If it is the length of the sequence, it should (and in BioPerl/bioperl-db does) map to column length in table biosequence 'DEFINITION' - maps to column description in table bioentry 'REFERENCE' - should be the same as for 'references' 'MDAT' - not sure what this is 'ORGANISM' - this is the organism and maps to the table taxon (and taxon_name), with a foreign key in bioentry pointing to the taxon 'JOURNAL' - this is part of a reference, see 'references' 'ACCESSION' - the primary accession, maps to column accession in table bioentry 'LOCUS' - in the file itself this is an entire line consisting of multiple fields; BioPerl/bioperl-db maps the locus name (the first token after the literal token LOCUS) to column name in table bioentry 'SOURCE' - this is the organism, see 'ORGANISM' 'PUBMED' - this is part of a literature reference, and maps to a foreign key in the reference table (reference.dbxref) to a dbxref entry with PUBMED or PMID as the database and the pubmed ID as the accession 'AUTHORS' - part of a literature reference, maps to column authors in table reference 'TYPE' - not sure what this is. If it's the alphabet, it maps to table biosequence, column alphabet 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, though there have been plans to make it a column in table biosequence. Note that this could in fact be the way Biojava stores it too, but upon retrieval represents it in the way you are seeing it. Hth, -hilmar On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > Dear all, > > When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted > previously by my BioJava application, I have: > > print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() > > Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', > 'genbank_accessions', 'TITLE', 'cross_references', > 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', > 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', > 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', > 'CIRCULAR'] > > but a freshly inserted BioSeq by BioPython 1.44 only gives me: > Debug on Seq: EF631597.1 = ['cross_references', 'dates', > 'references', 'gi', 'data_file_division'] > > > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which > should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? > > Eric > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Nov 8 20:30:29 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:30:29 -0500 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <473336E6.6000100@ebi.ac.uk> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> Message-ID: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> It seems BioPerl and Biopython both want (and have traditionally used) lowercase - do you mind going with that for Biojava as well, or alternatively, simply map upon insert/update and retrieve? -hilmar On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >> explicitly lowercase the value found for alphabet, and the comment >> says why: >> >> # Note: Biojava uses upper-case terms for alphabet, so we >> # need to change to all-lower in case the sequence was >> # manipulated by Biojava. >> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >> >> However, when inserting sequences, we leave the value as is in >> BioPerl (which is lowercase), leading to a potential problem for >> Biojava upon retrieval. Do the Biojava folks deal with that? Should >> this may harmonized across the board? >> >> -hilmar >> >> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >> >>> Dear Peter, >>> >>> All the alphabet are "DNA" (upper case) in my database. The >>> sequences are taken from NCBI by a BioJava application. >>> Thus is should be that BioJava inserts the records with "DNA". Thus >>> no potential "hidden bug" in BioPython. >>> >>> Maybe a point to share with the Open-Bio committee. >>> >>> Eric >>> >>> ----- Message d'origine ---- >>> De : Peter >>> ? : Eric Gibert >>> Cc : biopython at lists.open-bio.org >>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>> >>> Eric Gibert wrote: >>>> Dear all, >>>> >>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>> function: >>>> >>>> ... >>>> >>>> please note my correction: force moltype to be turn in lower >>>> case as >>>> my database has upper case value! this raises the "Unknown moltype" >>>> error. >>> Hi Eric, I've made your suggested change in CVS, >>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>> >>> I would encourage you to investigate why some of the "alphabet" >>> fields >>> in the biosequence table are in upper case. There could be a bug >>> elsewhere which is writing these entries with the wrong >>> alphabet. Is >>> this affecting all entries, or just some? >>> >>> Peter >>> >>> >>> >>> >>> >>> >>> >>> >>> ____________________________________________________________________ >>> __ >>> _______ >>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>> Yahoo! Mail >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3 > 9x+CUHig3GfBCZ56rDb1ZG4= > =OJyB > -----END PGP SIGNATURE----- -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From holland at ebi.ac.uk Fri Nov 9 08:39:01 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 09 Nov 2007 08:39:01 +0000 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> Message-ID: <47341CA5.9080509@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 i'll see what i can do. Hilmar Lapp wrote: > It seems BioPerl and Biopython both want (and have traditionally used) > lowercase - do you mind going with that for Biojava as well, or > alternatively, simply map upon insert/update and retrieve? > > -hilmar > > On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >>>> explicitly lowercase the value found for alphabet, and the comment >>>> says why: >>>> >>>> # Note: Biojava uses upper-case terms for alphabet, so we >>>> # need to change to all-lower in case the sequence was >>>> # manipulated by Biojava. >>>> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >>>> >>>> However, when inserting sequences, we leave the value as is in >>>> BioPerl (which is lowercase), leading to a potential problem for >>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should >>>> this may harmonized across the board? >>>> >>>> -hilmar >>>> >>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >>>> >>>>> Dear Peter, >>>>> >>>>> All the alphabet are "DNA" (upper case) in my database. The >>>>> sequences are taken from NCBI by a BioJava application. >>>>> Thus is should be that BioJava inserts the records with "DNA". Thus >>>>> no potential "hidden bug" in BioPython. >>>>> >>>>> Maybe a point to share with the Open-Bio committee. >>>>> >>>>> Eric >>>>> >>>>> ----- Message d'origine ---- >>>>> De : Peter >>>>> ? : Eric Gibert >>>>> Cc : biopython at lists.open-bio.org >>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>>>> >>>>> Eric Gibert wrote: >>>>>> Dear all, >>>>>> >>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>>>> function: >>>>>> >>>>>> ... >>>>>> >>>>>> please note my correction: force moltype to be turn in lower case as >>>>>> my database has upper case value! this raises the "Unknown moltype" >>>>>> error. >>>>> Hi Eric, I've made your suggested change in CVS, >>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>>>> >>>>> I would encourage you to investigate why some of the "alphabet" fields >>>>> in the biosequence table are in upper case. There could be a bug >>>>> elsewhere which is writing these entries with the wrong alphabet. Is >>>>> this affecting all entries, or just some? >>>>> >>>>> Peter >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> _______ >>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>>>> Yahoo! Mail >>>>> _______________________________________________ >>>>> BioPython mailing list - BioPython at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> > --=========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHNByl4C5LeMEKA/QRAmCzAJ9fxSm8l5YAEHAUe2hH+Gwc1Xe5IwCfcMf6 c9sy8lASDV069FQJ79Geemw= =RHM1 -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 9 12:42:38 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 09 Nov 2007 12:42:38 +0000 Subject: [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> Message-ID: <473455BE.6040807@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I did a bit of poking around in our code and internally BioJava represents all the default alphabet names (Protein, DNA, etc.) in upper case. It also allows for mixed case alphabet names. It's not quite as easy as I thought to change these to lower case as they are often referenced by text name, meaning other people's code might break if I change them. Also, as it allows for mixed-case alphabet names, I can't do a toUpper/toLower fudge on persistence to BioSQL, as I wouldn't necessarily get out what I put in! So, I think I'll add this as a point on the recently announced BioJava 3 proposal, that BioSQL interaction must be compliant with standards laid down by the BioSQL project, and that our code will be able to cope with this internally. That brings us back to BioSQL standards - the idea of a mini-hackathon to solve this once and for all is a very good one. Our previous attempts between BioPerl and BioJava in Singapore were good, but still there are niggles as seen in this thread of discussion. It seems that a schema on it's own just isn't enough to make the various projects play nicely, and instructions are needed on exactly how to use that schema if they are truly all going to be able to use it without caring who or what wrote the data that is being read. cheers, Richard Hilmar Lapp wrote: > It seems BioPerl and Biopython both want (and have traditionally used) > lowercase - do you mind going with that for Biojava as well, or > alternatively, simply map upon insert/update and retrieve? > > -hilmar > > On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >>>> explicitly lowercase the value found for alphabet, and the comment >>>> says why: >>>> >>>> # Note: Biojava uses upper-case terms for alphabet, so we >>>> # need to change to all-lower in case the sequence was >>>> # manipulated by Biojava. >>>> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >>>> >>>> However, when inserting sequences, we leave the value as is in >>>> BioPerl (which is lowercase), leading to a potential problem for >>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should >>>> this may harmonized across the board? >>>> >>>> -hilmar >>>> >>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >>>> >>>>> Dear Peter, >>>>> >>>>> All the alphabet are "DNA" (upper case) in my database. The >>>>> sequences are taken from NCBI by a BioJava application. >>>>> Thus is should be that BioJava inserts the records with "DNA". Thus >>>>> no potential "hidden bug" in BioPython. >>>>> >>>>> Maybe a point to share with the Open-Bio committee. >>>>> >>>>> Eric >>>>> >>>>> ----- Message d'origine ---- >>>>> De : Peter >>>>> ? : Eric Gibert >>>>> Cc : biopython at lists.open-bio.org >>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>>>> >>>>> Eric Gibert wrote: >>>>>> Dear all, >>>>>> >>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>>>> function: >>>>>> >>>>>> ... >>>>>> >>>>>> please note my correction: force moltype to be turn in lower case as >>>>>> my database has upper case value! this raises the "Unknown moltype" >>>>>> error. >>>>> Hi Eric, I've made your suggested change in CVS, >>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>>>> >>>>> I would encourage you to investigate why some of the "alphabet" fields >>>>> in the biosequence table are in upper case. There could be a bug >>>>> elsewhere which is writing these entries with the wrong alphabet. Is >>>>> this affecting all entries, or just some? >>>>> >>>>> Peter >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> _______ >>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>>>> Yahoo! Mail >>>>> _______________________________________________ >>>>> BioPython mailing list - BioPython at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> > --=========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHNFW84C5LeMEKA/QRApBiAJ41WqCDKOJhee5NxIsquYaR/ImBRgCfb7zM LX75HHvCUC/v4n3okmUQ+ME= =d6QO -----END PGP SIGNATURE----- From email2ants at gmail.com Fri Nov 9 17:55:36 2007 From: email2ants at gmail.com (Anthony Underwood) Date: Fri, 9 Nov 2007 17:55:36 +0000 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) Message-ID: Hi All, I've generated an alignment and I am retrieving positions within the alignment using Symbol base = alignment.symbolAt(label, i); I am trying to get whether the base at this position is G, A, T or C However when I use base.getName() it returns strings such as "thymine" The documentation states that the method getToken should also be available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html Is there a simple way of retrieving a one letter textual representation of the symbol? Many thanks Anthony From zagato.gekko at gmail.com Fri Nov 9 18:48:02 2007 From: zagato.gekko at gmail.com (Zagato) Date: Fri, 9 Nov 2007 13:48:02 -0500 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) In-Reply-To: References: Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> Try with: String s = alignment.symbolListForLabel( label ).subStr( i, i+1 ); Bye... Alan Jairo Acosta Cali - Colombia On Nov 9, 2007 12:55 PM, Anthony Underwood wrote: > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. > http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Farewell. http://www.youtube.com/zagatogekko ruby << __EOF__ puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse __EOF__ From zagato.gekko at gmail.com Fri Nov 9 18:48:02 2007 From: zagato.gekko at gmail.com (Zagato) Date: Fri, 9 Nov 2007 13:48:02 -0500 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) In-Reply-To: References: Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> Try with: String s = alignment.symbolListForLabel( label ).subStr( i, i+1 ); Bye... Alan Jairo Acosta Cali - Colombia On Nov 9, 2007 12:55 PM, Anthony Underwood wrote: > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. > http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Farewell. http://www.youtube.com/zagatogekko ruby << __EOF__ puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse __EOF__ From gwaldon at geneinfinity.org Fri Nov 9 18:45:10 2007 From: gwaldon at geneinfinity.org (George Waldon) Date: Fri, 09 Nov 2007 10:45:10 -0800 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) Message-ID: <20071109184510.80580.qmail@mmm1924.dulles19-verio.com> Tokens are associated with alphabets. Get the tokenization from the alphabet using: SymbolTokenization = Alphabet.getTokenization("token"); Get the token from the tokenization using: String = SymbolTokenization.tokenizeSymbol(Symbol); Also, check the tutotial and the cookbook on the biojava web site at www.biojava.org, which are often more informative than the javadoc. Frankly speaking, I agree with you and we should have a method like String = Symbol.getToken(Alphabet,"token"); to do these operations simply and without loosing our hairs! Best luck, George > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Anthony Underwood > Sent: Friday, November 09, 2007 9:56 AM > To: BioJava > Subject: [Biojava-l] Getting a base from an alignment (way to complex?) > > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. > http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists From email2ants at gmail.com Fri Nov 9 23:23:01 2007 From: email2ants at gmail.com (Anthony Underwood) Date: Fri, 9 Nov 2007 23:23:01 +0000 Subject: [Biojava-l] Getting a base from an alignment (way to complex?) In-Reply-To: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> References: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com> Message-ID: <70FC5536-E1B3-41C7-92BC-0B43A0E11E09@gmail.com> Hi Alan, Thanks for the suggestion. That was my first thought, but then I was thinking for amino acids this wouldn't work. I would have to use a hashmap to convert the amino acid to the appropriate single letter code. Hi George, I'll try your suggestion. As you say I think this is too much for something that should be a one liner. Thanks for your advice. Get the tokenization from the alphabet using: SymbolTokenization = Alphabet.getTokenization("token"); Get the token from the tokenization using: String = SymbolTokenization.tokenizeSymbol(Symbol); Thanks to both of you Anthony On 9 Nov 2007, at 18:48, Zagato wrote: > Try with: > String s = alignment.symbolListForLabel( label ).subStr( i, i+1 ); > > Bye... > > Alan Jairo Acosta > Cali - Colombia > > On Nov 9, 2007 12:55 PM, Anthony Underwood < email2ants at gmail.com> > wrote: > Hi All, > > I've generated an alignment and I am retrieving positions within the > alignment using > > Symbol base = alignment.symbolAt(label, i); > > I am trying to get whether the base at this position is G, A, T or C > > However when I use base.getName() it returns strings such as "thymine" > > The documentation states that the method getToken should also be > available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html > > Is there a simple way of retrieving a one letter textual > representation of the symbol? > > > Many thanks > > > Anthony > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > Farewell. > http://www.youtube.com/zagatogekko > ruby << __EOF__ > puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse > __EOF__ From hlapp at gmx.net Sat Nov 10 20:38:17 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 15:38:17 -0500 Subject: [Biojava-l] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> Message-ID: <5DDEBCDE-C8DA-4B2C-86F4-47FDB82CADAC@gmx.net> Just a few comments below, specifically where no rows would in fact be what I expect: On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote: > [...] > -------- For you information, I went thru the tables of my BioSQL > database: > [...] > 1) table bioentry: all column populated except for 'taxon_id' which > is NULL > (maybe I need an extra call for populating the 'taxon' table before?) Bioperl-db will try to look up (or create if necessary) the taxon from the taxon information attached to the sequence, but for BioPerl we actually recommend to pre-load the database with the NCBI taxonomy, which can be comfortably done with the script load_ncbi_taxonomy.pl that comes with BioSQL. > > 2) table bioentry_dbxref: no data inserted (always empty, even with > BioJava) This would mean that the sequence(s) have no dbxrefs. Note that for GenBank sequences that would be expected, since unfortunately, and unlike EMBL format, GenBank puts the dbxrefs into the feature table. > 3) table bioentry_qualifier_value: > > One entry only, for the 'term_id' = 149, rank = 1, and value = '07- > JUL-2005' > or other 'DD-MMM-YYYY' dates (see my remarks below) Below you say that your term table is empty, so I don't know why you can have value here at all. > [...] > 5) table bioentry_relationships: no entry found (always empty, even > with > BioJava) If you load sequences, they won't have direct relationships to other sequences (except dbxrefs, but those are rather 'pointers' and are stored in their own table). In Bioperl-db, this table is used only if you load sequence clusters through Bio::Cluster objects (such as UniGene). > [...] > 7) table comment: no entry found (always empty, even with BioJava) Again, this is expected with GenBank. AFAIK genbank format doesn't allow for comments at the level of the sequence. You would (i.e., should) find entries here if you load UniProt entries. > 8) table dbxref: some records are generated, for dbname 'PUBMED' > and 'Taxon' > with the correct value Taxon obviously isn't really a dbxref, but rather a taxon (and hence should go into that table). > [...] > 9) table dbxref_qualifier_value: (always empty, even with BioJava) That's almost expected. There's rather few cases where dbxrefs have additional attributes that the language can parse out from a source (and then maps to the schema). > [...] > 10) table location: all locations loaded correctly, note that > 'term_id' and > 'dbxref_id' remain NULL for these seq but I have value for other seq. Theoretically, the term_id should point to the term giving the type of the location. If you (or Biopython) are only dealing with simple ('normal') locations, then it's not needed. The dbxref_id gives the reference to the remote sequence if the location for a feature refers to a different sequence than the feature itself does (so-called 'remote locations'). If the sequences you loaded don't have such locations, there this would be expected to be empty (or if Biopython doesn't handle such locations). > 11) table location_qualifier_value: always empty, even with BioJava This is expected if Biopython doesn't support fuzzy locations, or if none of the feature locations that you loaded are fuzzy. > [...] > 13) Table reference: entries correct, note 'dbxref_id' remains NULL > for > these seq but I have value for other seq. It should point to the pubmed ID for the reference but only if there was one. > 14) table seqfeature: entries are there (same as in table 'location'). > FYI:'display_name is always NULL. GenBank doesn't give names to features (and I think EMBL does neither), so this is expected. > 15) table seqfeature_dbxref: always empty, even with BioJava That's likely more to do with your language object model than with anything else. dbxref annotation for features is in tag/value pairs, just as any other, so your language (Biopython in this case) will have to do a lot of interpretation to tease out the semantics behind each tag name and based on that decide what to do with the value. Indeed, by default we don't even do this in BioPerl. > [...] > 17) table seqfeature_relationship: always empty, even with BioJava GenBank (and EMBL) feature tables are flat, not hierarchical, so this is expected. > 18) table taxon: always empty, even with BioJava) This is where the organism should go. > 19) table taxon_name: I have one but not from this test (I tried to > tinker a > little bit with taxon but stopped) That's odd that you can have an entry in taxon_name w/o a corresponding one in taxon. Do you have foreign key checks disabled? > 20) table term: always empty, even with BioJava That's strange, since you say you do have rows in bioentry_qualifier_value, which has an enforced foreign key to term. Did you disable the foreign key checks? > 21) table term_dbxref: always empty, even with BioJava That's expected unless you loaded an ontology whose terms have dbxrefs, and your language object model supports that. > [...] > 23) table term_synonym: always empty, even with BioJava Same as for 21). Your terms would have to have synonyms, and your language object model would have to support those, before you could expect to get anything in here. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From shirleyc at cis.upenn.edu Tue Nov 13 18:45:59 2007 From: shirleyc at cis.upenn.edu (Shirley Cohen) Date: Tue, 13 Nov 2007 13:45:59 -0500 Subject: [Biojava-l] maximum parsimony search Message-ID: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu> Hi BioJava People, I'm looking for existing code that implements a maximum parsimony search in Java. Does BioJava have this functionality? If so, can you point me to the appropriate classes? Thanks, Shirley From bmduggan at yahoo.com Wed Nov 14 00:48:22 2007 From: bmduggan at yahoo.com (Brendan Duggan) Date: Wed, 14 Nov 2007 11:48:22 +1100 (EST) Subject: [Biojava-l] Disulfide information in PDB files Message-ID: <454510.91557.qm@web52705.mail.re2.yahoo.com> Greetings I'm trying to mine some information on disulfides in the PDB and was hoping there might be a way of obtaining this information with the BioJava PDB parser. However, I haven't been able to see anything like this mentioned in the API docs. If it is currently not possible to extract disulfide information from PDB files are there any plans to implement this? Thanks! Brendan Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html From holland at ebi.ac.uk Wed Nov 14 08:50:31 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 14 Nov 2007 08:50:31 +0000 Subject: [Biojava-l] maximum parsimony search In-Reply-To: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu> References: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu> Message-ID: <473AB6D7.2010405@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 There is a class currently only available from the head of CVS - ie. it is unreleased yet. To get it you'll need to check out the very latest BioJava source code from CVS. The JavaDoc for the class is here: http://www.spice-3d.org/public-files/javadoc/biojava/org/biojavax/bio/phylo/ParsimonyTreeMethod.html It is designed to take input in the form of blocks of data similar to what you would find in a Nexus file (the Nexus file parsers elsewhere in the org/biojavax/bio/phylo package will provide these). However you could of course create such objects from your own data without needing to read/write any Nexus files. cheers, Richard Shirley Cohen wrote: > Hi BioJava People, > > I'm looking for existing code that implements a maximum parsimony > search in Java. Does BioJava have this functionality? If so, can you > point me to the appropriate classes? > > Thanks, > > Shirley > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHOrbW4C5LeMEKA/QRAuswAJ9olIwj7DGszOnKORU255YS3m2ohACfbKTw ihjuQVv0j+nlXb+4SL5pIfw= =ldfM -----END PGP SIGNATURE----- From holland at ebi.ac.uk Wed Nov 14 08:55:24 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 14 Nov 2007 08:55:24 +0000 Subject: [Biojava-l] Disulfide information in PDB files In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com> References: <454510.91557.qm@web52705.mail.re2.yahoo.com> Message-ID: <473AB7FC.10403@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Currently this is not parsed - the parser does not read all the tags in the most recent PDB specification. Could you open a bug request at http://bugzilla.open-bio.org/ to formally add this to our to-do list? Thanks! cheers, Richard Brendan Duggan wrote: > Greetings > > I'm trying to mine some information on disulfides in > the PDB and was hoping there might be a way of > obtaining this information with the BioJava PDB > parser. However, I haven't been able to see anything > like this mentioned in the API docs. If it is > currently not possible to extract disulfide > information from PDB files are there any plans to > implement this? > > Thanks! > > Brendan > > > Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHOrf84C5LeMEKA/QRArfeAJ9nCViM2jyVfubIpl5w/1EXMYTv/gCgjVEs zDnxHjv8xJsRBw5pfE2NdkA= =tGqm -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Wed Nov 14 09:32:28 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 14 Nov 2007 09:32:28 +0000 Subject: [Biojava-l] Disulfide information in PDB files In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com> References: <454510.91557.qm@web52705.mail.re2.yahoo.com> Message-ID: <9B898ADF-78EB-4B5C-A432-98274190815F@sanger.ac.uk> Hi Brendan, SSBOND lines are currently not parsed. If this is what you need, I can add this over the next couple of days. If you want to compute the bonds yourself, the framework can e.g. calculate distances between the sulphur atoms for you. - Andreas On 14 Nov 2007, at 00:48, Brendan Duggan wrote: > Greetings > > I'm trying to mine some information on disulfides in > the PDB and was hoping there might be a way of > obtaining this information with the BioJava PDB > parser. However, I haven't been able to see anything > like this mentioned in the API docs. If it is > currently not possible to extract disulfide > information from PDB files are there any plans to > implement this? > > Thanks! > > Brendan > > > Make the switch to the world's best email. Get the new Yahoo! > 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From deb at mb.au.dk Thu Nov 15 12:04:02 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Thu, 15 Nov 2007 13:04:02 +0100 Subject: [Biojava-l] Parsing exising gaps Message-ID: <002701c8277f$9dbdca50$d9395ef0$@au.dk> Dear all, I have managed to read an MSF-formatted alignment from a file selected through FileChooser as follows: BufferedReader br = new BufferedReader(new FileReader(aFileChooser.getSelectedFile())); SimpleAlignment align = (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); I can now retrieve the sequence names and sequences through the Alignment object: Iterator aLabels = align.getLabels().iterator(); Iterator aSequences = align.symbolListIterator(); However, I now what to be able to translate between real sequence numbers and the positions within each alignment string, i.e. retrieve positions that remove the gaps first (gaps are represented by hyphens '-' in the MSF format). How can I tell BioJava to parse the gaps into an GappedSequence format? I have tried the following to check what position 15 (past the the first gap) translates into: int n = 0; while(aSequences.hasNext()) { SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); SimpleGappedSequence aGapped = new SimpleGappedSequence(new SimpleSequence(aSym, "", aLabels.next().toString(), null)); System.out.println(aGapped.gappedToLocation(new PointLocation(15))); } But I only get 15 back out. I have also studied the constructor of the underlying SimpleGappedSymbolList but it simply copies the SymbolList and creates one big block: public SimpleGappedSymbolList(SymbolList source) { this.source = source; this.alpha = source.getAlphabet(); this.blocks = new ArrayList(); this.length = source.length(); Block b = new Block(1, length, 1, length); blocks.add(b); } Is there a way to tell SimpleGappedSequence to parse itself in terms of the gap characters in the sequence string? How is the sequence represented in this case, if not by gaps? Surely the hyphen cannot be a part of the standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it? Best wishes, Ditlev -- Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor Department of Molecular Biology Office: +45 89425259 University of AarhusLab: +45 89425022 Gustav Wieds Vej 10cFax: +45 86123178 DK-8000 Aarhus C Email: deb at mb.au.dk Denmark Lab WWW: www.bioxray.dk/~deb From holland at ebi.ac.uk Thu Nov 15 13:51:48 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 15 Nov 2007 13:51:48 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <002701c8277f$9dbdca50$d9395ef0$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> Message-ID: <473C4EF4.5080301@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I think you've uncovered a number of problems here: 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do all the other predefined alphabets. 2. The MSF parser doesn't bother trying to build GappedSequence instances, instead it just builds solid sequences with the gaps as normal symbols. 3. There is no constructor or method for taking a sequence with embedded gap symbols and turning it into a GappedSequence with separate chunks. Combined, these three problems make it impossible to do what you want easily. I will make a note to fix this on the plans for the next BioJava development cycle. In the meantime, your best bet would be to construct a second alignment block by iterating over the alignment block you already have and parsing the locations of the gap symbols. You would create a SimpleGappedSequence intially over the ungapped sequence, then use the insert gap methods to insert the gaps into this ungapped sequence before putting all the SimpleGappedSequence objects together into a new alignment. cheers, Richard Ditlev Egeskov Brodersen wrote: > Dear all, > > > > I have managed to read an MSF-formatted alignment from a file selected > through FileChooser as follows: > > > > BufferedReader br = new BufferedReader(new > FileReader(aFileChooser.getSelectedFile())); > > SimpleAlignment align = > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); > > > > I can now retrieve the sequence names and sequences through the Alignment > object: > > > > Iterator aLabels = align.getLabels().iterator(); > > Iterator aSequences = align.symbolListIterator(); > > > > However, I now what to be able to translate between real sequence numbers > and the positions within each alignment string, i.e. retrieve positions that > remove the gaps first (gaps are represented by hyphens '-' in the MSF > format). How can I tell BioJava to parse the gaps into an GappedSequence > format? I have tried the following to check what position 15 (past the the > first gap) translates into: > > > > int n = 0; > > while(aSequences.hasNext()) { > > SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > > SimpleGappedSequence aGapped = new SimpleGappedSequence(new > SimpleSequence(aSym, "", aLabels.next().toString(), null)); > > System.out.println(aGapped.gappedToLocation(new PointLocation(15))); > > } > > > > But I only get 15 back out. I have also studied the constructor of the > underlying SimpleGappedSymbolList but it simply copies the SymbolList and > creates one big block: > > > > public SimpleGappedSymbolList(SymbolList source) { > > this.source = source; > > this.alpha = source.getAlphabet(); > > this.blocks = new ArrayList(); > > this.length = source.length(); > > Block b = new Block(1, length, 1, length); > > blocks.add(b); > > } > > > > Is there a way to tell SimpleGappedSequence to parse itself in terms of the > gap characters in the sequence string? How is the sequence represented in > this case, if not by gaps? Surely the hyphen cannot be a part of the > standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it? > > > > Best wishes, > > > > Ditlev > > > > -- > > > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > > > Department of Molecular Biology Office: +45 89425259 > University of AarhusLab: +45 89425022 > Gustav Wieds Vej 10cFax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPE704C5LeMEKA/QRAniIAJsGv+5HIP3mCDxBIUdw0SjDrWu8dgCeNviA EsJK4gv+EVY7wc4r6W2A0+I= =wCQs -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 16 08:59:41 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 08:59:41 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> Message-ID: <473D5BFD.8080305@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Ditlev. After some investigation and some helpful hints from Mark, it turns out that there are methods in DNATools/ProteinTools that can construct proper GappedSymbolList objects out of strings. I have managed to modify the MSF parser to use this instead. This means that the MSF parser will now return instances of GappedSymbolList (actually GappedSequences to be accurate) rather than SimpleSymbolList. Thanks to the way the APIs work this will make no difference to existing users (except those who are depending on being able to cast it to a certain type - which they shouldn't, because the API doesn't guarantee it to be of any type!), but it will fix it for you. Future releases will modify the API (or include a completely new MSF parser) which will explicitly return GappedSymbolLists in the API declarations rather than plain SymbolLists, but I can't do that right now because it would break existing users code. To get the modified parser you will need to check out the very latest source code from our CVS repository and compile it using ant. Instructions are on our website at biojava.org if you have not done this before. Hope this helps you. cheers, Richard Ditlev Egeskov Brodersen wrote: > Hi Richard, > > thanks for clarifying this and for your useful suggestion, which I've > managed to implement as shown below. It works nicely, but I was really > surprised to learn that biojava hasn't yet implemented a proper parsing of > gap characters from strings into the object structure as this seems central > to any use of pre-aligned sequences. Also, I find it problematic that the > API implements the gap characters as part of the alphabets. In my view, this > breaks the logic of the object model because proteins don't really have gaps > in their sequences. > > Rather, the constructor of the Sequence-derived classes ought to throw an > exception when non-protein characters are passed and should not allow the > user to create an object with sequence elements that are non-standard. > Instead, I think there should be a static method that allows cleaning the > input sequence before passing it to the Sequence constructor. On the other > hand, the constructor of the GappedSequence-derived classes should recognise > the gaps and create an object with blocks of legal protein symbols and gaps > in the appropriate places. > > -- Ditlev > > // Read MSF file into Alignment object > BufferedReader br = new BufferedReader(new > FileReader(aFileChooser.getSelectedFile())); > SimpleAlignment align = > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); > > // Iterate through sequences in turn > Iterator aSequences = align.symbolListIterator(); > while(aSequences.hasNext()) { > > // Retrieve SymbolList, the associated gap symbol and sequence string > SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol(); > String aGappedString = aSym.seqString(); > > // Prepare non-gapped string > String aPlainString = ""; > > // Loop through individual symbols and add non-gap characters to > string > for(int i=1;i<=aSym.length();i++) > if(aSym.symbolAt(i) != aGapSymbol) > aPlainString += aGappedString.charAt(i-1); > > // Create a new gapped sequence object with the plain (non-gapped) > sequence > SimpleGappedSequence aGapped = > (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainString, > ""); > > // Use separate indices for gapped and plain sequences > int n = 1; > > // Loop through individual gapped sequence symbols and insert gap into > object when gap symbol is encountered > for(int i=1;i<=aSym.length();i++) > if(aSym.symbolAt(i) != aGapSymbol) > n++; > else > aGapped.addGapInSource(n); > > -- > > Ditlev Egeskov Brodersen > Lektor > Bakkefaldet 30, Hasle > 8210 ?rhus V > > www.lindeman-brodersen.dk > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 15 November 2007 14:52 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: [Biojava-l] Parsing exising gaps >> > I think you've uncovered a number of problems here: > > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do > all the other predefined alphabets. > > 2. The MSF parser doesn't bother trying to build GappedSequence > instances, instead it just builds solid sequences with the gaps as > normal symbols. > > 3. There is no constructor or method for taking a sequence with > embedded > gap symbols and turning it into a GappedSequence with separate chunks. > > Combined, these three problems make it impossible to do what you want > easily. I will make a note to fix this on the plans for the next > BioJava > development cycle. > > In the meantime, your best bet would be to construct a second alignment > block by iterating over the alignment block you already have and > parsing > the locations of the gap symbols. You would create a > SimpleGappedSequence intially over the ungapped sequence, then use the > insert gap methods to insert the gaps into this ungapped sequence > before > putting all the SimpleGappedSequence objects together into a new > alignment. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: >>>> Dear all, >>>> >>>> >>>> >>>> I have managed to read an MSF-formatted alignment from a file > selected >>>> through FileChooser as follows: >>>> >>>> >>>> >>>> BufferedReader br = new BufferedReader(new >>>> FileReader(aFileChooser.getSelectedFile())); >>>> >>>> SimpleAlignment align = >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, > br); >>>> >>>> >>>> I can now retrieve the sequence names and sequences through the > Alignment >>>> object: >>>> >>>> >>>> >>>> Iterator aLabels = align.getLabels().iterator(); >>>> >>>> Iterator aSequences = align.symbolListIterator(); >>>> >>>> >>>> >>>> However, I now what to be able to translate between real sequence > numbers >>>> and the positions within each alignment string, i.e. retrieve > positions that >>>> remove the gaps first (gaps are represented by hyphens '-' in the MSF >>>> format). How can I tell BioJava to parse the gaps into an > GappedSequence >>>> format? I have tried the following to check what position 15 (past > the the >>>> first gap) translates into: >>>> >>>> >>>> >>>> int n = 0; >>>> >>>> while(aSequences.hasNext()) { >>>> >>>> SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); >>>> >>>> SimpleGappedSequence aGapped = new SimpleGappedSequence(new >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null)); >>>> >>>> System.out.println(aGapped.gappedToLocation(new > PointLocation(15))); >>>> } >>>> >>>> >>>> >>>> But I only get 15 back out. I have also studied the constructor of > the >>>> underlying SimpleGappedSymbolList but it simply copies the SymbolList > and >>>> creates one big block: >>>> >>>> >>>> >>>> public SimpleGappedSymbolList(SymbolList source) { >>>> >>>> this.source = source; >>>> >>>> this.alpha = source.getAlphabet(); >>>> >>>> this.blocks = new ArrayList(); >>>> >>>> this.length = source.length(); >>>> >>>> Block b = new Block(1, length, 1, length); >>>> >>>> blocks.add(b); >>>> >>>> } >>>> >>>> >>>> >>>> Is there a way to tell SimpleGappedSequence to parse itself in terms > of the >>>> gap characters in the sequence string? How is the sequence > represented in >>>> this case, if not by gaps? Surely the hyphen cannot be a part of the >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the use > of it? >>>> >>>> >>>> Best wishes, >>>> >>>> >>>> >>>> Ditlev >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> >>>> Ditlev E. Brodersen, Ph.D. >>>> Lektor, Associate Professor >>>> >>>> >>>> >>>> Department of Molecular Biology Office: +45 89425259 >>>> University of AarhusLab: +45 89425022 >>>> Gustav Wieds Vej 10cFax: +45 86123178 >>>> DK-8000 Aarhus C Email: deb at mb.au.dk >>>> Denmark Lab WWW: > www.bioxray.dk/~deb >>>> >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS Rv73CVmtYpEz+apJwM1L3sA= =UPU6 -----END PGP SIGNATURE----- From deb at mb.au.dk Fri Nov 16 09:28:40 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Fri, 16 Nov 2007 10:28:40 +0100 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <473D5BFD.8080305@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> Message-ID: <000601c82833$143c5300$3cb4f900$@au.dk> Hi Richard, thanks for your super fast reply. I managed to recompile using CVS/ant and the MSF import now works brilliantly and simply as follows: BufferedReader br = new BufferedReader(new FileReader(aFileChooser.getSelectedFile())); SimpleAlignment align = (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br); // Iterate through sequences in turn Iterator aSequences = align.symbolListIterator(); while(aSequences.hasNext()) { // Retrieve gapped sequence SimpleGappedSequence aGapped = (SimpleGappedSequence)aSequences.next(); ...do whatever with each gapped sequence } The returned gapped sequences are all properly set up with gaps, name etc. But as for other users, I think there may be some problems, since the SimpleAlignment object only has a general symbol list iterator, the user will have to cast each statement extracting a sequence object, and SimpleSequence aSimple = (SimpleSequence)aSequences.next(); returns an ClassCastException at run time. So old code might not run with the update as far as I can see. Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 10:00 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Parsing exising gaps > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Ditlev. > > After some investigation and some helpful hints from Mark, it turns out > that there are methods in DNATools/ProteinTools that can construct > proper GappedSymbolList objects out of strings. > > I have managed to modify the MSF parser to use this instead. This means > that the MSF parser will now return instances of GappedSymbolList > (actually GappedSequences to be accurate) rather than SimpleSymbolList. > Thanks to the way the APIs work this will make no difference to > existing > users (except those who are depending on being able to cast it to a > certain type - which they shouldn't, because the API doesn't guarantee > it to be of any type!), but it will fix it for you. Future releases > will > modify the API (or include a completely new MSF parser) which will > explicitly return GappedSymbolLists in the API declarations rather than > plain SymbolLists, but I can't do that right now because it would break > existing users code. > > To get the modified parser you will need to check out the very latest > source code from our CVS repository and compile it using ant. > Instructions are on our website at biojava.org if you have not done > this > before. > > Hope this helps you. > > cheers, > Richard > > > Ditlev Egeskov Brodersen wrote: > > Hi Richard, > > > > thanks for clarifying this and for your useful suggestion, which > I've > > managed to implement as shown below. It works nicely, but I was > really > > surprised to learn that biojava hasn't yet implemented a proper > parsing of > > gap characters from strings into the object structure as this seems > central > > to any use of pre-aligned sequences. Also, I find it problematic that > the > > API implements the gap characters as part of the alphabets. In my > view, this > > breaks the logic of the object model because proteins don't really > have gaps > > in their sequences. > > > > Rather, the constructor of the Sequence-derived classes ought to > throw an > > exception when non-protein characters are passed and should not allow > the > > user to create an object with sequence elements that are non- > standard. > > Instead, I think there should be a static method that allows cleaning > the > > input sequence before passing it to the Sequence constructor. On the > other > > hand, the constructor of the GappedSequence-derived classes should > recognise > > the gaps and create an object with blocks of legal protein symbols > and gaps > > in the appropriate places. > > > > -- Ditlev > > > > // Read MSF file into Alignment object > > BufferedReader br = new BufferedReader(new > > FileReader(aFileChooser.getSelectedFile())); > > SimpleAlignment align = > > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, > br); > > > > // Iterate through sequences in turn > > Iterator aSequences = align.symbolListIterator(); > > while(aSequences.hasNext()) { > > > > // Retrieve SymbolList, the associated gap symbol and sequence > string > > SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > > Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol(); > > String aGappedString = aSym.seqString(); > > > > // Prepare non-gapped string > > String aPlainString = ""; > > > > // Loop through individual symbols and add non-gap characters > to > > string > > for(int i=1;i<=aSym.length();i++) > > if(aSym.symbolAt(i) != aGapSymbol) > > aPlainString += aGappedString.charAt(i-1); > > > > // Create a new gapped sequence object with the plain (non- > gapped) > > sequence > > SimpleGappedSequence aGapped = > > > (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainSt > ring, > > ""); > > > > // Use separate indices for gapped and plain sequences > > int n = 1; > > > > // Loop through individual gapped sequence symbols and insert > gap into > > object when gap symbol is encountered > > for(int i=1;i<=aSym.length();i++) > > if(aSym.symbolAt(i) != aGapSymbol) > > n++; > > else > > aGapped.addGapInSource(n); > > > > -- > > > > Ditlev Egeskov Brodersen > > Lektor > > Bakkefaldet 30, Hasle > > 8210 ?rhus V > > > > www.lindeman-brodersen.dk > > > >> -----Original Message----- > >> From: Richard Holland [mailto:holland at ebi.ac.uk] > >> Sent: 15 November 2007 14:52 > >> To: Ditlev Egeskov Brodersen > >> Cc: biojava-l at biojava.org > >> Subject: Re: [Biojava-l] Parsing exising gaps > >> > > I think you've uncovered a number of problems here: > > > > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do > > all the other predefined alphabets. > > > > 2. The MSF parser doesn't bother trying to build GappedSequence > > instances, instead it just builds solid sequences with the gaps as > > normal symbols. > > > > 3. There is no constructor or method for taking a sequence with > > embedded > > gap symbols and turning it into a GappedSequence with separate > chunks. > > > > Combined, these three problems make it impossible to do what you want > > easily. I will make a note to fix this on the plans for the next > > BioJava > > development cycle. > > > > In the meantime, your best bet would be to construct a second > alignment > > block by iterating over the alignment block you already have and > > parsing > > the locations of the gap symbols. You would create a > > SimpleGappedSequence intially over the ungapped sequence, then use > the > > insert gap methods to insert the gaps into this ungapped sequence > > before > > putting all the SimpleGappedSequence objects together into a new > > alignment. > > > > cheers, > > Richard > > > > Ditlev Egeskov Brodersen wrote: > >>>> Dear all, > >>>> > >>>> > >>>> > >>>> I have managed to read an MSF-formatted alignment from a file > > selected > >>>> through FileChooser as follows: > >>>> > >>>> > >>>> > >>>> BufferedReader br = new BufferedReader(new > >>>> FileReader(aFileChooser.getSelectedFile())); > >>>> > >>>> SimpleAlignment align = > >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, > > br); > >>>> > >>>> > >>>> I can now retrieve the sequence names and sequences through the > > Alignment > >>>> object: > >>>> > >>>> > >>>> > >>>> Iterator aLabels = align.getLabels().iterator(); > >>>> > >>>> Iterator aSequences = align.symbolListIterator(); > >>>> > >>>> > >>>> > >>>> However, I now what to be able to translate between real sequence > > numbers > >>>> and the positions within each alignment string, i.e. retrieve > > positions that > >>>> remove the gaps first (gaps are represented by hyphens '-' in the > MSF > >>>> format). How can I tell BioJava to parse the gaps into an > > GappedSequence > >>>> format? I have tried the following to check what position 15 (past > > the the > >>>> first gap) translates into: > >>>> > >>>> > >>>> > >>>> int n = 0; > >>>> > >>>> while(aSequences.hasNext()) { > >>>> > >>>> SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next(); > >>>> > >>>> SimpleGappedSequence aGapped = new SimpleGappedSequence(new > >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null)); > >>>> > >>>> System.out.println(aGapped.gappedToLocation(new > > PointLocation(15))); > >>>> } > >>>> > >>>> > >>>> > >>>> But I only get 15 back out. I have also studied the constructor of > > the > >>>> underlying SimpleGappedSymbolList but it simply copies the > SymbolList > > and > >>>> creates one big block: > >>>> > >>>> > >>>> > >>>> public SimpleGappedSymbolList(SymbolList source) { > >>>> > >>>> this.source = source; > >>>> > >>>> this.alpha = source.getAlphabet(); > >>>> > >>>> this.blocks = new ArrayList(); > >>>> > >>>> this.length = source.length(); > >>>> > >>>> Block b = new Block(1, length, 1, length); > >>>> > >>>> blocks.add(b); > >>>> > >>>> } > >>>> > >>>> > >>>> > >>>> Is there a way to tell SimpleGappedSequence to parse itself in > terms > > of the > >>>> gap characters in the sequence string? How is the sequence > > represented in > >>>> this case, if not by gaps? Surely the hyphen cannot be a part of > the > >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the > use > > of it? > >>>> > >>>> > >>>> Best wishes, > >>>> > >>>> > >>>> > >>>> Ditlev > >>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> > >>>> > >>>> Ditlev E. Brodersen, Ph.D. > >>>> Lektor, Associate Professor > >>>> > >>>> > >>>> > >>>> Department of Molecular Biology Office: +45 89425259 > >>>> University of AarhusLab: +45 89425022 > >>>> Gustav Wieds Vej 10cFax: +45 86123178 > >>>> DK-8000 Aarhus C Email: deb at mb.au.dk > >>>> Denmark Lab WWW: > > www.bioxray.dk/~deb > >>>> > >>>> > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS > Rv73CVmtYpEz+apJwM1L3sA= > =UPU6 > -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 16 09:49:35 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 09:49:35 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <000601c82833$143c5300$3cb4f900$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> Message-ID: <473D67AF.2020007@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > The returned gapped sequences are all properly set up with gaps, name etc. > But as for other users, I think there may be some problems, since the > SimpleAlignment object only has a general symbol list iterator, the user > will have to cast each statement extracting a sequence object, and > > SimpleSequence aSimple = (SimpleSequence)aSequences.next(); > > returns an ClassCastException at run time. So old code might not run with > the update as far as I can see. This is true. However, such code would be unsupported by us as the API clearly states that SimpleAlignment returns SymbolList instances, and does not make any guarantees about the exact implementation details of the objects it returns. To attempt to cast it to anything other than SymbolList would be a mistake! (Although actually it is now returning a guarantee of GappedSymbolList, which is what your code can now take advantage of). To assume it will return SimpleSequence is outside the behaviour defined by the API and therefore should not be relied upon. A more correct behaviour would be to test each item returned: SymbolList symlist = aSequences.next(); if (symlist instanceof SimpleSequence) { SimpleSequence seq = (SimpleSequence)symlist; // Do simple-sequence stuff } else { // Do something else! } In future, I will modify the API to change the SymbolList guarantee to a GappedSymbolList guarantee, but I can't do this right now as this really would break everyone's code! We are currently planning a redesign as you may be aware, so issues like this will hopefully be resolved as part of that process. For a start, if we use Java 5 generics in future as we plan, we can strictly specify what kinds of objects will be returned by things such as the alignment API, making it easier for us to enforce API-compliant behaviour in user's code. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB ITj7O/foDly4aOT4SV1Jb+k= =g7Vs -----END PGP SIGNATURE----- From deb at mb.au.dk Fri Nov 16 10:11:15 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Fri, 16 Nov 2007 11:11:15 +0100 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <473D67AF.2020007@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> Message-ID: <000f01c82839$06722550$13566ff0$@au.dk> Hi again, thanks for the info - will do the check just to be proper. I have another question: In my application, I would like to wrap the retrieved SimpleGappedSequence objects inside another object that extends the functionality with application-specific stuff. Ideally, I would do this by extending the SimpleGappedSequence object and create it by passing the SimpleGappedSequence from the alignment import to the constructor of the parent, like so: class AlignedSequence extends SimpleGappedSequence { public AlignedSequence(SimpleGappedSequence aGapped) { super(aGapped); } ..custom stuff.. } However, the problem is that there is only one constructor for the SimpleGappedSequence, one which takes a simple Sequence object. I can pass the derived class alright, but all gap information is lost again, presumably because the SimpleGappedSequence constructor just takes out the seqString() and puts it into its own sequence object. Shouldn't the constructor of the SimpleGappedSequence class recognise when a derived (and gapped) sequence object is passed, and process it accordingly? As it stands, I am forced to include the SimpleGappedSequence as a private member of the AlignedSequence class, which is not near as nice since all statement using the class will have to do something like class AlignedSequence extends SimpleGappedSequence { private SimpleGappedSequence gapped_sequence; public AlignedSequence(SimpleGappedSequence aGapped) { gapped_sequence = aGapped; } public SimpleGappedSequence getGappedSequence() { return(gapped_sequence); } ..custom stuff.. } ... AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.getGappedSequence().seqString(); rather than simply: AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.seqString(); In other words, is there any solution with the current setup that would allow me to extend SimpleGappedSequence and not loose the gap information? -- Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 10:50 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Parsing exising gaps > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > The returned gapped sequences are all properly set up with gaps, > name etc. > > But as for other users, I think there may be some problems, since the > > SimpleAlignment object only has a general symbol list iterator, the > user > > will have to cast each statement extracting a sequence object, and > > > > SimpleSequence aSimple = (SimpleSequence)aSequences.next(); > > > > returns an ClassCastException at run time. So old code might not run > with > > the update as far as I can see. > > This is true. However, such code would be unsupported by us as the API > clearly states that SimpleAlignment returns SymbolList instances, and > does not make any guarantees about the exact implementation details of > the objects it returns. To attempt to cast it to anything other than > SymbolList would be a mistake! (Although actually it is now returning a > guarantee of GappedSymbolList, which is what your code can now take > advantage of). To assume it will return SimpleSequence is outside the > behaviour defined by the API and therefore should not be relied upon. > > A more correct behaviour would be to test each item returned: > > SymbolList symlist = aSequences.next(); > if (symlist instanceof SimpleSequence) { > SimpleSequence seq = (SimpleSequence)symlist; > // Do simple-sequence stuff > } else { > // Do something else! > } > > In future, I will modify the API to change the SymbolList guarantee to > a > GappedSymbolList guarantee, but I can't do this right now as this > really > would break everyone's code! > > We are currently planning a redesign as you may be aware, so issues > like > this will hopefully be resolved as part of that process. For a start, > if > we use Java 5 generics in future as we plan, we can strictly specify > what kinds of objects will be returned by things such as the alignment > API, making it easier for us to enforce API-compliant behaviour in > user's code. > > cheers, > Richard > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB > ITj7O/foDly4aOT4SV1Jb+k= > =g7Vs > -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Fri Nov 16 09:51:35 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 16 Nov 2007 09:51:35 +0000 Subject: [Biojava-l] Parsing exising gaps In-Reply-To: <473D5BFD.8080305@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> Message-ID: > > To get the modified parser you will need to check out the very latest > source code from our CVS repository and compile it using ant. > Instructions are on our website at biojava.org if you have not done > this > before. alternatively you could get the automatically built biojava.jar from http://www.spice-3d.org/cruise/ Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Fri Nov 16 10:46:57 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 10:46:57 +0000 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <000f01c82839$06722550$13566ff0$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.dk> Message-ID: <473D7521.9070603@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The easiest way is simply for me to alter the constructor to SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to copy all gaps if passed another instance of GappedSymbolList as the parameter. I've just done this in CVS so you should be able to update your copy and observe the new behaviour. cheers, Richard Ditlev Egeskov Brodersen wrote: > Hi again, > > thanks for the info - will do the check just to be proper. I have another > question: In my application, I would like to wrap the retrieved > SimpleGappedSequence objects inside another object that extends the > functionality with application-specific stuff. Ideally, I would do this by > extending the SimpleGappedSequence object and create it by passing the > SimpleGappedSequence from the alignment import to the constructor of the > parent, like so: > > class AlignedSequence extends SimpleGappedSequence { > public AlignedSequence(SimpleGappedSequence aGapped) { > super(aGapped); > } > > ..custom stuff.. > } > > However, the problem is that there is only one constructor for the > SimpleGappedSequence, one which takes a simple Sequence object. I can pass > the derived class alright, but all gap information is lost again, presumably > because the SimpleGappedSequence constructor just takes out the seqString() > and puts it into its own sequence object. > > Shouldn't the constructor of the SimpleGappedSequence class recognise when a > derived (and gapped) sequence object is passed, and process it accordingly? > > As it stands, I am forced to include the SimpleGappedSequence as a private > member of the AlignedSequence class, which is not near as nice since all > statement using the class will have to do something like > > class AlignedSequence extends SimpleGappedSequence { > private SimpleGappedSequence gapped_sequence; > > public AlignedSequence(SimpleGappedSequence aGapped) { > gapped_sequence = aGapped; > } > > public SimpleGappedSequence getGappedSequence() { > return(gapped_sequence); > } > > ..custom stuff.. > } > > ... > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.getGappedSequence().seqString(); > > rather than simply: > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.seqString(); > > In other words, is there any solution with the current setup that would > allow me to extend SimpleGappedSequence and not loose the gap information? > > -- Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 16 November 2007 10:50 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: [Biojava-l] Parsing exising gaps >> >>>> The returned gapped sequences are all properly set up with gaps, > name etc. >>>> But as for other users, I think there may be some problems, since the >>>> SimpleAlignment object only has a general symbol list iterator, the > user >>>> will have to cast each statement extracting a sequence object, and >>>> >>>> SimpleSequence aSimple = (SimpleSequence)aSequences.next(); >>>> >>>> returns an ClassCastException at run time. So old code might not run > with >>>> the update as far as I can see. > This is true. However, such code would be unsupported by us as the API > clearly states that SimpleAlignment returns SymbolList instances, and > does not make any guarantees about the exact implementation details of > the objects it returns. To attempt to cast it to anything other than > SymbolList would be a mistake! (Although actually it is now returning a > guarantee of GappedSymbolList, which is what your code can now take > advantage of). To assume it will return SimpleSequence is outside the > behaviour defined by the API and therefore should not be relied upon. > > A more correct behaviour would be to test each item returned: > > SymbolList symlist = aSequences.next(); > if (symlist instanceof SimpleSequence) { > SimpleSequence seq = (SimpleSequence)symlist; > // Do simple-sequence stuff > } else { > // Do something else! > } > > In future, I will modify the API to change the SymbolList guarantee to > a > GappedSymbolList guarantee, but I can't do this right now as this > really > would break everyone's code! > > We are currently planning a redesign as you may be aware, so issues > like > this will hopefully be resolved as part of that process. For a start, > if > we use Java 5 generics in future as we plan, we can strictly specify > what kinds of objects will be returned by things such as the alignment > API, making it easier for us to enforce API-compliant behaviour in > user's code. > > cheers, > Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m 8EJTzWZYOyjp9VhmvsgvmNA= =1uaB -----END PGP SIGNATURE----- From deb at mb.au.dk Fri Nov 16 12:39:23 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Fri, 16 Nov 2007 13:39:23 +0100 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <473D7521.9070603@ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> Message-ID: <001801c8284d$b8c525e0$2a4f71a0$@au.dk> Hi again, I updated CVS and got the new SimpleGappedSymbolList class, but there seems to be no changes to the SimpleGappedSequence class, which is the one I need to extend...have I missed something? Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 11:47 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: Wrapping SimpleGappedSequence > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > The easiest way is simply for me to alter the constructor to > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to > copy all gaps if passed another instance of GappedSymbolList as the > parameter. I've just done this in CVS so you should be able to update > your copy and observe the new behaviour. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: > > Hi again, > > > > thanks for the info - will do the check just to be proper. I have > another > > question: In my application, I would like to wrap the retrieved > > SimpleGappedSequence objects inside another object that extends the > > functionality with application-specific stuff. Ideally, I would do > this by > > extending the SimpleGappedSequence object and create it by passing > the > > SimpleGappedSequence from the alignment import to the constructor of > the > > parent, like so: > > > > class AlignedSequence extends SimpleGappedSequence { > > public AlignedSequence(SimpleGappedSequence aGapped) { > > super(aGapped); > > } > > > > ..custom stuff.. > > } > > > > However, the problem is that there is only one constructor for the > > SimpleGappedSequence, one which takes a simple Sequence object. I can > pass > > the derived class alright, but all gap information is lost again, > presumably > > because the SimpleGappedSequence constructor just takes out the > seqString() > > and puts it into its own sequence object. > > > > Shouldn't the constructor of the SimpleGappedSequence class recognise > when a > > derived (and gapped) sequence object is passed, and process it > accordingly? > > > > As it stands, I am forced to include the SimpleGappedSequence as a > private > > member of the AlignedSequence class, which is not near as nice since > all > > statement using the class will have to do something like > > > > class AlignedSequence extends SimpleGappedSequence { > > private SimpleGappedSequence gapped_sequence; > > > > public AlignedSequence(SimpleGappedSequence aGapped) { > > gapped_sequence = aGapped; > > } > > > > public SimpleGappedSequence getGappedSequence() { > > return(gapped_sequence); > > } > > > > ..custom stuff.. > > } > > > > ... > > > > AlignedSequence aAligned = new AlignedSequence(aGapped); > > aAligned.getGappedSequence().seqString(); > > > > rather than simply: > > > > AlignedSequence aAligned = new AlignedSequence(aGapped); > > aAligned.seqString(); > > > > In other words, is there any solution with the current setup that > would > > allow me to extend SimpleGappedSequence and not loose the gap > information? > > > > -- Ditlev > > > > -- > > > > Ditlev E. Brodersen, Ph.D. > > Lektor, Associate Professor > > > > Department of Molecular Biology Office: +45 89425259 > > University of Aarhus Lab: +45 89425022 > > Gustav Wieds Vej 10c Fax: +45 86123178 > > DK-8000 Aarhus C Email: deb at mb.au.dk > > Denmark Lab WWW: www.bioxray.dk/~deb > > > > > >> -----Original Message----- > >> From: Richard Holland [mailto:holland at ebi.ac.uk] > >> Sent: 16 November 2007 10:50 > >> To: Ditlev Egeskov Brodersen > >> Cc: biojava-l at biojava.org > >> Subject: Re: [Biojava-l] Parsing exising gaps > >> > >>>> The returned gapped sequences are all properly set up with gaps, > > name etc. > >>>> But as for other users, I think there may be some problems, since > the > >>>> SimpleAlignment object only has a general symbol list iterator, > the > > user > >>>> will have to cast each statement extracting a sequence object, and > >>>> > >>>> SimpleSequence aSimple = (SimpleSequence)aSequences.next(); > >>>> > >>>> returns an ClassCastException at run time. So old code might not > run > > with > >>>> the update as far as I can see. > > This is true. However, such code would be unsupported by us as the > API > > clearly states that SimpleAlignment returns SymbolList instances, and > > does not make any guarantees about the exact implementation details > of > > the objects it returns. To attempt to cast it to anything other than > > SymbolList would be a mistake! (Although actually it is now returning > a > > guarantee of GappedSymbolList, which is what your code can now take > > advantage of). To assume it will return SimpleSequence is outside the > > behaviour defined by the API and therefore should not be relied upon. > > > > A more correct behaviour would be to test each item returned: > > > > SymbolList symlist = aSequences.next(); > > if (symlist instanceof SimpleSequence) { > > SimpleSequence seq = (SimpleSequence)symlist; > > // Do simple-sequence stuff > > } else { > > // Do something else! > > } > > > > In future, I will modify the API to change the SymbolList guarantee > to > > a > > GappedSymbolList guarantee, but I can't do this right now as this > > really > > would break everyone's code! > > > > We are currently planning a redesign as you may be aware, so issues > > like > > this will hopefully be resolved as part of that process. For a start, > > if > > we use Java 5 generics in future as we plan, we can strictly specify > > what kinds of objects will be returned by things such as the > alignment > > API, making it easier for us to enforce API-compliant behaviour in > > user's code. > > > > cheers, > > Richard > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m > 8EJTzWZYOyjp9VhmvsgvmNA= > =1uaB > -----END PGP SIGNATURE----- From holland at ebi.ac.uk Fri Nov 16 12:46:23 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 16 Nov 2007 12:46:23 +0000 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <001801c8284d$b8c525e0$2a4f71a0$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> Message-ID: <473D911F.2000303@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SimpleGappedSequence extends SimpleGappedSymbolList, and the constructor delegates to the SimpleGappedSymbolList constructor. When you extend SimpleGappedSequence you should delegate in your new constructor to the existing SimpleGappedSequence constructor, which in turn will delegate as above and preserve the gaps. By passing any object which implements GappedSymbolList to the SimpleGappedSequence constructor, e.g. SimpleGappedSequence or SimpleGappedSymbolList, it will automatically choose the new constructor from SimpleGappedSymbolList which you hopefully should be able to see in the code you have just checked out. If passed any other non-GappedSymbolList object, it will use the old constructor that already existed from before. cheers, Richard Ditlev Egeskov Brodersen wrote: > Hi again, > > I updated CVS and got the new SimpleGappedSymbolList class, but there > seems to be no changes to the SimpleGappedSequence class, which is the one I > need to extend...have I missed something? > > Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 16 November 2007 11:47 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: Wrapping SimpleGappedSequence >> > The easiest way is simply for me to alter the constructor to > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to > copy all gaps if passed another instance of GappedSymbolList as the > parameter. I've just done this in CVS so you should be able to update > your copy and observe the new behaviour. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: >>>> Hi again, >>>> >>>> thanks for the info - will do the check just to be proper. I have > another >>>> question: In my application, I would like to wrap the retrieved >>>> SimpleGappedSequence objects inside another object that extends the >>>> functionality with application-specific stuff. Ideally, I would do > this by >>>> extending the SimpleGappedSequence object and create it by passing > the >>>> SimpleGappedSequence from the alignment import to the constructor of > the >>>> parent, like so: >>>> >>>> class AlignedSequence extends SimpleGappedSequence { >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >>>> super(aGapped); >>>> } >>>> >>>> ..custom stuff.. >>>> } >>>> >>>> However, the problem is that there is only one constructor for the >>>> SimpleGappedSequence, one which takes a simple Sequence object. I can > pass >>>> the derived class alright, but all gap information is lost again, > presumably >>>> because the SimpleGappedSequence constructor just takes out the > seqString() >>>> and puts it into its own sequence object. >>>> >>>> Shouldn't the constructor of the SimpleGappedSequence class recognise > when a >>>> derived (and gapped) sequence object is passed, and process it > accordingly? >>>> As it stands, I am forced to include the SimpleGappedSequence as a > private >>>> member of the AlignedSequence class, which is not near as nice since > all >>>> statement using the class will have to do something like >>>> >>>> class AlignedSequence extends SimpleGappedSequence { >>>> private SimpleGappedSequence gapped_sequence; >>>> >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >>>> gapped_sequence = aGapped; >>>> } >>>> >>>> public SimpleGappedSequence getGappedSequence() { >>>> return(gapped_sequence); >>>> } >>>> >>>> ..custom stuff.. >>>> } >>>> >>>> ... >>>> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >>>> aAligned.getGappedSequence().seqString(); >>>> >>>> rather than simply: >>>> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >>>> aAligned.seqString(); >>>> >>>> In other words, is there any solution with the current setup that > would >>>> allow me to extend SimpleGappedSequence and not loose the gap > information? >>>> -- Ditlev >>>> >>>> -- >>>> >>>> Ditlev E. Brodersen, Ph.D. >>>> Lektor, Associate Professor >>>> >>>> Department of Molecular Biology Office: +45 89425259 >>>> University of Aarhus Lab: +45 89425022 >>>> Gustav Wieds Vej 10c Fax: +45 86123178 >>>> DK-8000 Aarhus C Email: deb at mb.au.dk >>>> Denmark Lab WWW: www.bioxray.dk/~deb >>>> >>>> >>>>> -----Original Message----- >>>>> From: Richard Holland [mailto:holland at ebi.ac.uk] >>>>> Sent: 16 November 2007 10:50 >>>>> To: Ditlev Egeskov Brodersen >>>>> Cc: biojava-l at biojava.org >>>>> Subject: Re: [Biojava-l] Parsing exising gaps >>>>> >>>>>>> The returned gapped sequences are all properly set up with gaps, >>>> name etc. >>>>>>> But as for other users, I think there may be some problems, since > the >>>>>>> SimpleAlignment object only has a general symbol list iterator, > the >>>> user >>>>>>> will have to cast each statement extracting a sequence object, and >>>>>>> >>>>>>> SimpleSequence aSimple = (SimpleSequence)aSequences.next(); >>>>>>> >>>>>>> returns an ClassCastException at run time. So old code might not > run >>>> with >>>>>>> the update as far as I can see. >>>> This is true. However, such code would be unsupported by us as the > API >>>> clearly states that SimpleAlignment returns SymbolList instances, and >>>> does not make any guarantees about the exact implementation details > of >>>> the objects it returns. To attempt to cast it to anything other than >>>> SymbolList would be a mistake! (Although actually it is now returning > a >>>> guarantee of GappedSymbolList, which is what your code can now take >>>> advantage of). To assume it will return SimpleSequence is outside the >>>> behaviour defined by the API and therefore should not be relied upon. >>>> >>>> A more correct behaviour would be to test each item returned: >>>> >>>> SymbolList symlist = aSequences.next(); >>>> if (symlist instanceof SimpleSequence) { >>>> SimpleSequence seq = (SimpleSequence)symlist; >>>> // Do simple-sequence stuff >>>> } else { >>>> // Do something else! >>>> } >>>> >>>> In future, I will modify the API to change the SymbolList guarantee > to >>>> a >>>> GappedSymbolList guarantee, but I can't do this right now as this >>>> really >>>> would break everyone's code! >>>> >>>> We are currently planning a redesign as you may be aware, so issues >>>> like >>>> this will hopefully be resolved as part of that process. For a start, >>>> if >>>> we use Java 5 generics in future as we plan, we can strictly specify >>>> what kinds of objects will be returned by things such as the > alignment >>>> API, making it easier for us to enforce API-compliant behaviour in >>>> user's code. >>>> >>>> cheers, >>>> Richard - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx H0QV01cFreNi1SNLRPbhepg= =023Y -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Fri Nov 16 14:43:39 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 16 Nov 2007 14:43:39 +0000 Subject: [Biojava-l] Disulfide information in PDB files In-Reply-To: <459609.71722.qm@web52710.mail.re2.yahoo.com> References: <459609.71722.qm@web52710.mail.re2.yahoo.com> Message-ID: <8F40FBF1-D491-4C3D-BCEB-41316147BD80@sanger.ac.uk> Hi Brendan, I just committed the patches to CVS so BioJava can now parse the SSBond records. Andreas On 14 Nov 2007, at 16:28, Brendan Duggan wrote: > Hi Andreas > > Thanks for the quick response. I submitted a bug > request (#2400) as suggested by Richard. Parsing the > SSBOND records is indeed what I was talking about. I > want to identify the disulfides then calculate their > torsions, dihedrals and bond lengths, all of which I > believe can be implemented with the existing code. If > you could implement this parsing in a few days that > would be great! > > Thanks > > Brendan > > > --- Andreas Prlic wrote: > >> Hi Brendan, >> >> SSBOND lines are currently not parsed. If this is >> what you need, >> I can add this over the next couple of days. >> >> If you want to compute the bonds yourself, the >> framework can >> e.g. calculate distances between the sulphur atoms >> for you. - >> >> Andreas >> >> >> >> >> >> On 14 Nov 2007, at 00:48, Brendan Duggan wrote: >> >>> Greetings >>> >>> I'm trying to mine some information on disulfides >> in >>> the PDB and was hoping there might be a way of >>> obtaining this information with the BioJava PDB >>> parser. However, I haven't been able to see >> anything >>> like this mentioned in the API docs. If it is >>> currently not possible to extract disulfide >>> information from PDB files are there any plans to >>> implement this? >>> >>> Thanks! >>> >>> Brendan >>> >>> >>> Make the switch to the world's best email. >> Get the new Yahoo! >>> 7 Mail now. >> http://au.yahoo.com/worldsbestmail/viagra/index.html >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - >> Biojava-l at lists.open-bio.org >>> >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> > ---------------------------------------------------------------------- > - >> >> Andreas Prlic Wellcome Trust Sanger Institute >> Hinxton, Cambridge >> CB10 1SA, UK >> +44 (0) 1223 49 6891 >> >> > ---------------------------------------------------------------------- > - >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by >> Genome Research >> Limited, a charity registered in England with >> number 1021457 and a >> company registered in England with number 2742969, >> whose registered >> office is 215 Euston Road, London, NW1 2BE. >> > > > Brendan M. Duggan, PhD > > bmduggan at yahoo.com > (858) 692-2298 > > > Make the switch to the world's best email. Get the new Yahoo! > 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html > > ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Sun Nov 18 17:12:04 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Sun, 18 Nov 2007 17:12:04 -0000 (GMT) Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <000901c829d0$daa54620$8fefd260$@dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> <473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk> Message-ID: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> Interesting stuff. I'm not sure why it isn't working so I'll have to have a closer look. I'm currently on annual leave but will investigate when I return (Nov 27th). cheers, Richard On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote: > Hi Richard, > > I thought that was also correct what you say, but I can't get it to > work. > Below is a small test program to check this. First, I create a > SimpleGappedSequence through Text with > gaps->SymbolList->Sequence->GappedSequence. Gaps are there but not > "understood", as expected. Next, I create the same sequence non-gapped in > the above way, then introduce gaps with addGapsInSource. A gapped location > is now properly translated to a non-gapped sequence position. Finally, I > create a new SimpleGappedSequence based on the working one - as you can > see > the gaps are still there but not "understood"... > > aSymbolList = MSE--KLMPRT---TWAKG > aSequence = MSE--KLMPRT---TWAKG > > Gaps are not parsed when a SimpleGappedSequence is constructed from a > gapped > Sequence object: > aGapped = MSE--KLMPRT---TWAKG > Gapped position 10 = Plain position 10 > > aSymbolList = MSEKLMPRTTWAKG > aSequence = MSEKLMPRTTWAKG > > Gaps introduced through addGapsInSource work ok: > aGapped = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > Now a new SimpleGappedSequence object is created from the previous one: > aGapped2 = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 10 > > This should have been compiled with the new biojava.jar of 161107 (updated > via CVS), but perhaps I made a mistake updating? > > Any clues? > > Thanks, > > Ditlev > > --- > > package gappedsequencetest; > > import org.biojava.bio.*; > import org.biojava.bio.seq.*; > import org.biojava.bio.seq.impl.*; > import org.biojava.bio.symbol.*; > > public class Main { > > public static void main(String[] args) { > SymbolList aSymbolList = null; > try { > aSymbolList = > ProteinTools.createProtein("MSE--KLMPRT---TWAKG"); > > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + aSymbolList.seqString()); > > Sequence aSequence = new SimpleSequence(aSymbolList, "", > "mySequence", null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > SimpleGappedSequence aGapped = new > SimpleGappedSequence(aSequence); > System.out.println("Gaps are not parsed when a > SimpleGappedSequence > is constructed from a gapped Sequence object:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > try { > aSymbolList = ProteinTools.createProtein("MSEKLMPRTTWAKG"); > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + aSymbolList.seqString()); > > aSequence = new SimpleSequence(aSymbolList, "", "mySequence", > null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > aGapped = new SimpleGappedSequence(aSequence); > aGapped.addGapsInSource(9, 3); > aGapped.addGapsInSource(3, 2); > System.out.println("Gaps introduced through addGapsInSource work > ok:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > SimpleGappedSequence aGapped2 = new SimpleGappedSequence(aGapped); > System.out.println("Now a new SimpleGappedSequence object is > created > from the previous one:"); > System.out.println("aGapped2 = " + aGapped2.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > } > > } > > -- > > Ditlev Egeskov Brodersen > Lektor > Bakkefaldet 30, Hasle > 8210 ?rhus V > > www.lindeman-brodersen.dk > > >> -----Original Message----- >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> Sent: 16 November 2007 13:46 >> To: Ditlev Egeskov Brodersen >> Cc: biojava-l at biojava.org >> Subject: Re: Wrapping SimpleGappedSequence >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> SimpleGappedSequence extends SimpleGappedSymbolList, and the >> constructor >> delegates to the SimpleGappedSymbolList constructor. >> >> When you extend SimpleGappedSequence you should delegate in your new >> constructor to the existing SimpleGappedSequence constructor, which in >> turn will delegate as above and preserve the gaps. >> >> By passing any object which implements GappedSymbolList to the >> SimpleGappedSequence constructor, e.g. SimpleGappedSequence or >> SimpleGappedSymbolList, it will automatically choose the new >> constructor >> from SimpleGappedSymbolList which you hopefully should be able to see >> in >> the code you have just checked out. If passed any other >> non-GappedSymbolList object, it will use the old constructor that >> already existed from before. >> >> cheers, >> Richard >> >> Ditlev Egeskov Brodersen wrote: >> > Hi again, >> > >> > I updated CVS and got the new SimpleGappedSymbolList class, but >> there >> > seems to be no changes to the SimpleGappedSequence class, which is >> the one I >> > need to extend...have I missed something? >> > >> > Ditlev >> > >> > -- >> > >> > Ditlev E. Brodersen, Ph.D. >> > Lektor, Associate Professor >> > >> > Department of Molecular Biology Office: +45 89425259 >> > University of Aarhus Lab: +45 89425022 >> > Gustav Wieds Vej 10c Fax: +45 86123178 >> > DK-8000 Aarhus C Email: deb at mb.au.dk >> > Denmark Lab WWW: www.bioxray.dk/~deb >> > >> > >> >> -----Original Message----- >> >> From: Richard Holland [mailto:holland at ebi.ac.uk] >> >> Sent: 16 November 2007 11:47 >> >> To: Ditlev Egeskov Brodersen >> >> Cc: biojava-l at biojava.org >> >> Subject: Re: Wrapping SimpleGappedSequence >> >> >> > The easiest way is simply for me to alter the constructor to >> > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to >> > copy all gaps if passed another instance of GappedSymbolList as the >> > parameter. I've just done this in CVS so you should be able to update >> > your copy and observe the new behaviour. >> > >> > cheers, >> > Richard >> > >> > Ditlev Egeskov Brodersen wrote: >> >>>> Hi again, >> >>>> >> >>>> thanks for the info - will do the check just to be proper. I >> have >> > another >> >>>> question: In my application, I would like to wrap the retrieved >> >>>> SimpleGappedSequence objects inside another object that extends >> the >> >>>> functionality with application-specific stuff. Ideally, I would do >> > this by >> >>>> extending the SimpleGappedSequence object and create it by passing >> > the >> >>>> SimpleGappedSequence from the alignment import to the constructor >> of >> > the >> >>>> parent, like so: >> >>>> >> >>>> class AlignedSequence extends SimpleGappedSequence { >> >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >> >>>> super(aGapped); >> >>>> } >> >>>> >> >>>> ..custom stuff.. >> >>>> } >> >>>> >> >>>> However, the problem is that there is only one constructor for the >> >>>> SimpleGappedSequence, one which takes a simple Sequence object. I >> can >> > pass >> >>>> the derived class alright, but all gap information is lost again, >> > presumably >> >>>> because the SimpleGappedSequence constructor just takes out the >> > seqString() >> >>>> and puts it into its own sequence object. >> >>>> >> >>>> Shouldn't the constructor of the SimpleGappedSequence class >> recognise >> > when a >> >>>> derived (and gapped) sequence object is passed, and process it >> > accordingly? >> >>>> As it stands, I am forced to include the SimpleGappedSequence as a >> > private >> >>>> member of the AlignedSequence class, which is not near as nice >> since >> > all >> >>>> statement using the class will have to do something like >> >>>> >> >>>> class AlignedSequence extends SimpleGappedSequence { >> >>>> private SimpleGappedSequence gapped_sequence; >> >>>> >> >>>> public AlignedSequence(SimpleGappedSequence aGapped) { >> >>>> gapped_sequence = aGapped; >> >>>> } >> >>>> >> >>>> public SimpleGappedSequence getGappedSequence() { >> >>>> return(gapped_sequence); >> >>>> } >> >>>> >> >>>> ..custom stuff.. >> >>>> } >> >>>> >> >>>> ... >> >>>> >> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >> >>>> aAligned.getGappedSequence().seqString(); >> >>>> >> >>>> rather than simply: >> >>>> >> >>>> AlignedSequence aAligned = new AlignedSequence(aGapped); >> >>>> aAligned.seqString(); >> >>>> >> >>>> In other words, is there any solution with the current setup that >> > would >> >>>> allow me to extend SimpleGappedSequence and not loose the gap >> > information? >> >>>> -- Ditlev >> >>>> >> >>>> -- >> >>>> >> >>>> Ditlev E. Brodersen, Ph.D. >> >>>> Lektor, Associate Professor >> >>>> >> >>>> Department of Molecular Biology Office: +45 89425259 >> >>>> University of Aarhus Lab: +45 89425022 >> >>>> Gustav Wieds Vej 10c Fax: +45 86123178 >> >>>> DK-8000 Aarhus C Email: deb at mb.au.dk >> >>>> Denmark Lab WWW: www.bioxray.dk/~deb >> >>>> >> >>>> >> >>>>> -----Original Message----- >> >>>>> From: Richard Holland [mailto:holland at ebi.ac.uk] >> >>>>> Sent: 16 November 2007 10:50 >> >>>>> To: Ditlev Egeskov Brodersen >> >>>>> Cc: biojava-l at biojava.org >> >>>>> Subject: Re: [Biojava-l] Parsing exising gaps >> >>>>> >> >>>>>>> The returned gapped sequences are all properly set up with >> gaps, >> >>>> name etc. >> >>>>>>> But as for other users, I think there may be some problems, >> since >> > the >> >>>>>>> SimpleAlignment object only has a general symbol list iterator, >> > the >> >>>> user >> >>>>>>> will have to cast each statement extracting a sequence object, >> and >> >>>>>>> >> >>>>>>> SimpleSequence aSimple = >> (SimpleSequence)aSequences.next(); >> >>>>>>> >> >>>>>>> returns an ClassCastException at run time. So old code might >> not >> > run >> >>>> with >> >>>>>>> the update as far as I can see. >> >>>> This is true. However, such code would be unsupported by us as the >> > API >> >>>> clearly states that SimpleAlignment returns SymbolList instances, >> and >> >>>> does not make any guarantees about the exact implementation >> details >> > of >> >>>> the objects it returns. To attempt to cast it to anything other >> than >> >>>> SymbolList would be a mistake! (Although actually it is now >> returning >> > a >> >>>> guarantee of GappedSymbolList, which is what your code can now >> take >> >>>> advantage of). To assume it will return SimpleSequence is outside >> the >> >>>> behaviour defined by the API and therefore should not be relied >> upon. >> >>>> >> >>>> A more correct behaviour would be to test each item returned: >> >>>> >> >>>> SymbolList symlist = aSequences.next(); >> >>>> if (symlist instanceof SimpleSequence) { >> >>>> SimpleSequence seq = (SimpleSequence)symlist; >> >>>> // Do simple-sequence stuff >> >>>> } else { >> >>>> // Do something else! >> >>>> } >> >>>> >> >>>> In future, I will modify the API to change the SymbolList >> guarantee >> > to >> >>>> a >> >>>> GappedSymbolList guarantee, but I can't do this right now as this >> >>>> really >> >>>> would break everyone's code! >> >>>> >> >>>> We are currently planning a redesign as you may be aware, so >> issues >> >>>> like >> >>>> this will hopefully be resolved as part of that process. For a >> start, >> >>>> if >> >>>> we use Java 5 generics in future as we plan, we can strictly >> specify >> >>>> what kinds of objects will be returned by things such as the >> > alignment >> >>>> API, making it easier for us to enforce API-compliant behaviour in >> >>>> user's code. >> >>>> >> >>>> cheers, >> >>>> Richard >> >> - -- >> Richard Holland (BioMart) >> EMBL EBI, Wellcome Trust Genome Campus, >> Hinxton, Cambridgeshire CB10 1SD, UK >> Tel. +44 (0)1223 494416 >> >> http://www.biomart.org/ >> http://www.biojava.org/ >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx >> H0QV01cFreNi1SNLRPbhepg= >> =023Y >> -----END PGP SIGNATURE----- > > -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK From sterk at ebi.ac.uk Mon Nov 19 11:53:00 2007 From: sterk at ebi.ac.uk (Peter Sterk) Date: Mon, 19 Nov 2007 11:53:00 +0000 Subject: [Biojava-l] biojava.org wiki site down? Message-ID: <4741791C.2090307@ebi.ac.uk> Hi, I only get blank screens in firefox and IE can't display the pages, either. I think Richard reported something similar a few weeks ago. cheers, --Peter ----------------------------------------------------------------- Dr. Peter Sterk Tel: +44-(0)1223-494405 EMBL-European Bioinformatics Institute Fax: +44-(0)1223-494472 Wellcome Trust Genome Campus, Hinxton email: sterk at ebi.ac.uk Cambridge CB10 1SD, UK WWW: www.ebi.ac.uk Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/ ----------------------------------------------------------------- From deb at mb.au.dk Mon Nov 19 12:13:53 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Mon, 19 Nov 2007 13:13:53 +0100 Subject: [Biojava-l] biojava.org wiki site down? In-Reply-To: <4741791C.2090307@ebi.ac.uk> References: <4741791C.2090307@ebi.ac.uk> Message-ID: <003301c82aa5$a6fabdc0$f4f03940$@au.dk> www.biojava.org is down now, alright, but I was there less than 10 minutes ago, so it's recent crash. Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Peter Sterk > Sent: 19 November 2007 12:53 > To: biojava-l at lists.open-bio.org > Subject: [Biojava-l] biojava.org wiki site down? > > Hi, > > I only get blank screens in firefox and IE can't display the pages, > either. I think Richard reported something similar a few weeks ago. > > cheers, > > --Peter > ----------------------------------------------------------------- > Dr. Peter Sterk Tel: +44-(0)1223-494405 > EMBL-European Bioinformatics Institute Fax: +44-(0)1223-494472 > Wellcome Trust Genome Campus, Hinxton email: sterk at ebi.ac.uk > Cambridge CB10 1SD, UK WWW: www.ebi.ac.uk > > Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/ > ----------------------------------------------------------------- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From deb at mb.au.dk Mon Nov 19 14:46:01 2007 From: deb at mb.au.dk (Ditlev Egeskov Brodersen) Date: Mon, 19 Nov 2007 15:46:01 +0100 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> <473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk> <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> Message-ID: <003701c82aba$e85f4320$b91dc960$@au.dk> Dear Richard and all, I've been dissecting the delegation problem encountered when instantiating SimpleGappedSequence(Sequence) with an already gapped sequence. The constructor calls the parent SimpleGappedSymbolList(), which in Richard's CVS update of 161107 now contains a separate overloaded constructor for the gapped case: public SimpleGappedSymbolList(GappedSymbolList gappedSource) However, when instantiating a new SimpleGappedSequence based on an existing gapped sequence (with several blocks), the blocks were lost. After checking the path of code execution it appeared that for some reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified SimpleGappedSequence.java to include an overloaded constructor also for the descendant class, identical to the other constructor but with a GappedSequence argument: public SimpleGappedSequence(GappedSequence seq) { super(seq); this.sequence = seq; createOnUnderlying = false; } Now, the correct parent constructor (SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are two other problems with the new SimpleGappedSymbolList constructor that needs to be corrected for it to work as expected: First, the initial introduction of a single, large block is missing from the new code, so insert: Block b = new Block(1, length, 1, length); blocks.add(b); Secondly, the code for transferring the gaps from the sequence string need to use two separate indices, otherwise the gaps will be placed wrongly because their position is affected by previously inserted gaps: int n=1; for(int i=1;i<=this.length();i++) { if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i))) this.addGappInSource(n); else n++; In other words, the index giving the position of the gaps should only increment when there are NO gaps at the corresponding position in the gapped string. Following these changes, the GappedSequenceTest program from last week now works as expected: aSymbolList = MSE--KLMPRT---TWAKG aSequence = MSE--KLMPRT---TWAKG Gaps are not parsed when a SimpleGappedSequence is constructed from a gapped Sequence object: aGapped = MSE--KLMPRT---TWAKG Gapped position 10 = Plain position 10 aSymbolList = MSEKLMPRTTWAKG aSequence = MSEKLMPRTTWAKG Gaps introduced through addGapsInSource work ok: aGapped = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 8 Now a new SimpleGappedSequence object is created from the previous one: aGapped2 = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 8 -- Ditlev -- ? Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor ? Department of Molecular Biology?? Office:? +45 89425259 University of Aarhus????????????? Lab:???? +45 89425022 Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178 DK-8000 Aarhus C????????????????? Email:? deb at mb.au.dk Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- bounces at lists.open-bio.org] On Behalf Of Richard Holland Sent: 18 November 2007 18:12 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence Interesting stuff. I'm not sure why it isn't working so I'll have to have a closer look. I'm currently on annual leave but will investigate when I return (Nov 27th). cheers, Richard On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote: Hi Richard, I thought that was also correct what you say, but I can't get it to work. Below is a small test program to check this. First, I create a SimpleGappedSequence through Text with gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not "understood", as expected. Next, I create the same sequence non- gapped in the above way, then introduce gaps with addGapsInSource. A gapped location is now properly translated to a non-gapped sequence position. Finally, I create a new SimpleGappedSequence based on the working one - as you can see the gaps are still there but not "understood"... aSymbolList = MSE--KLMPRT---TWAKG aSequence = MSE--KLMPRT---TWAKG Gaps are not parsed when a SimpleGappedSequence is constructed from a gapped Sequence object: aGapped = MSE--KLMPRT---TWAKG Gapped position 10 = Plain position 10 aSymbolList = MSEKLMPRTTWAKG aSequence = MSEKLMPRTTWAKG Gaps introduced through addGapsInSource work ok: aGapped = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 8 Now a new SimpleGappedSequence object is created from the previous one: aGapped2 = MS--EKLMPR---TTWAKG Gapped position 10 = Plain position 10 This should have been compiled with the new biojava.jar of 161107 (updated via CVS), but perhaps I made a mistake updating? Any clues? Thanks, Ditlev --- package gappedsequencetest; import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.seq.impl.*; import org.biojava.bio.symbol.*; public class Main { public static void main(String[] args) { SymbolList aSymbolList = null; try { aSymbolList = ProteinTools.createProtein("MSE--KLMPRT---TWAKG"); } catch(BioException ex) {} System.out.println("aSymbolList = " + aSymbolList.seqString()); Sequence aSequence = new SimpleSequence(aSymbolList, "", "mySequence", null); System.out.println("aSequence = " + aSequence.seqString() + "\n"); SimpleGappedSequence aGapped = new SimpleGappedSequence(aSequence); System.out.println("Gaps are not parsed when a SimpleGappedSequence is constructed from a gapped Sequence object:"); System.out.println("aGapped = " + aGapped.seqString()); System.out.println("Gapped position 10 = Plain position " + aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); try { aSymbolList = ProteinTools.createProtein("MSEKLMPRTTWAKG"); } catch(BioException ex) {} System.out.println("aSymbolList = " + aSymbolList.seqString()); aSequence = new SimpleSequence(aSymbolList, "", "mySequence", null); System.out.println("aSequence = " + aSequence.seqString() + "\n"); aGapped = new SimpleGappedSequence(aSequence); aGapped.addGapsInSource(9, 3); aGapped.addGapsInSource(3, 2); System.out.println("Gaps introduced through addGapsInSource work ok:"); System.out.println("aGapped = " + aGapped.seqString()); System.out.println("Gapped position 10 = Plain position " + aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); SimpleGappedSequence aGapped2 = new SimpleGappedSequence(aGapped); System.out.println("Now a new SimpleGappedSequence object is created from the previous one:"); System.out.println("aGapped2 = " + aGapped2.seqString()); System.out.println("Gapped position 10 = Plain position " + aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); } } -- Ditlev Egeskov Brodersen Lektor Bakkefaldet 30, Hasle 8210 ?rhus V www.lindeman-brodersen.dk -----Original Message----- From: Richard Holland [mailto:holland at ebi.ac.uk] Sent: 16 November 2007 13:46 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: Wrapping SimpleGappedSequence -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SimpleGappedSequence extends SimpleGappedSymbolList, and the constructor delegates to the SimpleGappedSymbolList constructor. When you extend SimpleGappedSequence you should delegate in your new constructor to the existing SimpleGappedSequence constructor, which in turn will delegate as above and preserve the gaps. By passing any object which implements GappedSymbolList to the SimpleGappedSequence constructor, e.g. SimpleGappedSequence or SimpleGappedSymbolList, it will automatically choose the new constructor from SimpleGappedSymbolList which you hopefully should be able to see in the code you have just checked out. If passed any other non-GappedSymbolList object, it will use the old constructor that already existed from before. cheers, Richard Ditlev Egeskov Brodersen wrote: Hi again, I updated CVS and got the new SimpleGappedSymbolList class, but there seems to be no changes to the SimpleGappedSequence class, which is the one I need to extend...have I missed something? Ditlev -- Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor Department of Molecular Biology Office: +45 89425259 University of Aarhus Lab: +45 89425022 Gustav Wieds Vej 10c Fax: +45 86123178 DK-8000 Aarhus C Email: deb at mb.au.dk Denmark Lab WWW: www.bioxray.dk/~deb -----Original Message----- From: Richard Holland [mailto:holland at ebi.ac.uk] Sent: 16 November 2007 11:47 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: Wrapping SimpleGappedSequence The easiest way is simply for me to alter the constructor to SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to copy all gaps if passed another instance of GappedSymbolList as the parameter. I've just done this in CVS so you should be able to update your copy and observe the new behaviour. cheers, Richard Ditlev Egeskov Brodersen wrote: Hi again, thanks for the info - will do the check just to be proper. I have another question: In my application, I would like to wrap the retrieved SimpleGappedSequence objects inside another object that extends the functionality with application-specific stuff. Ideally, I would do this by extending the SimpleGappedSequence object and create it by passing the SimpleGappedSequence from the alignment import to the constructor of the parent, like so: class AlignedSequence extends SimpleGappedSequence { public AlignedSequence(SimpleGappedSequence aGapped) { super(aGapped); } ..custom stuff.. } However, the problem is that there is only one constructor for the SimpleGappedSequence, one which takes a simple Sequence object. I can pass the derived class alright, but all gap information is lost again, presumably because the SimpleGappedSequence constructor just takes out the seqString() and puts it into its own sequence object. Shouldn't the constructor of the SimpleGappedSequence class recognise when a derived (and gapped) sequence object is passed, and process it accordingly? As it stands, I am forced to include the SimpleGappedSequence as a private member of the AlignedSequence class, which is not near as nice since all statement using the class will have to do something like class AlignedSequence extends SimpleGappedSequence { private SimpleGappedSequence gapped_sequence; public AlignedSequence(SimpleGappedSequence aGapped) { gapped_sequence = aGapped; } public SimpleGappedSequence getGappedSequence() { return(gapped_sequence); } ..custom stuff.. } ... AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.getGappedSequence().seqString(); rather than simply: AlignedSequence aAligned = new AlignedSequence(aGapped); aAligned.seqString(); In other words, is there any solution with the current setup that would allow me to extend SimpleGappedSequence and not loose the gap information? -- Ditlev -- Ditlev E. Brodersen, Ph.D. Lektor, Associate Professor Department of Molecular Biology Office: +45 89425259 University of Aarhus Lab: +45 89425022 Gustav Wieds Vej 10c Fax: +45 86123178 DK-8000 Aarhus C Email: deb at mb.au.dk Denmark Lab WWW: www.bioxray.dk/~deb -----Original Message----- From: Richard Holland [mailto:holland at ebi.ac.uk] Sent: 16 November 2007 10:50 To: Ditlev Egeskov Brodersen Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] Parsing exising gaps The returned gapped sequences are all properly set up with gaps, name etc. But as for other users, I think there may be some problems, since the SimpleAlignment object only has a general symbol list iterator, the user will have to cast each statement extracting a sequence object, and SimpleSequence aSimple = (SimpleSequence)aSequences.next(); returns an ClassCastException at run time. So old code might not run with the update as far as I can see. This is true. However, such code would be unsupported by us as the API clearly states that SimpleAlignment returns SymbolList instances, and does not make any guarantees about the exact implementation details of the objects it returns. To attempt to cast it to anything other than SymbolList would be a mistake! (Although actually it is now returning a guarantee of GappedSymbolList, which is what your code can now take advantage of). To assume it will return SimpleSequence is outside the behaviour defined by the API and therefore should not be relied upon. A more correct behaviour would be to test each item returned: SymbolList symlist = aSequences.next(); if (symlist instanceof SimpleSequence) { SimpleSequence seq = (SimpleSequence)symlist; // Do simple-sequence stuff } else { // Do something else! } In future, I will modify the API to change the SymbolList guarantee to a GappedSymbolList guarantee, but I can't do this right now as this really would break everyone's code! We are currently planning a redesign as you may be aware, so issues like this will hopefully be resolved as part of that process. For a start, if we use Java 5 generics in future as we plan, we can strictly specify what kinds of objects will be returned by things such as the alignment API, making it easier for us to enforce API-compliant behaviour in user's code. cheers, Richard - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx H0QV01cFreNi1SNLRPbhepg= =023Y -----END PGP SIGNATURE----- -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From allank at sanbi.ac.za Sun Nov 25 13:10:55 2007 From: allank at sanbi.ac.za (Allan Kamau) Date: Sun, 25 Nov 2007 15:10:55 +0200 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported Message-ID: <4749745F.9070104@sanbi.ac.za> Hi all, I've searched for a conclusive answer to the "Program ncbi-blastn Version is not supported" without success. I would like to know format of the blast output the Biojava's blast-like parsing framework likes, including some examples (without the data) of how such blast output may be created. For example, I am using ncbi-blastn and I am generating the blast file (which Biojava doesn't like) as follows. export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; export BLAST_REPORT_TABULAR=somesequence.blast.txt export BLAST_REPORT_XML=somesequence.blast.xml export BLAST_REPORT=somesequence.blast export INPUT_FASTA=somesequence.fasta export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" Then I get the error below. [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; Buildfile: build.xml runBlastParser: [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17 is not supported by the biojava blast-like parsing framework [java] at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:241) [java] at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) Allan. From markjschreiber at gmail.com Mon Nov 26 01:17:03 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Mon, 26 Nov 2007 09:17:03 +0800 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <4749745F.9070104@sanbi.ac.za> References: <4749745F.9070104@sanbi.ac.za> Message-ID: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> Hi Allan - I think the solution is to call the setParserLazy() or some method with a similar name (I don't have the API handy). This will prevent it doing the check. The original idea of this method was you could check against a list of version numbers that people had validated. I don't think this is a good idea as nothing is truely 100% validated and we haven't kept the list up to date. If there are no objections I would propose to make this method depricated (and it's opposite method) and change the default behaivour to lazy checking. Best regards. - Mark On 11/25/07, Allan Kamau wrote: > > Hi all, > I've searched for a conclusive answer to the "Program ncbi-blastn > Version is not supported" without success. > I would like to know format of the blast output the Biojava's blast-like > parsing framework likes, including some examples (without the data) of > how such blast output may be created. > For example, I am using ncbi-blastn and I am generating the blast file > (which Biojava doesn't like) as follows. > > export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; > export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; > export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; > export BLAST_REPORT_TABULAR=somesequence.blast.txt > export BLAST_REPORT_XML=somesequence.blast.xml > export BLAST_REPORT=somesequence.blast > export INPUT_FASTA=somesequence.fasta > export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence > > date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i > $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d > $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o > $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i > $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; > > Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied > from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" > > Then I get the error below. > > > [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; > Buildfile: build.xml > > runBlastParser: > [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17 > is not supported by the biojava blast-like parsing framework > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret( > BlastLikeSAXParser.java:241) > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse( > BlastLikeSAXParser.java:160) > > Allan. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From holland at ebi.ac.uk Mon Nov 26 08:55:56 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Nov 2007 08:55:56 +0000 Subject: [Biojava-l] Applet not able to find DNATools class. In-Reply-To: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> References: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> Message-ID: <474A8A1C.4020901@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sounds like either a classpath problem (in which case check your classpath to ensure all parts of biojava are definitely on it) or a broken biojava.jar (in which case you need to recompile/redownload it). cheers, Richard Abhinav Ram Karhu wrote: > Hello all, > I am having an error while loading the applet. > > I am getting the following stack trace. > > java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools > at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161) > at Trace.init(Trace.java:161) > at sun.applet.AppletPanel.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder. > > I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java. > > My applet code in the php page looks like this: > > > > Please suggest if I am missing something. > > Thanks in advance. > > Abhinav > > > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSoob4C5LeMEKA/QRAsfkAJ9SlwIzDulzSDQpAfgh0alISRsplACcDqUx uyQUEmRFEWTdnEHsm7k2lg0= =SWHu -----END PGP SIGNATURE----- From holland at ebi.ac.uk Mon Nov 26 12:55:23 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Nov 2007 12:55:23 +0000 Subject: [Biojava-l] Wrapping SimpleGappedSequence In-Reply-To: <003701c82aba$e85f4320$b91dc960$@au.dk> References: <002701c8277f$9dbdca50$d9395ef0$@au.dk> <473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk> <473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk> <473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d <473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk> <473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk> <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk> <003701c82aba$e85f4320$b91dc960$@au.dk> Message-ID: <474AC23B.3080500@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have made the changes you suggest below in CVS. Hopefully it will work for you now. cheers, Richard Ditlev Egeskov Brodersen wrote: > Dear Richard and all, > > I've been dissecting the delegation problem encountered when instantiating > SimpleGappedSequence(Sequence) with an already gapped sequence. The > constructor calls the parent SimpleGappedSymbolList(), which in Richard's > CVS update of 161107 now contains a separate overloaded constructor for the > gapped case: > > public SimpleGappedSymbolList(GappedSymbolList gappedSource) > > However, when instantiating a new SimpleGappedSequence based on an > existing gapped sequence (with several blocks), the blocks were lost. > > After checking the path of code execution it appeared that for some > reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified > SimpleGappedSequence.java to include an overloaded constructor also for the > descendant class, identical to the other constructor but with a > GappedSequence argument: > > public SimpleGappedSequence(GappedSequence seq) { > super(seq); > this.sequence = seq; > createOnUnderlying = false; > } > > Now, the correct parent constructor > (SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are > two other problems with the new SimpleGappedSymbolList constructor that > needs to be corrected for it to work as expected: First, the initial > introduction of a single, large block is missing from the new code, so > insert: > > Block b = new Block(1, length, 1, length); > blocks.add(b); > > Secondly, the code for transferring the gaps from the sequence string need > to use two separate indices, otherwise the gaps will be placed wrongly > because their position is affected by previously inserted gaps: > > int n=1; > for(int i=1;i<=this.length();i++) { > if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i))) > this.addGappInSource(n); > else > n++; > > In other words, the index giving the position of the gaps should only > increment when there are NO gaps at the corresponding position in the gapped > string. > > Following these changes, the GappedSequenceTest program from last week now > works as expected: > > aSymbolList = MSE--KLMPRT---TWAKG > aSequence = MSE--KLMPRT---TWAKG > > Gaps are not parsed when a SimpleGappedSequence is constructed from a > gapped Sequence object: > aGapped = MSE--KLMPRT---TWAKG > Gapped position 10 = Plain position 10 > > aSymbolList = MSEKLMPRTTWAKG > aSequence = MSEKLMPRTTWAKG > > Gaps introduced through addGapsInSource work ok: > aGapped = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > Now a new SimpleGappedSequence object is created from the previous one: > aGapped2 = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > -- Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l- > bounces at lists.open-bio.org] On Behalf Of Richard Holland > Sent: 18 November 2007 18:12 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence > > Interesting stuff. I'm not sure why it isn't working so I'll have to > have > a closer look. > > I'm currently on annual leave but will investigate when I return (Nov > 27th). > > cheers, > Richard > > On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote: > Hi Richard, > > I thought that was also correct what you say, but I can't get it to > work. > Below is a small test program to check this. First, I create a > SimpleGappedSequence through Text with > gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not > "understood", as expected. Next, I create the same sequence non- > gapped in > the above way, then introduce gaps with addGapsInSource. A gapped > location > is now properly translated to a non-gapped sequence position. > Finally, I > create a new SimpleGappedSequence based on the working one - as you > can > see > the gaps are still there but not "understood"... > > aSymbolList = MSE--KLMPRT---TWAKG > aSequence = MSE--KLMPRT---TWAKG > > Gaps are not parsed when a SimpleGappedSequence is constructed from a > gapped > Sequence object: > aGapped = MSE--KLMPRT---TWAKG > Gapped position 10 = Plain position 10 > > aSymbolList = MSEKLMPRTTWAKG > aSequence = MSEKLMPRTTWAKG > > Gaps introduced through addGapsInSource work ok: > aGapped = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 8 > > Now a new SimpleGappedSequence object is created from the previous > one: > aGapped2 = MS--EKLMPR---TTWAKG > Gapped position 10 = Plain position 10 > > This should have been compiled with the new biojava.jar of 161107 > (updated > via CVS), but perhaps I made a mistake updating? > > Any clues? > > Thanks, > > Ditlev > > --- > > package gappedsequencetest; > > import org.biojava.bio.*; > import org.biojava.bio.seq.*; > import org.biojava.bio.seq.impl.*; > import org.biojava.bio.symbol.*; > > public class Main { > > public static void main(String[] args) { > SymbolList aSymbolList = null; > try { > aSymbolList = > ProteinTools.createProtein("MSE--KLMPRT---TWAKG"); > > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + > aSymbolList.seqString()); > > Sequence aSequence = new SimpleSequence(aSymbolList, "", > "mySequence", null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > SimpleGappedSequence aGapped = new > SimpleGappedSequence(aSequence); > System.out.println("Gaps are not parsed when a > SimpleGappedSequence > is constructed from a gapped Sequence object:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > try { > aSymbolList = > ProteinTools.createProtein("MSEKLMPRTTWAKG"); > } > catch(BioException ex) {} > > System.out.println("aSymbolList = " + > aSymbolList.seqString()); > > aSequence = new SimpleSequence(aSymbolList, "", "mySequence", > null); > System.out.println("aSequence = " + aSequence.seqString() + > "\n"); > > aGapped = new SimpleGappedSequence(aSequence); > aGapped.addGapsInSource(9, 3); > aGapped.addGapsInSource(3, 2); > System.out.println("Gaps introduced through addGapsInSource > work > ok:"); > System.out.println("aGapped = " + aGapped.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > > SimpleGappedSequence aGapped2 = new > SimpleGappedSequence(aGapped); > System.out.println("Now a new SimpleGappedSequence object is > created > from the previous one:"); > System.out.println("aGapped2 = " + aGapped2.seqString()); > System.out.println("Gapped position 10 = Plain position " + > aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n"); > } > > } > > -- > > Ditlev Egeskov Brodersen > Lektor > Bakkefaldet 30, Hasle > 8210 ?rhus V > > www.lindeman-brodersen.dk > > > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 13:46 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: Wrapping SimpleGappedSequence > > SimpleGappedSequence extends SimpleGappedSymbolList, and the > constructor > delegates to the SimpleGappedSymbolList constructor. > > When you extend SimpleGappedSequence you should delegate in your new > constructor to the existing SimpleGappedSequence constructor, which >> in > turn will delegate as above and preserve the gaps. > > By passing any object which implements GappedSymbolList to the > SimpleGappedSequence constructor, e.g. SimpleGappedSequence or > SimpleGappedSymbolList, it will automatically choose the new > constructor > from SimpleGappedSymbolList which you hopefully should be able to >> see > in > the code you have just checked out. If passed any other > non-GappedSymbolList object, it will use the old constructor that > already existed from before. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: > Hi again, > > I updated CVS and got the new SimpleGappedSymbolList class, but > there > seems to be no changes to the SimpleGappedSequence class, which is > the one I > need to extend...have I missed something? > > Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 11:47 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: Wrapping SimpleGappedSequence > > The easiest way is simply for me to alter the constructor to > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) >> to > copy all gaps if passed another instance of GappedSymbolList as >> the > parameter. I've just done this in CVS so you should be able to >> update > your copy and observe the new behaviour. > > cheers, > Richard > > Ditlev Egeskov Brodersen wrote: > Hi again, > > thanks for the info - will do the check just to be proper. I > have > another > question: In my application, I would like to wrap the retrieved > SimpleGappedSequence objects inside another object that extends > the > functionality with application-specific stuff. Ideally, I would >> do > this by > extending the SimpleGappedSequence object and create it by >> passing > the > SimpleGappedSequence from the alignment import to the >> constructor > of > the > parent, like so: > > class AlignedSequence extends SimpleGappedSequence { > public AlignedSequence(SimpleGappedSequence aGapped) { > super(aGapped); > } > > ..custom stuff.. > } > > However, the problem is that there is only one constructor for >> the > SimpleGappedSequence, one which takes a simple Sequence object. >> I > can > pass > the derived class alright, but all gap information is lost >> again, > presumably > because the SimpleGappedSequence constructor just takes out the > seqString() > and puts it into its own sequence object. > > Shouldn't the constructor of the SimpleGappedSequence class > recognise > when a > derived (and gapped) sequence object is passed, and process it > accordingly? > As it stands, I am forced to include the SimpleGappedSequence >> as a > private > member of the AlignedSequence class, which is not near as nice > since > all > statement using the class will have to do something like > > class AlignedSequence extends SimpleGappedSequence { > private SimpleGappedSequence gapped_sequence; > > public AlignedSequence(SimpleGappedSequence aGapped) { > gapped_sequence = aGapped; > } > > public SimpleGappedSequence getGappedSequence() { > return(gapped_sequence); > } > > ..custom stuff.. > } > > ... > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.getGappedSequence().seqString(); > > rather than simply: > > AlignedSequence aAligned = new AlignedSequence(aGapped); > aAligned.seqString(); > > In other words, is there any solution with the current setup >> that > would > allow me to extend SimpleGappedSequence and not loose the gap > information? > -- Ditlev > > -- > > Ditlev E. Brodersen, Ph.D. > Lektor, Associate Professor > > Department of Molecular Biology Office: +45 89425259 > University of Aarhus Lab: +45 89425022 > Gustav Wieds Vej 10c Fax: +45 86123178 > DK-8000 Aarhus C Email: deb at mb.au.dk > Denmark Lab WWW: www.bioxray.dk/~deb > > > -----Original Message----- > From: Richard Holland [mailto:holland at ebi.ac.uk] > Sent: 16 November 2007 10:50 > To: Ditlev Egeskov Brodersen > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] Parsing exising gaps > > The returned gapped sequences are all properly set up with > gaps, > name etc. > But as for other users, I think there may be some problems, > since > the > SimpleAlignment object only has a general symbol list >> iterator, > the > user > will have to cast each statement extracting a sequence >> object, > and > > SimpleSequence aSimple = > (SimpleSequence)aSequences.next(); > > returns an ClassCastException at run time. So old code might > not > run > with > the update as far as I can see. > This is true. However, such code would be unsupported by us as >> the > API > clearly states that SimpleAlignment returns SymbolList >> instances, > and > does not make any guarantees about the exact implementation > details > of > the objects it returns. To attempt to cast it to anything other > than > SymbolList would be a mistake! (Although actually it is now > returning > a > guarantee of GappedSymbolList, which is what your code can now > take > advantage of). To assume it will return SimpleSequence is >> outside > the > behaviour defined by the API and therefore should not be relied > upon. > > A more correct behaviour would be to test each item returned: > > SymbolList symlist = aSequences.next(); > if (symlist instanceof SimpleSequence) { > SimpleSequence seq = (SimpleSequence)symlist; > // Do simple-sequence stuff > } else { > // Do something else! > } > > In future, I will modify the API to change the SymbolList > guarantee > to > a > GappedSymbolList guarantee, but I can't do this right now as >> this > really > would break everyone's code! > > We are currently planning a redesign as you may be aware, so > issues > like > this will hopefully be resolved as part of that process. For a > start, > if > we use Java 5 generics in future as we plan, we can strictly > specify > what kinds of objects will be returned by things such as the > alignment > API, making it easier for us to enforce API-compliant behaviour >> in > user's code. > > cheers, > Richard > > -- > Richard Holland (BioMart) > EMBL EBI, Wellcome Trust Genome Campus, > Hinxton, Cambridgeshire CB10 1SD, UK > Tel. +44 (0)1223 494416 > > http://www.biomart.org/ > http://www.biojava.org/ > -- > Richard Holland > BioMart (http://www.biomart.org/) > EMBL-EBI > Hinxton, Cambridgeshire CB10 1SD, UK > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSsI64C5LeMEKA/QRAg21AKCieEvT2KaWBFdqLFUtxazhHXmD2wCgiRwk Bz79hrJxD/eZrrCUXUAh758= =0Jpp -----END PGP SIGNATURE----- From allank at sanbi.ac.za Mon Nov 26 12:02:56 2007 From: allank at sanbi.ac.za (Allan Kamau) Date: Mon, 26 Nov 2007 14:02:56 +0200 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> Message-ID: <474AB5F0.6040802@sanbi.ac.za> Hi Mark, Thank you for your reply. Calling setModeLazy() method of the object of type BlastLikeSAXParser did provide the cure. Allan. Mark Schreiber wrote: > Hi Allan - > > I think the solution is to call the setParserLazy() or some method > with a similar name (I don't have the API handy). This will prevent it > doing the check. > > The original idea of this method was you could check against a list of > version numbers that people had validated. I don't think this is a > good idea as nothing is truely 100% validated and we haven't kept the > list up to date. If there are no objections I would propose to make > this method depricated (and it's opposite method) and change the > default behaivour to lazy checking. > > Best regards. > > - Mark > > > On 11/25/07, *Allan Kamau* > wrote: > > Hi all, > I've searched for a conclusive answer to the "Program ncbi-blastn > Version is not supported" without success. > I would like to know format of the blast output the Biojava's > blast-like > parsing framework likes, including some examples (without the data) of > how such blast output may be created. > For example, I am using ncbi-blastn and I am generating the blast > file > (which Biojava doesn't like) as follows. > > export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; > export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; > export > REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; > export BLAST_REPORT_TABULAR=somesequence.blast.txt > export BLAST_REPORT_XML=somesequence.blast.xml > export BLAST_REPORT=somesequence.blast > export INPUT_FASTA=somesequence.fasta > export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence > > date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i > $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d > $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o > $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d > $REFERENCES_FASTA_NAME -i > $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; > > Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied > from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" > > Then I get the error below. > > > [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; > Buildfile: build.xml > > runBlastParser: > [java] org.xml.sax.SAXException: Program ncbi-blastn Version > 2.2.17 > is not supported by the biojava blast-like parsing framework > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java > :241) > [java] at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) > > Allan. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From markjschreiber at gmail.com Tue Nov 27 03:16:35 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 27 Nov 2007 11:16:35 +0800 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474AB5F0.6040802@sanbi.ac.za> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> Message-ID: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Hi - Does anyone mind if I change the default behaivor to lazy parsing? Technically this would be a break in backwards compatibility (although only if you have a program that relies on strict parsing). Last chance to complain. - Mark On Nov 26, 2007 8:02 PM, Allan Kamau wrote: > Hi Mark, > Thank you for your reply. > Calling setModeLazy() method of the object of type BlastLikeSAXParser > did provide the cure. > > Allan. > > > Mark Schreiber wrote: > > Hi Allan - > > > > I think the solution is to call the setParserLazy() or some method > > with a similar name (I don't have the API handy). This will prevent it > > doing the check. > > > > The original idea of this method was you could check against a list of > > version numbers that people had validated. I don't think this is a > > good idea as nothing is truely 100% validated and we haven't kept the > > list up to date. If there are no objections I would propose to make > > this method depricated (and it's opposite method) and change the > > default behaivour to lazy checking. > > > > Best regards. > > > > - Mark > > > > > > On 11/25/07, *Allan Kamau* > > > > > wrote: > > > > Hi all, > > I've searched for a conclusive answer to the "Program ncbi-blastn > > Version is not supported" without success. > > I would like to know format of the blast output the Biojava's > > blast-like > > parsing framework likes, including some examples (without the data) of > > how such blast output may be created. > > For example, I am using ncbi-blastn and I am generating the blast > > file > > (which Biojava doesn't like) as follows. > > > > export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; > > export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; > > export > > REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; > > export BLAST_REPORT_TABULAR=somesequence.blast.txt > > export BLAST_REPORT_XML=somesequence.blast.xml > > export BLAST_REPORT=somesequence.blast > > export INPUT_FASTA=somesequence.fasta > > export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence > > > > date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i > > $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d > > $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o > > $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d > > $REFERENCES_FASTA_NAME -i > > $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; > > > > Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied > > from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" > > > > Then I get the error below. > > > > > > [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; > > Buildfile: build.xml > > > > runBlastParser: > > [java] org.xml.sax.SAXException: Program ncbi-blastn Version > > 2.2.17 > > is not supported by the biojava blast-like parsing framework > > [java] at > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java > > :241) > > [java] at > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) > > > > Allan. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > From holland at ebi.ac.uk Tue Nov 27 08:40:10 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 08:40:10 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Message-ID: <474BD7EA.4040604@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sounds good to me. Mark Schreiber wrote: > Hi - > > Does anyone mind if I change the default behaivor to lazy parsing? > Technically this would be a break in backwards compatibility (although > only if you have a program that relies on strict parsing). > > Last chance to complain. > > - Mark > > On Nov 26, 2007 8:02 PM, Allan Kamau wrote: >> Hi Mark, >> Thank you for your reply. >> Calling setModeLazy() method of the object of type BlastLikeSAXParser >> did provide the cure. >> >> Allan. >> >> >> Mark Schreiber wrote: >>> Hi Allan - >>> >>> I think the solution is to call the setParserLazy() or some method >>> with a similar name (I don't have the API handy). This will prevent it >>> doing the check. >>> >>> The original idea of this method was you could check against a list of >>> version numbers that people had validated. I don't think this is a >>> good idea as nothing is truely 100% validated and we haven't kept the >>> list up to date. If there are no objections I would propose to make >>> this method depricated (and it's opposite method) and change the >>> default behaivour to lazy checking. >>> >>> Best regards. >>> >>> - Mark >>> >>> >>> On 11/25/07, *Allan Kamau* > >> >>> > wrote: >>> >>> Hi all, >>> I've searched for a conclusive answer to the "Program ncbi-blastn >>> Version is not supported" without success. >>> I would like to know format of the blast output the Biojava's >>> blast-like >>> parsing framework likes, including some examples (without the data) of >>> how such blast output may be created. >>> For example, I am using ncbi-blastn and I am generating the blast >>> file >>> (which Biojava doesn't like) as follows. >>> >>> export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb; >>> export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall; >>> export >>> REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta; >>> export BLAST_REPORT_TABULAR=somesequence.blast.txt >>> export BLAST_REPORT_XML=somesequence.blast.xml >>> export BLAST_REPORT=somesequence.blast >>> export INPUT_FASTA=somesequence.fasta >>> export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence >>> >>> date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i >>> $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d >>> $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o >>> $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d >>> $REFERENCES_FASTA_NAME -i >>> $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date; >>> >>> Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied >>> from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser" >>> >>> Then I get the error below. >>> >>> >>> [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser; >>> Buildfile: build.xml >>> >>> runBlastParser: >>> [java] org.xml.sax.SAXException: Program ncbi-blastn Version >>> 2.2.17 >>> is not supported by the biojava blast-like parsing framework >>> [java] at >>> org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java >>> :241) >>> [java] at >>> org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) >>> >>> Allan. >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> >> >> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS9fq4C5LeMEKA/QRAm/3AJ9hi2yrSyeK6a3nXtObyJ2MAk0Y1QCeL5HT iYQc6HTdm6fJ+Lcfssnd34g= =VuJJ -----END PGP SIGNATURE----- From ap3 at sanger.ac.uk Tue Nov 27 10:24:49 2007 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 27 Nov 2007 10:24:49 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Message-ID: > Does anyone mind if I change the default behaivor to lazy parsing? Hi Mark, I think this is a good idea. we had a couple of questions and feature requests recently regarding the blast parser, so I wonder if we should have a look at how to make it (and the documentation) better also during the V3 discussion... Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From holland at ebi.ac.uk Tue Nov 27 11:01:33 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 11:01:33 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> Message-ID: <474BF90D.3070003@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > we had a couple of questions and feature requests recently regarding > the blast parser, so I wonder if we should > have a look at how to make it (and the documentation) better also > during the V3 discussion... A rethink of the blast parser is definitely a good idea. It's starting to need more work than before as the various subtly different file formats used by the most recent versions and variants of blast have evolved beyond the tolerance limits of the existing parser. It also needs to be made simpler to use. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb Nu2SZ7yV4e0rUmyIBxNYTJU= =9nHg -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Tue Nov 27 11:11:30 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 27 Nov 2007 11:11:30 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474BF90D.3070003@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> Message-ID: <474BFB62.3040203@ebi.ac.uk> What format options are there from blast? Just thinking if it supports CIGAR or something like that are we better providing a parser for that format & saying that we do not support the traditional blast output? That said it doesn't help is when that format changes so maybe what is needed is a way to push out parser changes without requiring a full biojava release (v3 discussion) ... Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > >> we had a couple of questions and feature requests recently regarding >> the blast parser, so I wonder if we should >> have a look at how to make it (and the documentation) better also >> during the V3 discussion... > > A rethink of the blast parser is definitely a good idea. It's starting > to need more work than before as the various subtly different file > formats used by the most recent versions and variants of blast have > evolved beyond the tolerance limits of the existing parser. It also > needs to be made simpler to use. > > cheers, > Richard > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb > Nu2SZ7yV4e0rUmyIBxNYTJU= > =9nHg > -----END PGP SIGNATURE----- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at ebi.ac.uk Tue Nov 27 11:18:59 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 11:18:59 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474BFB62.3040203@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> Message-ID: <474BFD23.8060005@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > What format options are there from blast? Just thinking if it supports > CIGAR or something like that are we better providing a parser for that > format & saying that we do not support the traditional blast output? > That said it doesn't help is when that format changes so maybe what is > needed is a way to push out parser changes without requiring a full > biojava release (v3 discussion) ... Exactly! So the modular idea would work nicely here - we could have a blast module and only update that single module (which would be its own JAR) whenever the format changes. In a way, BioJava releases as such would no longer happen, except maybe for some kind of core BioJava module. Everything would be done in terms of individual module+JAR releases instead - one for Genbank, one for BioSQL, one for NEXUS, one for Phylogenetic tools, one for translation/transcription, etc. etc. cheers, Richard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS/0j4C5LeMEKA/QRAkQuAJ9B+mmV7vo9QuFYwEgmnHczExyXqwCfamIx uPFQKdbXRC7pwC6lM5aBcJk= =F3PD -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Tue Nov 27 11:47:54 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 27 Nov 2007 11:47:54 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474BFD23.8060005@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> <474BFD23.8060005@ebi.ac.uk> Message-ID: <474C03EA.4070706@ebi.ac.uk> I think Groovy have adopted a similar system recently & have guidelines for how each module should behave (dependencies, build system etc). This enforces the idea that a module whilst not part of the core project must behave in the same manner the core does. I do like the idea that we can have a core biojava & things get added around it & it might encourage other users to start developing their own modules for any formats/purpose they want. Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > >> What format options are there from blast? Just thinking if it supports >> CIGAR or something like that are we better providing a parser for that >> format & saying that we do not support the traditional blast output? >> That said it doesn't help is when that format changes so maybe what is >> needed is a way to push out parser changes without requiring a full >> biojava release (v3 discussion) ... > > Exactly! So the modular idea would work nicely here - we could have a > blast module and only update that single module (which would be its own > JAR) whenever the format changes. In a way, BioJava releases as such > would no longer happen, except maybe for some kind of core BioJava > module. Everything would be done in terms of individual module+JAR > releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > for Phylogenetic tools, one for translation/transcription, etc. etc. > > cheers, > Richard From markjschreiber at gmail.com Tue Nov 27 14:48:12 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 27 Nov 2007 22:48:12 +0800 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <474C03EA.4070706@ebi.ac.uk> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> <474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk> Message-ID: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com> For a long time now my feeling has been that we should *only* support the XML version of blast output. The other formats are too brittle to be easy to parse. I also feel similarly about Genbank, EMBL, etc that may be an extreme view but the power of generic XML parsers and things like XPath etc really make these formats look very attractive. - Mark On Nov 27, 2007 7:47 PM, Andy Yates wrote: > I think Groovy have adopted a similar system recently & have guidelines > for how each module should behave (dependencies, build system etc). This > enforces the idea that a module whilst not part of the core project must > behave in the same manner the core does. I do like the idea that we can > have a core biojava & things get added around it & it might encourage > other users to start developing their own modules for any > formats/purpose they want. > > Richard Holland wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > >> What format options are there from blast? Just thinking if it supports > >> CIGAR or something like that are we better providing a parser for that > >> format & saying that we do not support the traditional blast output? > >> That said it doesn't help is when that format changes so maybe what is > >> needed is a way to push out parser changes without requiring a full > >> biojava release (v3 discussion) ... > > > > Exactly! So the modular idea would work nicely here - we could have a > > blast module and only update that single module (which would be its own > > JAR) whenever the format changes. In a way, BioJava releases as such > > would no longer happen, except maybe for some kind of core BioJava > > module. Everything would be done in terms of individual module+JAR > > releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > > for Phylogenetic tools, one for translation/transcription, etc. etc. > > > > cheers, > > Richard > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Tue Nov 27 15:16:12 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Tue, 27 Nov 2007 15:16:12 +0000 Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not supported In-Reply-To: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com> References: <4749745F.9070104@sanbi.ac.za> <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com> <474AB5F0.6040802@sanbi.ac.za> <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com> <474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk> <474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk> <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com> Message-ID: <474C34BC.4070209@ebi.ac.uk> I was always under the impression that blast's XML output was nearly as hard to parse as the flat file format but I do agree that if we can use XML whenever we can it would make writing parsers a lot easier (especially if there are SAX based XPath libraries available). Actually this brings up a good question about development of this type of parser. The majority of XPath supporting libraries are DOM based which will mean large memory usage in some situations but overall providing an easier coding experience (and hopefully reduce our chances of creating bugs). Or should we code to the edge cases of someone trying to parse a 1GB XML? Personally I'd favour the former. Going back to the original topic there are going to be situations where people want the flat file parsers/writers & I think it's a valid point to say this is where BioJava is meant to come in & help a developer. Afterall XML is a computer science problem where as parsing an EMBL flat file or blast output is a bioinformatics problem. Andy Mark Schreiber wrote: > For a long time now my feeling has been that we should *only* support > the XML version of blast output. The other formats are too brittle to > be easy to parse. I also feel similarly about Genbank, EMBL, etc that > may be an extreme view but the power of generic XML parsers and things > like XPath etc really make these formats look very attractive. > > - Mark > > > On Nov 27, 2007 7:47 PM, Andy Yates wrote: >> I think Groovy have adopted a similar system recently & have guidelines >> for how each module should behave (dependencies, build system etc). This >> enforces the idea that a module whilst not part of the core project must >> behave in the same manner the core does. I do like the idea that we can >> have a core biojava & things get added around it & it might encourage >> other users to start developing their own modules for any >> formats/purpose they want. >> >> Richard Holland wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>>> What format options are there from blast? Just thinking if it supports >>>> CIGAR or something like that are we better providing a parser for that >>>> format & saying that we do not support the traditional blast output? >>>> That said it doesn't help is when that format changes so maybe what is >>>> needed is a way to push out parser changes without requiring a full >>>> biojava release (v3 discussion) ... >>> Exactly! So the modular idea would work nicely here - we could have a >>> blast module and only update that single module (which would be its own >>> JAR) whenever the format changes. In a way, BioJava releases as such >>> would no longer happen, except maybe for some kind of core BioJava >>> module. Everything would be done in terms of individual module+JAR >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>> >>> cheers, >>> Richard >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> From markjschreiber at gmail.com Wed Nov 28 03:34:38 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 28 Nov 2007 11:34:38 +0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files Message-ID: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> Hi - I think in most cases huge XML files in bioinformatics result from a single XML containing multiple repetitive elements. Eg a BLAST XML output with several hits or a GenBankXML with many Sequences. A nice approach I have seen for dealing with these is to use SAX to read over the file and every time it comes to an element it delegates to a DOM object. You then parse the bits of the DOM you want with XPath or convert to objects or something and then when you are finished with that entry everything gets garbage collected and the SAX parser moves to the next element and repeats the whole process. This is a hybrid of event based parsing and object-model based parsing which could let you efficiently deal with huge files. I think the BLAST XML has improved substantially, at least in terms of validating against it's own DTD. The DTD itself may not be the best design but that is always a matter of taste and if you are using XPath to get the relevant bits you don't need to make a SAX parser jump through hoops to get them. I agree we will have to keep flat file parsers but we should strongly encourage the use of XML where possible. It is simply easier to deal with. Most biological flat-files were designed for Fortran and mainly for human consumption. There is no obvious validation mechanism. Notably everything in NCBI is derived from ASN.1, what you see in the flatfile is produced from there. I tend to think this means that the ASN.1 is the holy gospel and what you get in the flat file is some translation. Ideally NCBI files should be parsed from the ASN.1 where you can guarantee validation, the more practical alternative is to use the XML which you can at least validate against a DTD. With XML we (Biojava) can say if it validates we will parse it and if it doesn't we may not. With flat files there are so many dodgey variants we cannot say anything. Because XML dtds (or xsd's) have versions it also makes it much easier to have parsers for different versions and the parsing machinery can figure out which is needed. With flat files it is anyones guess what version you are dealing with. Finally parsers can be auto-generated for XML if you have the DTD or XSD. This often doesn't give you an ideal parser but it can be a useful starting point for rapid development. For Biojava v 3 I think we should concentrate on XML parsers first and flat files second. if only Fasta had an XML format - Mark On Nov 27, 2007 11:16 PM, Andy Yates wrote: > I was always under the impression that blast's XML output was nearly as > hard to parse as the flat file format but I do agree that if we can use > XML whenever we can it would make writing parsers a lot easier > (especially if there are SAX based XPath libraries available). Actually > this brings up a good question about development of this type of parser. > The majority of XPath supporting libraries are DOM based which will mean > large memory usage in some situations but overall providing an easier > coding experience (and hopefully reduce our chances of creating bugs). > Or should we code to the edge cases of someone trying to parse a 1GB > XML? Personally I'd favour the former. > > Going back to the original topic there are going to be situations where > people want the flat file parsers/writers & I think it's a valid point > to say this is where BioJava is meant to come in & help a developer. > Afterall XML is a computer science problem where as parsing an EMBL flat > file or blast output is a bioinformatics problem. > > Andy > > > Mark Schreiber wrote: > > For a long time now my feeling has been that we should *only* support > > the XML version of blast output. The other formats are too brittle to > > be easy to parse. I also feel similarly about Genbank, EMBL, etc that > > may be an extreme view but the power of generic XML parsers and things > > like XPath etc really make these formats look very attractive. > > > > - Mark > > > > > > On Nov 27, 2007 7:47 PM, Andy Yates wrote: > >> I think Groovy have adopted a similar system recently & have guidelines > >> for how each module should behave (dependencies, build system etc). This > >> enforces the idea that a module whilst not part of the core project must > >> behave in the same manner the core does. I do like the idea that we can > >> have a core biojava & things get added around it & it might encourage > >> other users to start developing their own modules for any > >> formats/purpose they want. > >> > >> Richard Holland wrote: > >>> -----BEGIN PGP SIGNED MESSAGE----- > >>> Hash: SHA1 > >>> > >>>> What format options are there from blast? Just thinking if it supports > >>>> CIGAR or something like that are we better providing a parser for that > >>>> format & saying that we do not support the traditional blast output? > >>>> That said it doesn't help is when that format changes so maybe what is > >>>> needed is a way to push out parser changes without requiring a full > >>>> biojava release (v3 discussion) ... > >>> Exactly! So the modular idea would work nicely here - we could have a > >>> blast module and only update that single module (which would be its own > >>> JAR) whenever the format changes. In a way, BioJava releases as such > >>> would no longer happen, except maybe for some kind of core BioJava > >>> module. Everything would be done in terms of individual module+JAR > >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > >>> for Phylogenetic tools, one for translation/transcription, etc. etc. > >>> > >>> cheers, > >>> Richard > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > From ayates at ebi.ac.uk Wed Nov 28 14:29:15 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 28 Nov 2007 14:29:15 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> Message-ID: <474D7B3B.8030807@ebi.ac.uk> Hi Mark, Okay that sounds like a perfectly sensible way to deal with this. Is this kind of parsing model supported in Java5? I only ask as I've not done a lot of XML parsing with Java5; more with things like XOM (which I think offers a DOM only representation but I'm probably wrong). That's good. There's not a huge point to have a format & a DTD/XSD and then have your files not conform to it. I was thinking the exact same thing about ASN.1 (well that & it looks bleeding horrible to parse but that is an un-educated look at the format which I'm sure is a parsable as JSON & the alike). When it comes to flat file parsers I would be happier to provide implementations of the more common formats where a viable alternative is not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide similar output to the above have a chance to write their own parsers/formatters. This is very similar to the current situation but we just need to remove dependencies on statically located data structures (don't get rid of them completely just give users an option to not use them). I'm not sure how much automatically generated parsers would help us. I guess it depends on the data model(s) we use if they are auto-parser friendly (which normally means POJO/JavaBean conventions including the no-args constructor). Cool I don't want to exclude flat file parsers completely (if only because my group has an interest in BioJava being able to read & write flat files) :) They decided to have HUPO-PSI Format instead :) Andy Mark Schreiber wrote: > Hi - > > I think in most cases huge XML files in bioinformatics result from a > single XML containing multiple repetitive elements. Eg a BLAST XML > output with several hits or a GenBankXML with many Sequences. A nice > approach I have seen for dealing with these is to use SAX to read over > the file and every time it comes to an element it delegates to a DOM > object. You then parse the bits of the DOM you want with XPath or > convert to objects or something and then when you are finished with > that entry everything gets garbage collected and the SAX parser moves > to the next element and repeats the whole process. This is a hybrid > of event based parsing and object-model based parsing which could let > you efficiently deal with huge files. > > I think the BLAST XML has improved substantially, at least in terms of > validating against it's own DTD. The DTD itself may not be the best > design but that is always a matter of taste and if you are using XPath > to get the relevant bits you don't need to make a SAX parser jump > through hoops to get them. > > I agree we will have to keep flat file parsers but we should strongly > encourage the use of XML where possible. It is simply easier to deal > with. Most biological flat-files were designed for Fortran and mainly > for human consumption. There is no obvious validation mechanism. > Notably everything in NCBI is derived from ASN.1, what you see in the > flatfile is produced from there. I tend to think this means that the > ASN.1 is the holy gospel and what you get in the flat file is some > translation. Ideally NCBI files should be parsed from the ASN.1 where > you can guarantee validation, the more practical alternative is to use > the XML which you can at least validate against a DTD. > > With XML we (Biojava) can say if it validates we will parse it and if > it doesn't we may not. With flat files there are so many dodgey > variants we cannot say anything. Because XML dtds (or xsd's) have > versions it also makes it much easier to have parsers for different > versions and the parsing machinery can figure out which is needed. > With flat files it is anyones guess what version you are dealing with. > > Finally parsers can be auto-generated for XML if you have the DTD or > XSD. This often doesn't give you an ideal parser but it can be a > useful starting point for rapid development. > > For Biojava v 3 I think we should concentrate on XML parsers first and > flat files second. if only Fasta had an XML format > > - Mark > > On Nov 27, 2007 11:16 PM, Andy Yates wrote: >> I was always under the impression that blast's XML output was nearly as >> hard to parse as the flat file format but I do agree that if we can use >> XML whenever we can it would make writing parsers a lot easier >> (especially if there are SAX based XPath libraries available). Actually >> this brings up a good question about development of this type of parser. >> The majority of XPath supporting libraries are DOM based which will mean >> large memory usage in some situations but overall providing an easier >> coding experience (and hopefully reduce our chances of creating bugs). >> Or should we code to the edge cases of someone trying to parse a 1GB >> XML? Personally I'd favour the former. >> >> Going back to the original topic there are going to be situations where >> people want the flat file parsers/writers & I think it's a valid point >> to say this is where BioJava is meant to come in & help a developer. >> Afterall XML is a computer science problem where as parsing an EMBL flat >> file or blast output is a bioinformatics problem. >> >> Andy >> >> >> Mark Schreiber wrote: >>> For a long time now my feeling has been that we should *only* support >>> the XML version of blast output. The other formats are too brittle to >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that >>> may be an extreme view but the power of generic XML parsers and things >>> like XPath etc really make these formats look very attractive. >>> >>> - Mark >>> >>> >>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >>>> I think Groovy have adopted a similar system recently & have guidelines >>>> for how each module should behave (dependencies, build system etc). This >>>> enforces the idea that a module whilst not part of the core project must >>>> behave in the same manner the core does. I do like the idea that we can >>>> have a core biojava & things get added around it & it might encourage >>>> other users to start developing their own modules for any >>>> formats/purpose they want. >>>> >>>> Richard Holland wrote: >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>>> What format options are there from blast? Just thinking if it supports >>>>>> CIGAR or something like that are we better providing a parser for that >>>>>> format & saying that we do not support the traditional blast output? >>>>>> That said it doesn't help is when that format changes so maybe what is >>>>>> needed is a way to push out parser changes without requiring a full >>>>>> biojava release (v3 discussion) ... >>>>> Exactly! So the modular idea would work nicely here - we could have a >>>>> blast module and only update that single module (which would be its own >>>>> JAR) whenever the format changes. In a way, BioJava releases as such >>>>> would no longer happen, except maybe for some kind of core BioJava >>>>> module. Everything would be done in terms of individual module+JAR >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>>>> >>>>> cheers, >>>>> Richard >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> From dmitry.repchevski at bsc.es Wed Nov 28 14:49:23 2007 From: dmitry.repchevski at bsc.es (Dmitry Repchevsky) Date: Wed, 28 Nov 2007 15:49:23 +0100 Subject: [Biojava-l] SAX, DOM, XPath and Flat files References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com Message-ID: <474D7FF3.9010901@bsc.es> Hello! Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) which is faster when SAX and allows writing. In JDK 6 apart of StAX there is JAXB which is a perfect combination to parse a huge files. You can go through the XML fie using StAX until the element you are interested in and unmarshall it using JAXB to POJO object. Cheers, Dmitry From ayates at ebi.ac.uk Wed Nov 28 15:37:03 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 28 Nov 2007 15:37:03 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <474D7FF3.9010901@bsc.es> References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com <474D7FF3.9010901@bsc.es> Message-ID: <474D8B1F.8070301@ebi.ac.uk> Hi Dmitry, StAX still has higher memory consumption than SAX (still not as large as DOM) but yes it is quite a good parser system & since we're moving towards the later versions of Java may be a good idea to use it as our standard parser ... if it supports XPath (can't remember off the top of my head) :) Andy Dmitry Repchevsky wrote: > Hello! > > Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) > which is faster when SAX and allows writing. > In JDK 6 apart of StAX there is JAXB which is a perfect combination to > parse a huge files. > You can go through the XML fie using StAX until the element you are > interested in and unmarshall it using JAXB to POJO object. > > Cheers, > > Dmitry > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Fri Nov 30 02:28:58 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 30 Nov 2007 10:28:58 +0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <474D7B3B.8030807@ebi.ac.uk> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> Message-ID: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but not XQuery although XPath is probably more important for this use. The DOM model is a direct implementation of the W3C standard which makes it a little awkward from a java point of view but it is usable. Java 6 has StAX (the other one). There are a few java API's for parsing ASN.1 mostly developed for the telco industry, I've never really looked into which is best (anyone experienced with this?) but we could probably use one to work directly off NCBI ASN.1 - Mark On Nov 28, 2007 10:29 PM, Andy Yates wrote: > Hi Mark, > > Okay that sounds like a perfectly sensible way to deal with this. Is > this kind of parsing model supported in Java5? I only ask as I've not > done a lot of XML parsing with Java5; more with things like XOM (which I > think offers a DOM only representation but I'm probably wrong). > > That's good. There's not a huge point to have a format & a DTD/XSD and > then have your files not conform to it. > > I was thinking the exact same thing about ASN.1 (well that & it looks > bleeding horrible to parse but that is an un-educated look at the format > which I'm sure is a parsable as JSON & the alike). > > When it comes to flat file parsers I would be happier to provide > implementations of the more common formats where a viable alternative is > not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide > similar output to the above have a chance to write their own > parsers/formatters. This is very similar to the current situation but we > just need to remove dependencies on statically located data structures > (don't get rid of them completely just give users an option to not use > them). > > I'm not sure how much automatically generated parsers would help us. I > guess it depends on the data model(s) we use if they are auto-parser > friendly (which normally means POJO/JavaBean conventions including the > no-args constructor). > > Cool I don't want to exclude flat file parsers completely (if only > because my group has an interest in BioJava being able to read & write > flat files) :) > > They decided to have HUPO-PSI Format instead :) > > Andy > > > Mark Schreiber wrote: > > Hi - > > > > I think in most cases huge XML files in bioinformatics result from a > > single XML containing multiple repetitive elements. Eg a BLAST XML > > output with several hits or a GenBankXML with many Sequences. A nice > > approach I have seen for dealing with these is to use SAX to read over > > the file and every time it comes to an element it delegates to a DOM > > object. You then parse the bits of the DOM you want with XPath or > > convert to objects or something and then when you are finished with > > that entry everything gets garbage collected and the SAX parser moves > > to the next element and repeats the whole process. This is a hybrid > > of event based parsing and object-model based parsing which could let > > you efficiently deal with huge files. > > > > I think the BLAST XML has improved substantially, at least in terms of > > validating against it's own DTD. The DTD itself may not be the best > > design but that is always a matter of taste and if you are using XPath > > to get the relevant bits you don't need to make a SAX parser jump > > through hoops to get them. > > > > I agree we will have to keep flat file parsers but we should strongly > > encourage the use of XML where possible. It is simply easier to deal > > with. Most biological flat-files were designed for Fortran and mainly > > for human consumption. There is no obvious validation mechanism. > > Notably everything in NCBI is derived from ASN.1, what you see in the > > flatfile is produced from there. I tend to think this means that the > > ASN.1 is the holy gospel and what you get in the flat file is some > > translation. Ideally NCBI files should be parsed from the ASN.1 where > > you can guarantee validation, the more practical alternative is to use > > the XML which you can at least validate against a DTD. > > > > With XML we (Biojava) can say if it validates we will parse it and if > > it doesn't we may not. With flat files there are so many dodgey > > variants we cannot say anything. Because XML dtds (or xsd's) have > > versions it also makes it much easier to have parsers for different > > versions and the parsing machinery can figure out which is needed. > > With flat files it is anyones guess what version you are dealing with. > > > > Finally parsers can be auto-generated for XML if you have the DTD or > > XSD. This often doesn't give you an ideal parser but it can be a > > useful starting point for rapid development. > > > > For Biojava v 3 I think we should concentrate on XML parsers first and > > flat files second. if only Fasta had an XML format > > > > - Mark > > > > On Nov 27, 2007 11:16 PM, Andy Yates wrote: > >> I was always under the impression that blast's XML output was nearly as > >> hard to parse as the flat file format but I do agree that if we can use > >> XML whenever we can it would make writing parsers a lot easier > >> (especially if there are SAX based XPath libraries available). Actually > >> this brings up a good question about development of this type of parser. > >> The majority of XPath supporting libraries are DOM based which will mean > >> large memory usage in some situations but overall providing an easier > >> coding experience (and hopefully reduce our chances of creating bugs). > >> Or should we code to the edge cases of someone trying to parse a 1GB > >> XML? Personally I'd favour the former. > >> > >> Going back to the original topic there are going to be situations where > >> people want the flat file parsers/writers & I think it's a valid point > >> to say this is where BioJava is meant to come in & help a developer. > >> Afterall XML is a computer science problem where as parsing an EMBL flat > >> file or blast output is a bioinformatics problem. > >> > >> Andy > >> > >> > >> Mark Schreiber wrote: > >>> For a long time now my feeling has been that we should *only* support > >>> the XML version of blast output. The other formats are too brittle to > >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that > >>> may be an extreme view but the power of generic XML parsers and things > >>> like XPath etc really make these formats look very attractive. > >>> > >>> - Mark > >>> > >>> > >>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: > >>>> I think Groovy have adopted a similar system recently & have guidelines > >>>> for how each module should behave (dependencies, build system etc). This > >>>> enforces the idea that a module whilst not part of the core project must > >>>> behave in the same manner the core does. I do like the idea that we can > >>>> have a core biojava & things get added around it & it might encourage > >>>> other users to start developing their own modules for any > >>>> formats/purpose they want. > >>>> > >>>> Richard Holland wrote: > >>>>> -----BEGIN PGP SIGNED MESSAGE----- > >>>>> Hash: SHA1 > >>>>> > >>>>>> What format options are there from blast? Just thinking if it supports > >>>>>> CIGAR or something like that are we better providing a parser for that > >>>>>> format & saying that we do not support the traditional blast output? > >>>>>> That said it doesn't help is when that format changes so maybe what is > >>>>>> needed is a way to push out parser changes without requiring a full > >>>>>> biojava release (v3 discussion) ... > >>>>> Exactly! So the modular idea would work nicely here - we could have a > >>>>> blast module and only update that single module (which would be its own > >>>>> JAR) whenever the format changes. In a way, BioJava releases as such > >>>>> would no longer happen, except maybe for some kind of core BioJava > >>>>> module. Everything would be done in terms of individual module+JAR > >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one > >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. > >>>>> > >>>>> cheers, > >>>>> Richard > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > From heuermh at acm.org Fri Nov 30 06:06:26 2007 From: heuermh at acm.org (Michael Heuer) Date: Fri, 30 Nov 2007 01:06:26 -0500 (EST) Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: Mark Schreiber wrote: > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). Yeah, those jerks. :) I wrote a note to the spec author a few weeks before "the other" StAX was announced at a Java One however long ago asking them to reconsider their project name. Oh well. We can still be the "original" StAX. > http://stax.sf.net May I kindly suggest skipping all of this talk about XML and have us jump straight to OWL? ;) > http://dev.isb-sib.ch/projects/uniprot-rdf/ michael From ayates at ebi.ac.uk Fri Nov 30 09:18:45 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 30 Nov 2007 09:18:45 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: References: Message-ID: <474FD575.3060307@ebi.ac.uk> Michael Heuer wrote: > Mark Schreiber wrote: > >> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but >> not XQuery although XPath is probably more important for this use. >> >> The DOM model is a direct implementation of the W3C standard which >> makes it a little awkward from a java point of view but it is usable. >> >> Java 6 has StAX (the other one). > > Yeah, those jerks. :) > > I wrote a note to the spec author a few weeks before "the other" StAX was > announced at a Java One however long ago asking them to reconsider their > project name. > > Oh well. We can still be the "original" StAX. > >> http://stax.sf.net Yup I remember that issue from BOSC 2005 ... oh well not a lot that can be done now. Maybe a re-brand of our StAX to StAX Original. Bit like the Coca Cola & New Coke mess-up. > > > May I kindly suggest skipping all of this talk about XML and have us > jump straight to OWL? ;) > >> http://dev.isb-sib.ch/projects/uniprot-rdf/ Lol just let me fire up my semantic web engine first :). From ayates at ebi.ac.uk Fri Nov 30 09:26:15 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 30 Nov 2007 09:26:15 +0000 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> Message-ID: <474FD737.9080801@ebi.ac.uk> I think I've seen XPath hanging around in other people's code in a 1.5 code-base (in fact one of the guys I work with). I've used Java's DOM before & it really isn't very nice & quite verbose. I'd prefer if there was a better alternative/wrapper around the XML parsers just to cut down on code chatter. Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these Java tools & I think I've gone cross-eyed with the sheer number of acronyms! You've gotta love something which seems to add a letter to ER & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the list know of a ASN.1 parser for Java that's good and should we support it (considering NCBI generate their DTD & XML from the ASN.1 representation). Andy Mark Schreiber wrote: > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). > > There are a few java API's for parsing ASN.1 mostly developed for the > telco industry, I've never really looked into which is best (anyone > experienced with this?) but we could probably use one to work directly > off NCBI ASN.1 > > - Mark > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: >> Hi Mark, >> >> Okay that sounds like a perfectly sensible way to deal with this. Is >> this kind of parsing model supported in Java5? I only ask as I've not >> done a lot of XML parsing with Java5; more with things like XOM (which I >> think offers a DOM only representation but I'm probably wrong). >> >> That's good. There's not a huge point to have a format & a DTD/XSD and >> then have your files not conform to it. >> >> I was thinking the exact same thing about ASN.1 (well that & it looks >> bleeding horrible to parse but that is an un-educated look at the format >> which I'm sure is a parsable as JSON & the alike). >> >> When it comes to flat file parsers I would be happier to provide >> implementations of the more common formats where a viable alternative is >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide >> similar output to the above have a chance to write their own >> parsers/formatters. This is very similar to the current situation but we >> just need to remove dependencies on statically located data structures >> (don't get rid of them completely just give users an option to not use >> them). >> >> I'm not sure how much automatically generated parsers would help us. I >> guess it depends on the data model(s) we use if they are auto-parser >> friendly (which normally means POJO/JavaBean conventions including the >> no-args constructor). >> >> Cool I don't want to exclude flat file parsers completely (if only >> because my group has an interest in BioJava being able to read & write >> flat files) :) >> >> They decided to have HUPO-PSI Format instead :) >> >> Andy >> >> >> Mark Schreiber wrote: >>> Hi - >>> >>> I think in most cases huge XML files in bioinformatics result from a >>> single XML containing multiple repetitive elements. Eg a BLAST XML >>> output with several hits or a GenBankXML with many Sequences. A nice >>> approach I have seen for dealing with these is to use SAX to read over >>> the file and every time it comes to an element it delegates to a DOM >>> object. You then parse the bits of the DOM you want with XPath or >>> convert to objects or something and then when you are finished with >>> that entry everything gets garbage collected and the SAX parser moves >>> to the next element and repeats the whole process. This is a hybrid >>> of event based parsing and object-model based parsing which could let >>> you efficiently deal with huge files. >>> >>> I think the BLAST XML has improved substantially, at least in terms of >>> validating against it's own DTD. The DTD itself may not be the best >>> design but that is always a matter of taste and if you are using XPath >>> to get the relevant bits you don't need to make a SAX parser jump >>> through hoops to get them. >>> >>> I agree we will have to keep flat file parsers but we should strongly >>> encourage the use of XML where possible. It is simply easier to deal >>> with. Most biological flat-files were designed for Fortran and mainly >>> for human consumption. There is no obvious validation mechanism. >>> Notably everything in NCBI is derived from ASN.1, what you see in the >>> flatfile is produced from there. I tend to think this means that the >>> ASN.1 is the holy gospel and what you get in the flat file is some >>> translation. Ideally NCBI files should be parsed from the ASN.1 where >>> you can guarantee validation, the more practical alternative is to use >>> the XML which you can at least validate against a DTD. >>> >>> With XML we (Biojava) can say if it validates we will parse it and if >>> it doesn't we may not. With flat files there are so many dodgey >>> variants we cannot say anything. Because XML dtds (or xsd's) have >>> versions it also makes it much easier to have parsers for different >>> versions and the parsing machinery can figure out which is needed. >>> With flat files it is anyones guess what version you are dealing with. >>> >>> Finally parsers can be auto-generated for XML if you have the DTD or >>> XSD. This often doesn't give you an ideal parser but it can be a >>> useful starting point for rapid development. >>> >>> For Biojava v 3 I think we should concentrate on XML parsers first and >>> flat files second. if only Fasta had an XML format >>> >>> - Mark >>> >>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: >>>> I was always under the impression that blast's XML output was nearly as >>>> hard to parse as the flat file format but I do agree that if we can use >>>> XML whenever we can it would make writing parsers a lot easier >>>> (especially if there are SAX based XPath libraries available). Actually >>>> this brings up a good question about development of this type of parser. >>>> The majority of XPath supporting libraries are DOM based which will mean >>>> large memory usage in some situations but overall providing an easier >>>> coding experience (and hopefully reduce our chances of creating bugs). >>>> Or should we code to the edge cases of someone trying to parse a 1GB >>>> XML? Personally I'd favour the former. >>>> >>>> Going back to the original topic there are going to be situations where >>>> people want the flat file parsers/writers & I think it's a valid point >>>> to say this is where BioJava is meant to come in & help a developer. >>>> Afterall XML is a computer science problem where as parsing an EMBL flat >>>> file or blast output is a bioinformatics problem. >>>> >>>> Andy >>>> >>>> >>>> Mark Schreiber wrote: >>>>> For a long time now my feeling has been that we should *only* support >>>>> the XML version of blast output. The other formats are too brittle to >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that >>>>> may be an extreme view but the power of generic XML parsers and things >>>>> like XPath etc really make these formats look very attractive. >>>>> >>>>> - Mark >>>>> >>>>> >>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: >>>>>> I think Groovy have adopted a similar system recently & have guidelines >>>>>> for how each module should behave (dependencies, build system etc). This >>>>>> enforces the idea that a module whilst not part of the core project must >>>>>> behave in the same manner the core does. I do like the idea that we can >>>>>> have a core biojava & things get added around it & it might encourage >>>>>> other users to start developing their own modules for any >>>>>> formats/purpose they want. >>>>>> >>>>>> Richard Holland wrote: >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>> Hash: SHA1 >>>>>>> >>>>>>>> What format options are there from blast? Just thinking if it supports >>>>>>>> CIGAR or something like that are we better providing a parser for that >>>>>>>> format & saying that we do not support the traditional blast output? >>>>>>>> That said it doesn't help is when that format changes so maybe what is >>>>>>>> needed is a way to push out parser changes without requiring a full >>>>>>>> biojava release (v3 discussion) ... >>>>>>> Exactly! So the modular idea would work nicely here - we could have a >>>>>>> blast module and only update that single module (which would be its own >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as such >>>>>>> would no longer happen, except maybe for some kind of core BioJava >>>>>>> module. Everything would be done in terms of individual module+JAR >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> From phidias51 at gmail.com Fri Nov 30 18:30:50 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 30 Nov 2007 10:30:50 -0800 Subject: [Biojava-l] SAX, DOM, XPath and Flat files In-Reply-To: <474FD737.9080801@ebi.ac.uk> References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com> <474D7B3B.8030807@ebi.ac.uk> <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com> <474FD737.9080801@ebi.ac.uk> Message-ID: <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com> There's a potential gotcha involved with XPath parsing. If you use the current implementation that ships with the Java 5 & 6 JDKs, it performs a DOM parse on the whole document, even if you pass it a specific starting node in the document. I stumbled across this one the hard way when using the hybrid approach that you mention. This may be solved with another XPath implementation such as Saxon. One other problem I've noticed is that the NCBI XML doesn't always parse. I've reported this to them, and they've promised to address this. It usually occurs when submitters put non-escaped characters into text fields such as author lists in PubMed. NCBI doesn't always use CDATA blocks around text and as soon as the parser hits one of these characters it throws an exception. I've also noticed a tendency (in other code bases) for developers to use several different parsers; usually, whatever parser they're most familiar with. The problem with this is that they often introduce parser-specific code into the code base, so you end up with numerous dependencies for different parsers, and a potential configuration problem if you're passing the XML parser as a run-time configuration parameter. The most frequent external parsers I've seen used are JDOM and DOM4J. The usual way to get around this is to write to an interface, but that will require some additional vigilance. Just a few things to watch out for as we move forward. Mark (the other one) :-) On Nov 30, 2007 1:26 AM, Andy Yates wrote: > I think I've seen XPath hanging around in other people's code in a 1.5 > code-base (in fact one of the guys I work with). I've used Java's DOM > before & it really isn't very nice & quite verbose. I'd prefer if there > was a better alternative/wrapper around the XML parsers just to cut down > on code chatter. > > Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these > Java tools & I think I've gone cross-eyed with the sheer number of > acronyms! You've gotta love something which seems to add a letter to ER > & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the > list know of a ASN.1 parser for Java that's good and should we support > it (considering NCBI generate their DTD & XML from the ASN.1 > representation). > > Andy > > Mark Schreiber wrote: > > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > > not XQuery although XPath is probably more important for this use. > > > > The DOM model is a direct implementation of the W3C standard which > > makes it a little awkward from a java point of view but it is usable. > > > > Java 6 has StAX (the other one). > > > > There are a few java API's for parsing ASN.1 mostly developed for the > > telco industry, I've never really looked into which is best (anyone > > experienced with this?) but we could probably use one to work directly > > off NCBI ASN.1 > > > > - Mark > > > > On Nov 28, 2007 10:29 PM, Andy Yates wrote: > >> Hi Mark, > >> > >> Okay that sounds like a perfectly sensible way to deal with this. Is > >> this kind of parsing model supported in Java5? I only ask as I've not > >> done a lot of XML parsing with Java5; more with things like XOM (which > I > >> think offers a DOM only representation but I'm probably wrong). > >> > >> That's good. There's not a huge point to have a format & a DTD/XSD and > >> then have your files not conform to it. > >> > >> I was thinking the exact same thing about ASN.1 (well that & it looks > >> bleeding horrible to parse but that is an un-educated look at the > format > >> which I'm sure is a parsable as JSON & the alike). > >> > >> When it comes to flat file parsers I would be happier to provide > >> implementations of the more common formats where a viable alternative > is > >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which > provide > >> similar output to the above have a chance to write their own > >> parsers/formatters. This is very similar to the current situation but > we > >> just need to remove dependencies on statically located data structures > >> (don't get rid of them completely just give users an option to not use > >> them). > >> > >> I'm not sure how much automatically generated parsers would help us. I > >> guess it depends on the data model(s) we use if they are auto-parser > >> friendly (which normally means POJO/JavaBean conventions including the > >> no-args constructor). > >> > >> Cool I don't want to exclude flat file parsers completely (if only > >> because my group has an interest in BioJava being able to read & write > >> flat files) :) > >> > >> They decided to have HUPO-PSI Format instead :) > >> > >> Andy > >> > >> > >> Mark Schreiber wrote: > >>> Hi - > >>> > >>> I think in most cases huge XML files in bioinformatics result from a > >>> single XML containing multiple repetitive elements. Eg a BLAST XML > >>> output with several hits or a GenBankXML with many Sequences. A nice > >>> approach I have seen for dealing with these is to use SAX to read over > >>> the file and every time it comes to an element it delegates to a DOM > >>> object. You then parse the bits of the DOM you want with XPath or > >>> convert to objects or something and then when you are finished with > >>> that entry everything gets garbage collected and the SAX parser moves > >>> to the next element and repeats the whole process. This is a hybrid > >>> of event based parsing and object-model based parsing which could let > >>> you efficiently deal with huge files. > >>> > >>> I think the BLAST XML has improved substantially, at least in terms of > >>> validating against it's own DTD. The DTD itself may not be the best > >>> design but that is always a matter of taste and if you are using XPath > >>> to get the relevant bits you don't need to make a SAX parser jump > >>> through hoops to get them. > >>> > >>> I agree we will have to keep flat file parsers but we should strongly > >>> encourage the use of XML where possible. It is simply easier to deal > >>> with. Most biological flat-files were designed for Fortran and mainly > >>> for human consumption. There is no obvious validation mechanism. > >>> Notably everything in NCBI is derived from ASN.1, what you see in the > >>> flatfile is produced from there. I tend to think this means that the > >>> ASN.1 is the holy gospel and what you get in the flat file is some > >>> translation. Ideally NCBI files should be parsed from the ASN.1 where > >>> you can guarantee validation, the more practical alternative is to use > >>> the XML which you can at least validate against a DTD. > >>> > >>> With XML we (Biojava) can say if it validates we will parse it and if > >>> it doesn't we may not. With flat files there are so many dodgey > >>> variants we cannot say anything. Because XML dtds (or xsd's) have > >>> versions it also makes it much easier to have parsers for different > >>> versions and the parsing machinery can figure out which is needed. > >>> With flat files it is anyones guess what version you are dealing with. > >>> > >>> Finally parsers can be auto-generated for XML if you have the DTD or > >>> XSD. This often doesn't give you an ideal parser but it can be a > >>> useful starting point for rapid development. > >>> > >>> For Biojava v 3 I think we should concentrate on XML parsers first and > >>> flat files second. if only Fasta had an XML format > >>> > >>> - Mark > >>> > >>> On Nov 27, 2007 11:16 PM, Andy Yates wrote: > >>>> I was always under the impression that blast's XML output was nearly > as > >>>> hard to parse as the flat file format but I do agree that if we can > use > >>>> XML whenever we can it would make writing parsers a lot easier > >>>> (especially if there are SAX based XPath libraries available). > Actually > >>>> this brings up a good question about development of this type of > parser. > >>>> The majority of XPath supporting libraries are DOM based which will > mean > >>>> large memory usage in some situations but overall providing an easier > >>>> coding experience (and hopefully reduce our chances of creating > bugs). > >>>> Or should we code to the edge cases of someone trying to parse a 1GB > >>>> XML? Personally I'd favour the former. > >>>> > >>>> Going back to the original topic there are going to be situations > where > >>>> people want the flat file parsers/writers & I think it's a valid > point > >>>> to say this is where BioJava is meant to come in & help a developer. > >>>> Afterall XML is a computer science problem where as parsing an EMBL > flat > >>>> file or blast output is a bioinformatics problem. > >>>> > >>>> Andy > >>>> > >>>> > >>>> Mark Schreiber wrote: > >>>>> For a long time now my feeling has been that we should *only* > support > >>>>> the XML version of blast output. The other formats are too brittle > to > >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc > that > >>>>> may be an extreme view but the power of generic XML parsers and > things > >>>>> like XPath etc really make these formats look very attractive. > >>>>> > >>>>> - Mark > >>>>> > >>>>> > >>>>> On Nov 27, 2007 7:47 PM, Andy Yates wrote: > >>>>>> I think Groovy have adopted a similar system recently & have > guidelines > >>>>>> for how each module should behave (dependencies, build system etc). > This > >>>>>> enforces the idea that a module whilst not part of the core project > must > >>>>>> behave in the same manner the core does. I do like the idea that we > can > >>>>>> have a core biojava & things get added around it & it might > encourage > >>>>>> other users to start developing their own modules for any > >>>>>> formats/purpose they want. > >>>>>> > >>>>>> Richard Holland wrote: > >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- > >>>>>>> Hash: SHA1 > >>>>>>> > >>>>>>>> What format options are there from blast? Just thinking if it > supports > >>>>>>>> CIGAR or something like that are we better providing a parser for > that > >>>>>>>> format & saying that we do not support the traditional blast > output? > >>>>>>>> That said it doesn't help is when that format changes so maybe > what is > >>>>>>>> needed is a way to push out parser changes without requiring a > full > >>>>>>>> biojava release (v3 discussion) ... > >>>>>>> Exactly! So the modular idea would work nicely here - we could > have a > >>>>>>> blast module and only update that single module (which would be > its own > >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as > such > >>>>>>> would no longer happen, except maybe for some kind of core BioJava > >>>>>>> module. Everything would be done in terms of individual module+JAR > >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, > one > >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. > etc. > >>>>>>> > >>>>>>> cheers, > >>>>>>> Richard > >>>>>> _______________________________________________ > >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From abhi232 at cc.gatech.edu Sat Nov 24 16:16:17 2007 From: abhi232 at cc.gatech.edu (Abhinav Ram Karhu) Date: Sat, 24 Nov 2007 16:16:17 -0000 Subject: [Biojava-l] Applet not able to find DNATools class. Message-ID: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu> Hello all, I am having an error while loading the applet. I am getting the following stack trace. java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161) at Trace.init(Trace.java:161) at sun.applet.AppletPanel.run(Unknown Source) at java.lang.Thread.run(Unknown Source) I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder. I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java. My applet code in the php page looks like this: Please suggest if I am missing something. Thanks in advance. Abhinav