From alex at coolest.com  Thu Nov  1 04:20:26 2007
From: alex at coolest.com (dasoudesu)
Date: Thu, 1 Nov 2007 01:20:26 -0700 (PDT)
Subject: [Biojava-l]  [ann] Informal Text-mining & Java Meetup in Tokyo
Message-ID: <13524848.post@talk.nabble.com>


Just wanted to announce a mini-event:
        Informal Text-mining & Java Meetup in Tokyo
        http://curehunter.com/public/events.do
Come have a casual drink with some similarly minded devs interested in new
tech.
(We like: Text-mining, Natural Language Processing, Java, C#, Python, Flex,
Dojo, Lucene...)

Time/location:
        November 29th 2007, Thursday 8pm-10pm
        Amarcord in Hatsudai (near Shinjuku), Tokyo
        http://way.sub.jp/amarcord/access.php
        2000-3000yen for food/drinks

If you can attend, please confirm by emailing:
        events at curehunter com

We will do a short demo of CureHunter and talk about some of the tech we
used.
After that we will have a projector available if anyone else would like to
present for 5-15 min on stuff they are working on.  
(the location is best equipped for drinking, however)

Hope to meet a few Java people from around Tokyo.
Best Regards,

Alex
---
http://curehunter.com - http://popjisyo.com - http://winstone.sf.net

-- 
View this message in context: http://www.nabble.com/-ann--Informal-Text-mining---Java-Meetup-in-Tokyo-tf4729944.html#a13524848
Sent from the BioJava mailing list archive at Nabble.com.


From ap3 at sanger.ac.uk  Thu Nov  1 12:59:35 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Thu, 1 Nov 2007 16:59:35 +0000
Subject: [Biojava-l] Biojava migrating to Subversion
Message-ID: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>

Hi all,

Over the next weeks (until Christmas) BioJava will finally move the  
version control system from
CVS to Subversion (svn). This is happening in parallel to the other  
open-bio projects. We will
ensure that nothing gets lost during this migration. This means that  
all Biojava modules, branches,
tags and the history of the files will be imported into the new  
repository.

  Over the next weeks we will

A) Test the migration procedure to ensure nothing gets lost
B) We will declare a CVS freeze at some point, giving all developers  
enough time to commit the latest code to CVS.
C) After the freeze the final svn migration will happen. At this  
point we will also do a quick BioJava release (version 1.5.1)
D) From that moment on all future Biojava development will happen via  
svn, CVS will remain frozen.

Detailed instructions for how to check out and commit code using svn  
will be announced closer to the migration date.

We will keep you informed about the details of these ongoings. There  
is also a wiki page which provides documentation for this:
http://biojava.org/wiki/CVS_to_SVN_Migration

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From abhi232 at cc.gatech.edu  Mon Nov  5 12:59:15 2007
From: abhi232 at cc.gatech.edu (abhi232 at cc.gatech.edu)
Date: Mon, 5 Nov 2007 12:59:15 -0500 (EST)
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
Message-ID: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>

Hi all,
I am having a byte array which is having the data from an .ab1 file.The
biojava library provides a class called as ABITrace which takes as input
either a byte[] array , a file or a url.If i use the later parameters (the
file or the url )the program works but if I pass the byte array to the
constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
problem with the ABITrace class or how can I bypass this particular error.
I am printing the length of the byte array and it comes to 144930...Can
that cause a problem in my code?

Thanks in advance.
Abhinav

From holland at ebi.ac.uk  Tue Nov  6 05:15:43 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 06 Nov 2007 10:15:43 +0000
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
Message-ID: <47303ECF.4020806@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I suspect the byte array itself may contain inaccurate data.

Internally, both the URL and File constructors read the data into a byte
array and then pass it to the same method as is used by the byte[]
constructor.

So, something must be different between the byte array you have, and the
byte array obtained by reading the file in.

The File constructor uses the following code to read the file:

    byte[] bytes = null;
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    FileInputStream fis = new FileInputStream(ABIFile);
    BufferedInputStream bis = new BufferedInputStream(fis);
    int b;
    while ((b = bis.read()) >= 0)
    {
      baos.write(b);
    }
    bis.close(); fis.close(); baos.close();
    bytes = baos.toByteArray();

If the above code produces different results to your byte array when
reading data from the same file as your code, then something has gone
wrong with the construction of your byte array.

Lastly, a full stack trace would help us pinpoint the line that is
breaking, and hopefully provide a hint as to what is wrong with the
contents of the byte array. If you could provide one that would be very
helpful.

cheers,
Richard


abhi232 at cc.gatech.edu wrote:
> Hi all,
> I am having a byte array which is having the data from an .ab1 file.The
> biojava library provides a class called as ABITrace which takes as input
> either a byte[] array , a file or a url.If i use the later parameters (the
> file or the url )the program works but if I pass the byte array to the
> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
> problem with the ABITrace class or how can I bypass this particular error.
> I am printing the length of the byte array and it comes to 144930...Can
> that cause a problem in my code?
> 
> Thanks in advance.
> Abhinav
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr
oe52S3khmTvi5BFCL1W4KHM=
=5JAO
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Tue Nov  6 11:53:54 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 06 Nov 2007 16:53:54 +0000
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <4730A6F1.9050407@cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu>
Message-ID: <47309C22.10803@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think that either the file is at fault, or the method you are using to
read the file into Java is at fault.

Could you provide us with the complete piece of code you are using from
the point where you read the file into the array through to the point
where you generate the output you quoted?  (Not as an attachment as the
mailing list will strip those - simply paste it into the message body
instead).

cheers,
Richard


abhinav wrote:
> Richard Holland wrote:
> I suspect the byte array itself may contain inaccurate data.
> 
> Internally, both the URL and File constructors read the data into a byte
> array and then pass it to the same method as is used by the byte[]
> constructor.
> 
> So, something must be different between the byte array you have, and the
> byte array obtained by reading the file in.
> 
> The File constructor uses the following code to read the file:
> 
>     byte[] bytes = null;
>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>     FileInputStream fis = new FileInputStream(ABIFile);
>     BufferedInputStream bis = new BufferedInputStream(fis);
>     int b;
>     while ((b = bis.read()) >= 0)
>     {
>       baos.write(b);
>     }
>     bis.close(); fis.close(); baos.close();
>     bytes = baos.toByteArray();
> 
> If the above code produces different results to your byte array when
> reading data from the same file as your code, then something has gone
> wrong with the construction of your byte array.
> 
> Lastly, a full stack trace would help us pinpoint the line that is
> breaking, and hopefully provide a hint as to what is wrong with the
> contents of the byte array. If you could provide one that would be very
> helpful.
> 
> cheers,
> Richard
> 
> 
> abhi232 at cc.gatech.edu wrote:
>   
>>>> Hi all,
>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>> biojava library provides a class called as ABITrace which takes as input
>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>> file or the url )the program works but if I pass the byte array to the
>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>> that cause a problem in my code?
>>>>
>>>> Thanks in advance.
>>>> Abhinav
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>>     

> Yes I looked at the file ABITrace and found out that the first three
> characters must be ABI or the 128-130 characters must be ABI.But I
> cannot find that in the file that I am having.Also If this is not the
> case then there should be an illegal format exception whereas I am
> arrayIndexOutOfBound Exception which is also weird.
> I am getting the following stack trace.
> The bytes that i want are:0
> The bytes that i want are:11
> The bytes that i want are:0
> The size of the byte array generated is:144930
> Byte array also recieved
> java.lang.ArrayIndexOutOfBoundsException: 128
>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>     at Trace.init(Trace.java:138)
>     at sun.applet.AppletPanel.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> The bytes I want are the first three bytes that I want to check if my
> file is ABI or not.I checked the isABI function as well it returns true
> or false value and not arrayIndexOutOfBouond . Also the number 128 does
> it hve any significance in this case?
> Thanks in advance
> Abhinav

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG
T1Z8W0JhCe7+hx5rbKLGqVk=
=qNcr
-----END PGP SIGNATURE-----

From abhi232 at cc.gatech.edu  Tue Nov  6 13:03:02 2007
From: abhi232 at cc.gatech.edu (abhinav)
Date: Tue, 06 Nov 2007 12:03:02 -0600
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <47309C22.10803@ebi.ac.uk>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu>
	<47309C22.10803@ebi.ac.uk>
Message-ID: <4730AC56.9060808@cc.gatech.edu>

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think that either the file is at fault, or the method you are using to
> read the file into Java is at fault.
>
> Could you provide us with the complete piece of code you are using from
> the point where you read the file into the array through to the point
> where you generate the output you quoted?  (Not as an attachment as the
> mailing list will strip those - simply paste it into the message body
> instead).
>
> cheers,
> Richard
>
>
> abhinav wrote:
>   
>> Richard Holland wrote:
>> I suspect the byte array itself may contain inaccurate data.
>>
>> Internally, both the URL and File constructors read the data into a byte
>> array and then pass it to the same method as is used by the byte[]
>> constructor.
>>
>> So, something must be different between the byte array you have, and the
>> byte array obtained by reading the file in.
>>
>> The File constructor uses the following code to read the file:
>>
>>     byte[] bytes = null;
>>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>     FileInputStream fis = new FileInputStream(ABIFile);
>>     BufferedInputStream bis = new BufferedInputStream(fis);
>>     int b;
>>     while ((b = bis.read()) >= 0)
>>     {
>>       baos.write(b);
>>     }
>>     bis.close(); fis.close(); baos.close();
>>     bytes = baos.toByteArray();
>>
>> If the above code produces different results to your byte array when
>> reading data from the same file as your code, then something has gone
>> wrong with the construction of your byte array.
>>
>> Lastly, a full stack trace would help us pinpoint the line that is
>> breaking, and hopefully provide a hint as to what is wrong with the
>> contents of the byte array. If you could provide one that would be very
>> helpful.
>>
>> cheers,
>> Richard
>>
>>
>> abhi232 at cc.gatech.edu wrote:
>>   
>>     
>>>>> Hi all,
>>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>>> biojava library provides a class called as ABITrace which takes as input
>>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>>> file or the url )the program works but if I pass the byte array to the
>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>>> that cause a problem in my code?
>>>>>
>>>>> Thanks in advance.
>>>>> Abhinav
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>>     
>>>>>           
>
>   
>> Yes I looked at the file ABITrace and found out that the first three
>> characters must be ABI or the 128-130 characters must be ABI.But I
>> cannot find that in the file that I am having.Also If this is not the
>> case then there should be an illegal format exception whereas I am
>> arrayIndexOutOfBound Exception which is also weird.
>> I am getting the following stack trace.
>> The bytes that i want are:0
>> The bytes that i want are:11
>> The bytes that i want are:0
>> The size of the byte array generated is:144930
>> Byte array also recieved
>> java.lang.ArrayIndexOutOfBoundsException: 128
>>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>>     at Trace.init(Trace.java:138)
>>     at sun.applet.AppletPanel.run(Unknown Source)
>>     at java.lang.Thread.run(Unknown Source)
>> The bytes I want are the first three bytes that I want to check if my
>> file is ABI or not.I checked the isABI function as well it returns true
>> or false value and not arrayIndexOutOfBouond . Also the number 128 does
>> it hve any significance in this case?
>> Thanks in advance
>> Abhinav
>>     
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG
> T1Z8W0JhCe7+hx5rbKLGqVk=
> =qNcr
> -----END PGP SIGNATURE-----
>   
Ok Yes here is the code that i am using .I establish a connection with a 
php page which in turn reads the file and prints the content back to 
me.I am using DataOutputStream for sending data and BufferedReader for 
taking in the data.Then I am reading the data into a string and 
converting it to byte[] array . this the code where the connection is 
estableshed and the data is taken and displayed.


 private HttpURLConnection httpConn;
    private DataOutputStream out;
    private DataInputStream temp_stream;
    private BufferedReader in;
    private BufferedInputStream in_buff_stream;
    private String str ;
    private byte[] bytearray;
    Chromatogram abif_chromatogram;

    /** Creates a new instance of testPost */
    public testPost()
    {

        httpConn = null;
        str = new String("");
        bytearray = new byte[144930];

    }
    public byte[] create_and_write_Connection(String url,String 
data_request)
    {
        try
        {
            URL conn_url = new URL(url);
            httpConn = (HttpURLConnection)conn_url.openConnection();
            httpConn.setDoOutput(true);
            httpConn.setDoInput(true);
            httpConn.setRequestMethod("POST");
            out=new DataOutputStream(httpConn.getOutputStream());
            out.writeBytes(data_request);
            out.flush();
            System.out.println("Connection established successfully and 
data written");
            InputStreamReader in_stream = new 
InputStreamReader(httpConn.getInputStream());

                System.out.println("The character encoding used is:"+ 
in_stream.getEncoding());
            in = new BufferedReader(in_stream);


            System.out.println("Data acceptance started");


            while(in.readLine()!=null)
            {
                str += in.readLine();
            }
            System.out.println("The string to be returned is:"+str);
            bytearray = str.getBytes("ISO8859-1");
            String temp_string = new String(bytearray,"windows-1252");
           System.out.println("The encoded string is as follows:"+ 
temp_string);
            System.out.println("The size of byte array inside testpost 
is:"+ Array.getLength(bytearray));
             for(int i = 0 ; i < 3 ; i ++)
                System.out.println("The bytes that i want are:"+ 
bytearray[i]);
            return bytearray;
        }
        catch(Exception e)
        {
               e.printStackTrace();
        }
        return bytearray;
     }
Please guide me on this point
Thanks
Abhinav
   

From holland at ebi.ac.uk  Tue Nov  6 12:05:12 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 06 Nov 2007 17:05:12 +0000
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <4730AC56.9060808@cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu>
	<47309C22.10803@ebi.ac.uk> <4730AC56.9060808@cc.gatech.edu>
Message-ID: <47309EC8.2070904@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The String is where you're going wrong. ABI files are not Stringifyable
- - they are binary data. Converting them to a String will corrupt them.

cheers,
Richard

abhinav wrote:
> Richard Holland wrote:
> I think that either the file is at fault, or the method you are using to
> read the file into Java is at fault.
> 
> Could you provide us with the complete piece of code you are using from
> the point where you read the file into the array through to the point
> where you generate the output you quoted?  (Not as an attachment as the
> mailing list will strip those - simply paste it into the message body
> instead).
> 
> cheers,
> Richard
> 
> 
> abhinav wrote:
>   
>>>> Richard Holland wrote:
>>>> I suspect the byte array itself may contain inaccurate data.
>>>>
>>>> Internally, both the URL and File constructors read the data into a byte
>>>> array and then pass it to the same method as is used by the byte[]
>>>> constructor.
>>>>
>>>> So, something must be different between the byte array you have, and the
>>>> byte array obtained by reading the file in.
>>>>
>>>> The File constructor uses the following code to read the file:
>>>>
>>>>     byte[] bytes = null;
>>>>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>>     FileInputStream fis = new FileInputStream(ABIFile);
>>>>     BufferedInputStream bis = new BufferedInputStream(fis);
>>>>     int b;
>>>>     while ((b = bis.read()) >= 0)
>>>>     {
>>>>       baos.write(b);
>>>>     }
>>>>     bis.close(); fis.close(); baos.close();
>>>>     bytes = baos.toByteArray();
>>>>
>>>> If the above code produces different results to your byte array when
>>>> reading data from the same file as your code, then something has gone
>>>> wrong with the construction of your byte array.
>>>>
>>>> Lastly, a full stack trace would help us pinpoint the line that is
>>>> breaking, and hopefully provide a hint as to what is wrong with the
>>>> contents of the byte array. If you could provide one that would be very
>>>> helpful.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>>
>>>> abhi232 at cc.gatech.edu wrote:
>>>>   
>>>>     
>>>>>>> Hi all,
>>>>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>>>>> biojava library provides a class called as ABITrace which takes as input
>>>>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>>>>> file or the url )the program works but if I pass the byte array to the
>>>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>>>>> that cause a problem in my code?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>> Abhinav
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>>>     
>>>>>>>           
> 
>   
>>>> Yes I looked at the file ABITrace and found out that the first three
>>>> characters must be ABI or the 128-130 characters must be ABI.But I
>>>> cannot find that in the file that I am having.Also If this is not the
>>>> case then there should be an illegal format exception whereas I am
>>>> arrayIndexOutOfBound Exception which is also weird.
>>>> I am getting the following stack trace.
>>>> The bytes that i want are:0
>>>> The bytes that i want are:11
>>>> The bytes that i want are:0
>>>> The size of the byte array generated is:144930
>>>> Byte array also recieved
>>>> java.lang.ArrayIndexOutOfBoundsException: 128
>>>>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>>>>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>>>>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>>>>     at Trace.init(Trace.java:138)
>>>>     at sun.applet.AppletPanel.run(Unknown Source)
>>>>     at java.lang.Thread.run(Unknown Source)
>>>> The bytes I want are the first three bytes that I want to check if my
>>>> file is ABI or not.I checked the isABI function as well it returns true
>>>> or false value and not arrayIndexOutOfBouond . Also the number 128 does
>>>> it hve any significance in this case?
>>>> Thanks in advance
>>>> Abhinav
>>>>     
> 

> Ok Yes here is the code that i am using .I establish a connection with a
> php page which in turn reads the file and prints the content back to
> me.I am using DataOutputStream for sending data and BufferedReader for
> taking in the data.Then I am reading the data into a string and
> converting it to byte[] array . this the code where the connection is
> estableshed and the data is taken and displayed.


>  private HttpURLConnection httpConn;
>     private DataOutputStream out;
>     private DataInputStream temp_stream;
>     private BufferedReader in;
>     private BufferedInputStream in_buff_stream;
>     private String str ;
>     private byte[] bytearray;
>     Chromatogram abif_chromatogram;

>     /** Creates a new instance of testPost */
>     public testPost()
>     {

>         httpConn = null;
>         str = new String("");
>         bytearray = new byte[144930];

>     }
>     public byte[] create_and_write_Connection(String url,String
> data_request)
>     {
>         try
>         {
>             URL conn_url = new URL(url);
>             httpConn = (HttpURLConnection)conn_url.openConnection();
>             httpConn.setDoOutput(true);
>             httpConn.setDoInput(true);
>             httpConn.setRequestMethod("POST");
>             out=new DataOutputStream(httpConn.getOutputStream());
>             out.writeBytes(data_request);
>             out.flush();
>             System.out.println("Connection established successfully and
> data written");
>             InputStreamReader in_stream = new
> InputStreamReader(httpConn.getInputStream());

>                 System.out.println("The character encoding used is:"+
> in_stream.getEncoding());
>             in = new BufferedReader(in_stream);


>             System.out.println("Data acceptance started");


>             while(in.readLine()!=null)
>             {
>                 str += in.readLine();
>             }
>             System.out.println("The string to be returned is:"+str);
>             bytearray = str.getBytes("ISO8859-1");
>             String temp_string = new String(bytearray,"windows-1252");
>            System.out.println("The encoded string is as follows:"+
> temp_string);
>             System.out.println("The size of byte array inside testpost
> is:"+ Array.getLength(bytearray));
>              for(int i = 0 ; i < 3 ; i ++)
>                 System.out.println("The bytes that i want are:"+
> bytearray[i]);
>             return bytearray;
>         }
>         catch(Exception e)
>         {
>                e.printStackTrace();
>         }
>         return bytearray;
>      }
> Please guide me on this point
> Thanks
> Abhinav

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMJ7I4C5LeMEKA/QRAupLAJ9YDoGohk5uZSNYZnRRMJ5WeNDpGgCfdCyg
+Z/gXBbPmrG3SuQlfeHuD3A=
=akSf
-----END PGP SIGNATURE-----

From abhi232 at cc.gatech.edu  Tue Nov  6 12:40:01 2007
From: abhi232 at cc.gatech.edu (abhinav)
Date: Tue, 06 Nov 2007 11:40:01 -0600
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <47303ECF.4020806@ebi.ac.uk>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk>
Message-ID: <4730A6F1.9050407@cc.gatech.edu>

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I suspect the byte array itself may contain inaccurate data.
>
> Internally, both the URL and File constructors read the data into a byte
> array and then pass it to the same method as is used by the byte[]
> constructor.
>
> So, something must be different between the byte array you have, and the
> byte array obtained by reading the file in.
>
> The File constructor uses the following code to read the file:
>
>     byte[] bytes = null;
>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>     FileInputStream fis = new FileInputStream(ABIFile);
>     BufferedInputStream bis = new BufferedInputStream(fis);
>     int b;
>     while ((b = bis.read()) >= 0)
>     {
>       baos.write(b);
>     }
>     bis.close(); fis.close(); baos.close();
>     bytes = baos.toByteArray();
>
> If the above code produces different results to your byte array when
> reading data from the same file as your code, then something has gone
> wrong with the construction of your byte array.
>
> Lastly, a full stack trace would help us pinpoint the line that is
> breaking, and hopefully provide a hint as to what is wrong with the
> contents of the byte array. If you could provide one that would be very
> helpful.
>
> cheers,
> Richard
>
>
> abhi232 at cc.gatech.edu wrote:
>   
>> Hi all,
>> I am having a byte array which is having the data from an .ab1 file.The
>> biojava library provides a class called as ABITrace which takes as input
>> either a byte[] array , a file or a url.If i use the later parameters (the
>> file or the url )the program works but if I pass the byte array to the
>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>> problem with the ABITrace class or how can I bypass this particular error.
>> I am printing the length of the byte array and it comes to 144930...Can
>> that cause a problem in my code?
>>
>> Thanks in advance.
>> Abhinav
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>     
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr
> oe52S3khmTvi5BFCL1W4KHM=
> =5JAO
> -----END PGP SIGNATURE-----
>   

Yes I looked at the file ABITrace and found out that the first three 
characters must be ABI or the 128-130 characters must be ABI.But I 
cannot find that in the file that I am having.Also If this is not the 
case then there should be an illegal format exception whereas I am 
arrayIndexOutOfBound Exception which is also weird.
I am getting the following stack trace.
The bytes that i want are:0
The bytes that i want are:11
The bytes that i want are:0
The size of the byte array generated is:144930
Byte array also recieved
java.lang.ArrayIndexOutOfBoundsException: 128
    at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
    at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
    at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
    at Trace.init(Trace.java:138)
    at sun.applet.AppletPanel.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
The bytes I want are the first three bytes that I want to check if my 
file is ABI or not.I checked the isABI function as well it returns true 
or false value and not arrayIndexOutOfBouond . Also the number 128 does 
it hve any significance in this case?
Thanks in advance
Abhinav


From walsh at andrew.cmu.edu  Tue Nov  6 12:23:36 2007
From: walsh at andrew.cmu.edu (Andrew Walsh)
Date: Tue, 06 Nov 2007 12:23:36 -0500
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <4730AC56.9060808@cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>	<47303ECF.4020806@ebi.ac.uk>
	<4730A6F1.9050407@cc.gatech.edu>	<47309C22.10803@ebi.ac.uk>
	<4730AC56.9060808@cc.gatech.edu>
Message-ID: <4730A318.8010406@andrew.cmu.edu>

You also appear to be losing every other line with the following code:

    while(in.readLine()!=null)
        {
            str += in.readLine();
        }

Every time the while statement checks its condition, a line is read from 
the inputstream.  That line is never stored.  Then, if the condition is 
met, another line is read and that line is added to your String.

-Andy

abhinav wrote:
> Richard Holland wrote:
>   
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think that either the file is at fault, or the method you are using to
>> read the file into Java is at fault.
>>
>> Could you provide us with the complete piece of code you are using from
>> the point where you read the file into the array through to the point
>> where you generate the output you quoted?  (Not as an attachment as the
>> mailing list will strip those - simply paste it into the message body
>> instead).
>>
>> cheers,
>> Richard
>>
>>
>> abhinav wrote:
>>   
>>     
>>> Richard Holland wrote:
>>> I suspect the byte array itself may contain inaccurate data.
>>>
>>> Internally, both the URL and File constructors read the data into a byte
>>> array and then pass it to the same method as is used by the byte[]
>>> constructor.
>>>
>>> So, something must be different between the byte array you have, and the
>>> byte array obtained by reading the file in.
>>>
>>> The File constructor uses the following code to read the file:
>>>
>>>     byte[] bytes = null;
>>>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>     FileInputStream fis = new FileInputStream(ABIFile);
>>>     BufferedInputStream bis = new BufferedInputStream(fis);
>>>     int b;
>>>     while ((b = bis.read()) >= 0)
>>>     {
>>>       baos.write(b);
>>>     }
>>>     bis.close(); fis.close(); baos.close();
>>>     bytes = baos.toByteArray();
>>>
>>> If the above code produces different results to your byte array when
>>> reading data from the same file as your code, then something has gone
>>> wrong with the construction of your byte array.
>>>
>>> Lastly, a full stack trace would help us pinpoint the line that is
>>> breaking, and hopefully provide a hint as to what is wrong with the
>>> contents of the byte array. If you could provide one that would be very
>>> helpful.
>>>
>>> cheers,
>>> Richard
>>>
>>>
>>> abhi232 at cc.gatech.edu wrote:
>>>   
>>>     
>>>       
>>>>>> Hi all,
>>>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>>>> biojava library provides a class called as ABITrace which takes as input
>>>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>>>> file or the url )the program works but if I pass the byte array to the
>>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>>>> that cause a problem in my code?
>>>>>>
>>>>>> Thanks in advance.
>>>>>> Abhinav
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>>     
>>>>>>           
>>>>>>             
>>   
>>     
>>> Yes I looked at the file ABITrace and found out that the first three
>>> characters must be ABI or the 128-130 characters must be ABI.But I
>>> cannot find that in the file that I am having.Also If this is not the
>>> case then there should be an illegal format exception whereas I am
>>> arrayIndexOutOfBound Exception which is also weird.
>>> I am getting the following stack trace.
>>> The bytes that i want are:0
>>> The bytes that i want are:11
>>> The bytes that i want are:0
>>> The size of the byte array generated is:144930
>>> Byte array also recieved
>>> java.lang.ArrayIndexOutOfBoundsException: 128
>>>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>>>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>>>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>>>     at Trace.init(Trace.java:138)
>>>     at sun.applet.AppletPanel.run(Unknown Source)
>>>     at java.lang.Thread.run(Unknown Source)
>>> The bytes I want are the first three bytes that I want to check if my
>>> file is ABI or not.I checked the isABI function as well it returns true
>>> or false value and not arrayIndexOutOfBouond . Also the number 128 does
>>> it hve any significance in this case?
>>> Thanks in advance
>>> Abhinav
>>>     
>>>       
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG
>> T1Z8W0JhCe7+hx5rbKLGqVk=
>> =qNcr
>> -----END PGP SIGNATURE-----
>>   
>>     
> Ok Yes here is the code that i am using .I establish a connection with a 
> php page which in turn reads the file and prints the content back to 
> me.I am using DataOutputStream for sending data and BufferedReader for 
> taking in the data.Then I am reading the data into a string and 
> converting it to byte[] array . this the code where the connection is 
> estableshed and the data is taken and displayed.
>
>
>
>  private HttpURLConnection httpConn;
>     private DataOutputStream out;
>     private DataInputStream temp_stream;
>     private BufferedReader in;
>     private BufferedInputStream in_buff_stream;
>     private String str ;
>     private byte[] bytearray;
>     Chromatogram abif_chromatogram;
>
>     /** Creates a new instance of testPost */
>     public testPost()
>     {
>
>         httpConn = null;
>         str = new String("");
>         bytearray = new byte[144930];
>
>     }
>     public byte[] create_and_write_Connection(String url,String 
> data_request)
>     {
>         try
>         {
>             URL conn_url = new URL(url);
>             httpConn = (HttpURLConnection)conn_url.openConnection();
>             httpConn.setDoOutput(true);
>             httpConn.setDoInput(true);
>             httpConn.setRequestMethod("POST");
>             out=new DataOutputStream(httpConn.getOutputStream());
>             out.writeBytes(data_request);
>             out.flush();
>             System.out.println("Connection established successfully and 
> data written");
>             InputStreamReader in_stream = new 
> InputStreamReader(httpConn.getInputStream());
>
>                 System.out.println("The character encoding used is:"+ 
> in_stream.getEncoding());
>             in = new BufferedReader(in_stream);
>
>
>             System.out.println("Data acceptance started");
>
>
>             while(in.readLine()!=null)
>             {
>                 str += in.readLine();
>             }
>             System.out.println("The string to be returned is:"+str);
>             bytearray = str.getBytes("ISO8859-1");
>             String temp_string = new String(bytearray,"windows-1252");
>            System.out.println("The encoded string is as follows:"+ 
> temp_string);
>             System.out.println("The size of byte array inside testpost 
> is:"+ Array.getLength(bytearray));
>              for(int i = 0 ; i < 3 ; i ++)
>                 System.out.println("The bytes that i want are:"+ 
> bytearray[i]);
>             return bytearray;
>         }
>         catch(Exception e)
>         {
>                e.printStackTrace();
>         }
>         return bytearray;
>      }
> Please guide me on this point
> Thanks
> Abhinav
>    
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>   


From holland at ebi.ac.uk  Thu Nov  8 08:53:09 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 13:53:09 +0000
Subject: [Biojava-l] BioJava 3 Proposals
Message-ID: <473314C5.8070207@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear BioJava users,

The BioJava developers are considering options for the future
development of the BioJava toolkit. We consider that it needs
improvement in a few major areas to make it easier to use and
understand, and also faster and more scalable.

The options are to either rewrite large parts of the existing code,
working within the existing interfaces and paradigms, or to develop a
new set of BioJava packages from the ground up in order to take
advantage of lessons learned from the design patterns of the existing code.

The BioJava developers have spent the last couple of months discussing
ideas and proposals related to these options on a Wiki page, and would
now like to open this discussion to all users of BioJava and the
bioinformatics community in general. We would like to invite anyone who
has any ideas or suggestions to contribute these to the Wiki page,
and/or to comment on the ideas and suggestions that have already been
posted there.

Here is a link to the Wiki page, and also a link to the associated Talk
page where much of the discussion has taken place so far:

	http://biojava.org/wiki/BioJava3_Proposal
	http://biojava.org/wiki/Talk:BioJava3_Proposal

It is our intention to leave the discussion open until mid-January
2008 when we will summarise it and use it as the basis of a plan of
action. We will then distribute the summary and the action plan via the
BioJava website.

We look forward to hearing your comments and ideas. Please do remember
to make them directly to the Wiki page so that they are preserved in
context, making it easier for us to summarise them later!

cheers,
Richard
(on behalf of all BioJava developers)

PS. Just to reassure you, this is NOT a plan to drop the existing
codebase. It will continue to exist, but the outcome of these
discussions will determine whether we will continue to develop and
support it or start afresh with a clean slate and a new codebase.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMxTE4C5LeMEKA/QRAlGSAJwKzO0oAe3T2e8ibcG8uRReOVfh7wCdGlwn
JkcVzA55Ye32o8Ry48LO+04=
=oaaC
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Thu Nov  8 08:58:23 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 13:58:23 +0000
Subject: [Biojava-l] Biojava wiki
Message-ID: <473315FF.70506@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

what's happened to the biojava wiki today? i get errors from all pages,
including the front page, indicating zero-sized replies.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62
6KkoyXhahrWkZo2OWyCL+Uk=
=1jK7
-----END PGP SIGNATURE-----

From phidias51 at gmail.com  Thu Nov  8 10:39:29 2007
From: phidias51 at gmail.com (Mark Fortner)
Date: Thu, 8 Nov 2007 07:39:29 -0800
Subject: [Biojava-l] Biojava wiki
In-Reply-To: <473315FF.70506@ebi.ac.uk>
References: <473315FF.70506@ebi.ac.uk>
Message-ID: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com>

Richard,
That's odd.  It comes up fine for me.

BTW, in your proposal you mentioned that people had "moved on".  I was
wondering what types of tasks they had moved on to, and what should be
included in the Proposal to insure that BioJava stays relevant to them?

Regards,

Mark

On Nov 8, 2007 5:58 AM, Richard Holland <holland at ebi.ac.uk> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> what's happened to the biojava wiki today? i get errors from all pages,
> including the front page, indicating zero-sized replies.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62
> 6KkoyXhahrWkZo2OWyCL+Uk=
> =1jK7
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From hlapp at gmx.net  Thu Nov  8 10:53:03 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 8 Nov 2007 10:53:03 -0500
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
Message-ID: <ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>

Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we  
explicitly lowercase the value found for alphabet, and the comment  
says why:

         # Note: Biojava uses upper-case terms for alphabet, so we
         # need to change to all-lower in case the sequence was
         # manipulated by Biojava.
         $obj->alphabet(lc($rows->[3])) if $rows->[3];

However, when inserting sequences, we leave the value as is in  
BioPerl (which is lowercase), leading to a potential problem for  
Biojava upon retrieval. Do the Biojava folks deal with that? Should  
this may harmonized across the board?

	-hilmar

On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:

> Dear Peter,
>
> All the alphabet are "DNA" (upper case) in my database. The  
> sequences are taken from NCBI by a BioJava application.
> Thus is should be that BioJava inserts the records with "DNA". Thus  
> no potential "hidden bug" in BioPython.
>
> Maybe a point to share with the Open-Bio committee.
>
> Eric
>
> ----- Message d'origine ----
> De : Peter <biopython at maubp.freeserve.co.uk>
> ? : Eric Gibert <ericgibert at yahoo.fr>
> Cc : biopython at lists.open-bio.org
> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
> Objet : Re: [BioPython] small "bug" correction in package BioSql
>
> Eric Gibert wrote:
>> Dear all,
>>
>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>> function:
>>
>> ...
>>
>> please note my correction: force moltype to be turn in lower case as
>> my database has upper case value! this raises the "Unknown moltype"
>> error.
>
> Hi Eric, I've made your suggested change in CVS,
> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>
> I would encourage you to investigate why some of the "alphabet" fields
> in the biosequence table are in upper case.  There could be a bug
> elsewhere which is writing these entries with the wrong alphabet.  Is
> this affecting all entries, or just some?
>
> Peter
>
>
>
>
>
>
>
>        
> ______________________________________________________________________ 
> _______
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
> Yahoo! Mail
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From holland at ebi.ac.uk  Thu Nov  8 11:17:25 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 16:17:25 +0000
Subject: [Biojava-l] Biojava wiki
In-Reply-To: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com>
References: <473315FF.70506@ebi.ac.uk>
	<6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com>
Message-ID: <47333695.40808@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> BTW, in your proposal you mentioned that people had "moved on".  I was
> wondering what types of tasks they had moved on to, and what should be
> included in the Proposal to insure that BioJava stays relevant to them?

Good point. From what we can tell, people are not so sequence-focused
any more but are more interested in features, alignments, population
data, etc. - more 'metadata' so to speak.

We do need some mechanism to ensure that we are correct in this
thinking, and that future shifts in direction are catered for in this
design phase.

Could you add a note to the wiki with your points, and/or any ideas you
may have about ensuring these requirements are met?

cheers,
Richard


> Regards,
> 
> Mark
> 
> On Nov 8, 2007 5:58 AM, Richard Holland <holland at ebi.ac.uk> wrote:
> 
> what's happened to the biojava wiki today? i get errors from all pages,
> including the front page, indicating zero-sized replies.
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMzaV4C5LeMEKA/QRAoPUAJ0TQ+xFF1J3EtZgHmvYj2HH41koCgCeLYm0
D5Z7SJDWjvJ9rbCrS+RTEeI=
=XhE1
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Thu Nov  8 11:18:46 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 16:18:46 +0000
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
Message-ID: <473336E6.6000100@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

we do need a consensus here.

I'm happy to go with whatever value is chosen, as the BioJava code can
easily be modified to suit.

cheers,
Richard

Hilmar Lapp wrote:
> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we  
> explicitly lowercase the value found for alphabet, and the comment  
> says why:
> 
>          # Note: Biojava uses upper-case terms for alphabet, so we
>          # need to change to all-lower in case the sequence was
>          # manipulated by Biojava.
>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
> 
> However, when inserting sequences, we leave the value as is in  
> BioPerl (which is lowercase), leading to a potential problem for  
> Biojava upon retrieval. Do the Biojava folks deal with that? Should  
> this may harmonized across the board?
> 
> 	-hilmar
> 
> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
> 
>> Dear Peter,
>>
>> All the alphabet are "DNA" (upper case) in my database. The  
>> sequences are taken from NCBI by a BioJava application.
>> Thus is should be that BioJava inserts the records with "DNA". Thus  
>> no potential "hidden bug" in BioPython.
>>
>> Maybe a point to share with the Open-Bio committee.
>>
>> Eric
>>
>> ----- Message d'origine ----
>> De : Peter <biopython at maubp.freeserve.co.uk>
>> ? : Eric Gibert <ericgibert at yahoo.fr>
>> Cc : biopython at lists.open-bio.org
>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>
>> Eric Gibert wrote:
>>> Dear all,
>>>
>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>> function:
>>>
>>> ...
>>>
>>> please note my correction: force moltype to be turn in lower case as
>>> my database has upper case value! this raises the "Unknown moltype"
>>> error.
>> Hi Eric, I've made your suggested change in CVS,
>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>
>> I would encourage you to investigate why some of the "alphabet" fields
>> in the biosequence table are in upper case.  There could be a bug
>> elsewhere which is writing these entries with the wrong alphabet.  Is
>> this affecting all entries, or just some?
>>
>> Peter
>>
>>
>>
>>
>>
>>
>>
>>        
>> ______________________________________________________________________ 
>> _______
>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
>> Yahoo! Mail
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3
9x+CUHig3GfBCZ56rDb1ZG4=
=OJyB
-----END PGP SIGNATURE-----

From hlapp at gmx.net  Thu Nov  8 15:28:19 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 8 Nov 2007 15:28:19 -0500
Subject: [Biojava-l] [BioPython] error on insert new sequences from
	GenBank: no annotations saved in BioSQL database
In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com>
References: <499834.44468.qm@web26501.mail.ukl.yahoo.com>
Message-ID: <A1A51FDB-C4A6-4894-8C9C-12A210B73C0D@gmx.net>

Maybe we need to hold some mini-hackathon to make the different  
toolkits compatible in how they map annotation to the schema.  
Obviously I don't know whether you have the latest Biojava setup  
here, but I'll just comment how BioPerl/Bioperl-db would map this:

'ORIGIN' - if I'm not mistaken this is only a token that introduces  
the actual sequence. I'm not sure what Biojava is storing as value here.

'DIVISION' - this maps to column division in table bioentry (though I  
agree that if  perfectly following the weak typing principle this  
should be tag/value association, but at present it's still an actual  
column)

'genbank_accessions' - secondary accession numbers indeed go into the  
qualifier value table. The primary accession maps to column accession  
in table bioentry

'TITLE' - this is part of a publication reference, and should map to  
column title in table reference (which it does in bioperl-db)

'cross_references' - not sure where these would be coming from in  
GenBank format; for EMBL this will map to the dbxref table

'data_file_division' - not sure what this is (same as DIVISION?)

'VERSION' - in BioPerl we parse this apart into a version for the  
accession (which is column version in table bioentry) and the GI  
number, which maps to column identifier in table bioentry

'references' - these map to table reference (and bioentry_reference  
for association with the bioentry)

'KEYWORDS' - indeed these map to bioentry_qualifier_value

'GI' - maps to column identifier in table bioentry

'SIZE' - not sure what size that is. If it is the length of the  
sequence, it should (and in BioPerl/bioperl-db does) map to column  
length in table biosequence

'DEFINITION' - maps to column description in table bioentry

'REFERENCE' - should be the same as for 'references'

'MDAT' - not sure what this is

'ORGANISM' - this is the organism and maps to the table taxon (and  
taxon_name), with a foreign key in bioentry pointing to the taxon

'JOURNAL' - this is part of a reference, see 'references'

'ACCESSION' - the primary accession, maps to column accession in  
table bioentry

'LOCUS' - in the file itself this is an entire line consisting of  
multiple fields; BioPerl/bioperl-db maps the locus name (the first  
token after the literal token LOCUS) to column name in table bioentry

'SOURCE' - this is the organism, see 'ORGANISM'

'PUBMED' - this is part of a literature reference, and maps to a  
foreign key in the reference table (reference.dbxref) to a dbxref  
entry with PUBMED or PMID as the database and the pubmed ID as the  
accession

'AUTHORS' - part of a literature reference, maps to column authors in  
table reference

'TYPE' - not sure what this is. If it's the alphabet, it maps to  
table biosequence, column alphabet

'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,  
though there have been plans to make it a column in table biosequence.

Note that this could in fact be the way Biojava stores it too, but  
upon retrieval represents it in the way you are seeing it.

Hth,

	-hilmar

On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:

> Dear all,
>
> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted  
> previously by my BioJava application, I have:
>
> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>
> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',  
> 'genbank_accessions', 'TITLE', 'cross_references',  
> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',  
> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',  
> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',  
> 'CIRCULAR']
>
> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
> Debug on Seq: EF631597.1 =  ['cross_references', 'dates',  
> 'references', 'gi', 'data_file_division']
>
>
> Once I look in the table bioentry_qualifier_value
>
> * 20 records for a Sequence imported by BioJava
> * 1 only for a Sequence inserted by BioPython: the date which  
> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>
> Quite a few annotations missing, no?
>
> Any idea?
>
> Eric
>
>
>
>
>        
> ______________________________________________________________________ 
> _______
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
> Yahoo! Mail
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Thu Nov  8 15:30:29 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 8 Nov 2007 15:30:29 -0500
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <473336E6.6000100@ebi.ac.uk>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
Message-ID: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>

It seems BioPerl and Biopython both want (and have traditionally  
used) lowercase - do you mind going with that for Biojava as well, or  
alternatively, simply map upon insert/update and retrieve?

	-hilmar

On Nov 8, 2007, at 11:18 AM, Richard Holland wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> we do need a consensus here.
>
> I'm happy to go with whatever value is chosen, as the BioJava code can
> easily be modified to suit.
>
> cheers,
> Richard
>
> Hilmar Lapp wrote:
>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we
>> explicitly lowercase the value found for alphabet, and the comment
>> says why:
>>
>>          # Note: Biojava uses upper-case terms for alphabet, so we
>>          # need to change to all-lower in case the sequence was
>>          # manipulated by Biojava.
>>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
>>
>> However, when inserting sequences, we leave the value as is in
>> BioPerl (which is lowercase), leading to a potential problem for
>> Biojava upon retrieval. Do the Biojava folks deal with that? Should
>> this may harmonized across the board?
>>
>> 	-hilmar
>>
>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
>>
>>> Dear Peter,
>>>
>>> All the alphabet are "DNA" (upper case) in my database. The
>>> sequences are taken from NCBI by a BioJava application.
>>> Thus is should be that BioJava inserts the records with "DNA". Thus
>>> no potential "hidden bug" in BioPython.
>>>
>>> Maybe a point to share with the Open-Bio committee.
>>>
>>> Eric
>>>
>>> ----- Message d'origine ----
>>> De : Peter <biopython at maubp.freeserve.co.uk>
>>> ? : Eric Gibert <ericgibert at yahoo.fr>
>>> Cc : biopython at lists.open-bio.org
>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>>
>>> Eric Gibert wrote:
>>>> Dear all,
>>>>
>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>>> function:
>>>>
>>>> ...
>>>>
>>>> please note my correction: force moltype to be turn in lower  
>>>> case as
>>>> my database has upper case value! this raises the "Unknown moltype"
>>>> error.
>>> Hi Eric, I've made your suggested change in CVS,
>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>>
>>> I would encourage you to investigate why some of the "alphabet"  
>>> fields
>>> in the biosequence table are in upper case.  There could be a bug
>>> elsewhere which is writing these entries with the wrong  
>>> alphabet.  Is
>>> this affecting all entries, or just some?
>>>
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ____________________________________________________________________ 
>>> __
>>> _______
>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>>> Yahoo! Mail
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3
> 9x+CUHig3GfBCZ56rDb1ZG4=
> =OJyB
> -----END PGP SIGNATURE-----

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From holland at ebi.ac.uk  Fri Nov  9 03:39:01 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 09 Nov 2007 08:39:01 +0000
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
	<9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
Message-ID: <47341CA5.9080509@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

i'll see what i can do.

Hilmar Lapp wrote:
> It seems BioPerl and Biopython both want (and have traditionally used)
> lowercase - do you mind going with that for Biojava as well, or
> alternatively, simply map upon insert/update and retrieve?
> 
>     -hilmar
> 
> On Nov 8, 2007, at 11:18 AM, Richard Holland wrote:
> 
> we do need a consensus here.
> 
> I'm happy to go with whatever value is chosen, as the BioJava code can
> easily be modified to suit.
> 
> cheers,
> Richard
> 
> Hilmar Lapp wrote:
>>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we
>>>> explicitly lowercase the value found for alphabet, and the comment
>>>> says why:
>>>>
>>>>          # Note: Biojava uses upper-case terms for alphabet, so we
>>>>          # need to change to all-lower in case the sequence was
>>>>          # manipulated by Biojava.
>>>>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
>>>>
>>>> However, when inserting sequences, we leave the value as is in
>>>> BioPerl (which is lowercase), leading to a potential problem for
>>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should
>>>> this may harmonized across the board?
>>>>
>>>>     -hilmar
>>>>
>>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
>>>>
>>>>> Dear Peter,
>>>>>
>>>>> All the alphabet are "DNA" (upper case) in my database. The
>>>>> sequences are taken from NCBI by a BioJava application.
>>>>> Thus is should be that BioJava inserts the records with "DNA". Thus
>>>>> no potential "hidden bug" in BioPython.
>>>>>
>>>>> Maybe a point to share with the Open-Bio committee.
>>>>>
>>>>> Eric
>>>>>
>>>>> ----- Message d'origine ----
>>>>> De : Peter <biopython at maubp.freeserve.co.uk>
>>>>> ? : Eric Gibert <ericgibert at yahoo.fr>
>>>>> Cc : biopython at lists.open-bio.org
>>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>>>>
>>>>> Eric Gibert wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>>>>> function:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> please note my correction: force moltype to be turn in lower case as
>>>>>> my database has upper case value! this raises the "Unknown moltype"
>>>>>> error.
>>>>> Hi Eric, I've made your suggested change in CVS,
>>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>>>>
>>>>> I would encourage you to investigate why some of the "alphabet" fields
>>>>> in the biosequence table are in upper case.  There could be a bug
>>>>> elsewhere which is writing these entries with the wrong alphabet.  Is
>>>>> this affecting all entries, or just some?
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> _______
>>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>>>>> Yahoo! Mail
>>>>> _______________________________________________
>>>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>>

> --===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHNByl4C5LeMEKA/QRAmCzAJ9fxSm8l5YAEHAUe2hH+Gwc1Xe5IwCfcMf6
c9sy8lASDV069FQJ79Geemw=
=RHM1
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Fri Nov  9 07:42:38 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 09 Nov 2007 12:42:38 +0000
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
	<9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
Message-ID: <473455BE.6040807@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I did a bit of poking around in our code and internally BioJava
represents all the default alphabet names (Protein, DNA, etc.) in upper
case. It also allows for mixed case alphabet names.

It's not quite as easy as I thought to change these to lower case as
they are often referenced by text name, meaning other people's code
might break if I change them.

Also, as it allows for mixed-case alphabet names, I can't do a
toUpper/toLower fudge on persistence to BioSQL, as I wouldn't
necessarily get out what I put in!

So, I think I'll add this as a point on the recently announced BioJava 3
proposal, that BioSQL interaction must be compliant with standards laid
down by the BioSQL project, and that our code will be able to cope with
this internally.

That brings us back to BioSQL standards - the idea of a mini-hackathon
to solve this once and for all is a very good one. Our previous attempts
between BioPerl and BioJava in Singapore were good, but still there are
niggles as seen in this thread of discussion. It seems that a schema on
it's own just isn't enough to make the various projects play nicely, and
instructions are needed on exactly how to use that schema if they are
truly all going to be able to use it without caring who or what wrote
the data that is being read.

cheers,
Richard


Hilmar Lapp wrote:
> It seems BioPerl and Biopython both want (and have traditionally used)
> lowercase - do you mind going with that for Biojava as well, or
> alternatively, simply map upon insert/update and retrieve?
> 
>     -hilmar
> 
> On Nov 8, 2007, at 11:18 AM, Richard Holland wrote:
> 
> we do need a consensus here.
> 
> I'm happy to go with whatever value is chosen, as the BioJava code can
> easily be modified to suit.
> 
> cheers,
> Richard
> 
> Hilmar Lapp wrote:
>>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we
>>>> explicitly lowercase the value found for alphabet, and the comment
>>>> says why:
>>>>
>>>>          # Note: Biojava uses upper-case terms for alphabet, so we
>>>>          # need to change to all-lower in case the sequence was
>>>>          # manipulated by Biojava.
>>>>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
>>>>
>>>> However, when inserting sequences, we leave the value as is in
>>>> BioPerl (which is lowercase), leading to a potential problem for
>>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should
>>>> this may harmonized across the board?
>>>>
>>>>     -hilmar
>>>>
>>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
>>>>
>>>>> Dear Peter,
>>>>>
>>>>> All the alphabet are "DNA" (upper case) in my database. The
>>>>> sequences are taken from NCBI by a BioJava application.
>>>>> Thus is should be that BioJava inserts the records with "DNA". Thus
>>>>> no potential "hidden bug" in BioPython.
>>>>>
>>>>> Maybe a point to share with the Open-Bio committee.
>>>>>
>>>>> Eric
>>>>>
>>>>> ----- Message d'origine ----
>>>>> De : Peter <biopython at maubp.freeserve.co.uk>
>>>>> ? : Eric Gibert <ericgibert at yahoo.fr>
>>>>> Cc : biopython at lists.open-bio.org
>>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>>>>
>>>>> Eric Gibert wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>>>>> function:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> please note my correction: force moltype to be turn in lower case as
>>>>>> my database has upper case value! this raises the "Unknown moltype"
>>>>>> error.
>>>>> Hi Eric, I've made your suggested change in CVS,
>>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>>>>
>>>>> I would encourage you to investigate why some of the "alphabet" fields
>>>>> in the biosequence table are in upper case.  There could be a bug
>>>>> elsewhere which is writing these entries with the wrong alphabet.  Is
>>>>> this affecting all entries, or just some?
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> _______
>>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>>>>> Yahoo! Mail
>>>>> _______________________________________________
>>>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>>

> --===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHNFW84C5LeMEKA/QRApBiAJ41WqCDKOJhee5NxIsquYaR/ImBRgCfb7zM
LX75HHvCUC/v4n3okmUQ+ME=
=d6QO
-----END PGP SIGNATURE-----

From email2ants at gmail.com  Fri Nov  9 12:55:36 2007
From: email2ants at gmail.com (Anthony Underwood)
Date: Fri, 9 Nov 2007 17:55:36 +0000
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
Message-ID: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>

Hi All,

I've generated an alignment and I am retrieving positions within the  
alignment using

Symbol base = alignment.symbolAt(label, i);

I am trying to get whether the base at this position is G, A, T or C

However when I use base.getName() it returns strings such as "thymine"

The documentation states that the method getToken should also be  
available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html

Is there a simple way of retrieving a one letter textual  
representation of the symbol?


Many thanks


Anthony

From zagato.gekko at gmail.com  Fri Nov  9 13:48:02 2007
From: zagato.gekko at gmail.com (Zagato)
Date: Fri, 9 Nov 2007 13:48:02 -0500
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
In-Reply-To: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
References: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>

Try with:
String s = alignment.symbolListForLabel( label ).subStr( i, i+1 );

Bye...

Alan Jairo Acosta
Cali - Colombia

On Nov 9, 2007 12:55 PM, Anthony Underwood <email2ants at gmail.com> wrote:

> Hi All,
>
> I've generated an alignment and I am retrieving positions within the
> alignment using
>
> Symbol base = alignment.symbolAt(label, i);
>
> I am trying to get whether the base at this position is G, A, T or C
>
> However when I use base.getName() it returns strings such as "thymine"
>
> The documentation states that the method getToken should also be
> available, but this returns method undefined.
> http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
>
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
>
>
> Many thanks
>
>
> Anthony
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Farewell.
http://www.youtube.com/zagatogekko
ruby << __EOF__
 puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse
__EOF__

From zagato.gekko at gmail.com  Fri Nov  9 13:48:02 2007
From: zagato.gekko at gmail.com (Zagato)
Date: Fri, 9 Nov 2007 13:48:02 -0500
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
In-Reply-To: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
References: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>

Try with:
String s = alignment.symbolListForLabel( label ).subStr( i, i+1 );

Bye...

Alan Jairo Acosta
Cali - Colombia

On Nov 9, 2007 12:55 PM, Anthony Underwood <email2ants at gmail.com> wrote:

> Hi All,
>
> I've generated an alignment and I am retrieving positions within the
> alignment using
>
> Symbol base = alignment.symbolAt(label, i);
>
> I am trying to get whether the base at this position is G, A, T or C
>
> However when I use base.getName() it returns strings such as "thymine"
>
> The documentation states that the method getToken should also be
> available, but this returns method undefined.
> http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
>
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
>
>
> Many thanks
>
>
> Anthony
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Farewell.
http://www.youtube.com/zagatogekko
ruby << __EOF__
 puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse
__EOF__

From gwaldon at geneinfinity.org  Fri Nov  9 13:45:10 2007
From: gwaldon at geneinfinity.org (George Waldon)
Date: Fri, 09 Nov 2007 10:45:10 -0800
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
Message-ID: <20071109184510.80580.qmail@mmm1924.dulles19-verio.com>

Tokens are associated with alphabets. 

Get the tokenization from the alphabet using:
SymbolTokenization = Alphabet.getTokenization("token");

Get the token from the tokenization using:
String = SymbolTokenization.tokenizeSymbol(Symbol);

Also, check the tutotial and the cookbook on the biojava web site at www.biojava.org, which are often more informative than the javadoc.

Frankly speaking, I agree with you and we should have a method like
String = Symbol.getToken(Alphabet,"token");
to do these operations simply and without loosing our hairs!

Best luck,
George


> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
> bounces at lists.open-bio.org] On Behalf Of Anthony Underwood
> Sent: Friday, November 09, 2007 9:56 AM
> To: BioJava
> Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
> 
> Hi All,
> 
> I've generated an alignment and I am retrieving positions within the
> alignment using
> 
> Symbol base = alignment.symbolAt(label, i);
> 
> I am trying to get whether the base at this position is G, A, T or C
> 
> However when I use base.getName() it returns strings such as "thymine"
> 
> The documentation states that the method getToken should also be
> available, but this returns method undefined.
> http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
> 
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
> 
> 
> Many thanks
> 
> 
> Anthony
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists

From email2ants at gmail.com  Fri Nov  9 18:23:01 2007
From: email2ants at gmail.com (Anthony Underwood)
Date: Fri, 9 Nov 2007 23:23:01 +0000
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
In-Reply-To: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>
References: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
	<98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>
Message-ID: <70FC5536-E1B3-41C7-92BC-0B43A0E11E09@gmail.com>

Hi Alan,

Thanks for the suggestion. That was my first thought, but then I was  
thinking for amino acids this wouldn't work. I would have to use a  
hashmap to convert the amino acid to the appropriate single letter code.

Hi George, I'll try your suggestion. As you say I think this is too  
much for something that should be a one liner. Thanks for your advice.
Get the tokenization from the alphabet using:
SymbolTokenization = Alphabet.getTokenization("token");

Get the token from the tokenization using:
String = SymbolTokenization.tokenizeSymbol(Symbol);

Thanks to both of you

Anthony

On 9 Nov 2007, at 18:48, Zagato wrote:

> Try with:
> String s = alignment.symbolListForLabel( label ).subStr( i, i+1 );
>
> Bye...
>
> Alan Jairo Acosta
> Cali - Colombia
>
> On Nov 9, 2007 12:55 PM, Anthony Underwood < email2ants at gmail.com>  
> wrote:
> Hi All,
>
> I've generated an alignment and I am retrieving positions within the
> alignment using
>
> Symbol base = alignment.symbolAt(label, i);
>
> I am trying to get whether the base at this position is G, A, T or C
>
> However when I use base.getName() it returns strings such as "thymine"
>
> The documentation states that the method getToken should also be
> available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
>
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
>
>
> Many thanks
>
>
> Anthony
> _______________________________________________
> Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
> -- 
> Farewell.
> http://www.youtube.com/zagatogekko
> ruby << __EOF__
>  puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse
> __EOF__


From hlapp at gmx.net  Sat Nov 10 15:38:17 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 10 Nov 2007 15:38:17 -0500
Subject: [Biojava-l] error on insert new sequences from GenBank: no
	annotations saved in BioSQL database
In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko>
References: <499834.44468.qm@web26501.mail.ukl.yahoo.com>
	<47336117.2010102@maubp.freeserve.co.uk>
	<001c01c8238b$2ec64070$6400a8c0@Gecko>
Message-ID: <5DDEBCDE-C8DA-4B2C-86F4-47FDB82CADAC@gmx.net>

Just a few comments below, specifically where no rows would in fact  
be what I expect:

On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote:

> [...]
> --------  For you information, I went thru the tables of my BioSQL  
> database:
> [...]
> 1) table bioentry: all column populated except for 'taxon_id' which  
> is NULL
> (maybe I need an extra call for populating the 'taxon' table before?)

Bioperl-db will try to look up (or create if necessary) the taxon  
from the taxon information attached to the sequence, but for BioPerl  
we actually recommend to pre-load the database with the NCBI  
taxonomy, which can be comfortably done with the script  
load_ncbi_taxonomy.pl that comes with BioSQL.

>
> 2) table bioentry_dbxref: no data inserted (always empty, even with  
> BioJava)

This would mean that the sequence(s) have no dbxrefs. Note that for  
GenBank sequences that would be expected, since unfortunately, and  
unlike EMBL format, GenBank puts the dbxrefs into the feature table.

> 3) table bioentry_qualifier_value:
>
> One entry only, for the 'term_id' = 149, rank = 1, and value = '07- 
> JUL-2005'
> or other 'DD-MMM-YYYY' dates (see my remarks below)

Below you say that your term table is empty, so I don't know why you  
can have value here at all.

> [...]
> 5) table bioentry_relationships: no entry found (always empty, even  
> with
> BioJava)

If you load sequences, they won't have direct relationships to other  
sequences (except dbxrefs, but those are rather 'pointers' and are  
stored in their own table).

In Bioperl-db, this table is used only if you load sequence clusters  
through Bio::Cluster objects (such as UniGene).

> [...]
> 7) table comment: no entry found (always empty, even with BioJava)

Again, this is expected with GenBank. AFAIK genbank format doesn't  
allow for comments at the level of the sequence. You would (i.e.,  
should) find entries here if you load UniProt entries.

> 8) table dbxref: some records are generated, for dbname 'PUBMED'  
> and 'Taxon'
> with the correct value

Taxon obviously isn't really a dbxref, but rather a taxon (and hence  
should go into that table).

> [...]
> 9) table dbxref_qualifier_value: (always empty, even with BioJava)

That's almost expected. There's rather few cases where dbxrefs have  
additional attributes that the language can parse out from a source  
(and then maps to the schema).

> [...]
> 10) table location: all locations loaded correctly, note that  
> 'term_id' and
> 'dbxref_id' remain NULL for these seq but I have value for other seq.

Theoretically, the term_id should point to the term giving the type  
of the location. If you (or Biopython) are only dealing with simple  
('normal') locations, then it's not needed.

The dbxref_id gives the reference to the remote sequence if the  
location for a feature refers to a different sequence than the  
feature itself does (so-called 'remote locations'). If the sequences  
you loaded don't have such locations, there this would be expected to  
be empty (or if Biopython doesn't handle such locations).

> 11) table location_qualifier_value: always empty, even with BioJava

This is expected if Biopython doesn't support fuzzy locations, or if  
none of the feature locations that you loaded are fuzzy.

> [...]
> 13) Table reference: entries correct, note 'dbxref_id' remains NULL  
> for
> these seq but I have value for other seq.

It should point to the pubmed ID for the reference but only if there  
was one.

> 14) table seqfeature: entries are there (same as in table 'location').
> FYI:'display_name is always NULL.

GenBank doesn't give names to features (and I think EMBL does  
neither), so this is expected.

> 15) table seqfeature_dbxref: always empty, even with BioJava

That's likely more to do with your language object model than with  
anything else. dbxref annotation for features is in tag/value pairs,  
just as any other, so your language (Biopython in this case) will  
have to do a lot of interpretation to tease out the semantics behind  
each tag name and based on that decide what to do with the value.  
Indeed, by default we don't even do this in BioPerl.

> [...]
> 17) table seqfeature_relationship: always empty, even with BioJava

GenBank (and EMBL) feature tables are flat, not hierarchical, so this  
is expected.

> 18) table taxon: always empty, even with BioJava)

This is where the organism should go.

> 19) table taxon_name: I have one but not from this test (I tried to  
> tinker a
> little bit with taxon but stopped)

That's odd that you can have an entry in taxon_name w/o a  
corresponding one in taxon. Do you have foreign key checks disabled?

> 20) table term: always empty, even with BioJava

That's strange, since you say you do have rows in  
bioentry_qualifier_value, which has an enforced foreign key to term.  
Did you disable the foreign key checks?

> 21) table term_dbxref: always empty, even with BioJava

That's expected unless you loaded an ontology whose terms have  
dbxrefs, and your language object model supports that.

> [...]
> 23) table term_synonym: always empty, even with BioJava

Same as for 21). Your terms would have to have synonyms, and your  
language object model would have to support those, before you could  
expect to get anything in here.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From shirleyc at cis.upenn.edu  Tue Nov 13 13:45:59 2007
From: shirleyc at cis.upenn.edu (Shirley Cohen)
Date: Tue, 13 Nov 2007 13:45:59 -0500
Subject: [Biojava-l] maximum parsimony search
Message-ID: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu>

Hi BioJava People,

I'm looking for existing code that implements a maximum parsimony  
search in Java. Does BioJava have this functionality? If so, can you  
point me to the appropriate classes?

Thanks,

Shirley

From bmduggan at yahoo.com  Tue Nov 13 19:48:22 2007
From: bmduggan at yahoo.com (Brendan Duggan)
Date: Wed, 14 Nov 2007 11:48:22 +1100 (EST)
Subject: [Biojava-l] Disulfide information in PDB files
Message-ID: <454510.91557.qm@web52705.mail.re2.yahoo.com>

Greetings

I'm trying to mine some information on disulfides in
the PDB and was hoping there might be a way of
obtaining this information with the BioJava PDB
parser.  However, I haven't been able to see anything
like this mentioned in the API docs.  If it is
currently not possible to extract disulfide
information from PDB files are there any plans to
implement this?

Thanks!

Brendan


      Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html


From holland at ebi.ac.uk  Wed Nov 14 03:50:31 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 14 Nov 2007 08:50:31 +0000
Subject: [Biojava-l] maximum parsimony search
In-Reply-To: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu>
References: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu>
Message-ID: <473AB6D7.2010405@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

There is a class currently only available from the head of CVS - ie. it
is unreleased yet. To get it you'll need to check out the very latest
BioJava source code from CVS.

The JavaDoc for the class is here:

http://www.spice-3d.org/public-files/javadoc/biojava/org/biojavax/bio/phylo/ParsimonyTreeMethod.html

It is designed to take input in the form of blocks of data similar to
what you would find in a Nexus file (the Nexus file parsers elsewhere in
the org/biojavax/bio/phylo package will provide these). However you
could of course create such objects from your own data without needing
to read/write any Nexus files.

cheers,
Richard


Shirley Cohen wrote:
> Hi BioJava People,
> 
> I'm looking for existing code that implements a maximum parsimony  
> search in Java. Does BioJava have this functionality? If so, can you  
> point me to the appropriate classes?
> 
> Thanks,
> 
> Shirley
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHOrbW4C5LeMEKA/QRAuswAJ9olIwj7DGszOnKORU255YS3m2ohACfbKTw
ihjuQVv0j+nlXb+4SL5pIfw=
=ldfM
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Wed Nov 14 03:55:24 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 14 Nov 2007 08:55:24 +0000
Subject: [Biojava-l] Disulfide information in PDB files
In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com>
References: <454510.91557.qm@web52705.mail.re2.yahoo.com>
Message-ID: <473AB7FC.10403@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Currently this is not parsed - the parser does not read all the tags in
the most recent PDB specification.

Could you open a bug request at http://bugzilla.open-bio.org/ to
formally add this to our to-do list? Thanks!

cheers,
Richard

Brendan Duggan wrote:
> Greetings
> 
> I'm trying to mine some information on disulfides in
> the PDB and was hoping there might be a way of
> obtaining this information with the BioJava PDB
> parser.  However, I haven't been able to see anything
> like this mentioned in the API docs.  If it is
> currently not possible to extract disulfide
> information from PDB files are there any plans to
> implement this?
> 
> Thanks!
> 
> Brendan
> 
> 
>       Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHOrf84C5LeMEKA/QRArfeAJ9nCViM2jyVfubIpl5w/1EXMYTv/gCgjVEs
zDnxHjv8xJsRBw5pfE2NdkA=
=tGqm
-----END PGP SIGNATURE-----

From ap3 at sanger.ac.uk  Wed Nov 14 04:32:28 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Wed, 14 Nov 2007 09:32:28 +0000
Subject: [Biojava-l] Disulfide information in PDB files
In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com>
References: <454510.91557.qm@web52705.mail.re2.yahoo.com>
Message-ID: <9B898ADF-78EB-4B5C-A432-98274190815F@sanger.ac.uk>

Hi Brendan,

SSBOND lines are currently not parsed. If this is what you need,
I can add this over the next couple of days.

If you want to compute the bonds yourself, the framework can
e.g. calculate distances between the sulphur atoms for you. -

Andreas


On 14 Nov 2007, at 00:48, Brendan Duggan wrote:

> Greetings
>
> I'm trying to mine some information on disulfides in
> the PDB and was hoping there might be a way of
> obtaining this information with the BioJava PDB
> parser.  However, I haven't been able to see anything
> like this mentioned in the API docs.  If it is
> currently not possible to extract disulfide
> information from PDB files are there any plans to
> implement this?
>
> Thanks!
>
> Brendan
>
>
>       Make the switch to the world's best email. Get the new Yahoo! 
> 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From deb at mb.au.dk  Thu Nov 15 07:04:02 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Thu, 15 Nov 2007 13:04:02 +0100
Subject: [Biojava-l] Parsing exising gaps
Message-ID: <002701c8277f$9dbdca50$d9395ef0$@au.dk>

Dear all,

 
I have managed to read an MSF-formatted alignment from a file selected
through FileChooser as follows:

 
  BufferedReader br = new BufferedReader(new
FileReader(aFileChooser.getSelectedFile()));

  SimpleAlignment align =
(SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);

 
I can now retrieve the sequence names and sequences through the Alignment
object:

 
  Iterator aLabels = align.getLabels().iterator();

  Iterator aSequences = align.symbolListIterator();

 
However, I now what to be able to translate between real sequence numbers
and the positions within each alignment string, i.e. retrieve positions that
remove the gaps first (gaps are represented by hyphens '-' in the MSF
format). How can I tell BioJava to parse the gaps into an GappedSequence
format? I have tried the following to check what position 15 (past the the
first gap) translates into:

 
  int n = 0;

  while(aSequences.hasNext()) {

      SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();

      SimpleGappedSequence aGapped = new SimpleGappedSequence(new
SimpleSequence(aSym, "", aLabels.next().toString(), null));

      System.out.println(aGapped.gappedToLocation(new PointLocation(15)));

  }

 
But I only get 15 back out. I have also studied the constructor of the
underlying SimpleGappedSymbolList but it simply copies the SymbolList and
creates one big block:

 
  public SimpleGappedSymbolList(SymbolList source) {

    this.source = source;

    this.alpha = source.getAlphabet();

    this.blocks = new ArrayList();

    this.length = source.length();

    Block b = new Block(1, length, 1, length);

    blocks.add(b);

  }

 
Is there a way to tell SimpleGappedSequence to parse itself in terms of the
gap characters in the sequence string? How is the sequence represented in
this case, if not by gaps? Surely the hyphen cannot be a part of the
standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it?

 
Best wishes,

 
  Ditlev

 
--

 
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor

 
Department of Molecular Biology   Office:  +45 89425259
University of AarhusLab:     +45 89425022
Gustav Wieds Vej 10cFax:     +45 86123178
DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
Denmark             Lab WWW:  <http://bioxray.dk/~deb> www.bioxray.dk/~deb

 
From holland at ebi.ac.uk  Thu Nov 15 08:51:48 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 15 Nov 2007 13:51:48 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
Message-ID: <473C4EF4.5080301@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think you've uncovered a number of problems here:

1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
all the other predefined alphabets.

2. The MSF parser doesn't bother trying to build GappedSequence
instances, instead it just builds solid sequences with the gaps as
normal symbols.

3. There is no constructor or method for taking a sequence with embedded
gap symbols and turning it into a GappedSequence with separate chunks.

Combined, these three problems make it impossible to do what you want
easily. I will make a note to fix this on the plans for the next BioJava
development cycle.

In the meantime, your best bet would be to construct a second alignment
block by iterating over the alignment block you already have and parsing
the locations of the gap symbols. You would create a
SimpleGappedSequence intially over the ungapped sequence, then use the
insert gap methods to insert the gaps into this ungapped sequence before
putting all the SimpleGappedSequence objects together into a new alignment.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Dear all,
> 
>  
> 
> I have managed to read an MSF-formatted alignment from a file selected
> through FileChooser as follows:
> 
>  
> 
>   BufferedReader br = new BufferedReader(new
> FileReader(aFileChooser.getSelectedFile()));
> 
>   SimpleAlignment align =
> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);
> 
>  
> 
> I can now retrieve the sequence names and sequences through the Alignment
> object:
> 
>  
> 
>   Iterator aLabels = align.getLabels().iterator();
> 
>   Iterator aSequences = align.symbolListIterator();
> 
>  
> 
> However, I now what to be able to translate between real sequence numbers
> and the positions within each alignment string, i.e. retrieve positions that
> remove the gaps first (gaps are represented by hyphens '-' in the MSF
> format). How can I tell BioJava to parse the gaps into an GappedSequence
> format? I have tried the following to check what position 15 (past the the
> first gap) translates into:
> 
>  
> 
>   int n = 0;
> 
>   while(aSequences.hasNext()) {
> 
>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> 
>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
> SimpleSequence(aSym, "", aLabels.next().toString(), null));
> 
>       System.out.println(aGapped.gappedToLocation(new PointLocation(15)));
> 
>   }
> 
>  
> 
> But I only get 15 back out. I have also studied the constructor of the
> underlying SimpleGappedSymbolList but it simply copies the SymbolList and
> creates one big block:
> 
>  
> 
>   public SimpleGappedSymbolList(SymbolList source) {
> 
>     this.source = source;
> 
>     this.alpha = source.getAlphabet();
> 
>     this.blocks = new ArrayList();
> 
>     this.length = source.length();
> 
>     Block b = new Block(1, length, 1, length);
> 
>     blocks.add(b);
> 
>   }
> 
>  
> 
> Is there a way to tell SimpleGappedSequence to parse itself in terms of the
> gap characters in the sequence string? How is the sequence represented in
> this case, if not by gaps? Surely the hyphen cannot be a part of the
> standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it?
> 
>  
> 
> Best wishes,
> 
>  
> 
>   Ditlev
> 
>  
> 
> --
> 
>  
> 
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
> 
>  
> 
> Department of Molecular Biology   Office:  +45 89425259
> University of AarhusLab:     +45 89425022
> Gustav Wieds Vej 10cFax:     +45 86123178
> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
> Denmark             Lab WWW:  <http://bioxray.dk/~deb> www.bioxray.dk/~deb
> 
>  
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPE704C5LeMEKA/QRAniIAJsGv+5HIP3mCDxBIUdw0SjDrWu8dgCeNviA
EsJK4gv+EVY7wc4r6W2A0+I=
=wCQs
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Fri Nov 16 03:59:41 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 08:59:41 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
Message-ID: <473D5BFD.8080305@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Ditlev.

After some investigation and some helpful hints from Mark, it turns out
that there are methods in DNATools/ProteinTools that can construct
proper GappedSymbolList objects out of strings.

I have managed to modify the MSF parser to use this instead. This means
that the MSF parser will now return instances of GappedSymbolList
(actually GappedSequences to be accurate) rather than SimpleSymbolList.
Thanks to the way the APIs work this will make no difference to existing
users (except those who are depending on being able to cast it to a
certain type - which they shouldn't, because the API doesn't guarantee
it to be of any type!), but it will fix it for you. Future releases will
modify the API (or include a completely new MSF parser) which will
explicitly return GappedSymbolLists in the API declarations rather than
plain SymbolLists, but I can't do that right now because it would break
existing users code.

To get the modified parser you will need to check out the very latest
source code from our CVS repository and compile it using ant.
Instructions are on our website at biojava.org if you have not done this
before.

Hope this helps you.

cheers,
Richard


Ditlev Egeskov Brodersen wrote:
> Hi Richard,
> 
>   thanks for clarifying this and for your useful suggestion, which I've
> managed to implement as shown below. It works nicely, but I was really
> surprised to learn that biojava hasn't yet implemented a proper parsing of
> gap characters from strings into the object structure as this seems central
> to any use of pre-aligned sequences. Also, I find it problematic that the
> API implements the gap characters as part of the alphabets. In my view, this
> breaks the logic of the object model because proteins don't really have gaps
> in their sequences.
> 
>   Rather, the constructor of the Sequence-derived classes ought to throw an
> exception when non-protein characters are passed and should not allow the
> user to create an object with sequence elements that are non-standard.
> Instead, I think there should be a static method that allows cleaning the
> input sequence before passing it to the Sequence constructor. On the other
> hand, the constructor of the GappedSequence-derived classes should recognise
> the gaps and create an object with blocks of legal protein symbols and gaps
> in the appropriate places.
> 
>   -- Ditlev
> 
>   // Read MSF file into Alignment object
>   BufferedReader br = new BufferedReader(new
> FileReader(aFileChooser.getSelectedFile()));
>   SimpleAlignment align =
> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);
> 
>   // Iterate through sequences in turn
>   Iterator aSequences = align.symbolListIterator();
>   while(aSequences.hasNext()) {
> 
>       // Retrieve SymbolList, the associated gap symbol and sequence string
>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
>       Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol();
>       String aGappedString = aSym.seqString();
> 
>       // Prepare non-gapped string
>       String aPlainString = "";
> 
>       // Loop through individual symbols and add non-gap characters to
> string
>       for(int i=1;i<=aSym.length();i++)
>           if(aSym.symbolAt(i) != aGapSymbol)
>               aPlainString += aGappedString.charAt(i-1);
> 
>       // Create a new gapped sequence object with the plain (non-gapped)
> sequence
>       SimpleGappedSequence aGapped =
> (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainString,
> "");
> 
>       // Use separate indices for gapped and plain sequences
>       int n = 1;
> 
>       // Loop through individual gapped sequence symbols and insert gap into
> object when gap symbol is encountered
>       for(int i=1;i<=aSym.length();i++)
>           if(aSym.symbolAt(i) != aGapSymbol)
>               n++;
>           else
>               aGapped.addGapInSource(n); 
> 
> --
>  
> Ditlev Egeskov Brodersen
> Lektor
> Bakkefaldet 30, Hasle
> 8210 ?rhus V
>  
> www.lindeman-brodersen.dk
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 15 November 2007 14:52
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: [Biojava-l] Parsing exising gaps
>>
> I think you've uncovered a number of problems here:
> 
> 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
> all the other predefined alphabets.
> 
> 2. The MSF parser doesn't bother trying to build GappedSequence
> instances, instead it just builds solid sequences with the gaps as
> normal symbols.
> 
> 3. There is no constructor or method for taking a sequence with
> embedded
> gap symbols and turning it into a GappedSequence with separate chunks.
> 
> Combined, these three problems make it impossible to do what you want
> easily. I will make a note to fix this on the plans for the next
> BioJava
> development cycle.
> 
> In the meantime, your best bet would be to construct a second alignment
> block by iterating over the alignment block you already have and
> parsing
> the locations of the gap symbols. You would create a
> SimpleGappedSequence intially over the ungapped sequence, then use the
> insert gap methods to insert the gaps into this ungapped sequence
> before
> putting all the SimpleGappedSequence objects together into a new
> alignment.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>>>> Dear all,
>>>>
>>>>
>>>>
>>>> I have managed to read an MSF-formatted alignment from a file
> selected
>>>> through FileChooser as follows:
>>>>
>>>>
>>>>
>>>>   BufferedReader br = new BufferedReader(new
>>>> FileReader(aFileChooser.getSelectedFile()));
>>>>
>>>>   SimpleAlignment align =
>>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> br);
>>>>
>>>>
>>>> I can now retrieve the sequence names and sequences through the
> Alignment
>>>> object:
>>>>
>>>>
>>>>
>>>>   Iterator aLabels = align.getLabels().iterator();
>>>>
>>>>   Iterator aSequences = align.symbolListIterator();
>>>>
>>>>
>>>>
>>>> However, I now what to be able to translate between real sequence
> numbers
>>>> and the positions within each alignment string, i.e. retrieve
> positions that
>>>> remove the gaps first (gaps are represented by hyphens '-' in the MSF
>>>> format). How can I tell BioJava to parse the gaps into an
> GappedSequence
>>>> format? I have tried the following to check what position 15 (past
> the the
>>>> first gap) translates into:
>>>>
>>>>
>>>>
>>>>   int n = 0;
>>>>
>>>>   while(aSequences.hasNext()) {
>>>>
>>>>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
>>>>
>>>>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
>>>> SimpleSequence(aSym, "", aLabels.next().toString(), null));
>>>>
>>>>       System.out.println(aGapped.gappedToLocation(new
> PointLocation(15)));
>>>>   }
>>>>
>>>>
>>>>
>>>> But I only get 15 back out. I have also studied the constructor of
> the
>>>> underlying SimpleGappedSymbolList but it simply copies the SymbolList
> and
>>>> creates one big block:
>>>>
>>>>
>>>>
>>>>   public SimpleGappedSymbolList(SymbolList source) {
>>>>
>>>>     this.source = source;
>>>>
>>>>     this.alpha = source.getAlphabet();
>>>>
>>>>     this.blocks = new ArrayList();
>>>>
>>>>     this.length = source.length();
>>>>
>>>>     Block b = new Block(1, length, 1, length);
>>>>
>>>>     blocks.add(b);
>>>>
>>>>   }
>>>>
>>>>
>>>>
>>>> Is there a way to tell SimpleGappedSequence to parse itself in terms
> of the
>>>> gap characters in the sequence string? How is the sequence
> represented in
>>>> this case, if not by gaps? Surely the hyphen cannot be a part of the
>>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the use
> of it?
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>>
>>>>
>>>>   Ditlev
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Ditlev E. Brodersen, Ph.D.
>>>> Lektor, Associate Professor
>>>>
>>>>
>>>>
>>>> Department of Molecular Biology   Office:  +45 89425259
>>>> University of AarhusLab:     +45 89425022
>>>> Gustav Wieds Vej 10cFax:     +45 86123178
>>>> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
>>>> Denmark             Lab WWW:  <http://bioxray.dk/~deb>
> www.bioxray.dk/~deb
>>>>
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS
Rv73CVmtYpEz+apJwM1L3sA=
=UPU6
-----END PGP SIGNATURE-----

From deb at mb.au.dk  Fri Nov 16 04:28:40 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Fri, 16 Nov 2007 10:28:40 +0100
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <473D5BFD.8080305@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
Message-ID: <000601c82833$143c5300$3cb4f900$@au.dk>

Hi Richard,

  thanks for your super fast reply. I managed to recompile using CVS/ant and
the MSF import now works brilliantly and simply as follows:

  BufferedReader br = new BufferedReader(new
FileReader(aFileChooser.getSelectedFile()));
  SimpleAlignment align =
(SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);

  // Iterate through sequences in turn
  Iterator aSequences = align.symbolListIterator();
  while(aSequences.hasNext()) {

      // Retrieve gapped sequence
      SimpleGappedSequence aGapped =
(SimpleGappedSequence)aSequences.next();

      ...do whatever with each gapped sequence
  }

  The returned gapped sequences are all properly set up with gaps, name etc.
But as for other users, I think there may be some problems, since the
SimpleAlignment object only has a general symbol list iterator, the user
will have to cast each statement extracting a sequence object, and

      SimpleSequence aSimple = (SimpleSequence)aSequences.next();

returns an ClassCastException at run time. So old code might not run with
the update as far as I can see.

  Ditlev 

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb
> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 10:00
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] Parsing exising gaps
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi Ditlev.
> 
> After some investigation and some helpful hints from Mark, it turns out
> that there are methods in DNATools/ProteinTools that can construct
> proper GappedSymbolList objects out of strings.
> 
> I have managed to modify the MSF parser to use this instead. This means
> that the MSF parser will now return instances of GappedSymbolList
> (actually GappedSequences to be accurate) rather than SimpleSymbolList.
> Thanks to the way the APIs work this will make no difference to
> existing
> users (except those who are depending on being able to cast it to a
> certain type - which they shouldn't, because the API doesn't guarantee
> it to be of any type!), but it will fix it for you. Future releases
> will
> modify the API (or include a completely new MSF parser) which will
> explicitly return GappedSymbolLists in the API declarations rather than
> plain SymbolLists, but I can't do that right now because it would break
> existing users code.
> 
> To get the modified parser you will need to check out the very latest
> source code from our CVS repository and compile it using ant.
> Instructions are on our website at biojava.org if you have not done
> this
> before.
> 
> Hope this helps you.
> 
> cheers,
> Richard
> 
> 
> Ditlev Egeskov Brodersen wrote:
> > Hi Richard,
> >
> >   thanks for clarifying this and for your useful suggestion, which
> I've
> > managed to implement as shown below. It works nicely, but I was
> really
> > surprised to learn that biojava hasn't yet implemented a proper
> parsing of
> > gap characters from strings into the object structure as this seems
> central
> > to any use of pre-aligned sequences. Also, I find it problematic that
> the
> > API implements the gap characters as part of the alphabets. In my
> view, this
> > breaks the logic of the object model because proteins don't really
> have gaps
> > in their sequences.
> >
> >   Rather, the constructor of the Sequence-derived classes ought to
> throw an
> > exception when non-protein characters are passed and should not allow
> the
> > user to create an object with sequence elements that are non-
> standard.
> > Instead, I think there should be a static method that allows cleaning
> the
> > input sequence before passing it to the Sequence constructor. On the
> other
> > hand, the constructor of the GappedSequence-derived classes should
> recognise
> > the gaps and create an object with blocks of legal protein symbols
> and gaps
> > in the appropriate places.
> >
> >   -- Ditlev
> >
> >   // Read MSF file into Alignment object
> >   BufferedReader br = new BufferedReader(new
> > FileReader(aFileChooser.getSelectedFile()));
> >   SimpleAlignment align =
> > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> br);
> >
> >   // Iterate through sequences in turn
> >   Iterator aSequences = align.symbolListIterator();
> >   while(aSequences.hasNext()) {
> >
> >       // Retrieve SymbolList, the associated gap symbol and sequence
> string
> >       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> >       Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol();
> >       String aGappedString = aSym.seqString();
> >
> >       // Prepare non-gapped string
> >       String aPlainString = "";
> >
> >       // Loop through individual symbols and add non-gap characters
> to
> > string
> >       for(int i=1;i<=aSym.length();i++)
> >           if(aSym.symbolAt(i) != aGapSymbol)
> > aPlainString += aGappedString.charAt(i-1);
> >
> >       // Create a new gapped sequence object with the plain (non-
> gapped)
> > sequence
> >       SimpleGappedSequence aGapped =
> >
> (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainSt
> ring,
> > "");
> >
> >       // Use separate indices for gapped and plain sequences
> >       int n = 1;
> >
> >       // Loop through individual gapped sequence symbols and insert
> gap into
> > object when gap symbol is encountered
> >       for(int i=1;i<=aSym.length();i++)
> >           if(aSym.symbolAt(i) != aGapSymbol)
> > n++;
> >           else
> > aGapped.addGapInSource(n);
> >
> > --
> >
> > Ditlev Egeskov Brodersen
> > Lektor
> > Bakkefaldet 30, Hasle
> > 8210 ?rhus V
> >
> > www.lindeman-brodersen.dk
> >
> >> -----Original Message-----
> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
> >> Sent: 15 November 2007 14:52
> >> To: Ditlev Egeskov Brodersen
> >> Cc: biojava-l at biojava.org
> >> Subject: Re: [Biojava-l] Parsing exising gaps
> >>
> > I think you've uncovered a number of problems here:
> >
> > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
> > all the other predefined alphabets.
> >
> > 2. The MSF parser doesn't bother trying to build GappedSequence
> > instances, instead it just builds solid sequences with the gaps as
> > normal symbols.
> >
> > 3. There is no constructor or method for taking a sequence with
> > embedded
> > gap symbols and turning it into a GappedSequence with separate
> chunks.
> >
> > Combined, these three problems make it impossible to do what you want
> > easily. I will make a note to fix this on the plans for the next
> > BioJava
> > development cycle.
> >
> > In the meantime, your best bet would be to construct a second
> alignment
> > block by iterating over the alignment block you already have and
> > parsing
> > the locations of the gap symbols. You would create a
> > SimpleGappedSequence intially over the ungapped sequence, then use
> the
> > insert gap methods to insert the gaps into this ungapped sequence
> > before
> > putting all the SimpleGappedSequence objects together into a new
> > alignment.
> >
> > cheers,
> > Richard
> >
> > Ditlev Egeskov Brodersen wrote:
> >>>> Dear all,
> >>>>
> >>>>
> >>>>
> >>>> I have managed to read an MSF-formatted alignment from a file
> > selected
> >>>> through FileChooser as follows:
> >>>>
> >>>>
> >>>>
> >>>>   BufferedReader br = new BufferedReader(new
> >>>> FileReader(aFileChooser.getSelectedFile()));
> >>>>
> >>>>   SimpleAlignment align =
> >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> > br);
> >>>>
> >>>>
> >>>> I can now retrieve the sequence names and sequences through the
> > Alignment
> >>>> object:
> >>>>
> >>>>
> >>>>
> >>>>   Iterator aLabels = align.getLabels().iterator();
> >>>>
> >>>>   Iterator aSequences = align.symbolListIterator();
> >>>>
> >>>>
> >>>>
> >>>> However, I now what to be able to translate between real sequence
> > numbers
> >>>> and the positions within each alignment string, i.e. retrieve
> > positions that
> >>>> remove the gaps first (gaps are represented by hyphens '-' in the
> MSF
> >>>> format). How can I tell BioJava to parse the gaps into an
> > GappedSequence
> >>>> format? I have tried the following to check what position 15 (past
> > the the
> >>>> first gap) translates into:
> >>>>
> >>>>
> >>>>
> >>>>   int n = 0;
> >>>>
> >>>>   while(aSequences.hasNext()) {
> >>>>
> >>>>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> >>>>
> >>>>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
> >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null));
> >>>>
> >>>>       System.out.println(aGapped.gappedToLocation(new
> > PointLocation(15)));
> >>>>   }
> >>>>
> >>>>
> >>>>
> >>>> But I only get 15 back out. I have also studied the constructor of
> > the
> >>>> underlying SimpleGappedSymbolList but it simply copies the
> SymbolList
> > and
> >>>> creates one big block:
> >>>>
> >>>>
> >>>>
> >>>>   public SimpleGappedSymbolList(SymbolList source) {
> >>>>
> >>>>     this.source = source;
> >>>>
> >>>>     this.alpha = source.getAlphabet();
> >>>>
> >>>>     this.blocks = new ArrayList();
> >>>>
> >>>>     this.length = source.length();
> >>>>
> >>>>     Block b = new Block(1, length, 1, length);
> >>>>
> >>>>     blocks.add(b);
> >>>>
> >>>>   }
> >>>>
> >>>>
> >>>>
> >>>> Is there a way to tell SimpleGappedSequence to parse itself in
> terms
> > of the
> >>>> gap characters in the sequence string? How is the sequence
> > represented in
> >>>> this case, if not by gaps? Surely the hyphen cannot be a part of
> the
> >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the
> use
> > of it?
> >>>>
> >>>>
> >>>> Best wishes,
> >>>>
> >>>>
> >>>>
> >>>>   Ditlev
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>>
> >>>>
> >>>> Ditlev E. Brodersen, Ph.D.
> >>>> Lektor, Associate Professor
> >>>>
> >>>>
> >>>>
> >>>> Department of Molecular Biology   Office:  +45 89425259
> >>>> University of AarhusLab:     +45 89425022
> >>>> Gustav Wieds Vej 10cFax:     +45 86123178
> >>>> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
> >>>> Denmark             Lab WWW:  <http://bioxray.dk/~deb>
> > www.bioxray.dk/~deb
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS
> Rv73CVmtYpEz+apJwM1L3sA=
> =UPU6
> -----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Fri Nov 16 04:49:35 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 09:49:35 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <000601c82833$143c5300$3cb4f900$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
Message-ID: <473D67AF.2020007@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>   The returned gapped sequences are all properly set up with gaps, name etc.
> But as for other users, I think there may be some problems, since the
> SimpleAlignment object only has a general symbol list iterator, the user
> will have to cast each statement extracting a sequence object, and
> 
>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
> 
> returns an ClassCastException at run time. So old code might not run with
> the update as far as I can see.

This is true. However, such code would be unsupported by us as the API
clearly states that SimpleAlignment returns SymbolList instances, and
does not make any guarantees about the exact implementation details of
the objects it returns. To attempt to cast it to anything other than
SymbolList would be a mistake! (Although actually it is now returning a
guarantee of GappedSymbolList, which is what your code can now take
advantage of). To assume it will return SimpleSequence is outside the
behaviour defined by the API and therefore should not be relied upon.

A more correct behaviour would be to test each item returned:

	SymbolList symlist = aSequences.next();
	if (symlist instanceof SimpleSequence) {
		SimpleSequence seq = (SimpleSequence)symlist;
		// Do simple-sequence stuff
	} else {
		// Do something else!
	}

In future, I will modify the API to change the SymbolList guarantee to a
GappedSymbolList guarantee, but I can't do this right now as this really
would break everyone's code!

We are currently planning a redesign as you may be aware, so issues like
this will hopefully be resolved as part of that process. For a start, if
we use Java 5 generics in future as we plan, we can strictly specify
what kinds of objects will be returned by things such as the alignment
API, making it easier for us to enforce API-compliant behaviour in
user's code.

cheers,
Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB
ITj7O/foDly4aOT4SV1Jb+k=
=g7Vs
-----END PGP SIGNATURE-----

From deb at mb.au.dk  Fri Nov 16 05:11:15 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Fri, 16 Nov 2007 11:11:15 +0100
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <473D67AF.2020007@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk>
Message-ID: <000f01c82839$06722550$13566ff0$@au.dk>

Hi again,

  thanks for the info - will do the check just to be proper. I have another
question: In my application, I would like to wrap the retrieved
SimpleGappedSequence objects inside another object that extends the
functionality with application-specific stuff. Ideally, I would do this by
extending the SimpleGappedSequence object and create it by passing the
SimpleGappedSequence from the alignment import to the constructor of the
parent, like so:

  class AlignedSequence extends SimpleGappedSequence {
    public AlignedSequence(SimpleGappedSequence aGapped) {
      super(aGapped);
    }

    ..custom stuff..
  }

However, the problem is that there is only one constructor for the
SimpleGappedSequence, one which takes a simple Sequence object. I can pass
the derived class alright, but all gap information is lost again, presumably
because the SimpleGappedSequence constructor just takes out the seqString()
and puts it into its own sequence object.

Shouldn't the constructor of the SimpleGappedSequence class recognise when a
derived (and gapped) sequence object is passed, and process it accordingly?

As it stands, I am forced to include the SimpleGappedSequence as a private
member of the AlignedSequence class, which is not near as nice since all
statement using the class will have to do something like

  class AlignedSequence extends SimpleGappedSequence {
    private SimpleGappedSequence gapped_sequence;

    public AlignedSequence(SimpleGappedSequence aGapped) {
      gapped_sequence = aGapped;
    }

    public SimpleGappedSequence getGappedSequence() {
      return(gapped_sequence);
  }

    ..custom stuff..
  }

  ...

  AlignedSequence aAligned = new AlignedSequence(aGapped);
  aAligned.getGappedSequence().seqString();

rather than simply:

  AlignedSequence aAligned = new AlignedSequence(aGapped);
  aAligned.seqString();

In other words, is there any solution with the current setup that would
allow me to extend SimpleGappedSequence and not loose the gap information?

--  Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 10:50
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] Parsing exising gaps
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> >   The returned gapped sequences are all properly set up with gaps,
> name etc.
> > But as for other users, I think there may be some problems, since the
> > SimpleAlignment object only has a general symbol list iterator, the
> user
> > will have to cast each statement extracting a sequence object, and
> >
> >       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
> >
> > returns an ClassCastException at run time. So old code might not run
> with
> > the update as far as I can see.
> 
> This is true. However, such code would be unsupported by us as the API
> clearly states that SimpleAlignment returns SymbolList instances, and
> does not make any guarantees about the exact implementation details of
> the objects it returns. To attempt to cast it to anything other than
> SymbolList would be a mistake! (Although actually it is now returning a
> guarantee of GappedSymbolList, which is what your code can now take
> advantage of). To assume it will return SimpleSequence is outside the
> behaviour defined by the API and therefore should not be relied upon.
> 
> A more correct behaviour would be to test each item returned:
> 
> 	SymbolList symlist = aSequences.next();
> 	if (symlist instanceof SimpleSequence) {
> 		SimpleSequence seq = (SimpleSequence)symlist;
> 		// Do simple-sequence stuff
> 	} else {
> 		// Do something else!
> 	}
> 
> In future, I will modify the API to change the SymbolList guarantee to
> a
> GappedSymbolList guarantee, but I can't do this right now as this
> really
> would break everyone's code!
> 
> We are currently planning a redesign as you may be aware, so issues
> like
> this will hopefully be resolved as part of that process. For a start,
> if
> we use Java 5 generics in future as we plan, we can strictly specify
> what kinds of objects will be returned by things such as the alignment
> API, making it easier for us to enforce API-compliant behaviour in
> user's code.
> 
> cheers,
> Richard
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB
> ITj7O/foDly4aOT4SV1Jb+k=
> =g7Vs
> -----END PGP SIGNATURE-----


From ap3 at sanger.ac.uk  Fri Nov 16 04:51:35 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Fri, 16 Nov 2007 09:51:35 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <473D5BFD.8080305@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
Message-ID: <A750D1C7-7659-4F29-9BD0-B30558FBF38E@sanger.ac.uk>

>
> To get the modified parser you will need to check out the very latest
> source code from our CVS repository and compile it using ant.
> Instructions are on our website at biojava.org if you have not done  
> this
> before.

alternatively you could get the automatically built biojava.jar from
http://www.spice-3d.org/cruise/

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From holland at ebi.ac.uk  Fri Nov 16 05:46:57 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 10:46:57 +0000
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <000f01c82839$06722550$13566ff0$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk>
	<000f01c82839$06722550$13566ff0$@au.dk>
Message-ID: <473D7521.9070603@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The easiest way is simply for me to alter the constructor to
SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
copy all gaps if passed another instance of GappedSymbolList as the
parameter. I've just done this in CVS so you should be able to update
your copy and observe the new behaviour.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Hi again,
> 
>   thanks for the info - will do the check just to be proper. I have another
> question: In my application, I would like to wrap the retrieved
> SimpleGappedSequence objects inside another object that extends the
> functionality with application-specific stuff. Ideally, I would do this by
> extending the SimpleGappedSequence object and create it by passing the
> SimpleGappedSequence from the alignment import to the constructor of the
> parent, like so:
> 
>   class AlignedSequence extends SimpleGappedSequence {
>     public AlignedSequence(SimpleGappedSequence aGapped) {
>       super(aGapped);
>     }
> 
>     ..custom stuff..
>   }
> 
> However, the problem is that there is only one constructor for the
> SimpleGappedSequence, one which takes a simple Sequence object. I can pass
> the derived class alright, but all gap information is lost again, presumably
> because the SimpleGappedSequence constructor just takes out the seqString()
> and puts it into its own sequence object.
> 
> Shouldn't the constructor of the SimpleGappedSequence class recognise when a
> derived (and gapped) sequence object is passed, and process it accordingly?
> 
> As it stands, I am forced to include the SimpleGappedSequence as a private
> member of the AlignedSequence class, which is not near as nice since all
> statement using the class will have to do something like
> 
>   class AlignedSequence extends SimpleGappedSequence {
>     private SimpleGappedSequence gapped_sequence;
> 
>     public AlignedSequence(SimpleGappedSequence aGapped) {
>       gapped_sequence = aGapped;
>     }
> 
>     public SimpleGappedSequence getGappedSequence() {
>       return(gapped_sequence);
>   }
> 
>     ..custom stuff..
>   }
> 
>   ...
> 
>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>   aAligned.getGappedSequence().seqString();
> 
> rather than simply:
> 
>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>   aAligned.seqString();
> 
> In other words, is there any solution with the current setup that would
> allow me to extend SimpleGappedSequence and not loose the gap information?
> 
> --  Ditlev
> 
> --
>  
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
>  
> Department of Molecular Biology   Office:  +45 89425259
> University of Aarhus              Lab:     +45 89425022
> Gustav Wieds Vej 10c              Fax:     +45 86123178
> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 16 November 2007 10:50
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: [Biojava-l] Parsing exising gaps
>>
>>>>   The returned gapped sequences are all properly set up with gaps,
> name etc.
>>>> But as for other users, I think there may be some problems, since the
>>>> SimpleAlignment object only has a general symbol list iterator, the
> user
>>>> will have to cast each statement extracting a sequence object, and
>>>>
>>>>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
>>>>
>>>> returns an ClassCastException at run time. So old code might not run
> with
>>>> the update as far as I can see.
> This is true. However, such code would be unsupported by us as the API
> clearly states that SimpleAlignment returns SymbolList instances, and
> does not make any guarantees about the exact implementation details of
> the objects it returns. To attempt to cast it to anything other than
> SymbolList would be a mistake! (Although actually it is now returning a
> guarantee of GappedSymbolList, which is what your code can now take
> advantage of). To assume it will return SimpleSequence is outside the
> behaviour defined by the API and therefore should not be relied upon.
> 
> A more correct behaviour would be to test each item returned:
> 
> 	SymbolList symlist = aSequences.next();
> 	if (symlist instanceof SimpleSequence) {
> 		SimpleSequence seq = (SimpleSequence)symlist;
> 		// Do simple-sequence stuff
> 	} else {
> 		// Do something else!
> 	}
> 
> In future, I will modify the API to change the SymbolList guarantee to
> a
> GappedSymbolList guarantee, but I can't do this right now as this
> really
> would break everyone's code!
> 
> We are currently planning a redesign as you may be aware, so issues
> like
> this will hopefully be resolved as part of that process. For a start,
> if
> we use Java 5 generics in future as we plan, we can strictly specify
> what kinds of objects will be returned by things such as the alignment
> API, making it easier for us to enforce API-compliant behaviour in
> user's code.
> 
> cheers,
> Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m
8EJTzWZYOyjp9VhmvsgvmNA=
=1uaB
-----END PGP SIGNATURE-----

From deb at mb.au.dk  Fri Nov 16 07:39:23 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Fri, 16 Nov 2007 13:39:23 +0100
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <473D7521.9070603@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d
	<473D7521.9070603@ebi.ac.uk>
Message-ID: <001801c8284d$b8c525e0$2a4f71a0$@au.dk>

Hi again,

  I updated CVS and got the new SimpleGappedSymbolList class, but there
seems to be no changes to the SimpleGappedSequence class, which is the one I
need to extend...have I missed something?

  Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 11:47
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: Wrapping SimpleGappedSequence
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> The easiest way is simply for me to alter the constructor to
> SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
> copy all gaps if passed another instance of GappedSymbolList as the
> parameter. I've just done this in CVS so you should be able to update
> your copy and observe the new behaviour.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
> > Hi again,
> >
> >   thanks for the info - will do the check just to be proper. I have
> another
> > question: In my application, I would like to wrap the retrieved
> > SimpleGappedSequence objects inside another object that extends the
> > functionality with application-specific stuff. Ideally, I would do
> this by
> > extending the SimpleGappedSequence object and create it by passing
> the
> > SimpleGappedSequence from the alignment import to the constructor of
> the
> > parent, like so:
> >
> >   class AlignedSequence extends SimpleGappedSequence {
> >     public AlignedSequence(SimpleGappedSequence aGapped) {
> >       super(aGapped);
> >     }
> >
> >     ..custom stuff..
> >   }
> >
> > However, the problem is that there is only one constructor for the
> > SimpleGappedSequence, one which takes a simple Sequence object. I can
> pass
> > the derived class alright, but all gap information is lost again,
> presumably
> > because the SimpleGappedSequence constructor just takes out the
> seqString()
> > and puts it into its own sequence object.
> >
> > Shouldn't the constructor of the SimpleGappedSequence class recognise
> when a
> > derived (and gapped) sequence object is passed, and process it
> accordingly?
> >
> > As it stands, I am forced to include the SimpleGappedSequence as a
> private
> > member of the AlignedSequence class, which is not near as nice since
> all
> > statement using the class will have to do something like
> >
> >   class AlignedSequence extends SimpleGappedSequence {
> >     private SimpleGappedSequence gapped_sequence;
> >
> >     public AlignedSequence(SimpleGappedSequence aGapped) {
> >       gapped_sequence = aGapped;
> >     }
> >
> >     public SimpleGappedSequence getGappedSequence() {
> >       return(gapped_sequence);
> >   }
> >
> >     ..custom stuff..
> >   }
> >
> >   ...
> >
> >   AlignedSequence aAligned = new AlignedSequence(aGapped);
> >   aAligned.getGappedSequence().seqString();
> >
> > rather than simply:
> >
> >   AlignedSequence aAligned = new AlignedSequence(aGapped);
> >   aAligned.seqString();
> >
> > In other words, is there any solution with the current setup that
> would
> > allow me to extend SimpleGappedSequence and not loose the gap
> information?
> >
> > --  Ditlev
> >
> > --
> >
> > Ditlev E. Brodersen, Ph.D.
> > Lektor, Associate Professor
> >
> > Department of Molecular Biology   Office:  +45 89425259
> > University of Aarhus              Lab:     +45 89425022
> > Gustav Wieds Vej 10c              Fax:     +45 86123178
> > DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> > Denmark                           Lab WWW: www.bioxray.dk/~deb
> >
> >
> >> -----Original Message-----
> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
> >> Sent: 16 November 2007 10:50
> >> To: Ditlev Egeskov Brodersen
> >> Cc: biojava-l at biojava.org
> >> Subject: Re: [Biojava-l] Parsing exising gaps
> >>
> >>>>   The returned gapped sequences are all properly set up with gaps,
> > name etc.
> >>>> But as for other users, I think there may be some problems, since
> the
> >>>> SimpleAlignment object only has a general symbol list iterator,
> the
> > user
> >>>> will have to cast each statement extracting a sequence object, and
> >>>>
> >>>>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
> >>>>
> >>>> returns an ClassCastException at run time. So old code might not
> run
> > with
> >>>> the update as far as I can see.
> > This is true. However, such code would be unsupported by us as the
> API
> > clearly states that SimpleAlignment returns SymbolList instances, and
> > does not make any guarantees about the exact implementation details
> of
> > the objects it returns. To attempt to cast it to anything other than
> > SymbolList would be a mistake! (Although actually it is now returning
> a
> > guarantee of GappedSymbolList, which is what your code can now take
> > advantage of). To assume it will return SimpleSequence is outside the
> > behaviour defined by the API and therefore should not be relied upon.
> >
> > A more correct behaviour would be to test each item returned:
> >
> > 	SymbolList symlist = aSequences.next();
> > 	if (symlist instanceof SimpleSequence) {
> > 		SimpleSequence seq = (SimpleSequence)symlist;
> > 		// Do simple-sequence stuff
> > 	} else {
> > 		// Do something else!
> > 	}
> >
> > In future, I will modify the API to change the SymbolList guarantee
> to
> > a
> > GappedSymbolList guarantee, but I can't do this right now as this
> > really
> > would break everyone's code!
> >
> > We are currently planning a redesign as you may be aware, so issues
> > like
> > this will hopefully be resolved as part of that process. For a start,
> > if
> > we use Java 5 generics in future as we plan, we can strictly specify
> > what kinds of objects will be returned by things such as the
> alignment
> > API, making it easier for us to enforce API-compliant behaviour in
> > user's code.
> >
> > cheers,
> > Richard
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m
> 8EJTzWZYOyjp9VhmvsgvmNA=
> =1uaB
> -----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Fri Nov 16 07:46:23 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 12:46:23 +0000
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <001801c8284d$b8c525e0$2a4f71a0$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d
	<473D7521.9070603@ebi.ac.uk>
	<001801c8284d$b8c525e0$2a4f71a0$@au.dk>
Message-ID: <473D911F.2000303@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

SimpleGappedSequence extends SimpleGappedSymbolList, and the constructor
delegates to the SimpleGappedSymbolList constructor.

When you extend SimpleGappedSequence you should delegate in your new
constructor to the existing SimpleGappedSequence constructor, which in
turn will delegate as above and preserve the gaps.

By passing any object which implements GappedSymbolList to the
SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
SimpleGappedSymbolList, it will automatically choose the new constructor
from SimpleGappedSymbolList which you hopefully should be able to see in
the code you have just checked out. If passed any other
non-GappedSymbolList object, it will use the old constructor that
already existed from before.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Hi again,
> 
>   I updated CVS and got the new SimpleGappedSymbolList class, but there
> seems to be no changes to the SimpleGappedSequence class, which is the one I
> need to extend...have I missed something?
> 
>   Ditlev
> 
> --
>  
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
>  
> Department of Molecular Biology   Office:  +45 89425259
> University of Aarhus              Lab:     +45 89425022
> Gustav Wieds Vej 10c              Fax:     +45 86123178
> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 16 November 2007 11:47
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: Wrapping SimpleGappedSequence
>>
> The easiest way is simply for me to alter the constructor to
> SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
> copy all gaps if passed another instance of GappedSymbolList as the
> parameter. I've just done this in CVS so you should be able to update
> your copy and observe the new behaviour.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>>>> Hi again,
>>>>
>>>>   thanks for the info - will do the check just to be proper. I have
> another
>>>> question: In my application, I would like to wrap the retrieved
>>>> SimpleGappedSequence objects inside another object that extends the
>>>> functionality with application-specific stuff. Ideally, I would do
> this by
>>>> extending the SimpleGappedSequence object and create it by passing
> the
>>>> SimpleGappedSequence from the alignment import to the constructor of
> the
>>>> parent, like so:
>>>>
>>>>   class AlignedSequence extends SimpleGappedSequence {
>>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>>>>       super(aGapped);
>>>>     }
>>>>
>>>>     ..custom stuff..
>>>>   }
>>>>
>>>> However, the problem is that there is only one constructor for the
>>>> SimpleGappedSequence, one which takes a simple Sequence object. I can
> pass
>>>> the derived class alright, but all gap information is lost again,
> presumably
>>>> because the SimpleGappedSequence constructor just takes out the
> seqString()
>>>> and puts it into its own sequence object.
>>>>
>>>> Shouldn't the constructor of the SimpleGappedSequence class recognise
> when a
>>>> derived (and gapped) sequence object is passed, and process it
> accordingly?
>>>> As it stands, I am forced to include the SimpleGappedSequence as a
> private
>>>> member of the AlignedSequence class, which is not near as nice since
> all
>>>> statement using the class will have to do something like
>>>>
>>>>   class AlignedSequence extends SimpleGappedSequence {
>>>>     private SimpleGappedSequence gapped_sequence;
>>>>
>>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>>>>       gapped_sequence = aGapped;
>>>>     }
>>>>
>>>>     public SimpleGappedSequence getGappedSequence() {
>>>>       return(gapped_sequence);
>>>>   }
>>>>
>>>>     ..custom stuff..
>>>>   }
>>>>
>>>>   ...
>>>>
>>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>>>>   aAligned.getGappedSequence().seqString();
>>>>
>>>> rather than simply:
>>>>
>>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>>>>   aAligned.seqString();
>>>>
>>>> In other words, is there any solution with the current setup that
> would
>>>> allow me to extend SimpleGappedSequence and not loose the gap
> information?
>>>> --  Ditlev
>>>>
>>>> --
>>>>
>>>> Ditlev E. Brodersen, Ph.D.
>>>> Lektor, Associate Professor
>>>>
>>>> Department of Molecular Biology   Office:  +45 89425259
>>>> University of Aarhus              Lab:     +45 89425022
>>>> Gustav Wieds Vej 10c              Fax:     +45 86123178
>>>> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>>>> Denmark                           Lab WWW: www.bioxray.dk/~deb
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>>>>> Sent: 16 November 2007 10:50
>>>>> To: Ditlev Egeskov Brodersen
>>>>> Cc: biojava-l at biojava.org
>>>>> Subject: Re: [Biojava-l] Parsing exising gaps
>>>>>
>>>>>>>   The returned gapped sequences are all properly set up with gaps,
>>>> name etc.
>>>>>>> But as for other users, I think there may be some problems, since
> the
>>>>>>> SimpleAlignment object only has a general symbol list iterator,
> the
>>>> user
>>>>>>> will have to cast each statement extracting a sequence object, and
>>>>>>>
>>>>>>>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
>>>>>>>
>>>>>>> returns an ClassCastException at run time. So old code might not
> run
>>>> with
>>>>>>> the update as far as I can see.
>>>> This is true. However, such code would be unsupported by us as the
> API
>>>> clearly states that SimpleAlignment returns SymbolList instances, and
>>>> does not make any guarantees about the exact implementation details
> of
>>>> the objects it returns. To attempt to cast it to anything other than
>>>> SymbolList would be a mistake! (Although actually it is now returning
> a
>>>> guarantee of GappedSymbolList, which is what your code can now take
>>>> advantage of). To assume it will return SimpleSequence is outside the
>>>> behaviour defined by the API and therefore should not be relied upon.
>>>>
>>>> A more correct behaviour would be to test each item returned:
>>>>
>>>> 	SymbolList symlist = aSequences.next();
>>>> 	if (symlist instanceof SimpleSequence) {
>>>> 		SimpleSequence seq = (SimpleSequence)symlist;
>>>> 		// Do simple-sequence stuff
>>>> 	} else {
>>>> 		// Do something else!
>>>> 	}
>>>>
>>>> In future, I will modify the API to change the SymbolList guarantee
> to
>>>> a
>>>> GappedSymbolList guarantee, but I can't do this right now as this
>>>> really
>>>> would break everyone's code!
>>>>
>>>> We are currently planning a redesign as you may be aware, so issues
>>>> like
>>>> this will hopefully be resolved as part of that process. For a start,
>>>> if
>>>> we use Java 5 generics in future as we plan, we can strictly specify
>>>> what kinds of objects will be returned by things such as the
> alignment
>>>> API, making it easier for us to enforce API-compliant behaviour in
>>>> user's code.
>>>>
>>>> cheers,
>>>> Richard

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx
H0QV01cFreNi1SNLRPbhepg=
=023Y
-----END PGP SIGNATURE-----

From ap3 at sanger.ac.uk  Fri Nov 16 09:43:39 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Fri, 16 Nov 2007 14:43:39 +0000
Subject: [Biojava-l] Disulfide information in PDB files
In-Reply-To: <459609.71722.qm@web52710.mail.re2.yahoo.com>
References: <459609.71722.qm@web52710.mail.re2.yahoo.com>
Message-ID: <8F40FBF1-D491-4C3D-BCEB-41316147BD80@sanger.ac.uk>

Hi Brendan,

I just committed the patches to CVS so
BioJava can now parse the SSBond records.

Andreas


On 14 Nov 2007, at 16:28, Brendan Duggan wrote:

> Hi Andreas
>
> Thanks for the quick response.  I submitted a bug
> request (#2400) as suggested by Richard.  Parsing the
> SSBOND records is indeed what I was talking about.  I
> want to identify the disulfides then calculate their
> torsions, dihedrals and bond lengths, all of which I
> believe can be implemented with the existing code.  If
> you could implement this parsing in a few days that
> would be great!
>
> Thanks
>
> Brendan
>
>
> --- Andreas Prlic <ap3 at sanger.ac.uk> wrote:
>
>> Hi Brendan,
>>
>> SSBOND lines are currently not parsed. If this is
>> what you need,
>> I can add this over the next couple of days.
>>
>> If you want to compute the bonds yourself, the
>> framework can
>> e.g. calculate distances between the sulphur atoms
>> for you. -
>>
>> Andreas
>>
>>
>>
>>
>>
>> On 14 Nov 2007, at 00:48, Brendan Duggan wrote:
>>
>>> Greetings
>>>
>>> I'm trying to mine some information on disulfides
>> in
>>> the PDB and was hoping there might be a way of
>>> obtaining this information with the BioJava PDB
>>> parser.  However, I haven't been able to see
>> anything
>>> like this mentioned in the API docs.  If it is
>>> currently not possible to extract disulfide
>>> information from PDB files are there any plans to
>>> implement this?
>>>
>>> Thanks!
>>>
>>> Brendan
>>>
>>>
>>>       Make the switch to the world's best email.
>> Get the new Yahoo!
>>> 7 Mail now.
>> http://au.yahoo.com/worldsbestmail/viagra/index.html
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -
>> Biojava-l at lists.open-bio.org
>>>
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
> ---------------------------------------------------------------------- 
> -
>>
>> Andreas Prlic      Wellcome Trust Sanger Institute
>>                                Hinxton, Cambridge
>> CB10 1SA, UK
>> 			 +44 (0) 1223 49 6891
>>
>>
> ---------------------------------------------------------------------- 
> -
>>
>>
>>
>> -- 
>>  The Wellcome Trust Sanger Institute is operated by
>> Genome Research
>>  Limited, a charity registered in England with
>> number 1021457 and a
>>  company registered in England with number 2742969,
>> whose registered
>>  office is 215 Euston Road, London, NW1 2BE.
>>
>
>
> Brendan M. Duggan, PhD
>
> bmduggan at yahoo.com
> (858) 692-2298
>
>
>       Make the switch to the world's best email. Get the new Yahoo! 
> 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html
>
>

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From holland at ebi.ac.uk  Sun Nov 18 12:12:04 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Sun, 18 Nov 2007 17:12:04 -0000 (GMT)
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <000901c829d0$daa54620$8fefd260$@dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d
	<473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk>
	<473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk>
Message-ID: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>

Interesting stuff. I'm not sure why it isn't working so I'll have to have
a closer look.

I'm currently on annual leave but will investigate when I return (Nov 27th).

cheers,
Richard

On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote:
> Hi Richard,
>
>   I thought that was also correct what you say, but I can't get it to
> work.
> Below is a small test program to check this. First, I create a
> SimpleGappedSequence through Text with
> gaps->SymbolList->Sequence->GappedSequence. Gaps are there but not
> "understood", as expected. Next, I create the same sequence non-gapped in
> the above way, then introduce gaps with addGapsInSource. A gapped location
> is now properly translated to a non-gapped sequence position. Finally, I
> create a new SimpleGappedSequence based on the working one - as you can
> see
> the gaps are still there but not "understood"...
>
> aSymbolList = MSE--KLMPRT---TWAKG
> aSequence   = MSE--KLMPRT---TWAKG
>
> Gaps are not parsed when a SimpleGappedSequence is constructed from a
> gapped
> Sequence object:
> aGapped     = MSE--KLMPRT---TWAKG
> Gapped position 10 = Plain position 10
>
> aSymbolList = MSEKLMPRTTWAKG
> aSequence   = MSEKLMPRTTWAKG
>
> Gaps introduced through addGapsInSource work ok:
> aGapped     = MS--EKLMPR---TTWAKG
> Gapped position 10 = Plain position 8
>
> Now a new SimpleGappedSequence object is created from the previous one:
> aGapped2    = MS--EKLMPR---TTWAKG
> Gapped position 10 = Plain position 10
>
> This should have been compiled with the new biojava.jar of 161107 (updated
> via CVS), but perhaps I made a mistake updating?
>
> Any clues?
>
> Thanks,
>
>   Ditlev
>
> ---
>
> package gappedsequencetest;
>
> import org.biojava.bio.*;
> import org.biojava.bio.seq.*;
> import org.biojava.bio.seq.impl.*;
> import org.biojava.bio.symbol.*;
>
> public class Main {
>
>     public static void main(String[] args) {
>         SymbolList aSymbolList = null;
>         try {
>             aSymbolList =
> ProteinTools.createProtein("MSE--KLMPRT---TWAKG");
>
>         }
>         catch(BioException ex) {}
>
>         System.out.println("aSymbolList = " + aSymbolList.seqString());
>
>         Sequence aSequence = new SimpleSequence(aSymbolList, "",
> "mySequence", null);
>         System.out.println("aSequence   = " + aSequence.seqString() +
> "\n");
>
>         SimpleGappedSequence aGapped = new
> SimpleGappedSequence(aSequence);
>         System.out.println("Gaps are not parsed when a
> SimpleGappedSequence
> is constructed from a gapped Sequence object:");
>         System.out.println("aGapped     = " + aGapped.seqString());
>         System.out.println("Gapped position 10 = Plain position " +
> aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>
>         try {
>             aSymbolList = ProteinTools.createProtein("MSEKLMPRTTWAKG");
>         }
>         catch(BioException ex) {}
>
>         System.out.println("aSymbolList = " + aSymbolList.seqString());
>
>         aSequence = new SimpleSequence(aSymbolList, "", "mySequence",
> null);
>         System.out.println("aSequence   = " + aSequence.seqString() +
> "\n");
>
>         aGapped = new SimpleGappedSequence(aSequence);
>         aGapped.addGapsInSource(9, 3);
>         aGapped.addGapsInSource(3, 2);
>         System.out.println("Gaps introduced through addGapsInSource work
> ok:");
>         System.out.println("aGapped     = " + aGapped.seqString());
>         System.out.println("Gapped position 10 = Plain position " +
> aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>
>         SimpleGappedSequence aGapped2 = new SimpleGappedSequence(aGapped);
>         System.out.println("Now a new SimpleGappedSequence object is
> created
> from the previous one:");
>         System.out.println("aGapped2    = " + aGapped2.seqString());
>         System.out.println("Gapped position 10 = Plain position " +
> aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>     }
>
> }
>
> --
>
> Ditlev Egeskov Brodersen
> Lektor
> Bakkefaldet 30, Hasle
> 8210 ?rhus V
>
> www.lindeman-brodersen.dk
>
>
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 16 November 2007 13:46
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: Wrapping SimpleGappedSequence
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> SimpleGappedSequence extends SimpleGappedSymbolList, and the
>> constructor
>> delegates to the SimpleGappedSymbolList constructor.
>>
>> When you extend SimpleGappedSequence you should delegate in your new
>> constructor to the existing SimpleGappedSequence constructor, which in
>> turn will delegate as above and preserve the gaps.
>>
>> By passing any object which implements GappedSymbolList to the
>> SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
>> SimpleGappedSymbolList, it will automatically choose the new
>> constructor
>> from SimpleGappedSymbolList which you hopefully should be able to see
>> in
>> the code you have just checked out. If passed any other
>> non-GappedSymbolList object, it will use the old constructor that
>> already existed from before.
>>
>> cheers,
>> Richard
>>
>> Ditlev Egeskov Brodersen wrote:
>> > Hi again,
>> >
>> >   I updated CVS and got the new SimpleGappedSymbolList class, but
>> there
>> > seems to be no changes to the SimpleGappedSequence class, which is
>> the one I
>> > need to extend...have I missed something?
>> >
>> >   Ditlev
>> >
>> > --
>> >
>> > Ditlev E. Brodersen, Ph.D.
>> > Lektor, Associate Professor
>> >
>> > Department of Molecular Biology   Office:  +45 89425259
>> > University of Aarhus              Lab:     +45 89425022
>> > Gustav Wieds Vej 10c              Fax:     +45 86123178
>> > DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>> > Denmark                           Lab WWW: www.bioxray.dk/~deb
>> >
>> >
>> >> -----Original Message-----
>> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> >> Sent: 16 November 2007 11:47
>> >> To: Ditlev Egeskov Brodersen
>> >> Cc: biojava-l at biojava.org
>> >> Subject: Re: Wrapping SimpleGappedSequence
>> >>
>> > The easiest way is simply for me to alter the constructor to
>> > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
>> > copy all gaps if passed another instance of GappedSymbolList as the
>> > parameter. I've just done this in CVS so you should be able to update
>> > your copy and observe the new behaviour.
>> >
>> > cheers,
>> > Richard
>> >
>> > Ditlev Egeskov Brodersen wrote:
>> >>>> Hi again,
>> >>>>
>> >>>>   thanks for the info - will do the check just to be proper. I
>> have
>> > another
>> >>>> question: In my application, I would like to wrap the retrieved
>> >>>> SimpleGappedSequence objects inside another object that extends
>> the
>> >>>> functionality with application-specific stuff. Ideally, I would do
>> > this by
>> >>>> extending the SimpleGappedSequence object and create it by passing
>> > the
>> >>>> SimpleGappedSequence from the alignment import to the constructor
>> of
>> > the
>> >>>> parent, like so:
>> >>>>
>> >>>>   class AlignedSequence extends SimpleGappedSequence {
>> >>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>> >>>>       super(aGapped);
>> >>>>     }
>> >>>>
>> >>>>     ..custom stuff..
>> >>>>   }
>> >>>>
>> >>>> However, the problem is that there is only one constructor for the
>> >>>> SimpleGappedSequence, one which takes a simple Sequence object. I
>> can
>> > pass
>> >>>> the derived class alright, but all gap information is lost again,
>> > presumably
>> >>>> because the SimpleGappedSequence constructor just takes out the
>> > seqString()
>> >>>> and puts it into its own sequence object.
>> >>>>
>> >>>> Shouldn't the constructor of the SimpleGappedSequence class
>> recognise
>> > when a
>> >>>> derived (and gapped) sequence object is passed, and process it
>> > accordingly?
>> >>>> As it stands, I am forced to include the SimpleGappedSequence as a
>> > private
>> >>>> member of the AlignedSequence class, which is not near as nice
>> since
>> > all
>> >>>> statement using the class will have to do something like
>> >>>>
>> >>>>   class AlignedSequence extends SimpleGappedSequence {
>> >>>>     private SimpleGappedSequence gapped_sequence;
>> >>>>
>> >>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>> >>>>       gapped_sequence = aGapped;
>> >>>>     }
>> >>>>
>> >>>>     public SimpleGappedSequence getGappedSequence() {
>> >>>>       return(gapped_sequence);
>> >>>>   }
>> >>>>
>> >>>>     ..custom stuff..
>> >>>>   }
>> >>>>
>> >>>>   ...
>> >>>>
>> >>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>> >>>>   aAligned.getGappedSequence().seqString();
>> >>>>
>> >>>> rather than simply:
>> >>>>
>> >>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>> >>>>   aAligned.seqString();
>> >>>>
>> >>>> In other words, is there any solution with the current setup that
>> > would
>> >>>> allow me to extend SimpleGappedSequence and not loose the gap
>> > information?
>> >>>> --  Ditlev
>> >>>>
>> >>>> --
>> >>>>
>> >>>> Ditlev E. Brodersen, Ph.D.
>> >>>> Lektor, Associate Professor
>> >>>>
>> >>>> Department of Molecular Biology   Office:  +45 89425259
>> >>>> University of Aarhus              Lab:     +45 89425022
>> >>>> Gustav Wieds Vej 10c              Fax:     +45 86123178
>> >>>> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>> >>>> Denmark                           Lab WWW: www.bioxray.dk/~deb
>> >>>>
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> >>>>> Sent: 16 November 2007 10:50
>> >>>>> To: Ditlev Egeskov Brodersen
>> >>>>> Cc: biojava-l at biojava.org
>> >>>>> Subject: Re: [Biojava-l] Parsing exising gaps
>> >>>>>
>> >>>>>>>   The returned gapped sequences are all properly set up with
>> gaps,
>> >>>> name etc.
>> >>>>>>> But as for other users, I think there may be some problems,
>> since
>> > the
>> >>>>>>> SimpleAlignment object only has a general symbol list iterator,
>> > the
>> >>>> user
>> >>>>>>> will have to cast each statement extracting a sequence object,
>> and
>> >>>>>>>
>> >>>>>>>       SimpleSequence aSimple =
>> (SimpleSequence)aSequences.next();
>> >>>>>>>
>> >>>>>>> returns an ClassCastException at run time. So old code might
>> not
>> > run
>> >>>> with
>> >>>>>>> the update as far as I can see.
>> >>>> This is true. However, such code would be unsupported by us as the
>> > API
>> >>>> clearly states that SimpleAlignment returns SymbolList instances,
>> and
>> >>>> does not make any guarantees about the exact implementation
>> details
>> > of
>> >>>> the objects it returns. To attempt to cast it to anything other
>> than
>> >>>> SymbolList would be a mistake! (Although actually it is now
>> returning
>> > a
>> >>>> guarantee of GappedSymbolList, which is what your code can now
>> take
>> >>>> advantage of). To assume it will return SimpleSequence is outside
>> the
>> >>>> behaviour defined by the API and therefore should not be relied
>> upon.
>> >>>>
>> >>>> A more correct behaviour would be to test each item returned:
>> >>>>
>> >>>> 	SymbolList symlist = aSequences.next();
>> >>>> 	if (symlist instanceof SimpleSequence) {
>> >>>> 		SimpleSequence seq = (SimpleSequence)symlist;
>> >>>> 		// Do simple-sequence stuff
>> >>>> 	} else {
>> >>>> 		// Do something else!
>> >>>> 	}
>> >>>>
>> >>>> In future, I will modify the API to change the SymbolList
>> guarantee
>> > to
>> >>>> a
>> >>>> GappedSymbolList guarantee, but I can't do this right now as this
>> >>>> really
>> >>>> would break everyone's code!
>> >>>>
>> >>>> We are currently planning a redesign as you may be aware, so
>> issues
>> >>>> like
>> >>>> this will hopefully be resolved as part of that process. For a
>> start,
>> >>>> if
>> >>>> we use Java 5 generics in future as we plan, we can strictly
>> specify
>> >>>> what kinds of objects will be returned by things such as the
>> > alignment
>> >>>> API, making it easier for us to enforce API-compliant behaviour in
>> >>>> user's code.
>> >>>>
>> >>>> cheers,
>> >>>> Richard
>>
>> - --
>> Richard Holland (BioMart)
>> EMBL EBI, Wellcome Trust Genome Campus,
>> Hinxton, Cambridgeshire CB10 1SD, UK
>> Tel. +44 (0)1223 494416
>>
>> http://www.biomart.org/
>> http://www.biojava.org/
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx
>> H0QV01cFreNi1SNLRPbhepg=
>> =023Y
>> -----END PGP SIGNATURE-----
>
>


-- 
Richard Holland
BioMart (http://www.biomart.org/)
EMBL-EBI
Hinxton, Cambridgeshire CB10 1SD, UK


From sterk at ebi.ac.uk  Mon Nov 19 06:53:00 2007
From: sterk at ebi.ac.uk (Peter Sterk)
Date: Mon, 19 Nov 2007 11:53:00 +0000
Subject: [Biojava-l] biojava.org wiki site down?
Message-ID: <4741791C.2090307@ebi.ac.uk>

Hi,

I only get blank screens in firefox and IE can't display the pages, 
either. I think Richard reported something similar a few weeks ago.

cheers,

--Peter
-----------------------------------------------------------------
Dr. Peter Sterk                           Tel: +44-(0)1223-494405
EMBL-European Bioinformatics Institute    Fax: +44-(0)1223-494472
Wellcome Trust Genome Campus, Hinxton     email: sterk at ebi.ac.uk
Cambridge CB10 1SD, UK                    WWW: www.ebi.ac.uk

   Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/
-----------------------------------------------------------------

From deb at mb.au.dk  Mon Nov 19 07:13:53 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Mon, 19 Nov 2007 13:13:53 +0100
Subject: [Biojava-l] biojava.org wiki site down?
In-Reply-To: <4741791C.2090307@ebi.ac.uk>
References: <4741791C.2090307@ebi.ac.uk>
Message-ID: <003301c82aa5$a6fabdc0$f4f03940$@au.dk>

www.biojava.org is down now, alright, but I was there less than 10 minutes
ago, so it's recent crash.

  Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
> bounces at lists.open-bio.org] On Behalf Of Peter Sterk
> Sent: 19 November 2007 12:53
> To: biojava-l at lists.open-bio.org
> Subject: [Biojava-l] biojava.org wiki site down?
> 
> Hi,
> 
> I only get blank screens in firefox and IE can't display the pages,
> either. I think Richard reported something similar a few weeks ago.
> 
> cheers,
> 
> --Peter
> -----------------------------------------------------------------
> Dr. Peter Sterk                           Tel: +44-(0)1223-494405
> EMBL-European Bioinformatics Institute    Fax: +44-(0)1223-494472
> Wellcome Trust Genome Campus, Hinxton     email: sterk at ebi.ac.uk
> Cambridge CB10 1SD, UK                    WWW: www.ebi.ac.uk
> 
>    Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/
> -----------------------------------------------------------------
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From deb at mb.au.dk  Mon Nov 19 09:46:01 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Mon, 19 Nov 2007 15:46:01 +0100
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>	<473C4EF4.5080301@ebi.ac.uk>
	<000a01c827c1$8c8e5a50$a5ab0ef0$@dk>	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>	<473D67AF.2020007@ebi.ac.uk>
	<000f01c82839$06722550$13566ff0$@au.d	<473D7521.9070603@ebi.ac.uk>
	<001801c8284d$b8c525e0$2a4f71a0$@au.dk>	<473D911F.2000303@ebi.ac.uk>
	<000901c829d0$daa54620$8fefd260$@dk>
	<48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>
Message-ID: <003701c82aba$e85f4320$b91dc960$@au.dk>

Dear Richard and all,

  I've been dissecting the delegation problem encountered when instantiating
SimpleGappedSequence(Sequence) with an already gapped sequence. The
constructor calls the parent SimpleGappedSymbolList(), which in Richard's
CVS update of 161107 now contains a separate overloaded constructor for the
gapped case:

  public SimpleGappedSymbolList(GappedSymbolList gappedSource)

  However, when instantiating a new SimpleGappedSequence based on an
existing gapped sequence (with several blocks), the blocks were lost. 

  After checking the path of code execution it appeared that for some
reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified
SimpleGappedSequence.java to include an overloaded constructor also for the
descendant class, identical to the other constructor but with a
GappedSequence argument:

  public SimpleGappedSequence(GappedSequence seq) {
    super(seq);
    this.sequence = seq;
    createOnUnderlying = false;
  }

  Now, the correct parent constructor
(SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are
two other problems with the new SimpleGappedSymbolList constructor that
needs to be corrected for it to work as expected: First, the initial
introduction of a single, large block is missing from the new code, so
insert:

  Block b = new Block(1, length, 1, length);
  blocks.add(b);

  Secondly, the code for transferring the gaps from the sequence string need
to use two separate indices, otherwise the gaps will be placed wrongly
because their position is affected by previously inserted gaps:

  int n=1;
  for(int i=1;i<=this.length();i++) {
    if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i)))
      this.addGappInSource(n);
    else
      n++;

  In other words, the index giving the position of the gaps should only
increment when there are NO gaps at the corresponding position in the gapped
string.

  Following these changes, the GappedSequenceTest program from last week now
works as expected:

 aSymbolList = MSE--KLMPRT---TWAKG
 aSequence   = MSE--KLMPRT---TWAKG

 Gaps are not parsed when a SimpleGappedSequence is constructed from a 
 gapped Sequence object:
 aGapped     = MSE--KLMPRT---TWAKG
 Gapped position 10 = Plain position 10

 aSymbolList = MSEKLMPRTTWAKG
 aSequence   = MSEKLMPRTTWAKG

 Gaps introduced through addGapsInSource work ok:
 aGapped     = MS--EKLMPR---TTWAKG
 Gapped position 10 = Plain position 8

 Now a new SimpleGappedSequence object is created from the previous one:
 aGapped2    = MS--EKLMPR---TTWAKG
 Gapped position 10 = Plain position 8

  -- Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


 -----Original Message-----
 From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
 bounces at lists.open-bio.org] On Behalf Of Richard Holland
 Sent: 18 November 2007 18:12
 To: Ditlev Egeskov Brodersen
 Cc: biojava-l at biojava.org
 Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence
 
 Interesting stuff. I'm not sure why it isn't working so I'll have to
 have
 a closer look.
 
 I'm currently on annual leave but will investigate when I return (Nov
 27th).
 
 cheers,
 Richard
 
 On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote:
  Hi Richard,
 
    I thought that was also correct what you say, but I can't get it to
  work.
  Below is a small test program to check this. First, I create a
  SimpleGappedSequence through Text with
  gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not
  "understood", as expected. Next, I create the same sequence non-
 gapped in
  the above way, then introduce gaps with addGapsInSource. A gapped
 location
  is now properly translated to a non-gapped sequence position.
 Finally, I
  create a new SimpleGappedSequence based on the working one - as you
 can
  see
  the gaps are still there but not "understood"...
 
  aSymbolList = MSE--KLMPRT---TWAKG
  aSequence   = MSE--KLMPRT---TWAKG
 
  Gaps are not parsed when a SimpleGappedSequence is constructed from a
  gapped
  Sequence object:
  aGapped     = MSE--KLMPRT---TWAKG
  Gapped position 10 = Plain position 10
 
  aSymbolList = MSEKLMPRTTWAKG
  aSequence   = MSEKLMPRTTWAKG
 
  Gaps introduced through addGapsInSource work ok:
  aGapped     = MS--EKLMPR---TTWAKG
  Gapped position 10 = Plain position 8
 
  Now a new SimpleGappedSequence object is created from the previous
 one:
  aGapped2    = MS--EKLMPR---TTWAKG
  Gapped position 10 = Plain position 10
 
  This should have been compiled with the new biojava.jar of 161107
 (updated
  via CVS), but perhaps I made a mistake updating?
 
  Any clues?
 
  Thanks,
 
    Ditlev
 
  ---
 
  package gappedsequencetest;
 
  import org.biojava.bio.*;
  import org.biojava.bio.seq.*;
  import org.biojava.bio.seq.impl.*;
  import org.biojava.bio.symbol.*;
 
  public class Main {
 
      public static void main(String[] args) {
          SymbolList aSymbolList = null;
          try {
              aSymbolList =
  ProteinTools.createProtein("MSE--KLMPRT---TWAKG");
 
          }
          catch(BioException ex) {}
 
          System.out.println("aSymbolList = " +
 aSymbolList.seqString());
 
          Sequence aSequence = new SimpleSequence(aSymbolList, "",
  "mySequence", null);
          System.out.println("aSequence   = " + aSequence.seqString() +
  "\n");
 
          SimpleGappedSequence aGapped = new
  SimpleGappedSequence(aSequence);
          System.out.println("Gaps are not parsed when a
  SimpleGappedSequence
  is constructed from a gapped Sequence object:");
          System.out.println("aGapped     = " + aGapped.seqString());
          System.out.println("Gapped position 10 = Plain position " +
  aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
 
          try {
              aSymbolList =
 ProteinTools.createProtein("MSEKLMPRTTWAKG");
          }
          catch(BioException ex) {}
 
          System.out.println("aSymbolList = " +
 aSymbolList.seqString());
 
          aSequence = new SimpleSequence(aSymbolList, "", "mySequence",
  null);
          System.out.println("aSequence   = " + aSequence.seqString() +
  "\n");
 
          aGapped = new SimpleGappedSequence(aSequence);
          aGapped.addGapsInSource(9, 3);
          aGapped.addGapsInSource(3, 2);
          System.out.println("Gaps introduced through addGapsInSource
 work
  ok:");
          System.out.println("aGapped     = " + aGapped.seqString());
          System.out.println("Gapped position 10 = Plain position " +
  aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
 
          SimpleGappedSequence aGapped2 = new
 SimpleGappedSequence(aGapped);
          System.out.println("Now a new SimpleGappedSequence object is
  created
  from the previous one:");
          System.out.println("aGapped2    = " + aGapped2.seqString());
          System.out.println("Gapped position 10 = Plain position " +
  aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
      }
 
  }
 
  --
 
  Ditlev Egeskov Brodersen
  Lektor
  Bakkefaldet 30, Hasle
  8210 ?rhus V
 
  www.lindeman-brodersen.dk
 
 
  -----Original Message-----
  From: Richard Holland [mailto:holland at ebi.ac.uk]
  Sent: 16 November 2007 13:46
  To: Ditlev Egeskov Brodersen
  Cc: biojava-l at biojava.org
  Subject: Re: Wrapping SimpleGappedSequence
 
  -----BEGIN PGP SIGNED MESSAGE-----
  Hash: SHA1
 
  SimpleGappedSequence extends SimpleGappedSymbolList, and the
  constructor
  delegates to the SimpleGappedSymbolList constructor.
 
  When you extend SimpleGappedSequence you should delegate in your new
  constructor to the existing SimpleGappedSequence constructor, which
 in
  turn will delegate as above and preserve the gaps.
 
  By passing any object which implements GappedSymbolList to the
  SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
  SimpleGappedSymbolList, it will automatically choose the new
  constructor
  from SimpleGappedSymbolList which you hopefully should be able to
 see
  in
  the code you have just checked out. If passed any other
  non-GappedSymbolList object, it will use the old constructor that
  already existed from before.
 
  cheers,
  Richard
 
  Ditlev Egeskov Brodersen wrote:
   Hi again,
  
     I updated CVS and got the new SimpleGappedSymbolList class, but
  there
   seems to be no changes to the SimpleGappedSequence class, which is
  the one I
   need to extend...have I missed something?
  
     Ditlev
  
   --
  
   Ditlev E. Brodersen, Ph.D.
   Lektor, Associate Professor
  
   Department of Molecular Biology   Office:  +45 89425259
   University of Aarhus              Lab:     +45 89425022
   Gustav Wieds Vej 10c              Fax:     +45 86123178
   DK-8000 Aarhus C                  Email:   deb at mb.au.dk
   Denmark                           Lab WWW: www.bioxray.dk/~deb
  
  
   -----Original Message-----
   From: Richard Holland [mailto:holland at ebi.ac.uk]
   Sent: 16 November 2007 11:47
   To: Ditlev Egeskov Brodersen
   Cc: biojava-l at biojava.org
   Subject: Re: Wrapping SimpleGappedSequence
  
   The easiest way is simply for me to alter the constructor to
   SimpleGappedSequence (and equivalently to SimpleGappedSymbolList)
 to
   copy all gaps if passed another instance of GappedSymbolList as
 the
   parameter. I've just done this in CVS so you should be able to
 update
   your copy and observe the new behaviour.
  
   cheers,
   Richard
  
   Ditlev Egeskov Brodersen wrote:
   Hi again,
  
     thanks for the info - will do the check just to be proper. I
  have
   another
   question: In my application, I would like to wrap the retrieved
   SimpleGappedSequence objects inside another object that extends
  the
   functionality with application-specific stuff. Ideally, I would
 do
   this by
   extending the SimpleGappedSequence object and create it by
 passing
   the
   SimpleGappedSequence from the alignment import to the
 constructor
  of
   the
   parent, like so:
  
     class AlignedSequence extends SimpleGappedSequence {
       public AlignedSequence(SimpleGappedSequence aGapped) {
         super(aGapped);
       }
  
       ..custom stuff..
     }
  
   However, the problem is that there is only one constructor for
 the
   SimpleGappedSequence, one which takes a simple Sequence object.
 I
  can
   pass
   the derived class alright, but all gap information is lost
 again,
   presumably
   because the SimpleGappedSequence constructor just takes out the
   seqString()
   and puts it into its own sequence object.
  
   Shouldn't the constructor of the SimpleGappedSequence class
  recognise
   when a
   derived (and gapped) sequence object is passed, and process it
   accordingly?
   As it stands, I am forced to include the SimpleGappedSequence
 as a
   private
   member of the AlignedSequence class, which is not near as nice
  since
   all
   statement using the class will have to do something like
  
     class AlignedSequence extends SimpleGappedSequence {
       private SimpleGappedSequence gapped_sequence;
  
       public AlignedSequence(SimpleGappedSequence aGapped) {
         gapped_sequence = aGapped;
       }
  
       public SimpleGappedSequence getGappedSequence() {
         return(gapped_sequence);
     }
  
       ..custom stuff..
     }
  
     ...
  
     AlignedSequence aAligned = new AlignedSequence(aGapped);
     aAligned.getGappedSequence().seqString();
  
   rather than simply:
  
     AlignedSequence aAligned = new AlignedSequence(aGapped);
     aAligned.seqString();
  
   In other words, is there any solution with the current setup
 that
   would
   allow me to extend SimpleGappedSequence and not loose the gap
   information?
   --  Ditlev
  
   --
  
   Ditlev E. Brodersen, Ph.D.
   Lektor, Associate Professor
  
   Department of Molecular Biology   Office:  +45 89425259
   University of Aarhus              Lab:     +45 89425022
   Gustav Wieds Vej 10c              Fax:     +45 86123178
   DK-8000 Aarhus C                  Email:   deb at mb.au.dk
   Denmark                           Lab WWW: www.bioxray.dk/~deb
  
  
   -----Original Message-----
   From: Richard Holland [mailto:holland at ebi.ac.uk]
   Sent: 16 November 2007 10:50
   To: Ditlev Egeskov Brodersen
   Cc: biojava-l at biojava.org
   Subject: Re: [Biojava-l] Parsing exising gaps
  
     The returned gapped sequences are all properly set up with
  gaps,
   name etc.
   But as for other users, I think there may be some problems,
  since
   the
   SimpleAlignment object only has a general symbol list
 iterator,
   the
   user
   will have to cast each statement extracting a sequence
 object,
  and
  
         SimpleSequence aSimple =
  (SimpleSequence)aSequences.next();
  
   returns an ClassCastException at run time. So old code might
  not
   run
   with
   the update as far as I can see.
   This is true. However, such code would be unsupported by us as
 the
   API
   clearly states that SimpleAlignment returns SymbolList
 instances,
  and
   does not make any guarantees about the exact implementation
  details
   of
   the objects it returns. To attempt to cast it to anything other
  than
   SymbolList would be a mistake! (Although actually it is now
  returning
   a
   guarantee of GappedSymbolList, which is what your code can now
  take
   advantage of). To assume it will return SimpleSequence is
 outside
  the
   behaviour defined by the API and therefore should not be relied
  upon.
  
   A more correct behaviour would be to test each item returned:
  
   	SymbolList symlist = aSequences.next();
   	if (symlist instanceof SimpleSequence) {
   		SimpleSequence seq = (SimpleSequence)symlist;
   		// Do simple-sequence stuff
   	} else {
   		// Do something else!
   	}
  
   In future, I will modify the API to change the SymbolList
  guarantee
   to
   a
   GappedSymbolList guarantee, but I can't do this right now as
 this
   really
   would break everyone's code!
  
   We are currently planning a redesign as you may be aware, so
  issues
   like
   this will hopefully be resolved as part of that process. For a
  start,
   if
   we use Java 5 generics in future as we plan, we can strictly
  specify
   what kinds of objects will be returned by things such as the
   alignment
   API, making it easier for us to enforce API-compliant behaviour
 in
   user's code.
  
   cheers,
   Richard
 
  - --
  Richard Holland (BioMart)
  EMBL EBI, Wellcome Trust Genome Campus,
  Hinxton, Cambridgeshire CB10 1SD, UK
  Tel. +44 (0)1223 494416
 
  http://www.biomart.org/
  http://www.biojava.org/
  -----BEGIN PGP SIGNATURE-----
  Version: GnuPG v1.4.2.2 (GNU/Linux)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
  iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx
  H0QV01cFreNi1SNLRPbhepg=
  =023Y
  -----END PGP SIGNATURE-----
 
 
 --
 Richard Holland
 BioMart (http://www.biomart.org/)
 EMBL-EBI
 Hinxton, Cambridgeshire CB10 1SD, UK
 
 _______________________________________________
 Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
 http://lists.open-bio.org/mailman/listinfo/biojava-l


From allank at sanbi.ac.za  Sun Nov 25 08:10:55 2007
From: allank at sanbi.ac.za (Allan Kamau)
Date: Sun, 25 Nov 2007 15:10:55 +0200
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
Message-ID: <4749745F.9070104@sanbi.ac.za>

Hi all,
I've searched for a conclusive answer to the "Program ncbi-blastn 
Version <some value> is not supported" without success.
I would like to know format of the blast output the Biojava's blast-like 
parsing framework likes, including some examples (without the data) of 
how such blast output may be created.
For example, I am using ncbi-blastn and I am generating the blast file 
(which Biojava doesn't like) as follows.

export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
export BLAST_REPORT_TABULAR=somesequence.blast.txt
export BLAST_REPORT_XML=somesequence.blast.xml
export BLAST_REPORT=somesequence.blast
export INPUT_FASTA=somesequence.fasta
export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence

date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i 
$REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d 
$REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o 
$BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i 
$INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;

Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied 
from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"

Then I get the error below.


[aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
Buildfile: build.xml

runBlastParser:
     [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17 
is not supported by the biojava blast-like parsing framework
     [java]     at 
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:241)
     [java]     at 
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)

Allan.

From markjschreiber at gmail.com  Sun Nov 25 20:17:03 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Mon, 26 Nov 2007 09:17:03 +0800
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <4749745F.9070104@sanbi.ac.za>
References: <4749745F.9070104@sanbi.ac.za>
Message-ID: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>

Hi Allan -

I think the solution is to call the setParserLazy() or some method with a
similar name (I don't have the API handy). This will prevent it doing the
check.

The original idea of this method was you could check against a list of
version numbers that people had validated.  I don't think this is a good
idea as nothing is truely 100% validated and we haven't kept the list up to
date.  If there are no objections I would propose to make this method
depricated (and it's opposite method) and change the default behaivour to
lazy checking.

Best regards.

- Mark


On 11/25/07, Allan Kamau <allank at sanbi.ac.za> wrote:
>
> Hi all,
> I've searched for a conclusive answer to the "Program ncbi-blastn
> Version <some value> is not supported" without success.
> I would like to know format of the blast output the Biojava's blast-like
> parsing framework likes, including some examples (without the data) of
> how such blast output may be created.
> For example, I am using ncbi-blastn and I am generating the blast file
> (which Biojava doesn't like) as follows.
>
> export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
> export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
> export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
> export BLAST_REPORT_TABULAR=somesequence.blast.txt
> export BLAST_REPORT_XML=somesequence.blast.xml
> export BLAST_REPORT=somesequence.blast
> export INPUT_FASTA=somesequence.fasta
> export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
>
> date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
> $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
> $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
> $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i
> $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
>
> Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
> from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
>
> Then I get the error below.
>
>
> [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
> Buildfile: build.xml
>
> runBlastParser:
>     [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17
> is not supported by the biojava blast-like parsing framework
>     [java]     at
> org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(
> BlastLikeSAXParser.java:241)
>     [java]     at
> org.biojava.bio.program.sax.BlastLikeSAXParser.parse(
> BlastLikeSAXParser.java:160)
>
> Allan.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From holland at ebi.ac.uk  Mon Nov 26 03:55:56 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 26 Nov 2007 08:55:56 +0000
Subject: [Biojava-l] Applet not able to find DNATools class.
In-Reply-To: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu>
References: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu>
Message-ID: <474A8A1C.4020901@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sounds like either a classpath problem (in which case check your
classpath to ensure all parts of biojava are definitely on it) or a
broken biojava.jar (in which case you need to recompile/redownload it).

cheers,
Richard

Abhinav Ram Karhu wrote:
> Hello all,
> I am having an error while loading the applet.
> 
> I am getting the following stack trace.
> 
> java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools
> 	at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161)
> 	at Trace.init(Trace.java:161)
> 	at sun.applet.AppletPanel.run(Unknown Source)
> 	at java.lang.Thread.run(Unknown Source)
> 
> I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder.
> 
> I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java.
> 
> My applet code in the php page looks like this:
> 
> <applet code="Trace.class"  archive="biojava-1.5.jar , bytecode.jar" height=800 width=800>
> 
> Please suggest if I am missing something.
> 
> Thanks in advance.
> 
> Abhinav
> 
> 
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSoob4C5LeMEKA/QRAsfkAJ9SlwIzDulzSDQpAfgh0alISRsplACcDqUx
uyQUEmRFEWTdnEHsm7k2lg0=
=SWHu
-----END PGP SIGNATURE-----

From holland at ebi.ac.uk  Mon Nov 26 07:55:23 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 26 Nov 2007 12:55:23 +0000
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <003701c82aba$e85f4320$b91dc960$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>	<473C4EF4.5080301@ebi.ac.uk>
	<000a01c827c1$8c8e5a50$a5ab0ef0$@dk>	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>	<473D67AF.2020007@ebi.ac.uk>
	<000f01c82839$06722550$13566ff0$@au.d	<473D7521.9070603@ebi.ac.uk>
	<001801c8284d$b8c525e0$2a4f71a0$@au.dk>	<473D911F.2000303@ebi.ac.uk>
	<000901c829d0$daa54620$8fefd260$@dk>
	<48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>
	<003701c82aba$e85f4320$b91dc960$@au.dk>
Message-ID: <474AC23B.3080500@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have made the changes you suggest below in CVS. Hopefully it will work
for you now.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Dear Richard and all,
> 
>   I've been dissecting the delegation problem encountered when instantiating
> SimpleGappedSequence(Sequence) with an already gapped sequence. The
> constructor calls the parent SimpleGappedSymbolList(), which in Richard's
> CVS update of 161107 now contains a separate overloaded constructor for the
> gapped case:
> 
>   public SimpleGappedSymbolList(GappedSymbolList gappedSource)
> 
>   However, when instantiating a new SimpleGappedSequence based on an
> existing gapped sequence (with several blocks), the blocks were lost. 
> 
>   After checking the path of code execution it appeared that for some
> reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified
> SimpleGappedSequence.java to include an overloaded constructor also for the
> descendant class, identical to the other constructor but with a
> GappedSequence argument:
> 
>   public SimpleGappedSequence(GappedSequence seq) {
>     super(seq);
>     this.sequence = seq;
>     createOnUnderlying = false;
>   }
> 
>   Now, the correct parent constructor
> (SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are
> two other problems with the new SimpleGappedSymbolList constructor that
> needs to be corrected for it to work as expected: First, the initial
> introduction of a single, large block is missing from the new code, so
> insert:
> 
>   Block b = new Block(1, length, 1, length);
>   blocks.add(b);
> 
>   Secondly, the code for transferring the gaps from the sequence string need
> to use two separate indices, otherwise the gaps will be placed wrongly
> because their position is affected by previously inserted gaps:
> 
>   int n=1;
>   for(int i=1;i<=this.length();i++) {
>     if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i)))
>       this.addGappInSource(n);
>     else
>       n++;
> 
>   In other words, the index giving the position of the gaps should only
> increment when there are NO gaps at the corresponding position in the gapped
> string.
> 
>   Following these changes, the GappedSequenceTest program from last week now
> works as expected:
> 
>  aSymbolList = MSE--KLMPRT---TWAKG
>  aSequence   = MSE--KLMPRT---TWAKG
> 
>  Gaps are not parsed when a SimpleGappedSequence is constructed from a 
>  gapped Sequence object:
>  aGapped     = MSE--KLMPRT---TWAKG
>  Gapped position 10 = Plain position 10
> 
>  aSymbolList = MSEKLMPRTTWAKG
>  aSequence   = MSEKLMPRTTWAKG
> 
>  Gaps introduced through addGapsInSource work ok:
>  aGapped     = MS--EKLMPR---TTWAKG
>  Gapped position 10 = Plain position 8
> 
>  Now a new SimpleGappedSequence object is created from the previous one:
>  aGapped2    = MS--EKLMPR---TTWAKG
>  Gapped position 10 = Plain position 8
> 
>   -- Ditlev
> 
> --
>  
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
>  
> Department of Molecular Biology   Office:  +45 89425259
> University of Aarhus              Lab:     +45 89425022
> Gustav Wieds Vej 10c              Fax:     +45 86123178
> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>  -----Original Message-----
>  From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
>  bounces at lists.open-bio.org] On Behalf Of Richard Holland
>  Sent: 18 November 2007 18:12
>  To: Ditlev Egeskov Brodersen
>  Cc: biojava-l at biojava.org
>  Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence
>  
>  Interesting stuff. I'm not sure why it isn't working so I'll have to
>  have
>  a closer look.
>  
>  I'm currently on annual leave but will investigate when I return (Nov
>  27th).
>  
>  cheers,
>  Richard
>  
>  On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote:
>   Hi Richard,
>  
>     I thought that was also correct what you say, but I can't get it to
>   work.
>   Below is a small test program to check this. First, I create a
>   SimpleGappedSequence through Text with
>   gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not
>   "understood", as expected. Next, I create the same sequence non-
>  gapped in
>   the above way, then introduce gaps with addGapsInSource. A gapped
>  location
>   is now properly translated to a non-gapped sequence position.
>  Finally, I
>   create a new SimpleGappedSequence based on the working one - as you
>  can
>   see
>   the gaps are still there but not "understood"...
>  
>   aSymbolList = MSE--KLMPRT---TWAKG
>   aSequence   = MSE--KLMPRT---TWAKG
>  
>   Gaps are not parsed when a SimpleGappedSequence is constructed from a
>   gapped
>   Sequence object:
>   aGapped     = MSE--KLMPRT---TWAKG
>   Gapped position 10 = Plain position 10
>  
>   aSymbolList = MSEKLMPRTTWAKG
>   aSequence   = MSEKLMPRTTWAKG
>  
>   Gaps introduced through addGapsInSource work ok:
>   aGapped     = MS--EKLMPR---TTWAKG
>   Gapped position 10 = Plain position 8
>  
>   Now a new SimpleGappedSequence object is created from the previous
>  one:
>   aGapped2    = MS--EKLMPR---TTWAKG
>   Gapped position 10 = Plain position 10
>  
>   This should have been compiled with the new biojava.jar of 161107
>  (updated
>   via CVS), but perhaps I made a mistake updating?
>  
>   Any clues?
>  
>   Thanks,
>  
>     Ditlev
>  
>   ---
>  
>   package gappedsequencetest;
>  
>   import org.biojava.bio.*;
>   import org.biojava.bio.seq.*;
>   import org.biojava.bio.seq.impl.*;
>   import org.biojava.bio.symbol.*;
>  
>   public class Main {
>  
>       public static void main(String[] args) {
>           SymbolList aSymbolList = null;
>           try {
>               aSymbolList =
>   ProteinTools.createProtein("MSE--KLMPRT---TWAKG");
>  
>           }
>           catch(BioException ex) {}
>  
>           System.out.println("aSymbolList = " +
>  aSymbolList.seqString());
>  
>           Sequence aSequence = new SimpleSequence(aSymbolList, "",
>   "mySequence", null);
>           System.out.println("aSequence   = " + aSequence.seqString() +
>   "\n");
>  
>           SimpleGappedSequence aGapped = new
>   SimpleGappedSequence(aSequence);
>           System.out.println("Gaps are not parsed when a
>   SimpleGappedSequence
>   is constructed from a gapped Sequence object:");
>           System.out.println("aGapped     = " + aGapped.seqString());
>           System.out.println("Gapped position 10 = Plain position " +
>   aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>  
>           try {
>               aSymbolList =
>  ProteinTools.createProtein("MSEKLMPRTTWAKG");
>           }
>           catch(BioException ex) {}
>  
>           System.out.println("aSymbolList = " +
>  aSymbolList.seqString());
>  
>           aSequence = new SimpleSequence(aSymbolList, "", "mySequence",
>   null);
>           System.out.println("aSequence   = " + aSequence.seqString() +
>   "\n");
>  
>           aGapped = new SimpleGappedSequence(aSequence);
>           aGapped.addGapsInSource(9, 3);
>           aGapped.addGapsInSource(3, 2);
>           System.out.println("Gaps introduced through addGapsInSource
>  work
>   ok:");
>           System.out.println("aGapped     = " + aGapped.seqString());
>           System.out.println("Gapped position 10 = Plain position " +
>   aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>  
>           SimpleGappedSequence aGapped2 = new
>  SimpleGappedSequence(aGapped);
>           System.out.println("Now a new SimpleGappedSequence object is
>   created
>   from the previous one:");
>           System.out.println("aGapped2    = " + aGapped2.seqString());
>           System.out.println("Gapped position 10 = Plain position " +
>   aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>       }
>  
>   }
>  
>   --
>  
>   Ditlev Egeskov Brodersen
>   Lektor
>   Bakkefaldet 30, Hasle
>   8210 ?rhus V
>  
>   www.lindeman-brodersen.dk
>  
>  
>   -----Original Message-----
>   From: Richard Holland [mailto:holland at ebi.ac.uk]
>   Sent: 16 November 2007 13:46
>   To: Ditlev Egeskov Brodersen
>   Cc: biojava-l at biojava.org
>   Subject: Re: Wrapping SimpleGappedSequence
>  
> SimpleGappedSequence extends SimpleGappedSymbolList, and the
> constructor
> delegates to the SimpleGappedSymbolList constructor.
> 
> When you extend SimpleGappedSequence you should delegate in your new
> constructor to the existing SimpleGappedSequence constructor, which
>>  in
> turn will delegate as above and preserve the gaps.
> 
> By passing any object which implements GappedSymbolList to the
> SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
> SimpleGappedSymbolList, it will automatically choose the new
> constructor
> from SimpleGappedSymbolList which you hopefully should be able to
>>  see
> in
> the code you have just checked out. If passed any other
> non-GappedSymbolList object, it will use the old constructor that
> already existed from before.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>  Hi again,
> 
>    I updated CVS and got the new SimpleGappedSymbolList class, but
> there
>  seems to be no changes to the SimpleGappedSequence class, which is
> the one I
>  need to extend...have I missed something?
> 
>    Ditlev
> 
>  --
> 
>  Ditlev E. Brodersen, Ph.D.
>  Lektor, Associate Professor
> 
>  Department of Molecular Biology   Office:  +45 89425259
>  University of Aarhus              Lab:     +45 89425022
>  Gustav Wieds Vej 10c              Fax:     +45 86123178
>  DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>  Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>  -----Original Message-----
>  From: Richard Holland [mailto:holland at ebi.ac.uk]
>  Sent: 16 November 2007 11:47
>  To: Ditlev Egeskov Brodersen
>  Cc: biojava-l at biojava.org
>  Subject: Re: Wrapping SimpleGappedSequence
> 
>  The easiest way is simply for me to alter the constructor to
>  SimpleGappedSequence (and equivalently to SimpleGappedSymbolList)
>>  to
>  copy all gaps if passed another instance of GappedSymbolList as
>>  the
>  parameter. I've just done this in CVS so you should be able to
>>  update
>  your copy and observe the new behaviour.
> 
>  cheers,
>  Richard
> 
>  Ditlev Egeskov Brodersen wrote:
>  Hi again,
> 
>    thanks for the info - will do the check just to be proper. I
> have
>  another
>  question: In my application, I would like to wrap the retrieved
>  SimpleGappedSequence objects inside another object that extends
> the
>  functionality with application-specific stuff. Ideally, I would
>>  do
>  this by
>  extending the SimpleGappedSequence object and create it by
>>  passing
>  the
>  SimpleGappedSequence from the alignment import to the
>>  constructor
> of
>  the
>  parent, like so:
> 
>    class AlignedSequence extends SimpleGappedSequence {
>      public AlignedSequence(SimpleGappedSequence aGapped) {
>        super(aGapped);
>      }
> 
>      ..custom stuff..
>    }
> 
>  However, the problem is that there is only one constructor for
>>  the
>  SimpleGappedSequence, one which takes a simple Sequence object.
>>  I
> can
>  pass
>  the derived class alright, but all gap information is lost
>>  again,
>  presumably
>  because the SimpleGappedSequence constructor just takes out the
>  seqString()
>  and puts it into its own sequence object.
> 
>  Shouldn't the constructor of the SimpleGappedSequence class
> recognise
>  when a
>  derived (and gapped) sequence object is passed, and process it
>  accordingly?
>  As it stands, I am forced to include the SimpleGappedSequence
>>  as a
>  private
>  member of the AlignedSequence class, which is not near as nice
> since
>  all
>  statement using the class will have to do something like
> 
>    class AlignedSequence extends SimpleGappedSequence {
>      private SimpleGappedSequence gapped_sequence;
> 
>      public AlignedSequence(SimpleGappedSequence aGapped) {
>        gapped_sequence = aGapped;
>      }
> 
>      public SimpleGappedSequence getGappedSequence() {
>        return(gapped_sequence);
>    }
> 
>      ..custom stuff..
>    }
> 
>    ...
> 
>    AlignedSequence aAligned = new AlignedSequence(aGapped);
>    aAligned.getGappedSequence().seqString();
> 
>  rather than simply:
> 
>    AlignedSequence aAligned = new AlignedSequence(aGapped);
>    aAligned.seqString();
> 
>  In other words, is there any solution with the current setup
>>  that
>  would
>  allow me to extend SimpleGappedSequence and not loose the gap
>  information?
>  --  Ditlev
> 
>  --
> 
>  Ditlev E. Brodersen, Ph.D.
>  Lektor, Associate Professor
> 
>  Department of Molecular Biology   Office:  +45 89425259
>  University of Aarhus              Lab:     +45 89425022
>  Gustav Wieds Vej 10c              Fax:     +45 86123178
>  DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>  Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>  -----Original Message-----
>  From: Richard Holland [mailto:holland at ebi.ac.uk]
>  Sent: 16 November 2007 10:50
>  To: Ditlev Egeskov Brodersen
>  Cc: biojava-l at biojava.org
>  Subject: Re: [Biojava-l] Parsing exising gaps
> 
>    The returned gapped sequences are all properly set up with
> gaps,
>  name etc.
>  But as for other users, I think there may be some problems,
> since
>  the
>  SimpleAlignment object only has a general symbol list
>>  iterator,
>  the
>  user
>  will have to cast each statement extracting a sequence
>>  object,
> and
> 
>        SimpleSequence aSimple =
> (SimpleSequence)aSequences.next();
> 
>  returns an ClassCastException at run time. So old code might
> not
>  run
>  with
>  the update as far as I can see.
>  This is true. However, such code would be unsupported by us as
>>  the
>  API
>  clearly states that SimpleAlignment returns SymbolList
>>  instances,
> and
>  does not make any guarantees about the exact implementation
> details
>  of
>  the objects it returns. To attempt to cast it to anything other
> than
>  SymbolList would be a mistake! (Although actually it is now
> returning
>  a
>  guarantee of GappedSymbolList, which is what your code can now
> take
>  advantage of). To assume it will return SimpleSequence is
>>  outside
> the
>  behaviour defined by the API and therefore should not be relied
> upon.
> 
>  A more correct behaviour would be to test each item returned:
> 
>  	SymbolList symlist = aSequences.next();
>  	if (symlist instanceof SimpleSequence) {
>  		SimpleSequence seq = (SimpleSequence)symlist;
>  		// Do simple-sequence stuff
>  	} else {
>  		// Do something else!
>  	}
> 
>  In future, I will modify the API to change the SymbolList
> guarantee
>  to
>  a
>  GappedSymbolList guarantee, but I can't do this right now as
>>  this
>  really
>  would break everyone's code!
> 
>  We are currently planning a redesign as you may be aware, so
> issues
>  like
>  this will hopefully be resolved as part of that process. For a
> start,
>  if
>  we use Java 5 generics in future as we plan, we can strictly
> specify
>  what kinds of objects will be returned by things such as the
>  alignment
>  API, making it easier for us to enforce API-compliant behaviour
>>  in
>  user's code.
> 
>  cheers,
>  Richard
> 
> --
> Richard Holland (BioMart)
> EMBL EBI, Wellcome Trust Genome Campus,
> Hinxton, Cambridgeshire CB10 1SD, UK
> Tel. +44 (0)1223 494416
> 
> http://www.biomart.org/
> http://www.biojava.org/

>  --
>  Richard Holland
>  BioMart (http://www.biomart.org/)
>  EMBL-EBI
>  Hinxton, Cambridgeshire CB10 1SD, UK

>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l


- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSsI64C5LeMEKA/QRAg21AKCieEvT2KaWBFdqLFUtxazhHXmD2wCgiRwk
Bz79hrJxD/eZrrCUXUAh758=
=0Jpp
-----END PGP SIGNATURE-----

From allank at sanbi.ac.za  Mon Nov 26 07:02:56 2007
From: allank at sanbi.ac.za (Allan Kamau)
Date: Mon, 26 Nov 2007 14:02:56 +0200
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
Message-ID: <474AB5F0.6040802@sanbi.ac.za>

Hi Mark,
Thank you for your reply.
Calling setModeLazy() method of the object of type BlastLikeSAXParser 
did provide the cure.

Allan.

Mark Schreiber wrote:
> Hi Allan -
>  
> I think the solution is to call the setParserLazy() or some method 
> with a similar name (I don't have the API handy). This will prevent it 
> doing the check.
>  
> The original idea of this method was you could check against a list of 
> version numbers that people had validated.  I don't think this is a 
> good idea as nothing is truely 100% validated and we haven't kept the 
> list up to date.  If there are no objections I would propose to make 
> this method depricated (and it's opposite method) and change the 
> default behaivour to lazy checking.
>  
> Best regards.
>  
> - Mark
>
>  
> On 11/25/07, *Allan Kamau* <allank at sanbi.ac.za 
> <mailto:allank at sanbi.ac.za>> wrote:
>
>     Hi all,
>     I've searched for a conclusive answer to the "Program ncbi-blastn
>     Version <some value> is not supported" without success.
>     I would like to know format of the blast output the Biojava's
>     blast-like
>     parsing framework likes, including some examples (without the data) of
>     how such blast output may be created.
>     For example, I am using ncbi-blastn and I am generating the blast
>     file
>     (which Biojava doesn't like) as follows.
>
>     export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
>     export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
>     export
>     REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
>     export BLAST_REPORT_TABULAR=somesequence.blast.txt
>     export BLAST_REPORT_XML=somesequence.blast.xml
>     export BLAST_REPORT=somesequence.blast
>     export INPUT_FASTA=somesequence.fasta
>     export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
>
>     date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
>     $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
>     $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
>     $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d
>     $REFERENCES_FASTA_NAME -i
>     $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
>
>     Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
>     from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
>
>     Then I get the error below.
>
>
>     [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
>     Buildfile: build.xml
>
>     runBlastParser:
>         [java] org.xml.sax.SAXException: Program ncbi-blastn Version
>     2.2.17
>     is not supported by the biojava blast-like parsing framework
>         [java]     at
>     org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java
>     :241)
>         [java]     at
>     org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
>
>     Allan.
>     _______________________________________________
>     Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>     <mailto:Biojava-l at lists.open-bio.org>
>     http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From markjschreiber at gmail.com  Mon Nov 26 22:16:35 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 27 Nov 2007 11:16:35 +0800
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <474AB5F0.6040802@sanbi.ac.za>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
	<474AB5F0.6040802@sanbi.ac.za>
Message-ID: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>

Hi -

Does anyone mind if I change the default behaivor to lazy parsing?
Technically this would be a break in backwards compatibility (although
only if you have a program that relies on strict parsing).

Last chance to complain.

- Mark

On Nov 26, 2007 8:02 PM, Allan Kamau <allank at sanbi.ac.za> wrote:
> Hi Mark,
> Thank you for your reply.
> Calling setModeLazy() method of the object of type BlastLikeSAXParser
> did provide the cure.
>
> Allan.
>
>
> Mark Schreiber wrote:
> > Hi Allan -
> >
> > I think the solution is to call the setParserLazy() or some method
> > with a similar name (I don't have the API handy). This will prevent it
> > doing the check.
> >
> > The original idea of this method was you could check against a list of
> > version numbers that people had validated.  I don't think this is a
> > good idea as nothing is truely 100% validated and we haven't kept the
> > list up to date.  If there are no objections I would propose to make
> > this method depricated (and it's opposite method) and change the
> > default behaivour to lazy checking.
> >
> > Best regards.
> >
> > - Mark
> >
> >
> > On 11/25/07, *Allan Kamau* <allank at sanbi.ac.za
>
>
>
> > <mailto:allank at sanbi.ac.za>> wrote:
> >
> >     Hi all,
> >     I've searched for a conclusive answer to the "Program ncbi-blastn
> >     Version <some value> is not supported" without success.
> >     I would like to know format of the blast output the Biojava's
> >     blast-like
> >     parsing framework likes, including some examples (without the data) of
> >     how such blast output may be created.
> >     For example, I am using ncbi-blastn and I am generating the blast
> >     file
> >     (which Biojava doesn't like) as follows.
> >
> >     export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
> >     export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
> >     export
> >     REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
> >     export BLAST_REPORT_TABULAR=somesequence.blast.txt
> >     export BLAST_REPORT_XML=somesequence.blast.xml
> >     export BLAST_REPORT=somesequence.blast
> >     export INPUT_FASTA=somesequence.fasta
> >     export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
> >
> >     date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
> >     $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
> >     $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
> >     $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d
> >     $REFERENCES_FASTA_NAME -i
> >     $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
> >
> >     Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
> >     from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
> >
> >     Then I get the error below.
> >
> >
> >     [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
> >     Buildfile: build.xml
> >
> >     runBlastParser:
> >         [java] org.xml.sax.SAXException: Program ncbi-blastn Version
> >     2.2.17
> >     is not supported by the biojava blast-like parsing framework
> >         [java]     at
> >     org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java
> >     :241)
> >         [java]     at
> >     org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
> >
> >     Allan.
> >     _______________________________________________
> >     Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >     <mailto:Biojava-l at lists.open-bio.org>
>
>
>
> >     http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> >
>
>

From holland at ebi.ac.uk  Tue Nov 27 03:40:10 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 08:40:10 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
Message-ID: <474BD7EA.4040604@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sounds good to me.

Mark Schreiber wrote:
> Hi -
> 
> Does anyone mind if I change the default behaivor to lazy parsing?
> Technically this would be a break in backwards compatibility (although
> only if you have a program that relies on strict parsing).
> 
> Last chance to complain.
> 
> - Mark
> 
> On Nov 26, 2007 8:02 PM, Allan Kamau <allank at sanbi.ac.za> wrote:
>> Hi Mark,
>> Thank you for your reply.
>> Calling setModeLazy() method of the object of type BlastLikeSAXParser
>> did provide the cure.
>>
>> Allan.
>>
>>
>> Mark Schreiber wrote:
>>> Hi Allan -
>>>
>>> I think the solution is to call the setParserLazy() or some method
>>> with a similar name (I don't have the API handy). This will prevent it
>>> doing the check.
>>>
>>> The original idea of this method was you could check against a list of
>>> version numbers that people had validated.  I don't think this is a
>>> good idea as nothing is truely 100% validated and we haven't kept the
>>> list up to date.  If there are no objections I would propose to make
>>> this method depricated (and it's opposite method) and change the
>>> default behaivour to lazy checking.
>>>
>>> Best regards.
>>>
>>> - Mark
>>>
>>>
>>> On 11/25/07, *Allan Kamau* <allank at sanbi.ac.za
>>
>>
>>> <mailto:allank at sanbi.ac.za>> wrote:
>>>
>>>     Hi all,
>>>     I've searched for a conclusive answer to the "Program ncbi-blastn
>>>     Version <some value> is not supported" without success.
>>>     I would like to know format of the blast output the Biojava's
>>>     blast-like
>>>     parsing framework likes, including some examples (without the data) of
>>>     how such blast output may be created.
>>>     For example, I am using ncbi-blastn and I am generating the blast
>>>     file
>>>     (which Biojava doesn't like) as follows.
>>>
>>>     export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
>>>     export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
>>>     export
>>>     REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
>>>     export BLAST_REPORT_TABULAR=somesequence.blast.txt
>>>     export BLAST_REPORT_XML=somesequence.blast.xml
>>>     export BLAST_REPORT=somesequence.blast
>>>     export INPUT_FASTA=somesequence.fasta
>>>     export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
>>>
>>>     date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
>>>     $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
>>>     $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
>>>     $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d
>>>     $REFERENCES_FASTA_NAME -i
>>>     $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
>>>
>>>     Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
>>>     from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
>>>
>>>     Then I get the error below.
>>>
>>>
>>>     [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
>>>     Buildfile: build.xml
>>>
>>>     runBlastParser:
>>>         [java] org.xml.sax.SAXException: Program ncbi-blastn Version
>>>     2.2.17
>>>     is not supported by the biojava blast-like parsing framework
>>>         [java]     at
>>>     org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java
>>>     :241)
>>>         [java]     at
>>>     org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
>>>
>>>     Allan.
>>>     _______________________________________________
>>>     Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>     <mailto:Biojava-l at lists.open-bio.org>
>>
>>
>>>     http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>>
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS9fq4C5LeMEKA/QRAm/3AJ9hi2yrSyeK6a3nXtObyJ2MAk0Y1QCeL5HT
iYQc6HTdm6fJ+Lcfssnd34g=
=VuJJ
-----END PGP SIGNATURE-----

From ap3 at sanger.ac.uk  Tue Nov 27 05:24:49 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue, 27 Nov 2007 10:24:49 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
	<474AB5F0.6040802@sanbi.ac.za>
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
Message-ID: <C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>

> Does anyone mind if I change the default behaivor to lazy parsing?

Hi Mark,

I think this is a good idea.

we had a couple of questions and feature requests recently regarding  
the blast parser, so I wonder if we should
have a look at how to make it (and the documentation) better also  
during the V3 discussion...

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
                               +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From holland at ebi.ac.uk  Tue Nov 27 06:01:33 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 11:01:33 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
Message-ID: <474BF90D.3070003@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> we had a couple of questions and feature requests recently regarding  
> the blast parser, so I wonder if we should
> have a look at how to make it (and the documentation) better also  
> during the V3 discussion...

A rethink of the blast parser is definitely a good idea. It's starting
to need more work than before as the various subtly different file
formats used by the most recent versions and variants of blast have
evolved beyond the tolerance limits of the existing parser. It also
needs to be made simpler to use.

cheers,
Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb
Nu2SZ7yV4e0rUmyIBxNYTJU=
=9nHg
-----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Tue Nov 27 06:11:30 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Tue, 27 Nov 2007 11:11:30 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <474BF90D.3070003@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk>
Message-ID: <474BFB62.3040203@ebi.ac.uk>

What format options are there from blast? Just thinking if it supports 
CIGAR or something like that are we better providing a parser for that 
format & saying that we do not support the traditional blast output? 
That said it doesn't help is when that format changes so maybe what is 
needed is a way to push out parser changes without requiring a full 
biojava release (v3 discussion) ...

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
>> we had a couple of questions and feature requests recently regarding  
>> the blast parser, so I wonder if we should
>> have a look at how to make it (and the documentation) better also  
>> during the V3 discussion...
> 
> A rethink of the blast parser is definitely a good idea. It's starting
> to need more work than before as the various subtly different file
> formats used by the most recent versions and variants of blast have
> evolved beyond the tolerance limits of the existing parser. It also
> needs to be made simpler to use.
> 
> cheers,
> Richard
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb
> Nu2SZ7yV4e0rUmyIBxNYTJU=
> =9nHg
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From holland at ebi.ac.uk  Tue Nov 27 06:18:59 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 11:18:59 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <474BFB62.3040203@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>
Message-ID: <474BFD23.8060005@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> What format options are there from blast? Just thinking if it supports
> CIGAR or something like that are we better providing a parser for that
> format & saying that we do not support the traditional blast output?
> That said it doesn't help is when that format changes so maybe what is
> needed is a way to push out parser changes without requiring a full
> biojava release (v3 discussion) ...

Exactly! So the modular idea would work nicely here - we could have a
blast module and only update that single module (which would be its own
JAR) whenever the format changes. In a way, BioJava releases as such
would no longer happen, except maybe for some kind of core BioJava
module. Everything would be done in terms of individual module+JAR
releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
for Phylogenetic tools, one for translation/transcription, etc. etc.

cheers,
Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS/0j4C5LeMEKA/QRAkQuAJ9B+mmV7vo9QuFYwEgmnHczExyXqwCfamIx
uPFQKdbXRC7pwC6lM5aBcJk=
=F3PD
-----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Tue Nov 27 06:47:54 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Tue, 27 Nov 2007 11:47:54 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <474BFD23.8060005@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>
	<474BFD23.8060005@ebi.ac.uk>
Message-ID: <474C03EA.4070706@ebi.ac.uk>

I think Groovy have adopted a similar system recently & have guidelines 
for how each module should behave (dependencies, build system etc). This 
enforces the idea that a module whilst not part of the core project must 
behave in the same manner the core does. I do like the idea that we can 
have a core biojava & things get added around it & it might encourage 
other users to start developing their own modules for any 
formats/purpose they want.

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
>> What format options are there from blast? Just thinking if it supports
>> CIGAR or something like that are we better providing a parser for that
>> format & saying that we do not support the traditional blast output?
>> That said it doesn't help is when that format changes so maybe what is
>> needed is a way to push out parser changes without requiring a full
>> biojava release (v3 discussion) ...
> 
> Exactly! So the modular idea would work nicely here - we could have a
> blast module and only update that single module (which would be its own
> JAR) whenever the format changes. In a way, BioJava releases as such
> would no longer happen, except maybe for some kind of core BioJava
> module. Everything would be done in terms of individual module+JAR
> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> for Phylogenetic tools, one for translation/transcription, etc. etc.
> 
> cheers,
> Richard

From markjschreiber at gmail.com  Tue Nov 27 09:48:12 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 27 Nov 2007 22:48:12 +0800
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <474C03EA.4070706@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
	<474AB5F0.6040802@sanbi.ac.za>
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>
	<474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk>
Message-ID: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com>

For a long time now my feeling has been that we should *only* support
the XML version of blast output.  The other formats are too brittle to
be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
may be an extreme view but the power of generic XML parsers and things
like XPath etc really make these formats look very attractive.

- Mark


On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> I think Groovy have adopted a similar system recently & have guidelines
> for how each module should behave (dependencies, build system etc). This
> enforces the idea that a module whilst not part of the core project must
> behave in the same manner the core does. I do like the idea that we can
> have a core biojava & things get added around it & it might encourage
> other users to start developing their own modules for any
> formats/purpose they want.
>
> Richard Holland wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> >> What format options are there from blast? Just thinking if it supports
> >> CIGAR or something like that are we better providing a parser for that
> >> format & saying that we do not support the traditional blast output?
> >> That said it doesn't help is when that format changes so maybe what is
> >> needed is a way to push out parser changes without requiring a full
> >> biojava release (v3 discussion) ...
> >
> > Exactly! So the modular idea would work nicely here - we could have a
> > blast module and only update that single module (which would be its own
> > JAR) whenever the format changes. In a way, BioJava releases as such
> > would no longer happen, except maybe for some kind of core BioJava
> > module. Everything would be done in terms of individual module+JAR
> > releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> > for Phylogenetic tools, one for translation/transcription, etc. etc.
> >
> > cheers,
> > Richard
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From ayates at ebi.ac.uk  Tue Nov 27 10:16:12 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Tue, 27 Nov 2007 15:16:12 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>	
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	
	<474AB5F0.6040802@sanbi.ac.za>	
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	
	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>	
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>	
	<474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk>
	<93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com>
Message-ID: <474C34BC.4070209@ebi.ac.uk>

I was always under the impression that blast's XML output was nearly as 
hard to parse as the flat file format but I do agree that if we can use 
XML whenever we can it would make writing parsers a lot easier 
(especially if there are SAX based XPath libraries available). Actually 
this brings up a good question about development of this type of parser. 
The majority of XPath supporting libraries are DOM based which will mean 
large memory usage in some situations but overall providing an easier 
coding experience (and hopefully reduce our chances of creating bugs). 
Or should we code to the edge cases of someone trying to parse a 1GB 
XML? Personally I'd favour the former.

Going back to the original topic there are going to be situations where 
people want the flat file parsers/writers & I think it's a valid point 
to say this is where BioJava is meant to come in & help a developer. 
Afterall XML is a computer science problem where as parsing an EMBL flat 
file or blast output is a bioinformatics problem.

Andy

Mark Schreiber wrote:
> For a long time now my feeling has been that we should *only* support
> the XML version of blast output.  The other formats are too brittle to
> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> may be an extreme view but the power of generic XML parsers and things
> like XPath etc really make these formats look very attractive.
> 
> - Mark
> 
> 
> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> I think Groovy have adopted a similar system recently & have guidelines
>> for how each module should behave (dependencies, build system etc). This
>> enforces the idea that a module whilst not part of the core project must
>> behave in the same manner the core does. I do like the idea that we can
>> have a core biojava & things get added around it & it might encourage
>> other users to start developing their own modules for any
>> formats/purpose they want.
>>
>> Richard Holland wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>>> What format options are there from blast? Just thinking if it supports
>>>> CIGAR or something like that are we better providing a parser for that
>>>> format & saying that we do not support the traditional blast output?
>>>> That said it doesn't help is when that format changes so maybe what is
>>>> needed is a way to push out parser changes without requiring a full
>>>> biojava release (v3 discussion) ...
>>> Exactly! So the modular idea would work nicely here - we could have a
>>> blast module and only update that single module (which would be its own
>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>> would no longer happen, except maybe for some kind of core BioJava
>>> module. Everything would be done in terms of individual module+JAR
>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>
>>> cheers,
>>> Richard
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>

From markjschreiber at gmail.com  Tue Nov 27 22:34:38 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 28 Nov 2007 11:34:38 +0800
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
Message-ID: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>

Hi -

I think in most cases huge XML files in bioinformatics result from a
single XML containing multiple repetitive elements. Eg a BLAST XML
output with several hits or a GenBankXML with many Sequences.  A nice
approach I have seen for dealing with these is to use SAX to read over
the file and every time it comes to an element it delegates to a DOM
object.  You then parse the bits of the DOM you want with XPath or
convert to objects or something and then when you are finished with
that entry everything gets garbage collected and the SAX parser moves
to the next element and repeats the whole process.  This is a hybrid
of event based parsing and object-model based parsing which could let
you efficiently deal with huge files.

I think the BLAST XML has improved substantially, at least in terms of
validating against it's own DTD.  The DTD itself may not be the best
design but that is always a matter of taste and if you are using XPath
to get the relevant bits you don't need to make a SAX parser jump
through hoops to get them.

I agree we will have to keep flat file parsers but we should strongly
encourage the use of XML where possible. It is simply easier to deal
with. Most biological flat-files were designed for Fortran and mainly
for human consumption. There is no obvious validation mechanism.
Notably everything in NCBI is derived from ASN.1, what you see in the
flatfile is produced from there. I tend to think this means that the
ASN.1 is the holy gospel and what you get in the flat file is some
translation.  Ideally NCBI files should be parsed from the ASN.1 where
you can guarantee validation, the more practical alternative is to use
the XML which you can at least validate against a DTD.

With XML we (Biojava) can say if it validates we will parse it and if
it doesn't we may not.  With flat files there are so many dodgey
variants we cannot say anything.  Because XML dtds (or xsd's) have
versions it also makes it much easier to have parsers for different
versions and the parsing machinery can figure out which is needed.
With flat files it is anyones guess what version you are dealing with.

Finally parsers can be auto-generated for XML if you have the DTD or
XSD. This often doesn't give you an ideal parser but it can be a
useful starting point for rapid development.

For Biojava v 3 I think we should concentrate on XML parsers first and
flat files second. <sigh>if only Fasta had an XML format</sigh>

- Mark

On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> I was always under the impression that blast's XML output was nearly as
> hard to parse as the flat file format but I do agree that if we can use
> XML whenever we can it would make writing parsers a lot easier
> (especially if there are SAX based XPath libraries available). Actually
> this brings up a good question about development of this type of parser.
> The majority of XPath supporting libraries are DOM based which will mean
> large memory usage in some situations but overall providing an easier
> coding experience (and hopefully reduce our chances of creating bugs).
> Or should we code to the edge cases of someone trying to parse a 1GB
> XML? Personally I'd favour the former.
>
> Going back to the original topic there are going to be situations where
> people want the flat file parsers/writers & I think it's a valid point
> to say this is where BioJava is meant to come in & help a developer.
> Afterall XML is a computer science problem where as parsing an EMBL flat
> file or blast output is a bioinformatics problem.
>
> Andy
>
>
> Mark Schreiber wrote:
> > For a long time now my feeling has been that we should *only* support
> > the XML version of blast output.  The other formats are too brittle to
> > be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> > may be an extreme view but the power of generic XML parsers and things
> > like XPath etc really make these formats look very attractive.
> >
> > - Mark
> >
> >
> > On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> I think Groovy have adopted a similar system recently & have guidelines
> >> for how each module should behave (dependencies, build system etc). This
> >> enforces the idea that a module whilst not part of the core project must
> >> behave in the same manner the core does. I do like the idea that we can
> >> have a core biojava & things get added around it & it might encourage
> >> other users to start developing their own modules for any
> >> formats/purpose they want.
> >>
> >> Richard Holland wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>>> What format options are there from blast? Just thinking if it supports
> >>>> CIGAR or something like that are we better providing a parser for that
> >>>> format & saying that we do not support the traditional blast output?
> >>>> That said it doesn't help is when that format changes so maybe what is
> >>>> needed is a way to push out parser changes without requiring a full
> >>>> biojava release (v3 discussion) ...
> >>> Exactly! So the modular idea would work nicely here - we could have a
> >>> blast module and only update that single module (which would be its own
> >>> JAR) whenever the format changes. In a way, BioJava releases as such
> >>> would no longer happen, except maybe for some kind of core BioJava
> >>> module. Everything would be done in terms of individual module+JAR
> >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> >>> for Phylogenetic tools, one for translation/transcription, etc. etc.
> >>>
> >>> cheers,
> >>> Richard
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
>

From ayates at ebi.ac.uk  Wed Nov 28 09:29:15 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 28 Nov 2007 14:29:15 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
Message-ID: <474D7B3B.8030807@ebi.ac.uk>

Hi Mark,

Okay that sounds like a perfectly sensible way to deal with this. Is 
this kind of parsing model supported in Java5? I only ask as I've not 
done a lot of XML parsing with Java5; more with things like XOM (which I 
think offers a DOM only representation but I'm probably wrong).

That's good. There's not a huge point to have a format & a DTD/XSD and 
then have your files not conform to it.

I was thinking the exact same thing about ASN.1 (well that & it looks 
bleeding horrible to parse but that is an un-educated look at the format 
which I'm sure is a parsable as JSON & the alike).

When it comes to flat file parsers I would be happier to provide 
implementations of the more common formats where a viable alternative is 
not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide 
similar output to the above have a chance to write their own 
parsers/formatters. This is very similar to the current situation but we 
just need to remove dependencies on statically located data structures 
(don't get rid of them completely just give users an option to not use 
them).

I'm not sure how much automatically generated parsers would help us. I 
guess it depends on the data model(s) we use if they are auto-parser 
friendly (which normally means POJO/JavaBean conventions including the 
no-args constructor).

Cool I don't want to exclude flat file parsers completely (if only 
because my group has an interest in BioJava being able to read & write 
flat files) :)

They decided to have HUPO-PSI Format instead :)

Andy

Mark Schreiber wrote:
> Hi -
> 
> I think in most cases huge XML files in bioinformatics result from a
> single XML containing multiple repetitive elements. Eg a BLAST XML
> output with several hits or a GenBankXML with many Sequences.  A nice
> approach I have seen for dealing with these is to use SAX to read over
> the file and every time it comes to an element it delegates to a DOM
> object.  You then parse the bits of the DOM you want with XPath or
> convert to objects or something and then when you are finished with
> that entry everything gets garbage collected and the SAX parser moves
> to the next element and repeats the whole process.  This is a hybrid
> of event based parsing and object-model based parsing which could let
> you efficiently deal with huge files.
> 
> I think the BLAST XML has improved substantially, at least in terms of
> validating against it's own DTD.  The DTD itself may not be the best
> design but that is always a matter of taste and if you are using XPath
> to get the relevant bits you don't need to make a SAX parser jump
> through hoops to get them.
> 
> I agree we will have to keep flat file parsers but we should strongly
> encourage the use of XML where possible. It is simply easier to deal
> with. Most biological flat-files were designed for Fortran and mainly
> for human consumption. There is no obvious validation mechanism.
> Notably everything in NCBI is derived from ASN.1, what you see in the
> flatfile is produced from there. I tend to think this means that the
> ASN.1 is the holy gospel and what you get in the flat file is some
> translation.  Ideally NCBI files should be parsed from the ASN.1 where
> you can guarantee validation, the more practical alternative is to use
> the XML which you can at least validate against a DTD.
> 
> With XML we (Biojava) can say if it validates we will parse it and if
> it doesn't we may not.  With flat files there are so many dodgey
> variants we cannot say anything.  Because XML dtds (or xsd's) have
> versions it also makes it much easier to have parsers for different
> versions and the parsing machinery can figure out which is needed.
> With flat files it is anyones guess what version you are dealing with.
> 
> Finally parsers can be auto-generated for XML if you have the DTD or
> XSD. This often doesn't give you an ideal parser but it can be a
> useful starting point for rapid development.
> 
> For Biojava v 3 I think we should concentrate on XML parsers first and
> flat files second. <sigh>if only Fasta had an XML format</sigh>
> 
> - Mark
> 
> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> I was always under the impression that blast's XML output was nearly as
>> hard to parse as the flat file format but I do agree that if we can use
>> XML whenever we can it would make writing parsers a lot easier
>> (especially if there are SAX based XPath libraries available). Actually
>> this brings up a good question about development of this type of parser.
>> The majority of XPath supporting libraries are DOM based which will mean
>> large memory usage in some situations but overall providing an easier
>> coding experience (and hopefully reduce our chances of creating bugs).
>> Or should we code to the edge cases of someone trying to parse a 1GB
>> XML? Personally I'd favour the former.
>>
>> Going back to the original topic there are going to be situations where
>> people want the flat file parsers/writers & I think it's a valid point
>> to say this is where BioJava is meant to come in & help a developer.
>> Afterall XML is a computer science problem where as parsing an EMBL flat
>> file or blast output is a bioinformatics problem.
>>
>> Andy
>>
>>
>> Mark Schreiber wrote:
>>> For a long time now my feeling has been that we should *only* support
>>> the XML version of blast output.  The other formats are too brittle to
>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
>>> may be an extreme view but the power of generic XML parsers and things
>>> like XPath etc really make these formats look very attractive.
>>>
>>> - Mark
>>>
>>>
>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> I think Groovy have adopted a similar system recently & have guidelines
>>>> for how each module should behave (dependencies, build system etc). This
>>>> enforces the idea that a module whilst not part of the core project must
>>>> behave in the same manner the core does. I do like the idea that we can
>>>> have a core biojava & things get added around it & it might encourage
>>>> other users to start developing their own modules for any
>>>> formats/purpose they want.
>>>>
>>>> Richard Holland wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>>> What format options are there from blast? Just thinking if it supports
>>>>>> CIGAR or something like that are we better providing a parser for that
>>>>>> format & saying that we do not support the traditional blast output?
>>>>>> That said it doesn't help is when that format changes so maybe what is
>>>>>> needed is a way to push out parser changes without requiring a full
>>>>>> biojava release (v3 discussion) ...
>>>>> Exactly! So the modular idea would work nicely here - we could have a
>>>>> blast module and only update that single module (which would be its own
>>>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>>>> would no longer happen, except maybe for some kind of core BioJava
>>>>> module. Everything would be done in terms of individual module+JAR
>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>>>
>>>>> cheers,
>>>>> Richard
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>

From dmitry.repchevski at bsc.es  Wed Nov 28 09:49:23 2007
From: dmitry.repchevski at bsc.es (Dmitry Repchevsky)
Date: Wed, 28 Nov 2007 15:49:23 +0100
Subject: [Biojava-l]  SAX, DOM, XPath and Flat files
References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com
Message-ID: <474D7FF3.9010901@bsc.es>

Hello!

Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) 
which is faster when  SAX and allows writing.
In JDK 6 apart of StAX there is JAXB which is a perfect combination to 
parse a huge files.
You can go through the XML fie using StAX until the element you are 
interested in and unmarshall it using JAXB to POJO object.

Cheers,

Dmitry

From ayates at ebi.ac.uk  Wed Nov 28 10:37:03 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 28 Nov 2007 15:37:03 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <474D7FF3.9010901@bsc.es>
References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com
	<474D7FF3.9010901@bsc.es>
Message-ID: <474D8B1F.8070301@ebi.ac.uk>

Hi Dmitry,

StAX still has higher memory consumption than SAX (still not as large as 
DOM) but yes it is quite a good parser system & since we're moving 
towards the later versions of Java may be a good idea to use it as our 
standard parser ... if it supports XPath (can't remember off the top of 
my head) :)

Andy

Dmitry Repchevsky wrote:
> Hello!
> 
> Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) 
> which is faster when  SAX and allows writing.
> In JDK 6 apart of StAX there is JAXB which is a perfect combination to 
> parse a huge files.
> You can go through the XML fie using StAX until the element you are 
> interested in and unmarshall it using JAXB to POJO object.
> 
> Cheers,
> 
> Dmitry
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From markjschreiber at gmail.com  Thu Nov 29 21:28:58 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Fri, 30 Nov 2007 10:28:58 +0800
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <474D7B3B.8030807@ebi.ac.uk>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
	<474D7B3B.8030807@ebi.ac.uk>
Message-ID: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>

Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
not XQuery although XPath is probably more important for this use.

The DOM model is a direct implementation of the W3C standard which
makes it a little awkward from a java point of view but it is usable.

Java 6 has StAX (the other one).

There are a few java API's for parsing ASN.1 mostly developed for the
telco industry, I've never really looked into which is best (anyone
experienced with this?) but we could probably use one to work directly
off NCBI ASN.1

- Mark

On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Mark,
>
> Okay that sounds like a perfectly sensible way to deal with this. Is
> this kind of parsing model supported in Java5? I only ask as I've not
> done a lot of XML parsing with Java5; more with things like XOM (which I
> think offers a DOM only representation but I'm probably wrong).
>
> That's good. There's not a huge point to have a format & a DTD/XSD and
> then have your files not conform to it.
>
> I was thinking the exact same thing about ASN.1 (well that & it looks
> bleeding horrible to parse but that is an un-educated look at the format
> which I'm sure is a parsable as JSON & the alike).
>
> When it comes to flat file parsers I would be happier to provide
> implementations of the more common formats where a viable alternative is
> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
> similar output to the above have a chance to write their own
> parsers/formatters. This is very similar to the current situation but we
> just need to remove dependencies on statically located data structures
> (don't get rid of them completely just give users an option to not use
> them).
>
> I'm not sure how much automatically generated parsers would help us. I
> guess it depends on the data model(s) we use if they are auto-parser
> friendly (which normally means POJO/JavaBean conventions including the
> no-args constructor).
>
> Cool I don't want to exclude flat file parsers completely (if only
> because my group has an interest in BioJava being able to read & write
> flat files) :)
>
> They decided to have HUPO-PSI Format instead :)
>
> Andy
>
>
> Mark Schreiber wrote:
> > Hi -
> >
> > I think in most cases huge XML files in bioinformatics result from a
> > single XML containing multiple repetitive elements. Eg a BLAST XML
> > output with several hits or a GenBankXML with many Sequences.  A nice
> > approach I have seen for dealing with these is to use SAX to read over
> > the file and every time it comes to an element it delegates to a DOM
> > object.  You then parse the bits of the DOM you want with XPath or
> > convert to objects or something and then when you are finished with
> > that entry everything gets garbage collected and the SAX parser moves
> > to the next element and repeats the whole process.  This is a hybrid
> > of event based parsing and object-model based parsing which could let
> > you efficiently deal with huge files.
> >
> > I think the BLAST XML has improved substantially, at least in terms of
> > validating against it's own DTD.  The DTD itself may not be the best
> > design but that is always a matter of taste and if you are using XPath
> > to get the relevant bits you don't need to make a SAX parser jump
> > through hoops to get them.
> >
> > I agree we will have to keep flat file parsers but we should strongly
> > encourage the use of XML where possible. It is simply easier to deal
> > with. Most biological flat-files were designed for Fortran and mainly
> > for human consumption. There is no obvious validation mechanism.
> > Notably everything in NCBI is derived from ASN.1, what you see in the
> > flatfile is produced from there. I tend to think this means that the
> > ASN.1 is the holy gospel and what you get in the flat file is some
> > translation.  Ideally NCBI files should be parsed from the ASN.1 where
> > you can guarantee validation, the more practical alternative is to use
> > the XML which you can at least validate against a DTD.
> >
> > With XML we (Biojava) can say if it validates we will parse it and if
> > it doesn't we may not.  With flat files there are so many dodgey
> > variants we cannot say anything.  Because XML dtds (or xsd's) have
> > versions it also makes it much easier to have parsers for different
> > versions and the parsing machinery can figure out which is needed.
> > With flat files it is anyones guess what version you are dealing with.
> >
> > Finally parsers can be auto-generated for XML if you have the DTD or
> > XSD. This often doesn't give you an ideal parser but it can be a
> > useful starting point for rapid development.
> >
> > For Biojava v 3 I think we should concentrate on XML parsers first and
> > flat files second. <sigh>if only Fasta had an XML format</sigh>
> >
> > - Mark
> >
> > On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> I was always under the impression that blast's XML output was nearly as
> >> hard to parse as the flat file format but I do agree that if we can use
> >> XML whenever we can it would make writing parsers a lot easier
> >> (especially if there are SAX based XPath libraries available). Actually
> >> this brings up a good question about development of this type of parser.
> >> The majority of XPath supporting libraries are DOM based which will mean
> >> large memory usage in some situations but overall providing an easier
> >> coding experience (and hopefully reduce our chances of creating bugs).
> >> Or should we code to the edge cases of someone trying to parse a 1GB
> >> XML? Personally I'd favour the former.
> >>
> >> Going back to the original topic there are going to be situations where
> >> people want the flat file parsers/writers & I think it's a valid point
> >> to say this is where BioJava is meant to come in & help a developer.
> >> Afterall XML is a computer science problem where as parsing an EMBL flat
> >> file or blast output is a bioinformatics problem.
> >>
> >> Andy
> >>
> >>
> >> Mark Schreiber wrote:
> >>> For a long time now my feeling has been that we should *only* support
> >>> the XML version of blast output.  The other formats are too brittle to
> >>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> >>> may be an extreme view but the power of generic XML parsers and things
> >>> like XPath etc really make these formats look very attractive.
> >>>
> >>> - Mark
> >>>
> >>>
> >>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> I think Groovy have adopted a similar system recently & have guidelines
> >>>> for how each module should behave (dependencies, build system etc). This
> >>>> enforces the idea that a module whilst not part of the core project must
> >>>> behave in the same manner the core does. I do like the idea that we can
> >>>> have a core biojava & things get added around it & it might encourage
> >>>> other users to start developing their own modules for any
> >>>> formats/purpose they want.
> >>>>
> >>>> Richard Holland wrote:
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA1
> >>>>>
> >>>>>> What format options are there from blast? Just thinking if it supports
> >>>>>> CIGAR or something like that are we better providing a parser for that
> >>>>>> format & saying that we do not support the traditional blast output?
> >>>>>> That said it doesn't help is when that format changes so maybe what is
> >>>>>> needed is a way to push out parser changes without requiring a full
> >>>>>> biojava release (v3 discussion) ...
> >>>>> Exactly! So the modular idea would work nicely here - we could have a
> >>>>> blast module and only update that single module (which would be its own
> >>>>> JAR) whenever the format changes. In a way, BioJava releases as such
> >>>>> would no longer happen, except maybe for some kind of core BioJava
> >>>>> module. Everything would be done in terms of individual module+JAR
> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
> >>>>>
> >>>>> cheers,
> >>>>> Richard
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
>

From heuermh at acm.org  Fri Nov 30 01:06:26 2007
From: heuermh at acm.org (Michael Heuer)
Date: Fri, 30 Nov 2007 01:06:26 -0500 (EST)
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0711300054110.26684-100000@shell3.shore.net>

Mark Schreiber wrote:

> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> not XQuery although XPath is probably more important for this use.
>
> The DOM model is a direct implementation of the W3C standard which
> makes it a little awkward from a java point of view but it is usable.
>
> Java 6 has StAX (the other one).

Yeah, those jerks.  :)

I wrote a note to the spec author a few weeks before "the other" StAX was
announced at a Java One however long ago asking them to reconsider their
project name.

Oh well.  We can still be the "original" StAX.

> http://stax.sf.net


May I kindly suggest skipping all of this talk about XML and have us
jump straight to OWL?  ;)

> http://dev.isb-sib.ch/projects/uniprot-rdf/

   michael


From ayates at ebi.ac.uk  Fri Nov 30 04:18:45 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 30 Nov 2007 09:18:45 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <Pine.GSO.4.44.0711300054110.26684-100000@shell3.shore.net>
References: <Pine.GSO.4.44.0711300054110.26684-100000@shell3.shore.net>
Message-ID: <474FD575.3060307@ebi.ac.uk>


Michael Heuer wrote:
> Mark Schreiber wrote:
> 
>> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
>> not XQuery although XPath is probably more important for this use.
>>
>> The DOM model is a direct implementation of the W3C standard which
>> makes it a little awkward from a java point of view but it is usable.
>>
>> Java 6 has StAX (the other one).
> 
> Yeah, those jerks.  :)
> 
> I wrote a note to the spec author a few weeks before "the other" StAX was
> announced at a Java One however long ago asking them to reconsider their
> project name.
> 
> Oh well.  We can still be the "original" StAX.
> 
>> http://stax.sf.net

Yup I remember that issue from BOSC 2005 ... oh well not a lot that can 
be done now. Maybe a re-brand of our StAX to StAX Original. Bit like the 
Coca Cola & New Coke mess-up.

> 
> 
> May I kindly suggest skipping all of this talk about XML and have us
> jump straight to OWL?  ;)
> 
>> http://dev.isb-sib.ch/projects/uniprot-rdf/

Lol just let me fire up my semantic web engine first :).


From ayates at ebi.ac.uk  Fri Nov 30 04:26:15 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 30 Nov 2007 09:26:15 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>	
	<474D7B3B.8030807@ebi.ac.uk>
	<93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
Message-ID: <474FD737.9080801@ebi.ac.uk>

I think I've seen XPath hanging around in other people's code in a 1.5 
code-base (in fact one of the guys I work with). I've used Java's DOM 
before & it really isn't very nice & quite verbose. I'd prefer if there 
was a better alternative/wrapper around the XML parsers just to cut down 
on code chatter.

Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these 
Java tools & I think I've gone cross-eyed with the sheer number of 
acronyms! You've gotta love something which seems to add a letter to ER 
& that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the 
list know of a ASN.1 parser for Java that's good and should we support 
it (considering NCBI generate their DTD & XML from the ASN.1 
representation).

Andy

Mark Schreiber wrote:
> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> not XQuery although XPath is probably more important for this use.
> 
> The DOM model is a direct implementation of the W3C standard which
> makes it a little awkward from a java point of view but it is usable.
> 
> Java 6 has StAX (the other one).
> 
> There are a few java API's for parsing ASN.1 mostly developed for the
> telco industry, I've never really looked into which is best (anyone
> experienced with this?) but we could probably use one to work directly
> off NCBI ASN.1
> 
> - Mark
> 
> On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Mark,
>>
>> Okay that sounds like a perfectly sensible way to deal with this. Is
>> this kind of parsing model supported in Java5? I only ask as I've not
>> done a lot of XML parsing with Java5; more with things like XOM (which I
>> think offers a DOM only representation but I'm probably wrong).
>>
>> That's good. There's not a huge point to have a format & a DTD/XSD and
>> then have your files not conform to it.
>>
>> I was thinking the exact same thing about ASN.1 (well that & it looks
>> bleeding horrible to parse but that is an un-educated look at the format
>> which I'm sure is a parsable as JSON & the alike).
>>
>> When it comes to flat file parsers I would be happier to provide
>> implementations of the more common formats where a viable alternative is
>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
>> similar output to the above have a chance to write their own
>> parsers/formatters. This is very similar to the current situation but we
>> just need to remove dependencies on statically located data structures
>> (don't get rid of them completely just give users an option to not use
>> them).
>>
>> I'm not sure how much automatically generated parsers would help us. I
>> guess it depends on the data model(s) we use if they are auto-parser
>> friendly (which normally means POJO/JavaBean conventions including the
>> no-args constructor).
>>
>> Cool I don't want to exclude flat file parsers completely (if only
>> because my group has an interest in BioJava being able to read & write
>> flat files) :)
>>
>> They decided to have HUPO-PSI Format instead :)
>>
>> Andy
>>
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> I think in most cases huge XML files in bioinformatics result from a
>>> single XML containing multiple repetitive elements. Eg a BLAST XML
>>> output with several hits or a GenBankXML with many Sequences.  A nice
>>> approach I have seen for dealing with these is to use SAX to read over
>>> the file and every time it comes to an element it delegates to a DOM
>>> object.  You then parse the bits of the DOM you want with XPath or
>>> convert to objects or something and then when you are finished with
>>> that entry everything gets garbage collected and the SAX parser moves
>>> to the next element and repeats the whole process.  This is a hybrid
>>> of event based parsing and object-model based parsing which could let
>>> you efficiently deal with huge files.
>>>
>>> I think the BLAST XML has improved substantially, at least in terms of
>>> validating against it's own DTD.  The DTD itself may not be the best
>>> design but that is always a matter of taste and if you are using XPath
>>> to get the relevant bits you don't need to make a SAX parser jump
>>> through hoops to get them.
>>>
>>> I agree we will have to keep flat file parsers but we should strongly
>>> encourage the use of XML where possible. It is simply easier to deal
>>> with. Most biological flat-files were designed for Fortran and mainly
>>> for human consumption. There is no obvious validation mechanism.
>>> Notably everything in NCBI is derived from ASN.1, what you see in the
>>> flatfile is produced from there. I tend to think this means that the
>>> ASN.1 is the holy gospel and what you get in the flat file is some
>>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
>>> you can guarantee validation, the more practical alternative is to use
>>> the XML which you can at least validate against a DTD.
>>>
>>> With XML we (Biojava) can say if it validates we will parse it and if
>>> it doesn't we may not.  With flat files there are so many dodgey
>>> variants we cannot say anything.  Because XML dtds (or xsd's) have
>>> versions it also makes it much easier to have parsers for different
>>> versions and the parsing machinery can figure out which is needed.
>>> With flat files it is anyones guess what version you are dealing with.
>>>
>>> Finally parsers can be auto-generated for XML if you have the DTD or
>>> XSD. This often doesn't give you an ideal parser but it can be a
>>> useful starting point for rapid development.
>>>
>>> For Biojava v 3 I think we should concentrate on XML parsers first and
>>> flat files second. <sigh>if only Fasta had an XML format</sigh>
>>>
>>> - Mark
>>>
>>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> I was always under the impression that blast's XML output was nearly as
>>>> hard to parse as the flat file format but I do agree that if we can use
>>>> XML whenever we can it would make writing parsers a lot easier
>>>> (especially if there are SAX based XPath libraries available). Actually
>>>> this brings up a good question about development of this type of parser.
>>>> The majority of XPath supporting libraries are DOM based which will mean
>>>> large memory usage in some situations but overall providing an easier
>>>> coding experience (and hopefully reduce our chances of creating bugs).
>>>> Or should we code to the edge cases of someone trying to parse a 1GB
>>>> XML? Personally I'd favour the former.
>>>>
>>>> Going back to the original topic there are going to be situations where
>>>> people want the flat file parsers/writers & I think it's a valid point
>>>> to say this is where BioJava is meant to come in & help a developer.
>>>> Afterall XML is a computer science problem where as parsing an EMBL flat
>>>> file or blast output is a bioinformatics problem.
>>>>
>>>> Andy
>>>>
>>>>
>>>> Mark Schreiber wrote:
>>>>> For a long time now my feeling has been that we should *only* support
>>>>> the XML version of blast output.  The other formats are too brittle to
>>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
>>>>> may be an extreme view but the power of generic XML parsers and things
>>>>> like XPath etc really make these formats look very attractive.
>>>>>
>>>>> - Mark
>>>>>
>>>>>
>>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>> I think Groovy have adopted a similar system recently & have guidelines
>>>>>> for how each module should behave (dependencies, build system etc). This
>>>>>> enforces the idea that a module whilst not part of the core project must
>>>>>> behave in the same manner the core does. I do like the idea that we can
>>>>>> have a core biojava & things get added around it & it might encourage
>>>>>> other users to start developing their own modules for any
>>>>>> formats/purpose they want.
>>>>>>
>>>>>> Richard Holland wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA1
>>>>>>>
>>>>>>>> What format options are there from blast? Just thinking if it supports
>>>>>>>> CIGAR or something like that are we better providing a parser for that
>>>>>>>> format & saying that we do not support the traditional blast output?
>>>>>>>> That said it doesn't help is when that format changes so maybe what is
>>>>>>>> needed is a way to push out parser changes without requiring a full
>>>>>>>> biojava release (v3 discussion) ...
>>>>>>> Exactly! So the modular idea would work nicely here - we could have a
>>>>>>> blast module and only update that single module (which would be its own
>>>>>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>>>>>> would no longer happen, except maybe for some kind of core BioJava
>>>>>>> module. Everything would be done in terms of individual module+JAR
>>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>

From phidias51 at gmail.com  Fri Nov 30 13:30:50 2007
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 30 Nov 2007 10:30:50 -0800
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <474FD737.9080801@ebi.ac.uk>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
	<474D7B3B.8030807@ebi.ac.uk>
	<93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
	<474FD737.9080801@ebi.ac.uk>
Message-ID: <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com>

There's a potential gotcha involved with XPath parsing.  If you use the
current implementation that ships with the Java 5 & 6 JDKs, it performs a
DOM parse on the whole document, even if you pass it a specific starting
node in the document.  I stumbled across this one the hard way when using
the hybrid approach that you mention.  This may be solved with another XPath
implementation such as Saxon.

One other problem I've noticed is that the NCBI XML doesn't always parse.
I've reported this to them, and they've promised to address this. It usually
occurs when submitters put non-escaped characters into text fields such as
author lists in PubMed. NCBI doesn't always use CDATA blocks around text and
as soon as the parser hits one of these characters it throws an exception.

I've also noticed a tendency (in other code bases) for developers to use
several different parsers; usually, whatever parser they're most familiar
with.  The problem with this is that they often introduce parser-specific
code into the code base, so you end up with numerous dependencies for
different parsers, and a potential configuration problem if you're passing
the XML parser as a run-time configuration parameter.  The most frequent
external parsers I've seen used are JDOM and DOM4J.  The usual way to get
around this is to write to an interface, but that will require some
additional vigilance.

Just a few things to watch out for as we move forward.

Mark (the other one) :-)

On Nov 30, 2007 1:26 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> I think I've seen XPath hanging around in other people's code in a 1.5
> code-base (in fact one of the guys I work with). I've used Java's DOM
> before & it really isn't very nice & quite verbose. I'd prefer if there
> was a better alternative/wrapper around the XML parsers just to cut down
> on code chatter.
>
> Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these
> Java tools & I think I've gone cross-eyed with the sheer number of
> acronyms! You've gotta love something which seems to add a letter to ER
> & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the
> list know of a ASN.1 parser for Java that's good and should we support
> it (considering NCBI generate their DTD & XML from the ASN.1
> representation).
>
> Andy
>
> Mark Schreiber wrote:
> > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> > not XQuery although XPath is probably more important for this use.
> >
> > The DOM model is a direct implementation of the W3C standard which
> > makes it a little awkward from a java point of view but it is usable.
> >
> > Java 6 has StAX (the other one).
> >
> > There are a few java API's for parsing ASN.1 mostly developed for the
> > telco industry, I've never really looked into which is best (anyone
> > experienced with this?) but we could probably use one to work directly
> > off NCBI ASN.1
> >
> > - Mark
> >
> > On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Hi Mark,
> >>
> >> Okay that sounds like a perfectly sensible way to deal with this. Is
> >> this kind of parsing model supported in Java5? I only ask as I've not
> >> done a lot of XML parsing with Java5; more with things like XOM (which
> I
> >> think offers a DOM only representation but I'm probably wrong).
> >>
> >> That's good. There's not a huge point to have a format & a DTD/XSD and
> >> then have your files not conform to it.
> >>
> >> I was thinking the exact same thing about ASN.1 (well that & it looks
> >> bleeding horrible to parse but that is an un-educated look at the
> format
> >> which I'm sure is a parsable as JSON & the alike).
> >>
> >> When it comes to flat file parsers I would be happier to provide
> >> implementations of the more common formats where a viable alternative
> is
> >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which
> provide
> >> similar output to the above have a chance to write their own
> >> parsers/formatters. This is very similar to the current situation but
> we
> >> just need to remove dependencies on statically located data structures
> >> (don't get rid of them completely just give users an option to not use
> >> them).
> >>
> >> I'm not sure how much automatically generated parsers would help us. I
> >> guess it depends on the data model(s) we use if they are auto-parser
> >> friendly (which normally means POJO/JavaBean conventions including the
> >> no-args constructor).
> >>
> >> Cool I don't want to exclude flat file parsers completely (if only
> >> because my group has an interest in BioJava being able to read & write
> >> flat files) :)
> >>
> >> They decided to have HUPO-PSI Format instead :)
> >>
> >> Andy
> >>
> >>
> >> Mark Schreiber wrote:
> >>> Hi -
> >>>
> >>> I think in most cases huge XML files in bioinformatics result from a
> >>> single XML containing multiple repetitive elements. Eg a BLAST XML
> >>> output with several hits or a GenBankXML with many Sequences.  A nice
> >>> approach I have seen for dealing with these is to use SAX to read over
> >>> the file and every time it comes to an element it delegates to a DOM
> >>> object.  You then parse the bits of the DOM you want with XPath or
> >>> convert to objects or something and then when you are finished with
> >>> that entry everything gets garbage collected and the SAX parser moves
> >>> to the next element and repeats the whole process.  This is a hybrid
> >>> of event based parsing and object-model based parsing which could let
> >>> you efficiently deal with huge files.
> >>>
> >>> I think the BLAST XML has improved substantially, at least in terms of
> >>> validating against it's own DTD.  The DTD itself may not be the best
> >>> design but that is always a matter of taste and if you are using XPath
> >>> to get the relevant bits you don't need to make a SAX parser jump
> >>> through hoops to get them.
> >>>
> >>> I agree we will have to keep flat file parsers but we should strongly
> >>> encourage the use of XML where possible. It is simply easier to deal
> >>> with. Most biological flat-files were designed for Fortran and mainly
> >>> for human consumption. There is no obvious validation mechanism.
> >>> Notably everything in NCBI is derived from ASN.1, what you see in the
> >>> flatfile is produced from there. I tend to think this means that the
> >>> ASN.1 is the holy gospel and what you get in the flat file is some
> >>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
> >>> you can guarantee validation, the more practical alternative is to use
> >>> the XML which you can at least validate against a DTD.
> >>>
> >>> With XML we (Biojava) can say if it validates we will parse it and if
> >>> it doesn't we may not.  With flat files there are so many dodgey
> >>> variants we cannot say anything.  Because XML dtds (or xsd's) have
> >>> versions it also makes it much easier to have parsers for different
> >>> versions and the parsing machinery can figure out which is needed.
> >>> With flat files it is anyones guess what version you are dealing with.
> >>>
> >>> Finally parsers can be auto-generated for XML if you have the DTD or
> >>> XSD. This often doesn't give you an ideal parser but it can be a
> >>> useful starting point for rapid development.
> >>>
> >>> For Biojava v 3 I think we should concentrate on XML parsers first and
> >>> flat files second. <sigh>if only Fasta had an XML format</sigh>
> >>>
> >>> - Mark
> >>>
> >>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> I was always under the impression that blast's XML output was nearly
> as
> >>>> hard to parse as the flat file format but I do agree that if we can
> use
> >>>> XML whenever we can it would make writing parsers a lot easier
> >>>> (especially if there are SAX based XPath libraries available).
> Actually
> >>>> this brings up a good question about development of this type of
> parser.
> >>>> The majority of XPath supporting libraries are DOM based which will
> mean
> >>>> large memory usage in some situations but overall providing an easier
> >>>> coding experience (and hopefully reduce our chances of creating
> bugs).
> >>>> Or should we code to the edge cases of someone trying to parse a 1GB
> >>>> XML? Personally I'd favour the former.
> >>>>
> >>>> Going back to the original topic there are going to be situations
> where
> >>>> people want the flat file parsers/writers & I think it's a valid
> point
> >>>> to say this is where BioJava is meant to come in & help a developer.
> >>>> Afterall XML is a computer science problem where as parsing an EMBL
> flat
> >>>> file or blast output is a bioinformatics problem.
> >>>>
> >>>> Andy
> >>>>
> >>>>
> >>>> Mark Schreiber wrote:
> >>>>> For a long time now my feeling has been that we should *only*
> support
> >>>>> the XML version of blast output.  The other formats are too brittle
> to
> >>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc
> that
> >>>>> may be an extreme view but the power of generic XML parsers and
> things
> >>>>> like XPath etc really make these formats look very attractive.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>>
> >>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>>>> I think Groovy have adopted a similar system recently & have
> guidelines
> >>>>>> for how each module should behave (dependencies, build system etc).
> This
> >>>>>> enforces the idea that a module whilst not part of the core project
> must
> >>>>>> behave in the same manner the core does. I do like the idea that we
> can
> >>>>>> have a core biojava & things get added around it & it might
> encourage
> >>>>>> other users to start developing their own modules for any
> >>>>>> formats/purpose they want.
> >>>>>>
> >>>>>> Richard Holland wrote:
> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> Hash: SHA1
> >>>>>>>
> >>>>>>>> What format options are there from blast? Just thinking if it
> supports
> >>>>>>>> CIGAR or something like that are we better providing a parser for
> that
> >>>>>>>> format & saying that we do not support the traditional blast
> output?
> >>>>>>>> That said it doesn't help is when that format changes so maybe
> what is
> >>>>>>>> needed is a way to push out parser changes without requiring a
> full
> >>>>>>>> biojava release (v3 discussion) ...
> >>>>>>> Exactly! So the modular idea would work nicely here - we could
> have a
> >>>>>>> blast module and only update that single module (which would be
> its own
> >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as
> such
> >>>>>>> would no longer happen, except maybe for some kind of core BioJava
> >>>>>>> module. Everything would be done in terms of individual module+JAR
> >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS,
> one
> >>>>>>> for Phylogenetic tools, one for translation/transcription, etc.
> etc.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>> _______________________________________________
> >>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From abhi232 at cc.gatech.edu  Sat Nov 24 11:16:17 2007
From: abhi232 at cc.gatech.edu (Abhinav Ram Karhu)
Date: Sat, 24 Nov 2007 16:16:17 -0000
Subject: [Biojava-l] Applet not able to find DNATools class.
Message-ID: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu>

Hello all,
I am having an error while loading the applet.

I am getting the following stack trace.

java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools
	at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161)
	at Trace.init(Trace.java:161)
	at sun.applet.AppletPanel.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder.

I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java.

My applet code in the php page looks like this:

<applet code="Trace.class"  archive="biojava-1.5.jar , bytecode.jar" height=800 width=800>

Please suggest if I am missing something.

Thanks in advance.

Abhinav


From alex at coolest.com  Thu Nov  1 08:20:26 2007
From: alex at coolest.com (dasoudesu)
Date: Thu, 1 Nov 2007 01:20:26 -0700 (PDT)
Subject: [Biojava-l]  [ann] Informal Text-mining & Java Meetup in Tokyo
Message-ID: <13524848.post@talk.nabble.com>


Just wanted to announce a mini-event:
        Informal Text-mining & Java Meetup in Tokyo
        http://curehunter.com/public/events.do
Come have a casual drink with some similarly minded devs interested in new
tech.
(We like: Text-mining, Natural Language Processing, Java, C#, Python, Flex,
Dojo, Lucene...)

Time/location:
        November 29th 2007, Thursday 8pm-10pm
        Amarcord in Hatsudai (near Shinjuku), Tokyo
        http://way.sub.jp/amarcord/access.php
        2000-3000yen for food/drinks

If you can attend, please confirm by emailing:
        events at curehunter com

We will do a short demo of CureHunter and talk about some of the tech we
used.
After that we will have a projector available if anyone else would like to
present for 5-15 min on stuff they are working on.  
(the location is best equipped for drinking, however)

Hope to meet a few Java people from around Tokyo.
Best Regards,

Alex
---
http://curehunter.com - http://popjisyo.com - http://winstone.sf.net

-- 
View this message in context: http://www.nabble.com/-ann--Informal-Text-mining---Java-Meetup-in-Tokyo-tf4729944.html#a13524848
Sent from the BioJava mailing list archive at Nabble.com.


From ap3 at sanger.ac.uk  Thu Nov  1 16:59:35 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Thu, 1 Nov 2007 16:59:35 +0000
Subject: [Biojava-l] Biojava migrating to Subversion
Message-ID: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>

Hi all,

Over the next weeks (until Christmas) BioJava will finally move the  
version control system from
CVS to Subversion (svn). This is happening in parallel to the other  
open-bio projects. We will
ensure that nothing gets lost during this migration. This means that  
all Biojava modules, branches,
tags and the history of the files will be imported into the new  
repository.

  Over the next weeks we will

A) Test the migration procedure to ensure nothing gets lost
B) We will declare a CVS freeze at some point, giving all developers  
enough time to commit the latest code to CVS.
C) After the freeze the final svn migration will happen. At this  
point we will also do a quick BioJava release (version 1.5.1)
D) From that moment on all future Biojava development will happen via  
svn, CVS will remain frozen.

Detailed instructions for how to check out and commit code using svn  
will be announced closer to the migration date.

We will keep you informed about the details of these ongoings. There  
is also a wiki page which provides documentation for this:
http://biojava.org/wiki/CVS_to_SVN_Migration

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From abhi232 at cc.gatech.edu  Mon Nov  5 17:59:15 2007
From: abhi232 at cc.gatech.edu (abhi232 at cc.gatech.edu)
Date: Mon, 5 Nov 2007 12:59:15 -0500 (EST)
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
Message-ID: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>

Hi all,
I am having a byte array which is having the data from an .ab1 file.The
biojava library provides a class called as ABITrace which takes as input
either a byte[] array , a file or a url.If i use the later parameters (the
file or the url )the program works but if I pass the byte array to the
constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
problem with the ABITrace class or how can I bypass this particular error.
I am printing the length of the byte array and it comes to 144930...Can
that cause a problem in my code?

Thanks in advance.
Abhinav


From holland at ebi.ac.uk  Tue Nov  6 10:15:43 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 06 Nov 2007 10:15:43 +0000
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
Message-ID: <47303ECF.4020806@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I suspect the byte array itself may contain inaccurate data.

Internally, both the URL and File constructors read the data into a byte
array and then pass it to the same method as is used by the byte[]
constructor.

So, something must be different between the byte array you have, and the
byte array obtained by reading the file in.

The File constructor uses the following code to read the file:

    byte[] bytes = null;
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    FileInputStream fis = new FileInputStream(ABIFile);
    BufferedInputStream bis = new BufferedInputStream(fis);
    int b;
    while ((b = bis.read()) >= 0)
    {
      baos.write(b);
    }
    bis.close(); fis.close(); baos.close();
    bytes = baos.toByteArray();

If the above code produces different results to your byte array when
reading data from the same file as your code, then something has gone
wrong with the construction of your byte array.

Lastly, a full stack trace would help us pinpoint the line that is
breaking, and hopefully provide a hint as to what is wrong with the
contents of the byte array. If you could provide one that would be very
helpful.

cheers,
Richard


abhi232 at cc.gatech.edu wrote:
> Hi all,
> I am having a byte array which is having the data from an .ab1 file.The
> biojava library provides a class called as ABITrace which takes as input
> either a byte[] array , a file or a url.If i use the later parameters (the
> file or the url )the program works but if I pass the byte array to the
> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
> problem with the ABITrace class or how can I bypass this particular error.
> I am printing the length of the byte array and it comes to 144930...Can
> that cause a problem in my code?
> 
> Thanks in advance.
> Abhinav
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr
oe52S3khmTvi5BFCL1W4KHM=
=5JAO
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Tue Nov  6 16:53:54 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 06 Nov 2007 16:53:54 +0000
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <4730A6F1.9050407@cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu>
Message-ID: <47309C22.10803@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think that either the file is at fault, or the method you are using to
read the file into Java is at fault.

Could you provide us with the complete piece of code you are using from
the point where you read the file into the array through to the point
where you generate the output you quoted?  (Not as an attachment as the
mailing list will strip those - simply paste it into the message body
instead).

cheers,
Richard


abhinav wrote:
> Richard Holland wrote:
> I suspect the byte array itself may contain inaccurate data.
> 
> Internally, both the URL and File constructors read the data into a byte
> array and then pass it to the same method as is used by the byte[]
> constructor.
> 
> So, something must be different between the byte array you have, and the
> byte array obtained by reading the file in.
> 
> The File constructor uses the following code to read the file:
> 
>     byte[] bytes = null;
>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>     FileInputStream fis = new FileInputStream(ABIFile);
>     BufferedInputStream bis = new BufferedInputStream(fis);
>     int b;
>     while ((b = bis.read()) >= 0)
>     {
>       baos.write(b);
>     }
>     bis.close(); fis.close(); baos.close();
>     bytes = baos.toByteArray();
> 
> If the above code produces different results to your byte array when
> reading data from the same file as your code, then something has gone
> wrong with the construction of your byte array.
> 
> Lastly, a full stack trace would help us pinpoint the line that is
> breaking, and hopefully provide a hint as to what is wrong with the
> contents of the byte array. If you could provide one that would be very
> helpful.
> 
> cheers,
> Richard
> 
> 
> abhi232 at cc.gatech.edu wrote:
>   
>>>> Hi all,
>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>> biojava library provides a class called as ABITrace which takes as input
>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>> file or the url )the program works but if I pass the byte array to the
>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>> that cause a problem in my code?
>>>>
>>>> Thanks in advance.
>>>> Abhinav
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>>     

> Yes I looked at the file ABITrace and found out that the first three
> characters must be ABI or the 128-130 characters must be ABI.But I
> cannot find that in the file that I am having.Also If this is not the
> case then there should be an illegal format exception whereas I am
> arrayIndexOutOfBound Exception which is also weird.
> I am getting the following stack trace.
> The bytes that i want are:0
> The bytes that i want are:11
> The bytes that i want are:0
> The size of the byte array generated is:144930
> Byte array also recieved
> java.lang.ArrayIndexOutOfBoundsException: 128
>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>     at Trace.init(Trace.java:138)
>     at sun.applet.AppletPanel.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> The bytes I want are the first three bytes that I want to check if my
> file is ABI or not.I checked the isABI function as well it returns true
> or false value and not arrayIndexOutOfBouond . Also the number 128 does
> it hve any significance in this case?
> Thanks in advance
> Abhinav

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG
T1Z8W0JhCe7+hx5rbKLGqVk=
=qNcr
-----END PGP SIGNATURE-----


From abhi232 at cc.gatech.edu  Tue Nov  6 18:03:02 2007
From: abhi232 at cc.gatech.edu (abhinav)
Date: Tue, 06 Nov 2007 12:03:02 -0600
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <47309C22.10803@ebi.ac.uk>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu>
	<47309C22.10803@ebi.ac.uk>
Message-ID: <4730AC56.9060808@cc.gatech.edu>

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think that either the file is at fault, or the method you are using to
> read the file into Java is at fault.
>
> Could you provide us with the complete piece of code you are using from
> the point where you read the file into the array through to the point
> where you generate the output you quoted?  (Not as an attachment as the
> mailing list will strip those - simply paste it into the message body
> instead).
>
> cheers,
> Richard
>
>
> abhinav wrote:
>   
>> Richard Holland wrote:
>> I suspect the byte array itself may contain inaccurate data.
>>
>> Internally, both the URL and File constructors read the data into a byte
>> array and then pass it to the same method as is used by the byte[]
>> constructor.
>>
>> So, something must be different between the byte array you have, and the
>> byte array obtained by reading the file in.
>>
>> The File constructor uses the following code to read the file:
>>
>>     byte[] bytes = null;
>>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>     FileInputStream fis = new FileInputStream(ABIFile);
>>     BufferedInputStream bis = new BufferedInputStream(fis);
>>     int b;
>>     while ((b = bis.read()) >= 0)
>>     {
>>       baos.write(b);
>>     }
>>     bis.close(); fis.close(); baos.close();
>>     bytes = baos.toByteArray();
>>
>> If the above code produces different results to your byte array when
>> reading data from the same file as your code, then something has gone
>> wrong with the construction of your byte array.
>>
>> Lastly, a full stack trace would help us pinpoint the line that is
>> breaking, and hopefully provide a hint as to what is wrong with the
>> contents of the byte array. If you could provide one that would be very
>> helpful.
>>
>> cheers,
>> Richard
>>
>>
>> abhi232 at cc.gatech.edu wrote:
>>   
>>     
>>>>> Hi all,
>>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>>> biojava library provides a class called as ABITrace which takes as input
>>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>>> file or the url )the program works but if I pass the byte array to the
>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>>> that cause a problem in my code?
>>>>>
>>>>> Thanks in advance.
>>>>> Abhinav
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
>>>>>     
>>>>>           
>
>   
>> Yes I looked at the file ABITrace and found out that the first three
>> characters must be ABI or the 128-130 characters must be ABI.But I
>> cannot find that in the file that I am having.Also If this is not the
>> case then there should be an illegal format exception whereas I am
>> arrayIndexOutOfBound Exception which is also weird.
>> I am getting the following stack trace.
>> The bytes that i want are:0
>> The bytes that i want are:11
>> The bytes that i want are:0
>> The size of the byte array generated is:144930
>> Byte array also recieved
>> java.lang.ArrayIndexOutOfBoundsException: 128
>>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>>     at Trace.init(Trace.java:138)
>>     at sun.applet.AppletPanel.run(Unknown Source)
>>     at java.lang.Thread.run(Unknown Source)
>> The bytes I want are the first three bytes that I want to check if my
>> file is ABI or not.I checked the isABI function as well it returns true
>> or false value and not arrayIndexOutOfBouond . Also the number 128 does
>> it hve any significance in this case?
>> Thanks in advance
>> Abhinav
>>     
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG
> T1Z8W0JhCe7+hx5rbKLGqVk=
> =qNcr
> -----END PGP SIGNATURE-----
>   
Ok Yes here is the code that i am using .I establish a connection with a 
php page which in turn reads the file and prints the content back to 
me.I am using DataOutputStream for sending data and BufferedReader for 
taking in the data.Then I am reading the data into a string and 
converting it to byte[] array . this the code where the connection is 
estableshed and the data is taken and displayed.


 private HttpURLConnection httpConn;
    private DataOutputStream out;
    private DataInputStream temp_stream;
    private BufferedReader in;
    private BufferedInputStream in_buff_stream;
    private String str ;
    private byte[] bytearray;
    Chromatogram abif_chromatogram;

    /** Creates a new instance of testPost */
    public testPost()
    {

        httpConn = null;
        str = new String("");
        bytearray = new byte[144930];

    }
    public byte[] create_and_write_Connection(String url,String 
data_request)
    {
        try
        {
            URL conn_url = new URL(url);
            httpConn = (HttpURLConnection)conn_url.openConnection();
            httpConn.setDoOutput(true);
            httpConn.setDoInput(true);
            httpConn.setRequestMethod("POST");
            out=new DataOutputStream(httpConn.getOutputStream());
            out.writeBytes(data_request);
            out.flush();
            System.out.println("Connection established successfully and 
data written");
            InputStreamReader in_stream = new 
InputStreamReader(httpConn.getInputStream());

                System.out.println("The character encoding used is:"+ 
in_stream.getEncoding());
            in = new BufferedReader(in_stream);


            System.out.println("Data acceptance started");


            while(in.readLine()!=null)
            {
                str += in.readLine();
            }
            System.out.println("The string to be returned is:"+str);
            bytearray = str.getBytes("ISO8859-1");
            String temp_string = new String(bytearray,"windows-1252");
           System.out.println("The encoded string is as follows:"+ 
temp_string);
            System.out.println("The size of byte array inside testpost 
is:"+ Array.getLength(bytearray));
             for(int i = 0 ; i < 3 ; i ++)
                System.out.println("The bytes that i want are:"+ 
bytearray[i]);
            return bytearray;
        }
        catch(Exception e)
        {
               e.printStackTrace();
        }
        return bytearray;
     }
Please guide me on this point
Thanks
Abhinav
   

From holland at ebi.ac.uk  Tue Nov  6 17:05:12 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 06 Nov 2007 17:05:12 +0000
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <4730AC56.9060808@cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk> <4730A6F1.9050407@cc.gatech.edu>
	<47309C22.10803@ebi.ac.uk> <4730AC56.9060808@cc.gatech.edu>
Message-ID: <47309EC8.2070904@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The String is where you're going wrong. ABI files are not Stringifyable
- - they are binary data. Converting them to a String will corrupt them.

cheers,
Richard

abhinav wrote:
> Richard Holland wrote:
> I think that either the file is at fault, or the method you are using to
> read the file into Java is at fault.
> 
> Could you provide us with the complete piece of code you are using from
> the point where you read the file into the array through to the point
> where you generate the output you quoted?  (Not as an attachment as the
> mailing list will strip those - simply paste it into the message body
> instead).
> 
> cheers,
> Richard
> 
> 
> abhinav wrote:
>   
>>>> Richard Holland wrote:
>>>> I suspect the byte array itself may contain inaccurate data.
>>>>
>>>> Internally, both the URL and File constructors read the data into a byte
>>>> array and then pass it to the same method as is used by the byte[]
>>>> constructor.
>>>>
>>>> So, something must be different between the byte array you have, and the
>>>> byte array obtained by reading the file in.
>>>>
>>>> The File constructor uses the following code to read the file:
>>>>
>>>>     byte[] bytes = null;
>>>>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>>     FileInputStream fis = new FileInputStream(ABIFile);
>>>>     BufferedInputStream bis = new BufferedInputStream(fis);
>>>>     int b;
>>>>     while ((b = bis.read()) >= 0)
>>>>     {
>>>>       baos.write(b);
>>>>     }
>>>>     bis.close(); fis.close(); baos.close();
>>>>     bytes = baos.toByteArray();
>>>>
>>>> If the above code produces different results to your byte array when
>>>> reading data from the same file as your code, then something has gone
>>>> wrong with the construction of your byte array.
>>>>
>>>> Lastly, a full stack trace would help us pinpoint the line that is
>>>> breaking, and hopefully provide a hint as to what is wrong with the
>>>> contents of the byte array. If you could provide one that would be very
>>>> helpful.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>>
>>>> abhi232 at cc.gatech.edu wrote:
>>>>   
>>>>     
>>>>>>> Hi all,
>>>>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>>>>> biojava library provides a class called as ABITrace which takes as input
>>>>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>>>>> file or the url )the program works but if I pass the byte array to the
>>>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>>>>> that cause a problem in my code?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>> Abhinav
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>>>>>>     
>>>>>>>           
> 
>   
>>>> Yes I looked at the file ABITrace and found out that the first three
>>>> characters must be ABI or the 128-130 characters must be ABI.But I
>>>> cannot find that in the file that I am having.Also If this is not the
>>>> case then there should be an illegal format exception whereas I am
>>>> arrayIndexOutOfBound Exception which is also weird.
>>>> I am getting the following stack trace.
>>>> The bytes that i want are:0
>>>> The bytes that i want are:11
>>>> The bytes that i want are:0
>>>> The size of the byte array generated is:144930
>>>> Byte array also recieved
>>>> java.lang.ArrayIndexOutOfBoundsException: 128
>>>>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>>>>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>>>>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>>>>     at Trace.init(Trace.java:138)
>>>>     at sun.applet.AppletPanel.run(Unknown Source)
>>>>     at java.lang.Thread.run(Unknown Source)
>>>> The bytes I want are the first three bytes that I want to check if my
>>>> file is ABI or not.I checked the isABI function as well it returns true
>>>> or false value and not arrayIndexOutOfBouond . Also the number 128 does
>>>> it hve any significance in this case?
>>>> Thanks in advance
>>>> Abhinav
>>>>     
> 

> Ok Yes here is the code that i am using .I establish a connection with a
> php page which in turn reads the file and prints the content back to
> me.I am using DataOutputStream for sending data and BufferedReader for
> taking in the data.Then I am reading the data into a string and
> converting it to byte[] array . this the code where the connection is
> estableshed and the data is taken and displayed.


>  private HttpURLConnection httpConn;
>     private DataOutputStream out;
>     private DataInputStream temp_stream;
>     private BufferedReader in;
>     private BufferedInputStream in_buff_stream;
>     private String str ;
>     private byte[] bytearray;
>     Chromatogram abif_chromatogram;

>     /** Creates a new instance of testPost */
>     public testPost()
>     {

>         httpConn = null;
>         str = new String("");
>         bytearray = new byte[144930];

>     }
>     public byte[] create_and_write_Connection(String url,String
> data_request)
>     {
>         try
>         {
>             URL conn_url = new URL(url);
>             httpConn = (HttpURLConnection)conn_url.openConnection();
>             httpConn.setDoOutput(true);
>             httpConn.setDoInput(true);
>             httpConn.setRequestMethod("POST");
>             out=new DataOutputStream(httpConn.getOutputStream());
>             out.writeBytes(data_request);
>             out.flush();
>             System.out.println("Connection established successfully and
> data written");
>             InputStreamReader in_stream = new
> InputStreamReader(httpConn.getInputStream());

>                 System.out.println("The character encoding used is:"+
> in_stream.getEncoding());
>             in = new BufferedReader(in_stream);


>             System.out.println("Data acceptance started");


>             while(in.readLine()!=null)
>             {
>                 str += in.readLine();
>             }
>             System.out.println("The string to be returned is:"+str);
>             bytearray = str.getBytes("ISO8859-1");
>             String temp_string = new String(bytearray,"windows-1252");
>            System.out.println("The encoded string is as follows:"+
> temp_string);
>             System.out.println("The size of byte array inside testpost
> is:"+ Array.getLength(bytearray));
>              for(int i = 0 ; i < 3 ; i ++)
>                 System.out.println("The bytes that i want are:"+
> bytearray[i]);
>             return bytearray;
>         }
>         catch(Exception e)
>         {
>                e.printStackTrace();
>         }
>         return bytearray;
>      }
> Please guide me on this point
> Thanks
> Abhinav

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMJ7I4C5LeMEKA/QRAupLAJ9YDoGohk5uZSNYZnRRMJ5WeNDpGgCfdCyg
+Z/gXBbPmrG3SuQlfeHuD3A=
=akSf
-----END PGP SIGNATURE-----


From abhi232 at cc.gatech.edu  Tue Nov  6 17:40:01 2007
From: abhi232 at cc.gatech.edu (abhinav)
Date: Tue, 06 Nov 2007 11:40:01 -0600
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <47303ECF.4020806@ebi.ac.uk>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>
	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>
	<47303ECF.4020806@ebi.ac.uk>
Message-ID: <4730A6F1.9050407@cc.gatech.edu>

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I suspect the byte array itself may contain inaccurate data.
>
> Internally, both the URL and File constructors read the data into a byte
> array and then pass it to the same method as is used by the byte[]
> constructor.
>
> So, something must be different between the byte array you have, and the
> byte array obtained by reading the file in.
>
> The File constructor uses the following code to read the file:
>
>     byte[] bytes = null;
>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>     FileInputStream fis = new FileInputStream(ABIFile);
>     BufferedInputStream bis = new BufferedInputStream(fis);
>     int b;
>     while ((b = bis.read()) >= 0)
>     {
>       baos.write(b);
>     }
>     bis.close(); fis.close(); baos.close();
>     bytes = baos.toByteArray();
>
> If the above code produces different results to your byte array when
> reading data from the same file as your code, then something has gone
> wrong with the construction of your byte array.
>
> Lastly, a full stack trace would help us pinpoint the line that is
> breaking, and hopefully provide a hint as to what is wrong with the
> contents of the byte array. If you could provide one that would be very
> helpful.
>
> cheers,
> Richard
>
>
> abhi232 at cc.gatech.edu wrote:
>   
>> Hi all,
>> I am having a byte array which is having the data from an .ab1 file.The
>> biojava library provides a class called as ABITrace which takes as input
>> either a byte[] array , a file or a url.If i use the later parameters (the
>> file or the url )the program works but if I pass the byte array to the
>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>> problem with the ABITrace class or how can I bypass this particular error.
>> I am printing the length of the byte array and it comes to 144930...Can
>> that cause a problem in my code?
>>
>> Thanks in advance.
>> Abhinav
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>     
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMD7P4C5LeMEKA/QRAmGIAJ9a/V6nZqMROz3H4u69ECQ+9iTgMgCeNZvr
> oe52S3khmTvi5BFCL1W4KHM=
> =5JAO
> -----END PGP SIGNATURE-----
>   

Yes I looked at the file ABITrace and found out that the first three 
characters must be ABI or the 128-130 characters must be ABI.But I 
cannot find that in the file that I am having.Also If this is not the 
case then there should be an illegal format exception whereas I am 
arrayIndexOutOfBound Exception which is also weird.
I am getting the following stack trace.
The bytes that i want are:0
The bytes that i want are:11
The bytes that i want are:0
The size of the byte array generated is:144930
Byte array also recieved
java.lang.ArrayIndexOutOfBoundsException: 128
    at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
    at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
    at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
    at Trace.init(Trace.java:138)
    at sun.applet.AppletPanel.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
The bytes I want are the first three bytes that I want to check if my 
file is ABI or not.I checked the isABI function as well it returns true 
or false value and not arrayIndexOutOfBouond . Also the number 128 does 
it hve any significance in this case?
Thanks in advance
Abhinav


From walsh at andrew.cmu.edu  Tue Nov  6 17:23:36 2007
From: walsh at andrew.cmu.edu (Andrew Walsh)
Date: Tue, 06 Nov 2007 12:23:36 -0500
Subject: [Biojava-l] Error while reading byte data for creating a Trace.
In-Reply-To: <4730AC56.9060808@cc.gatech.edu>
References: <6EDA8DA0-39B2-40A3-B3B3-DB5F3463DB51@sanger.ac.uk>	<2839.130.207.66.142.1194285555.squirrel@webmail.cc.gatech.edu>	<47303ECF.4020806@ebi.ac.uk>
	<4730A6F1.9050407@cc.gatech.edu>	<47309C22.10803@ebi.ac.uk>
	<4730AC56.9060808@cc.gatech.edu>
Message-ID: <4730A318.8010406@andrew.cmu.edu>

You also appear to be losing every other line with the following code:

    while(in.readLine()!=null)
        {
            str += in.readLine();
        }

Every time the while statement checks its condition, a line is read from 
the inputstream.  That line is never stored.  Then, if the condition is 
met, another line is read and that line is added to your String.

-Andy

abhinav wrote:
> Richard Holland wrote:
>   
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think that either the file is at fault, or the method you are using to
>> read the file into Java is at fault.
>>
>> Could you provide us with the complete piece of code you are using from
>> the point where you read the file into the array through to the point
>> where you generate the output you quoted?  (Not as an attachment as the
>> mailing list will strip those - simply paste it into the message body
>> instead).
>>
>> cheers,
>> Richard
>>
>>
>> abhinav wrote:
>>   
>>     
>>> Richard Holland wrote:
>>> I suspect the byte array itself may contain inaccurate data.
>>>
>>> Internally, both the URL and File constructors read the data into a byte
>>> array and then pass it to the same method as is used by the byte[]
>>> constructor.
>>>
>>> So, something must be different between the byte array you have, and the
>>> byte array obtained by reading the file in.
>>>
>>> The File constructor uses the following code to read the file:
>>>
>>>     byte[] bytes = null;
>>>     ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>     FileInputStream fis = new FileInputStream(ABIFile);
>>>     BufferedInputStream bis = new BufferedInputStream(fis);
>>>     int b;
>>>     while ((b = bis.read()) >= 0)
>>>     {
>>>       baos.write(b);
>>>     }
>>>     bis.close(); fis.close(); baos.close();
>>>     bytes = baos.toByteArray();
>>>
>>> If the above code produces different results to your byte array when
>>> reading data from the same file as your code, then something has gone
>>> wrong with the construction of your byte array.
>>>
>>> Lastly, a full stack trace would help us pinpoint the line that is
>>> breaking, and hopefully provide a hint as to what is wrong with the
>>> contents of the byte array. If you could provide one that would be very
>>> helpful.
>>>
>>> cheers,
>>> Richard
>>>
>>>
>>> abhi232 at cc.gatech.edu wrote:
>>>   
>>>     
>>>       
>>>>>> Hi all,
>>>>>> I am having a byte array which is having the data from an .ab1 file.The
>>>>>> biojava library provides a class called as ABITrace which takes as input
>>>>>> either a byte[] array , a file or a url.If i use the later parameters (the
>>>>>> file or the url )the program works but if I pass the byte array to the
>>>>>> constructor I get java.lang.arrayIndexOutOfBound.Exception.Is there a
>>>>>> problem with the ABITrace class or how can I bypass this particular error.
>>>>>> I am printing the length of the byte array and it comes to 144930...Can
>>>>>> that cause a problem in my code?
>>>>>>
>>>>>> Thanks in advance.
>>>>>> Abhinav
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>>     
>>>>>>           
>>>>>>             
>>   
>>     
>>> Yes I looked at the file ABITrace and found out that the first three
>>> characters must be ABI or the 128-130 characters must be ABI.But I
>>> cannot find that in the file that I am having.Also If this is not the
>>> case then there should be an illegal format exception whereas I am
>>> arrayIndexOutOfBound Exception which is also weird.
>>> I am getting the following stack trace.
>>> The bytes that i want are:0
>>> The bytes that i want are:11
>>> The bytes that i want are:0
>>> The size of the byte array generated is:144930
>>> Byte array also recieved
>>> java.lang.ArrayIndexOutOfBoundsException: 128
>>>     at org.biojava.bio.program.abi.ABITrace.isABI(ABITrace.java:552)
>>>     at org.biojava.bio.program.abi.ABITrace.initData(ABITrace.java:289)
>>>     at org.biojava.bio.program.abi.ABITrace.<init>(ABITrace.java:136)
>>>     at Trace.init(Trace.java:138)
>>>     at sun.applet.AppletPanel.run(Unknown Source)
>>>     at java.lang.Thread.run(Unknown Source)
>>> The bytes I want are the first three bytes that I want to check if my
>>> file is ABI or not.I checked the isABI function as well it returns true
>>> or false value and not arrayIndexOutOfBouond . Also the number 128 does
>>> it hve any significance in this case?
>>> Thanks in advance
>>> Abhinav
>>>     
>>>       
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHMJwi4C5LeMEKA/QRAhAOAJ0ZjIWk1CXSLYlU2CUCp7xodAfFeACgjtFG
>> T1Z8W0JhCe7+hx5rbKLGqVk=
>> =qNcr
>> -----END PGP SIGNATURE-----
>>   
>>     
> Ok Yes here is the code that i am using .I establish a connection with a 
> php page which in turn reads the file and prints the content back to 
> me.I am using DataOutputStream for sending data and BufferedReader for 
> taking in the data.Then I am reading the data into a string and 
> converting it to byte[] array . this the code where the connection is 
> estableshed and the data is taken and displayed.
>
>
>
>  private HttpURLConnection httpConn;
>     private DataOutputStream out;
>     private DataInputStream temp_stream;
>     private BufferedReader in;
>     private BufferedInputStream in_buff_stream;
>     private String str ;
>     private byte[] bytearray;
>     Chromatogram abif_chromatogram;
>
>     /** Creates a new instance of testPost */
>     public testPost()
>     {
>
>         httpConn = null;
>         str = new String("");
>         bytearray = new byte[144930];
>
>     }
>     public byte[] create_and_write_Connection(String url,String 
> data_request)
>     {
>         try
>         {
>             URL conn_url = new URL(url);
>             httpConn = (HttpURLConnection)conn_url.openConnection();
>             httpConn.setDoOutput(true);
>             httpConn.setDoInput(true);
>             httpConn.setRequestMethod("POST");
>             out=new DataOutputStream(httpConn.getOutputStream());
>             out.writeBytes(data_request);
>             out.flush();
>             System.out.println("Connection established successfully and 
> data written");
>             InputStreamReader in_stream = new 
> InputStreamReader(httpConn.getInputStream());
>
>                 System.out.println("The character encoding used is:"+ 
> in_stream.getEncoding());
>             in = new BufferedReader(in_stream);
>
>
>             System.out.println("Data acceptance started");
>
>
>             while(in.readLine()!=null)
>             {
>                 str += in.readLine();
>             }
>             System.out.println("The string to be returned is:"+str);
>             bytearray = str.getBytes("ISO8859-1");
>             String temp_string = new String(bytearray,"windows-1252");
>            System.out.println("The encoded string is as follows:"+ 
> temp_string);
>             System.out.println("The size of byte array inside testpost 
> is:"+ Array.getLength(bytearray));
>              for(int i = 0 ; i < 3 ; i ++)
>                 System.out.println("The bytes that i want are:"+ 
> bytearray[i]);
>             return bytearray;
>         }
>         catch(Exception e)
>         {
>                e.printStackTrace();
>         }
>         return bytearray;
>      }
> Please guide me on this point
> Thanks
> Abhinav
>    
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>   


From holland at ebi.ac.uk  Thu Nov  8 13:53:09 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 13:53:09 +0000
Subject: [Biojava-l] BioJava 3 Proposals
Message-ID: <473314C5.8070207@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear BioJava users,

The BioJava developers are considering options for the future
development of the BioJava toolkit. We consider that it needs
improvement in a few major areas to make it easier to use and
understand, and also faster and more scalable.

The options are to either rewrite large parts of the existing code,
working within the existing interfaces and paradigms, or to develop a
new set of BioJava packages from the ground up in order to take
advantage of lessons learned from the design patterns of the existing code.

The BioJava developers have spent the last couple of months discussing
ideas and proposals related to these options on a Wiki page, and would
now like to open this discussion to all users of BioJava and the
bioinformatics community in general. We would like to invite anyone who
has any ideas or suggestions to contribute these to the Wiki page,
and/or to comment on the ideas and suggestions that have already been
posted there.

Here is a link to the Wiki page, and also a link to the associated Talk
page where much of the discussion has taken place so far:

	http://biojava.org/wiki/BioJava3_Proposal
	http://biojava.org/wiki/Talk:BioJava3_Proposal

It is our intention to leave the discussion open until mid-January
2008 when we will summarise it and use it as the basis of a plan of
action. We will then distribute the summary and the action plan via the
BioJava website.

We look forward to hearing your comments and ideas. Please do remember
to make them directly to the Wiki page so that they are preserved in
context, making it easier for us to summarise them later!

cheers,
Richard
(on behalf of all BioJava developers)

PS. Just to reassure you, this is NOT a plan to drop the existing
codebase. It will continue to exist, but the outcome of these
discussions will determine whether we will continue to develop and
support it or start afresh with a clean slate and a new codebase.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMxTE4C5LeMEKA/QRAlGSAJwKzO0oAe3T2e8ibcG8uRReOVfh7wCdGlwn
JkcVzA55Ye32o8Ry48LO+04=
=oaaC
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Thu Nov  8 13:58:23 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 13:58:23 +0000
Subject: [Biojava-l] Biojava wiki
Message-ID: <473315FF.70506@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

what's happened to the biojava wiki today? i get errors from all pages,
including the front page, indicating zero-sized replies.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62
6KkoyXhahrWkZo2OWyCL+Uk=
=1jK7
-----END PGP SIGNATURE-----


From phidias51 at gmail.com  Thu Nov  8 15:39:29 2007
From: phidias51 at gmail.com (Mark Fortner)
Date: Thu, 8 Nov 2007 07:39:29 -0800
Subject: [Biojava-l] Biojava wiki
In-Reply-To: <473315FF.70506@ebi.ac.uk>
References: <473315FF.70506@ebi.ac.uk>
Message-ID: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com>

Richard,
That's odd.  It comes up fine for me.

BTW, in your proposal you mentioned that people had "moved on".  I was
wondering what types of tasks they had moved on to, and what should be
included in the Proposal to insure that BioJava stays relevant to them?

Regards,

Mark

On Nov 8, 2007 5:58 AM, Richard Holland <holland at ebi.ac.uk> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> what's happened to the biojava wiki today? i get errors from all pages,
> including the front page, indicating zero-sized replies.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMxX/4C5LeMEKA/QRAmBPAJ9hx450OqBsD8s4DPgL8LsvpD4aRwCfZA62
> 6KkoyXhahrWkZo2OWyCL+Uk=
> =1jK7
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From hlapp at gmx.net  Thu Nov  8 15:53:03 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 8 Nov 2007 10:53:03 -0500
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
Message-ID: <ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>

Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we  
explicitly lowercase the value found for alphabet, and the comment  
says why:

         # Note: Biojava uses upper-case terms for alphabet, so we
         # need to change to all-lower in case the sequence was
         # manipulated by Biojava.
         $obj->alphabet(lc($rows->[3])) if $rows->[3];

However, when inserting sequences, we leave the value as is in  
BioPerl (which is lowercase), leading to a potential problem for  
Biojava upon retrieval. Do the Biojava folks deal with that? Should  
this may harmonized across the board?

	-hilmar

On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:

> Dear Peter,
>
> All the alphabet are "DNA" (upper case) in my database. The  
> sequences are taken from NCBI by a BioJava application.
> Thus is should be that BioJava inserts the records with "DNA". Thus  
> no potential "hidden bug" in BioPython.
>
> Maybe a point to share with the Open-Bio committee.
>
> Eric
>
> ----- Message d'origine ----
> De : Peter <biopython at maubp.freeserve.co.uk>
> ? : Eric Gibert <ericgibert at yahoo.fr>
> Cc : biopython at lists.open-bio.org
> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
> Objet : Re: [BioPython] small "bug" correction in package BioSql
>
> Eric Gibert wrote:
>> Dear all,
>>
>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>> function:
>>
>> ...
>>
>> please note my correction: force moltype to be turn in lower case as
>> my database has upper case value! this raises the "Unknown moltype"
>> error.
>
> Hi Eric, I've made your suggested change in CVS,
> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>
> I would encourage you to investigate why some of the "alphabet" fields
> in the biosequence table are in upper case.  There could be a bug
> elsewhere which is writing these entries with the wrong alphabet.  Is
> this affecting all entries, or just some?
>
> Peter
>
>
>
>
>
>
>
>        
> ______________________________________________________________________ 
> _______
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
> Yahoo! Mail
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From holland at ebi.ac.uk  Thu Nov  8 16:17:25 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 16:17:25 +0000
Subject: [Biojava-l] Biojava wiki
In-Reply-To: <6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com>
References: <473315FF.70506@ebi.ac.uk>
	<6e1d61f50711080739t6df72848se87e6001f97d01ce@mail.gmail.com>
Message-ID: <47333695.40808@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> BTW, in your proposal you mentioned that people had "moved on".  I was
> wondering what types of tasks they had moved on to, and what should be
> included in the Proposal to insure that BioJava stays relevant to them?

Good point. From what we can tell, people are not so sequence-focused
any more but are more interested in features, alignments, population
data, etc. - more 'metadata' so to speak.

We do need some mechanism to ensure that we are correct in this
thinking, and that future shifts in direction are catered for in this
design phase.

Could you add a note to the wiki with your points, and/or any ideas you
may have about ensuring these requirements are met?

cheers,
Richard


> Regards,
> 
> Mark
> 
> On Nov 8, 2007 5:58 AM, Richard Holland <holland at ebi.ac.uk> wrote:
> 
> what's happened to the biojava wiki today? i get errors from all pages,
> including the front page, indicating zero-sized replies.
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMzaV4C5LeMEKA/QRAoPUAJ0TQ+xFF1J3EtZgHmvYj2HH41koCgCeLYm0
D5Z7SJDWjvJ9rbCrS+RTEeI=
=XhE1
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Thu Nov  8 16:18:46 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 08 Nov 2007 16:18:46 +0000
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
Message-ID: <473336E6.6000100@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

we do need a consensus here.

I'm happy to go with whatever value is chosen, as the BioJava code can
easily be modified to suit.

cheers,
Richard

Hilmar Lapp wrote:
> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we  
> explicitly lowercase the value found for alphabet, and the comment  
> says why:
> 
>          # Note: Biojava uses upper-case terms for alphabet, so we
>          # need to change to all-lower in case the sequence was
>          # manipulated by Biojava.
>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
> 
> However, when inserting sequences, we leave the value as is in  
> BioPerl (which is lowercase), leading to a potential problem for  
> Biojava upon retrieval. Do the Biojava folks deal with that? Should  
> this may harmonized across the board?
> 
> 	-hilmar
> 
> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
> 
>> Dear Peter,
>>
>> All the alphabet are "DNA" (upper case) in my database. The  
>> sequences are taken from NCBI by a BioJava application.
>> Thus is should be that BioJava inserts the records with "DNA". Thus  
>> no potential "hidden bug" in BioPython.
>>
>> Maybe a point to share with the Open-Bio committee.
>>
>> Eric
>>
>> ----- Message d'origine ----
>> De : Peter <biopython at maubp.freeserve.co.uk>
>> ? : Eric Gibert <ericgibert at yahoo.fr>
>> Cc : biopython at lists.open-bio.org
>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>
>> Eric Gibert wrote:
>>> Dear all,
>>>
>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>> function:
>>>
>>> ...
>>>
>>> please note my correction: force moltype to be turn in lower case as
>>> my database has upper case value! this raises the "Unknown moltype"
>>> error.
>> Hi Eric, I've made your suggested change in CVS,
>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>
>> I would encourage you to investigate why some of the "alphabet" fields
>> in the biosequence table are in upper case.  There could be a bug
>> elsewhere which is writing these entries with the wrong alphabet.  Is
>> this affecting all entries, or just some?
>>
>> Peter
>>
>>
>>
>>
>>
>>
>>
>>        
>> ______________________________________________________________________ 
>> _______
>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
>> Yahoo! Mail
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3
9x+CUHig3GfBCZ56rDb1ZG4=
=OJyB
-----END PGP SIGNATURE-----


From hlapp at gmx.net  Thu Nov  8 20:28:19 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 8 Nov 2007 15:28:19 -0500
Subject: [Biojava-l] [BioPython] error on insert new sequences from
	GenBank: no annotations saved in BioSQL database
In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com>
References: <499834.44468.qm@web26501.mail.ukl.yahoo.com>
Message-ID: <A1A51FDB-C4A6-4894-8C9C-12A210B73C0D@gmx.net>

Maybe we need to hold some mini-hackathon to make the different  
toolkits compatible in how they map annotation to the schema.  
Obviously I don't know whether you have the latest Biojava setup  
here, but I'll just comment how BioPerl/Bioperl-db would map this:

'ORIGIN' - if I'm not mistaken this is only a token that introduces  
the actual sequence. I'm not sure what Biojava is storing as value here.

'DIVISION' - this maps to column division in table bioentry (though I  
agree that if  perfectly following the weak typing principle this  
should be tag/value association, but at present it's still an actual  
column)

'genbank_accessions' - secondary accession numbers indeed go into the  
qualifier value table. The primary accession maps to column accession  
in table bioentry

'TITLE' - this is part of a publication reference, and should map to  
column title in table reference (which it does in bioperl-db)

'cross_references' - not sure where these would be coming from in  
GenBank format; for EMBL this will map to the dbxref table

'data_file_division' - not sure what this is (same as DIVISION?)

'VERSION' - in BioPerl we parse this apart into a version for the  
accession (which is column version in table bioentry) and the GI  
number, which maps to column identifier in table bioentry

'references' - these map to table reference (and bioentry_reference  
for association with the bioentry)

'KEYWORDS' - indeed these map to bioentry_qualifier_value

'GI' - maps to column identifier in table bioentry

'SIZE' - not sure what size that is. If it is the length of the  
sequence, it should (and in BioPerl/bioperl-db does) map to column  
length in table biosequence

'DEFINITION' - maps to column description in table bioentry

'REFERENCE' - should be the same as for 'references'

'MDAT' - not sure what this is

'ORGANISM' - this is the organism and maps to the table taxon (and  
taxon_name), with a foreign key in bioentry pointing to the taxon

'JOURNAL' - this is part of a reference, see 'references'

'ACCESSION' - the primary accession, maps to column accession in  
table bioentry

'LOCUS' - in the file itself this is an entire line consisting of  
multiple fields; BioPerl/bioperl-db maps the locus name (the first  
token after the literal token LOCUS) to column name in table bioentry

'SOURCE' - this is the organism, see 'ORGANISM'

'PUBMED' - this is part of a literature reference, and maps to a  
foreign key in the reference table (reference.dbxref) to a dbxref  
entry with PUBMED or PMID as the database and the pubmed ID as the  
accession

'AUTHORS' - part of a literature reference, maps to column authors in  
table reference

'TYPE' - not sure what this is. If it's the alphabet, it maps to  
table biosequence, column alphabet

'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,  
though there have been plans to make it a column in table biosequence.

Note that this could in fact be the way Biojava stores it too, but  
upon retrieval represents it in the way you are seeing it.

Hth,

	-hilmar

On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:

> Dear all,
>
> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted  
> previously by my BioJava application, I have:
>
> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>
> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',  
> 'genbank_accessions', 'TITLE', 'cross_references',  
> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',  
> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',  
> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',  
> 'CIRCULAR']
>
> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
> Debug on Seq: EF631597.1 =  ['cross_references', 'dates',  
> 'references', 'gi', 'data_file_division']
>
>
> Once I look in the table bioentry_qualifier_value
>
> * 20 records for a Sequence imported by BioJava
> * 1 only for a Sequence inserted by BioPython: the date which  
> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>
> Quite a few annotations missing, no?
>
> Any idea?
>
> Eric
>
>
>
>
>        
> ______________________________________________________________________ 
> _______
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers  
> Yahoo! Mail
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Thu Nov  8 20:30:29 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 8 Nov 2007 15:30:29 -0500
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <473336E6.6000100@ebi.ac.uk>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
Message-ID: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>

It seems BioPerl and Biopython both want (and have traditionally  
used) lowercase - do you mind going with that for Biojava as well, or  
alternatively, simply map upon insert/update and retrieve?

	-hilmar

On Nov 8, 2007, at 11:18 AM, Richard Holland wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> we do need a consensus here.
>
> I'm happy to go with whatever value is chosen, as the BioJava code can
> easily be modified to suit.
>
> cheers,
> Richard
>
> Hilmar Lapp wrote:
>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we
>> explicitly lowercase the value found for alphabet, and the comment
>> says why:
>>
>>          # Note: Biojava uses upper-case terms for alphabet, so we
>>          # need to change to all-lower in case the sequence was
>>          # manipulated by Biojava.
>>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
>>
>> However, when inserting sequences, we leave the value as is in
>> BioPerl (which is lowercase), leading to a potential problem for
>> Biojava upon retrieval. Do the Biojava folks deal with that? Should
>> this may harmonized across the board?
>>
>> 	-hilmar
>>
>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
>>
>>> Dear Peter,
>>>
>>> All the alphabet are "DNA" (upper case) in my database. The
>>> sequences are taken from NCBI by a BioJava application.
>>> Thus is should be that BioJava inserts the records with "DNA". Thus
>>> no potential "hidden bug" in BioPython.
>>>
>>> Maybe a point to share with the Open-Bio committee.
>>>
>>> Eric
>>>
>>> ----- Message d'origine ----
>>> De : Peter <biopython at maubp.freeserve.co.uk>
>>> ? : Eric Gibert <ericgibert at yahoo.fr>
>>> Cc : biopython at lists.open-bio.org
>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>>
>>> Eric Gibert wrote:
>>>> Dear all,
>>>>
>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>>> function:
>>>>
>>>> ...
>>>>
>>>> please note my correction: force moltype to be turn in lower  
>>>> case as
>>>> my database has upper case value! this raises the "Unknown moltype"
>>>> error.
>>> Hi Eric, I've made your suggested change in CVS,
>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>>
>>> I would encourage you to investigate why some of the "alphabet"  
>>> fields
>>> in the biosequence table are in upper case.  There could be a bug
>>> elsewhere which is writing these entries with the wrong  
>>> alphabet.  Is
>>> this affecting all entries, or just some?
>>>
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ____________________________________________________________________ 
>>> __
>>> _______
>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>>> Yahoo! Mail
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3
> 9x+CUHig3GfBCZ56rDb1ZG4=
> =OJyB
> -----END PGP SIGNATURE-----

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From holland at ebi.ac.uk  Fri Nov  9 08:39:01 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 09 Nov 2007 08:39:01 +0000
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
	<9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
Message-ID: <47341CA5.9080509@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

i'll see what i can do.

Hilmar Lapp wrote:
> It seems BioPerl and Biopython both want (and have traditionally used)
> lowercase - do you mind going with that for Biojava as well, or
> alternatively, simply map upon insert/update and retrieve?
> 
>     -hilmar
> 
> On Nov 8, 2007, at 11:18 AM, Richard Holland wrote:
> 
> we do need a consensus here.
> 
> I'm happy to go with whatever value is chosen, as the BioJava code can
> easily be modified to suit.
> 
> cheers,
> Richard
> 
> Hilmar Lapp wrote:
>>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we
>>>> explicitly lowercase the value found for alphabet, and the comment
>>>> says why:
>>>>
>>>>          # Note: Biojava uses upper-case terms for alphabet, so we
>>>>          # need to change to all-lower in case the sequence was
>>>>          # manipulated by Biojava.
>>>>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
>>>>
>>>> However, when inserting sequences, we leave the value as is in
>>>> BioPerl (which is lowercase), leading to a potential problem for
>>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should
>>>> this may harmonized across the board?
>>>>
>>>>     -hilmar
>>>>
>>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
>>>>
>>>>> Dear Peter,
>>>>>
>>>>> All the alphabet are "DNA" (upper case) in my database. The
>>>>> sequences are taken from NCBI by a BioJava application.
>>>>> Thus is should be that BioJava inserts the records with "DNA". Thus
>>>>> no potential "hidden bug" in BioPython.
>>>>>
>>>>> Maybe a point to share with the Open-Bio committee.
>>>>>
>>>>> Eric
>>>>>
>>>>> ----- Message d'origine ----
>>>>> De : Peter <biopython at maubp.freeserve.co.uk>
>>>>> ? : Eric Gibert <ericgibert at yahoo.fr>
>>>>> Cc : biopython at lists.open-bio.org
>>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>>>>
>>>>> Eric Gibert wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>>>>> function:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> please note my correction: force moltype to be turn in lower case as
>>>>>> my database has upper case value! this raises the "Unknown moltype"
>>>>>> error.
>>>>> Hi Eric, I've made your suggested change in CVS,
>>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>>>>
>>>>> I would encourage you to investigate why some of the "alphabet" fields
>>>>> in the biosequence table are in upper case.  There could be a bug
>>>>> elsewhere which is writing these entries with the wrong alphabet.  Is
>>>>> this affecting all entries, or just some?
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> _______
>>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>>>>> Yahoo! Mail
>>>>> _______________________________________________
>>>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>>

> --===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHNByl4C5LeMEKA/QRAmCzAJ9fxSm8l5YAEHAUe2hH+Gwc1Xe5IwCfcMf6
c9sy8lASDV069FQJ79Geemw=
=RHM1
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Fri Nov  9 12:42:38 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 09 Nov 2007 12:42:38 +0000
Subject: [Biojava-l] small "bug" correction in package BioSql
In-Reply-To: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
	<9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
Message-ID: <473455BE.6040807@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I did a bit of poking around in our code and internally BioJava
represents all the default alphabet names (Protein, DNA, etc.) in upper
case. It also allows for mixed case alphabet names.

It's not quite as easy as I thought to change these to lower case as
they are often referenced by text name, meaning other people's code
might break if I change them.

Also, as it allows for mixed-case alphabet names, I can't do a
toUpper/toLower fudge on persistence to BioSQL, as I wouldn't
necessarily get out what I put in!

So, I think I'll add this as a point on the recently announced BioJava 3
proposal, that BioSQL interaction must be compliant with standards laid
down by the BioSQL project, and that our code will be able to cope with
this internally.

That brings us back to BioSQL standards - the idea of a mini-hackathon
to solve this once and for all is a very good one. Our previous attempts
between BioPerl and BioJava in Singapore were good, but still there are
niggles as seen in this thread of discussion. It seems that a schema on
it's own just isn't enough to make the various projects play nicely, and
instructions are needed on exactly how to use that schema if they are
truly all going to be able to use it without caring who or what wrote
the data that is being read.

cheers,
Richard


Hilmar Lapp wrote:
> It seems BioPerl and Biopython both want (and have traditionally used)
> lowercase - do you mind going with that for Biojava as well, or
> alternatively, simply map upon insert/update and retrieve?
> 
>     -hilmar
> 
> On Nov 8, 2007, at 11:18 AM, Richard Holland wrote:
> 
> we do need a consensus here.
> 
> I'm happy to go with whatever value is chosen, as the BioJava code can
> easily be modified to suit.
> 
> cheers,
> Richard
> 
> Hilmar Lapp wrote:
>>>> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we
>>>> explicitly lowercase the value found for alphabet, and the comment
>>>> says why:
>>>>
>>>>          # Note: Biojava uses upper-case terms for alphabet, so we
>>>>          # need to change to all-lower in case the sequence was
>>>>          # manipulated by Biojava.
>>>>          $obj->alphabet(lc($rows->[3])) if $rows->[3];
>>>>
>>>> However, when inserting sequences, we leave the value as is in
>>>> BioPerl (which is lowercase), leading to a potential problem for
>>>> Biojava upon retrieval. Do the Biojava folks deal with that? Should
>>>> this may harmonized across the board?
>>>>
>>>>     -hilmar
>>>>
>>>> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote:
>>>>
>>>>> Dear Peter,
>>>>>
>>>>> All the alphabet are "DNA" (upper case) in my database. The
>>>>> sequences are taken from NCBI by a BioJava application.
>>>>> Thus is should be that BioJava inserts the records with "DNA". Thus
>>>>> no potential "hidden bug" in BioPython.
>>>>>
>>>>> Maybe a point to share with the Open-Bio committee.
>>>>>
>>>>> Eric
>>>>>
>>>>> ----- Message d'origine ----
>>>>> De : Peter <biopython at maubp.freeserve.co.uk>
>>>>> ? : Eric Gibert <ericgibert at yahoo.fr>
>>>>> Cc : biopython at lists.open-bio.org
>>>>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s
>>>>> Objet : Re: [BioPython] small "bug" correction in package BioSql
>>>>>
>>>>> Eric Gibert wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the
>>>>>> function:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> please note my correction: force moltype to be turn in lower case as
>>>>>> my database has upper case value! this raises the "Unknown moltype"
>>>>>> error.
>>>>> Hi Eric, I've made your suggested change in CVS,
>>>>> biopython/BioSQL/BioSeq.py revision 1.13, thank you.
>>>>>
>>>>> I would encourage you to investigate why some of the "alphabet" fields
>>>>> in the biosequence table are in upper case.  There could be a bug
>>>>> elsewhere which is writing these entries with the wrong alphabet.  Is
>>>>> this affecting all entries, or just some?
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> _______
>>>>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>>>>> Yahoo! Mail
>>>>> _______________________________________________
>>>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>>

> --===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHNFW84C5LeMEKA/QRApBiAJ41WqCDKOJhee5NxIsquYaR/ImBRgCfb7zM
LX75HHvCUC/v4n3okmUQ+ME=
=d6QO
-----END PGP SIGNATURE-----


From email2ants at gmail.com  Fri Nov  9 17:55:36 2007
From: email2ants at gmail.com (Anthony Underwood)
Date: Fri, 9 Nov 2007 17:55:36 +0000
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
Message-ID: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>

Hi All,

I've generated an alignment and I am retrieving positions within the  
alignment using

Symbol base = alignment.symbolAt(label, i);

I am trying to get whether the base at this position is G, A, T or C

However when I use base.getName() it returns strings such as "thymine"

The documentation states that the method getToken should also be  
available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html

Is there a simple way of retrieving a one letter textual  
representation of the symbol?


Many thanks


Anthony


From zagato.gekko at gmail.com  Fri Nov  9 18:48:02 2007
From: zagato.gekko at gmail.com (Zagato)
Date: Fri, 9 Nov 2007 13:48:02 -0500
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
In-Reply-To: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
References: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>

Try with:
String s = alignment.symbolListForLabel( label ).subStr( i, i+1 );

Bye...

Alan Jairo Acosta
Cali - Colombia

On Nov 9, 2007 12:55 PM, Anthony Underwood <email2ants at gmail.com> wrote:

> Hi All,
>
> I've generated an alignment and I am retrieving positions within the
> alignment using
>
> Symbol base = alignment.symbolAt(label, i);
>
> I am trying to get whether the base at this position is G, A, T or C
>
> However when I use base.getName() it returns strings such as "thymine"
>
> The documentation states that the method getToken should also be
> available, but this returns method undefined.
> http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
>
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
>
>
> Many thanks
>
>
> Anthony
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Farewell.
http://www.youtube.com/zagatogekko
ruby << __EOF__
 puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse
__EOF__


From zagato.gekko at gmail.com  Fri Nov  9 18:48:02 2007
From: zagato.gekko at gmail.com (Zagato)
Date: Fri, 9 Nov 2007 13:48:02 -0500
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
In-Reply-To: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
References: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
Message-ID: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>

Try with:
String s = alignment.symbolListForLabel( label ).subStr( i, i+1 );

Bye...

Alan Jairo Acosta
Cali - Colombia

On Nov 9, 2007 12:55 PM, Anthony Underwood <email2ants at gmail.com> wrote:

> Hi All,
>
> I've generated an alignment and I am retrieving positions within the
> alignment using
>
> Symbol base = alignment.symbolAt(label, i);
>
> I am trying to get whether the base at this position is G, A, T or C
>
> However when I use base.getName() it returns strings such as "thymine"
>
> The documentation states that the method getToken should also be
> available, but this returns method undefined.
> http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
>
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
>
>
> Many thanks
>
>
> Anthony
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Farewell.
http://www.youtube.com/zagatogekko
ruby << __EOF__
 puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse
__EOF__


From gwaldon at geneinfinity.org  Fri Nov  9 18:45:10 2007
From: gwaldon at geneinfinity.org (George Waldon)
Date: Fri, 09 Nov 2007 10:45:10 -0800
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
Message-ID: <20071109184510.80580.qmail@mmm1924.dulles19-verio.com>

Tokens are associated with alphabets. 

Get the tokenization from the alphabet using:
SymbolTokenization = Alphabet.getTokenization("token");

Get the token from the tokenization using:
String = SymbolTokenization.tokenizeSymbol(Symbol);

Also, check the tutotial and the cookbook on the biojava web site at www.biojava.org, which are often more informative than the javadoc.

Frankly speaking, I agree with you and we should have a method like
String = Symbol.getToken(Alphabet,"token");
to do these operations simply and without loosing our hairs!

Best luck,
George


> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
> bounces at lists.open-bio.org] On Behalf Of Anthony Underwood
> Sent: Friday, November 09, 2007 9:56 AM
> To: BioJava
> Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
> 
> Hi All,
> 
> I've generated an alignment and I am retrieving positions within the
> alignment using
> 
> Symbol base = alignment.symbolAt(label, i);
> 
> I am trying to get whether the base at this position is G, A, T or C
> 
> However when I use base.getName() it returns strings such as "thymine"
> 
> The documentation states that the method getToken should also be
> available, but this returns method undefined.
> http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
> 
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
> 
> 
> Many thanks
> 
> 
> Anthony
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists


From email2ants at gmail.com  Fri Nov  9 23:23:01 2007
From: email2ants at gmail.com (Anthony Underwood)
Date: Fri, 9 Nov 2007 23:23:01 +0000
Subject: [Biojava-l] Getting a base from an alignment (way to complex?)
In-Reply-To: <98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>
References: <B394BA8E-E112-44E7-A2B6-10A189128D10@gmail.com>
	<98028b00711091048k26c61fc7qc68b14d8d289c769@mail.gmail.com>
Message-ID: <70FC5536-E1B3-41C7-92BC-0B43A0E11E09@gmail.com>

Hi Alan,

Thanks for the suggestion. That was my first thought, but then I was  
thinking for amino acids this wouldn't work. I would have to use a  
hashmap to convert the amino acid to the appropriate single letter code.

Hi George, I'll try your suggestion. As you say I think this is too  
much for something that should be a one liner. Thanks for your advice.
Get the tokenization from the alphabet using:
SymbolTokenization = Alphabet.getTokenization("token");

Get the token from the tokenization using:
String = SymbolTokenization.tokenizeSymbol(Symbol);

Thanks to both of you

Anthony

On 9 Nov 2007, at 18:48, Zagato wrote:

> Try with:
> String s = alignment.symbolListForLabel( label ).subStr( i, i+1 );
>
> Bye...
>
> Alan Jairo Acosta
> Cali - Colombia
>
> On Nov 9, 2007 12:55 PM, Anthony Underwood < email2ants at gmail.com>  
> wrote:
> Hi All,
>
> I've generated an alignment and I am retrieving positions within the
> alignment using
>
> Symbol base = alignment.symbolAt(label, i);
>
> I am trying to get whether the base at this position is G, A, T or C
>
> However when I use base.getName() it returns strings such as "thymine"
>
> The documentation states that the method getToken should also be
> available, but this returns method undefined. http://www.biojava.org/docs/api15/org/biojava/bio/symbol/Symbol.html
>
> Is there a simple way of retrieving a one letter textual
> representation of the symbol?
>
>
> Many thanks
>
>
> Anthony
> _______________________________________________
> Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
> -- 
> Farewell.
> http://www.youtube.com/zagatogekko
> ruby << __EOF__
>  puts [ 111, 116, 97, 103, 97, 90 ].collect{|v| v.chr}.join.reverse
> __EOF__


From hlapp at gmx.net  Sat Nov 10 20:38:17 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 10 Nov 2007 15:38:17 -0500
Subject: [Biojava-l] error on insert new sequences from GenBank: no
	annotations saved in BioSQL database
In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko>
References: <499834.44468.qm@web26501.mail.ukl.yahoo.com>
	<47336117.2010102@maubp.freeserve.co.uk>
	<001c01c8238b$2ec64070$6400a8c0@Gecko>
Message-ID: <5DDEBCDE-C8DA-4B2C-86F4-47FDB82CADAC@gmx.net>

Just a few comments below, specifically where no rows would in fact  
be what I expect:

On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote:

> [...]
> --------  For you information, I went thru the tables of my BioSQL  
> database:
> [...]
> 1) table bioentry: all column populated except for 'taxon_id' which  
> is NULL
> (maybe I need an extra call for populating the 'taxon' table before?)

Bioperl-db will try to look up (or create if necessary) the taxon  
from the taxon information attached to the sequence, but for BioPerl  
we actually recommend to pre-load the database with the NCBI  
taxonomy, which can be comfortably done with the script  
load_ncbi_taxonomy.pl that comes with BioSQL.

>
> 2) table bioentry_dbxref: no data inserted (always empty, even with  
> BioJava)

This would mean that the sequence(s) have no dbxrefs. Note that for  
GenBank sequences that would be expected, since unfortunately, and  
unlike EMBL format, GenBank puts the dbxrefs into the feature table.

> 3) table bioentry_qualifier_value:
>
> One entry only, for the 'term_id' = 149, rank = 1, and value = '07- 
> JUL-2005'
> or other 'DD-MMM-YYYY' dates (see my remarks below)

Below you say that your term table is empty, so I don't know why you  
can have value here at all.

> [...]
> 5) table bioentry_relationships: no entry found (always empty, even  
> with
> BioJava)

If you load sequences, they won't have direct relationships to other  
sequences (except dbxrefs, but those are rather 'pointers' and are  
stored in their own table).

In Bioperl-db, this table is used only if you load sequence clusters  
through Bio::Cluster objects (such as UniGene).

> [...]
> 7) table comment: no entry found (always empty, even with BioJava)

Again, this is expected with GenBank. AFAIK genbank format doesn't  
allow for comments at the level of the sequence. You would (i.e.,  
should) find entries here if you load UniProt entries.

> 8) table dbxref: some records are generated, for dbname 'PUBMED'  
> and 'Taxon'
> with the correct value

Taxon obviously isn't really a dbxref, but rather a taxon (and hence  
should go into that table).

> [...]
> 9) table dbxref_qualifier_value: (always empty, even with BioJava)

That's almost expected. There's rather few cases where dbxrefs have  
additional attributes that the language can parse out from a source  
(and then maps to the schema).

> [...]
> 10) table location: all locations loaded correctly, note that  
> 'term_id' and
> 'dbxref_id' remain NULL for these seq but I have value for other seq.

Theoretically, the term_id should point to the term giving the type  
of the location. If you (or Biopython) are only dealing with simple  
('normal') locations, then it's not needed.

The dbxref_id gives the reference to the remote sequence if the  
location for a feature refers to a different sequence than the  
feature itself does (so-called 'remote locations'). If the sequences  
you loaded don't have such locations, there this would be expected to  
be empty (or if Biopython doesn't handle such locations).

> 11) table location_qualifier_value: always empty, even with BioJava

This is expected if Biopython doesn't support fuzzy locations, or if  
none of the feature locations that you loaded are fuzzy.

> [...]
> 13) Table reference: entries correct, note 'dbxref_id' remains NULL  
> for
> these seq but I have value for other seq.

It should point to the pubmed ID for the reference but only if there  
was one.

> 14) table seqfeature: entries are there (same as in table 'location').
> FYI:'display_name is always NULL.

GenBank doesn't give names to features (and I think EMBL does  
neither), so this is expected.

> 15) table seqfeature_dbxref: always empty, even with BioJava

That's likely more to do with your language object model than with  
anything else. dbxref annotation for features is in tag/value pairs,  
just as any other, so your language (Biopython in this case) will  
have to do a lot of interpretation to tease out the semantics behind  
each tag name and based on that decide what to do with the value.  
Indeed, by default we don't even do this in BioPerl.

> [...]
> 17) table seqfeature_relationship: always empty, even with BioJava

GenBank (and EMBL) feature tables are flat, not hierarchical, so this  
is expected.

> 18) table taxon: always empty, even with BioJava)

This is where the organism should go.

> 19) table taxon_name: I have one but not from this test (I tried to  
> tinker a
> little bit with taxon but stopped)

That's odd that you can have an entry in taxon_name w/o a  
corresponding one in taxon. Do you have foreign key checks disabled?

> 20) table term: always empty, even with BioJava

That's strange, since you say you do have rows in  
bioentry_qualifier_value, which has an enforced foreign key to term.  
Did you disable the foreign key checks?

> 21) table term_dbxref: always empty, even with BioJava

That's expected unless you loaded an ontology whose terms have  
dbxrefs, and your language object model supports that.

> [...]
> 23) table term_synonym: always empty, even with BioJava

Same as for 21). Your terms would have to have synonyms, and your  
language object model would have to support those, before you could  
expect to get anything in here.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From shirleyc at cis.upenn.edu  Tue Nov 13 18:45:59 2007
From: shirleyc at cis.upenn.edu (Shirley Cohen)
Date: Tue, 13 Nov 2007 13:45:59 -0500
Subject: [Biojava-l] maximum parsimony search
Message-ID: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu>

Hi BioJava People,

I'm looking for existing code that implements a maximum parsimony  
search in Java. Does BioJava have this functionality? If so, can you  
point me to the appropriate classes?

Thanks,

Shirley


From bmduggan at yahoo.com  Wed Nov 14 00:48:22 2007
From: bmduggan at yahoo.com (Brendan Duggan)
Date: Wed, 14 Nov 2007 11:48:22 +1100 (EST)
Subject: [Biojava-l] Disulfide information in PDB files
Message-ID: <454510.91557.qm@web52705.mail.re2.yahoo.com>

Greetings

I'm trying to mine some information on disulfides in
the PDB and was hoping there might be a way of
obtaining this information with the BioJava PDB
parser.  However, I haven't been able to see anything
like this mentioned in the API docs.  If it is
currently not possible to extract disulfide
information from PDB files are there any plans to
implement this?

Thanks!

Brendan


      Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html


From holland at ebi.ac.uk  Wed Nov 14 08:50:31 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 14 Nov 2007 08:50:31 +0000
Subject: [Biojava-l] maximum parsimony search
In-Reply-To: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu>
References: <3001DEBB-AD61-4089-AE42-910AAC097D99@cis.upenn.edu>
Message-ID: <473AB6D7.2010405@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

There is a class currently only available from the head of CVS - ie. it
is unreleased yet. To get it you'll need to check out the very latest
BioJava source code from CVS.

The JavaDoc for the class is here:

http://www.spice-3d.org/public-files/javadoc/biojava/org/biojavax/bio/phylo/ParsimonyTreeMethod.html

It is designed to take input in the form of blocks of data similar to
what you would find in a Nexus file (the Nexus file parsers elsewhere in
the org/biojavax/bio/phylo package will provide these). However you
could of course create such objects from your own data without needing
to read/write any Nexus files.

cheers,
Richard


Shirley Cohen wrote:
> Hi BioJava People,
> 
> I'm looking for existing code that implements a maximum parsimony  
> search in Java. Does BioJava have this functionality? If so, can you  
> point me to the appropriate classes?
> 
> Thanks,
> 
> Shirley
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHOrbW4C5LeMEKA/QRAuswAJ9olIwj7DGszOnKORU255YS3m2ohACfbKTw
ihjuQVv0j+nlXb+4SL5pIfw=
=ldfM
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Wed Nov 14 08:55:24 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 14 Nov 2007 08:55:24 +0000
Subject: [Biojava-l] Disulfide information in PDB files
In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com>
References: <454510.91557.qm@web52705.mail.re2.yahoo.com>
Message-ID: <473AB7FC.10403@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Currently this is not parsed - the parser does not read all the tags in
the most recent PDB specification.

Could you open a bug request at http://bugzilla.open-bio.org/ to
formally add this to our to-do list? Thanks!

cheers,
Richard

Brendan Duggan wrote:
> Greetings
> 
> I'm trying to mine some information on disulfides in
> the PDB and was hoping there might be a way of
> obtaining this information with the BioJava PDB
> parser.  However, I haven't been able to see anything
> like this mentioned in the API docs.  If it is
> currently not possible to extract disulfide
> information from PDB files are there any plans to
> implement this?
> 
> Thanks!
> 
> Brendan
> 
> 
>       Make the switch to the world's best email. Get the new Yahoo!7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHOrf84C5LeMEKA/QRArfeAJ9nCViM2jyVfubIpl5w/1EXMYTv/gCgjVEs
zDnxHjv8xJsRBw5pfE2NdkA=
=tGqm
-----END PGP SIGNATURE-----


From ap3 at sanger.ac.uk  Wed Nov 14 09:32:28 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Wed, 14 Nov 2007 09:32:28 +0000
Subject: [Biojava-l] Disulfide information in PDB files
In-Reply-To: <454510.91557.qm@web52705.mail.re2.yahoo.com>
References: <454510.91557.qm@web52705.mail.re2.yahoo.com>
Message-ID: <9B898ADF-78EB-4B5C-A432-98274190815F@sanger.ac.uk>

Hi Brendan,

SSBOND lines are currently not parsed. If this is what you need,
I can add this over the next couple of days.

If you want to compute the bonds yourself, the framework can
e.g. calculate distances between the sulphur atoms for you. -

Andreas


On 14 Nov 2007, at 00:48, Brendan Duggan wrote:

> Greetings
>
> I'm trying to mine some information on disulfides in
> the PDB and was hoping there might be a way of
> obtaining this information with the BioJava PDB
> parser.  However, I haven't been able to see anything
> like this mentioned in the API docs.  If it is
> currently not possible to extract disulfide
> information from PDB files are there any plans to
> implement this?
>
> Thanks!
>
> Brendan
>
>
>       Make the switch to the world's best email. Get the new Yahoo! 
> 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From deb at mb.au.dk  Thu Nov 15 12:04:02 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Thu, 15 Nov 2007 13:04:02 +0100
Subject: [Biojava-l] Parsing exising gaps
Message-ID: <002701c8277f$9dbdca50$d9395ef0$@au.dk>

Dear all,

 
I have managed to read an MSF-formatted alignment from a file selected
through FileChooser as follows:

 
  BufferedReader br = new BufferedReader(new
FileReader(aFileChooser.getSelectedFile()));

  SimpleAlignment align =
(SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);

 
I can now retrieve the sequence names and sequences through the Alignment
object:

 
  Iterator aLabels = align.getLabels().iterator();

  Iterator aSequences = align.symbolListIterator();

 
However, I now what to be able to translate between real sequence numbers
and the positions within each alignment string, i.e. retrieve positions that
remove the gaps first (gaps are represented by hyphens '-' in the MSF
format). How can I tell BioJava to parse the gaps into an GappedSequence
format? I have tried the following to check what position 15 (past the the
first gap) translates into:

 
  int n = 0;

  while(aSequences.hasNext()) {

      SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();

      SimpleGappedSequence aGapped = new SimpleGappedSequence(new
SimpleSequence(aSym, "", aLabels.next().toString(), null));

      System.out.println(aGapped.gappedToLocation(new PointLocation(15)));

  }

 
But I only get 15 back out. I have also studied the constructor of the
underlying SimpleGappedSymbolList but it simply copies the SymbolList and
creates one big block:

 
  public SimpleGappedSymbolList(SymbolList source) {

    this.source = source;

    this.alpha = source.getAlphabet();

    this.blocks = new ArrayList();

    this.length = source.length();

    Block b = new Block(1, length, 1, length);

    blocks.add(b);

  }

 
Is there a way to tell SimpleGappedSequence to parse itself in terms of the
gap characters in the sequence string? How is the sequence represented in
this case, if not by gaps? Surely the hyphen cannot be a part of the
standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it?

 
Best wishes,

 
  Ditlev

 
--

 
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor

 
Department of Molecular Biology   Office:  +45 89425259
University of AarhusLab:     +45 89425022
Gustav Wieds Vej 10cFax:     +45 86123178
DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
Denmark             Lab WWW:  <http://bioxray.dk/~deb> www.bioxray.dk/~deb

 
From holland at ebi.ac.uk  Thu Nov 15 13:51:48 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 15 Nov 2007 13:51:48 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
Message-ID: <473C4EF4.5080301@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think you've uncovered a number of problems here:

1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
all the other predefined alphabets.

2. The MSF parser doesn't bother trying to build GappedSequence
instances, instead it just builds solid sequences with the gaps as
normal symbols.

3. There is no constructor or method for taking a sequence with embedded
gap symbols and turning it into a GappedSequence with separate chunks.

Combined, these three problems make it impossible to do what you want
easily. I will make a note to fix this on the plans for the next BioJava
development cycle.

In the meantime, your best bet would be to construct a second alignment
block by iterating over the alignment block you already have and parsing
the locations of the gap symbols. You would create a
SimpleGappedSequence intially over the ungapped sequence, then use the
insert gap methods to insert the gaps into this ungapped sequence before
putting all the SimpleGappedSequence objects together into a new alignment.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Dear all,
> 
>  
> 
> I have managed to read an MSF-formatted alignment from a file selected
> through FileChooser as follows:
> 
>  
> 
>   BufferedReader br = new BufferedReader(new
> FileReader(aFileChooser.getSelectedFile()));
> 
>   SimpleAlignment align =
> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);
> 
>  
> 
> I can now retrieve the sequence names and sequences through the Alignment
> object:
> 
>  
> 
>   Iterator aLabels = align.getLabels().iterator();
> 
>   Iterator aSequences = align.symbolListIterator();
> 
>  
> 
> However, I now what to be able to translate between real sequence numbers
> and the positions within each alignment string, i.e. retrieve positions that
> remove the gaps first (gaps are represented by hyphens '-' in the MSF
> format). How can I tell BioJava to parse the gaps into an GappedSequence
> format? I have tried the following to check what position 15 (past the the
> first gap) translates into:
> 
>  
> 
>   int n = 0;
> 
>   while(aSequences.hasNext()) {
> 
>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> 
>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
> SimpleSequence(aSym, "", aLabels.next().toString(), null));
> 
>       System.out.println(aGapped.gappedToLocation(new PointLocation(15)));
> 
>   }
> 
>  
> 
> But I only get 15 back out. I have also studied the constructor of the
> underlying SimpleGappedSymbolList but it simply copies the SymbolList and
> creates one big block:
> 
>  
> 
>   public SimpleGappedSymbolList(SymbolList source) {
> 
>     this.source = source;
> 
>     this.alpha = source.getAlphabet();
> 
>     this.blocks = new ArrayList();
> 
>     this.length = source.length();
> 
>     Block b = new Block(1, length, 1, length);
> 
>     blocks.add(b);
> 
>   }
> 
>  
> 
> Is there a way to tell SimpleGappedSequence to parse itself in terms of the
> gap characters in the sequence string? How is the sequence represented in
> this case, if not by gaps? Surely the hyphen cannot be a part of the
> standard PROTEIN-TERM alphabet, yet I get no complaints for the use of it?
> 
>  
> 
> Best wishes,
> 
>  
> 
>   Ditlev
> 
>  
> 
> --
> 
>  
> 
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
> 
>  
> 
> Department of Molecular Biology   Office:  +45 89425259
> University of AarhusLab:     +45 89425022
> Gustav Wieds Vej 10cFax:     +45 86123178
> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
> Denmark             Lab WWW:  <http://bioxray.dk/~deb> www.bioxray.dk/~deb
> 
>  
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPE704C5LeMEKA/QRAniIAJsGv+5HIP3mCDxBIUdw0SjDrWu8dgCeNviA
EsJK4gv+EVY7wc4r6W2A0+I=
=wCQs
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Fri Nov 16 08:59:41 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 08:59:41 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
Message-ID: <473D5BFD.8080305@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Ditlev.

After some investigation and some helpful hints from Mark, it turns out
that there are methods in DNATools/ProteinTools that can construct
proper GappedSymbolList objects out of strings.

I have managed to modify the MSF parser to use this instead. This means
that the MSF parser will now return instances of GappedSymbolList
(actually GappedSequences to be accurate) rather than SimpleSymbolList.
Thanks to the way the APIs work this will make no difference to existing
users (except those who are depending on being able to cast it to a
certain type - which they shouldn't, because the API doesn't guarantee
it to be of any type!), but it will fix it for you. Future releases will
modify the API (or include a completely new MSF parser) which will
explicitly return GappedSymbolLists in the API declarations rather than
plain SymbolLists, but I can't do that right now because it would break
existing users code.

To get the modified parser you will need to check out the very latest
source code from our CVS repository and compile it using ant.
Instructions are on our website at biojava.org if you have not done this
before.

Hope this helps you.

cheers,
Richard


Ditlev Egeskov Brodersen wrote:
> Hi Richard,
> 
>   thanks for clarifying this and for your useful suggestion, which I've
> managed to implement as shown below. It works nicely, but I was really
> surprised to learn that biojava hasn't yet implemented a proper parsing of
> gap characters from strings into the object structure as this seems central
> to any use of pre-aligned sequences. Also, I find it problematic that the
> API implements the gap characters as part of the alphabets. In my view, this
> breaks the logic of the object model because proteins don't really have gaps
> in their sequences.
> 
>   Rather, the constructor of the Sequence-derived classes ought to throw an
> exception when non-protein characters are passed and should not allow the
> user to create an object with sequence elements that are non-standard.
> Instead, I think there should be a static method that allows cleaning the
> input sequence before passing it to the Sequence constructor. On the other
> hand, the constructor of the GappedSequence-derived classes should recognise
> the gaps and create an object with blocks of legal protein symbols and gaps
> in the appropriate places.
> 
>   -- Ditlev
> 
>   // Read MSF file into Alignment object
>   BufferedReader br = new BufferedReader(new
> FileReader(aFileChooser.getSelectedFile()));
>   SimpleAlignment align =
> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);
> 
>   // Iterate through sequences in turn
>   Iterator aSequences = align.symbolListIterator();
>   while(aSequences.hasNext()) {
> 
>       // Retrieve SymbolList, the associated gap symbol and sequence string
>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
>       Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol();
>       String aGappedString = aSym.seqString();
> 
>       // Prepare non-gapped string
>       String aPlainString = "";
> 
>       // Loop through individual symbols and add non-gap characters to
> string
>       for(int i=1;i<=aSym.length();i++)
>           if(aSym.symbolAt(i) != aGapSymbol)
>               aPlainString += aGappedString.charAt(i-1);
> 
>       // Create a new gapped sequence object with the plain (non-gapped)
> sequence
>       SimpleGappedSequence aGapped =
> (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainString,
> "");
> 
>       // Use separate indices for gapped and plain sequences
>       int n = 1;
> 
>       // Loop through individual gapped sequence symbols and insert gap into
> object when gap symbol is encountered
>       for(int i=1;i<=aSym.length();i++)
>           if(aSym.symbolAt(i) != aGapSymbol)
>               n++;
>           else
>               aGapped.addGapInSource(n); 
> 
> --
>  
> Ditlev Egeskov Brodersen
> Lektor
> Bakkefaldet 30, Hasle
> 8210 ?rhus V
>  
> www.lindeman-brodersen.dk
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 15 November 2007 14:52
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: [Biojava-l] Parsing exising gaps
>>
> I think you've uncovered a number of problems here:
> 
> 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
> all the other predefined alphabets.
> 
> 2. The MSF parser doesn't bother trying to build GappedSequence
> instances, instead it just builds solid sequences with the gaps as
> normal symbols.
> 
> 3. There is no constructor or method for taking a sequence with
> embedded
> gap symbols and turning it into a GappedSequence with separate chunks.
> 
> Combined, these three problems make it impossible to do what you want
> easily. I will make a note to fix this on the plans for the next
> BioJava
> development cycle.
> 
> In the meantime, your best bet would be to construct a second alignment
> block by iterating over the alignment block you already have and
> parsing
> the locations of the gap symbols. You would create a
> SimpleGappedSequence intially over the ungapped sequence, then use the
> insert gap methods to insert the gaps into this ungapped sequence
> before
> putting all the SimpleGappedSequence objects together into a new
> alignment.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>>>> Dear all,
>>>>
>>>>
>>>>
>>>> I have managed to read an MSF-formatted alignment from a file
> selected
>>>> through FileChooser as follows:
>>>>
>>>>
>>>>
>>>>   BufferedReader br = new BufferedReader(new
>>>> FileReader(aFileChooser.getSelectedFile()));
>>>>
>>>>   SimpleAlignment align =
>>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> br);
>>>>
>>>>
>>>> I can now retrieve the sequence names and sequences through the
> Alignment
>>>> object:
>>>>
>>>>
>>>>
>>>>   Iterator aLabels = align.getLabels().iterator();
>>>>
>>>>   Iterator aSequences = align.symbolListIterator();
>>>>
>>>>
>>>>
>>>> However, I now what to be able to translate between real sequence
> numbers
>>>> and the positions within each alignment string, i.e. retrieve
> positions that
>>>> remove the gaps first (gaps are represented by hyphens '-' in the MSF
>>>> format). How can I tell BioJava to parse the gaps into an
> GappedSequence
>>>> format? I have tried the following to check what position 15 (past
> the the
>>>> first gap) translates into:
>>>>
>>>>
>>>>
>>>>   int n = 0;
>>>>
>>>>   while(aSequences.hasNext()) {
>>>>
>>>>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
>>>>
>>>>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
>>>> SimpleSequence(aSym, "", aLabels.next().toString(), null));
>>>>
>>>>       System.out.println(aGapped.gappedToLocation(new
> PointLocation(15)));
>>>>   }
>>>>
>>>>
>>>>
>>>> But I only get 15 back out. I have also studied the constructor of
> the
>>>> underlying SimpleGappedSymbolList but it simply copies the SymbolList
> and
>>>> creates one big block:
>>>>
>>>>
>>>>
>>>>   public SimpleGappedSymbolList(SymbolList source) {
>>>>
>>>>     this.source = source;
>>>>
>>>>     this.alpha = source.getAlphabet();
>>>>
>>>>     this.blocks = new ArrayList();
>>>>
>>>>     this.length = source.length();
>>>>
>>>>     Block b = new Block(1, length, 1, length);
>>>>
>>>>     blocks.add(b);
>>>>
>>>>   }
>>>>
>>>>
>>>>
>>>> Is there a way to tell SimpleGappedSequence to parse itself in terms
> of the
>>>> gap characters in the sequence string? How is the sequence
> represented in
>>>> this case, if not by gaps? Surely the hyphen cannot be a part of the
>>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the use
> of it?
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>>
>>>>
>>>>   Ditlev
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Ditlev E. Brodersen, Ph.D.
>>>> Lektor, Associate Professor
>>>>
>>>>
>>>>
>>>> Department of Molecular Biology   Office:  +45 89425259
>>>> University of AarhusLab:     +45 89425022
>>>> Gustav Wieds Vej 10cFax:     +45 86123178
>>>> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
>>>> Denmark             Lab WWW:  <http://bioxray.dk/~deb>
> www.bioxray.dk/~deb
>>>>
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS
Rv73CVmtYpEz+apJwM1L3sA=
=UPU6
-----END PGP SIGNATURE-----


From deb at mb.au.dk  Fri Nov 16 09:28:40 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Fri, 16 Nov 2007 10:28:40 +0100
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <473D5BFD.8080305@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
Message-ID: <000601c82833$143c5300$3cb4f900$@au.dk>

Hi Richard,

  thanks for your super fast reply. I managed to recompile using CVS/ant and
the MSF import now works brilliantly and simply as follows:

  BufferedReader br = new BufferedReader(new
FileReader(aFileChooser.getSelectedFile()));
  SimpleAlignment align =
(SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA, br);

  // Iterate through sequences in turn
  Iterator aSequences = align.symbolListIterator();
  while(aSequences.hasNext()) {

      // Retrieve gapped sequence
      SimpleGappedSequence aGapped =
(SimpleGappedSequence)aSequences.next();

      ...do whatever with each gapped sequence
  }

  The returned gapped sequences are all properly set up with gaps, name etc.
But as for other users, I think there may be some problems, since the
SimpleAlignment object only has a general symbol list iterator, the user
will have to cast each statement extracting a sequence object, and

      SimpleSequence aSimple = (SimpleSequence)aSequences.next();

returns an ClassCastException at run time. So old code might not run with
the update as far as I can see.

  Ditlev 

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb
> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 10:00
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] Parsing exising gaps
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi Ditlev.
> 
> After some investigation and some helpful hints from Mark, it turns out
> that there are methods in DNATools/ProteinTools that can construct
> proper GappedSymbolList objects out of strings.
> 
> I have managed to modify the MSF parser to use this instead. This means
> that the MSF parser will now return instances of GappedSymbolList
> (actually GappedSequences to be accurate) rather than SimpleSymbolList.
> Thanks to the way the APIs work this will make no difference to
> existing
> users (except those who are depending on being able to cast it to a
> certain type - which they shouldn't, because the API doesn't guarantee
> it to be of any type!), but it will fix it for you. Future releases
> will
> modify the API (or include a completely new MSF parser) which will
> explicitly return GappedSymbolLists in the API declarations rather than
> plain SymbolLists, but I can't do that right now because it would break
> existing users code.
> 
> To get the modified parser you will need to check out the very latest
> source code from our CVS repository and compile it using ant.
> Instructions are on our website at biojava.org if you have not done
> this
> before.
> 
> Hope this helps you.
> 
> cheers,
> Richard
> 
> 
> Ditlev Egeskov Brodersen wrote:
> > Hi Richard,
> >
> >   thanks for clarifying this and for your useful suggestion, which
> I've
> > managed to implement as shown below. It works nicely, but I was
> really
> > surprised to learn that biojava hasn't yet implemented a proper
> parsing of
> > gap characters from strings into the object structure as this seems
> central
> > to any use of pre-aligned sequences. Also, I find it problematic that
> the
> > API implements the gap characters as part of the alphabets. In my
> view, this
> > breaks the logic of the object model because proteins don't really
> have gaps
> > in their sequences.
> >
> >   Rather, the constructor of the Sequence-derived classes ought to
> throw an
> > exception when non-protein characters are passed and should not allow
> the
> > user to create an object with sequence elements that are non-
> standard.
> > Instead, I think there should be a static method that allows cleaning
> the
> > input sequence before passing it to the Sequence constructor. On the
> other
> > hand, the constructor of the GappedSequence-derived classes should
> recognise
> > the gaps and create an object with blocks of legal protein symbols
> and gaps
> > in the appropriate places.
> >
> >   -- Ditlev
> >
> >   // Read MSF file into Alignment object
> >   BufferedReader br = new BufferedReader(new
> > FileReader(aFileChooser.getSelectedFile()));
> >   SimpleAlignment align =
> > (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> br);
> >
> >   // Iterate through sequences in turn
> >   Iterator aSequences = align.symbolListIterator();
> >   while(aSequences.hasNext()) {
> >
> >       // Retrieve SymbolList, the associated gap symbol and sequence
> string
> >       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> >       Symbol aGapSymbol = aSym.getAlphabet().getGapSymbol();
> >       String aGappedString = aSym.seqString();
> >
> >       // Prepare non-gapped string
> >       String aPlainString = "";
> >
> >       // Loop through individual symbols and add non-gap characters
> to
> > string
> >       for(int i=1;i<=aSym.length();i++)
> >           if(aSym.symbolAt(i) != aGapSymbol)
> > aPlainString += aGappedString.charAt(i-1);
> >
> >       // Create a new gapped sequence object with the plain (non-
> gapped)
> > sequence
> >       SimpleGappedSequence aGapped =
> >
> (SimpleGappedSequence)ProteinTools.createGappedProteinSequence(aPlainSt
> ring,
> > "");
> >
> >       // Use separate indices for gapped and plain sequences
> >       int n = 1;
> >
> >       // Loop through individual gapped sequence symbols and insert
> gap into
> > object when gap symbol is encountered
> >       for(int i=1;i<=aSym.length();i++)
> >           if(aSym.symbolAt(i) != aGapSymbol)
> > n++;
> >           else
> > aGapped.addGapInSource(n);
> >
> > --
> >
> > Ditlev Egeskov Brodersen
> > Lektor
> > Bakkefaldet 30, Hasle
> > 8210 ?rhus V
> >
> > www.lindeman-brodersen.dk
> >
> >> -----Original Message-----
> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
> >> Sent: 15 November 2007 14:52
> >> To: Ditlev Egeskov Brodersen
> >> Cc: biojava-l at biojava.org
> >> Subject: Re: [Biojava-l] Parsing exising gaps
> >>
> > I think you've uncovered a number of problems here:
> >
> > 1. The PROTEIN-TERM alphabet does define '-' as a valid symbol, as do
> > all the other predefined alphabets.
> >
> > 2. The MSF parser doesn't bother trying to build GappedSequence
> > instances, instead it just builds solid sequences with the gaps as
> > normal symbols.
> >
> > 3. There is no constructor or method for taking a sequence with
> > embedded
> > gap symbols and turning it into a GappedSequence with separate
> chunks.
> >
> > Combined, these three problems make it impossible to do what you want
> > easily. I will make a note to fix this on the plans for the next
> > BioJava
> > development cycle.
> >
> > In the meantime, your best bet would be to construct a second
> alignment
> > block by iterating over the alignment block you already have and
> > parsing
> > the locations of the gap symbols. You would create a
> > SimpleGappedSequence intially over the ungapped sequence, then use
> the
> > insert gap methods to insert the gaps into this ungapped sequence
> > before
> > putting all the SimpleGappedSequence objects together into a new
> > alignment.
> >
> > cheers,
> > Richard
> >
> > Ditlev Egeskov Brodersen wrote:
> >>>> Dear all,
> >>>>
> >>>>
> >>>>
> >>>> I have managed to read an MSF-formatted alignment from a file
> > selected
> >>>> through FileChooser as follows:
> >>>>
> >>>>
> >>>>
> >>>>   BufferedReader br = new BufferedReader(new
> >>>> FileReader(aFileChooser.getSelectedFile()));
> >>>>
> >>>>   SimpleAlignment align =
> >>>> (SimpleAlignment)SeqIOTools.fileToBiojava(AlignIOConstants.MSF_AA,
> > br);
> >>>>
> >>>>
> >>>> I can now retrieve the sequence names and sequences through the
> > Alignment
> >>>> object:
> >>>>
> >>>>
> >>>>
> >>>>   Iterator aLabels = align.getLabels().iterator();
> >>>>
> >>>>   Iterator aSequences = align.symbolListIterator();
> >>>>
> >>>>
> >>>>
> >>>> However, I now what to be able to translate between real sequence
> > numbers
> >>>> and the positions within each alignment string, i.e. retrieve
> > positions that
> >>>> remove the gaps first (gaps are represented by hyphens '-' in the
> MSF
> >>>> format). How can I tell BioJava to parse the gaps into an
> > GappedSequence
> >>>> format? I have tried the following to check what position 15 (past
> > the the
> >>>> first gap) translates into:
> >>>>
> >>>>
> >>>>
> >>>>   int n = 0;
> >>>>
> >>>>   while(aSequences.hasNext()) {
> >>>>
> >>>>       SimpleSymbolList aSym = (SimpleSymbolList)aSequences.next();
> >>>>
> >>>>       SimpleGappedSequence aGapped = new SimpleGappedSequence(new
> >>>> SimpleSequence(aSym, "", aLabels.next().toString(), null));
> >>>>
> >>>>       System.out.println(aGapped.gappedToLocation(new
> > PointLocation(15)));
> >>>>   }
> >>>>
> >>>>
> >>>>
> >>>> But I only get 15 back out. I have also studied the constructor of
> > the
> >>>> underlying SimpleGappedSymbolList but it simply copies the
> SymbolList
> > and
> >>>> creates one big block:
> >>>>
> >>>>
> >>>>
> >>>>   public SimpleGappedSymbolList(SymbolList source) {
> >>>>
> >>>>     this.source = source;
> >>>>
> >>>>     this.alpha = source.getAlphabet();
> >>>>
> >>>>     this.blocks = new ArrayList();
> >>>>
> >>>>     this.length = source.length();
> >>>>
> >>>>     Block b = new Block(1, length, 1, length);
> >>>>
> >>>>     blocks.add(b);
> >>>>
> >>>>   }
> >>>>
> >>>>
> >>>>
> >>>> Is there a way to tell SimpleGappedSequence to parse itself in
> terms
> > of the
> >>>> gap characters in the sequence string? How is the sequence
> > represented in
> >>>> this case, if not by gaps? Surely the hyphen cannot be a part of
> the
> >>>> standard PROTEIN-TERM alphabet, yet I get no complaints for the
> use
> > of it?
> >>>>
> >>>>
> >>>> Best wishes,
> >>>>
> >>>>
> >>>>
> >>>>   Ditlev
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>>
> >>>>
> >>>> Ditlev E. Brodersen, Ph.D.
> >>>> Lektor, Associate Professor
> >>>>
> >>>>
> >>>>
> >>>> Department of Molecular Biology   Office:  +45 89425259
> >>>> University of AarhusLab:     +45 89425022
> >>>> Gustav Wieds Vej 10cFax:     +45 86123178
> >>>> DK-8000 Aarhus C    Email:    <mailto:deb at mb.au.dk> deb at mb.au.dk
> >>>> Denmark             Lab WWW:  <http://bioxray.dk/~deb>
> > www.bioxray.dk/~deb
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPVv84C5LeMEKA/QRAn0cAJ9jJUaA3bjiEwlzxaAo/bsN5+CT1QCcCLxS
> Rv73CVmtYpEz+apJwM1L3sA=
> =UPU6
> -----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Fri Nov 16 09:49:35 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 09:49:35 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <000601c82833$143c5300$3cb4f900$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
Message-ID: <473D67AF.2020007@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>   The returned gapped sequences are all properly set up with gaps, name etc.
> But as for other users, I think there may be some problems, since the
> SimpleAlignment object only has a general symbol list iterator, the user
> will have to cast each statement extracting a sequence object, and
> 
>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
> 
> returns an ClassCastException at run time. So old code might not run with
> the update as far as I can see.

This is true. However, such code would be unsupported by us as the API
clearly states that SimpleAlignment returns SymbolList instances, and
does not make any guarantees about the exact implementation details of
the objects it returns. To attempt to cast it to anything other than
SymbolList would be a mistake! (Although actually it is now returning a
guarantee of GappedSymbolList, which is what your code can now take
advantage of). To assume it will return SimpleSequence is outside the
behaviour defined by the API and therefore should not be relied upon.

A more correct behaviour would be to test each item returned:

	SymbolList symlist = aSequences.next();
	if (symlist instanceof SimpleSequence) {
		SimpleSequence seq = (SimpleSequence)symlist;
		// Do simple-sequence stuff
	} else {
		// Do something else!
	}

In future, I will modify the API to change the SymbolList guarantee to a
GappedSymbolList guarantee, but I can't do this right now as this really
would break everyone's code!

We are currently planning a redesign as you may be aware, so issues like
this will hopefully be resolved as part of that process. For a start, if
we use Java 5 generics in future as we plan, we can strictly specify
what kinds of objects will be returned by things such as the alignment
API, making it easier for us to enforce API-compliant behaviour in
user's code.

cheers,
Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB
ITj7O/foDly4aOT4SV1Jb+k=
=g7Vs
-----END PGP SIGNATURE-----


From deb at mb.au.dk  Fri Nov 16 10:11:15 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Fri, 16 Nov 2007 11:11:15 +0100
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <473D67AF.2020007@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk>
Message-ID: <000f01c82839$06722550$13566ff0$@au.dk>

Hi again,

  thanks for the info - will do the check just to be proper. I have another
question: In my application, I would like to wrap the retrieved
SimpleGappedSequence objects inside another object that extends the
functionality with application-specific stuff. Ideally, I would do this by
extending the SimpleGappedSequence object and create it by passing the
SimpleGappedSequence from the alignment import to the constructor of the
parent, like so:

  class AlignedSequence extends SimpleGappedSequence {
    public AlignedSequence(SimpleGappedSequence aGapped) {
      super(aGapped);
    }

    ..custom stuff..
  }

However, the problem is that there is only one constructor for the
SimpleGappedSequence, one which takes a simple Sequence object. I can pass
the derived class alright, but all gap information is lost again, presumably
because the SimpleGappedSequence constructor just takes out the seqString()
and puts it into its own sequence object.

Shouldn't the constructor of the SimpleGappedSequence class recognise when a
derived (and gapped) sequence object is passed, and process it accordingly?

As it stands, I am forced to include the SimpleGappedSequence as a private
member of the AlignedSequence class, which is not near as nice since all
statement using the class will have to do something like

  class AlignedSequence extends SimpleGappedSequence {
    private SimpleGappedSequence gapped_sequence;

    public AlignedSequence(SimpleGappedSequence aGapped) {
      gapped_sequence = aGapped;
    }

    public SimpleGappedSequence getGappedSequence() {
      return(gapped_sequence);
  }

    ..custom stuff..
  }

  ...

  AlignedSequence aAligned = new AlignedSequence(aGapped);
  aAligned.getGappedSequence().seqString();

rather than simply:

  AlignedSequence aAligned = new AlignedSequence(aGapped);
  aAligned.seqString();

In other words, is there any solution with the current setup that would
allow me to extend SimpleGappedSequence and not loose the gap information?

--  Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 10:50
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] Parsing exising gaps
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> >   The returned gapped sequences are all properly set up with gaps,
> name etc.
> > But as for other users, I think there may be some problems, since the
> > SimpleAlignment object only has a general symbol list iterator, the
> user
> > will have to cast each statement extracting a sequence object, and
> >
> >       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
> >
> > returns an ClassCastException at run time. So old code might not run
> with
> > the update as far as I can see.
> 
> This is true. However, such code would be unsupported by us as the API
> clearly states that SimpleAlignment returns SymbolList instances, and
> does not make any guarantees about the exact implementation details of
> the objects it returns. To attempt to cast it to anything other than
> SymbolList would be a mistake! (Although actually it is now returning a
> guarantee of GappedSymbolList, which is what your code can now take
> advantage of). To assume it will return SimpleSequence is outside the
> behaviour defined by the API and therefore should not be relied upon.
> 
> A more correct behaviour would be to test each item returned:
> 
> 	SymbolList symlist = aSequences.next();
> 	if (symlist instanceof SimpleSequence) {
> 		SimpleSequence seq = (SimpleSequence)symlist;
> 		// Do simple-sequence stuff
> 	} else {
> 		// Do something else!
> 	}
> 
> In future, I will modify the API to change the SymbolList guarantee to
> a
> GappedSymbolList guarantee, but I can't do this right now as this
> really
> would break everyone's code!
> 
> We are currently planning a redesign as you may be aware, so issues
> like
> this will hopefully be resolved as part of that process. For a start,
> if
> we use Java 5 generics in future as we plan, we can strictly specify
> what kinds of objects will be returned by things such as the alignment
> API, making it easier for us to enforce API-compliant behaviour in
> user's code.
> 
> cheers,
> Richard
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPWev4C5LeMEKA/QRAvTOAJ9tqdBGWangZ9YQPpEDJ4WWBP/vjQCdHlMB
> ITj7O/foDly4aOT4SV1Jb+k=
> =g7Vs
> -----END PGP SIGNATURE-----


From ap3 at sanger.ac.uk  Fri Nov 16 09:51:35 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Fri, 16 Nov 2007 09:51:35 +0000
Subject: [Biojava-l] Parsing exising gaps
In-Reply-To: <473D5BFD.8080305@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
Message-ID: <A750D1C7-7659-4F29-9BD0-B30558FBF38E@sanger.ac.uk>

>
> To get the modified parser you will need to check out the very latest
> source code from our CVS repository and compile it using ant.
> Instructions are on our website at biojava.org if you have not done  
> this
> before.

alternatively you could get the automatically built biojava.jar from
http://www.spice-3d.org/cruise/

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From holland at ebi.ac.uk  Fri Nov 16 10:46:57 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 10:46:57 +0000
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <000f01c82839$06722550$13566ff0$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk>
	<000f01c82839$06722550$13566ff0$@au.dk>
Message-ID: <473D7521.9070603@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The easiest way is simply for me to alter the constructor to
SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
copy all gaps if passed another instance of GappedSymbolList as the
parameter. I've just done this in CVS so you should be able to update
your copy and observe the new behaviour.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Hi again,
> 
>   thanks for the info - will do the check just to be proper. I have another
> question: In my application, I would like to wrap the retrieved
> SimpleGappedSequence objects inside another object that extends the
> functionality with application-specific stuff. Ideally, I would do this by
> extending the SimpleGappedSequence object and create it by passing the
> SimpleGappedSequence from the alignment import to the constructor of the
> parent, like so:
> 
>   class AlignedSequence extends SimpleGappedSequence {
>     public AlignedSequence(SimpleGappedSequence aGapped) {
>       super(aGapped);
>     }
> 
>     ..custom stuff..
>   }
> 
> However, the problem is that there is only one constructor for the
> SimpleGappedSequence, one which takes a simple Sequence object. I can pass
> the derived class alright, but all gap information is lost again, presumably
> because the SimpleGappedSequence constructor just takes out the seqString()
> and puts it into its own sequence object.
> 
> Shouldn't the constructor of the SimpleGappedSequence class recognise when a
> derived (and gapped) sequence object is passed, and process it accordingly?
> 
> As it stands, I am forced to include the SimpleGappedSequence as a private
> member of the AlignedSequence class, which is not near as nice since all
> statement using the class will have to do something like
> 
>   class AlignedSequence extends SimpleGappedSequence {
>     private SimpleGappedSequence gapped_sequence;
> 
>     public AlignedSequence(SimpleGappedSequence aGapped) {
>       gapped_sequence = aGapped;
>     }
> 
>     public SimpleGappedSequence getGappedSequence() {
>       return(gapped_sequence);
>   }
> 
>     ..custom stuff..
>   }
> 
>   ...
> 
>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>   aAligned.getGappedSequence().seqString();
> 
> rather than simply:
> 
>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>   aAligned.seqString();
> 
> In other words, is there any solution with the current setup that would
> allow me to extend SimpleGappedSequence and not loose the gap information?
> 
> --  Ditlev
> 
> --
>  
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
>  
> Department of Molecular Biology   Office:  +45 89425259
> University of Aarhus              Lab:     +45 89425022
> Gustav Wieds Vej 10c              Fax:     +45 86123178
> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 16 November 2007 10:50
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: [Biojava-l] Parsing exising gaps
>>
>>>>   The returned gapped sequences are all properly set up with gaps,
> name etc.
>>>> But as for other users, I think there may be some problems, since the
>>>> SimpleAlignment object only has a general symbol list iterator, the
> user
>>>> will have to cast each statement extracting a sequence object, and
>>>>
>>>>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
>>>>
>>>> returns an ClassCastException at run time. So old code might not run
> with
>>>> the update as far as I can see.
> This is true. However, such code would be unsupported by us as the API
> clearly states that SimpleAlignment returns SymbolList instances, and
> does not make any guarantees about the exact implementation details of
> the objects it returns. To attempt to cast it to anything other than
> SymbolList would be a mistake! (Although actually it is now returning a
> guarantee of GappedSymbolList, which is what your code can now take
> advantage of). To assume it will return SimpleSequence is outside the
> behaviour defined by the API and therefore should not be relied upon.
> 
> A more correct behaviour would be to test each item returned:
> 
> 	SymbolList symlist = aSequences.next();
> 	if (symlist instanceof SimpleSequence) {
> 		SimpleSequence seq = (SimpleSequence)symlist;
> 		// Do simple-sequence stuff
> 	} else {
> 		// Do something else!
> 	}
> 
> In future, I will modify the API to change the SymbolList guarantee to
> a
> GappedSymbolList guarantee, but I can't do this right now as this
> really
> would break everyone's code!
> 
> We are currently planning a redesign as you may be aware, so issues
> like
> this will hopefully be resolved as part of that process. For a start,
> if
> we use Java 5 generics in future as we plan, we can strictly specify
> what kinds of objects will be returned by things such as the alignment
> API, making it easier for us to enforce API-compliant behaviour in
> user's code.
> 
> cheers,
> Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m
8EJTzWZYOyjp9VhmvsgvmNA=
=1uaB
-----END PGP SIGNATURE-----


From deb at mb.au.dk  Fri Nov 16 12:39:23 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Fri, 16 Nov 2007 13:39:23 +0100
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <473D7521.9070603@ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d
	<473D7521.9070603@ebi.ac.uk>
Message-ID: <001801c8284d$b8c525e0$2a4f71a0$@au.dk>

Hi again,

  I updated CVS and got the new SimpleGappedSymbolList class, but there
seems to be no changes to the SimpleGappedSequence class, which is the one I
need to extend...have I missed something?

  Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


> -----Original Message-----
> From: Richard Holland [mailto:holland at ebi.ac.uk]
> Sent: 16 November 2007 11:47
> To: Ditlev Egeskov Brodersen
> Cc: biojava-l at biojava.org
> Subject: Re: Wrapping SimpleGappedSequence
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> The easiest way is simply for me to alter the constructor to
> SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
> copy all gaps if passed another instance of GappedSymbolList as the
> parameter. I've just done this in CVS so you should be able to update
> your copy and observe the new behaviour.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
> > Hi again,
> >
> >   thanks for the info - will do the check just to be proper. I have
> another
> > question: In my application, I would like to wrap the retrieved
> > SimpleGappedSequence objects inside another object that extends the
> > functionality with application-specific stuff. Ideally, I would do
> this by
> > extending the SimpleGappedSequence object and create it by passing
> the
> > SimpleGappedSequence from the alignment import to the constructor of
> the
> > parent, like so:
> >
> >   class AlignedSequence extends SimpleGappedSequence {
> >     public AlignedSequence(SimpleGappedSequence aGapped) {
> >       super(aGapped);
> >     }
> >
> >     ..custom stuff..
> >   }
> >
> > However, the problem is that there is only one constructor for the
> > SimpleGappedSequence, one which takes a simple Sequence object. I can
> pass
> > the derived class alright, but all gap information is lost again,
> presumably
> > because the SimpleGappedSequence constructor just takes out the
> seqString()
> > and puts it into its own sequence object.
> >
> > Shouldn't the constructor of the SimpleGappedSequence class recognise
> when a
> > derived (and gapped) sequence object is passed, and process it
> accordingly?
> >
> > As it stands, I am forced to include the SimpleGappedSequence as a
> private
> > member of the AlignedSequence class, which is not near as nice since
> all
> > statement using the class will have to do something like
> >
> >   class AlignedSequence extends SimpleGappedSequence {
> >     private SimpleGappedSequence gapped_sequence;
> >
> >     public AlignedSequence(SimpleGappedSequence aGapped) {
> >       gapped_sequence = aGapped;
> >     }
> >
> >     public SimpleGappedSequence getGappedSequence() {
> >       return(gapped_sequence);
> >   }
> >
> >     ..custom stuff..
> >   }
> >
> >   ...
> >
> >   AlignedSequence aAligned = new AlignedSequence(aGapped);
> >   aAligned.getGappedSequence().seqString();
> >
> > rather than simply:
> >
> >   AlignedSequence aAligned = new AlignedSequence(aGapped);
> >   aAligned.seqString();
> >
> > In other words, is there any solution with the current setup that
> would
> > allow me to extend SimpleGappedSequence and not loose the gap
> information?
> >
> > --  Ditlev
> >
> > --
> >
> > Ditlev E. Brodersen, Ph.D.
> > Lektor, Associate Professor
> >
> > Department of Molecular Biology   Office:  +45 89425259
> > University of Aarhus              Lab:     +45 89425022
> > Gustav Wieds Vej 10c              Fax:     +45 86123178
> > DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> > Denmark                           Lab WWW: www.bioxray.dk/~deb
> >
> >
> >> -----Original Message-----
> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
> >> Sent: 16 November 2007 10:50
> >> To: Ditlev Egeskov Brodersen
> >> Cc: biojava-l at biojava.org
> >> Subject: Re: [Biojava-l] Parsing exising gaps
> >>
> >>>>   The returned gapped sequences are all properly set up with gaps,
> > name etc.
> >>>> But as for other users, I think there may be some problems, since
> the
> >>>> SimpleAlignment object only has a general symbol list iterator,
> the
> > user
> >>>> will have to cast each statement extracting a sequence object, and
> >>>>
> >>>>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
> >>>>
> >>>> returns an ClassCastException at run time. So old code might not
> run
> > with
> >>>> the update as far as I can see.
> > This is true. However, such code would be unsupported by us as the
> API
> > clearly states that SimpleAlignment returns SymbolList instances, and
> > does not make any guarantees about the exact implementation details
> of
> > the objects it returns. To attempt to cast it to anything other than
> > SymbolList would be a mistake! (Although actually it is now returning
> a
> > guarantee of GappedSymbolList, which is what your code can now take
> > advantage of). To assume it will return SimpleSequence is outside the
> > behaviour defined by the API and therefore should not be relied upon.
> >
> > A more correct behaviour would be to test each item returned:
> >
> > 	SymbolList symlist = aSequences.next();
> > 	if (symlist instanceof SimpleSequence) {
> > 		SimpleSequence seq = (SimpleSequence)symlist;
> > 		// Do simple-sequence stuff
> > 	} else {
> > 		// Do something else!
> > 	}
> >
> > In future, I will modify the API to change the SymbolList guarantee
> to
> > a
> > GappedSymbolList guarantee, but I can't do this right now as this
> > really
> > would break everyone's code!
> >
> > We are currently planning a redesign as you may be aware, so issues
> > like
> > this will hopefully be resolved as part of that process. For a start,
> > if
> > we use Java 5 generics in future as we plan, we can strictly specify
> > what kinds of objects will be returned by things such as the
> alignment
> > API, making it easier for us to enforce API-compliant behaviour in
> > user's code.
> >
> > cheers,
> > Richard
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHPXUh4C5LeMEKA/QRAsbqAKCnpCRnIiztjZ69fE2/UaJuI9QjiACfYa0m
> 8EJTzWZYOyjp9VhmvsgvmNA=
> =1uaB
> -----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Fri Nov 16 12:46:23 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 16 Nov 2007 12:46:23 +0000
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <001801c8284d$b8c525e0$2a4f71a0$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d
	<473D7521.9070603@ebi.ac.uk>
	<001801c8284d$b8c525e0$2a4f71a0$@au.dk>
Message-ID: <473D911F.2000303@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

SimpleGappedSequence extends SimpleGappedSymbolList, and the constructor
delegates to the SimpleGappedSymbolList constructor.

When you extend SimpleGappedSequence you should delegate in your new
constructor to the existing SimpleGappedSequence constructor, which in
turn will delegate as above and preserve the gaps.

By passing any object which implements GappedSymbolList to the
SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
SimpleGappedSymbolList, it will automatically choose the new constructor
from SimpleGappedSymbolList which you hopefully should be able to see in
the code you have just checked out. If passed any other
non-GappedSymbolList object, it will use the old constructor that
already existed from before.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Hi again,
> 
>   I updated CVS and got the new SimpleGappedSymbolList class, but there
> seems to be no changes to the SimpleGappedSequence class, which is the one I
> need to extend...have I missed something?
> 
>   Ditlev
> 
> --
>  
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
>  
> Department of Molecular Biology   Office:  +45 89425259
> University of Aarhus              Lab:     +45 89425022
> Gustav Wieds Vej 10c              Fax:     +45 86123178
> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 16 November 2007 11:47
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: Wrapping SimpleGappedSequence
>>
> The easiest way is simply for me to alter the constructor to
> SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
> copy all gaps if passed another instance of GappedSymbolList as the
> parameter. I've just done this in CVS so you should be able to update
> your copy and observe the new behaviour.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>>>> Hi again,
>>>>
>>>>   thanks for the info - will do the check just to be proper. I have
> another
>>>> question: In my application, I would like to wrap the retrieved
>>>> SimpleGappedSequence objects inside another object that extends the
>>>> functionality with application-specific stuff. Ideally, I would do
> this by
>>>> extending the SimpleGappedSequence object and create it by passing
> the
>>>> SimpleGappedSequence from the alignment import to the constructor of
> the
>>>> parent, like so:
>>>>
>>>>   class AlignedSequence extends SimpleGappedSequence {
>>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>>>>       super(aGapped);
>>>>     }
>>>>
>>>>     ..custom stuff..
>>>>   }
>>>>
>>>> However, the problem is that there is only one constructor for the
>>>> SimpleGappedSequence, one which takes a simple Sequence object. I can
> pass
>>>> the derived class alright, but all gap information is lost again,
> presumably
>>>> because the SimpleGappedSequence constructor just takes out the
> seqString()
>>>> and puts it into its own sequence object.
>>>>
>>>> Shouldn't the constructor of the SimpleGappedSequence class recognise
> when a
>>>> derived (and gapped) sequence object is passed, and process it
> accordingly?
>>>> As it stands, I am forced to include the SimpleGappedSequence as a
> private
>>>> member of the AlignedSequence class, which is not near as nice since
> all
>>>> statement using the class will have to do something like
>>>>
>>>>   class AlignedSequence extends SimpleGappedSequence {
>>>>     private SimpleGappedSequence gapped_sequence;
>>>>
>>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>>>>       gapped_sequence = aGapped;
>>>>     }
>>>>
>>>>     public SimpleGappedSequence getGappedSequence() {
>>>>       return(gapped_sequence);
>>>>   }
>>>>
>>>>     ..custom stuff..
>>>>   }
>>>>
>>>>   ...
>>>>
>>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>>>>   aAligned.getGappedSequence().seqString();
>>>>
>>>> rather than simply:
>>>>
>>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>>>>   aAligned.seqString();
>>>>
>>>> In other words, is there any solution with the current setup that
> would
>>>> allow me to extend SimpleGappedSequence and not loose the gap
> information?
>>>> --  Ditlev
>>>>
>>>> --
>>>>
>>>> Ditlev E. Brodersen, Ph.D.
>>>> Lektor, Associate Professor
>>>>
>>>> Department of Molecular Biology   Office:  +45 89425259
>>>> University of Aarhus              Lab:     +45 89425022
>>>> Gustav Wieds Vej 10c              Fax:     +45 86123178
>>>> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>>>> Denmark                           Lab WWW: www.bioxray.dk/~deb
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>>>>> Sent: 16 November 2007 10:50
>>>>> To: Ditlev Egeskov Brodersen
>>>>> Cc: biojava-l at biojava.org
>>>>> Subject: Re: [Biojava-l] Parsing exising gaps
>>>>>
>>>>>>>   The returned gapped sequences are all properly set up with gaps,
>>>> name etc.
>>>>>>> But as for other users, I think there may be some problems, since
> the
>>>>>>> SimpleAlignment object only has a general symbol list iterator,
> the
>>>> user
>>>>>>> will have to cast each statement extracting a sequence object, and
>>>>>>>
>>>>>>>       SimpleSequence aSimple = (SimpleSequence)aSequences.next();
>>>>>>>
>>>>>>> returns an ClassCastException at run time. So old code might not
> run
>>>> with
>>>>>>> the update as far as I can see.
>>>> This is true. However, such code would be unsupported by us as the
> API
>>>> clearly states that SimpleAlignment returns SymbolList instances, and
>>>> does not make any guarantees about the exact implementation details
> of
>>>> the objects it returns. To attempt to cast it to anything other than
>>>> SymbolList would be a mistake! (Although actually it is now returning
> a
>>>> guarantee of GappedSymbolList, which is what your code can now take
>>>> advantage of). To assume it will return SimpleSequence is outside the
>>>> behaviour defined by the API and therefore should not be relied upon.
>>>>
>>>> A more correct behaviour would be to test each item returned:
>>>>
>>>> 	SymbolList symlist = aSequences.next();
>>>> 	if (symlist instanceof SimpleSequence) {
>>>> 		SimpleSequence seq = (SimpleSequence)symlist;
>>>> 		// Do simple-sequence stuff
>>>> 	} else {
>>>> 		// Do something else!
>>>> 	}
>>>>
>>>> In future, I will modify the API to change the SymbolList guarantee
> to
>>>> a
>>>> GappedSymbolList guarantee, but I can't do this right now as this
>>>> really
>>>> would break everyone's code!
>>>>
>>>> We are currently planning a redesign as you may be aware, so issues
>>>> like
>>>> this will hopefully be resolved as part of that process. For a start,
>>>> if
>>>> we use Java 5 generics in future as we plan, we can strictly specify
>>>> what kinds of objects will be returned by things such as the
> alignment
>>>> API, making it easier for us to enforce API-compliant behaviour in
>>>> user's code.
>>>>
>>>> cheers,
>>>> Richard

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx
H0QV01cFreNi1SNLRPbhepg=
=023Y
-----END PGP SIGNATURE-----


From ap3 at sanger.ac.uk  Fri Nov 16 14:43:39 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Fri, 16 Nov 2007 14:43:39 +0000
Subject: [Biojava-l] Disulfide information in PDB files
In-Reply-To: <459609.71722.qm@web52710.mail.re2.yahoo.com>
References: <459609.71722.qm@web52710.mail.re2.yahoo.com>
Message-ID: <8F40FBF1-D491-4C3D-BCEB-41316147BD80@sanger.ac.uk>

Hi Brendan,

I just committed the patches to CVS so
BioJava can now parse the SSBond records.

Andreas


On 14 Nov 2007, at 16:28, Brendan Duggan wrote:

> Hi Andreas
>
> Thanks for the quick response.  I submitted a bug
> request (#2400) as suggested by Richard.  Parsing the
> SSBOND records is indeed what I was talking about.  I
> want to identify the disulfides then calculate their
> torsions, dihedrals and bond lengths, all of which I
> believe can be implemented with the existing code.  If
> you could implement this parsing in a few days that
> would be great!
>
> Thanks
>
> Brendan
>
>
> --- Andreas Prlic <ap3 at sanger.ac.uk> wrote:
>
>> Hi Brendan,
>>
>> SSBOND lines are currently not parsed. If this is
>> what you need,
>> I can add this over the next couple of days.
>>
>> If you want to compute the bonds yourself, the
>> framework can
>> e.g. calculate distances between the sulphur atoms
>> for you. -
>>
>> Andreas
>>
>>
>>
>>
>>
>> On 14 Nov 2007, at 00:48, Brendan Duggan wrote:
>>
>>> Greetings
>>>
>>> I'm trying to mine some information on disulfides
>> in
>>> the PDB and was hoping there might be a way of
>>> obtaining this information with the BioJava PDB
>>> parser.  However, I haven't been able to see
>> anything
>>> like this mentioned in the API docs.  If it is
>>> currently not possible to extract disulfide
>>> information from PDB files are there any plans to
>>> implement this?
>>>
>>> Thanks!
>>>
>>> Brendan
>>>
>>>
>>>       Make the switch to the world's best email.
>> Get the new Yahoo!
>>> 7 Mail now.
>> http://au.yahoo.com/worldsbestmail/viagra/index.html
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -
>> Biojava-l at lists.open-bio.org
>>>
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>>
> ---------------------------------------------------------------------- 
> -
>>
>> Andreas Prlic      Wellcome Trust Sanger Institute
>>                                Hinxton, Cambridge
>> CB10 1SA, UK
>> 			 +44 (0) 1223 49 6891
>>
>>
> ---------------------------------------------------------------------- 
> -
>>
>>
>>
>> -- 
>>  The Wellcome Trust Sanger Institute is operated by
>> Genome Research
>>  Limited, a charity registered in England with
>> number 1021457 and a
>>  company registered in England with number 2742969,
>> whose registered
>>  office is 215 Euston Road, London, NW1 2BE.
>>
>
>
> Brendan M. Duggan, PhD
>
> bmduggan at yahoo.com
> (858) 692-2298
>
>
>       Make the switch to the world's best email. Get the new Yahoo! 
> 7 Mail now. http://au.yahoo.com/worldsbestmail/viagra/index.html
>
>

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From holland at ebi.ac.uk  Sun Nov 18 17:12:04 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Sun, 18 Nov 2007 17:12:04 -0000 (GMT)
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <000901c829d0$daa54620$8fefd260$@dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>
	<473C4EF4.5080301@ebi.ac.uk> <000a01c827c1$8c8e5a50$a5ab0ef0$@dk>
	<473D5BFD.8080305@ebi.ac.uk> <000601c82833$143c5300$3cb4f900$@au.dk>
	<473D67AF.2020007@ebi.ac.uk> <000f01c82839$06722550$13566ff0$@au.d
	<473D7521.9070603@ebi.ac.uk> <001801c8284d$b8c525e0$2a4f71a0$@au.dk>
	<473D911F.2000303@ebi.ac.uk> <000901c829d0$daa54620$8fefd260$@dk>
Message-ID: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>

Interesting stuff. I'm not sure why it isn't working so I'll have to have
a closer look.

I'm currently on annual leave but will investigate when I return (Nov 27th).

cheers,
Richard

On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote:
> Hi Richard,
>
>   I thought that was also correct what you say, but I can't get it to
> work.
> Below is a small test program to check this. First, I create a
> SimpleGappedSequence through Text with
> gaps->SymbolList->Sequence->GappedSequence. Gaps are there but not
> "understood", as expected. Next, I create the same sequence non-gapped in
> the above way, then introduce gaps with addGapsInSource. A gapped location
> is now properly translated to a non-gapped sequence position. Finally, I
> create a new SimpleGappedSequence based on the working one - as you can
> see
> the gaps are still there but not "understood"...
>
> aSymbolList = MSE--KLMPRT---TWAKG
> aSequence   = MSE--KLMPRT---TWAKG
>
> Gaps are not parsed when a SimpleGappedSequence is constructed from a
> gapped
> Sequence object:
> aGapped     = MSE--KLMPRT---TWAKG
> Gapped position 10 = Plain position 10
>
> aSymbolList = MSEKLMPRTTWAKG
> aSequence   = MSEKLMPRTTWAKG
>
> Gaps introduced through addGapsInSource work ok:
> aGapped     = MS--EKLMPR---TTWAKG
> Gapped position 10 = Plain position 8
>
> Now a new SimpleGappedSequence object is created from the previous one:
> aGapped2    = MS--EKLMPR---TTWAKG
> Gapped position 10 = Plain position 10
>
> This should have been compiled with the new biojava.jar of 161107 (updated
> via CVS), but perhaps I made a mistake updating?
>
> Any clues?
>
> Thanks,
>
>   Ditlev
>
> ---
>
> package gappedsequencetest;
>
> import org.biojava.bio.*;
> import org.biojava.bio.seq.*;
> import org.biojava.bio.seq.impl.*;
> import org.biojava.bio.symbol.*;
>
> public class Main {
>
>     public static void main(String[] args) {
>         SymbolList aSymbolList = null;
>         try {
>             aSymbolList =
> ProteinTools.createProtein("MSE--KLMPRT---TWAKG");
>
>         }
>         catch(BioException ex) {}
>
>         System.out.println("aSymbolList = " + aSymbolList.seqString());
>
>         Sequence aSequence = new SimpleSequence(aSymbolList, "",
> "mySequence", null);
>         System.out.println("aSequence   = " + aSequence.seqString() +
> "\n");
>
>         SimpleGappedSequence aGapped = new
> SimpleGappedSequence(aSequence);
>         System.out.println("Gaps are not parsed when a
> SimpleGappedSequence
> is constructed from a gapped Sequence object:");
>         System.out.println("aGapped     = " + aGapped.seqString());
>         System.out.println("Gapped position 10 = Plain position " +
> aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>
>         try {
>             aSymbolList = ProteinTools.createProtein("MSEKLMPRTTWAKG");
>         }
>         catch(BioException ex) {}
>
>         System.out.println("aSymbolList = " + aSymbolList.seqString());
>
>         aSequence = new SimpleSequence(aSymbolList, "", "mySequence",
> null);
>         System.out.println("aSequence   = " + aSequence.seqString() +
> "\n");
>
>         aGapped = new SimpleGappedSequence(aSequence);
>         aGapped.addGapsInSource(9, 3);
>         aGapped.addGapsInSource(3, 2);
>         System.out.println("Gaps introduced through addGapsInSource work
> ok:");
>         System.out.println("aGapped     = " + aGapped.seqString());
>         System.out.println("Gapped position 10 = Plain position " +
> aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>
>         SimpleGappedSequence aGapped2 = new SimpleGappedSequence(aGapped);
>         System.out.println("Now a new SimpleGappedSequence object is
> created
> from the previous one:");
>         System.out.println("aGapped2    = " + aGapped2.seqString());
>         System.out.println("Gapped position 10 = Plain position " +
> aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>     }
>
> }
>
> --
>
> Ditlev Egeskov Brodersen
> Lektor
> Bakkefaldet 30, Hasle
> 8210 ?rhus V
>
> www.lindeman-brodersen.dk
>
>
>> -----Original Message-----
>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> Sent: 16 November 2007 13:46
>> To: Ditlev Egeskov Brodersen
>> Cc: biojava-l at biojava.org
>> Subject: Re: Wrapping SimpleGappedSequence
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> SimpleGappedSequence extends SimpleGappedSymbolList, and the
>> constructor
>> delegates to the SimpleGappedSymbolList constructor.
>>
>> When you extend SimpleGappedSequence you should delegate in your new
>> constructor to the existing SimpleGappedSequence constructor, which in
>> turn will delegate as above and preserve the gaps.
>>
>> By passing any object which implements GappedSymbolList to the
>> SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
>> SimpleGappedSymbolList, it will automatically choose the new
>> constructor
>> from SimpleGappedSymbolList which you hopefully should be able to see
>> in
>> the code you have just checked out. If passed any other
>> non-GappedSymbolList object, it will use the old constructor that
>> already existed from before.
>>
>> cheers,
>> Richard
>>
>> Ditlev Egeskov Brodersen wrote:
>> > Hi again,
>> >
>> >   I updated CVS and got the new SimpleGappedSymbolList class, but
>> there
>> > seems to be no changes to the SimpleGappedSequence class, which is
>> the one I
>> > need to extend...have I missed something?
>> >
>> >   Ditlev
>> >
>> > --
>> >
>> > Ditlev E. Brodersen, Ph.D.
>> > Lektor, Associate Professor
>> >
>> > Department of Molecular Biology   Office:  +45 89425259
>> > University of Aarhus              Lab:     +45 89425022
>> > Gustav Wieds Vej 10c              Fax:     +45 86123178
>> > DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>> > Denmark                           Lab WWW: www.bioxray.dk/~deb
>> >
>> >
>> >> -----Original Message-----
>> >> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> >> Sent: 16 November 2007 11:47
>> >> To: Ditlev Egeskov Brodersen
>> >> Cc: biojava-l at biojava.org
>> >> Subject: Re: Wrapping SimpleGappedSequence
>> >>
>> > The easiest way is simply for me to alter the constructor to
>> > SimpleGappedSequence (and equivalently to SimpleGappedSymbolList) to
>> > copy all gaps if passed another instance of GappedSymbolList as the
>> > parameter. I've just done this in CVS so you should be able to update
>> > your copy and observe the new behaviour.
>> >
>> > cheers,
>> > Richard
>> >
>> > Ditlev Egeskov Brodersen wrote:
>> >>>> Hi again,
>> >>>>
>> >>>>   thanks for the info - will do the check just to be proper. I
>> have
>> > another
>> >>>> question: In my application, I would like to wrap the retrieved
>> >>>> SimpleGappedSequence objects inside another object that extends
>> the
>> >>>> functionality with application-specific stuff. Ideally, I would do
>> > this by
>> >>>> extending the SimpleGappedSequence object and create it by passing
>> > the
>> >>>> SimpleGappedSequence from the alignment import to the constructor
>> of
>> > the
>> >>>> parent, like so:
>> >>>>
>> >>>>   class AlignedSequence extends SimpleGappedSequence {
>> >>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>> >>>>       super(aGapped);
>> >>>>     }
>> >>>>
>> >>>>     ..custom stuff..
>> >>>>   }
>> >>>>
>> >>>> However, the problem is that there is only one constructor for the
>> >>>> SimpleGappedSequence, one which takes a simple Sequence object. I
>> can
>> > pass
>> >>>> the derived class alright, but all gap information is lost again,
>> > presumably
>> >>>> because the SimpleGappedSequence constructor just takes out the
>> > seqString()
>> >>>> and puts it into its own sequence object.
>> >>>>
>> >>>> Shouldn't the constructor of the SimpleGappedSequence class
>> recognise
>> > when a
>> >>>> derived (and gapped) sequence object is passed, and process it
>> > accordingly?
>> >>>> As it stands, I am forced to include the SimpleGappedSequence as a
>> > private
>> >>>> member of the AlignedSequence class, which is not near as nice
>> since
>> > all
>> >>>> statement using the class will have to do something like
>> >>>>
>> >>>>   class AlignedSequence extends SimpleGappedSequence {
>> >>>>     private SimpleGappedSequence gapped_sequence;
>> >>>>
>> >>>>     public AlignedSequence(SimpleGappedSequence aGapped) {
>> >>>>       gapped_sequence = aGapped;
>> >>>>     }
>> >>>>
>> >>>>     public SimpleGappedSequence getGappedSequence() {
>> >>>>       return(gapped_sequence);
>> >>>>   }
>> >>>>
>> >>>>     ..custom stuff..
>> >>>>   }
>> >>>>
>> >>>>   ...
>> >>>>
>> >>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>> >>>>   aAligned.getGappedSequence().seqString();
>> >>>>
>> >>>> rather than simply:
>> >>>>
>> >>>>   AlignedSequence aAligned = new AlignedSequence(aGapped);
>> >>>>   aAligned.seqString();
>> >>>>
>> >>>> In other words, is there any solution with the current setup that
>> > would
>> >>>> allow me to extend SimpleGappedSequence and not loose the gap
>> > information?
>> >>>> --  Ditlev
>> >>>>
>> >>>> --
>> >>>>
>> >>>> Ditlev E. Brodersen, Ph.D.
>> >>>> Lektor, Associate Professor
>> >>>>
>> >>>> Department of Molecular Biology   Office:  +45 89425259
>> >>>> University of Aarhus              Lab:     +45 89425022
>> >>>> Gustav Wieds Vej 10c              Fax:     +45 86123178
>> >>>> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>> >>>> Denmark                           Lab WWW: www.bioxray.dk/~deb
>> >>>>
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Richard Holland [mailto:holland at ebi.ac.uk]
>> >>>>> Sent: 16 November 2007 10:50
>> >>>>> To: Ditlev Egeskov Brodersen
>> >>>>> Cc: biojava-l at biojava.org
>> >>>>> Subject: Re: [Biojava-l] Parsing exising gaps
>> >>>>>
>> >>>>>>>   The returned gapped sequences are all properly set up with
>> gaps,
>> >>>> name etc.
>> >>>>>>> But as for other users, I think there may be some problems,
>> since
>> > the
>> >>>>>>> SimpleAlignment object only has a general symbol list iterator,
>> > the
>> >>>> user
>> >>>>>>> will have to cast each statement extracting a sequence object,
>> and
>> >>>>>>>
>> >>>>>>>       SimpleSequence aSimple =
>> (SimpleSequence)aSequences.next();
>> >>>>>>>
>> >>>>>>> returns an ClassCastException at run time. So old code might
>> not
>> > run
>> >>>> with
>> >>>>>>> the update as far as I can see.
>> >>>> This is true. However, such code would be unsupported by us as the
>> > API
>> >>>> clearly states that SimpleAlignment returns SymbolList instances,
>> and
>> >>>> does not make any guarantees about the exact implementation
>> details
>> > of
>> >>>> the objects it returns. To attempt to cast it to anything other
>> than
>> >>>> SymbolList would be a mistake! (Although actually it is now
>> returning
>> > a
>> >>>> guarantee of GappedSymbolList, which is what your code can now
>> take
>> >>>> advantage of). To assume it will return SimpleSequence is outside
>> the
>> >>>> behaviour defined by the API and therefore should not be relied
>> upon.
>> >>>>
>> >>>> A more correct behaviour would be to test each item returned:
>> >>>>
>> >>>> 	SymbolList symlist = aSequences.next();
>> >>>> 	if (symlist instanceof SimpleSequence) {
>> >>>> 		SimpleSequence seq = (SimpleSequence)symlist;
>> >>>> 		// Do simple-sequence stuff
>> >>>> 	} else {
>> >>>> 		// Do something else!
>> >>>> 	}
>> >>>>
>> >>>> In future, I will modify the API to change the SymbolList
>> guarantee
>> > to
>> >>>> a
>> >>>> GappedSymbolList guarantee, but I can't do this right now as this
>> >>>> really
>> >>>> would break everyone's code!
>> >>>>
>> >>>> We are currently planning a redesign as you may be aware, so
>> issues
>> >>>> like
>> >>>> this will hopefully be resolved as part of that process. For a
>> start,
>> >>>> if
>> >>>> we use Java 5 generics in future as we plan, we can strictly
>> specify
>> >>>> what kinds of objects will be returned by things such as the
>> > alignment
>> >>>> API, making it easier for us to enforce API-compliant behaviour in
>> >>>> user's code.
>> >>>>
>> >>>> cheers,
>> >>>> Richard
>>
>> - --
>> Richard Holland (BioMart)
>> EMBL EBI, Wellcome Trust Genome Campus,
>> Hinxton, Cambridgeshire CB10 1SD, UK
>> Tel. +44 (0)1223 494416
>>
>> http://www.biomart.org/
>> http://www.biojava.org/
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx
>> H0QV01cFreNi1SNLRPbhepg=
>> =023Y
>> -----END PGP SIGNATURE-----
>
>


-- 
Richard Holland
BioMart (http://www.biomart.org/)
EMBL-EBI
Hinxton, Cambridgeshire CB10 1SD, UK


From sterk at ebi.ac.uk  Mon Nov 19 11:53:00 2007
From: sterk at ebi.ac.uk (Peter Sterk)
Date: Mon, 19 Nov 2007 11:53:00 +0000
Subject: [Biojava-l] biojava.org wiki site down?
Message-ID: <4741791C.2090307@ebi.ac.uk>

Hi,

I only get blank screens in firefox and IE can't display the pages, 
either. I think Richard reported something similar a few weeks ago.

cheers,

--Peter
-----------------------------------------------------------------
Dr. Peter Sterk                           Tel: +44-(0)1223-494405
EMBL-European Bioinformatics Institute    Fax: +44-(0)1223-494472
Wellcome Trust Genome Campus, Hinxton     email: sterk at ebi.ac.uk
Cambridge CB10 1SD, UK                    WWW: www.ebi.ac.uk

   Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/
-----------------------------------------------------------------


From deb at mb.au.dk  Mon Nov 19 12:13:53 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Mon, 19 Nov 2007 13:13:53 +0100
Subject: [Biojava-l] biojava.org wiki site down?
In-Reply-To: <4741791C.2090307@ebi.ac.uk>
References: <4741791C.2090307@ebi.ac.uk>
Message-ID: <003301c82aa5$a6fabdc0$f4f03940$@au.dk>

www.biojava.org is down now, alright, but I was there less than 10 minutes
ago, so it's recent crash.

  Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
> bounces at lists.open-bio.org] On Behalf Of Peter Sterk
> Sent: 19 November 2007 12:53
> To: biojava-l at lists.open-bio.org
> Subject: [Biojava-l] biojava.org wiki site down?
> 
> Hi,
> 
> I only get blank screens in firefox and IE can't display the pages,
> either. I think Richard reported something similar a few weeks ago.
> 
> cheers,
> 
> --Peter
> -----------------------------------------------------------------
> Dr. Peter Sterk                           Tel: +44-(0)1223-494405
> EMBL-European Bioinformatics Institute    Fax: +44-(0)1223-494472
> Wellcome Trust Genome Campus, Hinxton     email: sterk at ebi.ac.uk
> Cambridge CB10 1SD, UK                    WWW: www.ebi.ac.uk
> 
>    Genome Reviews home page: http://www.ebi.ac.uk/GenomeReviews/
> -----------------------------------------------------------------
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From deb at mb.au.dk  Mon Nov 19 14:46:01 2007
From: deb at mb.au.dk (Ditlev Egeskov Brodersen)
Date: Mon, 19 Nov 2007 15:46:01 +0100
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>	<473C4EF4.5080301@ebi.ac.uk>
	<000a01c827c1$8c8e5a50$a5ab0ef0$@dk>	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>	<473D67AF.2020007@ebi.ac.uk>
	<000f01c82839$06722550$13566ff0$@au.d	<473D7521.9070603@ebi.ac.uk>
	<001801c8284d$b8c525e0$2a4f71a0$@au.dk>	<473D911F.2000303@ebi.ac.uk>
	<000901c829d0$daa54620$8fefd260$@dk>
	<48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>
Message-ID: <003701c82aba$e85f4320$b91dc960$@au.dk>

Dear Richard and all,

  I've been dissecting the delegation problem encountered when instantiating
SimpleGappedSequence(Sequence) with an already gapped sequence. The
constructor calls the parent SimpleGappedSymbolList(), which in Richard's
CVS update of 161107 now contains a separate overloaded constructor for the
gapped case:

  public SimpleGappedSymbolList(GappedSymbolList gappedSource)

  However, when instantiating a new SimpleGappedSequence based on an
existing gapped sequence (with several blocks), the blocks were lost. 

  After checking the path of code execution it appeared that for some
reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified
SimpleGappedSequence.java to include an overloaded constructor also for the
descendant class, identical to the other constructor but with a
GappedSequence argument:

  public SimpleGappedSequence(GappedSequence seq) {
    super(seq);
    this.sequence = seq;
    createOnUnderlying = false;
  }

  Now, the correct parent constructor
(SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are
two other problems with the new SimpleGappedSymbolList constructor that
needs to be corrected for it to work as expected: First, the initial
introduction of a single, large block is missing from the new code, so
insert:

  Block b = new Block(1, length, 1, length);
  blocks.add(b);

  Secondly, the code for transferring the gaps from the sequence string need
to use two separate indices, otherwise the gaps will be placed wrongly
because their position is affected by previously inserted gaps:

  int n=1;
  for(int i=1;i<=this.length();i++) {
    if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i)))
      this.addGappInSource(n);
    else
      n++;

  In other words, the index giving the position of the gaps should only
increment when there are NO gaps at the corresponding position in the gapped
string.

  Following these changes, the GappedSequenceTest program from last week now
works as expected:

 aSymbolList = MSE--KLMPRT---TWAKG
 aSequence   = MSE--KLMPRT---TWAKG

 Gaps are not parsed when a SimpleGappedSequence is constructed from a 
 gapped Sequence object:
 aGapped     = MSE--KLMPRT---TWAKG
 Gapped position 10 = Plain position 10

 aSymbolList = MSEKLMPRTTWAKG
 aSequence   = MSEKLMPRTTWAKG

 Gaps introduced through addGapsInSource work ok:
 aGapped     = MS--EKLMPR---TTWAKG
 Gapped position 10 = Plain position 8

 Now a new SimpleGappedSequence object is created from the previous one:
 aGapped2    = MS--EKLMPR---TTWAKG
 Gapped position 10 = Plain position 8

  -- Ditlev

--
?
Ditlev E. Brodersen, Ph.D.
Lektor, Associate Professor
?
Department of Molecular Biology?? Office:? +45 89425259
University of Aarhus????????????? Lab:???? +45 89425022
Gustav Wieds Vej 10c????????????? Fax:???? +45 86123178
DK-8000 Aarhus C????????????????? Email:?  deb at mb.au.dk
Denmark?????????????????????????? Lab WWW: www.bioxray.dk/~deb


 -----Original Message-----
 From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
 bounces at lists.open-bio.org] On Behalf Of Richard Holland
 Sent: 18 November 2007 18:12
 To: Ditlev Egeskov Brodersen
 Cc: biojava-l at biojava.org
 Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence
 
 Interesting stuff. I'm not sure why it isn't working so I'll have to
 have
 a closer look.
 
 I'm currently on annual leave but will investigate when I return (Nov
 27th).
 
 cheers,
 Richard
 
 On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote:
  Hi Richard,
 
    I thought that was also correct what you say, but I can't get it to
  work.
  Below is a small test program to check this. First, I create a
  SimpleGappedSequence through Text with
  gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not
  "understood", as expected. Next, I create the same sequence non-
 gapped in
  the above way, then introduce gaps with addGapsInSource. A gapped
 location
  is now properly translated to a non-gapped sequence position.
 Finally, I
  create a new SimpleGappedSequence based on the working one - as you
 can
  see
  the gaps are still there but not "understood"...
 
  aSymbolList = MSE--KLMPRT---TWAKG
  aSequence   = MSE--KLMPRT---TWAKG
 
  Gaps are not parsed when a SimpleGappedSequence is constructed from a
  gapped
  Sequence object:
  aGapped     = MSE--KLMPRT---TWAKG
  Gapped position 10 = Plain position 10
 
  aSymbolList = MSEKLMPRTTWAKG
  aSequence   = MSEKLMPRTTWAKG
 
  Gaps introduced through addGapsInSource work ok:
  aGapped     = MS--EKLMPR---TTWAKG
  Gapped position 10 = Plain position 8
 
  Now a new SimpleGappedSequence object is created from the previous
 one:
  aGapped2    = MS--EKLMPR---TTWAKG
  Gapped position 10 = Plain position 10
 
  This should have been compiled with the new biojava.jar of 161107
 (updated
  via CVS), but perhaps I made a mistake updating?
 
  Any clues?
 
  Thanks,
 
    Ditlev
 
  ---
 
  package gappedsequencetest;
 
  import org.biojava.bio.*;
  import org.biojava.bio.seq.*;
  import org.biojava.bio.seq.impl.*;
  import org.biojava.bio.symbol.*;
 
  public class Main {
 
      public static void main(String[] args) {
          SymbolList aSymbolList = null;
          try {
              aSymbolList =
  ProteinTools.createProtein("MSE--KLMPRT---TWAKG");
 
          }
          catch(BioException ex) {}
 
          System.out.println("aSymbolList = " +
 aSymbolList.seqString());
 
          Sequence aSequence = new SimpleSequence(aSymbolList, "",
  "mySequence", null);
          System.out.println("aSequence   = " + aSequence.seqString() +
  "\n");
 
          SimpleGappedSequence aGapped = new
  SimpleGappedSequence(aSequence);
          System.out.println("Gaps are not parsed when a
  SimpleGappedSequence
  is constructed from a gapped Sequence object:");
          System.out.println("aGapped     = " + aGapped.seqString());
          System.out.println("Gapped position 10 = Plain position " +
  aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
 
          try {
              aSymbolList =
 ProteinTools.createProtein("MSEKLMPRTTWAKG");
          }
          catch(BioException ex) {}
 
          System.out.println("aSymbolList = " +
 aSymbolList.seqString());
 
          aSequence = new SimpleSequence(aSymbolList, "", "mySequence",
  null);
          System.out.println("aSequence   = " + aSequence.seqString() +
  "\n");
 
          aGapped = new SimpleGappedSequence(aSequence);
          aGapped.addGapsInSource(9, 3);
          aGapped.addGapsInSource(3, 2);
          System.out.println("Gaps introduced through addGapsInSource
 work
  ok:");
          System.out.println("aGapped     = " + aGapped.seqString());
          System.out.println("Gapped position 10 = Plain position " +
  aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
 
          SimpleGappedSequence aGapped2 = new
 SimpleGappedSequence(aGapped);
          System.out.println("Now a new SimpleGappedSequence object is
  created
  from the previous one:");
          System.out.println("aGapped2    = " + aGapped2.seqString());
          System.out.println("Gapped position 10 = Plain position " +
  aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
      }
 
  }
 
  --
 
  Ditlev Egeskov Brodersen
  Lektor
  Bakkefaldet 30, Hasle
  8210 ?rhus V
 
  www.lindeman-brodersen.dk
 
 
  -----Original Message-----
  From: Richard Holland [mailto:holland at ebi.ac.uk]
  Sent: 16 November 2007 13:46
  To: Ditlev Egeskov Brodersen
  Cc: biojava-l at biojava.org
  Subject: Re: Wrapping SimpleGappedSequence
 
  -----BEGIN PGP SIGNED MESSAGE-----
  Hash: SHA1
 
  SimpleGappedSequence extends SimpleGappedSymbolList, and the
  constructor
  delegates to the SimpleGappedSymbolList constructor.
 
  When you extend SimpleGappedSequence you should delegate in your new
  constructor to the existing SimpleGappedSequence constructor, which
 in
  turn will delegate as above and preserve the gaps.
 
  By passing any object which implements GappedSymbolList to the
  SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
  SimpleGappedSymbolList, it will automatically choose the new
  constructor
  from SimpleGappedSymbolList which you hopefully should be able to
 see
  in
  the code you have just checked out. If passed any other
  non-GappedSymbolList object, it will use the old constructor that
  already existed from before.
 
  cheers,
  Richard
 
  Ditlev Egeskov Brodersen wrote:
   Hi again,
  
     I updated CVS and got the new SimpleGappedSymbolList class, but
  there
   seems to be no changes to the SimpleGappedSequence class, which is
  the one I
   need to extend...have I missed something?
  
     Ditlev
  
   --
  
   Ditlev E. Brodersen, Ph.D.
   Lektor, Associate Professor
  
   Department of Molecular Biology   Office:  +45 89425259
   University of Aarhus              Lab:     +45 89425022
   Gustav Wieds Vej 10c              Fax:     +45 86123178
   DK-8000 Aarhus C                  Email:   deb at mb.au.dk
   Denmark                           Lab WWW: www.bioxray.dk/~deb
  
  
   -----Original Message-----
   From: Richard Holland [mailto:holland at ebi.ac.uk]
   Sent: 16 November 2007 11:47
   To: Ditlev Egeskov Brodersen
   Cc: biojava-l at biojava.org
   Subject: Re: Wrapping SimpleGappedSequence
  
   The easiest way is simply for me to alter the constructor to
   SimpleGappedSequence (and equivalently to SimpleGappedSymbolList)
 to
   copy all gaps if passed another instance of GappedSymbolList as
 the
   parameter. I've just done this in CVS so you should be able to
 update
   your copy and observe the new behaviour.
  
   cheers,
   Richard
  
   Ditlev Egeskov Brodersen wrote:
   Hi again,
  
     thanks for the info - will do the check just to be proper. I
  have
   another
   question: In my application, I would like to wrap the retrieved
   SimpleGappedSequence objects inside another object that extends
  the
   functionality with application-specific stuff. Ideally, I would
 do
   this by
   extending the SimpleGappedSequence object and create it by
 passing
   the
   SimpleGappedSequence from the alignment import to the
 constructor
  of
   the
   parent, like so:
  
     class AlignedSequence extends SimpleGappedSequence {
       public AlignedSequence(SimpleGappedSequence aGapped) {
         super(aGapped);
       }
  
       ..custom stuff..
     }
  
   However, the problem is that there is only one constructor for
 the
   SimpleGappedSequence, one which takes a simple Sequence object.
 I
  can
   pass
   the derived class alright, but all gap information is lost
 again,
   presumably
   because the SimpleGappedSequence constructor just takes out the
   seqString()
   and puts it into its own sequence object.
  
   Shouldn't the constructor of the SimpleGappedSequence class
  recognise
   when a
   derived (and gapped) sequence object is passed, and process it
   accordingly?
   As it stands, I am forced to include the SimpleGappedSequence
 as a
   private
   member of the AlignedSequence class, which is not near as nice
  since
   all
   statement using the class will have to do something like
  
     class AlignedSequence extends SimpleGappedSequence {
       private SimpleGappedSequence gapped_sequence;
  
       public AlignedSequence(SimpleGappedSequence aGapped) {
         gapped_sequence = aGapped;
       }
  
       public SimpleGappedSequence getGappedSequence() {
         return(gapped_sequence);
     }
  
       ..custom stuff..
     }
  
     ...
  
     AlignedSequence aAligned = new AlignedSequence(aGapped);
     aAligned.getGappedSequence().seqString();
  
   rather than simply:
  
     AlignedSequence aAligned = new AlignedSequence(aGapped);
     aAligned.seqString();
  
   In other words, is there any solution with the current setup
 that
   would
   allow me to extend SimpleGappedSequence and not loose the gap
   information?
   --  Ditlev
  
   --
  
   Ditlev E. Brodersen, Ph.D.
   Lektor, Associate Professor
  
   Department of Molecular Biology   Office:  +45 89425259
   University of Aarhus              Lab:     +45 89425022
   Gustav Wieds Vej 10c              Fax:     +45 86123178
   DK-8000 Aarhus C                  Email:   deb at mb.au.dk
   Denmark                           Lab WWW: www.bioxray.dk/~deb
  
  
   -----Original Message-----
   From: Richard Holland [mailto:holland at ebi.ac.uk]
   Sent: 16 November 2007 10:50
   To: Ditlev Egeskov Brodersen
   Cc: biojava-l at biojava.org
   Subject: Re: [Biojava-l] Parsing exising gaps
  
     The returned gapped sequences are all properly set up with
  gaps,
   name etc.
   But as for other users, I think there may be some problems,
  since
   the
   SimpleAlignment object only has a general symbol list
 iterator,
   the
   user
   will have to cast each statement extracting a sequence
 object,
  and
  
         SimpleSequence aSimple =
  (SimpleSequence)aSequences.next();
  
   returns an ClassCastException at run time. So old code might
  not
   run
   with
   the update as far as I can see.
   This is true. However, such code would be unsupported by us as
 the
   API
   clearly states that SimpleAlignment returns SymbolList
 instances,
  and
   does not make any guarantees about the exact implementation
  details
   of
   the objects it returns. To attempt to cast it to anything other
  than
   SymbolList would be a mistake! (Although actually it is now
  returning
   a
   guarantee of GappedSymbolList, which is what your code can now
  take
   advantage of). To assume it will return SimpleSequence is
 outside
  the
   behaviour defined by the API and therefore should not be relied
  upon.
  
   A more correct behaviour would be to test each item returned:
  
   	SymbolList symlist = aSequences.next();
   	if (symlist instanceof SimpleSequence) {
   		SimpleSequence seq = (SimpleSequence)symlist;
   		// Do simple-sequence stuff
   	} else {
   		// Do something else!
   	}
  
   In future, I will modify the API to change the SymbolList
  guarantee
   to
   a
   GappedSymbolList guarantee, but I can't do this right now as
 this
   really
   would break everyone's code!
  
   We are currently planning a redesign as you may be aware, so
  issues
   like
   this will hopefully be resolved as part of that process. For a
  start,
   if
   we use Java 5 generics in future as we plan, we can strictly
  specify
   what kinds of objects will be returned by things such as the
   alignment
   API, making it easier for us to enforce API-compliant behaviour
 in
   user's code.
  
   cheers,
   Richard
 
  - --
  Richard Holland (BioMart)
  EMBL EBI, Wellcome Trust Genome Campus,
  Hinxton, Cambridgeshire CB10 1SD, UK
  Tel. +44 (0)1223 494416
 
  http://www.biomart.org/
  http://www.biojava.org/
  -----BEGIN PGP SIGNATURE-----
  Version: GnuPG v1.4.2.2 (GNU/Linux)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
  iD8DBQFHPZEf4C5LeMEKA/QRAr/JAJ4p/DvZRqkCwPqgKNkcY0LLJvnanQCeJcWx
  H0QV01cFreNi1SNLRPbhepg=
  =023Y
  -----END PGP SIGNATURE-----
 
 
 --
 Richard Holland
 BioMart (http://www.biomart.org/)
 EMBL-EBI
 Hinxton, Cambridgeshire CB10 1SD, UK
 
 _______________________________________________
 Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
 http://lists.open-bio.org/mailman/listinfo/biojava-l


From allank at sanbi.ac.za  Sun Nov 25 13:10:55 2007
From: allank at sanbi.ac.za (Allan Kamau)
Date: Sun, 25 Nov 2007 15:10:55 +0200
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
Message-ID: <4749745F.9070104@sanbi.ac.za>

Hi all,
I've searched for a conclusive answer to the "Program ncbi-blastn 
Version <some value> is not supported" without success.
I would like to know format of the blast output the Biojava's blast-like 
parsing framework likes, including some examples (without the data) of 
how such blast output may be created.
For example, I am using ncbi-blastn and I am generating the blast file 
(which Biojava doesn't like) as follows.

export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
export BLAST_REPORT_TABULAR=somesequence.blast.txt
export BLAST_REPORT_XML=somesequence.blast.xml
export BLAST_REPORT=somesequence.blast
export INPUT_FASTA=somesequence.fasta
export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence

date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i 
$REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d 
$REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o 
$BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i 
$INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;

Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied 
from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"

Then I get the error below.


[aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
Buildfile: build.xml

runBlastParser:
     [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17 
is not supported by the biojava blast-like parsing framework
     [java]     at 
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:241)
     [java]     at 
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)

Allan.


From markjschreiber at gmail.com  Mon Nov 26 01:17:03 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Mon, 26 Nov 2007 09:17:03 +0800
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <4749745F.9070104@sanbi.ac.za>
References: <4749745F.9070104@sanbi.ac.za>
Message-ID: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>

Hi Allan -

I think the solution is to call the setParserLazy() or some method with a
similar name (I don't have the API handy). This will prevent it doing the
check.

The original idea of this method was you could check against a list of
version numbers that people had validated.  I don't think this is a good
idea as nothing is truely 100% validated and we haven't kept the list up to
date.  If there are no objections I would propose to make this method
depricated (and it's opposite method) and change the default behaivour to
lazy checking.

Best regards.

- Mark


On 11/25/07, Allan Kamau <allank at sanbi.ac.za> wrote:
>
> Hi all,
> I've searched for a conclusive answer to the "Program ncbi-blastn
> Version <some value> is not supported" without success.
> I would like to know format of the blast output the Biojava's blast-like
> parsing framework likes, including some examples (without the data) of
> how such blast output may be created.
> For example, I am using ncbi-blastn and I am generating the blast file
> (which Biojava doesn't like) as follows.
>
> export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
> export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
> export REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
> export BLAST_REPORT_TABULAR=somesequence.blast.txt
> export BLAST_REPORT_XML=somesequence.blast.xml
> export BLAST_REPORT=somesequence.blast
> export INPUT_FASTA=somesequence.fasta
> export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
>
> date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
> $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
> $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
> $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d $REFERENCES_FASTA_NAME -i
> $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
>
> Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
> from "http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
>
> Then I get the error below.
>
>
> [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
> Buildfile: build.xml
>
> runBlastParser:
>     [java] org.xml.sax.SAXException: Program ncbi-blastn Version 2.2.17
> is not supported by the biojava blast-like parsing framework
>     [java]     at
> org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(
> BlastLikeSAXParser.java:241)
>     [java]     at
> org.biojava.bio.program.sax.BlastLikeSAXParser.parse(
> BlastLikeSAXParser.java:160)
>
> Allan.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From holland at ebi.ac.uk  Mon Nov 26 08:55:56 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 26 Nov 2007 08:55:56 +0000
Subject: [Biojava-l] Applet not able to find DNATools class.
In-Reply-To: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu>
References: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu>
Message-ID: <474A8A1C.4020901@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sounds like either a classpath problem (in which case check your
classpath to ensure all parts of biojava are definitely on it) or a
broken biojava.jar (in which case you need to recompile/redownload it).

cheers,
Richard

Abhinav Ram Karhu wrote:
> Hello all,
> I am having an error while loading the applet.
> 
> I am getting the following stack trace.
> 
> java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools
> 	at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161)
> 	at Trace.init(Trace.java:161)
> 	at sun.applet.AppletPanel.run(Unknown Source)
> 	at java.lang.Thread.run(Unknown Source)
> 
> I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder.
> 
> I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java.
> 
> My applet code in the php page looks like this:
> 
> <applet code="Trace.class"  archive="biojava-1.5.jar , bytecode.jar" height=800 width=800>
> 
> Please suggest if I am missing something.
> 
> Thanks in advance.
> 
> Abhinav
> 
> 
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSoob4C5LeMEKA/QRAsfkAJ9SlwIzDulzSDQpAfgh0alISRsplACcDqUx
uyQUEmRFEWTdnEHsm7k2lg0=
=SWHu
-----END PGP SIGNATURE-----


From holland at ebi.ac.uk  Mon Nov 26 12:55:23 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 26 Nov 2007 12:55:23 +0000
Subject: [Biojava-l] Wrapping SimpleGappedSequence
In-Reply-To: <003701c82aba$e85f4320$b91dc960$@au.dk>
References: <002701c8277f$9dbdca50$d9395ef0$@au.dk>	<473C4EF4.5080301@ebi.ac.uk>
	<000a01c827c1$8c8e5a50$a5ab0ef0$@dk>	<473D5BFD.8080305@ebi.ac.uk>
	<000601c82833$143c5300$3cb4f900$@au.dk>	<473D67AF.2020007@ebi.ac.uk>
	<000f01c82839$06722550$13566ff0$@au.d	<473D7521.9070603@ebi.ac.uk>
	<001801c8284d$b8c525e0$2a4f71a0$@au.dk>	<473D911F.2000303@ebi.ac.uk>
	<000901c829d0$daa54620$8fefd260$@dk>
	<48631.80.42.116.113.1195405924.squirrel@webmail.ebi.ac.uk>
	<003701c82aba$e85f4320$b91dc960$@au.dk>
Message-ID: <474AC23B.3080500@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have made the changes you suggest below in CVS. Hopefully it will work
for you now.

cheers,
Richard

Ditlev Egeskov Brodersen wrote:
> Dear Richard and all,
> 
>   I've been dissecting the delegation problem encountered when instantiating
> SimpleGappedSequence(Sequence) with an already gapped sequence. The
> constructor calls the parent SimpleGappedSymbolList(), which in Richard's
> CVS update of 161107 now contains a separate overloaded constructor for the
> gapped case:
> 
>   public SimpleGappedSymbolList(GappedSymbolList gappedSource)
> 
>   However, when instantiating a new SimpleGappedSequence based on an
> existing gapped sequence (with several blocks), the blocks were lost. 
> 
>   After checking the path of code execution it appeared that for some
> reason, the old SimpleGappedSymbolList(SymbolList) was called. So I modified
> SimpleGappedSequence.java to include an overloaded constructor also for the
> descendant class, identical to the other constructor but with a
> GappedSequence argument:
> 
>   public SimpleGappedSequence(GappedSequence seq) {
>     super(seq);
>     this.sequence = seq;
>     createOnUnderlying = false;
>   }
> 
>   Now, the correct parent constructor
> (SimpleGappedSymbolList(GappedSymbolList)) was called. However, there are
> two other problems with the new SimpleGappedSymbolList constructor that
> needs to be corrected for it to work as expected: First, the initial
> introduction of a single, large block is missing from the new code, so
> insert:
> 
>   Block b = new Block(1, length, 1, length);
>   blocks.add(b);
> 
>   Secondly, the code for transferring the gaps from the sequence string need
> to use two separate indices, otherwise the gaps will be placed wrongly
> because their position is affected by previously inserted gaps:
> 
>   int n=1;
>   for(int i=1;i<=this.length();i++) {
>     if(this.alpha.getGapSymbol().equals(gappedSource.symbolAt(i)))
>       this.addGappInSource(n);
>     else
>       n++;
> 
>   In other words, the index giving the position of the gaps should only
> increment when there are NO gaps at the corresponding position in the gapped
> string.
> 
>   Following these changes, the GappedSequenceTest program from last week now
> works as expected:
> 
>  aSymbolList = MSE--KLMPRT---TWAKG
>  aSequence   = MSE--KLMPRT---TWAKG
> 
>  Gaps are not parsed when a SimpleGappedSequence is constructed from a 
>  gapped Sequence object:
>  aGapped     = MSE--KLMPRT---TWAKG
>  Gapped position 10 = Plain position 10
> 
>  aSymbolList = MSEKLMPRTTWAKG
>  aSequence   = MSEKLMPRTTWAKG
> 
>  Gaps introduced through addGapsInSource work ok:
>  aGapped     = MS--EKLMPR---TTWAKG
>  Gapped position 10 = Plain position 8
> 
>  Now a new SimpleGappedSequence object is created from the previous one:
>  aGapped2    = MS--EKLMPR---TTWAKG
>  Gapped position 10 = Plain position 8
> 
>   -- Ditlev
> 
> --
>  
> Ditlev E. Brodersen, Ph.D.
> Lektor, Associate Professor
>  
> Department of Molecular Biology   Office:  +45 89425259
> University of Aarhus              Lab:     +45 89425022
> Gustav Wieds Vej 10c              Fax:     +45 86123178
> DK-8000 Aarhus C                  Email:   deb at mb.au.dk
> Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>  -----Original Message-----
>  From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
>  bounces at lists.open-bio.org] On Behalf Of Richard Holland
>  Sent: 18 November 2007 18:12
>  To: Ditlev Egeskov Brodersen
>  Cc: biojava-l at biojava.org
>  Subject: Re: [Biojava-l] Wrapping SimpleGappedSequence
>  
>  Interesting stuff. I'm not sure why it isn't working so I'll have to
>  have
>  a closer look.
>  
>  I'm currently on annual leave but will investigate when I return (Nov
>  27th).
>  
>  cheers,
>  Richard
>  
>  On Sun, November 18, 2007 10:50 am, Ditlev Egeskov Brodersen wrote:
>   Hi Richard,
>  
>     I thought that was also correct what you say, but I can't get it to
>   work.
>   Below is a small test program to check this. First, I create a
>   SimpleGappedSequence through Text with
>   gaps-SymbolList-Sequence-GappedSequence. Gaps are there but not
>   "understood", as expected. Next, I create the same sequence non-
>  gapped in
>   the above way, then introduce gaps with addGapsInSource. A gapped
>  location
>   is now properly translated to a non-gapped sequence position.
>  Finally, I
>   create a new SimpleGappedSequence based on the working one - as you
>  can
>   see
>   the gaps are still there but not "understood"...
>  
>   aSymbolList = MSE--KLMPRT---TWAKG
>   aSequence   = MSE--KLMPRT---TWAKG
>  
>   Gaps are not parsed when a SimpleGappedSequence is constructed from a
>   gapped
>   Sequence object:
>   aGapped     = MSE--KLMPRT---TWAKG
>   Gapped position 10 = Plain position 10
>  
>   aSymbolList = MSEKLMPRTTWAKG
>   aSequence   = MSEKLMPRTTWAKG
>  
>   Gaps introduced through addGapsInSource work ok:
>   aGapped     = MS--EKLMPR---TTWAKG
>   Gapped position 10 = Plain position 8
>  
>   Now a new SimpleGappedSequence object is created from the previous
>  one:
>   aGapped2    = MS--EKLMPR---TTWAKG
>   Gapped position 10 = Plain position 10
>  
>   This should have been compiled with the new biojava.jar of 161107
>  (updated
>   via CVS), but perhaps I made a mistake updating?
>  
>   Any clues?
>  
>   Thanks,
>  
>     Ditlev
>  
>   ---
>  
>   package gappedsequencetest;
>  
>   import org.biojava.bio.*;
>   import org.biojava.bio.seq.*;
>   import org.biojava.bio.seq.impl.*;
>   import org.biojava.bio.symbol.*;
>  
>   public class Main {
>  
>       public static void main(String[] args) {
>           SymbolList aSymbolList = null;
>           try {
>               aSymbolList =
>   ProteinTools.createProtein("MSE--KLMPRT---TWAKG");
>  
>           }
>           catch(BioException ex) {}
>  
>           System.out.println("aSymbolList = " +
>  aSymbolList.seqString());
>  
>           Sequence aSequence = new SimpleSequence(aSymbolList, "",
>   "mySequence", null);
>           System.out.println("aSequence   = " + aSequence.seqString() +
>   "\n");
>  
>           SimpleGappedSequence aGapped = new
>   SimpleGappedSequence(aSequence);
>           System.out.println("Gaps are not parsed when a
>   SimpleGappedSequence
>   is constructed from a gapped Sequence object:");
>           System.out.println("aGapped     = " + aGapped.seqString());
>           System.out.println("Gapped position 10 = Plain position " +
>   aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>  
>           try {
>               aSymbolList =
>  ProteinTools.createProtein("MSEKLMPRTTWAKG");
>           }
>           catch(BioException ex) {}
>  
>           System.out.println("aSymbolList = " +
>  aSymbolList.seqString());
>  
>           aSequence = new SimpleSequence(aSymbolList, "", "mySequence",
>   null);
>           System.out.println("aSequence   = " + aSequence.seqString() +
>   "\n");
>  
>           aGapped = new SimpleGappedSequence(aSequence);
>           aGapped.addGapsInSource(9, 3);
>           aGapped.addGapsInSource(3, 2);
>           System.out.println("Gaps introduced through addGapsInSource
>  work
>   ok:");
>           System.out.println("aGapped     = " + aGapped.seqString());
>           System.out.println("Gapped position 10 = Plain position " +
>   aGapped.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>  
>           SimpleGappedSequence aGapped2 = new
>  SimpleGappedSequence(aGapped);
>           System.out.println("Now a new SimpleGappedSequence object is
>   created
>   from the previous one:");
>           System.out.println("aGapped2    = " + aGapped2.seqString());
>           System.out.println("Gapped position 10 = Plain position " +
>   aGapped2.gappedToLocation(new PointLocation(10)).getMin()+ "\n");
>       }
>  
>   }
>  
>   --
>  
>   Ditlev Egeskov Brodersen
>   Lektor
>   Bakkefaldet 30, Hasle
>   8210 ?rhus V
>  
>   www.lindeman-brodersen.dk
>  
>  
>   -----Original Message-----
>   From: Richard Holland [mailto:holland at ebi.ac.uk]
>   Sent: 16 November 2007 13:46
>   To: Ditlev Egeskov Brodersen
>   Cc: biojava-l at biojava.org
>   Subject: Re: Wrapping SimpleGappedSequence
>  
> SimpleGappedSequence extends SimpleGappedSymbolList, and the
> constructor
> delegates to the SimpleGappedSymbolList constructor.
> 
> When you extend SimpleGappedSequence you should delegate in your new
> constructor to the existing SimpleGappedSequence constructor, which
>>  in
> turn will delegate as above and preserve the gaps.
> 
> By passing any object which implements GappedSymbolList to the
> SimpleGappedSequence constructor, e.g. SimpleGappedSequence or
> SimpleGappedSymbolList, it will automatically choose the new
> constructor
> from SimpleGappedSymbolList which you hopefully should be able to
>>  see
> in
> the code you have just checked out. If passed any other
> non-GappedSymbolList object, it will use the old constructor that
> already existed from before.
> 
> cheers,
> Richard
> 
> Ditlev Egeskov Brodersen wrote:
>  Hi again,
> 
>    I updated CVS and got the new SimpleGappedSymbolList class, but
> there
>  seems to be no changes to the SimpleGappedSequence class, which is
> the one I
>  need to extend...have I missed something?
> 
>    Ditlev
> 
>  --
> 
>  Ditlev E. Brodersen, Ph.D.
>  Lektor, Associate Professor
> 
>  Department of Molecular Biology   Office:  +45 89425259
>  University of Aarhus              Lab:     +45 89425022
>  Gustav Wieds Vej 10c              Fax:     +45 86123178
>  DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>  Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>  -----Original Message-----
>  From: Richard Holland [mailto:holland at ebi.ac.uk]
>  Sent: 16 November 2007 11:47
>  To: Ditlev Egeskov Brodersen
>  Cc: biojava-l at biojava.org
>  Subject: Re: Wrapping SimpleGappedSequence
> 
>  The easiest way is simply for me to alter the constructor to
>  SimpleGappedSequence (and equivalently to SimpleGappedSymbolList)
>>  to
>  copy all gaps if passed another instance of GappedSymbolList as
>>  the
>  parameter. I've just done this in CVS so you should be able to
>>  update
>  your copy and observe the new behaviour.
> 
>  cheers,
>  Richard
> 
>  Ditlev Egeskov Brodersen wrote:
>  Hi again,
> 
>    thanks for the info - will do the check just to be proper. I
> have
>  another
>  question: In my application, I would like to wrap the retrieved
>  SimpleGappedSequence objects inside another object that extends
> the
>  functionality with application-specific stuff. Ideally, I would
>>  do
>  this by
>  extending the SimpleGappedSequence object and create it by
>>  passing
>  the
>  SimpleGappedSequence from the alignment import to the
>>  constructor
> of
>  the
>  parent, like so:
> 
>    class AlignedSequence extends SimpleGappedSequence {
>      public AlignedSequence(SimpleGappedSequence aGapped) {
>        super(aGapped);
>      }
> 
>      ..custom stuff..
>    }
> 
>  However, the problem is that there is only one constructor for
>>  the
>  SimpleGappedSequence, one which takes a simple Sequence object.
>>  I
> can
>  pass
>  the derived class alright, but all gap information is lost
>>  again,
>  presumably
>  because the SimpleGappedSequence constructor just takes out the
>  seqString()
>  and puts it into its own sequence object.
> 
>  Shouldn't the constructor of the SimpleGappedSequence class
> recognise
>  when a
>  derived (and gapped) sequence object is passed, and process it
>  accordingly?
>  As it stands, I am forced to include the SimpleGappedSequence
>>  as a
>  private
>  member of the AlignedSequence class, which is not near as nice
> since
>  all
>  statement using the class will have to do something like
> 
>    class AlignedSequence extends SimpleGappedSequence {
>      private SimpleGappedSequence gapped_sequence;
> 
>      public AlignedSequence(SimpleGappedSequence aGapped) {
>        gapped_sequence = aGapped;
>      }
> 
>      public SimpleGappedSequence getGappedSequence() {
>        return(gapped_sequence);
>    }
> 
>      ..custom stuff..
>    }
> 
>    ...
> 
>    AlignedSequence aAligned = new AlignedSequence(aGapped);
>    aAligned.getGappedSequence().seqString();
> 
>  rather than simply:
> 
>    AlignedSequence aAligned = new AlignedSequence(aGapped);
>    aAligned.seqString();
> 
>  In other words, is there any solution with the current setup
>>  that
>  would
>  allow me to extend SimpleGappedSequence and not loose the gap
>  information?
>  --  Ditlev
> 
>  --
> 
>  Ditlev E. Brodersen, Ph.D.
>  Lektor, Associate Professor
> 
>  Department of Molecular Biology   Office:  +45 89425259
>  University of Aarhus              Lab:     +45 89425022
>  Gustav Wieds Vej 10c              Fax:     +45 86123178
>  DK-8000 Aarhus C                  Email:   deb at mb.au.dk
>  Denmark                           Lab WWW: www.bioxray.dk/~deb
> 
> 
>  -----Original Message-----
>  From: Richard Holland [mailto:holland at ebi.ac.uk]
>  Sent: 16 November 2007 10:50
>  To: Ditlev Egeskov Brodersen
>  Cc: biojava-l at biojava.org
>  Subject: Re: [Biojava-l] Parsing exising gaps
> 
>    The returned gapped sequences are all properly set up with
> gaps,
>  name etc.
>  But as for other users, I think there may be some problems,
> since
>  the
>  SimpleAlignment object only has a general symbol list
>>  iterator,
>  the
>  user
>  will have to cast each statement extracting a sequence
>>  object,
> and
> 
>        SimpleSequence aSimple =
> (SimpleSequence)aSequences.next();
> 
>  returns an ClassCastException at run time. So old code might
> not
>  run
>  with
>  the update as far as I can see.
>  This is true. However, such code would be unsupported by us as
>>  the
>  API
>  clearly states that SimpleAlignment returns SymbolList
>>  instances,
> and
>  does not make any guarantees about the exact implementation
> details
>  of
>  the objects it returns. To attempt to cast it to anything other
> than
>  SymbolList would be a mistake! (Although actually it is now
> returning
>  a
>  guarantee of GappedSymbolList, which is what your code can now
> take
>  advantage of). To assume it will return SimpleSequence is
>>  outside
> the
>  behaviour defined by the API and therefore should not be relied
> upon.
> 
>  A more correct behaviour would be to test each item returned:
> 
>  	SymbolList symlist = aSequences.next();
>  	if (symlist instanceof SimpleSequence) {
>  		SimpleSequence seq = (SimpleSequence)symlist;
>  		// Do simple-sequence stuff
>  	} else {
>  		// Do something else!
>  	}
> 
>  In future, I will modify the API to change the SymbolList
> guarantee
>  to
>  a
>  GappedSymbolList guarantee, but I can't do this right now as
>>  this
>  really
>  would break everyone's code!
> 
>  We are currently planning a redesign as you may be aware, so
> issues
>  like
>  this will hopefully be resolved as part of that process. For a
> start,
>  if
>  we use Java 5 generics in future as we plan, we can strictly
> specify
>  what kinds of objects will be returned by things such as the
>  alignment
>  API, making it easier for us to enforce API-compliant behaviour
>>  in
>  user's code.
> 
>  cheers,
>  Richard
> 
> --
> Richard Holland (BioMart)
> EMBL EBI, Wellcome Trust Genome Campus,
> Hinxton, Cambridgeshire CB10 1SD, UK
> Tel. +44 (0)1223 494416
> 
> http://www.biomart.org/
> http://www.biojava.org/

>  --
>  Richard Holland
>  BioMart (http://www.biomart.org/)
>  EMBL-EBI
>  Hinxton, Cambridgeshire CB10 1SD, UK

>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l


- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSsI64C5LeMEKA/QRAg21AKCieEvT2KaWBFdqLFUtxazhHXmD2wCgiRwk
Bz79hrJxD/eZrrCUXUAh758=
=0Jpp
-----END PGP SIGNATURE-----


From allank at sanbi.ac.za  Mon Nov 26 12:02:56 2007
From: allank at sanbi.ac.za (Allan Kamau)
Date: Mon, 26 Nov 2007 14:02:56 +0200
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
Message-ID: <474AB5F0.6040802@sanbi.ac.za>

Hi Mark,
Thank you for your reply.
Calling setModeLazy() method of the object of type BlastLikeSAXParser 
did provide the cure.

Allan.

Mark Schreiber wrote:
> Hi Allan -
>  
> I think the solution is to call the setParserLazy() or some method 
> with a similar name (I don't have the API handy). This will prevent it 
> doing the check.
>  
> The original idea of this method was you could check against a list of 
> version numbers that people had validated.  I don't think this is a 
> good idea as nothing is truely 100% validated and we haven't kept the 
> list up to date.  If there are no objections I would propose to make 
> this method depricated (and it's opposite method) and change the 
> default behaivour to lazy checking.
>  
> Best regards.
>  
> - Mark
>
>  
> On 11/25/07, *Allan Kamau* <allank at sanbi.ac.za 
> <mailto:allank at sanbi.ac.za>> wrote:
>
>     Hi all,
>     I've searched for a conclusive answer to the "Program ncbi-blastn
>     Version <some value> is not supported" without success.
>     I would like to know format of the blast output the Biojava's
>     blast-like
>     parsing framework likes, including some examples (without the data) of
>     how such blast output may be created.
>     For example, I am using ncbi-blastn and I am generating the blast
>     file
>     (which Biojava doesn't like) as follows.
>
>     export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
>     export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
>     export
>     REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
>     export BLAST_REPORT_TABULAR=somesequence.blast.txt
>     export BLAST_REPORT_XML=somesequence.blast.xml
>     export BLAST_REPORT=somesequence.blast
>     export INPUT_FASTA=somesequence.fasta
>     export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
>
>     date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
>     $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
>     $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
>     $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d
>     $REFERENCES_FASTA_NAME -i
>     $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
>
>     Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
>     from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
>
>     Then I get the error below.
>
>
>     [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
>     Buildfile: build.xml
>
>     runBlastParser:
>         [java] org.xml.sax.SAXException: Program ncbi-blastn Version
>     2.2.17
>     is not supported by the biojava blast-like parsing framework
>         [java]     at
>     org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java
>     :241)
>         [java]     at
>     org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
>
>     Allan.
>     _______________________________________________
>     Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>     <mailto:Biojava-l at lists.open-bio.org>
>     http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From markjschreiber at gmail.com  Tue Nov 27 03:16:35 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 27 Nov 2007 11:16:35 +0800
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <474AB5F0.6040802@sanbi.ac.za>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
	<474AB5F0.6040802@sanbi.ac.za>
Message-ID: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>

Hi -

Does anyone mind if I change the default behaivor to lazy parsing?
Technically this would be a break in backwards compatibility (although
only if you have a program that relies on strict parsing).

Last chance to complain.

- Mark

On Nov 26, 2007 8:02 PM, Allan Kamau <allank at sanbi.ac.za> wrote:
> Hi Mark,
> Thank you for your reply.
> Calling setModeLazy() method of the object of type BlastLikeSAXParser
> did provide the cure.
>
> Allan.
>
>
> Mark Schreiber wrote:
> > Hi Allan -
> >
> > I think the solution is to call the setParserLazy() or some method
> > with a similar name (I don't have the API handy). This will prevent it
> > doing the check.
> >
> > The original idea of this method was you could check against a list of
> > version numbers that people had validated.  I don't think this is a
> > good idea as nothing is truely 100% validated and we haven't kept the
> > list up to date.  If there are no objections I would propose to make
> > this method depricated (and it's opposite method) and change the
> > default behaivour to lazy checking.
> >
> > Best regards.
> >
> > - Mark
> >
> >
> > On 11/25/07, *Allan Kamau* <allank at sanbi.ac.za
>
>
>
> > <mailto:allank at sanbi.ac.za>> wrote:
> >
> >     Hi all,
> >     I've searched for a conclusive answer to the "Program ncbi-blastn
> >     Version <some value> is not supported" without success.
> >     I would like to know format of the blast output the Biojava's
> >     blast-like
> >     parsing framework likes, including some examples (without the data) of
> >     how such blast output may be created.
> >     For example, I am using ncbi-blastn and I am generating the blast
> >     file
> >     (which Biojava doesn't like) as follows.
> >
> >     export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
> >     export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
> >     export
> >     REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
> >     export BLAST_REPORT_TABULAR=somesequence.blast.txt
> >     export BLAST_REPORT_XML=somesequence.blast.xml
> >     export BLAST_REPORT=somesequence.blast
> >     export INPUT_FASTA=somesequence.fasta
> >     export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
> >
> >     date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
> >     $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
> >     $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
> >     $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d
> >     $REFERENCES_FASTA_NAME -i
> >     $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
> >
> >     Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
> >     from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
> >
> >     Then I get the error below.
> >
> >
> >     [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
> >     Buildfile: build.xml
> >
> >     runBlastParser:
> >         [java] org.xml.sax.SAXException: Program ncbi-blastn Version
> >     2.2.17
> >     is not supported by the biojava blast-like parsing framework
> >         [java]     at
> >     org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java
> >     :241)
> >         [java]     at
> >     org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
> >
> >     Allan.
> >     _______________________________________________
> >     Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >     <mailto:Biojava-l at lists.open-bio.org>
>
>
>
> >     http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> >
>
>


From holland at ebi.ac.uk  Tue Nov 27 08:40:10 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 08:40:10 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
Message-ID: <474BD7EA.4040604@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sounds good to me.

Mark Schreiber wrote:
> Hi -
> 
> Does anyone mind if I change the default behaivor to lazy parsing?
> Technically this would be a break in backwards compatibility (although
> only if you have a program that relies on strict parsing).
> 
> Last chance to complain.
> 
> - Mark
> 
> On Nov 26, 2007 8:02 PM, Allan Kamau <allank at sanbi.ac.za> wrote:
>> Hi Mark,
>> Thank you for your reply.
>> Calling setModeLazy() method of the object of type BlastLikeSAXParser
>> did provide the cure.
>>
>> Allan.
>>
>>
>> Mark Schreiber wrote:
>>> Hi Allan -
>>>
>>> I think the solution is to call the setParserLazy() or some method
>>> with a similar name (I don't have the API handy). This will prevent it
>>> doing the check.
>>>
>>> The original idea of this method was you could check against a list of
>>> version numbers that people had validated.  I don't think this is a
>>> good idea as nothing is truely 100% validated and we haven't kept the
>>> list up to date.  If there are no objections I would propose to make
>>> this method depricated (and it's opposite method) and change the
>>> default behaivour to lazy checking.
>>>
>>> Best regards.
>>>
>>> - Mark
>>>
>>>
>>> On 11/25/07, *Allan Kamau* <allank at sanbi.ac.za
>>
>>
>>> <mailto:allank at sanbi.ac.za>> wrote:
>>>
>>>     Hi all,
>>>     I've searched for a conclusive answer to the "Program ncbi-blastn
>>>     Version <some value> is not supported" without success.
>>>     I would like to know format of the blast output the Biojava's
>>>     blast-like
>>>     parsing framework likes, including some examples (without the data) of
>>>     how such blast output may be created.
>>>     For example, I am using ncbi-blastn and I am generating the blast
>>>     file
>>>     (which Biojava doesn't like) as follows.
>>>
>>>     export FORMATDB=/usr/local/share/blast/blast-2.2.17/bin/formatdb;
>>>     export BLASTALL=/usr/local/share/blast/blast-2.2.17/bin/blastall;
>>>     export
>>>     REFERENCES_FASTA_NAME=/study/tmp/somesequence/blast/20071017.fasta;
>>>     export BLAST_REPORT_TABULAR=somesequence.blast.txt
>>>     export BLAST_REPORT_XML=somesequence.blast.xml
>>>     export BLAST_REPORT=somesequence.blast
>>>     export INPUT_FASTA=somesequence.fasta
>>>     export WORK_DIR=/development/study/try1/tmpDataFiles/somesequence
>>>
>>>     date;time cd $WORK_DIR;cp $REFERENCES_FASTA_NAME .;$FORMATDB -i
>>>     $REFERENCES_FASTA_NAME -p F -o T;echo `pwd`;$BLASTALL -p blastn -d
>>>     $REFERENCES_FASTA_NAME -i $INPUT_FASTA -q -1 -r 1 -m 8 -o
>>>     $BLAST_REPORT_TABULAR;$BLASTALL -p blastn -d
>>>     $REFERENCES_FASTA_NAME -i
>>>     $INPUT_FASTA -q -1 -r 1 -m 7 -o $BLAST_REPORT_XML;date;
>>>
>>>     Then I supply the $BLAST_REPORT as an argument to "BlastParser" copied
>>>     from " http://biojava.org/wiki/BioJava:CookBook:Blast:Parser"
>>>
>>>     Then I get the error below.
>>>
>>>
>>>     [aaron at localhost try1]$ $ANT_HOME/bin/ant runBlastParser;
>>>     Buildfile: build.xml
>>>
>>>     runBlastParser:
>>>         [java] org.xml.sax.SAXException: Program ncbi-blastn Version
>>>     2.2.17
>>>     is not supported by the biojava blast-like parsing framework
>>>         [java]     at
>>>     org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java
>>>     :241)
>>>         [java]     at
>>>     org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
>>>
>>>     Allan.
>>>     _______________________________________________
>>>     Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>     <mailto:Biojava-l at lists.open-bio.org>
>>
>>
>>>     http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>>
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS9fq4C5LeMEKA/QRAm/3AJ9hi2yrSyeK6a3nXtObyJ2MAk0Y1QCeL5HT
iYQc6HTdm6fJ+Lcfssnd34g=
=VuJJ
-----END PGP SIGNATURE-----


From ap3 at sanger.ac.uk  Tue Nov 27 10:24:49 2007
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue, 27 Nov 2007 10:24:49 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
	<474AB5F0.6040802@sanbi.ac.za>
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
Message-ID: <C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>

> Does anyone mind if I change the default behaivor to lazy parsing?

Hi Mark,

I think this is a good idea.

we had a couple of questions and feature requests recently regarding  
the blast parser, so I wonder if we should
have a look at how to make it (and the documentation) better also  
during the V3 discussion...

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
                               +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From holland at ebi.ac.uk  Tue Nov 27 11:01:33 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 11:01:33 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
Message-ID: <474BF90D.3070003@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> we had a couple of questions and feature requests recently regarding  
> the blast parser, so I wonder if we should
> have a look at how to make it (and the documentation) better also  
> during the V3 discussion...

A rethink of the blast parser is definitely a good idea. It's starting
to need more work than before as the various subtly different file
formats used by the most recent versions and variants of blast have
evolved beyond the tolerance limits of the existing parser. It also
needs to be made simpler to use.

cheers,
Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb
Nu2SZ7yV4e0rUmyIBxNYTJU=
=9nHg
-----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Tue Nov 27 11:11:30 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Tue, 27 Nov 2007 11:11:30 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <474BF90D.3070003@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk>
Message-ID: <474BFB62.3040203@ebi.ac.uk>

What format options are there from blast? Just thinking if it supports 
CIGAR or something like that are we better providing a parser for that 
format & saying that we do not support the traditional blast output? 
That said it doesn't help is when that format changes so maybe what is 
needed is a way to push out parser changes without requiring a full 
biojava release (v3 discussion) ...

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
>> we had a couple of questions and feature requests recently regarding  
>> the blast parser, so I wonder if we should
>> have a look at how to make it (and the documentation) better also  
>> during the V3 discussion...
> 
> A rethink of the blast parser is definitely a good idea. It's starting
> to need more work than before as the various subtly different file
> formats used by the most recent versions and variants of blast have
> evolved beyond the tolerance limits of the existing parser. It also
> needs to be made simpler to use.
> 
> cheers,
> Richard
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHS/kM4C5LeMEKA/QRAho9AJkB28pMowj5OBXtokCKqNtmcBBq8ACdGGeb
> Nu2SZ7yV4e0rUmyIBxNYTJU=
> =9nHg
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From holland at ebi.ac.uk  Tue Nov 27 11:18:59 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 11:18:59 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <474BFB62.3040203@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>
Message-ID: <474BFD23.8060005@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> What format options are there from blast? Just thinking if it supports
> CIGAR or something like that are we better providing a parser for that
> format & saying that we do not support the traditional blast output?
> That said it doesn't help is when that format changes so maybe what is
> needed is a way to push out parser changes without requiring a full
> biojava release (v3 discussion) ...

Exactly! So the modular idea would work nicely here - we could have a
blast module and only update that single module (which would be its own
JAR) whenever the format changes. In a way, BioJava releases as such
would no longer happen, except maybe for some kind of core BioJava
module. Everything would be done in terms of individual module+JAR
releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
for Phylogenetic tools, one for translation/transcription, etc. etc.

cheers,
Richard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS/0j4C5LeMEKA/QRAkQuAJ9B+mmV7vo9QuFYwEgmnHczExyXqwCfamIx
uPFQKdbXRC7pwC6lM5aBcJk=
=F3PD
-----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Tue Nov 27 11:47:54 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Tue, 27 Nov 2007 11:47:54 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <474BFD23.8060005@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	<474AB5F0.6040802@sanbi.ac.za>	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>
	<474BFD23.8060005@ebi.ac.uk>
Message-ID: <474C03EA.4070706@ebi.ac.uk>

I think Groovy have adopted a similar system recently & have guidelines 
for how each module should behave (dependencies, build system etc). This 
enforces the idea that a module whilst not part of the core project must 
behave in the same manner the core does. I do like the idea that we can 
have a core biojava & things get added around it & it might encourage 
other users to start developing their own modules for any 
formats/purpose they want.

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
>> What format options are there from blast? Just thinking if it supports
>> CIGAR or something like that are we better providing a parser for that
>> format & saying that we do not support the traditional blast output?
>> That said it doesn't help is when that format changes so maybe what is
>> needed is a way to push out parser changes without requiring a full
>> biojava release (v3 discussion) ...
> 
> Exactly! So the modular idea would work nicely here - we could have a
> blast module and only update that single module (which would be its own
> JAR) whenever the format changes. In a way, BioJava releases as such
> would no longer happen, except maybe for some kind of core BioJava
> module. Everything would be done in terms of individual module+JAR
> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> for Phylogenetic tools, one for translation/transcription, etc. etc.
> 
> cheers,
> Richard


From markjschreiber at gmail.com  Tue Nov 27 14:48:12 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 27 Nov 2007 22:48:12 +0800
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
	supported
In-Reply-To: <474C03EA.4070706@ebi.ac.uk>
References: <4749745F.9070104@sanbi.ac.za>
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>
	<474AB5F0.6040802@sanbi.ac.za>
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>
	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>
	<474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk>
Message-ID: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com>

For a long time now my feeling has been that we should *only* support
the XML version of blast output.  The other formats are too brittle to
be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
may be an extreme view but the power of generic XML parsers and things
like XPath etc really make these formats look very attractive.

- Mark


On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> I think Groovy have adopted a similar system recently & have guidelines
> for how each module should behave (dependencies, build system etc). This
> enforces the idea that a module whilst not part of the core project must
> behave in the same manner the core does. I do like the idea that we can
> have a core biojava & things get added around it & it might encourage
> other users to start developing their own modules for any
> formats/purpose they want.
>
> Richard Holland wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> >> What format options are there from blast? Just thinking if it supports
> >> CIGAR or something like that are we better providing a parser for that
> >> format & saying that we do not support the traditional blast output?
> >> That said it doesn't help is when that format changes so maybe what is
> >> needed is a way to push out parser changes without requiring a full
> >> biojava release (v3 discussion) ...
> >
> > Exactly! So the modular idea would work nicely here - we could have a
> > blast module and only update that single module (which would be its own
> > JAR) whenever the format changes. In a way, BioJava releases as such
> > would no longer happen, except maybe for some kind of core BioJava
> > module. Everything would be done in terms of individual module+JAR
> > releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> > for Phylogenetic tools, one for translation/transcription, etc. etc.
> >
> > cheers,
> > Richard
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From ayates at ebi.ac.uk  Tue Nov 27 15:16:12 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Tue, 27 Nov 2007 15:16:12 +0000
Subject: [Biojava-l] error: Program ncbi-blastn Version 2.2.17 is not
 supported
In-Reply-To: <93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com>
References: <4749745F.9070104@sanbi.ac.za>	
	<93b45ca50711251717i24d9b7d8o6f5d4639ab573d00@mail.gmail.com>	
	<474AB5F0.6040802@sanbi.ac.za>	
	<93b45ca50711261916pbd2dd82hd69da9f985437fa4@mail.gmail.com>	
	<C720946F-B141-4C8D-9AD8-F5E328A3918D@sanger.ac.uk>	
	<474BF90D.3070003@ebi.ac.uk> <474BFB62.3040203@ebi.ac.uk>	
	<474BFD23.8060005@ebi.ac.uk> <474C03EA.4070706@ebi.ac.uk>
	<93b45ca50711270648q53d4deeeh3ffa7d6cef26c328@mail.gmail.com>
Message-ID: <474C34BC.4070209@ebi.ac.uk>

I was always under the impression that blast's XML output was nearly as 
hard to parse as the flat file format but I do agree that if we can use 
XML whenever we can it would make writing parsers a lot easier 
(especially if there are SAX based XPath libraries available). Actually 
this brings up a good question about development of this type of parser. 
The majority of XPath supporting libraries are DOM based which will mean 
large memory usage in some situations but overall providing an easier 
coding experience (and hopefully reduce our chances of creating bugs). 
Or should we code to the edge cases of someone trying to parse a 1GB 
XML? Personally I'd favour the former.

Going back to the original topic there are going to be situations where 
people want the flat file parsers/writers & I think it's a valid point 
to say this is where BioJava is meant to come in & help a developer. 
Afterall XML is a computer science problem where as parsing an EMBL flat 
file or blast output is a bioinformatics problem.

Andy

Mark Schreiber wrote:
> For a long time now my feeling has been that we should *only* support
> the XML version of blast output.  The other formats are too brittle to
> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> may be an extreme view but the power of generic XML parsers and things
> like XPath etc really make these formats look very attractive.
> 
> - Mark
> 
> 
> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> I think Groovy have adopted a similar system recently & have guidelines
>> for how each module should behave (dependencies, build system etc). This
>> enforces the idea that a module whilst not part of the core project must
>> behave in the same manner the core does. I do like the idea that we can
>> have a core biojava & things get added around it & it might encourage
>> other users to start developing their own modules for any
>> formats/purpose they want.
>>
>> Richard Holland wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>>> What format options are there from blast? Just thinking if it supports
>>>> CIGAR or something like that are we better providing a parser for that
>>>> format & saying that we do not support the traditional blast output?
>>>> That said it doesn't help is when that format changes so maybe what is
>>>> needed is a way to push out parser changes without requiring a full
>>>> biojava release (v3 discussion) ...
>>> Exactly! So the modular idea would work nicely here - we could have a
>>> blast module and only update that single module (which would be its own
>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>> would no longer happen, except maybe for some kind of core BioJava
>>> module. Everything would be done in terms of individual module+JAR
>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>
>>> cheers,
>>> Richard
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>


From markjschreiber at gmail.com  Wed Nov 28 03:34:38 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 28 Nov 2007 11:34:38 +0800
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
Message-ID: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>

Hi -

I think in most cases huge XML files in bioinformatics result from a
single XML containing multiple repetitive elements. Eg a BLAST XML
output with several hits or a GenBankXML with many Sequences.  A nice
approach I have seen for dealing with these is to use SAX to read over
the file and every time it comes to an element it delegates to a DOM
object.  You then parse the bits of the DOM you want with XPath or
convert to objects or something and then when you are finished with
that entry everything gets garbage collected and the SAX parser moves
to the next element and repeats the whole process.  This is a hybrid
of event based parsing and object-model based parsing which could let
you efficiently deal with huge files.

I think the BLAST XML has improved substantially, at least in terms of
validating against it's own DTD.  The DTD itself may not be the best
design but that is always a matter of taste and if you are using XPath
to get the relevant bits you don't need to make a SAX parser jump
through hoops to get them.

I agree we will have to keep flat file parsers but we should strongly
encourage the use of XML where possible. It is simply easier to deal
with. Most biological flat-files were designed for Fortran and mainly
for human consumption. There is no obvious validation mechanism.
Notably everything in NCBI is derived from ASN.1, what you see in the
flatfile is produced from there. I tend to think this means that the
ASN.1 is the holy gospel and what you get in the flat file is some
translation.  Ideally NCBI files should be parsed from the ASN.1 where
you can guarantee validation, the more practical alternative is to use
the XML which you can at least validate against a DTD.

With XML we (Biojava) can say if it validates we will parse it and if
it doesn't we may not.  With flat files there are so many dodgey
variants we cannot say anything.  Because XML dtds (or xsd's) have
versions it also makes it much easier to have parsers for different
versions and the parsing machinery can figure out which is needed.
With flat files it is anyones guess what version you are dealing with.

Finally parsers can be auto-generated for XML if you have the DTD or
XSD. This often doesn't give you an ideal parser but it can be a
useful starting point for rapid development.

For Biojava v 3 I think we should concentrate on XML parsers first and
flat files second. <sigh>if only Fasta had an XML format</sigh>

- Mark

On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> I was always under the impression that blast's XML output was nearly as
> hard to parse as the flat file format but I do agree that if we can use
> XML whenever we can it would make writing parsers a lot easier
> (especially if there are SAX based XPath libraries available). Actually
> this brings up a good question about development of this type of parser.
> The majority of XPath supporting libraries are DOM based which will mean
> large memory usage in some situations but overall providing an easier
> coding experience (and hopefully reduce our chances of creating bugs).
> Or should we code to the edge cases of someone trying to parse a 1GB
> XML? Personally I'd favour the former.
>
> Going back to the original topic there are going to be situations where
> people want the flat file parsers/writers & I think it's a valid point
> to say this is where BioJava is meant to come in & help a developer.
> Afterall XML is a computer science problem where as parsing an EMBL flat
> file or blast output is a bioinformatics problem.
>
> Andy
>
>
> Mark Schreiber wrote:
> > For a long time now my feeling has been that we should *only* support
> > the XML version of blast output.  The other formats are too brittle to
> > be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> > may be an extreme view but the power of generic XML parsers and things
> > like XPath etc really make these formats look very attractive.
> >
> > - Mark
> >
> >
> > On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> I think Groovy have adopted a similar system recently & have guidelines
> >> for how each module should behave (dependencies, build system etc). This
> >> enforces the idea that a module whilst not part of the core project must
> >> behave in the same manner the core does. I do like the idea that we can
> >> have a core biojava & things get added around it & it might encourage
> >> other users to start developing their own modules for any
> >> formats/purpose they want.
> >>
> >> Richard Holland wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>>> What format options are there from blast? Just thinking if it supports
> >>>> CIGAR or something like that are we better providing a parser for that
> >>>> format & saying that we do not support the traditional blast output?
> >>>> That said it doesn't help is when that format changes so maybe what is
> >>>> needed is a way to push out parser changes without requiring a full
> >>>> biojava release (v3 discussion) ...
> >>> Exactly! So the modular idea would work nicely here - we could have a
> >>> blast module and only update that single module (which would be its own
> >>> JAR) whenever the format changes. In a way, BioJava releases as such
> >>> would no longer happen, except maybe for some kind of core BioJava
> >>> module. Everything would be done in terms of individual module+JAR
> >>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> >>> for Phylogenetic tools, one for translation/transcription, etc. etc.
> >>>
> >>> cheers,
> >>> Richard
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
>


From ayates at ebi.ac.uk  Wed Nov 28 14:29:15 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 28 Nov 2007 14:29:15 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
Message-ID: <474D7B3B.8030807@ebi.ac.uk>

Hi Mark,

Okay that sounds like a perfectly sensible way to deal with this. Is 
this kind of parsing model supported in Java5? I only ask as I've not 
done a lot of XML parsing with Java5; more with things like XOM (which I 
think offers a DOM only representation but I'm probably wrong).

That's good. There's not a huge point to have a format & a DTD/XSD and 
then have your files not conform to it.

I was thinking the exact same thing about ASN.1 (well that & it looks 
bleeding horrible to parse but that is an un-educated look at the format 
which I'm sure is a parsable as JSON & the alike).

When it comes to flat file parsers I would be happier to provide 
implementations of the more common formats where a viable alternative is 
not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide 
similar output to the above have a chance to write their own 
parsers/formatters. This is very similar to the current situation but we 
just need to remove dependencies on statically located data structures 
(don't get rid of them completely just give users an option to not use 
them).

I'm not sure how much automatically generated parsers would help us. I 
guess it depends on the data model(s) we use if they are auto-parser 
friendly (which normally means POJO/JavaBean conventions including the 
no-args constructor).

Cool I don't want to exclude flat file parsers completely (if only 
because my group has an interest in BioJava being able to read & write 
flat files) :)

They decided to have HUPO-PSI Format instead :)

Andy

Mark Schreiber wrote:
> Hi -
> 
> I think in most cases huge XML files in bioinformatics result from a
> single XML containing multiple repetitive elements. Eg a BLAST XML
> output with several hits or a GenBankXML with many Sequences.  A nice
> approach I have seen for dealing with these is to use SAX to read over
> the file and every time it comes to an element it delegates to a DOM
> object.  You then parse the bits of the DOM you want with XPath or
> convert to objects or something and then when you are finished with
> that entry everything gets garbage collected and the SAX parser moves
> to the next element and repeats the whole process.  This is a hybrid
> of event based parsing and object-model based parsing which could let
> you efficiently deal with huge files.
> 
> I think the BLAST XML has improved substantially, at least in terms of
> validating against it's own DTD.  The DTD itself may not be the best
> design but that is always a matter of taste and if you are using XPath
> to get the relevant bits you don't need to make a SAX parser jump
> through hoops to get them.
> 
> I agree we will have to keep flat file parsers but we should strongly
> encourage the use of XML where possible. It is simply easier to deal
> with. Most biological flat-files were designed for Fortran and mainly
> for human consumption. There is no obvious validation mechanism.
> Notably everything in NCBI is derived from ASN.1, what you see in the
> flatfile is produced from there. I tend to think this means that the
> ASN.1 is the holy gospel and what you get in the flat file is some
> translation.  Ideally NCBI files should be parsed from the ASN.1 where
> you can guarantee validation, the more practical alternative is to use
> the XML which you can at least validate against a DTD.
> 
> With XML we (Biojava) can say if it validates we will parse it and if
> it doesn't we may not.  With flat files there are so many dodgey
> variants we cannot say anything.  Because XML dtds (or xsd's) have
> versions it also makes it much easier to have parsers for different
> versions and the parsing machinery can figure out which is needed.
> With flat files it is anyones guess what version you are dealing with.
> 
> Finally parsers can be auto-generated for XML if you have the DTD or
> XSD. This often doesn't give you an ideal parser but it can be a
> useful starting point for rapid development.
> 
> For Biojava v 3 I think we should concentrate on XML parsers first and
> flat files second. <sigh>if only Fasta had an XML format</sigh>
> 
> - Mark
> 
> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> I was always under the impression that blast's XML output was nearly as
>> hard to parse as the flat file format but I do agree that if we can use
>> XML whenever we can it would make writing parsers a lot easier
>> (especially if there are SAX based XPath libraries available). Actually
>> this brings up a good question about development of this type of parser.
>> The majority of XPath supporting libraries are DOM based which will mean
>> large memory usage in some situations but overall providing an easier
>> coding experience (and hopefully reduce our chances of creating bugs).
>> Or should we code to the edge cases of someone trying to parse a 1GB
>> XML? Personally I'd favour the former.
>>
>> Going back to the original topic there are going to be situations where
>> people want the flat file parsers/writers & I think it's a valid point
>> to say this is where BioJava is meant to come in & help a developer.
>> Afterall XML is a computer science problem where as parsing an EMBL flat
>> file or blast output is a bioinformatics problem.
>>
>> Andy
>>
>>
>> Mark Schreiber wrote:
>>> For a long time now my feeling has been that we should *only* support
>>> the XML version of blast output.  The other formats are too brittle to
>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
>>> may be an extreme view but the power of generic XML parsers and things
>>> like XPath etc really make these formats look very attractive.
>>>
>>> - Mark
>>>
>>>
>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> I think Groovy have adopted a similar system recently & have guidelines
>>>> for how each module should behave (dependencies, build system etc). This
>>>> enforces the idea that a module whilst not part of the core project must
>>>> behave in the same manner the core does. I do like the idea that we can
>>>> have a core biojava & things get added around it & it might encourage
>>>> other users to start developing their own modules for any
>>>> formats/purpose they want.
>>>>
>>>> Richard Holland wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>>> What format options are there from blast? Just thinking if it supports
>>>>>> CIGAR or something like that are we better providing a parser for that
>>>>>> format & saying that we do not support the traditional blast output?
>>>>>> That said it doesn't help is when that format changes so maybe what is
>>>>>> needed is a way to push out parser changes without requiring a full
>>>>>> biojava release (v3 discussion) ...
>>>>> Exactly! So the modular idea would work nicely here - we could have a
>>>>> blast module and only update that single module (which would be its own
>>>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>>>> would no longer happen, except maybe for some kind of core BioJava
>>>>> module. Everything would be done in terms of individual module+JAR
>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>>>
>>>>> cheers,
>>>>> Richard
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>


From dmitry.repchevski at bsc.es  Wed Nov 28 14:49:23 2007
From: dmitry.repchevski at bsc.es (Dmitry Repchevsky)
Date: Wed, 28 Nov 2007 15:49:23 +0100
Subject: [Biojava-l]  SAX, DOM, XPath and Flat files
References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com
Message-ID: <474D7FF3.9010901@bsc.es>

Hello!

Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) 
which is faster when  SAX and allows writing.
In JDK 6 apart of StAX there is JAXB which is a perfect combination to 
parse a huge files.
You can go through the XML fie using StAX until the element you are 
interested in and unmarshall it using JAXB to POJO object.

Cheers,

Dmitry


From ayates at ebi.ac.uk  Wed Nov 28 15:37:03 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 28 Nov 2007 15:37:03 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <474D7FF3.9010901@bsc.es>
References: 93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com
	<474D7FF3.9010901@bsc.es>
Message-ID: <474D8B1F.8070301@ebi.ac.uk>

Hi Dmitry,

StAX still has higher memory consumption than SAX (still not as large as 
DOM) but yes it is quite a good parser system & since we're moving 
towards the later versions of Java may be a good idea to use it as our 
standard parser ... if it supports XPath (can't remember off the top of 
my head) :)

Andy

Dmitry Repchevsky wrote:
> Hello!
> 
> Actually there is also a StAX parser (http://en.wikipedia.org/wiki/StAX) 
> which is faster when  SAX and allows writing.
> In JDK 6 apart of StAX there is JAXB which is a perfect combination to 
> parse a huge files.
> You can go through the XML fie using StAX until the element you are 
> interested in and unmarshall it using JAXB to POJO object.
> 
> Cheers,
> 
> Dmitry
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From markjschreiber at gmail.com  Fri Nov 30 02:28:58 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Fri, 30 Nov 2007 10:28:58 +0800
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <474D7B3B.8030807@ebi.ac.uk>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
	<474D7B3B.8030807@ebi.ac.uk>
Message-ID: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>

Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
not XQuery although XPath is probably more important for this use.

The DOM model is a direct implementation of the W3C standard which
makes it a little awkward from a java point of view but it is usable.

Java 6 has StAX (the other one).

There are a few java API's for parsing ASN.1 mostly developed for the
telco industry, I've never really looked into which is best (anyone
experienced with this?) but we could probably use one to work directly
off NCBI ASN.1

- Mark

On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Mark,
>
> Okay that sounds like a perfectly sensible way to deal with this. Is
> this kind of parsing model supported in Java5? I only ask as I've not
> done a lot of XML parsing with Java5; more with things like XOM (which I
> think offers a DOM only representation but I'm probably wrong).
>
> That's good. There's not a huge point to have a format & a DTD/XSD and
> then have your files not conform to it.
>
> I was thinking the exact same thing about ASN.1 (well that & it looks
> bleeding horrible to parse but that is an un-educated look at the format
> which I'm sure is a parsable as JSON & the alike).
>
> When it comes to flat file parsers I would be happier to provide
> implementations of the more common formats where a viable alternative is
> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
> similar output to the above have a chance to write their own
> parsers/formatters. This is very similar to the current situation but we
> just need to remove dependencies on statically located data structures
> (don't get rid of them completely just give users an option to not use
> them).
>
> I'm not sure how much automatically generated parsers would help us. I
> guess it depends on the data model(s) we use if they are auto-parser
> friendly (which normally means POJO/JavaBean conventions including the
> no-args constructor).
>
> Cool I don't want to exclude flat file parsers completely (if only
> because my group has an interest in BioJava being able to read & write
> flat files) :)
>
> They decided to have HUPO-PSI Format instead :)
>
> Andy
>
>
> Mark Schreiber wrote:
> > Hi -
> >
> > I think in most cases huge XML files in bioinformatics result from a
> > single XML containing multiple repetitive elements. Eg a BLAST XML
> > output with several hits or a GenBankXML with many Sequences.  A nice
> > approach I have seen for dealing with these is to use SAX to read over
> > the file and every time it comes to an element it delegates to a DOM
> > object.  You then parse the bits of the DOM you want with XPath or
> > convert to objects or something and then when you are finished with
> > that entry everything gets garbage collected and the SAX parser moves
> > to the next element and repeats the whole process.  This is a hybrid
> > of event based parsing and object-model based parsing which could let
> > you efficiently deal with huge files.
> >
> > I think the BLAST XML has improved substantially, at least in terms of
> > validating against it's own DTD.  The DTD itself may not be the best
> > design but that is always a matter of taste and if you are using XPath
> > to get the relevant bits you don't need to make a SAX parser jump
> > through hoops to get them.
> >
> > I agree we will have to keep flat file parsers but we should strongly
> > encourage the use of XML where possible. It is simply easier to deal
> > with. Most biological flat-files were designed for Fortran and mainly
> > for human consumption. There is no obvious validation mechanism.
> > Notably everything in NCBI is derived from ASN.1, what you see in the
> > flatfile is produced from there. I tend to think this means that the
> > ASN.1 is the holy gospel and what you get in the flat file is some
> > translation.  Ideally NCBI files should be parsed from the ASN.1 where
> > you can guarantee validation, the more practical alternative is to use
> > the XML which you can at least validate against a DTD.
> >
> > With XML we (Biojava) can say if it validates we will parse it and if
> > it doesn't we may not.  With flat files there are so many dodgey
> > variants we cannot say anything.  Because XML dtds (or xsd's) have
> > versions it also makes it much easier to have parsers for different
> > versions and the parsing machinery can figure out which is needed.
> > With flat files it is anyones guess what version you are dealing with.
> >
> > Finally parsers can be auto-generated for XML if you have the DTD or
> > XSD. This often doesn't give you an ideal parser but it can be a
> > useful starting point for rapid development.
> >
> > For Biojava v 3 I think we should concentrate on XML parsers first and
> > flat files second. <sigh>if only Fasta had an XML format</sigh>
> >
> > - Mark
> >
> > On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> I was always under the impression that blast's XML output was nearly as
> >> hard to parse as the flat file format but I do agree that if we can use
> >> XML whenever we can it would make writing parsers a lot easier
> >> (especially if there are SAX based XPath libraries available). Actually
> >> this brings up a good question about development of this type of parser.
> >> The majority of XPath supporting libraries are DOM based which will mean
> >> large memory usage in some situations but overall providing an easier
> >> coding experience (and hopefully reduce our chances of creating bugs).
> >> Or should we code to the edge cases of someone trying to parse a 1GB
> >> XML? Personally I'd favour the former.
> >>
> >> Going back to the original topic there are going to be situations where
> >> people want the flat file parsers/writers & I think it's a valid point
> >> to say this is where BioJava is meant to come in & help a developer.
> >> Afterall XML is a computer science problem where as parsing an EMBL flat
> >> file or blast output is a bioinformatics problem.
> >>
> >> Andy
> >>
> >>
> >> Mark Schreiber wrote:
> >>> For a long time now my feeling has been that we should *only* support
> >>> the XML version of blast output.  The other formats are too brittle to
> >>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
> >>> may be an extreme view but the power of generic XML parsers and things
> >>> like XPath etc really make these formats look very attractive.
> >>>
> >>> - Mark
> >>>
> >>>
> >>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> I think Groovy have adopted a similar system recently & have guidelines
> >>>> for how each module should behave (dependencies, build system etc). This
> >>>> enforces the idea that a module whilst not part of the core project must
> >>>> behave in the same manner the core does. I do like the idea that we can
> >>>> have a core biojava & things get added around it & it might encourage
> >>>> other users to start developing their own modules for any
> >>>> formats/purpose they want.
> >>>>
> >>>> Richard Holland wrote:
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA1
> >>>>>
> >>>>>> What format options are there from blast? Just thinking if it supports
> >>>>>> CIGAR or something like that are we better providing a parser for that
> >>>>>> format & saying that we do not support the traditional blast output?
> >>>>>> That said it doesn't help is when that format changes so maybe what is
> >>>>>> needed is a way to push out parser changes without requiring a full
> >>>>>> biojava release (v3 discussion) ...
> >>>>> Exactly! So the modular idea would work nicely here - we could have a
> >>>>> blast module and only update that single module (which would be its own
> >>>>> JAR) whenever the format changes. In a way, BioJava releases as such
> >>>>> would no longer happen, except maybe for some kind of core BioJava
> >>>>> module. Everything would be done in terms of individual module+JAR
> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
> >>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
> >>>>>
> >>>>> cheers,
> >>>>> Richard
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
>


From heuermh at acm.org  Fri Nov 30 06:06:26 2007
From: heuermh at acm.org (Michael Heuer)
Date: Fri, 30 Nov 2007 01:06:26 -0500 (EST)
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0711300054110.26684-100000@shell3.shore.net>

Mark Schreiber wrote:

> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> not XQuery although XPath is probably more important for this use.
>
> The DOM model is a direct implementation of the W3C standard which
> makes it a little awkward from a java point of view but it is usable.
>
> Java 6 has StAX (the other one).

Yeah, those jerks.  :)

I wrote a note to the spec author a few weeks before "the other" StAX was
announced at a Java One however long ago asking them to reconsider their
project name.

Oh well.  We can still be the "original" StAX.

> http://stax.sf.net


May I kindly suggest skipping all of this talk about XML and have us
jump straight to OWL?  ;)

> http://dev.isb-sib.ch/projects/uniprot-rdf/

   michael


From ayates at ebi.ac.uk  Fri Nov 30 09:18:45 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 30 Nov 2007 09:18:45 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <Pine.GSO.4.44.0711300054110.26684-100000@shell3.shore.net>
References: <Pine.GSO.4.44.0711300054110.26684-100000@shell3.shore.net>
Message-ID: <474FD575.3060307@ebi.ac.uk>


Michael Heuer wrote:
> Mark Schreiber wrote:
> 
>> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
>> not XQuery although XPath is probably more important for this use.
>>
>> The DOM model is a direct implementation of the W3C standard which
>> makes it a little awkward from a java point of view but it is usable.
>>
>> Java 6 has StAX (the other one).
> 
> Yeah, those jerks.  :)
> 
> I wrote a note to the spec author a few weeks before "the other" StAX was
> announced at a Java One however long ago asking them to reconsider their
> project name.
> 
> Oh well.  We can still be the "original" StAX.
> 
>> http://stax.sf.net

Yup I remember that issue from BOSC 2005 ... oh well not a lot that can 
be done now. Maybe a re-brand of our StAX to StAX Original. Bit like the 
Coca Cola & New Coke mess-up.

> 
> 
> May I kindly suggest skipping all of this talk about XML and have us
> jump straight to OWL?  ;)
> 
>> http://dev.isb-sib.ch/projects/uniprot-rdf/

Lol just let me fire up my semantic web engine first :).


From ayates at ebi.ac.uk  Fri Nov 30 09:26:15 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 30 Nov 2007 09:26:15 +0000
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>	
	<474D7B3B.8030807@ebi.ac.uk>
	<93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
Message-ID: <474FD737.9080801@ebi.ac.uk>

I think I've seen XPath hanging around in other people's code in a 1.5 
code-base (in fact one of the guys I work with). I've used Java's DOM 
before & it really isn't very nice & quite verbose. I'd prefer if there 
was a better alternative/wrapper around the XML parsers just to cut down 
on code chatter.

Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these 
Java tools & I think I've gone cross-eyed with the sheer number of 
acronyms! You've gotta love something which seems to add a letter to ER 
& that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the 
list know of a ASN.1 parser for Java that's good and should we support 
it (considering NCBI generate their DTD & XML from the ASN.1 
representation).

Andy

Mark Schreiber wrote:
> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> not XQuery although XPath is probably more important for this use.
> 
> The DOM model is a direct implementation of the W3C standard which
> makes it a little awkward from a java point of view but it is usable.
> 
> Java 6 has StAX (the other one).
> 
> There are a few java API's for parsing ASN.1 mostly developed for the
> telco industry, I've never really looked into which is best (anyone
> experienced with this?) but we could probably use one to work directly
> off NCBI ASN.1
> 
> - Mark
> 
> On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Mark,
>>
>> Okay that sounds like a perfectly sensible way to deal with this. Is
>> this kind of parsing model supported in Java5? I only ask as I've not
>> done a lot of XML parsing with Java5; more with things like XOM (which I
>> think offers a DOM only representation but I'm probably wrong).
>>
>> That's good. There's not a huge point to have a format & a DTD/XSD and
>> then have your files not conform to it.
>>
>> I was thinking the exact same thing about ASN.1 (well that & it looks
>> bleeding horrible to parse but that is an un-educated look at the format
>> which I'm sure is a parsable as JSON & the alike).
>>
>> When it comes to flat file parsers I would be happier to provide
>> implementations of the more common formats where a viable alternative is
>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
>> similar output to the above have a chance to write their own
>> parsers/formatters. This is very similar to the current situation but we
>> just need to remove dependencies on statically located data structures
>> (don't get rid of them completely just give users an option to not use
>> them).
>>
>> I'm not sure how much automatically generated parsers would help us. I
>> guess it depends on the data model(s) we use if they are auto-parser
>> friendly (which normally means POJO/JavaBean conventions including the
>> no-args constructor).
>>
>> Cool I don't want to exclude flat file parsers completely (if only
>> because my group has an interest in BioJava being able to read & write
>> flat files) :)
>>
>> They decided to have HUPO-PSI Format instead :)
>>
>> Andy
>>
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> I think in most cases huge XML files in bioinformatics result from a
>>> single XML containing multiple repetitive elements. Eg a BLAST XML
>>> output with several hits or a GenBankXML with many Sequences.  A nice
>>> approach I have seen for dealing with these is to use SAX to read over
>>> the file and every time it comes to an element it delegates to a DOM
>>> object.  You then parse the bits of the DOM you want with XPath or
>>> convert to objects or something and then when you are finished with
>>> that entry everything gets garbage collected and the SAX parser moves
>>> to the next element and repeats the whole process.  This is a hybrid
>>> of event based parsing and object-model based parsing which could let
>>> you efficiently deal with huge files.
>>>
>>> I think the BLAST XML has improved substantially, at least in terms of
>>> validating against it's own DTD.  The DTD itself may not be the best
>>> design but that is always a matter of taste and if you are using XPath
>>> to get the relevant bits you don't need to make a SAX parser jump
>>> through hoops to get them.
>>>
>>> I agree we will have to keep flat file parsers but we should strongly
>>> encourage the use of XML where possible. It is simply easier to deal
>>> with. Most biological flat-files were designed for Fortran and mainly
>>> for human consumption. There is no obvious validation mechanism.
>>> Notably everything in NCBI is derived from ASN.1, what you see in the
>>> flatfile is produced from there. I tend to think this means that the
>>> ASN.1 is the holy gospel and what you get in the flat file is some
>>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
>>> you can guarantee validation, the more practical alternative is to use
>>> the XML which you can at least validate against a DTD.
>>>
>>> With XML we (Biojava) can say if it validates we will parse it and if
>>> it doesn't we may not.  With flat files there are so many dodgey
>>> variants we cannot say anything.  Because XML dtds (or xsd's) have
>>> versions it also makes it much easier to have parsers for different
>>> versions and the parsing machinery can figure out which is needed.
>>> With flat files it is anyones guess what version you are dealing with.
>>>
>>> Finally parsers can be auto-generated for XML if you have the DTD or
>>> XSD. This often doesn't give you an ideal parser but it can be a
>>> useful starting point for rapid development.
>>>
>>> For Biojava v 3 I think we should concentrate on XML parsers first and
>>> flat files second. <sigh>if only Fasta had an XML format</sigh>
>>>
>>> - Mark
>>>
>>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> I was always under the impression that blast's XML output was nearly as
>>>> hard to parse as the flat file format but I do agree that if we can use
>>>> XML whenever we can it would make writing parsers a lot easier
>>>> (especially if there are SAX based XPath libraries available). Actually
>>>> this brings up a good question about development of this type of parser.
>>>> The majority of XPath supporting libraries are DOM based which will mean
>>>> large memory usage in some situations but overall providing an easier
>>>> coding experience (and hopefully reduce our chances of creating bugs).
>>>> Or should we code to the edge cases of someone trying to parse a 1GB
>>>> XML? Personally I'd favour the former.
>>>>
>>>> Going back to the original topic there are going to be situations where
>>>> people want the flat file parsers/writers & I think it's a valid point
>>>> to say this is where BioJava is meant to come in & help a developer.
>>>> Afterall XML is a computer science problem where as parsing an EMBL flat
>>>> file or blast output is a bioinformatics problem.
>>>>
>>>> Andy
>>>>
>>>>
>>>> Mark Schreiber wrote:
>>>>> For a long time now my feeling has been that we should *only* support
>>>>> the XML version of blast output.  The other formats are too brittle to
>>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
>>>>> may be an extreme view but the power of generic XML parsers and things
>>>>> like XPath etc really make these formats look very attractive.
>>>>>
>>>>> - Mark
>>>>>
>>>>>
>>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>> I think Groovy have adopted a similar system recently & have guidelines
>>>>>> for how each module should behave (dependencies, build system etc). This
>>>>>> enforces the idea that a module whilst not part of the core project must
>>>>>> behave in the same manner the core does. I do like the idea that we can
>>>>>> have a core biojava & things get added around it & it might encourage
>>>>>> other users to start developing their own modules for any
>>>>>> formats/purpose they want.
>>>>>>
>>>>>> Richard Holland wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA1
>>>>>>>
>>>>>>>> What format options are there from blast? Just thinking if it supports
>>>>>>>> CIGAR or something like that are we better providing a parser for that
>>>>>>>> format & saying that we do not support the traditional blast output?
>>>>>>>> That said it doesn't help is when that format changes so maybe what is
>>>>>>>> needed is a way to push out parser changes without requiring a full
>>>>>>>> biojava release (v3 discussion) ...
>>>>>>> Exactly! So the modular idea would work nicely here - we could have a
>>>>>>> blast module and only update that single module (which would be its own
>>>>>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>>>>>> would no longer happen, except maybe for some kind of core BioJava
>>>>>>> module. Everything would be done in terms of individual module+JAR
>>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>


From phidias51 at gmail.com  Fri Nov 30 18:30:50 2007
From: phidias51 at gmail.com (Mark Fortner)
Date: Fri, 30 Nov 2007 10:30:50 -0800
Subject: [Biojava-l] SAX, DOM, XPath and Flat files
In-Reply-To: <474FD737.9080801@ebi.ac.uk>
References: <93b45ca50711271934k377759adk11d65ab889964497@mail.gmail.com>
	<474D7B3B.8030807@ebi.ac.uk>
	<93b45ca50711291828u57b1cb5dj31cc7ef7f87eb701@mail.gmail.com>
	<474FD737.9080801@ebi.ac.uk>
Message-ID: <6e1d61f50711301030s60eee3cduf99109d0fa079a2e@mail.gmail.com>

There's a potential gotcha involved with XPath parsing.  If you use the
current implementation that ships with the Java 5 & 6 JDKs, it performs a
DOM parse on the whole document, even if you pass it a specific starting
node in the document.  I stumbled across this one the hard way when using
the hybrid approach that you mention.  This may be solved with another XPath
implementation such as Saxon.

One other problem I've noticed is that the NCBI XML doesn't always parse.
I've reported this to them, and they've promised to address this. It usually
occurs when submitters put non-escaped characters into text fields such as
author lists in PubMed. NCBI doesn't always use CDATA blocks around text and
as soon as the parser hits one of these characters it throws an exception.

I've also noticed a tendency (in other code bases) for developers to use
several different parsers; usually, whatever parser they're most familiar
with.  The problem with this is that they often introduce parser-specific
code into the code base, so you end up with numerous dependencies for
different parsers, and a potential configuration problem if you're passing
the XML parser as a run-time configuration parameter.  The most frequent
external parsers I've seen used are JDOM and DOM4J.  The usual way to get
around this is to write to an interface, but that will require some
additional vigilance.

Just a few things to watch out for as we move forward.

Mark (the other one) :-)

On Nov 30, 2007 1:26 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> I think I've seen XPath hanging around in other people's code in a 1.5
> code-base (in fact one of the guys I work with). I've used Java's DOM
> before & it really isn't very nice & quite verbose. I'd prefer if there
> was a better alternative/wrapper around the XML parsers just to cut down
> on code chatter.
>
> Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these
> Java tools & I think I've gone cross-eyed with the sheer number of
> acronyms! You've gotta love something which seems to add a letter to ER
> & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the
> list know of a ASN.1 parser for Java that's good and should we support
> it (considering NCBI generate their DTD & XML from the ASN.1
> representation).
>
> Andy
>
> Mark Schreiber wrote:
> > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> > not XQuery although XPath is probably more important for this use.
> >
> > The DOM model is a direct implementation of the W3C standard which
> > makes it a little awkward from a java point of view but it is usable.
> >
> > Java 6 has StAX (the other one).
> >
> > There are a few java API's for parsing ASN.1 mostly developed for the
> > telco industry, I've never really looked into which is best (anyone
> > experienced with this?) but we could probably use one to work directly
> > off NCBI ASN.1
> >
> > - Mark
> >
> > On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Hi Mark,
> >>
> >> Okay that sounds like a perfectly sensible way to deal with this. Is
> >> this kind of parsing model supported in Java5? I only ask as I've not
> >> done a lot of XML parsing with Java5; more with things like XOM (which
> I
> >> think offers a DOM only representation but I'm probably wrong).
> >>
> >> That's good. There's not a huge point to have a format & a DTD/XSD and
> >> then have your files not conform to it.
> >>
> >> I was thinking the exact same thing about ASN.1 (well that & it looks
> >> bleeding horrible to parse but that is an un-educated look at the
> format
> >> which I'm sure is a parsable as JSON & the alike).
> >>
> >> When it comes to flat file parsers I would be happier to provide
> >> implementations of the more common formats where a viable alternative
> is
> >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which
> provide
> >> similar output to the above have a chance to write their own
> >> parsers/formatters. This is very similar to the current situation but
> we
> >> just need to remove dependencies on statically located data structures
> >> (don't get rid of them completely just give users an option to not use
> >> them).
> >>
> >> I'm not sure how much automatically generated parsers would help us. I
> >> guess it depends on the data model(s) we use if they are auto-parser
> >> friendly (which normally means POJO/JavaBean conventions including the
> >> no-args constructor).
> >>
> >> Cool I don't want to exclude flat file parsers completely (if only
> >> because my group has an interest in BioJava being able to read & write
> >> flat files) :)
> >>
> >> They decided to have HUPO-PSI Format instead :)
> >>
> >> Andy
> >>
> >>
> >> Mark Schreiber wrote:
> >>> Hi -
> >>>
> >>> I think in most cases huge XML files in bioinformatics result from a
> >>> single XML containing multiple repetitive elements. Eg a BLAST XML
> >>> output with several hits or a GenBankXML with many Sequences.  A nice
> >>> approach I have seen for dealing with these is to use SAX to read over
> >>> the file and every time it comes to an element it delegates to a DOM
> >>> object.  You then parse the bits of the DOM you want with XPath or
> >>> convert to objects or something and then when you are finished with
> >>> that entry everything gets garbage collected and the SAX parser moves
> >>> to the next element and repeats the whole process.  This is a hybrid
> >>> of event based parsing and object-model based parsing which could let
> >>> you efficiently deal with huge files.
> >>>
> >>> I think the BLAST XML has improved substantially, at least in terms of
> >>> validating against it's own DTD.  The DTD itself may not be the best
> >>> design but that is always a matter of taste and if you are using XPath
> >>> to get the relevant bits you don't need to make a SAX parser jump
> >>> through hoops to get them.
> >>>
> >>> I agree we will have to keep flat file parsers but we should strongly
> >>> encourage the use of XML where possible. It is simply easier to deal
> >>> with. Most biological flat-files were designed for Fortran and mainly
> >>> for human consumption. There is no obvious validation mechanism.
> >>> Notably everything in NCBI is derived from ASN.1, what you see in the
> >>> flatfile is produced from there. I tend to think this means that the
> >>> ASN.1 is the holy gospel and what you get in the flat file is some
> >>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
> >>> you can guarantee validation, the more practical alternative is to use
> >>> the XML which you can at least validate against a DTD.
> >>>
> >>> With XML we (Biojava) can say if it validates we will parse it and if
> >>> it doesn't we may not.  With flat files there are so many dodgey
> >>> variants we cannot say anything.  Because XML dtds (or xsd's) have
> >>> versions it also makes it much easier to have parsers for different
> >>> versions and the parsing machinery can figure out which is needed.
> >>> With flat files it is anyones guess what version you are dealing with.
> >>>
> >>> Finally parsers can be auto-generated for XML if you have the DTD or
> >>> XSD. This often doesn't give you an ideal parser but it can be a
> >>> useful starting point for rapid development.
> >>>
> >>> For Biojava v 3 I think we should concentrate on XML parsers first and
> >>> flat files second. <sigh>if only Fasta had an XML format</sigh>
> >>>
> >>> - Mark
> >>>
> >>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> I was always under the impression that blast's XML output was nearly
> as
> >>>> hard to parse as the flat file format but I do agree that if we can
> use
> >>>> XML whenever we can it would make writing parsers a lot easier
> >>>> (especially if there are SAX based XPath libraries available).
> Actually
> >>>> this brings up a good question about development of this type of
> parser.
> >>>> The majority of XPath supporting libraries are DOM based which will
> mean
> >>>> large memory usage in some situations but overall providing an easier
> >>>> coding experience (and hopefully reduce our chances of creating
> bugs).
> >>>> Or should we code to the edge cases of someone trying to parse a 1GB
> >>>> XML? Personally I'd favour the former.
> >>>>
> >>>> Going back to the original topic there are going to be situations
> where
> >>>> people want the flat file parsers/writers & I think it's a valid
> point
> >>>> to say this is where BioJava is meant to come in & help a developer.
> >>>> Afterall XML is a computer science problem where as parsing an EMBL
> flat
> >>>> file or blast output is a bioinformatics problem.
> >>>>
> >>>> Andy
> >>>>
> >>>>
> >>>> Mark Schreiber wrote:
> >>>>> For a long time now my feeling has been that we should *only*
> support
> >>>>> the XML version of blast output.  The other formats are too brittle
> to
> >>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc
> that
> >>>>> may be an extreme view but the power of generic XML parsers and
> things
> >>>>> like XPath etc really make these formats look very attractive.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>>
> >>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>>>> I think Groovy have adopted a similar system recently & have
> guidelines
> >>>>>> for how each module should behave (dependencies, build system etc).
> This
> >>>>>> enforces the idea that a module whilst not part of the core project
> must
> >>>>>> behave in the same manner the core does. I do like the idea that we
> can
> >>>>>> have a core biojava & things get added around it & it might
> encourage
> >>>>>> other users to start developing their own modules for any
> >>>>>> formats/purpose they want.
> >>>>>>
> >>>>>> Richard Holland wrote:
> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> Hash: SHA1
> >>>>>>>
> >>>>>>>> What format options are there from blast? Just thinking if it
> supports
> >>>>>>>> CIGAR or something like that are we better providing a parser for
> that
> >>>>>>>> format & saying that we do not support the traditional blast
> output?
> >>>>>>>> That said it doesn't help is when that format changes so maybe
> what is
> >>>>>>>> needed is a way to push out parser changes without requiring a
> full
> >>>>>>>> biojava release (v3 discussion) ...
> >>>>>>> Exactly! So the modular idea would work nicely here - we could
> have a
> >>>>>>> blast module and only update that single module (which would be
> its own
> >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as
> such
> >>>>>>> would no longer happen, except maybe for some kind of core BioJava
> >>>>>>> module. Everything would be done in terms of individual module+JAR
> >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS,
> one
> >>>>>>> for Phylogenetic tools, one for translation/transcription, etc.
> etc.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>> _______________________________________________
> >>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From abhi232 at cc.gatech.edu  Sat Nov 24 16:16:17 2007
From: abhi232 at cc.gatech.edu (Abhinav Ram Karhu)
Date: Sat, 24 Nov 2007 16:16:17 -0000
Subject: [Biojava-l] Applet not able to find DNATools class.
Message-ID: <893100947.48481195919828028.JavaMail.root@pinky.cc.gatech.edu>

Hello all,
I am having an error while loading the applet.

I am getting the following stack trace.

java.lang.NoClassDefFoundError: Could not initialize class org.biojava.bio.seq.DNATools
	at org.biojava.bio.program.abi.ABITrace.getSequence(ABITrace.java:161)
	at Trace.init(Trace.java:161)
	at sun.applet.AppletPanel.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

I have the directory structure in which I am having my class files , the php page and the biojava jar files together in one folder.

I also have org.biojava.bio.seq.DNATools imported in the java file Trace.java.

My applet code in the php page looks like this:

<applet code="Trace.class"  archive="biojava-1.5.jar , bytecode.jar" height=800 width=800>

Please suggest if I am missing something.

Thanks in advance.

Abhinav