From hamish.mcwilliam at bioinfo-user.org.uk  Mon Nov  8 16:42:33 2010
From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam)
Date: Mon, 8 Nov 2010 21:42:33 +0000
Subject: [BioRuby] Bio::Fetch and EBI dbfetch
Message-ID: <AANLkTi=hAv3iGoz+u3wHX3WgEq7vyhCSwRoJKHUm2-1O@mail.gmail.com>

Hi folks,

The recent update to the EBI's dbfetch service
<http://www.ebi.ac.uk/Tools/webservices/about/news#st_october_2010>
adds support for the BioRuby biofetch meta-information methods and
means that the Bio::Fetch documentation
<http://bioruby.org/rdoc/classes/Bio/Fetch.html> needs to be updated.
The databases(), formats() and maxids() methods are now supported by
the EBI service as well as biofetch. For full details of the extended
syntax supported by dbfetch see
http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp.

Another thing I notice is that Bio::Fetch does not set a user-agent,
so no trace of BioRuby appears in the service logs. I'm not sure what
the most appropriate user-agent would be, something like
"BioRuby/1.1.0 Ruby/1.8.6" would probably do, but the maybe a module
specific style user-agent as used by BioPerl would be better (e.g.
"Bio::DB::RefSeq/0.8" or "bioperl-Bio_DB_RefSeq/1.4")?

All the best,

Hamish
-- 
----
"Saying the internet has changed dramatically over the last five years
is clich? ? the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.


From philipp.comans at googlemail.com  Sun Nov 21 06:16:56 2010
From: philipp.comans at googlemail.com (Philipp Comans)
Date: Sun, 21 Nov 2010 12:16:56 +0100
Subject: [BioRuby] Performance of Bio::Blast.reports
Message-ID: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>

Hi everyone,

I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.

Right now, I am parsing the reports in XML format using the following command:

blast_reports = Bio::Blast.reports(file, :rexml)

Is there any performance advantage when using REXML instead of the default XML parser?

In your opinion, is it possible to parse such a large report in XML format?
An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.

Thanks for your help!

Best regards,

Philipp

From pjotr.public14 at thebird.nl  Sun Nov 21 06:36:53 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 21 Nov 2010 12:36:53 +0100
Subject: [BioRuby] Performance of Bio::Blast.reports
In-Reply-To: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
Message-ID: <20101121113653.GA23100@thebird.nl>

Unfortunately BioRuby still loads it in RAM. Someone should do a
libxml2 version.

Easiest is to split the XML beforehand; that is what I do.

Pj.

On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote:
> Hi everyone,
> 
> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
> 
> Right now, I am parsing the reports in XML format using the following command:
> 
> blast_reports = Bio::Blast.reports(file, :rexml)
> 
> Is there any performance advantage when using REXML instead of the default XML parser?
> 
> In your opinion, is it possible to parse such a large report in XML format?
> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
> 
> Thanks for your help!
> 
> Best regards,
> 
> Philipp
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From cjfields at illinois.edu  Sun Nov 21 23:41:21 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Sun, 21 Nov 2010 22:41:21 -0600
Subject: [BioRuby] Performance of Bio::Blast.reports
In-Reply-To: <20101121113653.GA23100@thebird.nl>
References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
	<20101121113653.GA23100@thebird.nl>
Message-ID: <A7B4B89B-6532-4B1A-9468-A9536394DACA@illinois.edu>

On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory.

chris

On Nov 21, 2010, at 5:36 AM, Pjotr Prins wrote:

> Unfortunately BioRuby still loads it in RAM. Someone should do a
> libxml2 version.
> 
> Easiest is to split the XML beforehand; that is what I do.
> 
> Pj.
> 
> On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote:
>> Hi everyone,
>> 
>> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
>> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
>> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
>> 
>> Right now, I am parsing the reports in XML format using the following command:
>> 
>> blast_reports = Bio::Blast.reports(file, :rexml)
>> 
>> Is there any performance advantage when using REXML instead of the default XML parser?
>> 
>> In your opinion, is it possible to parse such a large report in XML format?
>> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
>> 
>> Thanks for your help!
>> 
>> Best regards,
>> 
>> Philipp
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From anurag08priyam at gmail.com  Mon Nov 22 00:07:08 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 22 Nov 2010 10:37:08 +0530
Subject: [BioRuby] Performance of Bio::Blast.reports
In-Reply-To: <A7B4B89B-6532-4B1A-9468-A9536394DACA@illinois.edu>
References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
	<20101121113653.GA23100@thebird.nl>
	<A7B4B89B-6532-4B1A-9468-A9536394DACA@illinois.edu>
Message-ID: <AANLkTim0F=bQiSWXxxyuF9QEUHhtkn+qrMUK2qdGX6J4@mail.gmail.com>

> On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory.

libxml2-ruby is quite buggy (segfaults, namespace does not work
correctly on attributes, inconsistent interface). AFAIK, its not Ruby
1.9.2 ready. I would suggest Nokogiri (Nokogiri::Reader - pull
parser[1]), which is almost as fast and has a better interface.

[1] http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Reader.html

-- 
Anurag Priyam,
3rd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From hamish.mcwilliam at bioinfo-user.org.uk  Mon Nov  8 21:42:33 2010
From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam)
Date: Mon, 8 Nov 2010 21:42:33 +0000
Subject: [BioRuby] Bio::Fetch and EBI dbfetch
Message-ID: <AANLkTi=hAv3iGoz+u3wHX3WgEq7vyhCSwRoJKHUm2-1O@mail.gmail.com>

Hi folks,

The recent update to the EBI's dbfetch service
<http://www.ebi.ac.uk/Tools/webservices/about/news#st_october_2010>
adds support for the BioRuby biofetch meta-information methods and
means that the Bio::Fetch documentation
<http://bioruby.org/rdoc/classes/Bio/Fetch.html> needs to be updated.
The databases(), formats() and maxids() methods are now supported by
the EBI service as well as biofetch. For full details of the extended
syntax supported by dbfetch see
http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp.

Another thing I notice is that Bio::Fetch does not set a user-agent,
so no trace of BioRuby appears in the service logs. I'm not sure what
the most appropriate user-agent would be, something like
"BioRuby/1.1.0 Ruby/1.8.6" would probably do, but the maybe a module
specific style user-agent as used by BioPerl would be better (e.g.
"Bio::DB::RefSeq/0.8" or "bioperl-Bio_DB_RefSeq/1.4")?

All the best,

Hamish
-- 
----
"Saying the internet has changed dramatically over the last five years
is clich? ? the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.


From philipp.comans at googlemail.com  Sun Nov 21 11:16:56 2010
From: philipp.comans at googlemail.com (Philipp Comans)
Date: Sun, 21 Nov 2010 12:16:56 +0100
Subject: [BioRuby] Performance of Bio::Blast.reports
Message-ID: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>

Hi everyone,

I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.

Right now, I am parsing the reports in XML format using the following command:

blast_reports = Bio::Blast.reports(file, :rexml)

Is there any performance advantage when using REXML instead of the default XML parser?

In your opinion, is it possible to parse such a large report in XML format?
An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.

Thanks for your help!

Best regards,

Philipp


From pjotr.public14 at thebird.nl  Sun Nov 21 11:36:53 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 21 Nov 2010 12:36:53 +0100
Subject: [BioRuby] Performance of Bio::Blast.reports
In-Reply-To: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
Message-ID: <20101121113653.GA23100@thebird.nl>

Unfortunately BioRuby still loads it in RAM. Someone should do a
libxml2 version.

Easiest is to split the XML beforehand; that is what I do.

Pj.

On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote:
> Hi everyone,
> 
> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
> 
> Right now, I am parsing the reports in XML format using the following command:
> 
> blast_reports = Bio::Blast.reports(file, :rexml)
> 
> Is there any performance advantage when using REXML instead of the default XML parser?
> 
> In your opinion, is it possible to parse such a large report in XML format?
> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
> 
> Thanks for your help!
> 
> Best regards,
> 
> Philipp
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From cjfields at illinois.edu  Mon Nov 22 04:41:21 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Sun, 21 Nov 2010 22:41:21 -0600
Subject: [BioRuby] Performance of Bio::Blast.reports
In-Reply-To: <20101121113653.GA23100@thebird.nl>
References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
	<20101121113653.GA23100@thebird.nl>
Message-ID: <A7B4B89B-6532-4B1A-9468-A9536394DACA@illinois.edu>

On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory.

chris

On Nov 21, 2010, at 5:36 AM, Pjotr Prins wrote:

> Unfortunately BioRuby still loads it in RAM. Someone should do a
> libxml2 version.
> 
> Easiest is to split the XML beforehand; that is what I do.
> 
> Pj.
> 
> On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote:
>> Hi everyone,
>> 
>> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. 
>> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby.
>> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand.
>> 
>> Right now, I am parsing the reports in XML format using the following command:
>> 
>> blast_reports = Bio::Blast.reports(file, :rexml)
>> 
>> Is there any performance advantage when using REXML instead of the default XML parser?
>> 
>> In your opinion, is it possible to parse such a large report in XML format?
>> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information.
>> 
>> Thanks for your help!
>> 
>> Best regards,
>> 
>> Philipp
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From anurag08priyam at gmail.com  Mon Nov 22 05:07:08 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 22 Nov 2010 10:37:08 +0530
Subject: [BioRuby] Performance of Bio::Blast.reports
In-Reply-To: <A7B4B89B-6532-4B1A-9468-A9536394DACA@illinois.edu>
References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com>
	<20101121113653.GA23100@thebird.nl>
	<A7B4B89B-6532-4B1A-9468-A9536394DACA@illinois.edu>
Message-ID: <AANLkTim0F=bQiSWXxxyuF9QEUHhtkn+qrMUK2qdGX6J4@mail.gmail.com>

> On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory.

libxml2-ruby is quite buggy (segfaults, namespace does not work
correctly on attributes, inconsistent interface). AFAIK, its not Ruby
1.9.2 ready. I would suggest Nokogiri (Nokogiri::Reader - pull
parser[1]), which is almost as fast and has a better interface.

[1] http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Reader.html

-- 
Anurag Priyam,
3rd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642