From hamish.mcwilliam at bioinfo-user.org.uk Mon Nov 8 16:42:33 2010 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Mon, 8 Nov 2010 21:42:33 +0000 Subject: [BioRuby] Bio::Fetch and EBI dbfetch Message-ID: Hi folks, The recent update to the EBI's dbfetch service adds support for the BioRuby biofetch meta-information methods and means that the Bio::Fetch documentation needs to be updated. The databases(), formats() and maxids() methods are now supported by the EBI service as well as biofetch. For full details of the extended syntax supported by dbfetch see http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. Another thing I notice is that Bio::Fetch does not set a user-agent, so no trace of BioRuby appears in the service logs. I'm not sure what the most appropriate user-agent would be, something like "BioRuby/1.1.0 Ruby/1.8.6" would probably do, but the maybe a module specific style user-agent as used by BioPerl would be better (e.g. "Bio::DB::RefSeq/0.8" or "bioperl-Bio_DB_RefSeq/1.4")? All the best, Hamish -- ---- "Saying the internet has changed dramatically over the last five years is clich? ? the internet is always changing dramatically" - Craig Labovitz, Arbor Networks. From philipp.comans at googlemail.com Sun Nov 21 06:16:56 2010 From: philipp.comans at googlemail.com (Philipp Comans) Date: Sun, 21 Nov 2010 12:16:56 +0100 Subject: [BioRuby] Performance of Bio::Blast.reports Message-ID: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> Hi everyone, I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby. I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand. Right now, I am parsing the reports in XML format using the following command: blast_reports = Bio::Blast.reports(file, :rexml) Is there any performance advantage when using REXML instead of the default XML parser? In your opinion, is it possible to parse such a large report in XML format? An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information. Thanks for your help! Best regards, Philipp From pjotr.public14 at thebird.nl Sun Nov 21 06:36:53 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 21 Nov 2010 12:36:53 +0100 Subject: [BioRuby] Performance of Bio::Blast.reports In-Reply-To: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> Message-ID: <20101121113653.GA23100@thebird.nl> Unfortunately BioRuby still loads it in RAM. Someone should do a libxml2 version. Easiest is to split the XML beforehand; that is what I do. Pj. On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote: > Hi everyone, > > I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. > Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby. > I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand. > > Right now, I am parsing the reports in XML format using the following command: > > blast_reports = Bio::Blast.reports(file, :rexml) > > Is there any performance advantage when using REXML instead of the default XML parser? > > In your opinion, is it possible to parse such a large report in XML format? > An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information. > > Thanks for your help! > > Best regards, > > Philipp > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From cjfields at illinois.edu Sun Nov 21 23:41:21 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sun, 21 Nov 2010 22:41:21 -0600 Subject: [BioRuby] Performance of Bio::Blast.reports In-Reply-To: <20101121113653.GA23100@thebird.nl> References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> <20101121113653.GA23100@thebird.nl> Message-ID: On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory. chris On Nov 21, 2010, at 5:36 AM, Pjotr Prins wrote: > Unfortunately BioRuby still loads it in RAM. Someone should do a > libxml2 version. > > Easiest is to split the XML beforehand; that is what I do. > > Pj. > > On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote: >> Hi everyone, >> >> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. >> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby. >> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand. >> >> Right now, I am parsing the reports in XML format using the following command: >> >> blast_reports = Bio::Blast.reports(file, :rexml) >> >> Is there any performance advantage when using REXML instead of the default XML parser? >> >> In your opinion, is it possible to parse such a large report in XML format? >> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information. >> >> Thanks for your help! >> >> Best regards, >> >> Philipp >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From anurag08priyam at gmail.com Mon Nov 22 00:07:08 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 22 Nov 2010 10:37:08 +0530 Subject: [BioRuby] Performance of Bio::Blast.reports In-Reply-To: References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> <20101121113653.GA23100@thebird.nl> Message-ID: > On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory. libxml2-ruby is quite buggy (segfaults, namespace does not work correctly on attributes, inconsistent interface). AFAIK, its not Ruby 1.9.2 ready. I would suggest Nokogiri (Nokogiri::Reader - pull parser[1]), which is almost as fast and has a better interface. [1] http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Reader.html -- Anurag Priyam, 3rd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From hamish.mcwilliam at bioinfo-user.org.uk Mon Nov 8 21:42:33 2010 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Mon, 8 Nov 2010 21:42:33 +0000 Subject: [BioRuby] Bio::Fetch and EBI dbfetch Message-ID: Hi folks, The recent update to the EBI's dbfetch service adds support for the BioRuby biofetch meta-information methods and means that the Bio::Fetch documentation needs to be updated. The databases(), formats() and maxids() methods are now supported by the EBI service as well as biofetch. For full details of the extended syntax supported by dbfetch see http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. Another thing I notice is that Bio::Fetch does not set a user-agent, so no trace of BioRuby appears in the service logs. I'm not sure what the most appropriate user-agent would be, something like "BioRuby/1.1.0 Ruby/1.8.6" would probably do, but the maybe a module specific style user-agent as used by BioPerl would be better (e.g. "Bio::DB::RefSeq/0.8" or "bioperl-Bio_DB_RefSeq/1.4")? All the best, Hamish -- ---- "Saying the internet has changed dramatically over the last five years is clich? ? the internet is always changing dramatically" - Craig Labovitz, Arbor Networks. From philipp.comans at googlemail.com Sun Nov 21 11:16:56 2010 From: philipp.comans at googlemail.com (Philipp Comans) Date: Sun, 21 Nov 2010 12:16:56 +0100 Subject: [BioRuby] Performance of Bio::Blast.reports Message-ID: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> Hi everyone, I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby. I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand. Right now, I am parsing the reports in XML format using the following command: blast_reports = Bio::Blast.reports(file, :rexml) Is there any performance advantage when using REXML instead of the default XML parser? In your opinion, is it possible to parse such a large report in XML format? An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information. Thanks for your help! Best regards, Philipp From pjotr.public14 at thebird.nl Sun Nov 21 11:36:53 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 21 Nov 2010 12:36:53 +0100 Subject: [BioRuby] Performance of Bio::Blast.reports In-Reply-To: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> Message-ID: <20101121113653.GA23100@thebird.nl> Unfortunately BioRuby still loads it in RAM. Someone should do a libxml2 version. Easiest is to split the XML beforehand; that is what I do. Pj. On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote: > Hi everyone, > > I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. > Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby. > I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand. > > Right now, I am parsing the reports in XML format using the following command: > > blast_reports = Bio::Blast.reports(file, :rexml) > > Is there any performance advantage when using REXML instead of the default XML parser? > > In your opinion, is it possible to parse such a large report in XML format? > An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information. > > Thanks for your help! > > Best regards, > > Philipp > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From cjfields at illinois.edu Mon Nov 22 04:41:21 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sun, 21 Nov 2010 22:41:21 -0600 Subject: [BioRuby] Performance of Bio::Blast.reports In-Reply-To: <20101121113653.GA23100@thebird.nl> References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> <20101121113653.GA23100@thebird.nl> Message-ID: On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory. chris On Nov 21, 2010, at 5:36 AM, Pjotr Prins wrote: > Unfortunately BioRuby still loads it in RAM. Someone should do a > libxml2 version. > > Easiest is to split the XML beforehand; that is what I do. > > Pj. > > On Sun, Nov 21, 2010 at 12:16:56PM +0100, Philipp Comans wrote: >> Hi everyone, >> >> I would like to parse large Blast reports with BioRuby. Each report will be around 150 - 400 MB. >> Is the XML parser for Blast reports able to parse files that big? I tested it with smaller reports of about 5 MB size and found that the performance was very poor when using MRI 1.8.7 but quite good when using JRuby. >> I don't think that the system I am working on has enough RAM to keep the whole report in memory as the 5 MB input file already requires about 600 MB of RAM. Does the Blast report parser in BioRuby parse the whole file at once or does it use a streaming approach? The latter would be very advantageous in my case as far as I understand. >> >> Right now, I am parsing the reports in XML format using the following command: >> >> blast_reports = Bio::Blast.reports(file, :rexml) >> >> Is there any performance advantage when using REXML instead of the default XML parser? >> >> In your opinion, is it possible to parse such a large report in XML format? >> An alternative for me would be to create Blast reports in tabular format because I am sure that these can be read line-by-line. As far as I know however, the tabular output does not contain all the information in XML output so I would have to take additional steps to recover that information. >> >> Thanks for your help! >> >> Best regards, >> >> Philipp >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From anurag08priyam at gmail.com Mon Nov 22 05:07:08 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 22 Nov 2010 10:37:08 +0530 Subject: [BioRuby] Performance of Bio::Blast.reports In-Reply-To: References: <8651F652-4F60-4004-A93A-9632DB0A15CF@googlemail.com> <20101121113653.GA23100@thebird.nl> Message-ID: > On could, in fact, use the libxml2 pull parser to grab what you need, which doesn't require pulling the entire XML document into memory. libxml2-ruby is quite buggy (segfaults, namespace does not work correctly on attributes, inconsistent interface). AFAIK, its not Ruby 1.9.2 ready. I would suggest Nokogiri (Nokogiri::Reader - pull parser[1]), which is almost as fast and has a better interface. [1] http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Reader.html -- Anurag Priyam, 3rd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642