[Bioperl-l] How to download EST files via bioperl script?

Alberto Davila davila at ioc.fiocruz.br
Tue Jul 10 13:58:29 UTC 2007


Hi Xing,

Unfortunately that did not work for me... there are 5133 T. brucei ESTs 
(http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5691[Organism:exp]&cmd=Search&db=nucest&QueryKey=8) 
and 13971 from T. cruzi 
(http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5693[Organism:exp]&cmd=Search&db=nucest&QueryKey=11) 
  that I cannot download at once in GenBank format... even when I select 
"GenBank" format in the Display menu I can only see and get/download 500 
ESTs each time...

I also downloaded all ESTs from GenBank (a pity there are not subsets of 
them !) but merging all them generate a file bigger than 120GB to be 
processed...

Just asked Diogo (my student) to give a try to the script sent by Chris 
Fields.. so finger crossed ;-)

Cheers, Alberto


Xing Hu wrote:
> Thanks you guys.
> 
> I had to confess that how stupid I was. The easiest way seems to be the 
> way using NCBI Taxonomy Browser which suggested by alex. As a matter of 
> fact, I knew that but I thought it was necessary to have all items 
> selected before pressing save to launch download. So I was desperate to 
> find a button that could achieve that without hundreds of thousands of 
> clicking by me. "What about select none of those items at all?" -- This 
> idea finally came to me after days of struggling and the problem was solved.
> 
> Xing
> 
> 
> 
> Chris Fields wrote:
>> Caveat: if you have millions of ESTs please consider NOT using my 
>> eutil script below or NCBI Batch Entrez, which would repeatedly hit 
>> the NCBI server thousands of times.  At least try looking for other 
>> ways to retrieve the data you want (ftp, organism-specific resources 
>> like Ensembl, so on), or run any scripts or data retrieval in off 
>> hours so you don't overtax the NCBI server.
>>
>> There is a way you can use BioPerl if you don't mind living on the 
>> bleeding edge by using bioperl-live (core code from CVS).  I have been 
>> working on a set of modules for the last year (Bio::DB::EUtilities) 
>> which interact with all the various eutils for building data pipelines 
>> which uses the NCBI CGI interface.  You could possibly retrieve all 
>> relevant ESTs using a variation of the example script here:
>>
>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-.3Eefetch
>>
>> Note that the code examples do NOT work with rel. 1.5.2 code as the 
>> API has changed quite a bit; I'm working to rectify some of that.
>>
>> The script I would use is below.  It retrieves batches of 500 
>> sequences (in fasta format) at a time, for a total of 10000 max seq 
>> records, saving the raw record data directly to a file (appending as 
>> you go along).  I added an eval block to check the server status and 
>> redo the call up to 4 times before giving up completely.  Using eval 
>> this way hasn't been extensively tested but should work.
>>
>> ---------------------------------------
>>
>> use Bio::DB::EUtilities;
>>
>> my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
>>                                        -db => 'nucest',
>>                                        -term => 'txid3702',
>>                                        -usehistory => 'y',
>>                                        -keep_histories => 1);
>>
>> my $count = $factory->get_count;
>>
>> print "Count: $count\n";
>>
>> if (my $hist = $factory->next_History) {
>>     print "History returned\n";
>>     # note db carries over from above
>>     $factory->set_parameters(-eutil => 'efetch',
>>                              -rettype => 'fasta',
>>                              -history => $hist);
>>     my ($retmax, $retstart) = (500,0);
>>     my $retry = 1;
>>     my $maxcount = $count < 10000 ? $count : 10000; # set max # seq 
>> records to return
>>     RETRIEVE_SEQS:
>>     while ($retstart < $maxcount) {
>>         print "Returning from ",$retstart+1," to 
>> ",$retstart+$retmax,"\n";
>>         $factory->set_parameters(-retmax => $retmax,
>>                                 -retstart => $retstart);
>>         # check in case of server error
>>         eval{
>>             $factory->get_Response(-file => ">>ESTs.fas");
>>         };
>>         if ($@) {
>>             die "Server error: $@.  Try again later" if $retry == 5;
>>             print STDERR "Server error, redo #$retry\n";
>>             $retry++ && redo RETRIEVE_SEQS;
>>         }
>>         $retstart += $retmax;
>>     }
>> }
>>
>>
>> ---------------------------------------
>>
>>
>> chris
>>
>> On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
>>
>>> To download genomic sequences or ESTs for any organism (in various
>>> formats) you can use NCBI Taxonomy Browser:
>>> http://www.ncbi.nlm.nih.gov/Taxonomy/
>>>
>>> you can use taxonomy id to access different organisms, Arabidopsis for
>>> example (3702):
>>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702 
>>>
>>>
>>> or by direct web link:
>>> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1 
>>>
>>>
>>> assembled genomes can be accessed via ftp:
>>> ftp://ftp.ncbi.nih.gov/genomes/
>>>
>>> To download large amount of selected sequences (ESTs for example) you
>>> can use batch Entrez:
>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
>>> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
>>> (select EST for EST, it's critical)
>>>
>>> It seems, to solve the problem you describe, you don't need to use
>>> bioperl. NCBI GenBank Entrez provides all necessary tools to work on
>>> these simple and frequent tasks.
>>>
>>> -Alex
>>>
>>> --Alexander Kozik
>>> Bioinformatics Specialist
>>> Genome and Biomedical Sciences Facility
>>> 451 East Health Sciences Drive
>>> University of California
>>> Davis, CA 95616-8816
>>> Phone: (530) 754-9127
>>> email#1: akozik at atgc.org
>>> email#2: akozik at gmail.com
>>> web: http://www.atgc.org/
>>>
>>>
>>>
>>> Xing Hu wrote:
>>>> Hi friends,
>>>>
>>>>     I wrote a script for getting genomic sequence file from GenBank. To
>>>> fulfill that target, I used DB::GenBank module to get the sequence via
>>>> get_Seq_by_acc, and it works well. But this time, facing enormous 
>>>> amount
>>>> of ESTs, I have no idea how to download them swiftly and elegantly.
>>>>
>>>>     PROBLEM DESCRIPTION:
>>>>     goal: download all EST files of a specific species from GenBank, 
>>>> say
>>>> Arabidopsis Thaliana or Oryza sativa(rice).
>>>>     other: whether all of ESTs are in a single file or separatedly
>>>> placed does not matter.
>>>>
>>>>     Can I use a bioperl script to achieve that? And How? I really
>>>> appreciate.
>>>>
>>>> Xing.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>



More information about the Bioperl-l mailing list