[Bioperl-l] How to download EST files via bioperl script?

Tue Jul 10 14:05:43 UTC 2007

Just make sure you're using the latest from CVS.  Let me know if it  
doesn't work and I'll look into it.

chris

On Jul 10, 2007, at 8:58 AM, Alberto Davila wrote:

> Hi Xing,
>
> Unfortunately that did not work for me... there are 5133 T. brucei  
> ESTs
> (http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5691 
> [Organism:exp]&cmd=Search&db=nucest&QueryKey=8)
> and 13971 from T. cruzi
> (http://www.ncbi.nlm.nih.gov/sites/entrez?term=txid5693 
> [Organism:exp]&cmd=Search&db=nucest&QueryKey=11)
>   that I cannot download at once in GenBank format... even when I  
> select
> "GenBank" format in the Display menu I can only see and get/ 
> download 500
> ESTs each time...
>
> I also downloaded all ESTs from GenBank (a pity there are not  
> subsets of
> them !) but merging all them generate a file bigger than 120GB to be
> processed...
>
> Just asked Diogo (my student) to give a try to the script sent by  
> Chris
> Fields.. so finger crossed ;-)
>
> Cheers, Alberto
>
>
> Xing Hu wrote:
>> Thanks you guys.
>>
>> I had to confess that how stupid I was. The easiest way seems to  
>> be the
>> way using NCBI Taxonomy Browser which suggested by alex. As a  
>> matter of
>> fact, I knew that but I thought it was necessary to have all items
>> selected before pressing save to launch download. So I was  
>> desperate to
>> find a button that could achieve that without hundreds of  
>> thousands of
>> clicking by me. "What about select none of those items at all?" --  
>> This
>> idea finally came to me after days of struggling and the problem  
>> was solved.
>>
>> Xing
>>
>>
>>
>> Chris Fields wrote:
>>> Caveat: if you have millions of ESTs please consider NOT using my
>>> eutil script below or NCBI Batch Entrez, which would repeatedly hit
>>> the NCBI server thousands of times.  At least try looking for other
>>> ways to retrieve the data you want (ftp, organism-specific resources
>>> like Ensembl, so on), or run any scripts or data retrieval in off
>>> hours so you don't overtax the NCBI server.
>>>
>>> There is a way you can use BioPerl if you don't mind living on the
>>> bleeding edge by using bioperl-live (core code from CVS).  I have  
>>> been
>>> working on a set of modules for the last year (Bio::DB::EUtilities)
>>> which interact with all the various eutils for building data  
>>> pipelines
>>> which uses the NCBI CGI interface.  You could possibly retrieve all
>>> relevant ESTs using a variation of the example script here:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esearch-. 
>>> 3Eefetch
>>>
>>> Note that the code examples do NOT work with rel. 1.5.2 code as the
>>> API has changed quite a bit; I'm working to rectify some of that.
>>>
>>> The script I would use is below.  It retrieves batches of 500
>>> sequences (in fasta format) at a time, for a total of 10000 max seq
>>> records, saving the raw record data directly to a file (appending as
>>> you go along).  I added an eval block to check the server status and
>>> redo the call up to 4 times before giving up completely.  Using eval
>>> this way hasn't been extensively tested but should work.
>>>
>>> ---------------------------------------
>>>
>>> use Bio::DB::EUtilities;
>>>
>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
>>>                                        -db => 'nucest',
>>>                                        -term => 'txid3702',
>>>                                        -usehistory => 'y',
>>>                                        -keep_histories => 1);
>>>
>>> my $count = $factory->get_count;
>>>
>>> print "Count: $count\n";
>>>
>>> if (my $hist = $factory->next_History) {
>>>     print "History returned\n";
>>>     # note db carries over from above
>>>     $factory->set_parameters(-eutil => 'efetch',
>>>                              -rettype => 'fasta',
>>>                              -history => $hist);
>>>     my ($retmax, $retstart) = (500,0);
>>>     my $retry = 1;
>>>     my $maxcount = $count < 10000 ? $count : 10000; # set max # seq
>>> records to return
>>>     RETRIEVE_SEQS:
>>>     while ($retstart < $maxcount) {
>>>         print "Returning from ",$retstart+1," to
>>> ",$retstart+$retmax,"\n";
>>>         $factory->set_parameters(-retmax => $retmax,
>>>                                 -retstart => $retstart);
>>>         # check in case of server error
>>>         eval{
>>>             $factory->get_Response(-file => ">>ESTs.fas");
>>>         };
>>>         if ($@) {
>>>             die "Server error: $@.  Try again later" if $retry == 5;
>>>             print STDERR "Server error, redo #$retry\n";
>>>             $retry++ && redo RETRIEVE_SEQS;
>>>         }
>>>         $retstart += $retmax;
>>>     }
>>> }
>>>
>>>
>>> ---------------------------------------
>>>
>>>
>>> chris
>>>
>>> On Jul 9, 2007, at 7:25 AM, Alexander Kozik wrote:
>>>
>>>> To download genomic sequences or ESTs for any organism (in various
>>>> formats) you can use NCBI Taxonomy Browser:
>>>> http://www.ncbi.nlm.nih.gov/Taxonomy/
>>>>
>>>> you can use taxonomy id to access different organisms,  
>>>> Arabidopsis for
>>>> example (3702):
>>>> http://www.ncbi.nlm.nih.gov/sites/entrez? 
>>>> db=Nucleotide&cmd=Search&dopt=DocSum&term=txid3702
>>>>
>>>>
>>>> or by direct web link:
>>>> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi? 
>>>> mode=Undef&name=Arabidopsis+thaliana&lvl=0&srchmode=1
>>>>
>>>>
>>>> assembled genomes can be accessed via ftp:
>>>> ftp://ftp.ncbi.nih.gov/genomes/
>>>>
>>>> To download large amount of selected sequences (ESTs for  
>>>> example) you
>>>> can use batch Entrez:
>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
>>>> http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide
>>>> (select EST for EST, it's critical)
>>>>
>>>> It seems, to solve the problem you describe, you don't need to use
>>>> bioperl. NCBI GenBank Entrez provides all necessary tools to  
>>>> work on
>>>> these simple and frequent tasks.
>>>>
>>>> -Alex
>>>>
>>>> --Alexander Kozik
>>>> Bioinformatics Specialist
>>>> Genome and Biomedical Sciences Facility
>>>> 451 East Health Sciences Drive
>>>> University of California
>>>> Davis, CA 95616-8816
>>>> Phone: (530) 754-9127
>>>> email#1: akozik at atgc.org
>>>> email#2: akozik at gmail.com
>>>> web: http://www.atgc.org/
>>>>
>>>>
>>>>
>>>> Xing Hu wrote:
>>>>> Hi friends,
>>>>>
>>>>>     I wrote a script for getting genomic sequence file from  
>>>>> GenBank. To
>>>>> fulfill that target, I used DB::GenBank module to get the  
>>>>> sequence via
>>>>> get_Seq_by_acc, and it works well. But this time, facing enormous
>>>>> amount
>>>>> of ESTs, I have no idea how to download them swiftly and  
>>>>> elegantly.
>>>>>
>>>>>     PROBLEM DESCRIPTION:
>>>>>     goal: download all EST files of a specific species from  
>>>>> GenBank,
>>>>> say
>>>>> Arabidopsis Thaliana or Oryza sativa(rice).
>>>>>     other: whether all of ESTs are in a single file or separatedly
>>>>> placed does not matter.
>>>>>
>>>>>     Can I use a bioperl script to achieve that? And How? I really
>>>>> appreciate.
>>>>>
>>>>> Xing.
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign