[Biopython] Searching for and downloading sequences using the history

Peter biopython at maubp.freeserve.co.uk
Fri Sep 18 16:56:34 EDT 2009


On Fri, Sep 18, 2009 at 9:09 PM, Carlos Javier Borroto
<carlos.borroto at gmail.com> wrote:
>
> On Fri, Sep 18, 2009 at 2:51 PM, Peter wrote:
>> I would first suggest you refine your Entrez search to use "species
>> name[orgn]" rather than just "species name" (i.e. explicitly search
>> on the organism rather than all fields). That may reduce things
>> further. Even better, search using an NCBI taxonomy ID to be
>> absolutely explicit. This may reduce the dataset a bit.
>
> Nice advise, I was thinking about using it, now I'm using something
> like txid6945[Organism:noexp], but still I have 100000+ sequence to
> download.

That is what I meant - but still, you have a lot of sequences!

>> Secondly, this seems like an awfully large amount of data to
>> try and download via Entrez. Email the NCBI to ask if if this is
>> OK (and if so what batch size you should use for EFetch calls),
>> or if they have an alternative suggestion (e.g. some FTP site).
>
> But how could I know what is and what isn't "an awfully large amount
> of data" to download via Entrez?, I'm gonna try writing to them and
> see what they think. FTP site was my first option but is unresponsive
> right now, but I don't think they have this specific subset of
> sequences there.

Well over 100000 record sounds like a lot to me, but I agree, the
NCBI could provide more explicit guidance.

It is a shame if the NCBI don't provide what you want by FTP,
as that would probably be easier.

>> P.S. You could try wrapping each EFetch call in a
>> try/except in order to retry any individual retrieval which fails.
>
> Great I just did it and it seems to be working fine!, here is what I did:
>
>                while True :
>                        try :
>                                fetch_handle = ...
>                                data = fetch_handle.read()
>                                fetch_handle.close()
>                                out_handle.write(data)
>                                break
>                        except IOError :
>                                print "Server error, going to try
> again from record %i" % (start+1)

That will loop for ever if there a problem - which is fine if you are
going to sit watching the script, but a very bad idea for automation.
I would limit it to say 3 attempts before giving up, and maybe add
a sleep of a few seconds too.

Peter



More information about the Biopython mailing list