[Bioperl-l] Best Practices for Downloading/Mirroring Genbank

Mark Johnson mjohnson at watson.wustl.edu
Thu Jun 24 12:13:07 EDT 2004


Rsync is your friend.  Both NCBI and Biomirror are rsync friendly.  You
can use rsync to maintain a local copy of whatever parts of the NCBI ftp
site you'd like.  Then you can be assured that after the rsync finishes
you have a consistent local snapshot (as long as you didn't rsync in the
middle of a file update on the other end).  It will even minimize your
bandwidth consumption...on subsequent invocations it will only transfer
files you don't have, or changes to files you do have.

> I'm working on setting up a local mirror of Genbank here at work and am
> unsure of what the best way to go about it is.
>
> I started off real simple with a wget -m ftp://genbank.sdsc.edu/pub (Yes,
> I
> wanted the BLAST formatted databses and executables as well) and the
> transfer is going just fine, albeit excruciatingly slow at times.
>
> But what happens:
> 1) between now and the next build?;
> 2) if I coose to mirror from an alternate source?;
> 3) after the next build?
>
> For the first part, I just planned on doing daily wgets for the updates,
> and
> the possibility occurred to me that if I miss the last couple days worth
> of
> updates before the new build,  those updates get shuffled into the main
> build files and I have to download the whole thing again?
>
> For the second, If I choose to mirror from Biomirror or NCBI instead of
> San
> Diego, those timestamps seem to be different for what I am assuming to be
> the same build.  For example,
>
> gbest1.seq.gz	19,454,020 bytes	5/22/04	5:04am SDSC Mirror
> 		19,454,020 bytes	4/25/04	2:01am NCBI Mirror
> 		19,454,020 bytes	4/25/04	2:01am BioMirror
>
> For the third part,  do the build files really change or are new entries
> and
> revisions just added on as extra build files?  I read that the files are
> non-cumulative, so that would seem to confirm it, but the timestamps are
> updated in sync with the latest build date.
>
> How do I keep an updated mirror without losing daily builds or having to
> download the whole thing every couple of months.  How do I verify that I
> do
> have the latest data, because checking timestamps does not seem like it
> will
> work?  Should I even bother with creating a true mirror?
>
> I ran across this recent thesis on some of the issues in maintaining these
> types of databases accurately while minimizing file transfers
> http://if.anu.edu.au/Students/DamonSearle-2003-thesis.pdf
>
> I know that Biomirror has some scripts to facilitate efficient transfers
> but
> do they handle updates.  I'm guessing this problem has already been
> addressed, I just can't find the solution.
>
> Thanks in advance for any input,
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Joseph Karalius
> RA, Bioinformatics
> Molecular Markers and Applied Genomics
> Seminis Vegetable Seeds, Inc
> 37437 State Highway 16
> Woodland, CA 95695-9353
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list