[Bioperl-l] contigs in NCBIHelper (RE: WGS sequences through Bio::DB::GenBank)

Chris Fields cjfields at uiuc.edu
Mon Mar 6 18:03:28 UTC 2006


I noticed this morning, while looking into ways of retrieving WGS sequences
from WGS master files from Bio::DB::GenBank, that NCBIHelper post-processes
all files to check for the CONTIG lines (I believe Brian pointed this out to
me last week).  I found a blurb from the eutils course file that this can be
done directly from NCBI, using rettype = gbwithparts, which I mentioned
previously:

Application 4: Downloading Contigs
I want to download a flatfile with the full sequence of an assembly (eg. a
contig).
Solution: Use EFetch with &rettype=gbwithparts
URL:efetch.fcgi?db=nucleotide&id=27479347&rettype=gbwithparts

I changed %FORMATMAP in the NCBIHelper BEGIN block to include this return
type and it seems to catch these files w/o problems (i.e. passes through
postprocessing w/o a hitch).  This, of course, doesn't work with WGS files,
my original intent.  oh well ;{ 

This seems to speed up the process tremendously as well, considering all the
work is done on NCBI's end; a quicky test using the same file (CH398084) and
the following: 

my $gb = Bio::DB::GenBank->new(-verbose => $v,
                               -format => 'gbwithparts');


took ~10-15 secs, most of this retrieval time, while this:

my $gb = Bio::DB::GenBank->new(-verbose => $v,
                               -format => 'gb');

took ~45-55 seconds with my 2GHz computer, ~1 Gb RAM running WinXP.  There
are substantial differences in the files which seem based on 'n' padding
between joined segnments.  NCBI's version had various # of n's padding where
contigs were joined based on the presence of 'gap(x)' in the CONTIG join
lines from the master file. I didn't see any padding with bioperl's version.


I haven't committed any of these changes just yet as I'm still working on
the WGS issue (I'm thinking about a module based on Bio::DB::GenBank
aliasing some of the get methods at the moment).

So, now should we change the _post_process sub to revert to this when
catching CONTIG files on the backend, such as when someone requests a CONTIG
file using the rettype of 'gb' instead of 'gbwithparts'?

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 






More information about the Bioperl-l mailing list