[Bioperl-l] acquiring a local refseq + index

Erik er at xs4all.nl
Mon Jan 1 20:17:42 UTC 2007


> Agree with Hilmar, in that we need examples.

Another problematic one was NC_004822 - however, most other problems I
referred to were ones that did *not* stop de DBD indexing. (Of course, I
do not know how many error-throwing entries there still are in the files
that are not yet indexed: ca 75%).

The most common error was pollution of the 'binomial' name with
classification lines as a result of faulty parsing. If Bio::Species is
deprecated (I hadn't noticed that before) then these problems are of
course correspondingly less important.

>> If you are referring to your submitted bug:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2167

Yes, the above was one that stopped refseq indexing (there is one more
that I will stick into bugzilla in a minute). Thanks for the commit.
>
> we could add this in as long as it passes (I'll try giving it a
> workout with my local bacterial seqs tonight or tomorrow).  However,
> in the not-too-distant future your patch would likely be rendered
> obsolete, as any parsing in Bio::SeqIO modules pertaining to
> Bio::Species-related matters will be deprecated in favor of simple
> parsing (more foolproof, less uncertainty) and Bio::Taxon (which has
> optional db lookups using NCBI Taxonomy).  Bio::Species and anything
> related to it are considered marked for deprecation.  Fair warning...

What does simple parsing mean? Just returning the whole ORGANISM string,
and leaving further parsing to application side?

I shall look a bit closer at the Bio::Taxon and its relation to the parser
modules, assuming there still *is* a relation. :)

Maybe someone could elaborate just a little bit to get me started on how
to get taxonomic data from a refseg id or a genbank entry?


thanks,

Erik











> On Dec 30, 2006, at 7:48 PM, Hilmar Lapp wrote:
>
>> Can you send examples and the resulting error messages? Also, I'm
>> assuming you running the 1.5.2 release of Bioperl; if not that's what
>> I would try first.
>>
>> 	-hilmar
>>
>> On Dec 30, 2006, at 7:05 PM, Erik wrote:
>>
>>> Hi all,
>>>
>>> I downloaded the refseq files (.gbff) and want to index the lot with
>>> Bio::DB::Flat.
>>>
>>> It turns out that there are many cases where the SOURCE and
>>> ORGANISM lines
>>> are messed up, sometimes to a degree where the indexing fails on a
>>> Bio::SeqIO::genbank error.
>>>
>>> I'd like to change Bio::SeqIO::genbank to let this parsing go at
>>> least so
>>> far as to make the indexing of the refseq files possible, and
>>> hopefully
>>> improving the taxonomic output ($seq->species->binomial is often
>>> mutilated
>>> at the moment).
>>>
>>> Is it still worthwhile to change parsing modules like
>>> Bio::SeqIO::genbank?
>>>  Is anyone already working on a rewrite? Because if this is the
>>> case I may
>>> be better off writing my own indexing scheme?
>>>
>>> Below is (outline of) my indexing program, which uses
>>> Bio::DB::Flat::DBD.
>>> If anyone knows of a better way to get a locally searchable refseq
>>> flat
>>> file index, I would be very interested.
>>>
>>> Thanks for your help,
>>>
>>> Erikjan
>>>
>>>
>>> -------------
>>> use Bio::DB::Flat;
>>>
>>> my $refseq_dir = '/data/ftp.ncbi.nih.gov/refseq/release/complete';
>>> my $db=Bio::DB::Flat->new(
>>>    -directory  => $refseq_dir,
>>>    -dbname     => 'refseq',
>>>    -format     => 'genbank',
>>>    -index      => 'bdb',
>>>    -write_flag => 1,
>>> );
>>> my @files = getfiles($refseq_dir);
>>> for my $f (@files) {
>>>         db->build_index($f);
>>> }
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================





More information about the Bioperl-l mailing list