[Bioperl-l] retrieval of PRELIMINARY uniprot sequences using Bio::Registry fails

Wed Sep 6 18:09:42 UTC 2006

Brian

There are problem with the way the division is parsed in SeqIO::swiss (it is
pulled from the entry name 'xxxx_yyyy', after the underscore).  This leads
to some weird 'division' designations, like 'CHLTR', '9PICO', etc.  What we
probably should do is not set division() at all (or set it to 'unknown')
since the release notes specify SwissProt/TrEMBL doesn't use them (and,
looking at our old test sequences, never used them).   

We could always leave it the way it is, but I don't think using these odd
designations for divisions makes much sense.

I'll probably use namespace() as it seems to fit the closest to storing the
specific database name (Swiss-Prot or TrEMBL).  I'm using version() to hold
the current number for the entry version (latest update version), which
should probably go in seq_version() instead.  

As SwissProt => 'STANDARD' and TrEMBL => 'PRELIMINARY', we could just build
the ID line based on the namespace() designation, falling back to the true
division() or 'UNK'.

Chris

> Chris,
> 
> Yes, I saw this but was waiting for Daniel's sample.
> 
> division() is not a great way to set this value since it's meant for
> taxonomic "divisions" (e.g. "PRI" in Genbank). On the other hand what else
> is there? authority() doesn't seem right either. What about:
> 
> $seq->seq_version($DATA_CLASS)
> 
> None of them are ideal but this is the closest, in my opinion. Then
> "Swiss-prot" and "TrEMBL" could be set by namespace() or authority().
> 
> Brian O.
> 
> On 9/6/06 10:59 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> 
> > Brian,
> >
> > I have found the issue with Bio::SeqIO::swiss; apparently UniProt has
> > switched to using the following ID line format:
> >
> > ID   ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
> >
> > For SwissProt ID's
> >
> > ID   CYC_BOVIN      STANDARD;      PRT;   104 AA.
> > ID   GIA2_GIALA     STANDARD;      PRT;   296 AA.
> >
> > For TrEMBL (preliminary protein):
> >
> > ID   Q5XPV6      PRELIMINARY;      PRT;   231 AA.
> >
> > SeqIO 'swiss' sequence output currently uses the first (SwissProt)
> version;
> > it's hardcoded in a sprintf() statement.  I guess TrEMBL didn't have a
> > designation before, so this complicates things a little.
> >
> > There are a few other (small) formatting differences I have also found
> which
> > we could update fairly easily.
> >
> > In the section of the release notes describing differences between
> > SwissProt/EMBL format, this is listed:
> >
> > * EMBL entry ID lines have an additional three-letter taxonomic division
> > 'token' inserted between the data class and the molecule type;
> >
> > I suppose we could use division() to store 'STANDARD' and 'PRELIMINARY'
> (or
> > 'Swiss-Prot' and 'TrEMBL' if that's nicer).
> >
> > Chris
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Daniel Lang
> >> Sent: Wednesday, September 06, 2006 4:12 AM
> >> To: bioperl-l at lists.open-bio.org
> >> Subject: Re: [Bioperl-l] retrieval of PRELIMINARY uniprot sequences
> using
> >> Bio::Registry fails
> >>
> >> Hi Brian,
> >>
> >> I'm iterating now over all uniprot_trembl sequences and record for
> which
> >>  retrieval fails - Lets see if STANDARDs also fail...
> >>
> >> How is the second field of the swissprot ID line handled anyway?
> Because
> >> PRELIMINARYs end up as STANDARD when being parsed by Bio::SeqIO::swiss.
> >>
> >> On the other side I'm still confused why there's no error or warning
> >> when the retrieval fails. Can you give me a hint which modules (besides
> >> swiss.pm) to look at?
> >>
> >> Cheers,
> >> Daniel
> >>
> >> Brian Osborne wrote:
> >>> Daniel,
> >>>
> >>> Well, if you can isolate the bug please add it to bugzilla.
> >>>
> >>> Brian O.
> >>>
> >>>
> >>> On 9/5/06 5:57 AM, "Daniel Lang" <daniel.lang at biologie.uni-
> freiburg.de>
> >>> wrote:
> >>>
> >>>> Hi Brian,
> >>>>
> >>>> sorry for the belated response!
> >>>> I've compiled you a set of 100 PRELIMINARY entries from the latest
> >>>> uniprot_trembl release. I've tried to reproduce the bug using only
> >> these
> >>>> as input to build an index, but (sadly) all of them can be retrieved
> >>>> using the latest checkout:-(
> >>>> Maybe its not connected to these entries after all, but the size or
> >> some
> >>>> other feature of the uniprot distribution?
> >>>> I now could make it work using the 1.5.1 release.
> >>>>
> >>>> Originally, I've built the index using flat protocol, when I try bdb
> >> and
> >>>> bioperl-live even more problems occur:
> >>>>
> >>>> bp_bioflat_index.pl --dbname sw -i bdb -f swiss -l . -c
> >> uniprot_sprot.dat
> >>>>
> >>>> ------------- EXCEPTION  -------------
> >>>> MSG: The lineage 'Eukaryota, Metazoa, Chordata, Craniata, Vertebrata,
> >>>> Euteleostomi, Amphibia, Batrachia, Anura, Mesobatrachia, Pipoidea,
> >>>> Pipidae, Xenopodinae, Xenopus, Silurana, Xenopus, tropicalis' had two
> >>>> non-consecutive nodes with the same name. Can't cope!
> >>>> STACK Bio::DB::Taxonomy::list::add_lineage
> >>>> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy/list.pm:163
> >>>> STACK Bio::DB::Taxonomy::list::new
> >>>> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy/list.pm:100
> >>>> STACK Bio::DB::Taxonomy::new
> >>>> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy.pm:106
> >>>> STACK Bio::Species::classification
> >>>> /home/lang/bioperl/bioperl-live/Bio/Species.pm:171
> >>>> STACK Bio::SeqIO::swiss::_read_swissprot_Species
> >>>> /home/lang/bioperl/bioperl-live/Bio/SeqIO/swiss.pm:1049
> >>>> STACK Bio::SeqIO::swiss::next_seq
> >>>> /home/lang/bioperl/bioperl-live/Bio/SeqIO/swiss.pm:240
> >>>> STACK Bio::DB::Flat::parse_one_record
> >>>> /home/lang/bioperl/bioperl-live/Bio/DB/Flat.pm:333
> >>>> STACK Bio::DB::Flat::BDB::_index_file
> >>>> /home/lang/bioperl/bioperl-live/Bio/DB/Flat/BDB.pm:235
> >>>> STACK Bio::DB::Flat::BDB::build_index
> >>>> /home/lang/bioperl/bioperl-live/Bio/DB/Flat/BDB.pm:218
> >>>> STACK toplevel
> >>>> /share/apps/bioperl/bioperl-live/scripts_temp/bp_bioflat_index.pl:113
> >>>>
> >>>> But I think this is connected to the new changes to taxonomy handling
> >> in
> >>>> Bio::Taxon...
> >>>> I'm unsure wether to submit this separately, but I could also provide
> >> an
> >>>> example of such a swissprot entry that causes this error.
> >>>>
> >>>> Thanks, again.
> >>>>
> >>>> Daniel
> >>>>
> >>>> Brian Osborne wrote:
> >>>>> Daniel,
> >>>>>
> >>>>> Bug, presumably in SeqIO/swiss.pm. Can you send me a small file with
> >> such a
> >>>>> PRELIMINARY entry?
> >>>>>
> >>>>> Brian O.
> >>>>>
> >>>>>
> >>>>> On 9/1/06 6:11 AM, "Daniel Lang" <daniel.lang at biologie.uni-
> >> freiburg.de>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> when using Bio::Registry (bioperl-live) to fetch uniprot entries
> from
> >>>>>> local indexed uniprot *.dats, I had to realize that several entries
> >>>>>> could not be retrieved despite the fact that they are present in
> the
> >>>>>> files! A closer look reveals that they are of status PRELIMINARY:
> >>>>>>
> >>>>>> uniprot_trembl.dat:ID   Q16EZ1_AEDAE   PRELIMINARY;   PRT;   222
> AA.
> >>>>>>
> >>>>>> I don't "grep" PRELIMINARY anywhere in my cvs checkout..
> >>>>>> I also can't retrieve the sequences from the online database
> defined
> >> as
> >>>>>> follows:
> >>>>>> [swissprot_ebi]
> >>>>>> protocol=biofetch
> >>>>>> location=http://www.ebi.ac.uk/cgi-bin/dbfetch
> >>>>>> dbname=swall
> >>>>>>
> >>>>>> Is this a bug or a feature? If its a feature, how can I bypass it?
> >>>>>>
> >>>>>> Thanks in advance,
> >>>>>> Daniel
> >>>>>
> >>>>
> >>>>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l