[Biopython-dev] [Bug 2681] BioSQL: record annotations enhancements

Mon Nov 24 20:40:49 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2681

------- Comment #4 from cymon.cox at gmail.com  2008-11-24 15:40 EST -------
(In reply to comment #2)
> (In reply to comment #0)
> > 1) Fixed date/dates typo.
> 
> Why is it a typo?  Change not checked in.

The function _load_bioentry_date in Loader.py inserts the annotation 'date', if
present, or the current date if not, into the bioentry_qualifier_value table.
This is pulled by BioSeq.py _retrieve_qualifier_value and stored as the
attribute 'dates'. Hence I considered line 307 in BioSeq.py to be a typo, which
should be 'date' and not 'dates'. Also, because Loader.py handles dates
separately, they should not be handled by the function load_annotations.

> > 2) comment's were being stored by not retrieved - fixed with test.
> 
> Looks good, except for returning an empty list if there were no comments.
> 
> > 3) A 'reference' annotation, even if an empty list, was being retrieved in a
> > DBSeqRecord. Fixed so that if there are no references there is no annotation
> > in DBSeqRecord.
> 
> I agree, but preferred a smaller change for this:
> 
> Checking in BioSQL/BioSeq.py;
> /home/repository/biopython/biopython/BioSQL/BioSeq.py,v  <--  BioSeq.py
> new revision: 1.33; previous revision: 1.32
> done
> Checking in Tests/test_BioSQL_SeqIO.py;
> /home/repository/biopython/biopython/Tests/test_BioSQL_SeqIO.py,v  <-- 
> test_BioSQL_SeqIO.py
> new revision: 1.29; previous revision: 1.28
> done

Actually, your version of _retrieve_comment never returns comments ;-)

On the wider issue: perhaps, it's best if DBSeqRecord's always have the same
set of attributes, even if comments and references are empty lists. Trying to
regenerate the attributes present in the loaded SeqRecord is, I think, not the
way to go, and not possible (or at least currently not attempted) for fasta
records. Perhaps we should be coding around the issue in the test suite rather
than changing the attributes of the DBSeqRecord so that it passes the test...

> > Some swiss prot SeqRecords have ncbi_taxid and they are retrieved
> > correctly by DBSeqRecord. TODO: others have ncbi_taxid that is missing
> > from the retrieved DBSeqRecord: sp012, sp014, 
> 
> Note some swiss prot records may be multi-species, which the BioSQL schema
> can't cope with.  Not sure if that applies here.

Yep, thats exactly what was causing the problem. Currently the code refuses to
load an ncbi_taxid, which I think is correct, after all which one should be
loaded? Anyway, I'll look into this a bit more...

> > Swissprot, fasta, and EMBL SeqRecords dont have a gi annotation, retrieved
> > DBSeqRecords do. Loader uses the 'record_id' (line 522) as the identifier in
> > bioentry, if the gi annotation is missing, which is pulled as the gi
> > annotation.
> 
> There probably is something not quite right here.  Are you talking about the
> bioentry.identifier entry in the database?  Perhaps an explicit example might
> help.  As an aside, I think "gi" (GeneIndex used by NCBI) might be better
> stored in the record.dbxrefs, but that could be a parser change...

Ah, OK, will look further into this as well...

> > 'contig' is ignored by loader because it's a SeqFeature object. Is there any
> > reason it couldnt be loaded and retrieved? (record is GenBank/NT_019265.gb)
> 
> I couldn't even say off hand how the CONTIG line in that example would be
> parsed, let alone how it gets dealt with when loading into BioSQL.

Well, the parser correctly deals with it as a SeqFeature (with a whole bunch of
sub_features) but it never gets loaded its not dealt with at all an falls of
the bottom of the function; I cant see any reason not to load it...

C.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.