[BioSQL-l] Swissprot Problems

Dave Howorth dhoworth at mrc-lmb.cam.ac.uk
Thu Aug 19 04:38:59 EDT 2004


Raphael A. Bauer wrote:
> just an interesting thing from Swiss-Prot:
> If we want to load the latest Swiss-Prot flatfile with
> load_seqdatabase.pl we get the following error (normally our
> load_seqdatabase.pl works fine):
> 
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were
> ("",""A multicenter comparison of methods for typing strains of Pseudomonas
> aeruginosa predominantly from patients with cystic fibrosis."","J. Infect.
> Dis. 169:134-142(1994).","CRC-237261AF859664D3","","") FKs (613823)
> ERROR:  null value in column "authors" violates not-null constraint
> ---------------------------------------------------
> Could not store Q53391:
> ------------- EXCEPTION  -------------
> ...
> 
> And that is what I would expect, because Q53391 has no RA line.
> The Swiss-Prot manual says:
> 
> RG    Reference group    Once or more (Optional if RA line)   
> RA    Reference authors    Once or more (Optional if RG line)
> ...
> ...
> so I don't know how one can deal with this, because it is a clear
> violation of the Swiss-Prot manual statements and therefore a violation
> of the BioSQL schema definition (authors NOT NULL)...
> 
> We will remove the NOT NULL statements from the authors line in the
> BioSQL schema to deal with this..

Hilmar Lapp replied:
 >> We will remove the NOT NULL statements from the authors line in the
 >> BioSQL schema to deal with this..
 > Yep, I'll do this in the repository too.


I'm a little confused by this. I'm interested in learning a bit about 
these entries so I went to browse the entry 
<http://www.ebi.uniprot.org/uniprot-srv/flatView.do?proteinId=FMK7_PSEAE&pager.offset=0> 

The relevant section seems to be:

RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=KB7;
RX   MEDLINE=94103636; PubMed=7903973;
RG   INTERNATIONAL PSEUDOMONAS AERUGINOSA TYPING STUDY GROUP;
RT   "A multicenter comparison of methods for typing strains of Pseudomonas
RT   aeruginosa predominantly from patients with cystic fibrosis.";
RL   J. Infect. Dis. 169:134-142(1994).

Then I went to the user manual 
<http://www.expasy.org/sprot/userman.html#Ref_line> where the relevant 
text seems to be:

     3.10.5. The RG line

The Reference Group (RG) line lists the consortium name associated with 
a given citation. The RG line is mainly used in submission reference 
blocks, but can also be used in paper references, if the working group 
is cited as an author in the paper. RG line and RA line (Reference 
Author) can be present in the same reference block; at least one RG or 
RA line is mandatory per reference block. An example of the use of RG 
lines is shown below:

RG   The mouse genome sequencing consortium;

     3.10.6. The RA line

The RA (Reference Author) lines list the authors of the paper (or other 
work) cited. The RA line is present in most references, but might be 
missing in references that cite a reference group (see RG line). At 
least one RG or RA line is mandatory per reference block.
--------------------

So it seems to me that the record is valid according to the spec and 
records do not need to have an RA line if they do have an RG line. It is 
probably appropriate to use the value of the RG field as the authors 
field in the database. Or am I missing something?


 >> Any better ideas?
 > No. There's not much you can do if people violate their own specs.

There is another possible way to deal with errant records that violate 
the spec. That is to maintain an exception dictionary. That is, for each 
record that would fail validation, make a curated patch that can be 
applied to the record before validation. Clearly this can be a lot of 
work unless the initial record quality is already high. Submitting the 
exceptions back to the originating institution is good to do as well :)

Cheers, Dave
-- 
Dave Howorth
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
01223 252960



More information about the BioSQL-l mailing list