[BioSQL-l] Swissprot Problems
Hilmar Lapp
hlapp at gnf.org
Thu Aug 19 15:51:19 EDT 2004
Thanks, this is helpful. So it appears we need to add recognition of
the RG line to the swissprot SeqIO parser. I just committed a fix to
the main trunk. There should be a test as well, didn't have the time
yet - everybody feel free to step in. I also added code to deal with
this on writing out and therefore also to Bio::Annotation::Reference.
I was being lazy and didn't check the manual myself, and sure enough
paid the price ...
-hilmar
On Aug 19, 2004, at 1:38 AM, Dave Howorth wrote:
> Raphael A. Bauer wrote:
>> just an interesting thing from Swiss-Prot:
>> If we want to load the latest Swiss-Prot flatfile with
>> load_seqdatabase.pl we get the following error (normally our
>> load_seqdatabase.pl works fine):
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,
>> values
>> were
>> ("",""A multicenter comparison of methods for typing strains of
>> Pseudomonas
>> aeruginosa predominantly from patients with cystic fibrosis."","J.
>> Infect.
>> Dis. 169:134-142(1994).","CRC-237261AF859664D3","","") FKs (613823)
>> ERROR: null value in column "authors" violates not-null constraint
>> ---------------------------------------------------
>> Could not store Q53391:
>> ------------- EXCEPTION -------------
>> ...
>> And that is what I would expect, because Q53391 has no RA line.
>> The Swiss-Prot manual says:
>> RG Reference group Once or more (Optional if RA line) RA
>> Reference authors Once or more (Optional if RG line)
>> ...
>> ...
>> so I don't know how one can deal with this, because it is a clear
>> violation of the Swiss-Prot manual statements and therefore a
>> violation
>> of the BioSQL schema definition (authors NOT NULL)...
>> We will remove the NOT NULL statements from the authors line in the
>> BioSQL schema to deal with this..
>
> Hilmar Lapp replied:
> >> We will remove the NOT NULL statements from the authors line in the
> >> BioSQL schema to deal with this..
> > Yep, I'll do this in the repository too.
>
>
> I'm a little confused by this. I'm interested in learning a bit about
> these entries so I went to browse the entry
> <http://www.ebi.uniprot.org/uniprot-srv/flatView.do?
> proteinId=FMK7_PSEAE&pager.offset=0>
> The relevant section seems to be:
>
> RN [1]
> RP SEQUENCE FROM N.A.
> RC STRAIN=KB7;
> RX MEDLINE=94103636; PubMed=7903973;
> RG INTERNATIONAL PSEUDOMONAS AERUGINOSA TYPING STUDY GROUP;
> RT "A multicenter comparison of methods for typing strains of
> Pseudomonas
> RT aeruginosa predominantly from patients with cystic fibrosis.";
> RL J. Infect. Dis. 169:134-142(1994).
>
> Then I went to the user manual
> <http://www.expasy.org/sprot/userman.html#Ref_line> where the relevant
> text seems to be:
>
> 3.10.5. The RG line
>
> The Reference Group (RG) line lists the consortium name associated
> with a given citation. The RG line is mainly used in submission
> reference blocks, but can also be used in paper references, if the
> working group is cited as an author in the paper. RG line and RA line
> (Reference Author) can be present in the same reference block; at
> least one RG or RA line is mandatory per reference block. An example
> of the use of RG lines is shown below:
>
> RG The mouse genome sequencing consortium;
>
> 3.10.6. The RA line
>
> The RA (Reference Author) lines list the authors of the paper (or
> other work) cited. The RA line is present in most references, but
> might be missing in references that cite a reference group (see RG
> line). At least one RG or RA line is mandatory per reference block.
> --------------------
>
> So it seems to me that the record is valid according to the spec and
> records do not need to have an RA line if they do have an RG line. It
> is probably appropriate to use the value of the RG field as the
> authors field in the database. Or am I missing something?
>
>
> >> Any better ideas?
> > No. There's not much you can do if people violate their own specs.
>
> There is another possible way to deal with errant records that violate
> the spec. That is to maintain an exception dictionary. That is, for
> each record that would fail validation, make a curated patch that can
> be applied to the record before validation. Clearly this can be a lot
> of work unless the initial record quality is already high. Submitting
> the exceptions back to the originating institution is good to do as
> well :)
>
> Cheers, Dave
> --
> Dave Howorth
> MRC Centre for Protein Engineering
> Hills Road, Cambridge, CB2 2QH
> 01223 252960
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list