[Biopython-dev] More SwissProt inconsistencies

Michiel de Hoon mjldehoon at yahoo.com
Sat May 30 05:37:35 EDT 2009


Looking some more at how Bio.SeqIO and Bio.SwissProt store the information in a SwissProt file, I found the following two inconsistencies:

1) A multi-line author list such as the following:
RA   Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,
RA   Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,
RA   Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,
RA   Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,
RA   Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,
RA   Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,
RA   Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,
RA   Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,
RA   Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,
RA   Barrell B.G., Hall N.;
is stored without newlines by Bio.SeqIO:
>>> seq_record.annotations['references'][0].authors
"Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,Barrell B.G., Hall N.;"
but with newlines by Bio.SwissProt:
>>> swiss_record.references[0].authors
"Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,\nKerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,\nCoulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,\nGardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,\nLarke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,\nNene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,\nRawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,\nSquares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,\nLangsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,\nBarrell B.G., Hall N.;"

To me, the Bio.SeqIO approach seems more reasonable. I think we should add a space though at places where there is a newline in the file.

The same happens for multiline RL such as

RL   (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.);
RL   Proceedings of the XVII international grassland congress,
RL   pp.2:1033-1034, Dunmore Press, Palmerston North (1993).

and for multiline RT lines such as

RT   "Genome of the host-cell transforming parasite Theileria annulata
RT   compared with T. parva.";

This is stored by Bio.SeqIO as

'"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";'

and by Bio.SwissProt as

'"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";'

whereas I think that both should be stored as

'"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";'


2) Comments in a references such as the following:
RC   STRAIN=cv. VF36; TISSUE=Anther;
are stored as a single string by Bio.SeqIO:
>>> seq_record.annotations['references'][i].comment
'STRAIN=cv. VF36; TISSUE=Anther;'
but as a list of (key, value) pairs by Bio.SwissProt:
[('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')]
Whereas I think both are reasonable, Bio.SeqIO drops the space between two (key, value) pairs if they are on two separate lines:
RC   STRAIN=C57BL/6J;
RC   TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;
is stored as
>>> seq_record.annotations['references'][i].comment
'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;'
I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing.

Any objections or comments?

--Michiel


      


More information about the Biopython-dev mailing list