[Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines

Sun Jun 7 11:38:10 UTC 2009

Hi everybody,

Comments in SwissProt files such as the following:

CC   -!- FUNCTION: Core subunit of the mitochondrial membrane respiratory
CC       chain NADH dehydrogenase (Complex I) that is believed to belong to
CC       the minimal assembly required for catalysis. Complex I functions
CC       in the transfer of electrons from NADH to the respiratory chain.
CC       The immediate electron acceptor for the enzyme is believed to be
CC       ubiquinone (By similarity).
CC   -!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.
CC   -!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane
CC       protein (By similarity).
CC   -!- SIMILARITY: Belongs to the complex I subunit 3 family.
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------

are currently being stored differently by Bio.SeqIO and Bio.SwissProt.

Bio.SeqIO stores the comments as one string, as follows:

>>> record.annotations['comment']
'-!- FUNCTION: Core subunit of the mitochondrial membrane respiratory\n\n    chain NADH dehydrogenase (Complex I) that is believed to belong to\n\n    the minimal assembly required for catalysis. Complex I functions\n\n    in the transfer of electrons from NADH to the respiratory chain.\n\n    The immediate electron acceptor for the enzyme is believed to be\n\n    ubiquinone (By similarity).\n\n-!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n\n-!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n\n    protein (By similarity).\n\n-!- SIMILARITY: Belongs to the complex I subunit 3 family.\n\n-----------------------------------------------------------------------\n\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\n\nDistributed under the Creative Commons Attribution-NoDerivs License\n\n-----------------------------------------------------------------------\n'

Note that two endlines appear at the end of each line; I don't know why.

Bio.SwissProt, on the other hand, stores a list of comments (with single newlines):

>>> record.comments
[' FUNCTION: Core subunit of the mitochondrial membrane respiratory\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n the minimal assembly required for catalysis. Complex I functions\n in the transfer of electrons from NADH to the respiratory chain.\n The immediate electron acceptor for the enzyme is believed to be\n ubiquinone (By similarity).\n', ' CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n', ' SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n protein (By similarity).\n', ' SIMILARITY: Belongs to the complex I subunit 3 family.\n', '-----------------------------------------------------------------------\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\nDistributed under the Creative Commons Attribution-NoDerivs License\n-----------------------------------------------------------------------\n']

I think that the approach used by Bio.SwissProt is more reasonable, although I'd prefer to remove the newlines and to skip the copyright statement altogether (since it's the same for all SwissProt records anyway).

Can we do the same for Bio.SeqIO? Or is there a need to keep record.annotations['comments'] as a single string? If they are kept as a single string, how about using a single newline between comments, and no newlines within comments?

This btw is the last inconsistency between Bio.SeqIO and Bio.SwissProt. By making this consistent, Bio.SeqIO could use Bio.SwissProt as a backend, which is about three times faster than the current parser, and has the added benefit of having to maintain only one SwissProt parser.

--Michiel.

--Michiel