[Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines
Michiel de Hoon
mjldehoon at yahoo.com
Sun Jun 7 07:38:10 EDT 2009
Hi everybody,
Comments in SwissProt files such as the following:
CC -!- FUNCTION: Core subunit of the mitochondrial membrane respiratory
CC chain NADH dehydrogenase (Complex I) that is believed to belong to
CC the minimal assembly required for catalysis. Complex I functions
CC in the transfer of electrons from NADH to the respiratory chain.
CC The immediate electron acceptor for the enzyme is believed to be
CC ubiquinone (By similarity).
CC -!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.
CC -!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane
CC protein (By similarity).
CC -!- SIMILARITY: Belongs to the complex I subunit 3 family.
CC -----------------------------------------------------------------------
CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC Distributed under the Creative Commons Attribution-NoDerivs License
CC -----------------------------------------------------------------------
are currently being stored differently by Bio.SeqIO and Bio.SwissProt.
Bio.SeqIO stores the comments as one string, as follows:
>>> record.annotations['comment']
'-!- FUNCTION: Core subunit of the mitochondrial membrane respiratory\n\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n\n the minimal assembly required for catalysis. Complex I functions\n\n in the transfer of electrons from NADH to the respiratory chain.\n\n The immediate electron acceptor for the enzyme is believed to be\n\n ubiquinone (By similarity).\n\n-!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n\n-!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n\n protein (By similarity).\n\n-!- SIMILARITY: Belongs to the complex I subunit 3 family.\n\n-----------------------------------------------------------------------\n\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\n\nDistributed under the Creative Commons Attribution-NoDerivs License\n\n-----------------------------------------------------------------------\n'
Note that two endlines appear at the end of each line; I don't know why.
Bio.SwissProt, on the other hand, stores a list of comments (with single newlines):
>>> record.comments
[' FUNCTION: Core subunit of the mitochondrial membrane respiratory\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n the minimal assembly required for catalysis. Complex I functions\n in the transfer of electrons from NADH to the respiratory chain.\n The immediate electron acceptor for the enzyme is believed to be\n ubiquinone (By similarity).\n', ' CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n', ' SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n protein (By similarity).\n', ' SIMILARITY: Belongs to the complex I subunit 3 family.\n', '-----------------------------------------------------------------------\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\nDistributed under the Creative Commons Attribution-NoDerivs License\n-----------------------------------------------------------------------\n']
I think that the approach used by Bio.SwissProt is more reasonable, although I'd prefer to remove the newlines and to skip the copyright statement altogether (since it's the same for all SwissProt records anyway).
Can we do the same for Bio.SeqIO? Or is there a need to keep record.annotations['comments'] as a single string? If they are kept as a single string, how about using a single newline between comments, and no newlines within comments?
This btw is the last inconsistency between Bio.SeqIO and Bio.SwissProt. By making this consistent, Bio.SeqIO could use Bio.SwissProt as a backend, which is about three times faster than the current parser, and has the added benefit of having to maintain only one SwissProt parser.
--Michiel.
--Michiel
More information about the Biopython-dev
mailing list