[Biopython-dev] Genbank structured comments

Wed Sep 9 13:56:13 UTC 2015

All,

I noticed that BioPython, like the versions of BioPerl in CPAN, does not handle GenBank structured comments (http://www.ncbi.nlm.nih.gov/genbank/structuredcomment <http://www.ncbi.nlm.nih.gov/genbank/structuredcomment>) in the ideal way. Here’s an example structured comment:

COMMENT     ##FluData-START##
           EPI_ISOLATE_ID        :: EPI_ISL_77637
           NAME                  :: A/California/07/2009
           TYPE                  :: H1N1
           Segment_name          :: M'
           HOST_AGE              :: 54
           HOST_GENDER           :: F'
           PASSAGE               :: M1/C1 (2009-04-24)
           LOCATION              :: United States / California'
           COLLECT_DATE          :: 09-Apr-2009
           Lineage               :: A(H1N1)pdm09
           RESIST_TO_ADAMANTANES :: Resistant'
           RESIST_TO_OSELTAMIVIR :: Sensitive'
           RESIST_TO_ZANAMVIR    :: Sensitive'
           SPECIMEN_ID           :: H13596
           SENDER_LAB            :: Naval Health Research Center'
           SEQLAB_SAMPLE_ID      :: 2009712111
           EPI_SEQUENCE_ID       :: EPI273604
           ##FluData-END##

Or here: http://www.ncbi.nlm.nih.gov/nuccore/291609868 <http://www.ncbi.nlm.nih.gov/nuccore/291609868>

A table, with tag/value pairs. A fair number of bacterial genomes in GenBank use the structured comment to hold MIGS/MIMS data. The comment() method should return something like this, which is easily parsed:

##FluData-START##
EPI_ISOLATE_ID        :: EPI_ISL_77637
NAME                  :: A/California/07/2009
TYPE                  :: H1N1
Segment_name          :: M'
HOST_AGE              :: 54
HOST_GENDER           :: F'
PASSAGE               :: M1/C1 (2009-04-24)
LOCATION              :: United States / California'
COLLECT_DATE          :: 09-Apr-2009
Lineage               :: A(H1N1)pdm09
RESIST_TO_ADAMANTANES :: Resistant'
RESIST_TO_OSELTAMIVIR :: Sensitive'
RESIST_TO_ZANAMVIR    :: Sensitive'
SPECIMEN_ID           :: H13596
SENDER_LAB            :: Naval Health Research Center'
SEQLAB_SAMPLE_ID      :: 2009712111
EPI_SEQUENCE_ID       :: EPI273604
##FluData-END##

Rather than this, which is what it currently returns:

##FluData-START## EPI_ISOLATE_ID        :: EPI_ISL_77637 NAME                  :: A/California/07/2009 TYPE                  :: H1N1 Segment_name          :: M' HOST_AGE              :: 54 HOST_GENDER           :: F' PASSAGE               :: M1/C1 (2009-04-24) LOCATION              :: United States / California' COLLECT_DATE          :: 09-Apr-2009 Lineage               :: A(H1N1)pdm09 RESIST_TO_ADAMANTANES :: Resistant' RESIST_TO_OSELTAMIVIR :: Sensitive' RESIST_TO_ZANAMVIR    :: Sensitive' SPECIMEN_ID           :: H13596 SENDER_LAB            :: Naval Health Research Center' SEQLAB_SAMPLE_ID      :: 2009712111 EPI_SEQUENCE_ID       :: EPI273604 ##FluData-END##

Are there any objections to me putting in a pull request with this change? I made this same fix in BioPerl. Of course, if the comment is a “normal” one, it will be treated the same as it is treated now. Another words, the vast majority of comments stay the same.

I’ll also add tests.

Thanks again,

Brian O.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150909/8a3b6d0a/attachment.html>