[Biopython-dev] Genbank structured comments

Sun Sep 13 18:00:22 UTC 2015

All,

Code done, tests pass. I didn’t hear back about the dict of dict approach but it currently works as in this example:

        record = SeqIO.read(path.join('GenBank', 'KF527485.gbk'),'genbank')
        self.assertEqual(record.annotations['structured_comment']['Assembly-Data']['Assembly Method'], 'Lasergene v. 10')

However, I have a question. Will code using format() like this work, as versions of Python less than 2.6 might be used?

	re.search(r"([^#]+){}$".format(STRUCTURED_COMMENT_START), data)

Thanks again,

Brian O.

> On Sep 10, 2015, at 3:52 PM, Brian Osborne <bosborne11 at verizon.net> wrote:
> 
> Chris,
> 
> BioPerl does what you might call a compromise (in bioperl-live, not in any CPAN release). If a structured comment appears in COMMENT it’s still part of the comment (a string) but no returns are removed, it stays tabular. Thus it’s easy to detect and parse.
> 
> Yes, if there is a ‘structured_comment’ dict it could have a primary key and secondary keys. This was my first thought. So something like:
> 
> defaultdict(<class 'dict'>, {'Assembly-Data': {'a': 1, 'b': 2, 'c': 3}})
> 
> Brian O.
> 
> 
>> On Sep 10, 2015, at 10:06 AM, Fields, Christopher J <cjfields at illinois.edu <mailto:cjfields at illinois.edu>> wrote:
>> 
>> This is very similar to the issue bioperl had with nested annotations; namely that some annotation data from SwissProt (GENE NAME I believe) had a hierarchal structure.  Seems a bit thornier in this case as the annotation would have a both a standard comment field and a named collection of meta-data tied together.  
>> 
>> Brian, how is this implemented in BioPerl? 
>> 
>> chris
>> 
>>> On Sep 10, 2015, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>> 
>>> Good question...
>>> 
>>> e.g. http://www.ncbi.nlm.nih.gov/nuccore/291609868 <http://www.ncbi.nlm.nih.gov/nuccore/291609868>
>>> and http://www.ncbi.nlm.nih.gov/nuccore/FJ966082 <http://www.ncbi.nlm.nih.gov/nuccore/FJ966082>
>>> 
>>> It almost makes me wonder if that should have top level
>>> keys of MIENS-Data and FluData - or is that too nested?
>>> 
>>> Peter
>>> 
>>> On Thu, Sep 10, 2015 at 4:37 PM, Brian Osborne <bosborne11 at verizon.net <mailto:bosborne11 at verizon.net>> wrote:
>>>> Peter,
>>>> 
>>>> Another question, maybe the last one: what do we do what the “header” and “footer” strings, things like “FluData”, "GISAID_EpiFlu(TM)Data”, and “Assembly-Data”?
>>>> 
>>>> They could also be keys in the dict, of course. Values are ‘’?
>>>> 
>>>> Thanks again,
>>>> 
>>>> Brian O.
>>>> 
>>>> 
>>>>> On Sep 10, 2015, at 1:25 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>>>> 
>>>>> On Wed, Sep 9, 2015 at 11:37 PM, Brian Osborne <bosborne11 at verizon.net <mailto:bosborne11 at verizon.net>> wrote:
>>>>>> Chris,
>>>>>> 
>>>>>> This is the documentation I’m familiar with, but there may be more:
>>>>>> 
>>>>>> http://www.ncbi.nlm.nih.gov/genbank/structuredcomment <http://www.ncbi.nlm.nih.gov/genbank/structuredcomment>
>>>>>> 
>>>>>> Peter, I can definitely separate these using ‘comment’ and
>>>>>> ‘structured_comment’ keys in the record.annotations dict.
>>>>>> 
>>>>>> If there’s no structured comment in the Genbank file, would
>>>>>> there simply be an empty dict in the SeqRecord?
>>>>>> 
>>>>>> E.g.
>>>>>> 
>>>>>>>>> record.annotations[‘structured_comment']
>>>>>> {}
>>>>> 
>>>>> That makes sense - equally no entry in the annotation dictionary
>>>>> would be reasonable.
>>>>> 
>>>>> Peter
>>>> 
>> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150913/5b48c9ca/attachment.html>