[Biopython-dev] Genbank structured comments
Brian Osborne
bosborne11 at verizon.net
Sun Sep 13 18:00:22 UTC 2015
All,
Code done, tests pass. I didn’t hear back about the dict of dict approach but it currently works as in this example:
record = SeqIO.read(path.join('GenBank', 'KF527485.gbk'),'genbank')
self.assertEqual(record.annotations['structured_comment']['Assembly-Data']['Assembly Method'], 'Lasergene v. 10')
However, I have a question. Will code using format() like this work, as versions of Python less than 2.6 might be used?
re.search(r"([^#]+){}$".format(STRUCTURED_COMMENT_START), data)
Thanks again,
Brian O.
> On Sep 10, 2015, at 3:52 PM, Brian Osborne <bosborne11 at verizon.net> wrote:
>
> Chris,
>
> BioPerl does what you might call a compromise (in bioperl-live, not in any CPAN release). If a structured comment appears in COMMENT it’s still part of the comment (a string) but no returns are removed, it stays tabular. Thus it’s easy to detect and parse.
>
> Yes, if there is a ‘structured_comment’ dict it could have a primary key and secondary keys. This was my first thought. So something like:
>
> defaultdict(<class 'dict'>, {'Assembly-Data': {'a': 1, 'b': 2, 'c': 3}})
>
> Brian O.
>
>
>> On Sep 10, 2015, at 10:06 AM, Fields, Christopher J <cjfields at illinois.edu <mailto:cjfields at illinois.edu>> wrote:
>>
>> This is very similar to the issue bioperl had with nested annotations; namely that some annotation data from SwissProt (GENE NAME I believe) had a hierarchal structure. Seems a bit thornier in this case as the annotation would have a both a standard comment field and a named collection of meta-data tied together.
>>
>> Brian, how is this implemented in BioPerl?
>>
>> chris
>>
>>> On Sep 10, 2015, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>>
>>> Good question...
>>>
>>> e.g. http://www.ncbi.nlm.nih.gov/nuccore/291609868 <http://www.ncbi.nlm.nih.gov/nuccore/291609868>
>>> and http://www.ncbi.nlm.nih.gov/nuccore/FJ966082 <http://www.ncbi.nlm.nih.gov/nuccore/FJ966082>
>>>
>>> It almost makes me wonder if that should have top level
>>> keys of MIENS-Data and FluData - or is that too nested?
>>>
>>> Peter
>>>
>>> On Thu, Sep 10, 2015 at 4:37 PM, Brian Osborne <bosborne11 at verizon.net <mailto:bosborne11 at verizon.net>> wrote:
>>>> Peter,
>>>>
>>>> Another question, maybe the last one: what do we do what the “header” and “footer” strings, things like “FluData”, "GISAID_EpiFlu(TM)Data”, and “Assembly-Data”?
>>>>
>>>> They could also be keys in the dict, of course. Values are ‘’?
>>>>
>>>> Thanks again,
>>>>
>>>> Brian O.
>>>>
>>>>
>>>>> On Sep 10, 2015, at 1:25 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>>>>
>>>>> On Wed, Sep 9, 2015 at 11:37 PM, Brian Osborne <bosborne11 at verizon.net <mailto:bosborne11 at verizon.net>> wrote:
>>>>>> Chris,
>>>>>>
>>>>>> This is the documentation I’m familiar with, but there may be more:
>>>>>>
>>>>>> http://www.ncbi.nlm.nih.gov/genbank/structuredcomment <http://www.ncbi.nlm.nih.gov/genbank/structuredcomment>
>>>>>>
>>>>>> Peter, I can definitely separate these using ‘comment’ and
>>>>>> ‘structured_comment’ keys in the record.annotations dict.
>>>>>>
>>>>>> If there’s no structured comment in the Genbank file, would
>>>>>> there simply be an empty dict in the SeqRecord?
>>>>>>
>>>>>> E.g.
>>>>>>
>>>>>>>>> record.annotations[‘structured_comment']
>>>>>> {}
>>>>>
>>>>> That makes sense - equally no entry in the annotation dictionary
>>>>> would be reasonable.
>>>>>
>>>>> Peter
>>>>
>>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150913/5b48c9ca/attachment.html>
More information about the Biopython-dev
mailing list