[Biopython-dev] Consumer of "KW" in embl format

Peter Cock p.j.a.cock at googlemail.com
Tue Mar 12 10:02:15 UTC 2013


On Tue, Mar 12, 2013 at 9:54 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Mar 12, 2013 at 9:36 AM, Xabier Bello <xbello at gmail.com> wrote:
>> Hi:
>>
>> I don't know if this is the right way to do this. The code:
>>
>> records = SeqIO.parse(open("MyFile.embl", "r"), "embl")
>> for record in records:
>>     print record.annotations["keywords"]
>>
>> Doesn't work
>>
>> I've added to Bio/GenBank/Scanner.py, in _feed_header_lines():
>>
>> elif line_type == 'KW':
>>     consumer.keywords(data.rstrip(";"))
>>
>> And now it seems to parse the keyword lines.
>>
>> Regards.
>
> Good idea, although it needs a little more generalisation for handling
> multiple keywords - a list of strings seems sensible here. Quoting
> ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt
>
> <quote>
> 3.4.6  The KW Line
> The KW (KeyWord) lines provide information which can be used to generate
> cross-reference indexes of the sequence entries based on functional,
> structural, or other categories deemed important.
> The format for a KW line is:
>      KW   keyword[; keyword ...].
> More than one keyword may be listed on each KW line; the keywords are
> separated by semicolons, and the last keyword is followed by a full
> stop. Keywords may consist of more than one word, and they may contain
> embedded blanks and stops. A keyword is never split between lines.
> An example of a keyword line is:
>      KW   beta-glucosidase.
> The keywords are ordered alphabetically; the ordering implies no hierarchy
> of importance or function.  If an entry has no keywords assigned to it,
> it will contain a single KW line like this:
>      KW   .
> </quote>
>
> Likewise the GenBank parser should support the KEYWORDS line
> too - and then writing the keywords out again too.
>
> Is this something you'd like to work on, or should I do it?

To clarify - Biopython should already be reading and writing any
KEYWORDS line in GenBank files - the same data structure should
be used for EMBL files (your suggestion looks good, but an explicit
unit test covering single and multiple keywords would be ideal),
and then the EMBL writer updated to write this. i.e. code added in
Bio/SeqIO/InsdcIO.py

Peter



More information about the Biopython-dev mailing list