[Biopython-dev] Consumer of "KW" in embl format

Tue Mar 12 09:54:51 UTC 2013

On Tue, Mar 12, 2013 at 9:36 AM, Xabier Bello <xbello at gmail.com> wrote:
> Hi:
>
> I don't know if this is the right way to do this. The code:
>
> records = SeqIO.parse(open("MyFile.embl", "r"), "embl")
> for record in records:
>     print record.annotations["keywords"]
>
> Doesn't work
>
> I've added to Bio/GenBank/Scanner.py, in _feed_header_lines():
>
> elif line_type == 'KW':
>     consumer.keywords(data.rstrip(";"))
>
> And now it seems to parse the keyword lines.
>
> Regards.

Good idea, although it needs a little more generalisation for handling
multiple keywords - a list of strings seems sensible here. Quoting
ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

<quote>
3.4.6  The KW Line
The KW (KeyWord) lines provide information which can be used to generate
cross-reference indexes of the sequence entries based on functional,
structural, or other categories deemed important.
The format for a KW line is:
     KW   keyword[; keyword ...].
More than one keyword may be listed on each KW line; the keywords are
separated by semicolons, and the last keyword is followed by a full
stop. Keywords may consist of more than one word, and they may contain
embedded blanks and stops. A keyword is never split between lines.
An example of a keyword line is:
     KW   beta-glucosidase.
The keywords are ordered alphabetically; the ordering implies no hierarchy
of importance or function.  If an entry has no keywords assigned to it,
it will contain a single KW line like this:
     KW   .
</quote>

Likewise the GenBank parser should support the KEYWORDS line
too - and then writing the keywords out again too.

Is this something you'd like to work on, or should I do it?

(If you are interested in getting involved in Biopython development
this seems like a nice project to start with - not too complicated, but
large enough to make creating a fork on GitHub and your own
enhancement branch a good idea.)

Thanks,

Peter