[Biopython-dev] GenBank parser fails (on large files?)

Fri Sep 28 08:17:02 EDT 2001

Hi Michel; 

> Thanks, the fix worked. 

Great to hear. Thanks for reporting back.

> However your solution to make parsing of large sequences 
> faster has currently a side effect. If I print the first 
> feature with qualifier 'translation', I get
> 
> ['MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ\012                     
>               CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR']
> 
> before, when I would have gotten a slightly different result:
> 
> "MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ\012                     
>               CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR"

This is actually not a side-effect of the recent changes, but a
deliberate change I made in CVS. I wrote a long message about this
last week concerning non-compatible fixes I made to the GenBank
SeqFeature parser:

http://www.biopython.org/pipermail/biopython-dev/2001-September/000579.html

The part related to your problem involves how I was handing features
that had multiple qualifier keys with the same name (ie. two
'translation' keys). Previously, I was doing something really ugly
-- appending numbers on to the end of multiple keys to make them
unique (translation, translation1, translation2 ...). This allowed
me to have one key and one string value and store things in a
dictionary.

But, this is an ugly way to do things and actually makes life very
hard for people who wanted to get, say, all translation qualifiers
in a feature (if there were multiple translations). The fix was to
use the qualifier key and store the values as a list, ie:

qualifiers = {"translation" : ["CREL", "CRET"]}

When there is one one qualifier name, I also store this as a list
to help people avoid having to do:

if type(qualifier[key]) == type(""):
    # do something with the string
elif type(qualifier[key]) == type([]):
    # do something with the list

in their code.

I am definately sensitive to the fact that the change is bad news
for current code -- I'm sorry about that; it's all due to that bad
design decision I made earlier.

> Now the problem is, I had a hack to shape this string better, namely
> 
> >>> newseq= string.join(string.split(sq.qualifiers['translation']), sep=''))
> 
> This works with the " " form, but not with the [' '] form, which is how I 
> noticed the difference. 

Yes, sorry about that. A potential change (untested) would be:

clean_translations = []
for translation in sq.qualifiers['translation']:
    clean_translations.append(string.join(string.split(translation),
                              sep = ''))
sq.qualifiers['translations'] = clean_translations

But on to the other problem:

[Talking about the translation]
> Note, incidentally, that this is a bit ugly, because the \012's and spaces 
> should have been cleaned out 

I agree with you here -- I haven't yet done any work at massaging
the feature value information. I'll think about a good way to do
this (I'm sure there are other cases where this also needs to be
done), and try to get something done on it this weekend.

Thanks again for the feedback.
Brad
-- 
PGP public key available from http://pgp.mit.edu/