[Bioperl-l] Memory requirements for conversion from embl togenbank

Chris Fields cjfields at uiuc.edu
Thu Aug 31 15:53:54 UTC 2006


> Hi Chris,
> 
> so it has been killed after a while.
> 
> Chris Fields wrote:
> > Martin,
> >
> > That's the issue; I believe the tags are supposed to be unique (part of
> the
> > EMBL standard, I think).  I'll look at it but this may be, again, one of
> > those issues which we may not fix as it's a problem with the input
> sequence
> > (not in the correct format).
> 
> Why? You can merge at least the text in two, successively appearing /note
> feature
> lines, right? Can you fix your code, Chris? It would take me a while to
> get
> familiar with it. What I still have in mind, expect that mostly either
> there is no closing quote or there are two closing quotes. And, single
> quote
> appears often in the middle of the string, e.g. 5'UTR, 5'-UTR. As I
> already
> mentioned that, the loop should be used as a last resort. And now you see
> why.
> Definitely, the loop in genabnk.pm must have a builtin limit so it never
> adds
> say more that 4 or 6 quotes.;)

Well, it isn't my code, but I'm flattered (or maybe shocked is more
appropriate based on the condition of the GenBank parser).

The parser is set up to grab any tag, regardless of the tag name.  The only
thing it supposedly relies on is proper tag format and balanced quotes.  I
added a fix for those tags that lack a balanced quote (bug 2077).  I'm
looking into a better (faster) way of going about it that would be amenable
to EMBL/GenBank/Swiss yet still retains some semblance of format checking.  

You'll have to give me a bit of time, though, as I'm juggling a lot of other
priorities right now (the beginning of the fall semester here and all).
I'll work on it as soon as I can.

> > At the very least it should break out of an infinite loop with a thrown
> > message.  Have you tried adding a debugging statement to the specific
> line
> > in genbank.pm to verify the infinite loop?
> 
> Definitely, I would even opt for ignoring such /note lines, they are not
> critical.

It would be more of a pain to ignore specific tags as you would have to add
exceptions for every tag you didn't want.  

> > Wow, you've run into a hornet's nest of bad sequences.  Missing quotes,
> too
> > many quotes, now this!
> 
> Reality. :(
> M.
...

Chris




More information about the Bioperl-l mailing list