[Bioperl-l] Memory requirements for conversion from embl to genbank

Chris Fields cjfields at uiuc.edu
Thu Aug 31 22:13:23 UTC 2006


> So, to recap, the script used to generate UTRdb (supposed UTRdb_gen)
> mangles the input GenBank or EMBL formatted input. According to notes
> on the ftp server EMBL rel. 86 has been used to generate this record:
> 
...
> OS   Hepatitis GB virus B
> OS   Encephalomyocarditis virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
> OC   Cardiovirus.
...

Yes, but they very obviously made an error in their conversion (several I
would say based on this ongoing conversation).  AFAIK there are no
GenBank/EMBL files containing multiple lineages (the ORGANISM/OC lines),
unless they are specialized multi sequence files.   We haven't come across
them here yet, at least.

> But the original record in both GenBank and EMBL does make sense, right?

They do; they both have single lineage lines (the above sequence has two).

GenBank file (from your post):
....
> SOURCE      Hepatitis GB virus B
>   ORGANISM  Hepatitis GB virus B
>             Viruses; ssRNA positive-strand viruses, no DNA stage;
> Flaviviridae.
....

EMBL file:

....
> OS   Hepatitis GB virus B
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
> XX
> OS   Encephalomyocarditis virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
> OC   Cardiovirus.
....
> [...]

Which, again indicates that the persons who ran the original sequence
conversion are the ones at fault (not us).  Talk to them.

> The above official GenBank record cannot be parsed and the parsing code
> silently leaks through and exits with no data written out. I have filed
> bug #2087.

Working on that (already a bugfix in for embl).  The genbank fix isn't so
straightforward either.

> This official EMBL record cannot be parsed either:
> 
> ------------- EXCEPTION  -------------
> MSG: Can't see new qualifier in: /focus
> from:
> /organism="Hepatitis GB virus B"
> /focus
> /isolate="FL3"
> /mol_type="mRNA"
> /db_xref="taxon:39113"
> 
> STACK Bio::SeqIO::embl::_read_FTHelper_EMBL
> /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO/embl.pm:1245
> STACK Bio::SeqIO::embl::next_seq
> /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO/embl.pm:383
> STACK toplevel testparsing.pl:20
> 
> --------------------------------------

Yes, that's an issue.

> Shall I file another bugreport or attach under the bug #2077, my favourite
> one? ;-)

As I have mentioned to you several times before, when you file a bug report,
once the bug is fixed it should NEVER be reopened unless the fix (1) doesn't
work, (2) it introduces more bugs, or (3) there is a better way (and the
latter can be avoided by just contacting the developer).  Any new bug must
be reopened using a new bug report.  

It's too hard to track what's fixed and what isn't if you keep piling extra
data or submitting new bugs to the same report.  It's just not good practice
(and not good bug reporting).

> I don't have the originally generated files anymore but parsing finished
> "successfully" with "some" data written out. ;)

Unless there is a specific problem with bioperl writing or parsing erroneous
data, there is no bug.  You have indicated a few instances where that may be
an issue (the LOCUS line, the above one, lack of proper error handling).
For that we thank you.  

However, everything else I have seen you submit regards sequence data we
have no control over, no responsibility for, and therefore we could just
completely absolve ourselves of fixing it.  Missing or extra quotes and
multiple lineages are not our fault.  

We'll try to do what we can, but only within the bounds of not breaking
code, performance, losing sanity, etc.

Chris





More information about the Bioperl-l mailing list