[Bioperl-l] BioPerl 1.6 and parsing multiple EMBL records

Tue Jan 12 11:19:32 EST 2010

On Tue, Jan 12, 2010 at 4:02 PM, Chris Fields <cjfields at illinois.edu> wrote:
> On Jan 11, 2010, at 9:55 AM, Peter wrote:
>
>> On Mon, Jan 11, 2010 at 3:42 PM, Hotz, Hans-Rudolf <hrh at fmi.ch> wrote:
>>>
>>> These entries form the CON data class, see:
>>> http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_4_14
>>> and they don't contain any sequence information.
>>
>> I know - GenBank files have a similar system with CONTIG
>> lines instead of sequences. I was expecting BioPerl to be
>> able to convert these EMBL files with CO lines into GenBank
>> files with CONTIG lines.
>
> IIRC the contig information for GenBank is stored in annotation.
> We can try to ensure the data is carried over to EMBL properly.

For contig records (where there is no sequence) I think we just
need to map the GenBank CONTIG lines to the EMBL CO lines,
and vice versa. At least, that's what Biopython now does (trunk
code, not yet released).

>>> If you take the 'expanded' entries from
>>> ftp://ftp.ebi.ac.uk/pub/databases/embl/expanded_con/release/rel_con_hum_01_r102.dat.gz
>>> your script will work.
>>
>> That's a useful tip - thanks.
>>
>> Peter
>
> NCBI's eutil option 'gbwithparts' is similar (always retrieves the sequence).

Indeed. This is a useful work around for when a parser couldn't
cope with the contig version of a GenBank file for some reason, e.g.
http://bugzilla.open-bio.org/show_bug.cgi?id=2745

Peter