[Biopython] Increase line length when writing EMBL format

Fri Sep 18 19:50:57 UTC 2020

That sounds like a plan - if EBI/EMBL can confirm how they would
prefer these long lines to be done, we can adjust the Biopython output
accordingly.

Thank you!

Peter

On Fri, Sep 18, 2020 at 2:47 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>
> > The ENA website would only show me the parent record:
> >
> > https://www.ebi.ac.uk/ena/browser/api/embl/CAADRP010000000.1
> >
> > Anyway, I downloaded the full archive, and saw lots of example of 81 character
> > line entries in the feature table.
>
>
> Yes, I also noticed that, so thought to share the location of the file itself.
>
> > Do you have any contacts at EBI/EMBL who might be able to help clarify this?
>
>
> Not at the moment, but I plan to get in contact with them in the next few days to update the EMBL file. I’m happy to also discuss these specifications with them and let you know who I’ve been in contact with.
>
> > Does the current 80 character line output from Biopython actually break
> > anything downstream?
>
> Also don’t know yet. I think that their Webin-CLI application checks for errors in flat files so I might have news about this as well in the next few days.
>
> Pedro
>
>
> > On 18 Sep 2020, at 14:33, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> >
> > Thank you. The ENA website would only show me the parent record:
> >
> > https://www.ebi.ac.uk/ena/browser/api/embl/CAADRP010000000.1
> >
> > Anyway, I downloaded the full archive, and saw lots of example of 81 character
> > line entries in the feature table.
> >
> > I've rechecked the IUPAC feature table specification, and maximum line
> > lengths and wrappings remain under-specified:
> >
> > http://www.insdc.org/files/feature_table.html
> >
> > Do you have any contacts at EBI/EMBL who might be able to help clarify this?
> >
> > Does the current 80 character line output from Biopython actually break
> > anything downstream?
> >
> > Peter
> >
> > On Fri, Sep 18, 2020 at 2:07 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
> >>
> >> Thanks Peter!
> >>
> >> The accession is CAADRP010000001 and the EMBL file for the genome annotation can be downloaded directly from: ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/ca/CAADRP01.dat.gz
> >>
> >> All the best,
> >> Pedro
> >>
> >>
> >>> On 18 Sep 2020, at 13:54, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> >>>
> >>> Thanks Pedro,
> >>>
> >>> Could you share the accession / URL for the problem record(s) then?
> >>>
> >>> And to clarify why your experiment didn't work, the Bio.GenBank.Record
> >>> objects are irrelevant to Bio.SeqIO which uses SeqRecord objects. The
> >>> GenBank parser can either produce records using Bio.GenBank.Record
> >>> (mimics a GenBank record very closely, see the Bio.GenBank.parse
> >>> function), or SeqRecords (as used in SeqIO).
> >>>
> >>> The output from SeqIO is via the EmblWriter object here, where MAX_WIDTH = 80:
> >>>
> >>> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/InsdcIO.py#L1105
> >>>
> >>> Peter
> >>>
> >>> On Fri, Sep 18, 2020 at 1:52 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> thank you so much for the prompt reply. Yes, it was downloaded directly from EMBL. It's from a recent submission early this year, so maybe there were some modifications related to these cases as you pointed out.
> >>>>
> >>>> All the best,
> >>>> Pedro
> >>>>
> >>>>
> >>>>> On 18 Sep 2020, at 13:47, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> >>>>>
> >>>>> Hello Pedro,
> >>>>>
> >>>>> Sadly this annotation value is one of those awkward cases of a long
> >>>>> value with no spaces, so there is no good place to break it for line
> >>>>> wrapping.
> >>>>>
> >>>>> Where is your original file from? The input line was 81 characters
> >>>>> long, which I believe is too long.  It is from EMBL themselves? If so,
> >>>>> perhaps we need to more closely match how they now handle this corner
> >>>>> case - which may have changed since I last looked at this code.
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>> On Fri, Sep 18, 2020 at 1:36 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
> >>>>>>
> >>>>>> Dear BioPython Developers and enthusiasts,
> >>>>>>
> >>>>>> I’m working in a script to perform some modifications in an EMBL file format I have at hand. Everything seems to be working OK, except for some features where `SeqIO.write(record, fh, 'embl')` seems to be writing the last closing quote (`"`) in a new line as a feat of its own.
> >>>>>>
> >>>>>> Here’s how the original feature is:
> >>>>>>
> >>>>>> ```
> >>>>>> FT                   /standard_name="species:rnd-4_family-1331|genus:Unspecified"
> >>>>>> ```
> >>>>>>
> >>>>>> but with  `SeqIO.write` gets printed in 2 lines as:
> >>>>>>
> >>>>>> ```
> >>>>>> FT                   /standard_name="species:rnd-4_family-1331|genus:Unspecified
> >>>>>> FT                   "
> >>>>>> ```
> >>>>>>
> >>>>>> I remember seeing (can’t remember where though) that the ‘embl’ format uses for the most part the genbank structure, so thought that increasing the value of `record.GB_LINE_LENGTH` say to 100 `record.GB_LINE_LENGTH=100` could work, but it doesn’t…
> >>>>>>
> >>>>>> I actually think that `record.GB_LINE_LENGTH` is not taken into account with ‘embl’ writing format because the default value seems to be [79](https://biopython.org/docs/1.75/api/Bio.GenBank.Record.html#Bio.GenBank.Record.Record.GB_LINE_LENGTH) but by default it prints the line above with a width of 81.
> >>>>>>
> >>>>>> Any ideas/suggestions on how to work around this? I could probably write another parser to correct for this but would be easier/better if this could be worked with BioPython.
> >>>>>>
> >>>>>> Many thanks,
> >>>>>> Pedro
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >>>>>> https://mailman.open-bio.org/mailman/listinfo/biopython
> >>>>
> >>
>