[Biopython] LOCUS name length on GenBank output: option to adjust?

Thu Feb 2 19:57:06 UTC 2017

I just wanted to note that there was a similar discussion in #747
<https://github.com/biopython/biopython/issues/747> which I submitted about
a year ago. I'm not sure if it adds anything to the discussion, but since
the issue and Bastien's proposed change is similar to what I suggested I
thought I'd throw it out there.

I think the behavior implemented in #802
<https://github.com/biopython/biopython/pull/802> is a good solution, and
agree additional complexity is not advisable. In the situation I linked to
above, I ended up storing the longer names in a dictionary and used short
keys for everything that biopython dealt with. Then when I was all done, I
went back and re-wrote the locus lines using the dictionary and normal file
I/O. It's a messy hack but I can go dig up the code if it would be helpful.

Cheers!
Kevin

On Thu, Feb 2, 2017 at 2:11 PM Peter Cock <p.j.a.cock at googlemail.com> wrote:

> I was alluding to this paragraph in
> ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
>
>   Although each of these data values can be found at column-specific
> positions, we encourage those who parse the contents of the LOCUS
> line to use a token-based approach. This will prevent the need for
> software changes if the spacing of the data values ever has to be
> modified.
>
> See also
> https://github.com/biopython/biopython/issues/526#issuecomment-276994748
>
> Given the current Bio.SeqIO API, I take it you are suggesting a
> file format variant like "genbank-relaxed" or "genbank-strict"?
> (Precedent with the FASTQ variants, or PHYLIP in AlignIO)
>
> I would prefer just to have one GenBank mode - it is complicated
> enough already.
>
> Peter
>
> On Thu, Feb 2, 2017 at 5:33 PM, Chevreux, Bastien
> <bastien.chevreux at dsm.com> wrote:
> > I could not find any discussion about LOCUS line and separators, so I
> cannot comment on that.
> >
> > The way I would implement output in BioPython is a mode flag. If set to
> strict (default) it would give exactly what we have today, if set to
> 'loose' it would try to wiggle name and size numbers into the available
> space for name and size, leaving the rest untouched. That would leave the
> possibility to have a "separator" mode sometime later should LOCUS lines
> switch to that.
> >
> > B.
> >
> > --
> > DSM Nutritional Products Microbia Inc | Bioinformatics
> > 60 Westview Street | Lexington, MA 02421 | United States
> > Phone +1 781 259 7613 <(781)%20259-7613> | Fax +1 781 259 0615
> <(781)%20259-0615>
> >
> > -----Original Message-----
> > From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
> > Sent: Thursday, February 02, 2017 10:39 AM
> > To: Chevreux, Bastien <bastien.chevreux at dsm.com>
> > Cc: biopython at biopython.org
> > Subject: Re: [Biopython] LOCUS name length on GenBank output: option to
> adjust?
> >
> > --- This mail has been sent from an external source ---
> >
> >
> > I've not checked lately but has there been any progress on the NCBI
> moving GenBank format away from a strict positional interpretation of the
> LOCUS line to being separator based? The issue and discussion you
> referenced was back in 2015...
> >
> > Practically speaking perhaps we could write a spec-breaking minimal
> LOCUS line (leaving out things like the date and division which make
> parsing a problem once the column positions are lost) with a warning?
> >
> > Peter
> >
> >
> > On Thu, Feb 2, 2017 at 3:16 PM, Chevreux, Bastien <
> bastien.chevreux at dsm.com> wrote:
> >> Dear list,
> >>
> >>
> >>
> >> could BioPython implement an option to adjust the strictness of name
> >> length checking of a sequence when writing GenBank output?
> >>
> >>
> >>
> >> I am aware of the short discussion in
> >>
> >>   https://github.com/biopython/biopython/issues/526
> >>
> >>
> >>
> >> and that BioPython wants to be strict on writing GenBank. However, I
> >> beg to reconsider this decision and allow for a user override.
> >>
> >>
> >>
> >> Background: annotation of metagenomics / metatranscriptomic datasets
> >> where one can easily have a million contigs or more. Projects were
> >> named accordingly short so that names of DNA sequences fit into 16
> characters.
> >> However, what was not considered was GenBank output of peptides, where
> >> the locus names are
> >> <name_of_contig>+<underscore>+<CDScounter_in_contig> … and that may go
> up to 20.
> >>
> >>
> >>
> >> As we are talking protein sequences here and there is no known protein
> >>>99999 amino acids, there would be a lot of wiggle room to allow the
> >>>user to
> >> set a 20 char limit (or more) during GB output.
> >>
> >>
> >>
> >> Best,
> >>
> >>   Bastien
> >>
> >>
> >>
> >> --
> >> DSM Nutritional Products Microbia Inc | Bioinformatics
> >> 60 Westview Street | Lexington, MA 02421 | United States Phone +1 781
> >> 259 7613 | Fax +1 781 259 0615 <(781)%20259-0615>
> >>
> >>
> >>
> >>
> >> ________________________________
> >>
> >> DISCLAIMER:
> >> This e-mail is for the intended recipient only.
> >> If you have received it by mistake please let us know by reply and
> >> then delete it from your system; access, disclosure, copying,
> >> distribution or reliance on any of it by anyone else is prohibited.
> >> If you as intended recipient have received this e-mail incorrectly,
> >> please notify the sender (via e-mail) immediately.
> >>
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >> http://mailman.open-bio.org/mailman/listinfo/biopython
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython

-- 
Lecturer, Harvard Medical School
617-432-2210
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20170202/3139f602/attachment-0001.html>