[Biopython] Problem with parsing strand in Homo_sapiens.GRCh37.68 genbank files

Tue Aug 14 15:38:42 UTC 2012

On Tue, Aug 14, 2012 at 3:54 PM, Susan Wilson <smwilson at hpc.unm.edu> wrote:
> Hi Peter,
>
> Thanks for quick response. I have downloaded the files from
> ftp://ftp.ensembl.org/pub/release-68/genbank/homo_sapiens/. Got version 1.53
> of biopython. Maybe I should try 1.6?

Biopython 1.53 was released over two years ago (December 2009). The
current release is 1.60 (one dot sixty), there never was a 1.6 (one dot six).

Yes, please try the current Biopython. It seems fine here at least - using
this quick test I seem to get strands of +1 or -1 only as expected:

from Bio import SeqIO
genome = SeqIO.read("Homo_sapiens.GRCh37.68.chromosome.1.dat", "gb")
for f in genome.features: print f.strand, f.location, f.qualifiers.get("gene")

Going back to Biopython 1.53 on my machine (which didn't allow a filename
in SeqIO thus needs an explicit open), I get a parser warning:

UserWarning: Malformed LOCUS line found - is this correct?
LOCUS       1 249250621 bp DNA HTG 14-JUL-2012

You should have seen this warning on your machine. Did you?

This meant the sequence wasn't considered DNA or RNA (but an
unspecified alphabet), and as a result the strand wasn't set to +1,
but left as None (which would normally only happen on proteins).
At some point the LOCUS line handling was updated, so it now
does recognise this as a nucleotide sequence.

Peter