[Biopython] format fasta files to genbank: problem with too long Locus identifier

Sat Jul 17 07:50:50 EDT 2010

2010/7/17 Björn Johansson <bjorn_johansson at bio.uminho.pt>:
> Hi all, this is an example of parsing a fasta file and then trying to
> convert it to genbank.
> It seems that the fasta header file is not split between the "|", and all
> that is in the fasta header ends up as "LOCUS" in the genbank file. Is this
> the expected behavior? Can this be set somehow?
>
> Thanks for any help on this!
> /bjorn

Hi Bjorn,

Yes this is expected behaviour. There are no standards for FASTA
identifiers, the NCBI conventions are just one of dozens of styles.
Therefore we don't try and parse the identifiers in FASTA files (we
can't do it reliably). Then for GenBank files, the identifier field in
the LOCUS line is very limited - you'll have to shorten your ID
manually, Try something like this:

from Bio import SeqIO
a=SeqIO.read("newfile.fasta", "fasta")
a.id = a.id.split("|")[3]
print a.format('genbank')

(untested)

Peter