[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Seth Johnson
johnson.biotech at gmail.com
Tue Jun 6 15:03:23 UTC 2006
I've found the cause of the incorrect formatting (command line option
for Release formatting) and most of the sequences are parsed
correctly. However, some of them cause the exception below. I hope
I'm not being too much of a nuisance.
~~~~~~~~~~~~~~~~~~
org.biojava.bio.BioException: Could not read sequence
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
at exonhit.parsers.GenBankParser.main(GenBankParser.java:366)
Caused by: org.biojava.bio.seq.io.ParseException: Bad ID line found:
DX588312 standard; DNA ; GSS; 25 BP.
at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:321)
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
... 1 more
Java Result: -1
~~~~~~~~~~~~~~~~~~
Here's the entire sequence file:
==================
ID DX588312 standard; DNA ; GSS; 25 BP.
XX
AC DX588312;
XX
SV DX588312.1
DT 18-MAY-2006
XX
DE Lewinski-HIVchimera-HeLa-MLVGagPuro-11D09.rev HIVmGag MLV/HIV chimera
DE Integration Site Library Homo sapiens genomic, genomic survey sequence.
XX
KW GSS.
XX
OS Homo sapiens (human)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini;
OC Hominidae; Homo.
XX
RN [1]
RP 1-25
RA Lewinski M.K., Yamashita M., Emerman M., Ciuffi A., Marshall H., Crawford
RA G., Collins F., Shinn P., Leipzig J., Hannenhalli S., Berry C.C., Ecker
RA J.R., Bushman F.D.;
RT "Retroviral DNA Integration: Viral and Cellular Determinants of Target
RT Site Selection";
RL PLoS Pathog. 0:0-0 (2006).
XX
CC Contact: Bushman FD
CC Department of Microbiology
CC University of Pennsylvania School of Medicine
CC 402C Johnson Pavilion, 3610 Hamilton Walk, Philadelphia, PA 19104-6076,
CC USA
CC Tel: 215 573 8732
CC Fax: 215 573 4856
CC Email: bushman at mail.med.upenn.edu
CC The hg17 freeze of the human genome was used.
CC Class: shotgun.
XX
FH Key Location/Qualifiers
FH
FT source 1..25
FT /organism="Homo sapiens"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:9606"
FT /cell_line="HeLa"
FT /clone_lib="HIVmGag MLV/HIV chimera Integration Site
FT Library"
FT /note="HeLa cells were infected with an HIV-based
FT chimeric virus with MLV MA, p12 and CA substituted for
FT HIV MA and CA and the puromycin resistance gene in place
FT of nef. Cells were selected with puromycin for 2 weeks.
FT Genomic DNA was extracted, digested with MseI, and
FT ligated to a linker. Viral-host DNA junctions were
FT amplified by nested PCR and cloned into TOPO TA vectors."
XX
SQ Sequence 25 BP; 13 A; 0 C; 5 G; 7 T; 0 other;
agaagtaaaa atgtagatat gatta 25
//
==================
On 6/6/06, Seth Johnson <johnson.biotech at gmail.com> wrote:
> I see now! It looks like the ASN2GB converter is taking some liberties
> with EMBL format. I'll try to experiment with command line options of
> that software and if all else fails get hold of the NCBI developers.
>
> On 6/6/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > The program used to generate that EMBL file is doing it incorrectly - it
> > is missing the XX tag after the feature table, and is also missing the
> > SQ tag before the sequence begins.
> >
> > If you generated it using BJX then that's my problem to fix so let me
> > know ASAP if that is the case!
> >
> > cheers,
> > Richard
> >
More information about the Biojava-l
mailing list