[Biojava-l] Re: Biojava-l digest, Vol 1 #334 - 2 msgs
Cox, Greg
gcox@netgenics.com
Mon, 11 Jun 2001 15:00:28 -0400
Here are a couple of links on the Genbank format:
ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Neither indicates that the GI field is optional. I checked Genbank through
the Web interface, and that version of AE000783 does have a GI field. On a
first pass, I'd guess it's a crossed wire at Genbank. If this is epidemic,
it's probably worth changing the parser, but it seems to be more an
occasional bug.
Greg
-----Original Message-----
From: Sarath [mailto:sarath@decodon.com]
Sent: Monday, June 11, 2001 2:18 PM
To: Thomas Down
Cc: Sarath; biojava-l@biojava.org
Subject: Re: [Biojava-l] Re: Biojava-l digest, Vol 1 #334 - 2 msgs
hi thomas
It was so nice to hear from you the early response and it did work the
way you said but i just had to include a set of dummy characters to
mislead the program but is this the only way i could manage with
such files as the files i have suggested as a reference were the newly
sequenced ones i.e the sequencing of these genomes was completed on 1st
june so what have u to say for this ? I dont exactly know the purpose of
GI in the genbank format but do u think this level of rigidity is
neccessary for genbankformat reading
from sarath
On Mon, 11 Jun 2001, Thomas Down wrote:
> On Mon, Jun 11, 2001 at 06:32:08PM +0200, Sarath wrote:
> > Hello everyone
> > I have been using the biojava package for around a month and today
> > surprisingly i have met with a strange circumstance of simple program
not
> > able to compute the gc content from a file in the gene bank format.I
would
> > be very glad if some body can tell me the bug in the program to find the
> > gc content from the file (AE000783.gbk) from the url
> > ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Borrelia_burgdorferi/
> > I am pasting the source code here i hope this is not an
inconvenience
> > to some people.The code is exactly the same as that given in the
tutorials
> > .Probably there is a bug in biojava ...who knows or should i blame the
> > genbank people????
>
> Bug? No, never... ;)
>
> I've taken a quick look at this. The problem is with
> the VERSION line of the file. Most Genbank entries look
> like:
>
> VERSION AE000784.1 GI:2690041
>
> The BioJava parser is explicitly expecting to see two tokens
> folowing the VERSION keyword.
>
> The file you are trying to read has a line like:
>
>
> VERSION AE000783
>
> The only format documentation I can find for Genbank is at:
>
> ftp://ncbi.nlm.nih.gov/genbank/docs/
>
> This concentrates on the feature tables, and doesn't seem
> to give a normative description of the headers. However,
> the example given includes the two-token VERSION string,
> and this seems to be found in the vast majority of entries.
>
> That said, we probably ought to go the `strict in what you
> produce, tolerant in what you accept' approach. I'll leave
> the final decision to people who use the Genbank parser more
> regularly, though (Greg, are you listening?)
>
> Is there a normative specification for the Genbank header
> lines anywhere? If so, maybe it is worth complaining about
> that entry...
>
> Thomas.
>
_______________________________________________
Biojava-l mailing list - Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l