[Biojava-l] Re: Biojava-l digest, Vol 1 #334 - 2 msgs

Mon, 11 Jun 2001 15:00:28 -0400

Here are a couple of links on the Genbank format:
ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Neither indicates that the GI field is optional.  I checked Genbank through
the Web interface, and that version of AE000783 does have a GI field.  On a
first pass, I'd guess it's a crossed wire at Genbank.  If this is epidemic,
it's probably worth changing the parser, but it seems to be more an
occasional bug.

Greg

-----Original Message-----
From: Sarath [mailto:sarath@decodon.com]
Sent: Monday, June 11, 2001 2:18 PM
To: Thomas Down
Cc: Sarath; biojava-l@biojava.org
Subject: Re: [Biojava-l] Re: Biojava-l digest, Vol 1 #334 - 2 msgs

hi thomas
  It was so nice to hear from you the early response and it did work the
way you said but i just had to include a set of dummy characters to
mislead the program but is this  the only way i could manage with
such files as the files i have suggested as a reference were the newly
sequenced ones i.e the sequencing of these genomes was completed on 1st
june  so what have u to say for this ? I dont exactly know the purpose of
GI in the genbank format but do u think this level of rigidity is
neccessary for genbankformat reading 
from sarath 

On Mon, 11 Jun 2001, Thomas Down wrote:

> On Mon, Jun 11, 2001 at 06:32:08PM +0200, Sarath wrote:
> >   Hello everyone
> >     I have been using the biojava package for around a month and today
> > surprisingly i have met with a strange circumstance of simple program
not
> > able to compute the gc content from a file in the gene bank format.I
would
> > be very glad if some body can tell me the bug in the program to find the
> > gc content from the file (AE000783.gbk)  from the url
> >   ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Borrelia_burgdorferi/

> >     I am pasting the source code here i hope this is not an
inconvenience
> > to some people.The code is exactly the same as that given in the
tutorials
> > .Probably there is a bug in biojava ...who knows   or should i blame the
> > genbank people???? 
> 
> Bug?  No, never... ;)
> 
> I've taken a quick look at this.  The problem is with
> the VERSION line of the file.  Most Genbank entries look
> like:
> 
>   VERSION     AE000784.1  GI:2690041
> 
> The BioJava parser is explicitly expecting to see two tokens
> folowing the VERSION keyword.
> 
> The file you are trying to read has a line like:
> 
> 
>   VERSION     AE000783
> 
> The only format documentation I can find for Genbank is at:
> 
>   ftp://ncbi.nlm.nih.gov/genbank/docs/
> 
> This concentrates on the feature tables, and doesn't seem
> to give a normative description of the headers.  However,
> the example given includes the two-token VERSION string,
> and this seems to be found in the vast majority of entries.
> 
> That said, we probably ought to go the `strict in what you
> produce, tolerant in what you accept' approach.  I'll leave
> the final decision to people who use the Genbank parser more
> regularly, though (Greg, are you listening?)
> 
> Is there a normative specification for the Genbank header
> lines anywhere?  If so, maybe it is worth complaining about
> that entry...
> 
>    Thomas.
> 

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l