[BioRuby] Patch for Bug 18019.

Thu Apr 15 08:39:28 UTC 2010

>
> Because genbank and genpept format file downloaded from NCBI with entrez
> usually ends with double new line characters,
> the latter behavior is really desired.
>
> $ wget -O sequences.gb  "
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4
> "
>
> $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\
>  ff.each { |e| c += 1 }; p c' sequences.gb
> #==> 4
> Hope it becomes 3. As there are 3 entries.
> $ grep LOCUS sequences.gb
> LOCUS       A00002                   194 bp    DNA     linear   PAT
> 10-FEB-1993
> LOCUS       A00003                   194 bp    DNA     linear   PAT
> 10-FEB-1993
> LOCUS       X17276                   556 bp    DNA     linear   MAM
> 26-FEB-1992
>
> Actually this file have an excess newline at each end of entry.
> And his patch will work in this case, despite it is not right as you
> mentioned.
>
> Although in this example no error is reported because we don't do anything
> with the
> entry, accessing the last entry (the fourth in this case) will cause error.
>

As, I mentioned in my previous mail, the cause for the extra entry is cause
by a "\n". Even the "\n" gets parsed into Bio::GenBank object. No errors are
raised. Here:

$ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts
e.entry_id};' sequences.gb
A00002
A00003
X17276
nil

$ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts
e.class};' sequences.gb
Bio::GenBank
Bio::GenBank
Bio::GenBank
Bio::GenBank

-- 
Anurag Priyam
2nd Year,Mechanical Engineering,
IIT Kharagpur.
+91-9775550642