[BioRuby] Patch for Bug 18019.
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Thu Apr 15 07:26:42 UTC 2010
Hi Goto-san,
> Splitting entries by using such delimiter is simple and the
> performance
> is well, but it can only work with correct data which should always be
> ended with the delimiter. Characters after the last delimiter in the
> file is regarded as a single entry because we don't want to lose data.
>
> The behavior can be changed, for example, when getting only white
> spaces and then the end of file without delimiter, it is ignored and
> treated as EOF with no entries.
Because genbank and genpept format file downloaded from NCBI with entrez
usually ends with double new line characters,
the latter behavior is really desired.
$ wget -O sequences.gb "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/
efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4"
$ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\
ff.each { |e| c += 1 }; p c' sequences.gb
#==> 4
Hope it becomes 3. As there are 3 entries.
$ grep LOCUS sequences.gb
LOCUS A00002 194 bp DNA linear PAT
10-FEB-1993
LOCUS A00003 194 bp DNA linear PAT
10-FEB-1993
LOCUS X17276 556 bp DNA linear MAM
26-FEB-1992
Actually this file have an excess newline at each end of entry.
And his patch will work in this case, despite it is not right as you
mentioned.
Although in this example no error is reported because we don't do
anything with the
entry, accessing the last entry (the fourth in this case) will cause
error.
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
More information about the BioRuby
mailing list