[BioRuby] Patch for Bug 18019.
Anurag Priyam
anurag08priyam at gmail.com
Thu Apr 15 08:39:28 UTC 2010
>
> Because genbank and genpept format file downloaded from NCBI with entrez
> usually ends with double new line characters,
> the latter behavior is really desired.
>
> $ wget -O sequences.gb "
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4
> "
>
> $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\
> ff.each { |e| c += 1 }; p c' sequences.gb
> #==> 4
> Hope it becomes 3. As there are 3 entries.
> $ grep LOCUS sequences.gb
> LOCUS A00002 194 bp DNA linear PAT
> 10-FEB-1993
> LOCUS A00003 194 bp DNA linear PAT
> 10-FEB-1993
> LOCUS X17276 556 bp DNA linear MAM
> 26-FEB-1992
>
> Actually this file have an excess newline at each end of entry.
> And his patch will work in this case, despite it is not right as you
> mentioned.
>
> Although in this example no error is reported because we don't do anything
> with the
> entry, accessing the last entry (the fourth in this case) will cause error.
>
As, I mentioned in my previous mail, the cause for the extra entry is cause
by a "\n". Even the "\n" gets parsed into Bio::GenBank object. No errors are
raised. Here:
$ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts
e.entry_id};' sequences.gb
A00002
A00003
X17276
nil
$ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts
e.class};' sequences.gb
Bio::GenBank
Bio::GenBank
Bio::GenBank
Bio::GenBank
--
Anurag Priyam
2nd Year,Mechanical Engineering,
IIT Kharagpur.
+91-9775550642
More information about the BioRuby
mailing list