[BioRuby] Proposal: Bio::FastaFormat#each_entry
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Fri Jan 29 10:25:29 UTC 2010
Hi,
On Fri, 29 Jan 2010 15:46:15 +0900
"MISHIMA, Hiroyuki" <missy at be.to> wrote:
> Hi all,
>
> How about implementing the following methods?
>
> Bio::FastaFormat#each_entry
> Bio::FastaNumericFormat#each_entry
>
> The following is a sample code to generate a FASTQ string from a FASTA
> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.
>
> I am afraid that simpler or easier ways are already existed in BioRuby...
I think mixing single entry parser with multiple entry iterator
will cause confusion, and not good way.
For most parser classes in bioruby, expected data source is
String containing single entry data. In addition, for IO with
possible multiple entries, Bio::FlatFile is the front-end that
can detect data type, splits each entry, and calling assigned
parser class.
For String containing multiple entries, using StringIO and
then Bio::FlatFile is the easiest way, although indirect.
Recently, many efficient memory-mapped data transfer methods
are available, e.g. memcached, IPC shared memory, mmap(2)
system call. I'm now thinking how to treat such data efficiently.
Below is an example using StringIO and Bio::FlatFile.
#------------------------------------------------
require 'stringio'
require 'bio'
# When copy-and paste this script, the "> " in the head of
# each line should be removed.
> fasta = <<EOS
> >FXQB1I00000001
> TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG
> >FXQB1I00000002
> CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG
> EOS
>
> qual = <<EOS
> >FXQB1I00000001
> 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30
> 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24
> >FXQB1I00000002
> 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34
> 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30
> 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30
> EOS
ff_fasta = Bio::FlatFile.open(StringIO.new(fasta))
ff_qual = Bio::FlatFile.open(StringIO.new(qual))
while entry_fasta = ff.fasta.next_entry
seq = entry_fasta.to_biosequence
seq.quality_score_type = :phred
seq.quality_scores = ff_qual.next_entry.data
puts fastq.output(:fastq, :title => entry_fasta.definition)
end
#------------------------------------------------
> enum_fasta = Bio::FastaFormat.new(fasta).each_entry
> enum_qual = Bio::FastaNumericFormat.new(qual).each_entry
>
> loop do
> fastq = Bio::Sequence.adapter(enum_fasta.next,
> Bio::Sequence::Adapter::Fastq)
> fastq.quality_score_type = :phred
> fastq.quality_scores = enum_qual.next.data
> puts fastq.output(:fastq)
> end
Bio::Sequence.adapter is bioruby library internal use only,
and normally should not be used by user scripts. In addition,
using Adapter::Fastq for Bio::FastaFormat data is mismatch.
In this case, use Bio::FastaFormat#to_biosequence.
>
> --
> MISHIMA, Hiroyuki, DDS, Ph.D.
> COE Research Fellow
> Department of Human Genetics
> Nagasaki University Graduate School of Biomedical Sciences
Thanks,
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list