[BioRuby] Fwd: Disk cash on the parse genes
Toshiaki Katayama
ktym at hgc.jp
Mon May 23 03:28:08 UTC 2011
Dear Endoh-san,
> Using your code, I can overcome the problem.
Good!
> "join(M52614.1:1..1456,5216..5823), the code returned a error.
External reference should be detected by the Bio::Location class
and the ID will be stored in an instance variable @xref_id,
however, how to deal with it is up to users, so you need to implement
some code to fetch external entry (in this case @xref_id="M52614.1") from
available services (local DB or web service etc.) and extract
the sub-sequence from the entry.
Please take a look at pattern (G) in the documentation.
http://bioruby.open-bio.org/rdoc/classes/Bio/Locations.html
Unfortunately, I've got an unexplained "Exception" error from NCBI
when retrieving http://www.ncbi.nlm.nih.gov/nuccore?term=M52614.1
so, I'll use "join(U75473.1:1..293,1..216)" found in a GenBank entry
SMMFD02 (gbbct65.seq) for example.
# obtain a genbank record
bioruby> entry = getobj("genbank:SMMFD02")
or
bioruby> entry = Bio::GenBank.new(open("http://togows.dbcls.jp/entry/ncbi-genbank/SMMFD02").read)
# cache whole sequence as we learnt in this thread :-)
bioruby> naseq = entry.naseq
# pick up "gene" features only
bioruby> genes = entry.features.select {|x| x.feature == "gene" }
==> [#<Bio::Feature:0x102340870
@feature="gene",
@position="join(U75473.1:1..293,1..216)",
@qualifiers=
[#<Bio::Feature::Qualifier:0x102340280 @qualifier="gene", @value="mfd">]>]
# example to handle external references in a given position
bioruby> genes.each do |gene|
locations = Bio::Locations.new(gene.position)
locations.each do |location|
if xref = location.xref_id
xref_entry = open("http://togows.dbcls.jp/entry/ncbi-genbank/#{xref}").read
location.sequence = Bio::GenBank.new(xref_entry).naseq.subseq(location.from, location.to)
end
end
gene.position = locations.to_s # (*1)
puts naseq.splice(gene.position) # (*2)
end
(*1) will generate the following string
join(replace(U75473.1:1..293,"gtcttcttgttggtgatgttggttttggaaaaacggaagtagcgatgagagctgcttttaaagcagttaatgatgataaacaagttgctgttttggtgccaacaacagttcttgctcaacagcactataatacttttaaggagcgctttgaaaattttcctgtcaatgttgccatgatgagtcgttttaaaaccaagactgaacagtctgaaacgttaactaaattagctaagggacaggttgatatcattattggaacacatcgtctactttctaaagatgttacgtttaaa"),1..216)
(*2) will return 293 + 216 = 509 bp sequence
gtcttcttgttggtgatgttggttttggaaaaacggaagtagcgatgagagctgcttttaaagcagttaatgatgataaacaagttgctgttttggtgccaacaacagttcttgctcaacagcactataatacttttaaggagcgctttgaaaattttcctgtcaatgttgccatgatgagtcgttttaaaaccaagactgaacagtctgaaacgttaactaaattagctaagggacaggttgatatcattattggaacacatcgtctactttctaaagatgttacgtttaaaggggttaaacacaaggaaacattgaaagaattaaaaactaaggttgatgtcttgaccttgacagcaactcctattccacggacattacatatgtctatgcttggtatacgagatttatcagttattgaaacacctccaagtaatcgttaccctgtccagacttatgttatggaaacaaatgcaagtgtcattcgtgaagctattatgcgtgaaatt
During this trial, I found a bug in the Bio::Sequence#splice method.
bio/sequence/common.rb:
def splice(position)
unless position.is_a?(Locations) then
position = Locations.new(position)
end
s = ''
position.each do |location|
if location.sequence
s << location.sequence
else # <----- (*3)
exon = self.subseq(location.from, location.to)
begin
exon.complement! if location.strand < 0
rescue NameError
end
s << exon
end
end
return self.class.new(s)
end
alias splicing splice
We need to fix this else block (*3) to mind if @xref_id exists or not.
Currently, "join(U75473.1:1..293,1..216)" will be treated as
"join(1..293,1..216)" and, obviously, it is not feasible.
Toshiaki
On 2011/05/21, at 14:33, 遠藤大二 wrote:
> Dear Katayama-san
>
> I am very very grateful to your suggestion. I have been struggled on
> this problem for 6 months.
> Using your code, I can overcome the problem.
>
> But, only one point the code stopped.
> If the feature.position refer to the other entry such as
> "join(M52614.1:1..1456,5216..5823), the code returned a error.
> So I added a line below.
>
> next if position =~ /[A-Z]+\d+\W*\d*\:/
>
> The inserting code now working.
> I attached the modified code.
> Thanks again,
>
> Daiji Endoh
> ************************************************************************
> Dear Endoh-san,
>
> Thank you for pointing this problem out.
>
> I tried to parse gbbct12.seq file with the example code based on
> our tutorial at http://bioruby.open-bio.org/wiki/Tutorial and found
> that the actual problem is in the multiple calling of the gb.naseq method.
>
> The method is defined as shown in below and which doesn't cache
> the generated Bio::Sequence::NA object, therefore, it will take
> long time if called multiple times, especially for a long sequence.
>
> bio/db/genbank/genbank.rb:
> def seq
> unless @data['SEQUENCE']
> origin
> end
> Bio::Sequence::NA.new(@data['SEQUENCE'])
> end
> alias naseq seq
>
> If I store the object outside of the loop of feature manipulation,
> it became much faster.
>
> % ruby gbparse.rb gbbct12.seq > gbbct12.out 2> gbbct12.err
> Parsed 16125 entries in 1645.838824 sec.
>
> % ruby gbparse_new.rb gbbct12.seq > gbbct12.out_new 2> gbbct12.err_new
> Parsed 16125 entries in 39.012607 sec.
>
> Based on this observation, could you check the algorithm of your code?
>
> Regards,
> Toshiaki Katayama
> ****************************************************************************************
>
>
> On 2011/05/19, at 10:04, 遠藤大二 wrote:
>
>> Dear All
>>
>> I often download whole genbank data from bio at mirror ( such as
>> gbbct12.seq ) and parse them.
>> But recently, parsing the whole data became to be difficult. On some
>> some step, the program need a long time to select nucleic acid
>> sequences of genes or transcripts. It seems that selection of spliced
>> or partial sequences from a long (genome) nucleic acid sequence using
>> feature data.
>>
>> Anyone have strategies or methods avoiding these heavy steps ?
>>
>> Daiji Endoh
>> Rakuno Gakuen University
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
>
>
>
> --
> 酪農学園大学 獣医学部 放射線学教室
> 遠藤大二
> 〒069-8501 北海道江別市文京台緑町582
> Tel: 011-388-4847
> Fax:011-387-5890
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list