[Bioperl-l] Bio::SearchIO::Writer::TextResultWriter is buggy
Lee Katz
lskatz at gmail.com
Thu Mar 15 21:26:30 UTC 2012
Splitting the blast output file, enquing the file, and then adding the
header works!
In the Main thread:
logmsg "Splitting the blast output into chunks";
system("csplit -s -f '$$settings{tempdir}/xx' -b '%02d.bls'
$$settings{blastfile} /Query=/ '{*}'");
die "Problem with splitting the blast output into chunks: $!" if $?;
# remove xx00 from the chunk array and add it to each chunk
$$settings{blastHeader}=`cat $$settings{tempdir}/xx00.bls`;
die "Could not get blast header from file because $!" if $?;
unlink("$$settings{tempdir}/xx00.bls"); # remove this file before it's
caught by glob()
my @chunk=glob("$$settings{tempdir}/xx*.bls");
In the threads:
while(defined(my $resultFile=$Q->dequeue)){
my $completeBlastContents=`(echo '$$settings{blastHeader}' && cat
$resultFile)`;
open(BLASTIN, "<", \$completeBlastContents) or die "Could not open
string for reading: $!";
my $searchIn=Bio::SearchIO->new(-fh=>\*BLASTIN,-format=>"blast");
On Wed, Mar 14, 2012 at 12:51 PM, Lee Katz <lskatz at gmail.com> wrote:
> I haven't gotten too far, but the following code might be appropriate for
> splitting a human-readable blast. The output files should be appropriate
> for passing to threads. I'll give an update tomorrow when I have more time.
>
> logmsg "Splitting the blast output into chunks";
> system("csplit -s -f '$$settings{tempdir}/xx' -b '%02d.bls'
> $$settings{blastfile} /Query=/ '{*}'");
> die "Problem with splitting the blast output into chunks: $!" if $?;
> # remove xx00 from the chunk array and add it to each chunk
> for my $chunk(glob("$$settings{tempdir}/xx*.bls")){
> next if($chunk=~/xx00.bls/);
> my $newChunk=`cat $$settings{tempdir}/xx00.bls $chunk`;
> system("cat $$settings{tempdir}/xx00.bls $chunk >
> $$settings{tempdir}/chunktmp.bls");
> system("cp $$settings{tempdir}/chunktmp.bls $chunk");
> die "Error moving temp blast file" if $?;
> }
>
>
>
> On Wed, Mar 14, 2012 at 11:21 AM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:
>
>> Can you pass a file handle? SearchIO accepts them using '-fh', but I
>> have no idea if this would work. See:
>>
>>
>> http://bytes.com/topic/perl/answers/497453-howto-share-filehandles-between-threads
>>
>> The problem is whether this will be at the specific file point you need
>> or whether the file pointer will be reset to the beginning. According to
>> the last answer there, if the variable is sharing scope it should work, but
>> I haven't tried it out myself to be honest. If you try that out let us
>> know, I would be interested to see if it works.
>>
>> Re: splitting the file, this is a little tricky as plain text output has
>> changed. Older versions of BLAST simply concatenate output files, so the
>> header for the file is repeated (lines up to 'Query'). Latter version
>> leave off the repeated header. You could simply split the file up based on
>> the query using a regex, I believe the result object should still be
>> generated, but the header contains information that you may want, such as
>> BLAST version, etc.
>>
>> chris
>>
>> On Mar 14, 2012, at 9:35 AM, Lee Katz wrote:
>>
>> > I just want to clarify: I have an already existing blast output. Is
>> there
>> > a non-buggy way to split it? It is in human-readable text form (-m 0).
>> >
>> > On Tue, Mar 13, 2012 at 6:06 PM, Lee Katz <lskatz at gmail.com> wrote:
>> >
>> >> Hi, I am separating a blast output file into individual results, so
>> that I
>> >> can multithread the reading of the results. I cannot pass a result
>> object
>> >> through Perl threads because it contains code, which is not sharable
>> via
>> >> threads::share (sharing is used internally in
>> Thread::Queue)--therefore I
>> >> must pass a sub-file. My strategy is to read the whole file into
>> >> Bio::SearchIO and then write the result objects to a file, so that a
>> thread
>> >> can read the file. The thread would thus read one file at a time
>> >> containing one query and all its results.
>> >>
>> >> Reading the original file works, but then outputting the blast file is
>> >> buggy. The last line of the HSP is empty and has bad coordinates. I
>> have
>> >> an example, with an error when trying to read it again with SearchIO,
>> and
>> >> its fasta file below.
>> >>
>> >> Any help debugging? Maybe I just need to update BioPerl since I
>> installed
>> >> it around several months ago, maybe a year ago? Thanks.
>> >>
>> >>
>> >> MSG: In sequence lcl|R009125 residue count gives end value 341.
>> >> Overriding value [340] with value 341 for Bio::LocatableSeq::end().
>> >>
>> >>
>> ANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFF-----SLSDVMLL-----YRYVIINDFE-------INEGKYF----FAVVIVFFKIIGFPLFFCVLSAVLPTLVQTKFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIFSQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYIEQEGHFETKSRRRELHIEILSEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLIETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFVDFEHLDEVLRLIVWLEQVEN1
>> >> ---------------------------------------------------
>> >>
>> >>
>> >>
>> >>> lcl|R009125 (gi:13449103) spa40 (pWR501_p164) - Type III secretion
>> >> protein [Shigella flexneri str. M90T (serotype 5a) plasmid pWR501]
>> >> Length = 342
>> >>
>> >> Score = 79.3 bits (194), Expect = 2e-15
>> >> Identities = 87/360 (24%), Positives = 175/360 (48%), Gaps = 35/360
>> (9%)
>> >>
>> >>
>> >> Query: 4
>> GDKTEQASSQKLDKARKQGQIARSKEFSSAIMLMV----CIGYFYANADSLSGHLMQLFE 59
>> >> +KTE+ + +KL A K+GQ + K+ ++ ++++V I +F SLS ++
>> >> Sbjct: 2
>> ANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFF-----SLSDVMLL--- 53
>> >>
>> >> Query: 60
>> VSFRFTAESQSDHDHILHLITQSLYLMIKVFAPLIIF-QFIASAIATCLLGGF------- 111
>> >> +R+ + + I + Y FA +I+F + I + C+L
>> >> Sbjct: 54
>> --YRYVIINDFE-------INEGKYF----FAVVIVFFKIIGFPLFFCVLSAVLPTLVQT 100
>> >>
>> >> Query: 112
>> HFNLSLLAPK--FSKINPLSGIKRIFSKQTLVEFLKNVAKISLIFALLYYMISTNFHMIG 169
>> >> F L+ A K FS +NP+ G+K+IFS +T+ EF K++ + ++ Y+ + +I
>> >> Sbjct: 101
>> KFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIF 160
>> >>
>> >> Query: 170
>> SLVRASFQTTIHFSLQYVLELLGMLILIAILFGVIDIPYQKMTFGTQMKMTkqevkqehk 229
>> >> S V +S + +++ + +IL ++D + + + M M KQE+K+E+
>> >> Sbjct: 161
>> SQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYI 220
>> >>
>> >> Query: 230
>> eqeGRPEIKSRIRQIQMQNARRSASQTVPTADVVLMNPTHFAVALKYDLTKAEAPFVVAK 289
>> >> EQEG E KSR R++ ++ + + +V+MNPTH A+ + ++ A APF+
>> >> Sbjct: 221
>> EQEGHFETKSRRRELHIEILSEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLI 280
>> >>
>> >> Query: 290
>> GKNEVAFYIRTLAEQHQVEVLVVPEITRSIYHTTQLNQMIPNQLFLAVAQILKYVQQLKS 349
>> >> N+ A +R A + + + ++ R +Y T + + V +++ +++Q+++
>> >> Sbjct: 281
>> ETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFVDFEHLDEVLRLIVWLEQVEN 340
>> >>
>> >> Query: 350 349
>> >>
>> >> Sbjct: 341 340
>> >>
>> >> And the whole fasta entry is:
>> >>> lcl|R009125 (gi:13449103) spa40 (pWR501_p164) - Type III secretion
>> >> protein [Shigella flexneri str. M90T (serotype 5a) plasmid pWR501]
>> >>
>> >>
>> MANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFFSLSDVMLLYRYVIINDFEINEGKYFFAVVIVFFKI
>> >>
>> >>
>> IGFPLFFCVLSAVLPTLVQTKFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIF
>> >>
>> >>
>> SQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYIEQEGHFETKSRRRELHIEIL
>> >>
>> >>
>> SEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLIETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFV
>> >> DFEHLDEVLRLIVWLEQVENTH
>> >>
>> >>
>> >>
>> >> --
>> >> Lee Katz, Ph.D.
>> >>
>> >
>> >
>> >
>> > --
>> > Lee Katz, Ph.D.
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> --
> Lee Katz, Ph.D.
>
--
Lee Katz, Ph.D.
More information about the Bioperl-l
mailing list