[Bioperl-l] Bio::SearchIO::Writer::TextResultWriter is buggy

Lee Katz lskatz at gmail.com
Thu Mar 15 17:26:30 EDT 2012


Splitting the blast output file, enquing the file, and then adding the
header works!

In the Main thread:

  logmsg "Splitting the blast output into chunks";
  system("csplit -s -f '$$settings{tempdir}/xx' -b '%02d.bls'
$$settings{blastfile} /Query=/ '{*}'");
  die "Problem with splitting the blast output into chunks: $!" if $?;
  # remove xx00 from the chunk array and add it to each chunk
  $$settings{blastHeader}=`cat $$settings{tempdir}/xx00.bls`;
  die "Could not get blast header from file because $!" if $?;
  unlink("$$settings{tempdir}/xx00.bls"); # remove this file before it's
caught by glob()
  my @chunk=glob("$$settings{tempdir}/xx*.bls");


In the threads:

  while(defined(my $resultFile=$Q->dequeue)){
    my $completeBlastContents=`(echo '$$settings{blastHeader}' && cat
$resultFile)`;
    open(BLASTIN, "<", \$completeBlastContents) or die "Could not open
string for reading: $!";
    my $searchIn=Bio::SearchIO->new(-fh=>\*BLASTIN,-format=>"blast");


On Wed, Mar 14, 2012 at 12:51 PM, Lee Katz <lskatz at gmail.com> wrote:

> I haven't gotten too far, but the following code might be appropriate for
> splitting a human-readable blast.  The output files should be appropriate
> for passing to threads.  I'll give an update tomorrow when I have more time.
>
>   logmsg "Splitting the blast output into chunks";
>   system("csplit -s -f '$$settings{tempdir}/xx' -b '%02d.bls'
> $$settings{blastfile} /Query=/ '{*}'");
>   die "Problem with splitting the blast output into chunks: $!" if $?;
>   # remove xx00 from the chunk array and add it to each chunk
>   for my $chunk(glob("$$settings{tempdir}/xx*.bls")){
>     next if($chunk=~/xx00.bls/);
>     my $newChunk=`cat $$settings{tempdir}/xx00.bls $chunk`;
>     system("cat $$settings{tempdir}/xx00.bls $chunk >
> $$settings{tempdir}/chunktmp.bls");
>     system("cp $$settings{tempdir}/chunktmp.bls $chunk");
>     die "Error moving temp blast file" if $?;
>   }
>
>
>
> On Wed, Mar 14, 2012 at 11:21 AM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:
>
>> Can you pass a file handle?  SearchIO accepts them using '-fh', but I
>> have no idea if this would work.  See:
>>
>>
>> http://bytes.com/topic/perl/answers/497453-howto-share-filehandles-between-threads
>>
>> The problem is whether this will be at the specific file point you need
>> or whether the file pointer will be reset to the beginning.  According to
>> the last answer there, if the variable is sharing scope it should work, but
>> I haven't tried it out myself to be honest.  If you try that out let us
>> know, I would be interested to see if it works.
>>
>> Re: splitting the file, this is a little tricky as plain text output has
>> changed.  Older versions of BLAST simply concatenate output files, so the
>> header for the file is repeated (lines up to 'Query').  Latter version
>> leave off the repeated header.  You could simply split the file up based on
>> the query using a regex, I believe the result object should still be
>> generated, but the header contains information that you may want, such as
>> BLAST version, etc.
>>
>> chris
>>
>> On Mar 14, 2012, at 9:35 AM, Lee Katz wrote:
>>
>> > I just want to clarify: I have an already existing blast output.  Is
>> there
>> > a non-buggy way to split it?  It is in human-readable text form (-m 0).
>> >
>> > On Tue, Mar 13, 2012 at 6:06 PM, Lee Katz <lskatz at gmail.com> wrote:
>> >
>> >> Hi, I am separating a blast output file into individual results, so
>> that I
>> >> can multithread the reading of the results.  I cannot pass a result
>> object
>> >> through Perl threads because it contains code, which is not sharable
>> via
>> >> threads::share (sharing is used internally in
>> Thread::Queue)--therefore I
>> >> must pass a sub-file.  My strategy is to read the whole file into
>> >> Bio::SearchIO and then write the result objects to a file, so that a
>> thread
>> >> can read the file.  The thread would thus read one file at a time
>> >> containing one query and all its results.
>> >>
>> >> Reading the original file works, but then outputting the blast file is
>> >> buggy.  The last line of the HSP is empty and has bad coordinates.  I
>> have
>> >> an example, with an error when trying to read it again with SearchIO,
>> and
>> >> its fasta file below.
>> >>
>> >> Any help debugging?  Maybe I just need to update BioPerl since I
>> installed
>> >> it around several months ago, maybe a year ago?   Thanks.
>> >>
>> >>
>> >> MSG: In sequence lcl|R009125 residue count gives end value 341.
>> >> Overriding value [340] with value 341 for Bio::LocatableSeq::end().
>> >>
>> >>
>> ANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFF-----SLSDVMLL-----YRYVIINDFE-------INEGKYF----FAVVIVFFKIIGFPLFFCVLSAVLPTLVQTKFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIFSQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYIEQEGHFETKSRRRELHIEILSEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLIETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFVDFEHLDEVLRLIVWLEQVEN1
>> >> ---------------------------------------------------
>> >>
>> >>
>> >>
>> >>> lcl|R009125 (gi:13449103) spa40 (pWR501_p164) - Type III secretion
>> >> protein [Shigella flexneri str. M90T (serotype 5a) plasmid pWR501]
>> >>         Length = 342
>> >>
>> >> Score = 79.3 bits (194), Expect = 2e-15
>> >> Identities = 87/360 (24%), Positives = 175/360 (48%), Gaps = 35/360
>> (9%)
>> >>
>> >>
>> >> Query: 4
>> GDKTEQASSQKLDKARKQGQIARSKEFSSAIMLMV----CIGYFYANADSLSGHLMQLFE 59
>> >>            +KTE+ + +KL  A K+GQ  + K+ ++ ++++V     I +F     SLS  ++
>> >> Sbjct: 2
>> ANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFF-----SLSDVMLL--- 53
>> >>
>> >> Query: 60
>>  VSFRFTAESQSDHDHILHLITQSLYLMIKVFAPLIIF-QFIASAIATCLLGGF------- 111
>> >>             +R+   +  +       I +  Y     FA +I+F + I   +  C+L
>> >> Sbjct: 54
>>  --YRYVIINDFE-------INEGKYF----FAVVIVFFKIIGFPLFFCVLSAVLPTLVQT 100
>> >>
>> >> Query: 112
>> HFNLSLLAPK--FSKINPLSGIKRIFSKQTLVEFLKNVAKISLIFALLYYMISTNFHMIG 169
>> >>            F L+  A K  FS +NP+ G+K+IFS +T+ EF K++  + ++    Y+    +  +I
>> >> Sbjct: 101
>> KFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIF 160
>> >>
>> >> Query: 170
>> SLVRASFQTTIHFSLQYVLELLGMLILIAILFGVIDIPYQKMTFGTQMKMTkqevkqehk 229
>> >>           S V +S         +   +++   +  +IL  ++D   + + +   M M KQE+K+E+
>> >> Sbjct: 161
>> SQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYI 220
>> >>
>> >> Query: 230
>> eqeGRPEIKSRIRQIQMQNARRSASQTVPTADVVLMNPTHFAVALKYDLTKAEAPFVVAK 289
>> >>           EQEG  E KSR R++ ++         +  + +V+MNPTH A+ + ++   A APF+
>> >> Sbjct: 221
>> EQEGHFETKSRRRELHIEILSEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLI 280
>> >>
>> >> Query: 290
>> GKNEVAFYIRTLAEQHQVEVLVVPEITRSIYHTTQLNQMIPNQLFLAVAQILKYVQQLKS 349
>> >>             N+ A  +R  A +  +  +   ++ R +Y T      +  +    V +++ +++Q+++
>> >> Sbjct: 281
>> ETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFVDFEHLDEVLRLIVWLEQVEN 340
>> >>
>> >> Query: 350  349
>> >>
>> >> Sbjct: 341  340
>> >>
>> >> And the whole fasta entry is:
>> >>> lcl|R009125 (gi:13449103) spa40 (pWR501_p164) - Type III secretion
>> >> protein [Shigella flexneri str. M90T (serotype 5a) plasmid pWR501]
>> >>
>> >>
>> MANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFFSLSDVMLLYRYVIINDFEINEGKYFFAVVIVFFKI
>> >>
>> >>
>> IGFPLFFCVLSAVLPTLVQTKFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIF
>> >>
>> >>
>> SQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYIEQEGHFETKSRRRELHIEIL
>> >>
>> >>
>> SEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLIETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFV
>> >> DFEHLDEVLRLIVWLEQVENTH
>> >>
>> >>
>> >>
>> >> --
>> >> Lee Katz, Ph.D.
>> >>
>> >
>> >
>> >
>> > --
>> > Lee Katz, Ph.D.
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> --
> Lee Katz, Ph.D.
>



-- 
Lee Katz, Ph.D.



More information about the Bioperl-l mailing list