[Bioperl-l] Bio::SearchIO::Writer::TextResultWriter is buggy
Lee Katz
lskatz at gmail.com
Wed Mar 14 12:51:03 EDT 2012
I haven't gotten too far, but the following code might be appropriate for
splitting a human-readable blast. The output files should be appropriate
for passing to threads. I'll give an update tomorrow when I have more time.
logmsg "Splitting the blast output into chunks";
system("csplit -s -f '$$settings{tempdir}/xx' -b '%02d.bls'
$$settings{blastfile} /Query=/ '{*}'");
die "Problem with splitting the blast output into chunks: $!" if $?;
# remove xx00 from the chunk array and add it to each chunk
for my $chunk(glob("$$settings{tempdir}/xx*.bls")){
next if($chunk=~/xx00.bls/);
my $newChunk=`cat $$settings{tempdir}/xx00.bls $chunk`;
system("cat $$settings{tempdir}/xx00.bls $chunk >
$$settings{tempdir}/chunktmp.bls");
system("cp $$settings{tempdir}/chunktmp.bls $chunk");
die "Error moving temp blast file" if $?;
}
On Wed, Mar 14, 2012 at 11:21 AM, Fields, Christopher J <
cjfields at illinois.edu> wrote:
> Can you pass a file handle? SearchIO accepts them using '-fh', but I have
> no idea if this would work. See:
>
>
> http://bytes.com/topic/perl/answers/497453-howto-share-filehandles-between-threads
>
> The problem is whether this will be at the specific file point you need or
> whether the file pointer will be reset to the beginning. According to the
> last answer there, if the variable is sharing scope it should work, but I
> haven't tried it out myself to be honest. If you try that out let us know,
> I would be interested to see if it works.
>
> Re: splitting the file, this is a little tricky as plain text output has
> changed. Older versions of BLAST simply concatenate output files, so the
> header for the file is repeated (lines up to 'Query'). Latter version
> leave off the repeated header. You could simply split the file up based on
> the query using a regex, I believe the result object should still be
> generated, but the header contains information that you may want, such as
> BLAST version, etc.
>
> chris
>
> On Mar 14, 2012, at 9:35 AM, Lee Katz wrote:
>
> > I just want to clarify: I have an already existing blast output. Is
> there
> > a non-buggy way to split it? It is in human-readable text form (-m 0).
> >
> > On Tue, Mar 13, 2012 at 6:06 PM, Lee Katz <lskatz at gmail.com> wrote:
> >
> >> Hi, I am separating a blast output file into individual results, so
> that I
> >> can multithread the reading of the results. I cannot pass a result
> object
> >> through Perl threads because it contains code, which is not sharable via
> >> threads::share (sharing is used internally in Thread::Queue)--therefore
> I
> >> must pass a sub-file. My strategy is to read the whole file into
> >> Bio::SearchIO and then write the result objects to a file, so that a
> thread
> >> can read the file. The thread would thus read one file at a time
> >> containing one query and all its results.
> >>
> >> Reading the original file works, but then outputting the blast file is
> >> buggy. The last line of the HSP is empty and has bad coordinates. I
> have
> >> an example, with an error when trying to read it again with SearchIO,
> and
> >> its fasta file below.
> >>
> >> Any help debugging? Maybe I just need to update BioPerl since I
> installed
> >> it around several months ago, maybe a year ago? Thanks.
> >>
> >>
> >> MSG: In sequence lcl|R009125 residue count gives end value 341.
> >> Overriding value [340] with value 341 for Bio::LocatableSeq::end().
> >>
> >>
> ANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFF-----SLSDVMLL-----YRYVIINDFE-------INEGKYF----FAVVIVFFKIIGFPLFFCVLSAVLPTLVQTKFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIFSQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYIEQEGHFETKSRRRELHIEILSEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLIETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFVDFEHLDEVLRLIVWLEQVEN1
> >> ---------------------------------------------------
> >>
> >>
> >>
> >>> lcl|R009125 (gi:13449103) spa40 (pWR501_p164) - Type III secretion
> >> protein [Shigella flexneri str. M90T (serotype 5a) plasmid pWR501]
> >> Length = 342
> >>
> >> Score = 79.3 bits (194), Expect = 2e-15
> >> Identities = 87/360 (24%), Positives = 175/360 (48%), Gaps = 35/360 (9%)
> >>
> >>
> >> Query: 4 GDKTEQASSQKLDKARKQGQIARSKEFSSAIMLMV----CIGYFYANADSLSGHLMQLFE
> 59
> >> +KTE+ + +KL A K+GQ + K+ ++ ++++V I +F SLS ++
> >> Sbjct: 2 ANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFF-----SLSDVMLL---
> 53
> >>
> >> Query: 60 VSFRFTAESQSDHDHILHLITQSLYLMIKVFAPLIIF-QFIASAIATCLLGGF-------
> 111
> >> +R+ + + I + Y FA +I+F + I + C+L
> >> Sbjct: 54 --YRYVIINDFE-------INEGKYF----FAVVIVFFKIIGFPLFFCVLSAVLPTLVQT
> 100
> >>
> >> Query: 112 HFNLSLLAPK--FSKINPLSGIKRIFSKQTLVEFLKNVAKISLIFALLYYMISTNFHMIG
> 169
> >> F L+ A K FS +NP+ G+K+IFS +T+ EF K++ + ++ Y+ + +I
> >> Sbjct: 101 KFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIF
> 160
> >>
> >> Query: 170 SLVRASFQTTIHFSLQYVLELLGMLILIAILFGVIDIPYQKMTFGTQMKMTkqevkqehk
> 229
> >> S V +S + +++ + +IL ++D + + + M M KQE+K+E+
> >> Sbjct: 161 SQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYI
> 220
> >>
> >> Query: 230 eqeGRPEIKSRIRQIQMQNARRSASQTVPTADVVLMNPTHFAVALKYDLTKAEAPFVVAK
> 289
> >> EQEG E KSR R++ ++ + + +V+MNPTH A+ + ++ A APF+
> >> Sbjct: 221 EQEGHFETKSRRRELHIEILSEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLI
> 280
> >>
> >> Query: 290 GKNEVAFYIRTLAEQHQVEVLVVPEITRSIYHTTQLNQMIPNQLFLAVAQILKYVQQLKS
> 349
> >> N+ A +R A + + + ++ R +Y T + + V +++ +++Q+++
> >> Sbjct: 281 ETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFVDFEHLDEVLRLIVWLEQVEN
> 340
> >>
> >> Query: 350 349
> >>
> >> Sbjct: 341 340
> >>
> >> And the whole fasta entry is:
> >>> lcl|R009125 (gi:13449103) spa40 (pWR501_p164) - Type III secretion
> >> protein [Shigella flexneri str. M90T (serotype 5a) plasmid pWR501]
> >>
> >>
> MANKTEKPTPKKLKDAAKKGQSFKFKDLTTVVIILVGTFTIISFFSLSDVMLLYRYVIINDFEINEGKYFFAVVIVFFKI
> >>
> >>
> IGFPLFFCVLSAVLPTLVQTKFVLATKAIKIDFSVLNPVKGLKKIFSIKTIKEFFKSILLLIILALTTYFFWINDRKIIF
> >>
> >>
> SQVFSSVDGLYLIWGRLFKDIILFFLAFSILVIILDFVIEFILYMKDMMMDKQEIKREYIEQEGHFETKSRRRELHIEIL
> >>
> >>
> SEQTKSDIRNSKLVVMNPTHIAIGIYFNPEIAPAPFISLIETNQCALAVRKYANEVGIPTVRDVKLARKLYKTHTKYSFV
> >> DFEHLDEVLRLIVWLEQVENTH
> >>
> >>
> >>
> >> --
> >> Lee Katz, Ph.D.
> >>
> >
> >
> >
> > --
> > Lee Katz, Ph.D.
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Lee Katz, Ph.D.
More information about the Bioperl-l
mailing list