[Biopython] Pulling Alignment From PSI-Blast Output

Brett Bowman bnbowman at gmail.com
Tue Feb 8 22:46:02 UTC 2011


I stand corrected.  I couldn't get the buildali script to work, and it
wasn't mentioned in the paper that I am trying to emulate, so I tried to
build something in Python instead.

But it appears to confirm my suspicions that the PSI-Blast standalone
doesn't have the output that I want, as they use the less efficient blastpgp
algorithm and trap the output, so I think I will give that a try...

-Brett

On Tue, Feb 8, 2011 at 12:54 PM, Ruchira Datta <ruchira.datta at gmail.com>wrote:

> Hi Brett,
>
> I have run HHmake numerous times myself and the overall script does include
> running PSI-BLAST.  I suspect they didn't include it in the paper because
> it's not part of their algorithm which they invented.
>
> From the hh_1.5.1 version, from the script buildali.pl, here are some
> relevant lines:
>
> my $blastpgp=$ncbidir."/blastpgp -I T -s T"; # blastpgp executable
>
> our $blastpgp = $blastpgp . " -I T -s T"; # show gi's in defline; use
> Smith-Waterman
>
>     # Fast psiblast search for very similar sequences
>     &System("$blastpgp -I T -b 10 -v 10 -e 1e-6 -A 10 -f 15 -d $dbfile -i
> $tmp.seq &> $tmp.bla");
>
>   if ($nseqin<=1) {
>       &blastpgp("-e $Eult -d $db -i $seqfile","$blafile");
>   } else {
>       if ($core==1) {
>     &blastpgp("-e $Eult -d $db -i $seqfile","$blafile");
>     &System("perl $perl/alignhits.pl -cov $covcore -e $E2 $qid $bcore
> $pmaxopt -best -psi -q $seqfile $blafile $coreali",$v2);
>     if ($bcore ne "") {$bcore.=" -B $coreali";} # From here on use $coreali
> for end pruning of $coreali
>       }
>       &blastpgp("-e $Eult -d $db -i $seqfile -B $psifile","$blafile");
>   }
>
> It's possible you could *only* use their script alignhits.pl:
>
> #! /usr/bin/perl -w
> # Extract a multiple alignment of hits from Blast or PsiBlast output
> # Usage:   alignhits.pl [options] blast.out alignment-file
>
> Very sorry for subjecting you to Perl, but it does seem like it might be
> helpful in your case. :-)
>
> --Ruchira
>
>
> On Tue, Feb 8, 2011 at 12:28 PM, Brett Bowman <bnbowman at gmail.com> wrote:
>
>>  It might be there implicitly in HSPs (high-scoring segment pairs).
>>> Trying to make a multiple sequence alignment out of these is a huge pain.
>>>
>>
>>  It isn't, I checked - the alignments are only pairwise, so combining them
>> produces non-sense.  Thats why I have been using Muscle and ClustalW to
>> create an alignment after the Blast, despite the large computational cost.
>>
>>
>>> Brett, might you not run all of HHmake (which includes running
>>> PSI-BLAST)?  It extracts the multiple sequence alignment.
>>>
>>
>> HHmake does not include PSI-Blast - all HHmake does is convert a multiple
>> sequence alignment to an HMM profile.  The original paper that inspired this
>> project used PSI-Blast and then used HHmake, but make no mention of
>> alignments.  I have sent the authors an e-mail for clarification, but have
>> not received a reply.  But from the way the paper was written, it does not
>> appear that they used an alignment program, hence my confusion at my
>> inability to do the same.
>>
>> -Brett
>>
>> On Tue, Feb 8, 2011 at 9:59 AM, Ruchira Datta <ruchira.datta at gmail.com>wrote:
>>
>>>
>>>
>>> On Tue, Feb 8, 2011 at 4:05 AM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:
>>>
>>>> I am surprised that the multiple alignment is not in the XML at all. It
>>>> can not be constructed from the information in the XML?
>>>
>>>
>>> It might be there implicitly in HSPs (high-scoring segment pairs).
>>> Trying to make a multiple sequence alignment out of these is a huge pain.
>>>
>>> Brett, might you not run all of HHmake (which includes running
>>> PSI-BLAST)?  It extracts the multiple sequence alignment.  (It might "clean
>>> up" after itself by deleting it -- you might need to go in and take that
>>> line out.)  Then, if you wanted to do more stuff to the MSA before running
>>> HHmake, you could just discard the rest of the results of the first run and
>>> run HHmake again at the point that you want.
>>>
>>> --Ruchira
>>>
>>> Anyway, if it is in there, I would suggest to use Bio.Entrez to parse the
>>>> XML instead of the parser in Bio.Blast. The Bio.Entrez parser will give you
>>>> all the information in the XML; the parser in Bio.Blast is more polished but
>>>> may not give you all the information present in the PSI-Blast output.
>>>>
>>>> --Michiel.
>>>>
>>>> --- On Tue, 2/8/11, Brett Bowman <bnbowman at gmail.com> wrote:
>>>>
>>>> From: Brett Bowman <bnbowman at gmail.com>
>>>> Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output
>>>> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
>>>> Cc: biopython at biopython.org
>>>> Date: Tuesday, February 8, 2011, 2:40 AM
>>>>
>>>> I thought about that, but there doesn't appear to be any
>>>> multiple-alignment data in the XML file - just pair-wise alignments of the
>>>> query with each hit.  In addition, when I parse the output file with NCBIXML
>>>> I get a Bio.Blast.Record.Blast object, instead of a
>>>> Bio.Blast.Record.PSIBlast object.  The Biopython cookbook describes how to
>>>> work with a PSIBlast object, but it doesn't really cover how to make one...
>>>>
>>>>
>>>>
>>>> On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon <mjldehoon at yahoo.com>
>>>> wrote:
>>>>
>>>>
>>>> One option you could try is to let PSI-Blast generate its output in XML
>>>> and check if the information you need is present in the XML. If it is, you
>>>> can parse the XML with the read() function in Bio.Entrez. You may find that
>>>> Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast
>>>> XML output (Bio.Entrez will tell you which one and where to store it). If
>>>> so, please let us know, so we can include the required DTDs in the next
>>>> release of Biopython.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --Michiel.
>>>>
>>>>
>>>>
>>>> --- On Mon, 2/7/11, Brett Bowman <bnbowman at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> > From: Brett Bowman <bnbowman at gmail.com>
>>>>
>>>> > Subject: [Biopython] Pulling Alignment From PSI-Blast Output
>>>>
>>>> > To: biopython at biopython.org
>>>>
>>>> > Date: Monday, February 7, 2011, 5:30 PM
>>>>
>>>> > I'm trying to use the PSI-Blast
>>>>
>>>> > results from a series of proteins to detect
>>>>
>>>> > distant homologues, using HMMs of various sorts.
>>>>
>>>> > Currently I'm pulling down
>>>>
>>>> > the sequence IDs with PSI-Blast, downloading the full
>>>>
>>>> > sequences from NCBI,
>>>>
>>>> > then aligning everything with ClustalW or Muscle.
>>>>
>>>> > However this is eating up
>>>>
>>>> > way more processor time than I have to spare, so I want to
>>>>
>>>> > just pull the
>>>>
>>>> > full multi-sequence alignment from the PSI-blast results if
>>>>
>>>> > possible (OUTFMT
>>>>
>>>> > option #3 or 4), for use in building the HMMs.  But it
>>>>
>>>> > doesn't look like
>>>>
>>>> > AlignIO has a module for reading the peculiar format that
>>>>
>>>> > PSI-Blast
>>>>
>>>> > generates...
>>>>
>>>> >
>>>>
>>>> > Has this been done before, or will I need to write my own
>>>>
>>>> > parser?
>>>>
>>>> >
>>>>
>>>> > Brett Bowman
>>>>
>>>> > Woelk Lab
>>>>
>>>> > UCSD School of Medicine
>>>>
>>>> > _______________________________________________
>>>>
>>>> > Biopython mailing list  -  Biopython at lists.open-bio.org
>>>>
>>>> > http://lists.open-bio.org/mailman/listinfo/biopython
>>>>
>>>> >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>>
>>>
>>>
>>
>



More information about the Biopython mailing list