[BioPython] how to convert file full of BLAST runs into a FASTA file of sequences?

Thu Apr 9 22:46:37 UTC 2009

On 4/9/09, jchen at alumni.caltech.edu <jchen at alumni.caltech.edu> wrote:
>  > Do you just want the FASTA file to contain the matched region of the
>  > sequences in the database?  That information should be in the BLAST
>  > output - you'll need to remove any gap characters.
>  >
>  > If you want the full sequence of each matched target, that isn't in
>  > the database.  You'd have to take the reference number and look it up.
>  >  If you made the database yourself from a FASTA file, that should be
>  > easy.  If it was from NR/NT or another large database then maybe
>  > fetching the sequences from the NCBI would be easiest (try
>  > Bio.Entrez).
>
> Yeah, I actually do want the full length FASTA sequences. I didn't think
>  about the fact that the BLAST output only contains (partial) match
>  regions. I have a FASTA file of the entire proteome for the organism we
>  are studying.

You should be able to get the match IDs from the BLAST output and
match them up to your FASTA file easily enough.

>  > Are you sure you are using the XML output?
>  >
>  > With the plain text output and BLAST v.2.2.18, Biopython can only cope
>  > with single query output.  The NCBI regularly change their plain text
>  > output, and we have more-or-less given up with the our plain text
>  > parser.  The NCBI themselves do not recommend parsing it - that is
>  > what the XML format was introduced for.
>
> That's unfortunate there's no standard BLAST format. Yeah, I am trying to
>  parse the plain text BLAST output. I'm not familiar with the XML output -
>  I don't know how to have BLAST output in XML format.

If you are using the blastall tool at the command line directly, use
the argument -m 7 (from memory - check the blastall help).  If you are
using the wrapper in Bio.Blast.NCBIStandalone, this defaults to
requesting XML.  Have you looked at our documentation or the tutorial?

>  My file contains a few hundred queries. I ended up writing a little script
>  that extracted the name of each query and each of its significant hits. I
>  will probably end up writing my own scripts for getting the FASTA
>  sequences for each of these hits from a FASTA proteome file.

If you have already run the BLAST search and it would be slow to rerun
it with XML output, then doing your own parser might be expedient.

Anyway, once I had the sequence identifiers, I would use Bio.SeqIO to
read the FASTA file.  If the file is small, loading into memory as a
python dictionary would be the simplest solution - see the
Bio.SeqIO.to_dict function as one way to do this.

Finally, sending attachments to the mailing list isn't a good idea -
especially not half a megabyte of BLAST results!  I think the mailing
list has rejected that email anyway...

Peter