[BioPython] parsing blast results for use in clustal

Thu Mar 25 19:39:02 EST 2004

Hi Aaron;

> I'm new to biopython and python in general. I am trying to take the 
> results from a blast search to feed into a clustal multiple alignment.  
> I followed the cookbook tutorials and can get results from blast but 
> parsing into a file that clustal can read is giving me some trouble.

Okay, so if I'm understanding you correctly, what you want is a file
that you can put into Clustalw to do an alignment. From the code you
supplied, it looks like what you are printing out is clustalw aln
output -- the results from an alignment.

If I can try and extrapolate, what you probably want to do is
retrieve the FASTA record for the hit and then write this to a file
-- then subsequently use clustalw for the alignment.

If I'm at all interpreting you correctly, then you can do this quite
readily. Since it's NCBIWWW, I'll assume you are BLASting against
some kind of standard NCBI database. Then you'll just need to split
the title of the hit to get out the GI or accession number. With
this, you can retrieve the corresponding full length FASTA record
from NCBI with code like:

>>> accession = "AAN04997.1"
>>> from Bio import GenBank
>>> dict = GenBank.NCBIDictionary(format = "fasta")
>>> rec = dict[accession]
>>> print rec
>gi|22725997|gb|AAN04997.1| putative transcription initiation factor [Oryza sativa (japonica cultivar-group)]
MGSADLVLKAACEGCGSPSDLYGTSCKHTTLCSSCGKSMALSGARCLVCSAPITNLIREYNVRANATTDK
SFSIGRFVTGLPPFSKKKSAENKWSLHKEGLQGRQIPENMREKYNRKPWILEDETGQYQYQGQMEGSQSS
TATYYLLMMHGKEFHAYPAGSWYNFSKIAQYKQLTLEEAEEKMNKRKTSATGYERWMMKAATNGPAAFGS
DVKKLEPTNGTEKENARPKKGKNNEEGNNSDKGEEDEEEEAARKNRLALNKKSMDDDEEGGKDLDFDLDD
EIEKGDDWEHEETFTDDDEAVDIDPEERADLAPEIPAPPEIKQDDEENEEEGGLSKSGKELKKLLGKAAG
LNESDADEDDEDDDQEDESSPVLAPKQKDQPKDEPVDNSPAKPTPSGHARGTPPASKSKQKRKSGGGDDS
KASGGAASKKAKVESDTKPSVAKDETPSSSKPASKATAASKTSANVSPVTEDEIRTVLLAVAPVTTQDLV
SRFKSRLRGPEDKNAFAEILKKISKIQKTNGHNYVVLRDDKK

So the returned record is a string FASTA record and you can replace
your output_file.write(...) code with:

output_file.write(rec)

and then end up with a file full of FASTA sequences, which clustalw
will take as input to do a subsequent alignment.

If you wanted to trim the sequence to the length of the hit, you
could parse the Fasta result you retrieve:

>>> from Bio import Fasta
>>> fasta_parser = Fasta.RecordParser()
>>> import StringIO
>>> fasta_rec = fasta_parser.parse(StringIO.StringIO(rec))

Manipulate the sequence:

>>> fasta_rec.sequence = fasta_rec.sequence[20:70]

And then write this out to your file:

output_file.write(str(rec) + "\n")

Hope some of that helped!
Brad