[Biopython] Retrieving fasta seqs

Brad Chapman chapmanb at 50mail.com
Tue Feb 2 13:09:22 UTC 2010


Hi Alvaro;

> Hi all! This time the issue is about retrieving fasta records. I have a huge
> multifasta file and another file that has a list of ids.
> The latter has several ids, ex:
> FBgn0010441
> FBgn0011598
> FBgn0011761
> The purpose of this script is to retrieve the fasta sequences for this ids
> from the multifasta file and save the data to a file.
>  Ex. output file
> 
> >FBgn0010441
> ACTAGACCC
> >FBgn0011598
> GGTAATAAA

What you want to do here is read in your list of IDs first, and then
loop through the large FASTA file writing out the records you want.
More specific suggestions below:

> import sys
> from Bio import SeqIO
> try:
>     sec = open(sys.argv[1], 'r')
>     lista = open(sys.argv[2], 'r')
> except:
>     print "Error"

This is an aside, but type of code is a bad idea. You don't want to
blindly catch errors and keep moving on; it's fine to raise an error
if you can't find a file. I would remove the try/except from this
code.

On to the actual code, first read through the list of IDs and store
those as a list:

lista = open(sys.argv[2], 'r')
listita = []
for lines in lista:
   listita.append(line.rstrip())

Now open an output handle to write the records you want:

out_handle = open("your_out_file.fa", "w")

Finally, iterate through the large FASTA file, and write records of
interest:

sec = open(sys.argv[1], 'r')
for rec in SeqIO.parse(sec, "fasta"):
    if rec.id in listita:
        SeqIO.write([rec], out_handle, "fasta")

Hope this helps,
Brad



More information about the Biopython mailing list