[BioPython] Can't download a FASTA file from NCBI to BLAST
Roger Barrette
rwbarrette at gmail.com
Tue Jun 19 09:50:08 EDT 2007
Hi again Peter,
You are correct in your assumptions as to what I'm trying to accomplish. I
have a habit of pulling random code from different places when I'm at a loss
for how to do something, when I can't find documentation or examples. The
Seq5 and Seq3 were residual from a previous script where I was pulling out
overlapping 50mers. Regardless... I added your code to my script, and
changed the format for the search parameter to "fasta", but I'm getting the
following error:
\ Traceback (most recent call last):
\ File "<pyshell#4>", line 1, in <module>
\ Get_FASTA_Seq("NC_001653")
\ File "C:/Python25/FASTAtry2.py", line 17, in Get_FASTA_Seq
\ SeqIO.write(gb_records, fasta_file, "fasta")
\ File "C:\Python25\lib\site-packages\Bio\SeqIO\__init__.py", line
214, in write
\ writer_class(handle).write_file(sequences)
\ File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line
243, in write_file
\ self.write_records(records)
\ File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line
232, in write_records
\ self.write_record(record)
\ File "C:\Python25\Lib\site-packages\Bio\SeqIO\FastaIO.py", line
103, in write_record
\ id = self.clean(record.id)
\ AttributeError: 'str' object has no attribute 'id'
Am I not supposed to use GenBank.search_for, or NCBIDictionary, or do I need
to parse the raw output first. I did try to use the GenBank RecordParser,
as well as the Feature Parser set to output as "fasta", but I get the same
error? My current (+your) modified code is:
\ def Get_FASTA_Seq(NC_ID):
\ i = NC_ID
\
\ ## Search for Viruses based on TXID, and count number of hits ##
\
\ from Bio import GenBank
\ from Bio import SeqIO
\
\ gi_list = GenBank.search_for(i)
\ ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta")
\
\ ## Extract individual Sequence from NCBI for each TXID ##
\
\ gb_records = [ncbi_dict[gi] for gi in gi_list]
\ fasta_file = open("c:\\Current_Query.fasta","w")
\ SeqIO.write(gb_records, fasta_file, "fasta")
\ fasta_file.close()
Thank you for your insights. Sorry to be such a noob.
-Roger
On 6/19/07, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> On 6/19/07, Roger Barrette <rwbarrette at gmail.com> wrote:
> > I am trying to set up a script to automatically go into NCBI and
> retrieve
> > individual FASTA file based on a list of accession numbers (either gi or
> > NC). The code that I have written gets the sequences and saves the
> file,
> > but when I run a blast against the file, it doesn't work, Am I not using
> the
> > correct parser for preparing to save the file for blasting? I tried to
> set
> > the format to "fasta", but I was getting errors saying that gi_list[0]
> > doesn't contain the arguement 'data.seq'. I also tried the arguement
> > .sequence, and it gave me the same errror. I realize I'm not currently
> > calling the file in as a FASTA, but this is the only way I've been able
> to
> > even automate the record retrieval process for the long series of
> Blasting
> > that I have to do. I have a separate function for calling the Blast,
> > but it works fine with manually downloaded FASTA files, so the
> > problem appears to be here. Any suggestions for a fix, or even a better
> way
> > to do this would be greatly appreciated. Thanks. My code is:
>
> You seem to have tried a lot of things and its difficult to follow
> exactly what you are trying to do. I think you want to:
>
> (1) start with a list of gi numbers
> (2) get the matching sequence data online from the NCBI
> (3) save these as a fasta file
> (4) call blast using this fasta file as the input query
>
> Anyway - I've included a bit of code for (2) and (3) at the end of this
> email.
>
> > def Get_FASTA_Seq(NC_ID):
> >
> > i = NC_ID
> >
> > ## Search for Viruses based on TXID
> >
> > from Bio import GenBank
> > gi_list = GenBank.search_for(i)
> > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank")
> >
> > fasta_file = open("c:\Current_Query.gbk", "w")
>
> Its a bit odd to save a fasta file with a gbk extension, fasta is more
> common. And there may be a problem with the single unescaped slash,
> try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk".
>
> Remember, in python the slash is used for things like \n (new line),
> \t (tab) etc.
>
> > ## Extract individual Sequence from NCBI based on gi# or NC# ##
> >
> > gb_record = ncbi_dict[gi_list[0]]
> > record_parser = GenBank.FeatureParser()
> > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser =
> > record_parser)
> > gb_seqrecord = ncbi_dict[gi_list[0]]
> >
> > SeqValue = ncbi_dict[gi_list[0]].seq.data
> > NameValue = ncbi_dict[gi_list[0]].annotations["organism"]
> > Length = len(SeqValue)
> > Seq5 = 0
> > Seq3 = Seq5 + Length
> >
> > print NameValue
> > print Length
> > print SeqValue
> >
> > ## Write sequences into the FASTA file ##
> >
> > fasta_file.write(">" + i + " " + NameValue + "\n")
> > for j in range(0, len(SeqValue[Seq5:Seq3]), Length):
> > fasta_file.write(SeqValue[Seq5:Seq3])
> > fasta_file.write("\n")
> > ## Close and Save the FASTA file ##
> > fasta_file.close()
>
> That is all very complicated - why mess about with Seq5 and Seq4 when
> you seem to want the whole sequence anyway?
>
> Have you opened the output file in a text editor to check it looks
> sensible?
>
> If you can construct a list SeqRecords, why not write the file using
> Bio.SeqIO (Biopython 1.43 or later) like this:
>
> ...
> gb_records = [ncbi_dict[gi] for gi in gi_list]
> from Bio import SeqIO
> fasta_file = open("c:\\Current_Query.fasta","w")
> SeqIO.write(gb_records, fasta_file, "fasta")
> fasta_file.close()
>
> Peter
>
More information about the BioPython
mailing list