[BioPython] FASTA parsing errors

Aaron Zschau aaron at ocelot-atroxen.dyndns.org
Tue Aug 3 17:50:57 EDT 2004


That parsed just fine and I'm getting output now, however I'm now  
having trouble figuring out how to extract the protein sequence from  
this record. Looking at the API for the Record class, there isn't a  
nice clear 'protein' attribute the way that 'sequence' is.  Do you have  
any recommendations on how I would access that part of the Record?

thanks for your help so far,

Aaron


On Aug 3, 2004, at 5:06 PM, Jeffrey Chang wrote:

> Hi Aaron,
>
> This looks like a Genbank format file, so try using the Genbank parser  
> by doing something like (untested code):
> from Bio import GenBank
> parser = GenBank.RecordParser()
> rec = parser.parse(open('/var/www/html/data/a12345.fasta','r'))
> print rec.sequence
>
> Also, for future reference, please send samples of files as an  
> attachment rather than as part of the email.  Many email clients  
> change the text (for example, wrapping long lines) so that it is no  
> longer parseable by Biopython.  Thus, I was not able to check to make  
> sure that the Genbank parser can handle this file correctly. :)
>
> Jeff
>
>
> On Aug 3, 2004, at 4:48 PM, Aaron Zschau wrote:
>
>> This is the file that is being read. I know it worked in 1.24 just  
>> fine but maybe something changed in the versions that make it not  
>> like this format
>>
>>
>> thanks,
>>
>> Aaron Zschau
>>
>> ------a12345.fasta----------
>>
>>
>> LOCUS       XM_414447               2107 bp    mRNA    linear   VRT  
>> 28-JUL-2004
>> DEFINITION  PREDICTED: Gallus gallus similar to von Hippel-Lindau  
>> protein
>>             (LOC416117), mRNA.
>> ACCESSION   XM_414447
>> VERSION     XM_414447.1  GI:50754623
>> KEYWORDS    .
>> SOURCE      Gallus gallus (red jungle fowl)
>>   ORGANISM  Gallus gallus
>>             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;  
>> Euteleostomi;
>>             Archosauria; Aves; Neognathae; Galliformes; Phasianidae;
>>             Phasianinae; Gallus.
>> COMMENT     MODEL REFSEQ:  This record is predicted by automated  
>> computational
>>             analysis. This record is derived from an annotated  
>> genomic sequence
>>             (NW_060494) using gene prediction method: GNOMON,  
>> supported by EST
>>             evidence.
>>             Also see:
>>                 Documentation of NCBI's Annotation Process
>>
>> FEATURES             Location/Qualifiers
>>      source          1..2107
>>                      /organism="Gallus gallus"
>>                      /mol_type="mRNA"
>>                      /strain="inbred line UCD001"
>>                      /isolate="#256"
>>                      /db_xref="taxon:9031"
>>                      /chromosome="12"
>>                      /sex="female"
>>                      /note="inbred line derived from a wild  
>> population of red
>>                      jungle fowl in Malaysia in the late 1930s, with  
>> the
>>                      possible introgression of a limited amount of  
>> White
>>                      Leghorn genome during its captive breeding  
>> history
>>                      common: red jungle fowl"
>>      gene            1..2107
>>                      /gene="LOC416117"
>>                      /note="Derived by automated computational  
>> analysis using
>>                      gene prediction method: GNOMON."
>>                      /db_xref="GeneID:416117"
>>                      /db_xref="InterimID:416117"
>>      CDS             1..486
>>                      /gene="LOC416117"
>>                      /codon_start=1
>>                      /product="similar to von Hippel-Lindau protein"
>>                      /protein_id="XP_414447.1"
>>                      /db_xref="GI:50754624"
>>                      /db_xref="GeneID:416117"
>>                      /db_xref="InterimID:416117"
>>                       
>> /translation="MAPPGPGPAGPCLRSANTRELSEVVFNNRSPRAVLPIWVDFEGR
>>                       
>> PRYYPVLRPRTGRIMHSYRGHLWLFRDAGTHDGLLVNRQELFVAAPDVNKADITLPVF
>>                       
>> TLKERCLQVVRSLVRPGDYRKLDIVRSLYEELEDHPDVKKDLQRLSMERSKTLQEEIL
>>                      H"
>>      misc_feature    37..453
>>                      /gene="LOC416117"
>>                      /note="VHL; Region: von Hippel-Lindau disease  
>> tumour
>>                      suppressor protein. VHL forms a ternary complex  
>> with the
>>                      elonginB and elonginC proteins. This complex  
>> binds Cul2,
>>                      which then is involved in regulation of vascular
>>                      endothelial growth factor mRNA"
>>                      /db_xref="CDD:pfam01847"
>> ORIGIN
>>         1 atggcgccgc cgggtccggg tcccgccggg ccgtgcctgc gctccgccaa  
>> cacgcgcgaa
>>        61 ctctccgagg tcgtcttcaa caaccgcagc ccgcgcgccg tgctccccat  
>> ctgggtggac
>>       121 ttcgagggcc ggccgcgcta ctaccccgtg ctgcggccgc gcaccgggcg  
>> gatcatgcac
>>       181 agctaccgcg ggcacctgtg gctgttccgc gacgcgggca cgcacgacgg  
>> gctgctcgtc
>>       241 aaccggcagg agctgttcgt ggccgcgccg gacgtcaaca aggccgacat  
>> cacgctgcca
>>       301 gtgttcacgc tgaaggagcg gtgcctgcag gtggtgcgca gcctggtccg  
>> gccgggggac
>>       361 taccggaagc tggacatcgt gcgctcgctg tacgaggagc tggaggacca  
>> ccccgacgtc
>>       421 aagaaggacc tgcagcggct ctccatggag aggagcaaaa cgttacagga  
>> ggaaatcctc
>>       481 cactaacagg gctgtgcgtc ccgagccgtg tagatagcaa agcaccgagc  
>> ttaggagggg
>>       541 cagctgccgt gcagcgtgcc gggagctaac gtctgcatcg acgttctgga  
>> acgaactcag
>>       601 tcatgctgta gaacatttgc tatgctggta ggtcagattc caaagagcaa  
>> acagtgtgca
>>       661 ggaacgtact gctttgtgag ggctctgctc ccggtctcat gcactggtga  
>> gcagtgaccc
>>       721 cagtggcctg gcacagacgg ggctcagaga agcttgcttc cgactgtttc  
>> agaacattcc
>>       781 atagtaacac aagatttatc cgtctggagg aaatacatgc agctcagctt  
>> cctctgagtt
>>       841 agaaagaaaa ctacatcaag ggttcactta atccagacta taaaatcagt  
>> ggcagagcag
>>       901 caccaggttt gcttgaatga tttggttttg gcagaaattc gctctcacat  
>> gctaaattta
>>       961 cttttgaatc acaaagcgtg gagcgtgttc atgtgagagc ttccacggtt  
>> gccttctgag
>>      1021 ggctcggccc aaaacttctg tgctggcgga aagatgtccg taagcatttc  
>> tgtgttagcc
>>      1081 tctgtctgtg cgttcataaa ccctcattgt agcaactctg aagctgacaa  
>> attcttacac
>>      1141 agaacatgcc ttgaatgcct taatttgtct ttcattcctg aattcctgct  
>> tagtttatct
>>      1201 ctagatgatg gaaccttgtc agccatatgg actgcatctt ggttttagga  
>> cccctttctg
>>      1261 ctttgcacct ctgtgcccac accctcagct cccatagtgg tataccaagg  
>> gagcgttccc
>>      1321 agaaggtggg tgctctgagc ctcatctttc ccttgtccca gggattggcc  
>> ttggggagca
>>      1381 cagtccgccc aggccgctgg tgccccctga ggcacagaag ctgccccagc  
>> tgcaggcgtg
>>      1441 gctcccccaa gcagagctgt gcttttcagc aggccagctg cacagagaga  
>> aatcatagaa
>>      1501 tcacagaatc atacaatggc ctgggctgaa aaggaccaca atgcccatcc  
>> agttccaacc
>>      1561 ccctgctatg tgcagggtca ccaaccagca gaccaggctg cccagagcca  
>> catccagcct
>>      1621 ggccttgaat gcctccaggg atggggcctc cttgggcgac ctgttccaat  
>> gcatcaacac
>>      1681 cctccaagtg aaaaacttcc tcctgatata cctgaacatc ccctgtctta  
>> tttaagatca
>>      1741 ttcccccttg tcctgtcact atccaccctc gtgaacagct gttccccttc  
>> ctgtttatat
>>      1801 gcttcctaaa atcaagaaag gttctaggcc tatatgttct cttcccccat  
>> acatcaaata
>>      1861 cacaggtgtg tgtctgtatg tctctgtgca taactcaaag cagcgttgtt  
>> tttagcagat
>>      1921 aggtgaattg ttccccaagt tgcaggcagg cgcagtgctg ctcagcatgc  
>> agagcagcag
>>      1981 gttgctaaca gatagcagca ggctgttctg tggtgtaagg ttcttaagta  
>> tgcaatgtgt
>>      2041 gcccttctcg tggacttttt ttttcttaaa tgtttgtgta tgaactgatc  
>> tttgtttctc
>>      2101 ataaaaa
>> //
>>
>>
>> ------end file----------
>> On Aug 3, 2004, at 4:23 PM, Jeffrey Chang wrote:
>>
>>> Hi Aaron,
>>>
>>> Can you send the file that is generating the error?  I believe it is  
>>> called /var/www/html/data/a12345.fasta.  In general, the fasta  
>>> parser should be well-tested.  It works on a test file in fasta  
>>> format that I have here.  It would help most if someone could look  
>>> at your file to see what's going on.
>>>
>>> Thanks,
>>> Jeff
>>>
>>>
>>> On Aug 3, 2004, at 3:42 PM, Aaron Zschau wrote:
>>>
>>>> I've sent a couple messages to the list about this but I'm not sure  
>>>> if they're going through as I haven't seen any replies.  I am  
>>>> trying to get a section of my code that worked before the 1.30  
>>>> revision of biopython, based on the cookbook tutorials. My code  
>>>> looks up a gene by name in genbank and saves the FASTA version of  
>>>> that data so that the protein string can be fed into a BLAST  
>>>> search.  The lookup works fine and I get a FASTA file saved just  
>>>> fine, however I then get an error at the parse stage at character 0  
>>>> of the file.
>>>>
>>>> Any help would be greatly appreciated
>>>>
>>>> thanks
>>>>
>>>> Aaron Zschau
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> #file_for_blast = open(data_path_prefix + file_unique_id + 'fasta',  
>>>> 'r')
>>>> file_for_blast = open('/var/www/html/data/a12345.fasta','r')
>>>>
>>>> f_iterator = Fasta.Iterator(file_for_blast)
>>>> print "iterator created"
>>>> sys.stdout.flush()
>>>>
>>>> f_record = f_iterator.next()
>>>> print "f_record created"
>>>> sys.stdout.flush()
>>>>
>>>> -----------------------
>>>>
>>>> iterator created
>>>> Traceback (most recent call last):
>>>>   File "cluster-debug.py", line 119, in ?
>>>>     f_record = f_iterator.next()
>>>>   File  
>>>> "/root/biopython-1.30/build/lib.linux-i586-2.2/Bio/Fasta/ 
>>>> __init__.py", line 72, in next
>>>>     result = self._iterator.next()
>>>>   File  
>>>> "/root/biopython-1.30/build/lib.linux-i586-2.2/Martel/ 
>>>> IterParser.py", line 152, in iterateFile
>>>>     self.header_parser.parseString(rec)
>>>>   File  
>>>> "/root/biopython-1.30/build/lib.linux-i586-2.2/Martel/Parser.py",  
>>>> line 361, in parseString
>>>>     self._err_handler.fatalError(ParserIncompleteException(pos))
>>>>   File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",  
>>>> line 38, in fatalError
>>>>     raise exception
>>>> Martel.Parser.ParserIncompleteException: error parsing at or beyond  
>>>> character 0 (unparsed text remains)
>>>>
>>>> _______________________________________________
>>>> BioPython mailing list  -  BioPython at biopython.org
>>>> http://biopython.org/mailman/listinfo/biopython
>>>
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at biopython.org
>>> http://biopython.org/mailman/listinfo/biopython
>>



More information about the BioPython mailing list