[BioPython] FASTA parsing errors

Jonathan Taylor jonathan.taylor at utoronto.ca
Tue Aug 3 17:01:50 EDT 2004


Hi,

I don't think that file conforms to the fasta format:
see http://ngfnblast.gbf.de/docs/fasta.html
I could be wrong though.

Jon Taylor.


On Tue, 2004-08-03 at 16:48, Aaron Zschau wrote:
> This is the file that is being read. I know it worked in 1.24 just fine  
> but maybe something changed in the versions that make it not like this  
> format
> 
> 
> thanks,
> 
> Aaron Zschau
> 
> ------a12345.fasta----------
> 
> 
> LOCUS       XM_414447               2107 bp    mRNA    linear   VRT  
> 28-JUL-2004
> DEFINITION  PREDICTED: Gallus gallus similar to von Hippel-Lindau  
> protein
>              (LOC416117), mRNA.
> ACCESSION   XM_414447
> VERSION     XM_414447.1  GI:50754623
> KEYWORDS    .
> SOURCE      Gallus gallus (red jungle fowl)
>    ORGANISM  Gallus gallus
>              Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;  
> Euteleostomi;
>              Archosauria; Aves; Neognathae; Galliformes; Phasianidae;
>              Phasianinae; Gallus.
> COMMENT     MODEL REFSEQ:  This record is predicted by automated  
> computational
>              analysis. This record is derived from an annotated genomic  
> sequence
>              (NW_060494) using gene prediction method: GNOMON, supported  
> by EST
>              evidence.
>              Also see:
>                  Documentation of NCBI's Annotation Process
> 
> FEATURES             Location/Qualifiers
>       source          1..2107
>                       /organism="Gallus gallus"
>                       /mol_type="mRNA"
>                       /strain="inbred line UCD001"
>                       /isolate="#256"
>                       /db_xref="taxon:9031"
>                       /chromosome="12"
>                       /sex="female"
>                       /note="inbred line derived from a wild population  
> of red
>                       jungle fowl in Malaysia in the late 1930s, with the
>                       possible introgression of a limited amount of White
>                       Leghorn genome during its captive breeding history
>                       common: red jungle fowl"
>       gene            1..2107
>                       /gene="LOC416117"
>                       /note="Derived by automated computational analysis  
> using
>                       gene prediction method: GNOMON."
>                       /db_xref="GeneID:416117"
>                       /db_xref="InterimID:416117"
>       CDS             1..486
>                       /gene="LOC416117"
>                       /codon_start=1
>                       /product="similar to von Hippel-Lindau protein"
>                       /protein_id="XP_414447.1"
>                       /db_xref="GI:50754624"
>                       /db_xref="GeneID:416117"
>                       /db_xref="InterimID:416117"
>                        
> /translation="MAPPGPGPAGPCLRSANTRELSEVVFNNRSPRAVLPIWVDFEGR
>                        
> PRYYPVLRPRTGRIMHSYRGHLWLFRDAGTHDGLLVNRQELFVAAPDVNKADITLPVF
>                        
> TLKERCLQVVRSLVRPGDYRKLDIVRSLYEELEDHPDVKKDLQRLSMERSKTLQEEIL
>                       H"
>       misc_feature    37..453
>                       /gene="LOC416117"
>                       /note="VHL; Region: von Hippel-Lindau disease  
> tumour
>                       suppressor protein. VHL forms a ternary complex  
> with the
>                       elonginB and elonginC proteins. This complex binds  
> Cul2,
>                       which then is involved in regulation of vascular
>                       endothelial growth factor mRNA"
>                       /db_xref="CDD:pfam01847"
> ORIGIN
>          1 atggcgccgc cgggtccggg tcccgccggg ccgtgcctgc gctccgccaa  
> cacgcgcgaa
>         61 ctctccgagg tcgtcttcaa caaccgcagc ccgcgcgccg tgctccccat  
> ctgggtggac
>        121 ttcgagggcc ggccgcgcta ctaccccgtg ctgcggccgc gcaccgggcg  
> gatcatgcac
>        181 agctaccgcg ggcacctgtg gctgttccgc gacgcgggca cgcacgacgg  
> gctgctcgtc
>        241 aaccggcagg agctgttcgt ggccgcgccg gacgtcaaca aggccgacat  
> cacgctgcca
>        301 gtgttcacgc tgaaggagcg gtgcctgcag gtggtgcgca gcctggtccg  
> gccgggggac
>        361 taccggaagc tggacatcgt gcgctcgctg tacgaggagc tggaggacca  
> ccccgacgtc
>        421 aagaaggacc tgcagcggct ctccatggag aggagcaaaa cgttacagga  
> ggaaatcctc
>        481 cactaacagg gctgtgcgtc ccgagccgtg tagatagcaa agcaccgagc  
> ttaggagggg
>        541 cagctgccgt gcagcgtgcc gggagctaac gtctgcatcg acgttctgga  
> acgaactcag
>        601 tcatgctgta gaacatttgc tatgctggta ggtcagattc caaagagcaa  
> acagtgtgca
>        661 ggaacgtact gctttgtgag ggctctgctc ccggtctcat gcactggtga  
> gcagtgaccc
>        721 cagtggcctg gcacagacgg ggctcagaga agcttgcttc cgactgtttc  
> agaacattcc
>        781 atagtaacac aagatttatc cgtctggagg aaatacatgc agctcagctt  
> cctctgagtt
>        841 agaaagaaaa ctacatcaag ggttcactta atccagacta taaaatcagt  
> ggcagagcag
>        901 caccaggttt gcttgaatga tttggttttg gcagaaattc gctctcacat  
> gctaaattta
>        961 cttttgaatc acaaagcgtg gagcgtgttc atgtgagagc ttccacggtt  
> gccttctgag
>       1021 ggctcggccc aaaacttctg tgctggcgga aagatgtccg taagcatttc  
> tgtgttagcc
>       1081 tctgtctgtg cgttcataaa ccctcattgt agcaactctg aagctgacaa  
> attcttacac
>       1141 agaacatgcc ttgaatgcct taatttgtct ttcattcctg aattcctgct  
> tagtttatct
>       1201 ctagatgatg gaaccttgtc agccatatgg actgcatctt ggttttagga  
> cccctttctg
>       1261 ctttgcacct ctgtgcccac accctcagct cccatagtgg tataccaagg  
> gagcgttccc
>       1321 agaaggtggg tgctctgagc ctcatctttc ccttgtccca gggattggcc  
> ttggggagca
>       1381 cagtccgccc aggccgctgg tgccccctga ggcacagaag ctgccccagc  
> tgcaggcgtg
>       1441 gctcccccaa gcagagctgt gcttttcagc aggccagctg cacagagaga  
> aatcatagaa
>       1501 tcacagaatc atacaatggc ctgggctgaa aaggaccaca atgcccatcc  
> agttccaacc
>       1561 ccctgctatg tgcagggtca ccaaccagca gaccaggctg cccagagcca  
> catccagcct
>       1621 ggccttgaat gcctccaggg atggggcctc cttgggcgac ctgttccaat  
> gcatcaacac
>       1681 cctccaagtg aaaaacttcc tcctgatata cctgaacatc ccctgtctta  
> tttaagatca
>       1741 ttcccccttg tcctgtcact atccaccctc gtgaacagct gttccccttc  
> ctgtttatat
>       1801 gcttcctaaa atcaagaaag gttctaggcc tatatgttct cttcccccat  
> acatcaaata
>       1861 cacaggtgtg tgtctgtatg tctctgtgca taactcaaag cagcgttgtt  
> tttagcagat
>       1921 aggtgaattg ttccccaagt tgcaggcagg cgcagtgctg ctcagcatgc  
> agagcagcag
>       1981 gttgctaaca gatagcagca ggctgttctg tggtgtaagg ttcttaagta  
> tgcaatgtgt
>       2041 gcccttctcg tggacttttt ttttcttaaa tgtttgtgta tgaactgatc  
> tttgtttctc
>       2101 ataaaaa
> //
> 
> 
> ------end file----------
> On Aug 3, 2004, at 4:23 PM, Jeffrey Chang wrote:
> 
> > Hi Aaron,
> >
> > Can you send the file that is generating the error?  I believe it is  
> > called /var/www/html/data/a12345.fasta.  In general, the fasta parser  
> > should be well-tested.  It works on a test file in fasta format that I  
> > have here.  It would help most if someone could look at your file to  
> > see what's going on.
> >
> > Thanks,
> > Jeff
> >
> >
> > On Aug 3, 2004, at 3:42 PM, Aaron Zschau wrote:
> >
> >> I've sent a couple messages to the list about this but I'm not sure  
> >> if they're going through as I haven't seen any replies.  I am trying  
> >> to get a section of my code that worked before the 1.30 revision of  
> >> biopython, based on the cookbook tutorials. My code looks up a gene  
> >> by name in genbank and saves the FASTA version of that data so that  
> >> the protein string can be fed into a BLAST search.  The lookup works  
> >> fine and I get a FASTA file saved just fine, however I then get an  
> >> error at the parse stage at character 0 of the file.
> >>
> >> Any help would be greatly appreciated
> >>
> >> thanks
> >>
> >> Aaron Zschau
> >>
> >>
> >>
> >>
> >>
> >>
> >> #file_for_blast = open(data_path_prefix + file_unique_id + 'fasta',  
> >> 'r')
> >> file_for_blast = open('/var/www/html/data/a12345.fasta','r')
> >>
> >> f_iterator = Fasta.Iterator(file_for_blast)
> >> print "iterator created"
> >> sys.stdout.flush()
> >>
> >> f_record = f_iterator.next()
> >> print "f_record created"
> >> sys.stdout.flush()
> >>
> >> -----------------------
> >>
> >> iterator created
> >> Traceback (most recent call last):
> >>   File "cluster-debug.py", line 119, in ?
> >>     f_record = f_iterator.next()
> >>   File  
> >> "/root/biopython-1.30/build/lib.linux-i586-2.2/Bio/Fasta/ 
> >> __init__.py", line 72, in next
> >>     result = self._iterator.next()
> >>   File  
> >> "/root/biopython-1.30/build/lib.linux-i586-2.2/Martel/IterParser.py",  
> >> line 152, in iterateFile
> >>     self.header_parser.parseString(rec)
> >>   File  
> >> "/root/biopython-1.30/build/lib.linux-i586-2.2/Martel/Parser.py",  
> >> line 361, in parseString
> >>     self._err_handler.fatalError(ParserIncompleteException(pos))
> >>   File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",  
> >> line 38, in fatalError
> >>     raise exception
> >> Martel.Parser.ParserIncompleteException: error parsing at or beyond  
> >> character 0 (unparsed text remains)
> >>
> >> _______________________________________________
> >> BioPython mailing list  -  BioPython at biopython.org
> >> http://biopython.org/mailman/listinfo/biopython
> >
> > _______________________________________________
> > BioPython mailing list  -  BioPython at biopython.org
> > http://biopython.org/mailman/listinfo/biopython
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython



More information about the BioPython mailing list