[BioPython] XML Parser problem
alper soyler
alpersoyler at yahoo.com
Mon Dec 11 14:44:12 UTC 2006
Dear all,
I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message.
XML Parsing Error: junk after document element
Location: file:///home/alper/Desktop/genes/combinedblastfile.xml
Line Number 38, Column 1:
<?xml version="1.0"?>
^
But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance.
Traceback (most recent call last):
File "XMLBlastParser.py", line 13, in ?
b_record = b_parser.parse(blast_out)
File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
self._parser.parse(handler)
File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element
Alper Soyler
Dept. of Food Engineering
Middle East Technical University,Turkey
Tel:+90312 2105625
Fax:+90312 2102767
http://www.metu.edu.tr/~soyler
----- Original Message ----
From: "biopython-request at lists.open-bio.org" <biopython-request at lists.open-bio.org>
To: biopython at lists.open-bio.org
Sent: Sunday, October 8, 2006 8:03:28 AM
Subject: BioPython Digest, Vol 46, Issue 2
Send BioPython mailing list submissions to
biopython at lists.open-bio.org
To subscribe or unsubscribe via the World Wide Web, visit
http://lists.open-bio.org/mailman/listinfo/biopython
or, via email, send a message with subject or body 'help' to
biopython-request at lists.open-bio.org
You can reach the person managing the list at
biopython-owner at lists.open-bio.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of BioPython digest..."
Today's Topics:
1. Re: Genbank parsing problem and fix (Gemma Atkinson)
2. Re: Genbank parsing problem and fix (Peter)
3. BioPython for TRANSFAC (Wijaya Edward)
4. Creating fusion protein like constructs with BioPython
(Mitchell Stanton-Cook)
5. Re: Creating fusion protein like constructs with BioPython (Peter)
6. Re: Creating fusion protein like constructs with BioPython
(Thomas Hamelryck)
7. Problem parsing Blast XML output from different sources
(Steffi Gebauer-Jung)
8. Re: Problem parsing Blast XML output from different sources
(Michiel Jan Laurens de Hoon)
9. Join kirby white on Yahoo! Messenger! (kirbywhite at sbcglobal.net)
10. Re: Problem parsing Blast XML output from different sources
(Michiel de Hoon)
----------------------------------------------------------------------
Message: 1
Date: Tue, 3 Oct 2006 12:36:58 +0100
From: Gemma Atkinson <gca500 at york.ac.uk>
Subject: Re: [BioPython] Genbank parsing problem and fix
To: biopython at lists.open-bio.org
Message-ID: <E1279C3E-878C-4F85-8551-793B82D90013 at york.ac.uk>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Hi Peter,
I was using the Bio.Genbank module. This is the code I've been using:
from Bio import GenBank
parser = GenBank.RecordParser(debug_level=2)
record = parser.parse(open("test4.txt"))
It was the expressions/genbank.py file, imported from within the
Genbank module that I've been changing. I haven't touched the
formatdefs/genbank.py file (should have made that clear before - sorry).
This was the error I was getting before I changed expressions/
genbank.py:
File "testgbparser.py", line 3, in ?
record = parser.parse(open("test4.txt"))
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/
python2.4/Bio/GenBank/__init__.py", line 240, in parse
self._scanner.feed(handle, self._consumer)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/
python2.4/Bio/GenBank/__init__.py", line 1259, in feed
self._parser.parseFile(handle)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/
python2.4/Martel/Parser.py", line 328, in parseFile
self.parseString(fileobj.read())
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/
python2.4/Martel/Parser.py", line 356, in parseString
self._err_handler.fatalError(result)
File "/Library/Frameworks/Python.framework/Versions/2.4//lib/
python2.4/xml/sax/handler.py", line 38, in fatalError
raise exception
Martel.Parser.ParserPositionException: error parsing at or beyond
character 1153
Gemma
On 3 Oct 2006, at 10:54, Peter wrote:
> gca500 at york.ac.uk wrote:
>> Hi All,
>> Been having a problem using the Genbank RecordParser with some
>> Genbank files that have recently been added to NCBI. After a bit
>> of trial and error, I realised the problem only occurs if a
>> REFERENCE field isn't followed by an AUTHOR field (for example in
>> reference 2 of this record: http://www.ncbi.nlm.nih.gov/entrez/
>> viewer.fcgi?db=protein&val=88602864).
>> There's a very easy fix on line 289 of Genbank.py. Decided to post
>> this to the list to save any one else who stumbles across this
>> problem tearing their hair out like I've been doing this afternoon!
>> Change ... and it works!
>> Hope this is useful,
>> Gemma
>
> Hi Gemma,
>
> I have made your suggested change to biopython/Bio/formatdefs/
> genbank.py as CVS revision 1.10, which should be viewable online soon:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/
> expressions/genbank.py?cvsroot=biopython
>
> I am curious as to why you are using this code (part of the
> FormatIO system), rather than the Bio.GenBank module.
>
> Thank you,
>
> Peter
>
------------------------------
Message: 2
Date: Tue, 03 Oct 2006 14:33:48 +0100
From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [BioPython] Genbank parsing problem and fix
To: Gemma Atkinson <gca500 at york.ac.uk>
Cc: biopython at lists.open-bio.org
Message-ID: <452266BC.9060809 at maubp.freeserve.co.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>> Hi Gemma,
>>
>> I have made your suggested change to biopython/Bio/formatdefs/
>> genbank.py as CVS revision 1.10, which should be viewable online soon:
>>
>> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/
>> expressions/genbank.py?cvsroot=biopython
I got the URL right, but I mean to say Bio/expressions/genbank.py (which
actually has the Martel definition in it) not Bio/formatdefs/genbank.py
Peter wrote:
>> I am curious as to why you are using this code ...
Gemma replied:
> I was using the Bio.Genbank module. This is the code I've been using:
>
> from Bio import GenBank
> parser = GenBank.RecordParser(debug_level=2)
> record = parser.parse(open("test4.txt"))
I would guess you are using BioPython 1.41 (or older) then, as your
stack trace was indeed using Martel internally.
Recent versions of BioPython (1.42 and later) use a pure python parser
in Bio.GenBank as the old Martel code didn't scale well with large input
files (to the point of being almost useless on large genomes).
If you do update your installation, and run into any problems with the
GenBank parser, please do let us know.
Peter
------------------------------
Message: 3
Date: Tue, 03 Oct 2006 22:16:27 +0800
From: Wijaya Edward <ewijaya at i2r.a-star.edu.sg>
Subject: [BioPython] BioPython for TRANSFAC
To: biopython at lists.open-bio.org
Message-ID:
<3ACF03E372996C4EACD542EA8A05E66A061584 at mailbe01.teak.local.net>
Content-Type: text/plain; charset=iso-8859-1
Hi there,
Is there a method in BioPython that allow me to pass the query "fruitfly" or "drosophila"
and then returning the:
1. already characterized TF and their binding sites (BS),
2. their respective coregulated genes, and
3. the location of TFBS location/position in the genes.
all from TRANSFAC database.
--
Regards,
Edward WIJAYA
------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------
------------------------------
Message: 4
Date: Wed, 4 Oct 2006 23:38:03 +1000
From: "Mitchell Stanton-Cook" <m.stantoncook at gmail.com>
Subject: [BioPython] Creating fusion protein like constructs with
BioPython
To: BioPython at lists.open-bio.org
Message-ID:
<cb57bffd0610040638i2fb4c0cahf9021d3f5d3dc9dd at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hello all.
I am trying to create fusion protein-like model from two separate pdb files.
I introduce a CYS mutant in the target protein, and then wish to form a
disulphide bound between it and a small peptide.
This is pure computational work.
I am using Bio.PDB. As the two structures are in arbitrary frames of
reference I need to rotate and translate to form the "construct".
I wish to have
TargetProtein-CB-SY-SY-CB-SmallPeptide (the peptide is not really added to
the N/C term)
I have tried many different approaches but have failed miserable to get
SmallPeptide rotated relative to TargetProtein at the correct dihedral angle
+/-90deg and bond lengths.
My current approach is (omitting the correct bond length at this time):
TP-CB-SY SY-CB-SP
1 2 3 4
Translate 2 onto 3
Calculate the angle between 1-(23)-4
Calculate the cross product of 1-23 x 23-4
Generate the rotation matrix given the angle and vector
Rotate all SP (SmallPeptide) atoms by this rotation matrix.
This has not worked.
I have had some other ideas and have written code for them. Ideally, I wish
to calculate the rotations about X,Y,Z to place the SP at the correct
dihedral angle followed by translation, but I have no idea how to do this.
1) Can I use Bio.PDB to do this above task or do I need to look at something
else?
2) Does anyone have any ideas on how to complete this goal?
Thanking you for your time.
Mitch
------------------------------
Message: 5
Date: Thu, 05 Oct 2006 10:47:30 +0100
From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [BioPython] Creating fusion protein like constructs with
BioPython
To: Mitchell Stanton-Cook <m.stantoncook at gmail.com>
Cc: BioPython at lists.open-bio.org
Message-ID: <4524D4B2.8030600 at maubp.freeserve.co.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Mitchell Stanton-Cook wrote:
> Hello all.
>
> I am trying to create fusion protein-like model from two separate pdb files.
> I introduce a CYS mutant in the target protein, and then wish to form a
> disulphide bound between it and a small peptide.
>
> This is pure computational work.
>
> ...
>
> 1) Can I use Bio.PDB to do this above task or do I need to look at something
> else?
My gut instinct is that yes, you probably can - but you will have to do
a lot of the work with your own code. Its not something I have ever
tried though.
> 2) Does anyone have any ideas on how to complete this goal?
You might want to have a look at MMTK, which on the face of it would be
better suited. Assuming MMTK will read both PDB files you might have
better luck - this proviso is because I have found MMTK will choke on
"odd" PDB files, and its support for non-standard residues could be better.
http://starship.python.net/crew/hinsen/MMTK/index.html
Peter
------------------------------
Message: 6
Date: Thu, 5 Oct 2006 11:52:56 +0200
From: "Thomas Hamelryck" <thamelry at binf.ku.dk>
Subject: Re: [BioPython] Creating fusion protein like constructs with
BioPython
To: "Mitchell Stanton-Cook" <m.stantoncook at gmail.com>
Cc: BioPython at lists.open-bio.org
Message-ID:
<2d7c25310610050252j2f889242h84411e0927fb4502 at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Hi,
> I am trying to create fusion protein-like model from two separate pdb files.
> I introduce a CYS mutant in the target protein, and then wish to form a
> disulphide bound between it and a small peptide.
...
> 1) Can I use Bio.PDB to do this above task or do I need to look at something
> else?
Bio.PDB has functionality to do vector/rotation calculations.
Take a look at the Vector.py module.
Best,
----
Thomas Hamelryck, Post-doctoral researcher
Bioinformatics center
Institute of Molecular Biology and Physiology
University of Copenhagen
Universitetsparken 15 - Bygning 10
DK-2100 Copenhagen ?
Denmark
Homepage: http://www.binf.ku.dk/Protein_structure
------------------------------
Message: 7
Date: Thu, 05 Oct 2006 12:30:36 +0200
From: Steffi Gebauer-Jung <gebauer-jung at ice.mpg.de>
Subject: [BioPython] Problem parsing Blast XML output from different
sources
To: biopython at lists.open-bio.org
Message-ID: <4524DECC.3030307 at ice.mpg.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hello,
because of blastall 2.2.14 output was not parsed from the
Bio.Blast.NCBIStandalone parser,
I tried to switch to the recommended Bio.Blast.NCBIXML parser.
Thereby I found, that the xml output of the locally installed standalone
blastall (2.2.14)
differs from the web xml output.
For BlastN hsps on Plus/Minus strands, the xml gives
query_frame/hit_frame 1 / -1 as usual.
But query and frame positions and sequences are switched in direction
(would match frames -1/1).
As the Bio.Blast.Record returned by the NCBIXML parser only gives
frames, sequences
and start positions it is not possible (without knowing the source of
the xml file)
to be sure to find the right data.
This is clearly a problem of Blast.
But because of the missing end positions in the returned record object
it becomes a problem for users of the parser too.
Could somebody try to confirm the different behaviour of the xml blast
output
with his/her own examples/installation?
Thanks, Steffi
------------------------------
Message: 8
Date: Thu, 05 Oct 2006 12:01:04 -0400
From: Michiel Jan Laurens de Hoon <mdehoon at c2b2.columbia.edu>
Subject: Re: [BioPython] Problem parsing Blast XML output from
different sources
To: Steffi Gebauer-Jung <gebauer-jung at ice.mpg.de>
Cc: biopython at lists.open-bio.org
Message-ID: <45252C40.8040806 at c2b2.columbia.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Which sequence are you running blast on?
I'd like to try this on our local blast installation.
--Michiel.
Steffi Gebauer-Jung wrote:
> Hello,
>
> because of blastall 2.2.14 output was not parsed from the
> Bio.Blast.NCBIStandalone parser,
> I tried to switch to the recommended Bio.Blast.NCBIXML parser.
>
> Thereby I found, that the xml output of the locally installed standalone
> blastall (2.2.14)
> differs from the web xml output.
>
> For BlastN hsps on Plus/Minus strands, the xml gives
> query_frame/hit_frame 1 / -1 as usual.
> But query and frame positions and sequences are switched in direction
> (would match frames -1/1).
>
> As the Bio.Blast.Record returned by the NCBIXML parser only gives
> frames, sequences
> and start positions it is not possible (without knowing the source of
> the xml file)
> to be sure to find the right data.
>
> This is clearly a problem of Blast.
> But because of the missing end positions in the returned record object
> it becomes a problem for users of the parser too.
>
> Could somebody try to confirm the different behaviour of the xml blast
> output
> with his/her own examples/installation?
>
> Thanks, Steffi
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
------------------------------
Message: 9
Date: 06 Oct 2006 01:30:32 -0700
From: kirbywhite at sbcglobal.net
Subject: [BioPython] Join kirby white on Yahoo! Messenger!
To: biopython at biopython.org
Message-ID: <200610060837.k968bH7m002645 at portal.open-bio.org>
Content-Type: text/plain; charset=en_US.ISO-8859-1
kirby white wants to talk with you using the new Yahoo! Messenger with Voice:
Accept the invitation by clicking this link:
http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=TH4bGUcdNQlSM9glNjqlrYiUe5Ghe81EwN0H9cef5vb5F7R7g9X1RKU7ac1qLispOfRJgQy2V7nt.fUIeMUChnR9ZMz50uB3r5ocpMTyDcxHE4kS.n_LZ2zqpi54EYbR3KHoIq73BouZjRO0y5J6LODqpmvT3VY-
With Yahoo! Messenger with Voice, you get:
Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail!
IM Windows Live™ Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime.
Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online.
So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want.
* Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger.
------------------------------
Message: 10
Date: Sun, 08 Oct 2006 00:51:09 -0400
From: Michiel de Hoon <mdehoon at c2b2.columbia.edu>
Subject: Re: [BioPython] Problem parsing Blast XML output from
different sources
To: Steffi Gebauer-Jung <gebauer-jung at ice.mpg.de>,
biopython at biopython.org
Message-ID: <452883BD.7050907 at c2b2.columbia.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi Steffi,
I am trying to replicate this problem with Blast. Where did you get the
pat database? I searched for it with google, but there seems to be more
than one blast database called pat.
--Michiel.
Steffi Gebauer-Jung wrote:
> Hello,
>
> I don't know what local databases you have available for testing.
> The discrepancy between xml and 'pairwise text' output should be seen
> for every Plus/Minus Hsp created by local Blastn (local server or
> standalone blastall from command line, I use version 2.2.14)
>
> I tried several combinations, one is M38240 vs. pat database,
> the hsp hit was BD298385.
> Here are the interesting output snippets:
>
>> dbj|BD298385.1|
>> <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=92136243&dopt=GenBank>
>> CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS AND PLANT PARTS
> CONTAINING THEM, AND METHODS FOR OBTAINING THEM
> Length = 14108
>
> Score = 125 bits (63), Expect = 1e-25
> Identities = 63/63 (100%)
> Strand = Plus / Minus
>
>
> Query: 727 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc
> 786
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 8332 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc
> 8273
>
> Query: 787 cga 789
> |||
> Sbjct: 8272 cga 8270
>
> =====================================================
> <Hit>
> <Hit_num>15</Hit_num>
> <Hit_id>gi|92136243|dbj|BD298385.1|</Hit_id>
> <Hit_def>CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS
> AND PLANT PARTS CONTAINING THEM, AND METHODS FOR OBTAINING THEM</Hit_def>
> <Hit_accession>BD298385</Hit_accession>
> <Hit_len>14108</Hit_len>
> <Hit_hsps>
> <Hsp>
> <Hsp_num>1</Hsp_num>
> <Hsp_bit-score>125.381</Hsp_bit-score>
> <Hsp_score>63</Hsp_score>
> <Hsp_evalue>9.63859e-26</Hsp_evalue>
> <Hsp_query-from>789</Hsp_query-from>
> <Hsp_query-to>727</Hsp_query-to>
> <Hsp_hit-from>8270</Hsp_hit-from>
> <Hsp_hit-to>8332</Hsp_hit-to>
> <Hsp_query-frame>1</Hsp_query-frame>
> <Hsp_hit-frame>-1</Hsp_hit-frame>
> <Hsp_identity>63</Hsp_identity>
> <Hsp_positive>63</Hsp_positive>
> <Hsp_align-len>63</Hsp_align-len>
>
> <Hsp_qseq>TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT</Hsp_qseq>
>
>
> <Hsp_hseq>TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT</Hsp_hseq>
>
>
> <Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||</Hsp_midline>
>
> </Hsp>
> </Hit_hsps>
> </Hit>
>
> Thanks, Steffi
>
>
>
>
>
>
> Michiel Jan Laurens de Hoon wrote:
>
>> Which sequence are you running blast on?
>> I'd like to try this on our local blast installation.
>>
>> --Michiel.
>>
>> Steffi Gebauer-Jung wrote:
>>
>>> Hello,
>>>
>>> because of blastall 2.2.14 output was not parsed from the
>>> Bio.Blast.NCBIStandalone parser,
>>> I tried to switch to the recommended Bio.Blast.NCBIXML parser.
>>>
>>> Thereby I found, that the xml output of the locally installed
>>> standalone blastall (2.2.14)
>>> differs from the web xml output.
>>>
>>> For BlastN hsps on Plus/Minus strands, the xml gives
>>> query_frame/hit_frame 1 / -1 as usual.
>>> But query and frame positions and sequences are switched in direction
>>> (would match frames -1/1).
>>>
>>> As the Bio.Blast.Record returned by the NCBIXML parser only gives
>>> frames, sequences
>>> and start positions it is not possible (without knowing the source of
>>> the xml file)
>>> to be sure to find the right data.
>>>
>>> This is clearly a problem of Blast.
>>> But because of the missing end positions in the returned record object
>>> it becomes a problem for users of the parser too.
>>>
>>> Could somebody try to confirm the different behaviour of the xml
>>> blast output
>>> with his/her own examples/installation?
>>>
>>> Thanks, Steffi
>>>
>>>
>>>
>>> _______________________________________________
>>> BioPython mailing list - BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
>
------------------------------
_______________________________________________
BioPython mailing list - BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython
End of BioPython Digest, Vol 46, Issue 2
****************************************
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
More information about the Biopython
mailing list