[BioPython] XML Parser problem

Mon Dec 11 14:44:12 UTC 2006

Dear all,

I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message.

XML Parsing Error: junk after document element
Location: file:///home/alper/Desktop/genes/combinedblastfile.xml
Line Number 38, Column 1:
<?xml version="1.0"?>
^
But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance.
  Traceback (most recent call last):
  File "XMLBlastParser.py", line 13, in ?
      b_record = b_parser.parse(blast_out)
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112,          in parse
        self._parser.parse(handler)
  File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line       109, in parse
        xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line       123, in parse
        self.feed(buffer)
  File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line       220, in feed
        self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38,       in fatalError
        raise exception
xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element

Alper Soyler
Dept. of Food Engineering
Middle East Technical University,Turkey
Tel:+90312 2105625
Fax:+90312 2102767
http://www.metu.edu.tr/~soyler

----- Original Message ----
From: "biopython-request at lists.open-bio.org" <biopython-request at lists.open-bio.org>
To: biopython at lists.open-bio.org
Sent: Sunday, October 8, 2006 8:03:28 AM
Subject: BioPython Digest, Vol 46, Issue 2

Send BioPython mailing list submissions to
    biopython at lists.open-bio.org

To subscribe or unsubscribe via the World Wide Web, visit
    http://lists.open-bio.org/mailman/listinfo/biopython
or, via email, send a message with subject or body 'help' to
    biopython-request at lists.open-bio.org

You can reach the person managing the list at
    biopython-owner at lists.open-bio.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of BioPython digest..."

Today's Topics:

   1. Re: Genbank parsing problem and fix (Gemma Atkinson)
   2. Re: Genbank parsing problem and fix (Peter)
   3. BioPython for TRANSFAC (Wijaya Edward)
   4. Creating fusion protein like constructs with BioPython
      (Mitchell Stanton-Cook)
   5. Re: Creating fusion protein like constructs with    BioPython (Peter)
   6. Re: Creating fusion protein like constructs with    BioPython
      (Thomas Hamelryck)
   7. Problem parsing Blast XML output from different sources
      (Steffi Gebauer-Jung)
   8. Re: Problem parsing Blast XML output from different    sources
      (Michiel Jan Laurens de Hoon)
   9. Join kirby white on Yahoo! Messenger! (kirbywhite at sbcglobal.net)
  10. Re: Problem parsing Blast XML output from different    sources
      (Michiel de Hoon)

----------------------------------------------------------------------

Message: 1
Date: Tue, 3 Oct 2006 12:36:58 +0100
From: Gemma Atkinson <gca500 at york.ac.uk>
Subject: Re: [BioPython] Genbank parsing problem and fix
To: biopython at lists.open-bio.org
Message-ID: <E1279C3E-878C-4F85-8551-793B82D90013 at york.ac.uk>
Content-Type: text/plain;    charset=US-ASCII;    delsp=yes;    format=flowed

Hi Peter,

I was using the Bio.Genbank module. This is the code I've been using:

from Bio import GenBank
parser = GenBank.RecordParser(debug_level=2)
record = parser.parse(open("test4.txt"))

It was the expressions/genbank.py file, imported from within the  
Genbank module that I've been changing. I haven't touched the  
formatdefs/genbank.py file (should have made that clear before - sorry).

This was the error I was getting before I changed expressions/ 
genbank.py:

File "testgbparser.py", line 3, in ?
     record = parser.parse(open("test4.txt"))
   File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
python2.4/Bio/GenBank/__init__.py", line 240, in parse
     self._scanner.feed(handle, self._consumer)
   File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
python2.4/Bio/GenBank/__init__.py", line 1259, in feed
     self._parser.parseFile(handle)
   File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
python2.4/Martel/Parser.py", line 328, in parseFile
     self.parseString(fileobj.read())
   File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
python2.4/Martel/Parser.py", line 356, in parseString
     self._err_handler.fatalError(result)
   File "/Library/Frameworks/Python.framework/Versions/2.4//lib/ 
python2.4/xml/sax/handler.py", line 38, in fatalError
     raise exception
Martel.Parser.ParserPositionException: error parsing at or beyond  
character 1153

Gemma

On 3 Oct 2006, at 10:54, Peter wrote:

> gca500 at york.ac.uk wrote:
>> Hi All,
>> Been having a problem using the Genbank RecordParser with some  
>> Genbank files that have recently been added to NCBI. After a bit  
>> of trial and error, I realised the problem only occurs if a  
>> REFERENCE field isn't followed by an AUTHOR field (for example in  
>> reference 2 of this record: http://www.ncbi.nlm.nih.gov/entrez/ 
>> viewer.fcgi?db=protein&val=88602864).
>> There's a very easy fix on line 289 of Genbank.py. Decided to post  
>> this to the list to save any one else who stumbles across this  
>> problem tearing their hair out like I've been doing this afternoon!
>> Change ... and it works!
>> Hope this is useful,
>> Gemma
>
> Hi Gemma,
>
> I have made your suggested change to biopython/Bio/formatdefs/ 
> genbank.py as CVS revision 1.10, which should be viewable online soon:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/ 
> expressions/genbank.py?cvsroot=biopython
>
> I am curious as to why you are using this code (part of the  
> FormatIO system), rather than the Bio.GenBank module.
>
> Thank you,
>
> Peter
>

------------------------------

Message: 2
Date: Tue, 03 Oct 2006 14:33:48 +0100
From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [BioPython] Genbank parsing problem and fix
To: Gemma Atkinson <gca500 at york.ac.uk>
Cc: biopython at lists.open-bio.org
Message-ID: <452266BC.9060809 at maubp.freeserve.co.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

>> Hi Gemma,
>>
>> I have made your suggested change to biopython/Bio/formatdefs/ 
>> genbank.py as CVS revision 1.10, which should be viewable online soon:
>>
>> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/ 
>> expressions/genbank.py?cvsroot=biopython

I got the URL right, but I mean to say Bio/expressions/genbank.py (which 
actually has the Martel definition in it) not Bio/formatdefs/genbank.py

Peter wrote:
>> I am curious as to why you are using this code ... 

Gemma replied:
 > I was using the Bio.Genbank module. This is the code I've been using:
 >
 > from Bio import GenBank
 > parser = GenBank.RecordParser(debug_level=2)
 > record = parser.parse(open("test4.txt"))

I would guess you are using BioPython 1.41 (or older) then, as your 
stack trace was indeed using Martel internally.

Recent versions of BioPython (1.42 and later) use a pure python parser 
in Bio.GenBank as the old Martel code didn't scale well with large input 
files (to the point of being almost useless on large genomes).

If you do update your installation, and run into any problems with the 
GenBank parser, please do let us know.

Peter

------------------------------

Message: 3
Date: Tue, 03 Oct 2006 22:16:27 +0800
From: Wijaya Edward <ewijaya at i2r.a-star.edu.sg>
Subject: [BioPython] BioPython for TRANSFAC
To: biopython at lists.open-bio.org
Message-ID:
    <3ACF03E372996C4EACD542EA8A05E66A061584 at mailbe01.teak.local.net>
Content-Type: text/plain; charset=iso-8859-1

Hi there,

Is there a method  in BioPython  that allow me to pass the query "fruitfly" or "drosophila" 
and then returning the: 

1.    already characterized TF and their binding sites (BS), 
2.    their respective coregulated genes, and 
3.    the location of TFBS location/position in the genes. 

all from TRANSFAC database.

-- 
Regards,
Edward WIJAYA 

------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged.  If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------

------------------------------

Message: 4
Date: Wed, 4 Oct 2006 23:38:03 +1000
From: "Mitchell Stanton-Cook" <m.stantoncook at gmail.com>
Subject: [BioPython] Creating fusion protein like constructs with
    BioPython
To: BioPython at lists.open-bio.org
Message-ID:
    <cb57bffd0610040638i2fb4c0cahf9021d3f5d3dc9dd at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hello all.

I am trying to create fusion protein-like model from two separate pdb files.
I introduce a CYS mutant in the target protein, and then wish to form a
disulphide bound between it and a small peptide.

This is pure computational work.

I am using Bio.PDB. As the two structures are in arbitrary frames of
reference I need to rotate and translate to form the "construct".

I wish to have

TargetProtein-CB-SY-SY-CB-SmallPeptide (the peptide is not really added to
the N/C term)

I have tried many different approaches but have failed miserable to get
SmallPeptide rotated relative to TargetProtein at the correct dihedral angle
+/-90deg and bond lengths.

My current approach is (omitting the correct bond length at this time):

 TP-CB-SY     SY-CB-SP
  1    2             3    4

Translate  2 onto 3
Calculate the angle between 1-(23)-4
Calculate the cross product of 1-23 x 23-4
Generate the rotation matrix given the angle and vector
Rotate all SP (SmallPeptide) atoms by this rotation matrix.

This has not worked.

I have had some other ideas and have written code for them. Ideally, I wish
to calculate the rotations about X,Y,Z to place the SP at the correct
dihedral angle followed by translation, but I have no idea how to do this.

1) Can I use Bio.PDB to do this above task or do I need to look at something
else?
2) Does anyone have any ideas on how to complete this goal?

Thanking you for your time.

Mitch

------------------------------

Message: 5
Date: Thu, 05 Oct 2006 10:47:30 +0100
From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [BioPython] Creating fusion protein like constructs with
    BioPython
To: Mitchell Stanton-Cook <m.stantoncook at gmail.com>
Cc: BioPython at lists.open-bio.org
Message-ID: <4524D4B2.8030600 at maubp.freeserve.co.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Mitchell Stanton-Cook wrote:
> Hello all.
> 
> I am trying to create fusion protein-like model from two separate pdb files.
> I introduce a CYS mutant in the target protein, and then wish to form a
> disulphide bound between it and a small peptide.
> 
> This is pure computational work.
> 
> ...
> 
> 1) Can I use Bio.PDB to do this above task or do I need to look at something
> else?

My gut instinct is that yes, you probably can - but you will have to do 
a lot of the work with your own code.  Its not something I have ever 
tried though.

> 2) Does anyone have any ideas on how to complete this goal?

You might want to have a look at MMTK, which on the face of it would be 
better suited.  Assuming MMTK will read both PDB files you might have 
better luck - this proviso is because I have found MMTK will choke on 
"odd" PDB files, and its support for non-standard residues could be better.

http://starship.python.net/crew/hinsen/MMTK/index.html

Peter

------------------------------

Message: 6
Date: Thu, 5 Oct 2006 11:52:56 +0200
From: "Thomas Hamelryck" <thamelry at binf.ku.dk>
Subject: Re: [BioPython] Creating fusion protein like constructs with
    BioPython
To: "Mitchell Stanton-Cook" <m.stantoncook at gmail.com>
Cc: BioPython at lists.open-bio.org
Message-ID:
    <2d7c25310610050252j2f889242h84411e0927fb4502 at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi,

> I am trying to create fusion protein-like model from two separate pdb files.
> I introduce a CYS mutant in the target protein, and then wish to form a
> disulphide bound between it and a small peptide.
...
> 1) Can I use Bio.PDB to do this above task or do I need to look at something
> else?

Bio.PDB has functionality to do vector/rotation calculations.
Take a look at the Vector.py module.

Best,

----
Thomas Hamelryck, Post-doctoral researcher
Bioinformatics center
Institute of Molecular Biology and Physiology
University of Copenhagen
Universitetsparken 15 - Bygning 10
DK-2100 Copenhagen ?
Denmark
Homepage: http://www.binf.ku.dk/Protein_structure

------------------------------

Message: 7
Date: Thu, 05 Oct 2006 12:30:36 +0200
From: Steffi Gebauer-Jung <gebauer-jung at ice.mpg.de>
Subject: [BioPython] Problem parsing Blast XML output from different
    sources
To: biopython at lists.open-bio.org
Message-ID: <4524DECC.3030307 at ice.mpg.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hello,

because of blastall 2.2.14 output was not parsed from the 
Bio.Blast.NCBIStandalone parser,
I tried to switch to the recommended Bio.Blast.NCBIXML parser.

Thereby I found, that the xml output of the locally installed standalone 
blastall (2.2.14)
differs from the web xml output.

For BlastN hsps on Plus/Minus strands, the xml gives
query_frame/hit_frame  1 / -1 as usual.
But query and frame positions and sequences are switched in direction
(would match frames -1/1).

As the Bio.Blast.Record returned by the NCBIXML parser only gives 
frames, sequences
and start positions it is not possible (without knowing the source of 
the xml file)
to be sure to find the right data.

This is clearly a problem of Blast.
But because of the missing end positions in the returned record object
it becomes a problem for users of the parser too.

Could somebody try to confirm the different behaviour of the xml blast 
output
with his/her own examples/installation?

Thanks, Steffi

------------------------------

Message: 8
Date: Thu, 05 Oct 2006 12:01:04 -0400
From: Michiel Jan Laurens de Hoon <mdehoon at c2b2.columbia.edu>
Subject: Re: [BioPython] Problem parsing Blast XML output from
    different    sources
To: Steffi Gebauer-Jung <gebauer-jung at ice.mpg.de>
Cc: biopython at lists.open-bio.org
Message-ID: <45252C40.8040806 at c2b2.columbia.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Which sequence are you running blast on?
I'd like to try this on our local blast installation.

--Michiel.

Steffi Gebauer-Jung wrote:
> Hello,
> 
> because of blastall 2.2.14 output was not parsed from the 
> Bio.Blast.NCBIStandalone parser,
> I tried to switch to the recommended Bio.Blast.NCBIXML parser.
> 
> Thereby I found, that the xml output of the locally installed standalone 
> blastall (2.2.14)
> differs from the web xml output.
> 
> For BlastN hsps on Plus/Minus strands, the xml gives
> query_frame/hit_frame  1 / -1 as usual.
> But query and frame positions and sequences are switched in direction
> (would match frames -1/1).
> 
> As the Bio.Blast.Record returned by the NCBIXML parser only gives 
> frames, sequences
> and start positions it is not possible (without knowing the source of 
> the xml file)
> to be sure to find the right data.
> 
> This is clearly a problem of Blast.
> But because of the missing end positions in the returned record object
> it becomes a problem for users of the parser too.
> 
> Could somebody try to confirm the different behaviour of the xml blast 
> output
> with his/her own examples/installation?
> 
> Thanks, Steffi
> 
> 
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

------------------------------

Message: 9
Date: 06 Oct 2006 01:30:32 -0700
From: kirbywhite at sbcglobal.net
Subject: [BioPython] Join kirby white on Yahoo! Messenger!
To: biopython at biopython.org
Message-ID: <200610060837.k968bH7m002645 at portal.open-bio.org>
Content-Type: text/plain; charset=en_US.ISO-8859-1

kirby white wants to talk with you using the new Yahoo! Messenger with Voice:

Accept the invitation by clicking this link:

http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=TH4bGUcdNQlSM9glNjqlrYiUe5Ghe81EwN0H9cef5vb5F7R7g9X1RKU7ac1qLispOfRJgQy2V7nt.fUIeMUChnR9ZMz50uB3r5ocpMTyDcxHE4kS.n_LZ2zqpi54EYbR3KHoIq73BouZjRO0y5J6LODqpmvT3VY-

With Yahoo! Messenger with Voice, you get:

 Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail!

IM Windows Live&trade; Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime.

 Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online.

 So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want.

 * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger.

------------------------------

Message: 10
Date: Sun, 08 Oct 2006 00:51:09 -0400
From: Michiel de Hoon <mdehoon at c2b2.columbia.edu>
Subject: Re: [BioPython] Problem parsing Blast XML output from
    different    sources
To: Steffi Gebauer-Jung <gebauer-jung at ice.mpg.de>,
    biopython at biopython.org
Message-ID: <452883BD.7050907 at c2b2.columbia.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Steffi,

I am trying to replicate this problem with Blast. Where did you get the 
pat database? I searched for it with google, but there seems to be more 
than one blast database called pat.

--Michiel.

Steffi Gebauer-Jung wrote:
> Hello,
> 
> I don't know what local databases you have available for testing.
> The discrepancy between xml and 'pairwise text' output  should be seen
> for every Plus/Minus Hsp created by local Blastn (local server or
> standalone blastall from command line, I use version 2.2.14)
> 
> I tried several combinations, one is M38240 vs. pat database,
> the hsp hit was BD298385.
> Here are the interesting output snippets:
> 
>> dbj|BD298385.1| 
>> <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=92136243&dopt=GenBank> 
>> CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS AND PLANT PARTS
>            CONTAINING THEM, AND METHODS FOR OBTAINING THEM
>          Length = 14108
> 
> Score =  125 bits (63), Expect = 1e-25
> Identities = 63/63 (100%)
> Strand = Plus / Minus
> 
>                                                                        
> Query: 727  aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc 
> 786
>            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 8332 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc 
> 8273
> 
>               Query: 787  cga 789
>            |||
> Sbjct: 8272 cga 8270
> 
> =====================================================
>        <Hit>
>          <Hit_num>15</Hit_num>
>          <Hit_id>gi|92136243|dbj|BD298385.1|</Hit_id>
>          <Hit_def>CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS 
> AND PLANT PARTS CONTAINING THEM, AND METHODS FOR OBTAINING THEM</Hit_def>
>          <Hit_accession>BD298385</Hit_accession>
>          <Hit_len>14108</Hit_len>
>          <Hit_hsps>
>            <Hsp>
>              <Hsp_num>1</Hsp_num>
>              <Hsp_bit-score>125.381</Hsp_bit-score>
>              <Hsp_score>63</Hsp_score>
>              <Hsp_evalue>9.63859e-26</Hsp_evalue>
>              <Hsp_query-from>789</Hsp_query-from>
>              <Hsp_query-to>727</Hsp_query-to>
>              <Hsp_hit-from>8270</Hsp_hit-from>
>              <Hsp_hit-to>8332</Hsp_hit-to>
>              <Hsp_query-frame>1</Hsp_query-frame>
>              <Hsp_hit-frame>-1</Hsp_hit-frame>
>              <Hsp_identity>63</Hsp_identity>
>              <Hsp_positive>63</Hsp_positive>
>              <Hsp_align-len>63</Hsp_align-len>
>              
> <Hsp_qseq>TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT</Hsp_qseq> 
> 
>              
> <Hsp_hseq>TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT</Hsp_hseq> 
> 
>              
> <Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||</Hsp_midline> 
> 
>            </Hsp>
>          </Hit_hsps>
>        </Hit>
> 
> Thanks, Steffi
> 
> 
> 
> 
> 
> 
> Michiel Jan Laurens de Hoon wrote:
> 
>> Which sequence are you running blast on?
>> I'd like to try this on our local blast installation.
>>
>> --Michiel.
>>
>> Steffi Gebauer-Jung wrote:
>>
>>> Hello,
>>>
>>> because of blastall 2.2.14 output was not parsed from the 
>>> Bio.Blast.NCBIStandalone parser,
>>> I tried to switch to the recommended Bio.Blast.NCBIXML parser.
>>>
>>> Thereby I found, that the xml output of the locally installed 
>>> standalone blastall (2.2.14)
>>> differs from the web xml output.
>>>
>>> For BlastN hsps on Plus/Minus strands, the xml gives
>>> query_frame/hit_frame  1 / -1 as usual.
>>> But query and frame positions and sequences are switched in direction
>>> (would match frames -1/1).
>>>
>>> As the Bio.Blast.Record returned by the NCBIXML parser only gives 
>>> frames, sequences
>>> and start positions it is not possible (without knowing the source of 
>>> the xml file)
>>> to be sure to find the right data.
>>>
>>> This is clearly a problem of Blast.
>>> But because of the missing end positions in the returned record object
>>> it becomes a problem for users of the parser too.
>>>
>>> Could somebody try to confirm the different behaviour of the xml 
>>> blast output
>>> with his/her own examples/installation?
>>>
>>> Thanks, Steffi
>>>
>>>
>>>
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
> 

------------------------------

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

End of BioPython Digest, Vol 46, Issue 2
****************************************

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com