[Biopython-dev] NCBIStandalone Blast HSP parsing
Michiel De Hoon
mdehoon at c2b2.columbia.edu
Mon Oct 17 13:51:21 EDT 2005
The current patch breaks the parser if the Blast output does not contain
query_end and sbjct_end. The problem seems to be in the line:
start, seq, end = m.groups()
(traceback ends with
File "/usr/local/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py",
line 995, in query
start, seq, end = m.groups()
ValueError: need more than 2 values to unpack).
But this should be easy to fix.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
-----Original Message-----
From: Mark Hoebeke [mailto:Mark.Hoebeke at jouy.inra.fr]
Sent: Mon 10/17/2005 1:05 PM
To: Michiel De Hoon
Cc: biopython-dev at biopython.org
Subject: Re: [Biopython-dev] NCBIStandalone Blast HSP parsing
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Michiel De Hoon wrote:
> Just to make sure I understand what you're doing:
>
> Are the query_end and sbjct_end attributes found in the Blast output, or do
> you calculate them from the other attributes in the Blast output?
I directly grab them from the Blast report.
>If they're
> in the Blast output,
> 1) Do they always appear in the Blast output, or does it depend on the
query?
> In the latter case, does the modified Blast parser choke on Blast output
that
> do not contain these attributes?
The patterns in the official release 1.4b module check for "a single
digit" following the string of sequence characters at the end of the
alignment lines.
All I did was to extend the patterns to "one or more digits" and to
capture them in order to store their contents in the HSP attributes. So
AFAIK, the patch does not change the way reports are currently parsed.
> 2) Does these attributes also appear in Blast XML output? The XML parser is
> easier to maintain than the text-based parser in BlastStandalone, may
> therefore become the main Blast parser in Biopython in the long run.
With the sequence set I'm currently working on (and with NCBI Blast
2.2.12), the XML output has indeed the following elements : Hsp_query-to
and Hsp_hit-to which seem to have the intended meaning.
I suppose I should be able to adapt the XML parser while I'm on it, if
it is officially accepted.
Mark
>
> --Michiel.
>
>
>
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
>
>
>
> -----Original Message-----
> From: biopython-dev-bounces at portal.open-bio.org on behalf of Mark Hoebeke
> Sent: Mon 10/17/2005 10:07 AM
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
>
> Hi all,
>
> I wanted a quick and easy way to determine the endpoints of HSPs extraced
> from
> Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
> the
> query_end and sbjct_end attributes. Googling around led me to a recipe
> describing how to compute the endpoint using the total length, gap length
and
> other niceties. Not exactly intuitive to me.
>
> Hence I dove into the NCBIStandalone and HSP modules and made some slight
> modifications. Basically I added the two attributes to HSP and the
following
> snippets to NCBIStandalone (release 1.4b):
>
> 972c972
> < _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
> ---
>
>>> _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
>
> 977,978c977
> < start, seq, end = m.groups()
> < self._hsp.query_end=string.atoi(end);
> ---
>
>>> start, seq = m.groups()
>
> 997,998c996,997
> < start, seq, end = _re_search(
> < r"Sbjct: (\d+)\s*(.+) (\d+)", line,
> ---
>
>>> start, seq = _re_search(
>>> r"Sbjct: (\d+)\s*(.+) \d", line,
>
> 1014c1013
> < self._hsp.sbjct_end=string.atoi(end)
> ---
>
>
> Looks to easy to be true, I thought. Now sorry if I'm missing some
important
> issues here (I'm quite new to BioPython), but is there a reason no one has
> made
> this patch yet ?
>
> Thanks for any comments (flames and others.)
>
> Cheers,
>
> Mark
>
>
> --
> -
>
----------------------------Mark.Hoebeke at jouy.inra.fr-----------------------
> Unité Statistique & Génome _/_/_/ _/_/_/
http://stat.genopole.cnrs.fr
> Tél : +33 (0)1 60 87 38 03 _/ _/ Fax : +33 (0)1 60 87 38
09
> Tour Evry 2, _/_/ _/ _/_/ 523, pl. des
Terrasses
> F-91000, _/ _/ _/
Evry
> PGP : A2AD52E3 _/_/_/ _/_/_/
>
>
>
>
_______________________________________________
Biopython-dev mailing list
Biopython-dev at biopython.org
http://biopython.org/mailman/listinfo/biopython-dev
- --
- -------------------------Mark.Hoebeke at jouy.inra.fr---------------------
Unité Statistique & Génome Unité MIG
+33 (0)1 60 87 38 03 Tél. +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09 Fax. +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses INRA - Domaine de Vilvert
F - 91000 Evry F - 78352 Jouy-en-Josas CEDEX
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFDU9nxa3nTV6KtUuMRApqXAJ9a9z7J0bvigZ1NiZZxmTUziMocIgCdE0O9
EvX5Bm6f7dMcAUFGfNIO8tk=
=mWo3
-----END PGP SIGNATURE-----
More information about the Biopython-dev
mailing list