[Bioperl-l] new directions

Geer, Lewis (NLM/NCBI) lewisg@mail.nih.gov
Wed, 7 Mar 2001 12:00:13 -0500


Hi, 
Just in case you haven't seen it, XML output is an option for the NCBI
public blast servers (it's been an option in standalone blast for a while).
Here's a sample:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"
"NCBI_BlastOutput.dtd"><BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>blastp 2.1.2 [Nov-13-2000]</BlastOutput_version>
  <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden,
Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David
J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new generation of
protein database search~programs&quot;,  Nucleic Acids Res.
25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>lcl|1_20397</BlastOutput_query-ID>
  <BlastOutput_query-def>gi|7291680|gb|AAF47102.1| </BlastOutput_query-def>
  <BlastOutput_query-len>1020</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_include>0</Parameters_include>
      <Parameters_sc-match>0</Parameters_sc-match>
      <Parameters_sc-mismatch>0</Parameters_sc-mismatch>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|280603|pir||A36691</Hit_id>
          <Hit_def>Ca2+-transporting ATPase (EC 3.6.1.38), sarcoplasmic
reticulum - fruit fly (Drosophila melanogaster) &gt;gi|158416|gb|AAB00735.1|
(M62892) sarco/endoplasmic reticulum-type Ca-2+-ATPase [Drosophila
melanogaster]</Hit_def>
          <Hit_accession>A36691</Hit_accession>
          <Hit_len>1002</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>1792.44</Hsp_bit-score>
              <Hsp_score>4642</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>1</Hsp_query-from>
              <Hsp_query-to>993</Hsp_query-to>
              <Hsp_hit-from>1</Hsp_hit-from>
              <Hsp_hit-to>993</Hsp_hit-to>
              <Hsp_pattern-from>0</Hsp_pattern-from>
              <Hsp_pattern-to>0</Hsp_pattern-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>992</Hsp_identity>
              <Hsp_positive>993</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>993</Hsp_align-len>
              <Hsp_density>0</Hsp_density>
 
<Hsp_qseq>MEDGHSKTVEQSLNFFGTDPERGLTLDQIKANQKKYGPNELPTEEGKSIWQLVLEQFDDLLVKILL
LAAIISFVLALFEEHEETFTAFVEPLVILLILIANAVVGVWQERNAESAIEALKEYEPEMGKVVRQDKSGIQKVRA
KEIVPGDLVEVSVGDKIPADIRITHIYSTTLRIDQSILTGESVSVIKHTDAIPDPRAVNQDKKNILFSGTNVAAGK
ARGVVIGTGLSTAIGKIRTEMSETEEIKTPLQQKLDEFGEQLSKVISVICVAVWAINIGHFNDPAHGGSWIKGAIY
YFKIAVALAVAAIPEGLPAVITTCLALGTRRMAKKNAIVRSLPSVETLGCTSVICSDKTGTLTTNQMSVSRMFIFD
KVEGNDSSFLEFEMTGSTYEPIGEVFLNGQRIKAADYDTLQELSTICIMCNDSAIDYNEFKQAFEKVGEATETALI
VLAEKLNSFSVNKSGLDRRSAAIACRGEIETKWKKEFTLEFSRDRKSMSSYCTPLKASRLGTGPKLFVKGAPEGVL
ERCTHARVGTTKVPLTSALKAKILALTGQYGTGRDTLRCLALAVADSPMKPDEMDLGDSTKFYQYEVNLTFVGVVG
MLDPPRKEVFDSIVRCRAAGIRVIVITGDNKATAEAICRRIGVFAEDEDTTGKSYSGREFDDLSPTEQKAAVARSR
LFSRVEPQHKSKIVEFLQSMNEISAMTGDGVNDAPALKKAEIGIAMGSGTAVAKSAAEMVLADDNFSSIVSAVEEG
RAIYNNMKQFIRYLISSNIGEVVSIFLTAALGLPEALIPVQLLWVNLVTDGLPATALGFNPPDLDIMEKPPRKADE
GLISGWLFFRYMAIGFYVGAATVGAAAWWFVFSDEGPKLSYWQLTHHLSCLGGGDEFKGVDCKIFSDPHAMTMALS
VLVTIEMLNAMNSLSENQSLITMPPWCNLWLIGSMALSFTLHFVILYVDVLSTVFQVTPLSAEEWITVMKFSIPVV
LLDETLKFVARKIAD</Hsp_qseq>
 
<Hsp_hseq>MEDGHSKTVEQSLNFFGTDPERGLTLDQIKANQKKYGPNELPTEEGKSIWQLVLEQFDDLLVKILL
LAAIISFVLALFEEHEETFTAFVEPLVILLILIANAVVGVWQERNAESAIEALKEYEPEMGKVVRQDKSGIQKVRA
KEIVPGDLVEVSVGDKIPADIRITHIYSTTLRIDQSILTGESVSVIKHTDAIPDPRAVNQDKKNILFSGTNVAAGK
ARGVVIGTGLSTAIGKIRTEMSETEEIKTPLQQKLDEFGEQLSKVISVICVAVWAINIGHFNDPAHGGSWIKGAIY
YFKIAVAVAVAAIPEGLPAVITTCLALGTRRMAKKNAIVRSLPSVETLGCTSVICSDKTGTLTTNQMSVSRMFIFD
KVEGNDSSFLEFEMTGSTYEPIGEVFLNGQRIKAADYDTLQELSTICIMCNDSAIDYNEFKQAFEKVGEATETALI
VLAEKLNSFSVNKSGLDRRSAAIACRGEIETKWKKEFTLEFSRDRKSMSSYCTPLKASRLGTGPKLFVKGAPEGVL
ERCTHARVGTTKVPLTSALKAKILALTGQYGTGRDTLRCLALAVADSPMKPDEMDLGDSTKFYQYEVNLTFVGVVG
MLDPPRKEVFDSIVRCRAAGIRVIVITGDNKATAEAICRRIGVFAEDEDTTGKSYSGREFDDLSPTEQKAAVARSR
LFSRVEPQHKSKIVEFLQSMNEISAMTGDGVNDAPALKKAEIGIAMGSGTAVAKSAAEMVLADDNFSSIVSAVEEG
RAIYNNMKQFIRYLISSNIGEVVSIFLTAALGLPEALIPVQLLWVNLVTDGLPATALGFNPPDLDIMEKPPRKADE
GLISGWLFFRYMAIGFYVGAATVGAAAWWFVFSDEGPKLSYWQLTHHLSCLGGGDEFKGVDCKIFSDPHAMTMALS
VLVTIEMLNAMNSLSENQSLITMPPWCNLWLIGSMALSFTLHFVILYVDVLSTVFQVTPLSAEEWITVMKFSIPVV
LLDETLKFVARKIAD</Hsp_hseq>
 
<Hsp_midline>MEDGHSKTVEQSLNFFGTDPERGLTLDQIKANQKKYGPNELPTEEGKSIWQLVLEQFDDLLVK
ILLLAAIISFVLALFEEHEETFTAFVEPLVILLILIANAVVGVWQERNAESAIEALKEYEPEMGKVVRQDKSGIQK
VRAKEIVPGDLVEVSVGDKIPADIRITHIYSTTLRIDQSILTGESVSVIKHTDAIPDPRAVNQDKKNILFSGTNVA
AGKARGVVIGTGLSTAIGKIRTEMSETEEIKTPLQQKLDEFGEQLSKVISVICVAVWAINIGHFNDPAHGGSWIKG
AIYYFKIAVA+AVAAIPEGLPAVITTCLALGTRRMAKKNAIVRSLPSVETLGCTSVICSDKTGTLTTNQMSVSRMF
IFDKVEGNDSSFLEFEMTGSTYEPIGEVFLNGQRIKAADYDTLQELSTICIMCNDSAIDYNEFKQAFEKVGEATET
ALIVLAEKLNSFSVNKSGLDRRSAAIACRGEIETKWKKEFTLEFSRDRKSMSSYCTPLKASRLGTGPKLFVKGAPE
GVLERCTHARVGTTKVPLTSALKAKILALTGQYGTGRDTLRCLALAVADSPMKPDEMDLGDSTKFYQYEVNLTFVG
VVGMLDPPRKEVFDSIVRCRAAGIRVIVITGDNKATAEAICRRIGVFAEDEDTTGKSYSGREFDDLSPTEQKAAVA
RSRLFSRVEPQHKSKIVEFLQSMNEISAMTGDGVNDAPALKKAEIGIAMGSGTAVAKSAAEMVLADDNFSSIVSAV
EEGRAIYNNMKQFIRYLISSNIGEVVSIFLTAALGLPEALIPVQLLWVNLVTDGLPATALGFNPPDLDIMEKPPRK
ADEGLISGWLFFRYMAIGFYVGAATVGAAAWWFVFSDEGPKLSYWQLTHHLSCLGGGDEFKGVDCKIFSDPHAMTM
ALSVLVTIEMLNAMNSLSENQSLITMPPWCNLWLIGSMALSFTLHFVILYVDVLSTVFQVTPLSAEEWITVMKFSI
PVVLLDETLKFVARKIAD</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>

[...]

      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>807597</Statistics_db-num>
          <Statistics_db-len>-1431139411</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>4.94066e-324</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>


> -----Original Message-----
> From: Jason Stajich [mailto:jason@chg.mc.duke.edu]
> Sent: Wednesday, March 07, 2001 11:45 AM
> To: Bioperl
> Subject: [Bioperl-l] new directions
> 
> 
> So very happy to have 0.7 out.  I know there are some minor 
> issues that
> have begun to be resolved, once these reach a suitable number 
> or enough
> time has passed, we can think about a point release.  Not for 
> at least 3
> weeks though.
> 
> The branching gives us a chance to take stock and look at 
> where we want to
> go next.  Interest has been expressed in expanding outside of 
> the sequence
> analysis realm bioperl has pretty much occupied.  I'm all for 
> it.  The new
> projects I hint at below should go on the main trunk, only 
> bug fixes, and
> minor feature changes should go on the branch.  We're probably
> flexible here so when in doubt we can discuss on the list.  
> 
> I'd like to throw some ideas out there and encourage people 
> on the list
> who maybe haven't felt comfortable jumping in while we were 
> churning on
> the release to think about picking up a project. Especially if any of
> these (or your own project ideas) scratch a particular itch 
> you have.  
> Some of these don't have to be part of bioperl-live but can 
> be sattelite
> projects which utilize the bioperl core objects.
> 
> These are just some ideas I have bouncing around, perhaps you 
> have your
> own ideas and would like to contribute:
> 
> This is also in wiki at
> http://www.bioperl.org/wiki/html/BioPerl/BioperlProjects.html - so any
> critiques or additions could be added there as well, just CC 
> the list so
> we know to check.
> 
>  o perl is not an ideal language for doing something like 
> huge microarray
>    clustering, but it is ideal for dealing with formatting issues.
>    Perhaps code that can deal with converting different 
> microarray formats
>    would be helpful.
>  
>  o Expansion into other expression data, code to help link 
> expression data
>    for genes (sometimes unknown genes) to available 
> information in IGI,
>    NCBI Unigene, etc.  All in software so that it can be automated.
> 
>  o The Blast issues.  I think the pluggable features to 
> BPlite would be
>    ideal, I don't know how well it will work ( wanting to 
> parse more or
>    less of the report -- runtime plugging of 'adaptors'?) . I like the
>    html features of Bio::Tools::Blast.  What about parsing 
> NCBI Blast XML?
>  
>  o Fasta parsing.  We should find a way to support this, either with a
>    formal grammar or just some perl code.
> 
>  o Speaking of grammars, what about a grammar for parsing 
> EMBL/Genbank?
>    Would this be more/less efficient?  We seem kind of kludgy 
> in parts of
>    the feature table parsing and it has gotten pretty heavy 
> down there,
>    are there ways to simplify this code?
> 
>  o Bio::Index::Blast which can read fetch ( and store?) seqs 
> from a blast
>    index.
>  
>  o Map data - genetic, RH maps and their markers.  Adopting code for
>    manipulating this information.  A simple ePCR parser would 
> fit in here
>    too.
> 
>  o visualization - perhaps visualization is best done in java, but the
>    bioperl-gui modules provide a nice way to look at a sequence with
>    annotation. Is there interest in a png/gif/ps renderer as well,
>    adopting existing code -- perhaps something similar to gff2ps.
> 
>  o Tree drawing - plugging into a PHYLIP or something similar 
> to provide
>    some nice drawings of phylogenetic tress.
> 
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center 
> http://www.chg.duke.edu/ 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>