[Bioperl-l] getting the Sbjct database from a hit

Stephen A. Chervitz sac@neomorphic.com
Sat, 07 Oct 2000 13:09:30 -0700


Catherine,

You've discovered a recent change in the Blast Sbjct object regarding the handling of hit
identifiers. The Sbjct no longer assumes any structure in the hit identifier. It simply
takes the first non-whitespace containing chunk of text from the hit description line and
assignes it to the name of the hit. You then need to parse the $hit->name to extract the
database yourself now.

The behavior you described below, of hit name containing something like
"Q00752/MSMK_STRMU", is from the former version of Sbjct which stripped out database name
and concatenated identifiers with a '/' separator. The new version will produce a string
such as "sp|Q00752|MSMK_STRMU" and $hit->database will be '-', as Hilmar said. (The
database method is unofficially deprecated - we should make this official.)

Here's my rationale for making this change. Previously (prior to about Feb 2000) the Sbjct
object would attempt to parse hit identifiers to separate database name from the sequence
identifier. This worked well enough for GenBank/EMBL/DDBJ-style ids such as
"sp|P02914|MALK_ECOLI", but it often caused problems for other databases which don't
conform to this convention.

Because the parser needs to be able to work with a variety of Blast reports and databases,
it is safest to be as semantically neutral as possible regarding the format of hit
identifiers. This means that the user may need to do more work, but it makes the Blast
parser more flexible and robust.

Steve

Catherine Letondal wrote:

> Hilmar Lapp writes:
> > First, usually you don't create Sbjct objects yourself, but obtain them
> > from a Bio::Tools::Blast object created from a BLAST report. As you're
> > writing this first, I assume that's in fact what you're doing.
> >
> > I'm not sure what you're referring to with the database() method. If you
> > want to get the name of the database the sequence was searched against,
> > you can simply call $blast->database().
>
> Actually I need the database of the subject - since the searched database may be
> a non-redundant database containing entries from pir or swissprot :
> [...]
> sp|P02914|MALK_ECOLI MALTOSE/MALTODEXTRIN TRANSPORT ATP-BINDING ...   738  0.0
> pir||MMECMK inner membrane protein malK - Escherichia coli >gi|4...   734  0.0
> sp|P19566|MALK_SALTY MALTOSE/MALTODEXTRIN TRANSPORT ATP-BINDING ...   696  0.0
> pir||S05329 inner membrane protein malK - Salmonella typhimurium      689  0.0
> [...]
>
> > To obtain the database containing
> > the hit (i.e., the database part of the identifier for those cases in
> > which the identifier is a compound of database and accession) you have to
> > interpret $hit->name() yourself, as the documentation tells.
> > $hit->database() will only return a dash ('-').
>
> That's right! But the name $hit->name() does *not* return the database in it:
>
> For example, for the hit:
>
>         sp|Q00752|MSMK_STRMU MULTIPLE SUGAR-BINDING TRANSPORT ATP-BINDI...   298  2e-81
>
> This statement:
>         print STDERR "hit name: ",$hit->name,"\n";
> prints:
>         hit name: Q00752/MSMK_STRMU
> The 'sp' has disappeared...
>
> --
> Catherine Letondal -- Pasteur Institute Computing Center
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l