[Bioperl-l] Bio::SeqIO; seq->desc() gives back too (!!!) full header

Thu, 8 Aug 2002 13:59:45 -0700

Another set of conventions for formatting fasta headers is presented in the
doc for ncbi's formatdb, given in

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/doc/README.formatdb
#L441

where we see the table with entries such as:

Database Name                         Identifier Syntax
NBRF PIR                              pir||entry
GenBank                               gb|accession|locus
NCBI Reference Sequence               ref|accession|locus

Later, this document says that... "ID's must fit into the framework
described above to ensure that they can be reliably parsed and indices built
for them.  This means that it is not possible for users to invent new ID
formats on the fly." - which nonetheless happens.  Of course, this doc
really only matters if your going to create blastable dbs, but I think in
general there is little reason not to abide by their conventions when
possible in your own dbs.

Also, sometimes duplicate entries are merged (by NCBI) and then they utilize
their "Non-redundant defline syntax", which is described in
ftp://ftp.ncbi.nih.gov/blast/db/README as "identical sequences are merged
into one entry.  To be merged two sequences must have identical lengths and
every residue (or basepair) at every position must be the same.  The FASTA
deflines for the different entries that belong to one sequence are separated
by control-A's (^A).  In the example below both entries gi|1469284 and
gi|1477453 have the same sequence, in every respect:

>gi|1469284 (U05042) afuC gene product [Actinobacillus
pleuropneumoniae]^Agi|1477453
(U04954) afuC gene product [Actinobacillus pleuropneumoniae]

It appears that your defline suffers from such a 'merging' event.  No?

While trying to deconstructing your defline, I searched genbank with gi of
15233744 and displayed in FASTA format, which yielded a shorter version of
it, namely:

>gi|15233744|ref|NP_194152.1| putative protein [Arabidopsis thaliana]

with the COMMENT that "The reference sequence was derived from
gnl|TIGR|At4g24210"

then, searching for gi 7487330 yields protein:

>gi|7487330|pir||T09884[7487330]

and on 5051763 yields

gi|5051763|emb|CAB45056.1|[5051763]

So, it looks to me like all these entries were 'merged' by some process at
some time using the control-a convention, and somehow/where the control-a is
not being respected/printed/cut&paste as such.

Hope this helps...

Malcolm Cook

> -----Original Message-----
> From: bioperl-l-admin@bioperl.org [mailto:bioperl-l-admin@bioperl.org]On
> Behalf Of Jason Stajich
> Sent: Thursday, August 08, 2002 6:40 AM
> To: Benjamin Breu
> Cc: Bioperl-Mailing Liste (E-Mail)
> Subject: Re: [Bioperl-l] Bio::SeqIO; seq->desc() gives back too (!!!)
> full header
>
>
> FASTA/Pearson format is unstructured.  That means you (the programmer) get
> to do the work figuring out what you want out of the desc/id line.  We
> have taken the approach that the first section (white space is the
> separator) is the id and everything else is the description
> more succiently (in perl):
>
> my ($id,$desc) = ($str =~ /^>\s*(\S+)\s*(.*)/);
>
> You're only going to get a recapitulation of what is in the header file we
> don't do anything magic to detect this because it would invariably break
> with someone else's scheme.  If you only want the pir number you write a
> simple regexp
>
> (this [untested] will find the gi,optional pir num, and acc for the 1st
> pir if it exists)
>
> my ($pirgi,$pirnum,$piracc) = ($seq->desc() =~
> /(\d+)\|pir\|(\S+)?\|(\S+)/);
>
> You also can investigate the split function to split on whitespace.
> Programming the correct soln depends on what you actually want to extract
> from the desc line.
>
> -jason
>
>
>
> On Thu, 8 Aug 2002, Benjamin Breu wrote:
>
> > Hi,
> >
> > thx Jason for your help.
> >
> > The desc() funktion prints out the header but there is too much stuff
> > in it. I thought it would print only the description, but if there are
> > multiple gi numbers for one protein (I'm using NCBI-Fasta (nr)), it
> > shows me the description and the following gi, pir, etc. number plus
> > their description. See below.
> >
> > use Bio::SeqIO;
> > my $seq = Bio::SeqIO->new(-format => 'fasta', -file =>
> 'filename');   	#filename = my filename
> > while( my $seq = $in->next_seq ) {
> >  print  $seq->display_id(), "\n",$seq->desc(), "\n",
> $seq->seq(), "\n\n";
> > }
> >
> > format as folows for output:
> >
> > ID
> > description
> > sequence
> >
> > gi|15233744|ref|NP_194152.1|
> > (NM_118554) putative protein [Arabidopsis
> thaliana]gi|7487330|pir||T09884 hypothetical protein T22A6.40 -
> Arabidopsis thalianagi|5051763|emb|CAB45056.1| (AL078637)
> putative protein [Arabidopsis thaliana]gi|7269271|emb|CAB79331.1|
> (AL161561) putative protein [Arabidopsis thaliana]
> >
> MKRSTTDSDLAGDAHNETNKKMKSTEEEEIGFSNLDENLVYEVLKHVDAKTLAMSSCVSKIWHKTA
> QDERLWELICTRHWTNIGCGQNQLRSVVLALGGFRRLHSLYLWPLSKPNPRARFGKDELKLTLSLL
> SIRYYEKMSFTKRPLPESK
> >
> > Is there a problem with the parser or what options does it need
> in order to tell me the whole gi, pir, etc. -numbers when I call
> for an ID. That could be an hash with key = database (e.g. dbj,
> pir) and values = @arrayofnumbers. Is there such a smart little
> parser or do I have to spend (a lot of) hours to do this myself?
> >
> > Thx
> >
> > Ben
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>