[Bioperl-l] parsing protein accession numbers and types from >fasta headers

Wed Sep 13 13:17:33 UTC 2006

Hi

I tried to parse this variabilty and get out the dbs. So first I read
the DB type in $1 and then I got out the ID I needed for my purposes.
Of course not *Bio*Perl, but it worked for me ;-)

if ( m/>gi\|\d+\|(\w+)\|([^\|\s]*)\|(\S*)\s/ ) {
	my $name;
	#if ($1 eq 'pdb') { $name = $2.$3 } elsif ($1 eq 'sp' || $1 eq 'pir')
{ $name = $3 } else { $name = $2 }
	SWITCH: {		
		if ($1 eq 'pdb') { $name = $2.$3; last SWITCH; }
		if ($1 eq 'sp' ) { $name = $3; last SWITCH; }
		if ($1 eq 'pir') { $name = $3; last SWITCH; }
		$name = $2;
	}

bernd	

On 9/13/06, Antonio Ramos Fernández <tniram at hotmail.com> wrote:
>
> I'd like to write a script to parse fasta headers of fasta-formatted protein
> databases and get protein accession numbers and identifiers (uniprot, IPI,
> gi, Refseq, ensembl...). The idea is building a simple local database that
> relates an accession number for protein sequence with all valid identifiers
> and the fasta files from where they weher obtained at my system, or
> checking, for instance, if an uniprot accession exists for a given gi.
> However, the structure of the fasta header is quite variable depending on
> the source. Any suggestions?
>
> _________________________________________________________________
> Horóscopo, tarot, numerología... Escucha lo que te dicen los astros.
> http://astrocentro.msn.es/
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>