[Bioperl-l] Genbank parsing using Bioperl
Chris Fields
cjfields at uiuc.edu
Fri Apr 21 11:26:53 EDT 2006
I'm adding my 2c since I've got a bit of time on my hands. I'll add that I
found most of these answers by looking through the mail list archives (now
searchable through Gmane) and the BioPerl wiki.
I believe Sean pointed out the HOWTO on the BioPerl wiki:
http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Getting_Sequences
In theory, you should be able to retrieve from the CDS feature which gene
feature or transcript each coding feature belongs to, and normally vice
versa. I may be wrong (I work with bacterial genome sequences mainly), but
I believe this is completely dependent on how well the features are
annotated (which can vary greatly between different sequencing centers) so
can be a bit tricky depending on the source of the GenBank file. I would,
instead, try a database that's well-curated and has a consistent interface
across different genome projects. In other words, something like what Sean
suggested, like Ensembl:
http://www.ensembl.org/index.html
Use can use the Ensembl Perl API to retrieve data from Ensembl databases:
http://www.ensembl.org/info/software/core/core_tutorial.html
You could also have a look at Entrez Gene; Brian's working on modules (in
CVS) for retrieving and parsing Entrez Gene's output:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
You'll need the Bio::ASN1 parser for Brian's modules:
http://sourceforge.net/projects/egparser
Both Ensembl and Entrez Gene are constantly updated for transcript/protein
information and are likely what you are looking for.
Chris
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Prabu R
> Sent: Friday, April 21, 2006 8:25 AM
> To: bioperl-l at lists.open-bio.org; Sean Davis
> Subject: Re: [Bioperl-l] Genbank parsing using Bioperl
>
> Dear All,
>
> I feel sorry for making a small mistake in my earlier mail
>
> I am not actually using Genbank releases, But Refseq Genome build gbk
> files
> of NCBI (ftp.ncbi.nih.gov/genomes/)
>
> Those files are genbank formatted and contains Refseq IDs.
>
> Kindly help.
>
> R. Prabu
>
> ----------------------------
> Dear all!
>
> I am a novice bioperl user, trying to parse Genbank files with Bioperl
> modules to get some specific features and details.
>
> Anyone please tell me, whether we can retrive a Gene, its Transcript ID
> and
> its Protein ID from the Genbank file.
>
> I mainly need to extract with one to one relationship between TranscriptID
> and Protein ID.
>
> I was trying this. I was able to take these details if the gene is not
> alternatively spliced.
>
> If a gene contains multiple mRNA/CDS feature, I am not able to build the
> relationship between Transcript and its Protein.
>
> Kindly help me to find out whether this is possible in Bioperl.
>
> Thanks in advance,
>
> R. Prabu
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list