[Bioperl-l] Indexing CDS file
Dave Messina
David.Messina at sbc.su.se
Wed Feb 11 05:29:41 EST 2009
Thanks, Heikki.
I took a closer look at the EBI ftp site where Sviya and I got the file, and
in their README (ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt) it
says:
PA line - contains the accession.version of the "parent" EMBL entry
(entry where the CDS is annotated)
So, unfortunately they've decided that a CDS record, which has no accession
of its own, doesn't get its parent's accession number, but gets to refer to
its parent's accession number via the PA line.
Furthermore, there's an
OX line - contains the NCBI taxid for the organism; taxonomic data are taken
from the parent EMBL entries
which is also not part of the the formal spec. (although this one is a more
worthwhile addition, IMO)
Sooooo, I think we'll need to add support for these.
'PA' seems easy enough -- the EMBL parser can look for it if there isn't an
'AC' line.
As for 'OX', is there a standard slot for a taxonID in a RichSeq SeqFeature
table? Coming from a Genbank record or a vanilla EMBL record, this is
normally encoded as
primary tag: source
tag: db_xref
value: taxon:9606
right?
Should do the same if we're coming from an EMBL entry, even though it's not
actually in the feature table?
Dave
More information about the Bioperl-l
mailing list