[Bioperl-l] Indexing CDS file

Wed Feb 11 05:29:41 EST 2009

Thanks, Heikki.

I took a closer look at the EBI ftp site where Sviya and I got the file, and
in their README (ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt) it
says:

PA line - contains the accession.version of the "parent" EMBL entry
          (entry where the CDS is annotated)

So, unfortunately they've decided that a CDS record, which has no accession
of its own, doesn't get its parent's accession number, but gets to refer to
its parent's accession number via the PA line.

Furthermore, there's an

OX line - contains the NCBI taxid for the organism; taxonomic data are taken
          from the parent EMBL entries

which is also not part of the the formal spec. (although this one is a more
worthwhile addition, IMO)

Sooooo, I think we'll need to add support for these.

'PA' seems easy enough -- the EMBL parser can look for it if there isn't an
'AC' line.

As for 'OX', is there a standard slot for a taxonID in a RichSeq SeqFeature
table? Coming from a Genbank record or a vanilla EMBL record, this is
normally encoded as

primary tag: source
tag: db_xref
value: taxon:9606

right?

Should do the same if we're coming from an EMBL entry, even though it's not
actually in the feature table?

Dave