[Bioperl-l] Indexing GenBank files

Mark Dalphin mdalphin@amgen.com
Wed, 20 Sep 2000 11:50:52 -0700


Brad Chapman wrote:

> 2. Right now GenBank.pm creates the index keys using the identifier after
> the LOCUS keyword. Personally, I like using the accession number better
> (ie. the first thing in all of the junk after ACCESSION) and made a
> little hack to do this. It seems to work okay for what I've been doing
> thus far, but I'm not sure if this is a good idea and there is a reason
> why using the locus identifier is better. I am definately not the most
> experienced person ever when it comes to dealing with GenBank files, so
> I'd really like to hear what people's opinion on this is.

With the rapid rate of sequence changes, I find the Genbank VERSION number is
the safest field to use. Otherwise scientists find that their "interesting"
region in some HTG is gone two weeks after they find it.  (The NID number is
also equivalant to the VERSION; both reflect the particular sequence and not the
documentation of the sequence.) By using the VERSION number, I can get them the
original sequence they worked with, whether it was 2 days, 2 weeks or 2 months
ago. Of course, this leads to the problem that papers with published ACCESSION
NUMBERs, don't readily map to the index (ACC No=AC000134; Search index for
AC000134.1, not found; Search index for AC000134.2, not found; ... Finally found
at AC000134.14!).

Just my $0.02.

Mark

--
Mark Dalphin                          email: mdalphin@amgen.com
Mail Stop: 29-2-A                     phone: +1-805-447-4951 (work)
One Amgen Center Drive                       +1-805-375-0680 (home)
Thousand Oaks, CA 91320                 fax: +1-805-499-9955 (work)