[Biopython-dev] [Bug 2225] New: Do something with the PROJECT line in GenBank files

Thu Mar 8 18:45:47 UTC 2007

http://bugzilla.open-bio.org/show_bug.cgi?id=2225

           Summary: Do something with the PROJECT line in GenBank files
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk

See also bug 1946 where the introduction of this line broke the parser.  At the
moment the project line is currently ignored.

Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
-------------------------------------------------
1.4 Upcoming Changes

1.4.1 Multiple identifiers for the PROJECT line

  The recently-introduced PROJECT linetype (see Section 3.4.7.2) provides a
way to link GenBank sequences that are part of a sequencing project to the
Entrez Genome Project database, where further details about the project can
be found.

  As of June 2007, multiple identifiers will be valid for the PROJECT line.
Here is a mocked-up example of the expected usage:

LOCUS       AANA01000001               2 rc    DNA     linear   BCT 09-FEB-2007
DEFINITION  Polaribacter dokdonensis MED152 whole genome shotgun sequencing
            project.
ACCESSION   AANA01000001
VERSION     AANA01000001.1  GI:85822094
PROJECT     GenomeProject:13543  GenomeProject:99999

  There are several situations in which a record could be considered part
of two different Genome Projects. For example, consider an
environmental-sampling metagenomic WGS project for which the individual
sequence-overlap contigs are not attributed to a specific organism. A
Genome Project could exist that provides further details about the
sequencing effort, the centers involved, etc.

  If, in subsequent assembly and annotation phases, scaffold/super-contig/
chromosomal records are created which **are** attributed to a specific
organism, then those CON-division records could have two Genome Project IDs:
one for the WGS sequencing project as a whole; and a second for organism-
specific Genome Projects.

  Additional examples illustrating the use of multiple Genome Project IDs 
will be provided in future release notes, and via the GenBank listserv.
-------------------------------------------------
End quote

For the RecordParser, storing this line as a string should be fine (?)

However, for the FeatureParser, which turns the data into a SeqRecord, perhaps
this data should be held in the annotation as a list of strings:

['GenomeProject:13543', 'GenomeProject:99999']

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.