[Biopython-dev] [Bug 2225] New: Do something with the PROJECT line in GenBank files
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Thu Mar 8 13:45:47 EST 2007
http://bugzilla.open-bio.org/show_bug.cgi?id=2225
Summary: Do something with the PROJECT line in GenBank files
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
See also bug 1946 where the introduction of this line broke the parser. At the
moment the project line is currently ignored.
Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
-------------------------------------------------
1.4 Upcoming Changes
1.4.1 Multiple identifiers for the PROJECT line
The recently-introduced PROJECT linetype (see Section 3.4.7.2) provides a
way to link GenBank sequences that are part of a sequencing project to the
Entrez Genome Project database, where further details about the project can
be found.
As of June 2007, multiple identifiers will be valid for the PROJECT line.
Here is a mocked-up example of the expected usage:
LOCUS AANA01000001 2 rc DNA linear BCT 09-FEB-2007
DEFINITION Polaribacter dokdonensis MED152 whole genome shotgun sequencing
project.
ACCESSION AANA01000001
VERSION AANA01000001.1 GI:85822094
PROJECT GenomeProject:13543 GenomeProject:99999
There are several situations in which a record could be considered part
of two different Genome Projects. For example, consider an
environmental-sampling metagenomic WGS project for which the individual
sequence-overlap contigs are not attributed to a specific organism. A
Genome Project could exist that provides further details about the
sequencing effort, the centers involved, etc.
If, in subsequent assembly and annotation phases, scaffold/super-contig/
chromosomal records are created which **are** attributed to a specific
organism, then those CON-division records could have two Genome Project IDs:
one for the WGS sequencing project as a whole; and a second for organism-
specific Genome Projects.
Additional examples illustrating the use of multiple Genome Project IDs
will be provided in future release notes, and via the GenBank listserv.
-------------------------------------------------
End quote
For the RecordParser, storing this line as a string should be fine (?)
However, for the FeatureParser, which turns the data into a SeqRecord, perhaps
this data should be held in the annotation as a list of strings:
['GenomeProject:13543', 'GenomeProject:99999']
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list