[Biojava-dev] [Bug 2402] New: Parsed genbank file lacks some annotations from original record
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Fri Nov 16 14:02:29 UTC 2007
http://bugzilla.open-bio.org/show_bug.cgi?id=2402
Summary: Parsed genbank file lacks some annotations from original
record
Product: BioJava
Version: 1.5
Platform: PC
OS/Version: Windows XP
Status: NEW
Severity: normal
Priority: P2
Component: seq.io
AssignedTo: biojava-dev at biojava.org
ReportedBy: p.troshin at dl.ac.uk
Hi there,
I am not sure whether this is a feature or a bug.
I discovered that some annotation is not put in the Annotation object, however,
there were clearly in the genbank file. Some Annotation seems to be always
ignored like "db_xref" for instance.
In the genBank protein record YP_006528 the source feature contains 5
annotations they are:
/virion
/isolate
/specific_host
/db_xref
/organism
BioJava annotation after parsing this file contains only the first 3, thereas
db-xref and organism gets ignored.
Here is a testcase for this bug.
import java.io.BufferedReader;
import java.io.StringReader;
import java.util.Iterator;
import junit.framework.TestCase;
import org.biojava.bio.Annotation;
import org.biojava.bio.seq.Feature;
import org.biojava.bio.seq.FeatureFilter;
import org.biojava.bio.seq.FeatureHolder;
import org.biojava.bio.seq.ProteinTools;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojavax.RichObjectFactory;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;
import org.biojavax.bio.seq.io.RichSequenceBuilderFactory;
public class BioJavaGenBankProteinParserForDev extends TestCase {
RichSequence bio = null;
static final String genBankProtein =
" LOCUS YP_006528 398 aa linear PHG
25-JUL-2007\r\n"
+ "DEFINITION ParA [Enterobacteria phage P1].\r\n" + "ACCESSION
YP_006528\r\n"
+ "VERSION YP_006528.1 GI:46401682\r\n" + "DBSOURCE REFSEQ:
accession NC_005856.1\r\n"
+ "KEYWORDS .\r\n" + "SOURCE Enterobacteria phage P1\r\n"
+ " ORGANISM Enterobacteria phage P1\r\n"
+ " Viruses; dsDNA viruses, no RNA stage; Caudovirales;
Myoviridae;\r\n"
+ " P1-like viruses.\r\n" + "REFERENCE 1 (residues 1
to 398)\r\n"
+ " AUTHORS Lobocka,M.B., Rose,D.J., Plunkett,G., Rusin,M.,
Samojedny,A.,\r\n"
+ " Lehnherr,H., Yarmolinsky,M.B. and Blattner,F.R.\r\n"
+ " TITLE Genome of bacteriophage p1\r\n"
+ " JOURNAL J. Bacteriol. 186 (21), 7032-7068 (2004)\r\n" + "
PUBMED 15489417\r\n"
+ "REFERENCE 2 (residues 1 to 398)\r\n" + " CONSRTM NCBI
Genome Project\r\n"
+ " TITLE Direct Submission\r\n"
+ " JOURNAL Submitted (06-APR-2006) National Center for
Biotechnology\r\n"
+ " Information, NIH, Bethesda, MD 20894, USA\r\n"
+ "REFERENCE 3 (residues 1 to 398)\r\n" + " AUTHORS
Lobocka,M.B.\r\n"
+ " TITLE Direct Submission\r\n"
+ " JOURNAL Submitted (14-FEB-2000) Department of Microbial
Biochemistry,\r\n"
+ " Institute of Biochemistry and Biophysics of the
Polish Academy of\r\n"
+ " Sciences, Ul. Pawinskiego 5A, Warsaw 02-106,
Poland\r\n"
+ "REFERENCE 4 (residues 1 to 398)\r\n" + " AUTHORS Rusin,M.
and Samojedny,A.\r\n"
+ " TITLE Direct Submission\r\n"
+ " JOURNAL Submitted (14-FEB-2000) Department of Tumor Biology,
Centre of\r\n"
+ " Oncology, M. Sklodowska-Curie Memorial Institute,
Ul. Wybrzeze AK\r\n"
+ " 15, Gliwice 44-101, Poland\r\n"
+ "COMMENT PROVISIONAL REFSEQ: This record has not yet been
subject to final\r\n"
+ " NCBI review. The reference sequence was derived from
AAQ14032.\r\n"
+ " Method: conceptual translation.\r\n"
+ "FEATURES Location/Qualifiers\r\n" + " source
1..398\r\n"
+ " /organism=\"Enterobacteria phage P1\"\r\n"
+ " /virion\r\n"
+ " /isolate=\"mod749::IS5 c1.100 mutant\"\r\n"
+ " /specific_host=\"Escherichia coli\"\r\n"
+ " /db_xref=\"taxon:10678\"\r\n" + "
Protein 1..398\r\n"
+ " /product=\"ParA\"\r\n"
+ " /name=\"encodes ParA/SopA family protein
involved in active\r\n"
+ " partitioning of P1 plasmid prophage at cell
division\"\r\n"
+ " /calculated_mol_wt=44139\r\n" + "
Region 108..393\r\n"
+ " /region_name=\"Soj\"\r\n"
+ " /note=\"ATPases involved in chromosome
partitioning [Cell\r\n"
+ " division and chromosome partitioning];
COG1192\"\r\n"
+ " /db_xref=\"CDD:31385\"\r\n" + " Region
110..>154\r\n"
+ " /region_name=\"ParA\"\r\n"
+ " /note=\"ParA and ParB of Caulobacter
crescentus belong to a\r\n"
+ " conserved family of bacterial proteins
implicated in\r\n"
+ " chromosome segregation; cd02042\"\r\n"
+ " /db_xref=\"CDD:73302\"\r\n" + " Site
117..123\r\n"
+ " /site_type=\"other\"\r\n" + "
/note=\"P-loop\"\r\n"
+ " /db_xref=\"CDD:73302\"\r\n" + " Site
123\r\n"
+ " /site_type=\"other\"\r\n"
+ " /note=\"Magnesium ion binding site\"\r\n"
+ " /db_xref=\"CDD:73302\"\r\n" + " Region
<237..300\r\n"
+ " /region_name=\"ParA\"\r\n"
+ " /note=\"ParA and ParB of Caulobacter
crescentus belong to a\r\n"
+ " conserved family of bacterial proteins
implicated in\r\n"
+ " chromosome segregation; cd02042\"\r\n"
+ " /db_xref=\"CDD:73302\"\r\n" + " Site
251\r\n"
+ " /site_type=\"other\"\r\n"
+ " /note=\"Magnesium ion binding site\"\r\n"
+ " /db_xref=\"CDD:73302\"\r\n" + " CDS
1..398\r\n"
+ " /gene=\"parA\"\r\n" + "
/locus_tag=\"P1_gp060\"\r\n"
+ "
/coded_by=\"complement(NC_005856.1:60017..61213)\"\r\n"
+ " /note=\"weak ATPase, binds to par operator
site to repress\r\n"
+ " transcription; 100 pct identical to
previously predicted\r\n"
+ " product of parA of P1 Swissprot:PARA_ECOLI;
similar to\r\n"
+ " partition ATPases of ParA/SopA family of
many low copy\r\n"
+ " number plasmids and bacteria\"\r\n"
+ " /transl_table=11\r\n"
+ " /db_xref=\"GeneID:2777494\"\r\n" + "ORIGIN
\r\n"
+ " 1 msdssqlhkv aqranrmlnv lteqvqlqkd elhanefyqv yakaalaklp
lltranvdya\r\n"
+ " 61 vsemeekgyv fdkrpagssm kyamsiqnii diyehrgvpk yrdryseayv
ifisnlkggv\r\n"
+ " 121 sktvstvsla hamrahphll medlrilvid ldpqssatmf lshkhsigiv
natsaqamlq\r\n"
+ " 181 nvsreellee fivpsvvpgv dvmpasidda fiasdwrelc nehlpgqnih
avlkenvidk\r\n"
+ " 241 lksdydfilv dsgphldafl knalasanil ftplppatvd fhsslkyvar
lpelvklisd\r\n"
+ " 301 egcecqlatn igfmsklsnk adhkychsla kevfggdmld vflprldgfe
rcgesfdtvi\r\n"
+ " 361 sanpatyvgs adalknaria aedfakavfd riefirsn\r\n" +
"//\r\n" + "\r\n";
/**
* @see junit.framework.TestCase#setUp()
*/
@Override
protected void setUp() throws Exception {
BufferedReader br = new BufferedReader(new
StringReader(genBankProtein));
SymbolTokenization rParser =
ProteinTools.getAlphabet().getTokenization("token");
RichSequenceIterator seqI =
RichSequence.IOTools.readGenbank(br, rParser,
RichSequenceBuilderFactory.FACTORY,
RichObjectFactory.getDefaultNamespace());
bio = seqI.nextRichSequence();
}
/**
* @see junit.framework.TestCase#tearDown()
*/
@Override
protected void tearDown() throws Exception {
bio = null;
}
public void testFeatureList() {
FeatureHolder cds = bio.filter(new FeatureFilter.ByType("source"));
for (Iterator iterator = cds.features(); iterator.hasNext();) {
Feature f = (Feature) iterator.next();
assertEquals("GenBank", f.getSource());
assertEquals("source", f.getType());
Annotation annot = f.getAnnotation();
assertTrue(annot.containsProperty("virion"));
assertTrue(annot.containsProperty("specific_host"));
assertTrue(annot.containsProperty("isolate"));
assertTrue(annot.containsProperty("organism"));
assertTrue(annot.containsProperty("db_xref"));
}
}
}
Thank you for your help.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the biojava-dev
mailing list