[Bioperl-l] SeqIO-based parser for Vector NTI sequence files

Scott Markel SMarkel at accelrys.com
Mon Feb 9 18:34:51 UTC 2009


Malcolm,

It looks like Vector NTI puts features into COMMENT lines rather than leveraging
the DDBJ/EMBL/GenBank Feature table syntax.  I'd like to treat these features the
same way I treat other features, hence my interest in parsing them.

My only example file is from a customer so the following snippets have been tweaked
a bit.  My replacements are in angle brackets: <...>.

COMMENT     <date here>  <user name here> wrote: 
            <user comment here>

.COMMENT     This file is created by Vector NTI
            http://www.informaxinc.com/
COMMENT     ORIGDB|GenBank
COMMENT     VNTDATE|<integers here>|
COMMENT     VNTDBDATE|<integers here>|
COMMENT     LSOWNER|
COMMENT     VNTNAME|<string here>|
COMMENT     VNTAUTHORNAME|<user name here>|
COMMENT     VNTREPLTYPE|<string here>
COMMENT     VNTEXTCHREPL|Animal/Other Eukaryotic
COMMENT     Vector_NTI_Display_Data_(Do_Not_Edit!)
COMMENT     (SXF 
COMMENT      (CGexDoc "<string here>" 0 7616 
COMMENT       (CDBMol 0 0 1 1 1 0 0 0 0 "" "" 0 0 0 0 (CObList) (CObList) (CObList) 
COMMENT        (CObList) -1 "") 
COMMENT       (CDocSetData 1 1 0 1 0 1 "MAIN" 1 1 1 1 1 0 1 1 1 0 10 10 4294967295 50 0 
COMMENT        1 0 (CHomObj 1 0 0 3 100) (CWordArray 23) (CWordArray) 
COMMENT        (CStringList <multiple quoted strings here>) 
COMMENT        (CStringList <multiple quoted strings here>) (CStringList <multiple quoted strings here>) 
COMMENT        (CObList 
COMMENT         #0=(COligo <quoted string here> <quoted string here> 
COMMENT             "Tm: 52.1C Length: 16mer GC: 56.3%" 0 (CStringList) 0) 
COMMENT         #1=(COligo <quoted string here> <quoted string here> 
COMMENT             "Tm: 56.8C Length: 18mer GC: 61.1%" 0 (CStringList) 0)

There are also some hierarchical sections.

COMMENT       (CObList) (CObList) (CObList) 
COMMENT       (CTextView 0 
COMMENT        #120=(CGroupPar (CParagraph 0 (0 0) 1 2 0 0 180) 
COMMENT              (CObjectList 
COMMENT               #121=(CRefLinePar 
COMMENT                     (CLinePar (CParagraph 0 (0 0) 0 2 0 1 233) <quoted string here> 2) 5 
COMMENT                     "" 0 4) 
COMMENT               #122=(CFolderPar 
COMMENT                     (CGroupPar (CParagraph 1 (0 0) 1 1 0 0 178) 
COMMENT                      (CObjectList 
COMMENT                       #123=(CLinePar (CParagraph 0 (0 0) 1 2 1 0 180) 
COMMENT                             <quoted string here> 1) 
COMMENT                       #124=(CLinePar (CParagraph 0 (0 0) 1 2 1 0 180) 
COMMENT                             <quoted string here> 1) 

Scott

> -----Original Message-----
> From: Cook, Malcolm [mailto:MEC at stowers.org]
> Sent: Monday, 09 February 2009 6:51 AM
> To: Scott Markel; 'bioperl-ml'
> Subject: RE: [Bioperl-l] SeqIO-based parser for Vector NTI sequence files
> 
> Scott,
> 
> What do you expect to extract from the COMMENT lines?
> 
> 
> Malcolm Cook
> Database Applications Manager - Bioinformatics
> Stowers Institute for Medical Research - Kansas City, Missouri
> 
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Scott Markel
> Sent: Tuesday, October 21, 2008 3:49 PM
> To: bioperl-ml
> Cc: smarkel at accelrys.com
> Subject: [Bioperl-l] SeqIO-based parser for Vector NTI sequence files
> 
> I'm looking for a BioPerl-related solution to parsing Vector NTI sequence
> files.  The genbank.pm parser will work, but it doesn't parse the COMMENT
> lines beyond grabbing the simple string value, so it misses all of the
> added information in those lines.
> 
> If you know of any existing code, I'd be interesting in hearing about it.
> I checked BioPerl, BioJava, and EMBOSS documentation.
> I also checked the Invitrogen web site.
> 
> Scott
> 
> --
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> Accelrys (SciTegic R&D)             mobile: +1 858 205 3653
> 10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
> 
> http://www.linkedin.com/in/smarkel
> Board of Directors: International Society for Computational Biology
> Co-chair: ISCB Publications Committee
> Associate Editor: PLoS Computational Biology Editorial Board: Briefings in
> Bioinformatics _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list