[Bioperl-l] SeqIO-based parser for Vector NTI sequence files

Cook, Malcolm MEC at stowers.org
Mon Feb 9 14:32:03 EST 2009


Hi Scott,

It is my understanding that Informax developer used the COMMENT line to encode molecular level attribute-value pairs.

i.e. your VNTDATE|<integers here>|

Some dialect of LISP had a role in the products back-days.  The attribute 'Vector_NTI_Display_Data_(Do_Not_Edit!)' is a LISP S-expression whose CAR is 'SXF' (ever program LISP?).

It should be easily 'parseable' by any LISP interpreter or anything that can balance parens.

The only 'rub' is the #NNN= tokens which appear to be some form of forward reference used by the lisp serializer to allow for internal references.

I would be very surprised if you could learn from ABI (was Invitrogen, was Informax) anything about the internal structure of this.  I once asked about 5 years ago.  I am/was a supported user.  It was arcane historical knowledge even then.

I would also be very surprised if there was anything in it that was meaningful, unless the creator of the molecule in VNTI was using some very odd convention.

I used to know which dialect of LISP was being used by them and might track it down if it were important to you....

If you are after the oligo descriptive text in the COMMENT, it is likely that the oligos ALSO have a genbank feature associated with them.  But you probably already looked....

What are you really after...?

One thing you should know (if you don't) that might bear on your underlying problem (whatever it may be....).  Vector NTI can open GFF files and display them as an extenral analysis.  The element of the analysis can then be promoted interactively by the user to be features (genbank style).

Malcolm


> -----Original Message-----
> From: Scott Markel [mailto:SMarkel at accelrys.com]
> Sent: Monday, February 09, 2009 12:35 PM
> To: Cook, Malcolm; 'bioperl-ml'
> Subject: RE: [Bioperl-l] SeqIO-based parser for Vector NTI
> sequence files
>
> Malcolm,
>
> It looks like Vector NTI puts features into COMMENT lines
> rather than leveraging the DDBJ/EMBL/GenBank Feature table
> syntax.  I'd like to treat these features the same way I
> treat other features, hence my interest in parsing them.
>
> My only example file is from a customer so the following
> snippets have been tweaked a bit.  My replacements are in
> angle brackets: <...>.
>
> COMMENT     <date here>  <user name here> wrote:
>             <user comment here>
>
> .COMMENT     This file is created by Vector NTI
>             http://www.informaxinc.com/
> COMMENT     ORIGDB|GenBank
> COMMENT     VNTDATE|<integers here>|
> COMMENT     VNTDBDATE|<integers here>|
> COMMENT     LSOWNER|
> COMMENT     VNTNAME|<string here>|
> COMMENT     VNTAUTHORNAME|<user name here>|
> COMMENT     VNTREPLTYPE|<string here>
> COMMENT     VNTEXTCHREPL|Animal/Other Eukaryotic
> COMMENT     Vector_NTI_Display_Data_(Do_Not_Edit!)
> COMMENT     (SXF
> COMMENT      (CGexDoc "<string here>" 0 7616
> COMMENT       (CDBMol 0 0 1 1 1 0 0 0 0 "" "" 0 0 0 0
> (CObList) (CObList) (CObList)
> COMMENT        (CObList) -1 "")
> COMMENT       (CDocSetData 1 1 0 1 0 1 "MAIN" 1 1 1 1 1 0 1 1
> 1 0 10 10 4294967295 50 0
> COMMENT        1 0 (CHomObj 1 0 0 3 100) (CWordArray 23) (CWordArray)
> COMMENT        (CStringList <multiple quoted strings here>)
> COMMENT        (CStringList <multiple quoted strings here>)
> (CStringList <multiple quoted strings here>)
> COMMENT        (CObList
> COMMENT         #0=(COligo <quoted string here> <quoted string here>
> COMMENT             "Tm: 52.1C Length: 16mer GC: 56.3%" 0
> (CStringList) 0)
> COMMENT         #1=(COligo <quoted string here> <quoted string here>
> COMMENT             "Tm: 56.8C Length: 18mer GC: 61.1%" 0
> (CStringList) 0)
>
> There are also some hierarchical sections.
>
> COMMENT       (CObList) (CObList) (CObList)
> COMMENT       (CTextView 0
> COMMENT        #120=(CGroupPar (CParagraph 0 (0 0) 1 2 0 0 180)
> COMMENT              (CObjectList
> COMMENT               #121=(CRefLinePar
> COMMENT                     (CLinePar (CParagraph 0 (0 0) 0 2
> 0 1 233) <quoted string here> 2) 5
> COMMENT                     "" 0 4)
> COMMENT               #122=(CFolderPar
> COMMENT                     (CGroupPar (CParagraph 1 (0 0) 1
> 1 0 0 178)
> COMMENT                      (CObjectList
> COMMENT                       #123=(CLinePar (CParagraph 0 (0
> 0) 1 2 1 0 180)
> COMMENT                             <quoted string here> 1)
> COMMENT                       #124=(CLinePar (CParagraph 0 (0
> 0) 1 2 1 0 180)
> COMMENT                             <quoted string here> 1)
>
> Scott
>
> > -----Original Message-----
> > From: Cook, Malcolm [mailto:MEC at stowers.org]
> > Sent: Monday, 09 February 2009 6:51 AM
> > To: Scott Markel; 'bioperl-ml'
> > Subject: RE: [Bioperl-l] SeqIO-based parser for Vector NTI sequence
> > files
> >
> > Scott,
> >
> > What do you expect to extract from the COMMENT lines?
> >
> >
> > Malcolm Cook
> > Database Applications Manager - Bioinformatics Stowers
> Institute for
> > Medical Research - Kansas City, Missouri
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of Scott Markel
> > Sent: Tuesday, October 21, 2008 3:49 PM
> > To: bioperl-ml
> > Cc: smarkel at accelrys.com
> > Subject: [Bioperl-l] SeqIO-based parser for Vector NTI
> sequence files
> >
> > I'm looking for a BioPerl-related solution to parsing Vector NTI
> > sequence files.  The genbank.pm parser will work, but it
> doesn't parse
> > the COMMENT lines beyond grabbing the simple string value, so it
> > misses all of the added information in those lines.
> >
> > If you know of any existing code, I'd be interesting in
> hearing about it.
> > I checked BioPerl, BioJava, and EMBOSS documentation.
> > I also checked the Invitrogen web site.
> >
> > Scott
> >
> > --
> > Scott Markel, Ph.D.
> > Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> > Accelrys (SciTegic R&D)             mobile: +1 858 205 3653
> > 10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
> > San Diego, CA 92121                 fax:    +1 858 799 5222
> > USA                                 web:    http://www.accelrys.com
> >
> > http://www.linkedin.com/in/smarkel
> > Board of Directors: International Society for Computational Biology
> > Co-chair: ISCB Publications Committee
> > Associate Editor: PLoS Computational Biology Editorial Board:
> > Briefings in Bioinformatics
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list