[Bioperl-l] Trying to get a mysql DB from genbank flat files

Raphael LaFrance rafe@scinq.org
Thu, 15 Nov 2001 14:30:50 -0500


Here's the record that crashes it. I'm adding to an empty DB... if that
matters.

==============================Cut
Here========================================
GBBCT1.SEQ           Genetic Sequence Data Bank
                          October 15 2001

                 NCBI-GenBank Flat File Release 126

                        Bacterial Sequences (Part 1)

   17495 loci,   103824523 bases, from    17495 reported sequences


LOCUS       AARPOB2       871 bp    DNA             BCT      
03-FEB-2000
DEFINITION  Abiotrophia adiacens RNA polymerase beta subunit (rpoB)
gene,
            partial cds.
ACCESSION   AF194508
VERSION     AF194508.1  GI:6449110
KEYWORDS    .
SEGMENT     2 of 2
SOURCE      Granulicatella adiacens.
  ORGANISM  Granulicatella adiacens
            Bacteria; Firmicutes; Bacillus/Clostridium group;
Granulicatella.
REFERENCE   1  (bases 1 to 871)
  AUTHORS   Majewski,J., Zawadzki,P., Pickerill,P., Cohan,F.M. and
Dowson,C.G.
  TITLE     Barriers to genetic exchange between bacterial species:
            Streptococcus pneumoniae transformation
  JOURNAL   J. Bacteriol. 182 (4), 1016-1023 (2000)
  MEDLINE   20115546
   PUBMED   10648528
REFERENCE   2  (bases 1 to 871)
  AUTHORS   Majewski,J., Zawadzki,P., Pickerill,P., Cohan,F.M. and
Dowson,C.G.
  TITLE     Direct Submission
  JOURNAL   Submitted (13-OCT-1999) Biology, Wesleyan University, Church
            Street, Middletown, CT 06457, USA
FEATURES             Location/Qualifiers
     source          1..871
                     /organism="Granulicatella adiacens"
                     /strain="ATCC 49175"
                     /db_xref="taxon:46124"
                     /db_xref="ATCC:49175"
     gene            order(AF194507.1:<1..510,1..>871)
                     /gene="rpoB"
     CDS             <1..>871
                     /gene="rpoB"
                     /codon_start=1
                     /transl_table=11
                     /product="RNA polymerase beta subunit"
                     /protein_id="AAF08832.1"
                     /db_xref="GI:6449113"
                    
/translation="GELTYKRRLSALGPGGLTRDRAGYEVRDVHYSHYGRMCPIETPE
                    
GPNIGLINSLSTYAKINKYGFIETPYRRVDWNTHKVTDKIDYLTADEEDSFVVAQANS
                    
PLNEDGSFVNDVVMARYVSENLEVPVERVDYMDVSPKQVVAVATACIPFLENDDSNRA
                    
LMGANMQRQAVPLLNPKAPFIGTGMEYVSAHDSGVALLCKRDGVVEFVDAKEVRVRTA
                    
DGSLDTYHITKFHGSNAGMCYNQRPIVAQGDKVVKGEILADGPSMEKGELALGQNVLV
                     AFMTWEGYIYEDAV"
BASE COUNT      263 a    174 c    206 g    228 t
ORIGIN
        1 ggagagttaa catacaaacg ccgtctatca gcgttaggac ctggtggttt
gactcgtgac
       61 cgtgctggat atgaagttcg tgacgttcac tattctcact atggtcgtat
gtgtcctatc
      121 gaaactcctg aaggaccaaa catcgggttg atcaacagct tatcaaccta
tgcgaagatc
      181 aataaatatg gtttcatcga aactccatac cgtcgtgtag actggaacac
tcataaagtt
      241 acagataaaa ttgactactt aacagctgac gaagaagata gcttcgtagt
agcgcaagca
      301 aactctccat taaatgaaga tggaagcttc gtgaatgatg ttgttatggc
gcgttacgta
      361 tctgaaaact tagaagtgcc agtagaacgc gttgactata tggacgtttc
tccaaaacaa
      421 gtagttgcag ttgcgacagc atgtatcccg ttcttagaaa acgacgactc
aaaccgtgcg
      481 ttgatgggtg cgaacatgca acgtcaagct gttccattgt taaatccaaa
agcaccattc
      541 atcggtacag gtatggaata cgtatctgca catgactcag gggttgcctt
gttatgtaaa
      601 cgtgatggtg tagtcgaatt cgttgatgct aaagaagtac gtgtacgtac
agctgatggc
      661 tcattagata cttaccacat cactaagttc cacggatcaa acgcgggtat
gtgttacaac
      721 caacgtccaa tcgtggcaca aggggataaa gtcgttaaag gcgaaatcct
agcagatgga
      781 ccttctatgg aaaaaggtga attagcatta ggacaaaacg ttctagtagc
gttcatgact
      841 tgggaaggtt acatctacga ggatgcggtt a
//
================================Cut
Here========================================

sorry for the slow turnaround.

rafe

Raphael LaFrance wrote:
> 
> Sorry for delay in replying... meetings all morning.
> 
> Yup, crazy chars abound in the Genbank data. That quote() function is
> sorely missing in my program & a much better solution than my 1st try,
> which was rather brute force-ish.  I also like to strip mutil-whitespace
> chars down to a single space char.
> 
> Oh yea, you probably already knew this but... on quoted data like in the
> feature's qualifiers you cannot depend on the "/" being a reliable
> delimiter you have to look for the closing quote via a strip of ending
> white space & a check of the last char found, well last two if you think
> there might be a \" sequence in the data itself, which I didn't find in
> the 60 or so files I stripped but you never know.
> 
> Maybe new bugs will be by using the substr function, I've seen some odd
> things in the data set. This, however, is a peculiar record where format
> is the same but the fields may be missing & there are no easy delimiters
> to go after. I also didn't post that the date was thrown off too.  Same
> problem tho.
> 
> I'm of course I am willing to send the entire rippers but I'm not sure
> how much use they'll be (aside from comic relief :D) given that I'm a
> total newbie.
> 
> Apologies. I'm trying to track down the nasty record but the obvious n+1
> isn't crashing it. More as soon as I get out of the next set of
> meetings.
> 
> rafe
> 
> "Osborne, Brian" wrote:
> >
> > Wilfred,
> >
> > Yes. And there is also the quote() method, which escapes offending
> > characters. However, I don't know what the complete offending character set
> > is and how it overlaps with the set of annoying characters in the
> > description, it might not help here.
> >
> > quote
> > $sql = $dbh->quote($string);
> > This method escapes special characters (quotation marks, etc.) from strings
> > and adds the required outer quotation marks. May not be able to handle all
> > types of input (i.e binary data).
> >
> > Example using the above methods:
> > #!/usr/bin/perl -w
> > use DBI;
> > use strict;
> > my $dbh = DBI->connect("DBI:mysql:contacts",undef,undef)
> > or die "Unable to connect to contacts Database: $dbh->errstr\n";
> > my $sth = $dbh->prepare("SELECT uid FROM contact WHERE last_name =
> > 'Flaherty'");
> > $sth->execute or die "Unable to execute query: $dbh->errstr\n";
> > my $row = $sth->fetchrow_arrayref;
> > my $uid = $row->[0];
> > $sth->finish;
> > my $newname = $dbh->quote("The Flahertys'");
> > my $statement =qq(UPDATE contact SET last_name = '$newname'
> >                   WHERE uid = $uid);
> > my $rc = $dbh->do($statement) or die "Unable to prepare/execute $statement:
> > $dbh->errstr\n";
> > print "$rc rows were updated\n";
> > $dbh->disconnect;
> > exit;
> >
> > Brian O.
> >
> >  -----Original Message-----
> > From:   Wilfred Li, Ph.D. [mailto:wilfred@sdsc.edu]
> > Sent:   Thursday, November 15, 2001 12:41 PM
> > To:     bioperl-l@bioperl.org
> > Subject:        RE: [Bioperl-l] Trying to get a mysql DB from genbank flat
> > files
> >
> > >thanks for your mail, you have magically bumped into our daily nightmare,
> > >that is that very often people put crazy characters in the description
> > >lines, and we have to find ways of backslashing all of them otherwise
> > >they will break mysql statements. Could you mail me the offending record?
> > Hi,
> >
> > If bind variables are used in place of a plain insert statement, many of
> > the special characters will be taken care of by perl DBI. e.g.
> >
> > $sth->prepare("insert ... values (?, ..., ?)");
> > $sth->execute($id, ..., $kw);
> >
> > or use
> >
> > $dbh->do("insert ... values (?, ..., ?)", $id, ..., $kw);
> >
> > to combine the two steps into one.
> >
> > I had the problem with SeqAdaptor.pm when parsing SwissProt.
> >
> > Wilfred
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l