[Bioperl-l] getting pubmed id from genbank files

Wed Jul 27 04:09:59 EDT 2005

Yeah, i'm pretty sure i was using bioperl-live updated that morning. Your explaination of the problem seems feasible from what i was looking at in the
perl debugger. I'll look into this a bit more later this morning.

Nathan

Quoting Barry Moore <bmoore at genetics.utah.edu>:

> Nathan-
> 
> That sounds like you are using bioperl 1.4?  The error is in
> Bio/SeqIO/genbank.pm  and was fixed by Jason in cvs version 1.102 of
> that file.  However the current code still looks a bit odd to me.
> Starting at line 1068 of the current cvs version (1.119) of genebank.pm
> we have:
> 
> 1068  if (/^\s{2}JOURNAL\s+(.*)/o) {
> 1069     push(@loc, $1);
> 1070     while ( defined($_ = $self->_readline) ) {
> 1071           # we only match when there are at least 4 spaces
> 1072           # there is probably a better way to match this
> 1073           # as it assumes that the describing tag is short enough
> 1074           /^\s{4,}(.*)/o && do { push(@loc, $1);
> 1075           next;
> 1076     };
> 1077     last;
> 1078  }
> 1079  $ref->location(join(' ', @loc));
> 
> This is all dealing with parsing the Journal line which is handled fine
> by lines 1068-69.  The while loop at 1070 looks at successive lines to
> find something to add to the Journal line.  The regex at line 1074 used
> to read /^\s{3,}(.*)/o which would not match if the next line after
> JOURNAL began with '  MEDLINE', but would match '   PUBMED' (Nathan's
> situation) causing that line to be added to the JOURNAL line.  Is there
> ever a JOURNAL entry with more than one line?  If so, shouldn't the
> following lines always be untagged and thus indented 12 making the regex
> /^\s{12}(.*)/o safer.  The current situation would add any line to
> JOURNAL line if it's tag is shorter than 6 characters, and I don't think
> that's what we want.
> 
> Barry
> 
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Hilmar Lapp
> Sent: Tuesday, July 26, 2005 11:05 AM
> To: n.haigh at sheffield.ac.uk
> Cc: 'bioperl-l'
> Subject: Re: [Bioperl-l] getting pubmed id from genbank files
> 
> 
> On Jul 26, 2005, at 7:49 AM, Nathan Haigh wrote:
> 
> > -- snip --
> > $VAR1 = bless( {
> >        'authors' => 'Clauss,M.J. and Mitchell-Olds,T.',
> >        'location' => 'Genetics 166 (3), 1419-1436 (2004) PUBMED   
> > 15082560',
> >        'title' => 'Functional divergence in tandemly duplicated 
> > Arabidopsis
> > thaliana trypsin inhibitor genes',
> >        'tagname' => 'reference'
> >      }, 'Bio::Annotation::Reference' );
> > -- snip --
> 
> This is odd. The PUBMED line should not be concatenated with the 
> JOURNAL line. I wonder where this happens and why. Can you download the 
> record from NCBI (using the web interface, format 'GenBank', 'Send all 
> to file') and then parse it with Bio::SeqIO? If it works then the 
> problem must be in the code that deals with the HTTP-response.
> 
> 	-hilmar
> 
> 
> >
> > -----Original Message-----
> > From: Jason Stajich [mailto:jason.stajich at duke.edu]
> > Sent: 26 July 2005 15:28
> > To: Bioperl-l at portal.open-bio.org
> > Cc: Nathan Haigh
> > Subject: [Bioperl-l] getting pubmed id from genbank files
> >
> >
> >
> > Here is part of the synopsis in Bio::Seq:
> >
> >      foreach my $ref ( $ann->get_Annotations('reference') ) {
> >          print "Reference ",$ref->title,"\n";
> >      }
> >
> >   so do $ref->pubmed instead of $ref->title.
> >
> >
> > -jason
> >> On Jul 26, 2005, at 6:02 AM, Nathan Haigh wrote:
> >>
> >>> I want to be able to supply a list of GI's, retrieve the genbank
> >>> files and
> >>> parse out the pubmed id's.
> >>>
> >>>
> >>>
> >>> I know I can do the first steps of retrieving the genbank files
> >>> directly,
> >>> but how do I get the pubmed id's? I've been playing around with
> >>> things and
> >>> haven't yet found out if this can be done.
> >>>
> >>>
> >>>
> >>> Cheers,
> >>>
> >>> Nathan
> >>>
> >>>
> >>>
> >>> ----------------------------------
> >>>
> >>> Nathan Haigh
> >>>
> >>> Bioinformatics PostDoctoral Research Associate
> >>>
> >>>
> >>>
> >>> Room B2 211
> >>>
> >>> Department of Animal and Plant Sciences
> >>>
> >>> University of Sheffield
> >>>
> >>> Western Bank
> >>>
> >>> Sheffield
> >>>
> >>> S10 2TN
> >>>
> >>>
> >>>
> >>> Tel: +44 (0)114 22 20112
> >>>
> >>> Mob: +44 (0)7742 533 569
> >>>
> >>> Fax: +44 (0)114 22 20002
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at portal.open-bio.org
> >>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >> --
> >> Jason Stajich
> >> http://www.duke.edu/~jes12
> >> jason.stajich -at- duke.edu
> >>
> >>
> > --
> > Jason Stajich
> > http://www.duke.edu/~jes12
> > jason.stajich -at- duke.edu
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>