[Bioperl-l] getting pubmed id from genbank files

Tue Jul 26 16:07:16 EDT 2005

Nathan-

That sounds like you are using bioperl 1.4?  The error is in
Bio/SeqIO/genbank.pm  and was fixed by Jason in cvs version 1.102 of
that file.  However the current code still looks a bit odd to me.
Starting at line 1068 of the current cvs version (1.119) of genebank.pm
we have:

1068  if (/^\s{2}JOURNAL\s+(.*)/o) {
1069     push(@loc, $1);
1070     while ( defined($_ = $self->_readline) ) {
1071           # we only match when there are at least 4 spaces
1072           # there is probably a better way to match this
1073           # as it assumes that the describing tag is short enough
1074           /^\s{4,}(.*)/o && do { push(@loc, $1);
1075           next;
1076     };
1077     last;
1078  }
1079  $ref->location(join(' ', @loc));

This is all dealing with parsing the Journal line which is handled fine
by lines 1068-69.  The while loop at 1070 looks at successive lines to
find something to add to the Journal line.  The regex at line 1074 used
to read /^\s{3,}(.*)/o which would not match if the next line after
JOURNAL began with '  MEDLINE', but would match '   PUBMED' (Nathan's
situation) causing that line to be added to the JOURNAL line.  Is there
ever a JOURNAL entry with more than one line?  If so, shouldn't the
following lines always be untagged and thus indented 12 making the regex
/^\s{12}(.*)/o safer.  The current situation would add any line to
JOURNAL line if it's tag is shorter than 6 characters, and I don't think
that's what we want.

Barry

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Hilmar Lapp
Sent: Tuesday, July 26, 2005 11:05 AM
To: n.haigh at sheffield.ac.uk
Cc: 'bioperl-l'
Subject: Re: [Bioperl-l] getting pubmed id from genbank files

On Jul 26, 2005, at 7:49 AM, Nathan Haigh wrote:

> -- snip --
> $VAR1 = bless( {
>        'authors' => 'Clauss,M.J. and Mitchell-Olds,T.',
>        'location' => 'Genetics 166 (3), 1419-1436 (2004) PUBMED   
> 15082560',
>        'title' => 'Functional divergence in tandemly duplicated 
> Arabidopsis
> thaliana trypsin inhibitor genes',
>        'tagname' => 'reference'
>      }, 'Bio::Annotation::Reference' );
> -- snip --

This is odd. The PUBMED line should not be concatenated with the 
JOURNAL line. I wonder where this happens and why. Can you download the 
record from NCBI (using the web interface, format 'GenBank', 'Send all 
to file') and then parse it with Bio::SeqIO? If it works then the 
problem must be in the code that deals with the HTTP-response.

	-hilmar

>
> -----Original Message-----
> From: Jason Stajich [mailto:jason.stajich at duke.edu]
> Sent: 26 July 2005 15:28
> To: Bioperl-l at portal.open-bio.org
> Cc: Nathan Haigh
> Subject: [Bioperl-l] getting pubmed id from genbank files
>
>
>
> Here is part of the synopsis in Bio::Seq:
>
>      foreach my $ref ( $ann->get_Annotations('reference') ) {
>          print "Reference ",$ref->title,"\n";
>      }
>
>   so do $ref->pubmed instead of $ref->title.
>
>
> -jason
>> On Jul 26, 2005, at 6:02 AM, Nathan Haigh wrote:
>>
>>> I want to be able to supply a list of GI's, retrieve the genbank
>>> files and
>>> parse out the pubmed id's.
>>>
>>>
>>>
>>> I know I can do the first steps of retrieving the genbank files
>>> directly,
>>> but how do I get the pubmed id's? I've been playing around with
>>> things and
>>> haven't yet found out if this can be done.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Nathan
>>>
>>>
>>>
>>> ----------------------------------
>>>
>>> Nathan Haigh
>>>
>>> Bioinformatics PostDoctoral Research Associate
>>>
>>>
>>>
>>> Room B2 211
>>>
>>> Department of Animal and Plant Sciences
>>>
>>> University of Sheffield
>>>
>>> Western Bank
>>>
>>> Sheffield
>>>
>>> S10 2TN
>>>
>>>
>>>
>>> Tel: +44 (0)114 22 20112
>>>
>>> Mob: +44 (0)7742 533 569
>>>
>>> Fax: +44 (0)114 22 20002
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> --
>> Jason Stajich
>> http://www.duke.edu/~jes12
>> jason.stajich -at- duke.edu
>>
>>
> --
> Jason Stajich
> http://www.duke.edu/~jes12
> jason.stajich -at- duke.edu
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l