Bioperl: NCBI Entrez queries and Perl file handling

Wed, 02 Jun 1999 09:47:52 -0500

Hi there,

Not strictly a bio-perl question I know, but I didnt manage to get a
helpful answer from NCBI so I thought Id ask here.

I have a perl script and Im using LWP to handle the retrieval of
sequences from NCBI. One problem Im finding is that I dont always get
just the one sequence I request, I get a load of associated ones I dont
want. For example, using the entrez query below to try and get the
nucleotide sequence for L34657

http://www4.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=n&form=6&uid=L34657&dopt
=f

What I get back is about 15+ sequences (introns and exons, etc.) with
L34657 at the end. How can I configure this to just give me the single
sequence I requested and not
all the other associated introns and exons? I tried things like
dispmax=1 to no avail, with FastA format I always get all of the
sequences.

If I change the output to Genbank using the dopt=g option then I get
just the
sequence I want. I could always just parse the genbank format instead
but Id rather not have to unless its really necessary. Is there a simple
way I can just get the one specified sequence and not everything else -
am I missing some command line options here?

On a Perl note:
With all of this Im trying to get the one sequence that NCBI have
flagged for each Unigene cluster as being the 'best quality' sequence
for that cluster (from their Hs.seq.uniq file). I actually have the
Hs.seq.uniq file which contains the individual sequences but this is a
60Mb file and Im not entirely sure if conventional file IO would object
to a file that large (or just go painfully slowly). 

When perl reads in a file say using the normal code such as:

open (FILE, "Hs.seq.uniq") or die "Cant open file: $!";

while (<FILE>) {
	# deal with each line as it comes through
	# for example, to look for a specific Unigene ID
	if( /Hs.12345/) {
		# deal with the unigene information
	}
}

close FILE;

does it keep the whole thing in memory as it reads through the file or
does it just keep the current line (in $_) in memory? If its the former
then Im not sure if reading in a 60Mb file is a good thing, if its the
latter, then file size shouldnt have too many adverse effects other than
taking a while to go through the whole thing.

I also thought of trying to grep out the sequence rather than going all
the way through the file sequentially as this seems pretty fast from the
command line.

Any suggestions on efficient ways to pull data out of large flat files
like this?

Thanks for any help you can give me!

Simon.

-- 
--------------------------------------------------
Simon Twigger, Ph.D.
Laboratory for Genetic Research,
Cardiovascular Research Center,
Medical College of Wisconsin,
8701 Watertown Plank Road, 
Milwaukee, WI, 53226

http://legba.ifrc.mcw.edu/~simont/

tel. 414-456-4409               fax. 414-456-6516
--------------------------------------------------
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================