[BioPython] reading large sequence files
Leighton Pritchard
L.Pritchard at scri.sari.ac.uk
Tue Sep 23 10:57:07 EDT 2003
Hi Karin,
Guessing that you have one .fna sequence file containing the whole sequence
(or each chromosome/plasmid), then you can use quick_FASTA_reader from
SeqUtils in a manner similar to:
from Bio.SeqUtils import quick_FASTA_reader
name, seq = quick_FASTA_reader(genome_file)[0]
The quick_FASTA_reader reads in (name, sequence) tuples without doing
anything too clever or time-consuming like parsing sequences as
SeqRecords. It's *much* faster than using the Fasta.Iterator class.
Hope this helps,
At 16:19 23/09/2003 +0200, Karin Lagesen wrote:
>Hi!
>
>I am working on whole (procaryote) genomes, and due to this I need to
>work with whole genomes at the time. I am for instance reading in the
>ecoli genome like this:
>
>ecoliDir = genomePath + ecoli
>ecoliFiles = os.listdir(ecoliDir)
>ecoliFile = fnmatch.filter(ecoliFiles, '*.fna')
>ecoliFile = open(os.path.join(ecoliDir, ecoliFile[0]), 'r')
>iterator = Fasta.Iterator(ecoliFile, parser)
>fileContents = iterator.next()
>ecoliSeq = fileContents.sequence
>ecoliFile.close()
>
>where genomePath tells the program where the genome files are, and
>ecoli just gives the genome name. All genome files end in .fna
>
>However, when I do it this way it takes a looooooooong time to read in
>the genome, it currently takes almost 10 minutes. Is there some way I
>can make this go faster? I need to work with alltogether 13 genomes,
>and it would be nice if this part of it wasn't the bottleneck.
>
>
>Karin
>--
>Karin Lagesen, PhD student
>karin.lagesen at labmed.uio.no
>_______________________________________________
>BioPython mailing list - BioPython at biopython.org
>http://biopython.org/mailman/listinfo/biopython
Dr Leighton Pritchard AMRSC
PPI, Scottish Crop Research Institute
Invergowrie, Dundee, DD2 5DA, Scotland, UK
L.Pritchard at scri.sari.ac.uk
PGP key FEFC205C: http://www.keyserver.net http://pgp.mit.edu
More information about the BioPython
mailing list