[Biopython] biopython question
Wibowo Arindrarto
w.arindrarto at gmail.com
Wed Apr 4 18:26:29 EDT 2012
Hi Tychele,
Glad to hear that and thanks for attaching the code as well :).
Just one more heads up on the code, the trimming function assumes that
for any record sequence, there is only one matching primer sequence at
most. If by any random chance a sequence begins with two or more
primer sequences, then it will only trim the first primer sequence. So
if you still see some primer sequences left in the trimmed sequences,
this could be the case and you'll need to modify the code.
However, that seems unlikely ~ the current code should suffice.
cheers,
Bow
On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu> wrote:
> Hi Bow,
>
> Thank you! This works great. I have attached the final code to the email in case it may benefit others.
>
> Tychele
>
>
> ________________________________________
> From: Wibowo Arindrarto [w.arindrarto at gmail.com]
> Sent: Wednesday, April 04, 2012 2:05 PM
> To: Tychele Turner
> Cc: biopython at biopython.org
> Subject: Re: [Biopython] biopython question
>
> Hi Tychele,
>
> If I understood correctly, you have a list of primers stored in a file
> and you want to trim those primer sequences off your fastq sequences,
> correct? One way I could think of is to first store the primers in a
> list (since they will be used repeatedly to check every single fastq
> sequence).
>
> Here's the code:
>
> from Bio import SeqIO
>
> def trim_primers(records, 'primer_file_name'):
>
> # read the primers
> primer_list = []
> with open('primer_file_name', 'r') as source:
> for line in source:
> primer_list.append(line.strip())
>
> for record in records:
> # list to check if the sequence begins with any of the primers
> check = [record.seq.startswith(x) for x in primer_list]
> # if any of the primer is present in the beginning of the
> sequence, then we trim it off
> if any(check):
> # get index of primer that matches the beginning
> idx = check.index(True)
> len_primer = len(primer_list[idx])
> yield record[len_primer:]
> # otherwise just return the whole record
> else:
> yield record
>
> and then, you can use the function like so:
>
> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> trimmed_reads = trim_primers(original_reads, 'primer_file_name')
> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> print "Saved %i reads" % count
>
> I haven't tested the function, but I suppose that should do the trick.
>
> Hope that helps :),
> Bow
>
>
> On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
>> Hi,
>>
>> I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows:
>>
>> from Bio import SeqIO
>> def trim_primers(records, primer):
>> """Removes perfect primer sequences at start of reads.
>>
>> This is a generator function, the records argument should
>> be a list or iterator returning SeqRecord objects.
>> """
>> len_primer = len(primer) #cache this for later
>> for record in records:
>> if record.seq.startswith(primer):
>> yield record[len_primer:]
>> else:
>> yield record
>>
>> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
>> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> print "Saved %i reads" % count
>>
>>
>>
>>
>> My question is: Is there a way to loop through a primer file for instance instead of looking for only
>>
>> 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read.
>>
>> Primer file structured as:
>> GATGACGGTGT
>> GATGACGGTGA
>> GATGACGGCCT
>>
>> If you have any suggestions it would be greatly appreciated. Thanks.
>>
>> Tychele
>>
>>
>> _______________________________________________
>> Biopython mailing list - Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
More information about the Biopython
mailing list