[Biopython] biopython question
Wibowo Arindrarto
w.arindrarto at gmail.com
Thu Apr 5 22:20:09 EDT 2012
Hi Mic,
I'm not familiar with picard, but it seems that this program detects
whole duplicate molecules instead of detecting whether a primer is
present in sequences (which may or may not be duplicates). Plus, it
doesn't do any removal ~ it only flags them. So I don't think the two
are comparable.
cheers,
Bow
On Fri, Apr 6, 2012 at 04:06, Mic <mictadlo at gmail.com> wrote:
> What is the difference to remove primer from the fastq file rather to use
> markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on
> an alignment?
>
> Would both ways deliver the same results?
>
> Thank you in advance.
>
>
> On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <w.arindrarto at gmail.com>
> wrote:
>>
>> Hi Tychele,
>>
>> Glad to hear that and thanks for attaching the code as well :).
>>
>> Just one more heads up on the code, the trimming function assumes that
>> for any record sequence, there is only one matching primer sequence at
>> most. If by any random chance a sequence begins with two or more
>> primer sequences, then it will only trim the first primer sequence. So
>> if you still see some primer sequences left in the trimmed sequences,
>> this could be the case and you'll need to modify the code.
>>
>> However, that seems unlikely ~ the current code should suffice.
>>
>> cheers,
>> Bow
>>
>>
>> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu> wrote:
>> > Hi Bow,
>> >
>> > Thank you! This works great. I have attached the final code to the email
>> > in case it may benefit others.
>> >
>> > Tychele
>> >
>> >
>> > ________________________________________
>> > From: Wibowo Arindrarto [w.arindrarto at gmail.com]
>> > Sent: Wednesday, April 04, 2012 2:05 PM
>> > To: Tychele Turner
>> > Cc: biopython at biopython.org
>> > Subject: Re: [Biopython] biopython question
>> >
>> > Hi Tychele,
>> >
>> > If I understood correctly, you have a list of primers stored in a file
>> > and you want to trim those primer sequences off your fastq sequences,
>> > correct? One way I could think of is to first store the primers in a
>> > list (since they will be used repeatedly to check every single fastq
>> > sequence).
>> >
>> > Here's the code:
>> >
>> > from Bio import SeqIO
>> >
>> > def trim_primers(records, 'primer_file_name'):
>> >
>> > # read the primers
>> > primer_list = []
>> > with open('primer_file_name', 'r') as source:
>> > for line in source:
>> > primer_list.append(line.strip())
>> >
>> > for record in records:
>> > # list to check if the sequence begins with any of the primers
>> > check = [record.seq.startswith(x) for x in primer_list]
>> > # if any of the primer is present in the beginning of the
>> > sequence, then we trim it off
>> > if any(check):
>> > # get index of primer that matches the beginning
>> > idx = check.index(True)
>> > len_primer = len(primer_list[idx])
>> > yield record[len_primer:]
>> > # otherwise just return the whole record
>> > else:
>> > yield record
>> >
>> > and then, you can use the function like so:
>> >
>> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
>> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> > print "Saved %i reads" % count
>> >
>> > I haven't tested the function, but I suppose that should do the trick.
>> >
>> > Hope that helps :),
>> > Bow
>> >
>> >
>> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
>> >> Hi,
>> >>
>> >> I have a question regarding one of the biopython capabilities. I would
>> >> like to trim primers off the end of reads in a fastq file and I found
>> >> wonderful documentation of how to do this on your website as follows:
>> >>
>> >> from Bio import SeqIO
>> >> def trim_primers(records, primer):
>> >> """Removes perfect primer sequences at start of reads.
>> >>
>> >> This is a generator function, the records argument should
>> >> be a list or iterator returning SeqRecord objects.
>> >> """
>> >> len_primer = len(primer) #cache this for later
>> >> for record in records:
>> >> if record.seq.startswith(primer):
>> >> yield record[len_primer:]
>> >> else:
>> >> yield record
>> >>
>> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
>> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> >> print "Saved %i reads" % count
>> >>
>> >>
>> >>
>> >>
>> >> My question is: Is there a way to loop through a primer file for
>> >> instance instead of looking for only
>> >>
>> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
>> >> from the start of its respective read.
>> >>
>> >> Primer file structured as:
>> >> GATGACGGTGT
>> >> GATGACGGTGA
>> >> GATGACGGCCT
>> >>
>> >> If you have any suggestions it would be greatly appreciated. Thanks.
>> >>
>> >> Tychele
>> >>
>> >>
>> >> _______________________________________________
>> >> Biopython mailing list - Biopython at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>> _______________________________________________
>> Biopython mailing list - Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
More information about the Biopython
mailing list