[Biopython] biopython question

Thu Apr 5 22:20:09 EDT 2012

Hi Mic,

I'm not familiar with picard, but it seems that this program detects
whole duplicate molecules instead of detecting whether a primer is
present in sequences (which may or may not be duplicates). Plus, it
doesn't do any removal ~ it only flags them. So I don't think the two
are comparable.

cheers,
Bow

On Fri, Apr 6, 2012 at 04:06, Mic <mictadlo at gmail.com> wrote:
> What is the difference to remove primer from the fastq file rather to use
> markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on
> an alignment?
>
> Would both ways deliver the same results?
>
> Thank you in advance.
>
>
> On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <w.arindrarto at gmail.com>
> wrote:
>>
>> Hi Tychele,
>>
>> Glad to hear that and thanks for attaching the code as well :).
>>
>> Just one more heads up on the code, the trimming function assumes that
>> for any record sequence, there is only one matching primer sequence at
>> most. If by any random chance a sequence begins with two or more
>> primer sequences, then it will only trim the first primer sequence. So
>> if you still see some primer sequences left in the trimmed sequences,
>> this could be the case and you'll need to modify the code.
>>
>> However, that seems unlikely ~ the current code should suffice.
>>
>> cheers,
>> Bow
>>
>>
>> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu> wrote:
>> > Hi Bow,
>> >
>> > Thank you! This works great. I have attached the final code to the email
>> > in case it may benefit others.
>> >
>> > Tychele
>> >
>> >
>> > ________________________________________
>> > From: Wibowo Arindrarto [w.arindrarto at gmail.com]
>> > Sent: Wednesday, April 04, 2012 2:05 PM
>> > To: Tychele Turner
>> > Cc: biopython at biopython.org
>> > Subject: Re: [Biopython] biopython question
>> >
>> > Hi Tychele,
>> >
>> > If I understood correctly, you have a list of primers stored in a file
>> > and you want to trim those primer sequences off your fastq sequences,
>> > correct? One way I could think of is to first store the primers in a
>> > list (since they will be used repeatedly to check every single fastq
>> > sequence).
>> >
>> > Here's the code:
>> >
>> > from Bio import SeqIO
>> >
>> > def trim_primers(records, 'primer_file_name'):
>> >
>> >    # read the primers
>> >    primer_list = []
>> >    with open('primer_file_name', 'r') as source:
>> >      for line in source:
>> >        primer_list.append(line.strip())
>> >
>> >    for record in records:
>> >        # list to check if the sequence begins with any of the primers
>> >        check = [record.seq.startswith(x) for x in primer_list]
>> >        # if any of the primer is present in the beginning of the
>> > sequence, then we trim it off
>> >        if any(check):
>> >            # get index of primer that matches the beginning
>> >            idx = check.index(True)
>> >            len_primer = len(primer_list[idx])
>> >            yield record[len_primer:]
>> >        # otherwise just return the whole record
>> >        else:
>> >            yield record
>> >
>> > and then, you can use the function like so:
>> >
>> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
>> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> > print "Saved %i reads" % count
>> >
>> > I haven't tested the function, but I suppose that should do the trick.
>> >
>> > Hope that helps :),
>> > Bow
>> >
>> >
>> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
>> >> Hi,
>> >>
>> >> I have a question regarding one of the biopython capabilities. I would
>> >> like to trim primers off the end of reads in a fastq file and I found
>> >> wonderful documentation of how to do this on your website as follows:
>> >>
>> >> from Bio import SeqIO
>> >> def trim_primers(records, primer):
>> >>    """Removes perfect primer sequences at start of reads.
>> >>
>> >>    This is a generator function, the records argument should
>> >>    be a list or iterator returning SeqRecord objects.
>> >>    """
>> >>    len_primer = len(primer) #cache this for later
>> >>    for record in records:
>> >>        if record.seq.startswith(primer):
>> >>            yield record[len_primer:]
>> >>        else:
>> >>            yield record
>> >>
>> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
>> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> >> print "Saved %i reads" % count
>> >>
>> >>
>> >>
>> >>
>> >> My question is: Is there a way to loop through a primer file for
>> >> instance instead of looking for only
>> >>
>> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
>> >> from the start of its respective read.
>> >>
>> >> Primer file structured as:
>> >> GATGACGGTGT
>> >> GATGACGGTGA
>> >> GATGACGGCCT
>> >>
>> >> If you have any suggestions it would be greatly appreciated. Thanks.
>> >>
>> >> Tychele
>> >>
>> >>
>> >> _______________________________________________
>> >> Biopython mailing list  -  Biopython at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>