[Biopython] biopython question

Thu Apr 5 22:06:54 EDT 2012

What is the difference to remove primer from the fastq file rather to use
markDuplicates
http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on
an alignment?

Would both ways deliver the same results?

Thank you in advance.

On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <w.arindrarto at gmail.com>wrote:

> Hi Tychele,
>
> Glad to hear that and thanks for attaching the code as well :).
>
> Just one more heads up on the code, the trimming function assumes that
> for any record sequence, there is only one matching primer sequence at
> most. If by any random chance a sequence begins with two or more
> primer sequences, then it will only trim the first primer sequence. So
> if you still see some primer sequences left in the trimmed sequences,
> this could be the case and you'll need to modify the code.
>
> However, that seems unlikely ~ the current code should suffice.
>
> cheers,
> Bow
>
>
> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu> wrote:
> > Hi Bow,
> >
> > Thank you! This works great. I have attached the final code to the email
> in case it may benefit others.
> >
> > Tychele
> >
> >
> > ________________________________________
> > From: Wibowo Arindrarto [w.arindrarto at gmail.com]
> > Sent: Wednesday, April 04, 2012 2:05 PM
> > To: Tychele Turner
> > Cc: biopython at biopython.org
> > Subject: Re: [Biopython] biopython question
> >
> > Hi Tychele,
> >
> > If I understood correctly, you have a list of primers stored in a file
> > and you want to trim those primer sequences off your fastq sequences,
> > correct? One way I could think of is to first store the primers in a
> > list (since they will be used repeatedly to check every single fastq
> > sequence).
> >
> > Here's the code:
> >
> > from Bio import SeqIO
> >
> > def trim_primers(records, 'primer_file_name'):
> >
> >    # read the primers
> >    primer_list = []
> >    with open('primer_file_name', 'r') as source:
> >      for line in source:
> >        primer_list.append(line.strip())
> >
> >    for record in records:
> >        # list to check if the sequence begins with any of the primers
> >        check = [record.seq.startswith(x) for x in primer_list]
> >        # if any of the primer is present in the beginning of the
> > sequence, then we trim it off
> >        if any(check):
> >            # get index of primer that matches the beginning
> >            idx = check.index(True)
> >            len_primer = len(primer_list[idx])
> >            yield record[len_primer:]
> >        # otherwise just return the whole record
> >        else:
> >            yield record
> >
> > and then, you can use the function like so:
> >
> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> > print "Saved %i reads" % count
> >
> > I haven't tested the function, but I suppose that should do the trick.
> >
> > Hope that helps :),
> > Bow
> >
> >
> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
> >> Hi,
> >>
> >> I have a question regarding one of the biopython capabilities. I would
> like to trim primers off the end of reads in a fastq file and I found
> wonderful documentation of how to do this on your website as follows:
> >>
> >> from Bio import SeqIO
> >> def trim_primers(records, primer):
> >>    """Removes perfect primer sequences at start of reads.
> >>
> >>    This is a generator function, the records argument should
> >>    be a list or iterator returning SeqRecord objects.
> >>    """
> >>    len_primer = len(primer) #cache this for later
> >>    for record in records:
> >>        if record.seq.startswith(primer):
> >>            yield record[len_primer:]
> >>        else:
> >>            yield record
> >>
> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> >> print "Saved %i reads" % count
> >>
> >>
> >>
> >>
> >> My question is: Is there a way to loop through a primer file for
> instance instead of looking for only
> >>
> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
> from the start of its respective read.
> >>
> >> Primer file structured as:
> >> GATGACGGTGT
> >> GATGACGGTGA
> >> GATGACGGCCT
> >>
> >> If you have any suggestions it would be greatly appreciated. Thanks.
> >>
> >> Tychele
> >>
> >>
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>