[Biopython] biopython question
Mic
mictadlo at gmail.com
Thu Apr 5 22:06:54 EDT 2012
What is the difference to remove primer from the fastq file rather to use
markDuplicates
http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on
an alignment?
Would both ways deliver the same results?
Thank you in advance.
On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <w.arindrarto at gmail.com>wrote:
> Hi Tychele,
>
> Glad to hear that and thanks for attaching the code as well :).
>
> Just one more heads up on the code, the trimming function assumes that
> for any record sequence, there is only one matching primer sequence at
> most. If by any random chance a sequence begins with two or more
> primer sequences, then it will only trim the first primer sequence. So
> if you still see some primer sequences left in the trimmed sequences,
> this could be the case and you'll need to modify the code.
>
> However, that seems unlikely ~ the current code should suffice.
>
> cheers,
> Bow
>
>
> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu> wrote:
> > Hi Bow,
> >
> > Thank you! This works great. I have attached the final code to the email
> in case it may benefit others.
> >
> > Tychele
> >
> >
> > ________________________________________
> > From: Wibowo Arindrarto [w.arindrarto at gmail.com]
> > Sent: Wednesday, April 04, 2012 2:05 PM
> > To: Tychele Turner
> > Cc: biopython at biopython.org
> > Subject: Re: [Biopython] biopython question
> >
> > Hi Tychele,
> >
> > If I understood correctly, you have a list of primers stored in a file
> > and you want to trim those primer sequences off your fastq sequences,
> > correct? One way I could think of is to first store the primers in a
> > list (since they will be used repeatedly to check every single fastq
> > sequence).
> >
> > Here's the code:
> >
> > from Bio import SeqIO
> >
> > def trim_primers(records, 'primer_file_name'):
> >
> > # read the primers
> > primer_list = []
> > with open('primer_file_name', 'r') as source:
> > for line in source:
> > primer_list.append(line.strip())
> >
> > for record in records:
> > # list to check if the sequence begins with any of the primers
> > check = [record.seq.startswith(x) for x in primer_list]
> > # if any of the primer is present in the beginning of the
> > sequence, then we trim it off
> > if any(check):
> > # get index of primer that matches the beginning
> > idx = check.index(True)
> > len_primer = len(primer_list[idx])
> > yield record[len_primer:]
> > # otherwise just return the whole record
> > else:
> > yield record
> >
> > and then, you can use the function like so:
> >
> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> > print "Saved %i reads" % count
> >
> > I haven't tested the function, but I suppose that should do the trick.
> >
> > Hope that helps :),
> > Bow
> >
> >
> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
> >> Hi,
> >>
> >> I have a question regarding one of the biopython capabilities. I would
> like to trim primers off the end of reads in a fastq file and I found
> wonderful documentation of how to do this on your website as follows:
> >>
> >> from Bio import SeqIO
> >> def trim_primers(records, primer):
> >> """Removes perfect primer sequences at start of reads.
> >>
> >> This is a generator function, the records argument should
> >> be a list or iterator returning SeqRecord objects.
> >> """
> >> len_primer = len(primer) #cache this for later
> >> for record in records:
> >> if record.seq.startswith(primer):
> >> yield record[len_primer:]
> >> else:
> >> yield record
> >>
> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> >> print "Saved %i reads" % count
> >>
> >>
> >>
> >>
> >> My question is: Is there a way to loop through a primer file for
> instance instead of looking for only
> >>
> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
> from the start of its respective read.
> >>
> >> Primer file structured as:
> >> GATGACGGTGT
> >> GATGACGGTGA
> >> GATGACGGCCT
> >>
> >> If you have any suggestions it would be greatly appreciated. Thanks.
> >>
> >> Tychele
> >>
> >>
> >> _______________________________________________
> >> Biopython mailing list - Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
More information about the Biopython
mailing list