[Biopython] [Samtools-help] biopython question

Sat Apr 7 11:05:30 EDT 2012

Hi Mic,

I just saw your message regarding Mark Duplicates and the script Bow and I were discussing which recognizes and cleaves primers.

First off, I'm familiar with Mark Duplicates from Picard and I do use it for exome data.

However, in this instance I was looking at sequences coming from short amplicon sequencing. In this instance, marking duplicates is not appropriate because most of the reads will be duplicates due to the nature of the bench experiment (in contrast to shotgun sequencing where your looking at random fragments in which PCR artifacts arise in the PCR steps post-shearing). In my short amplicon sequence data, the read will start with the primer sequence and then extend to be a total length of 100 nucleotides. For this reason, I wanted to use a script which could recognize the primer and ultimately cleave that primer from the read so it would not go into the rest of the pipeline which would ultimately go to a variant calling program.

As for your last point of sending other software which cut adapters that's fine but I'm not cutting adapters I'm looking for primer sequences and cleaving those. Also, I thought that if Biopython already has such a nice setup to do this I would use that especially since python is quite efficient at this task. Hope this helps.

Tychele

From: Mic [mictadlo at gmail.com]
Sent: Friday, April 06, 2012 5:59 AM
To: Wibowo Arindrarto
Cc: samtools-help; biopython at biopython.org
Subject: Re: [Samtools-help] [Biopython] biopython question

Hi Bow,
You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates:

$ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam

$ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024   INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics

Are adapters and fragments the same?

I found the following software for adapter:
* TagDust - eliminate artifactual sequence from NGS data
http://www.biomedcentral.com/1471-2164/12/382
http://bioinformatics.oxfordjournals.org/content/25/21/2839.full
* FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?
title=Main_Page
* Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
* http://code.google.com/p/cutadapt/
* https://github.com/vsbuffalo/scythe
* http://code.google.com/p/biopieces/wiki/find_adaptor

Thank you in advance.

Cheers,

On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto <w.arindrarto at gmail.com<mailto:w.arindrarto at gmail.com>> wrote:
Hi Mic,

I'm not familiar with picard, but it seems that this program detects
whole duplicate molecules instead of detecting whether a primer is
present in sequences (which may or may not be duplicates). Plus, it
doesn't do any removal ~ it only flags them. So I don't think the two
are comparable.

cheers,
Bow

On Fri, Apr 6, 2012 at 04:06, Mic <mictadlo at gmail.com<mailto:mictadlo at gmail.com>> wrote:
> What is the difference to remove primer from the fastq file rather to use
> markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on
> an alignment?
>
> Would both ways deliver the same results?
>
> Thank you in advance.
>
>
> On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <w.arindrarto at gmail.com<mailto:w.arindrarto at gmail.com>>
> wrote:
>>
>> Hi Tychele,
>>
>> Glad to hear that and thanks for attaching the code as well :).
>>
>> Just one more heads up on the code, the trimming function assumes that
>> for any record sequence, there is only one matching primer sequence at
>> most. If by any random chance a sequence begins with two or more
>> primer sequences, then it will only trim the first primer sequence. So
>> if you still see some primer sequences left in the trimmed sequences,
>> this could be the case and you'll need to modify the code.
>>
>> However, that seems unlikely ~ the current code should suffice.
>>
>> cheers,
>> Bow
>>
>>
>> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu<mailto:tturne18 at jhmi.edu>> wrote:
>> > Hi Bow,
>> >
>> > Thank you! This works great. I have attached the final code to the email
>> > in case it may benefit others.
>> >
>> > Tychele
>> >
>> >
>> > ________________________________________
>> > From: Wibowo Arindrarto [w.arindrarto at gmail.com<mailto:w.arindrarto at gmail.com>]
>> > Sent: Wednesday, April 04, 2012 2:05 PM
>> > To: Tychele Turner
>> > Cc: biopython at biopython.org<mailto:biopython at biopython.org>
>> > Subject: Re: [Biopython] biopython question
>> >
>> > Hi Tychele,
>> >
>> > If I understood correctly, you have a list of primers stored in a file
>> > and you want to trim those primer sequences off your fastq sequences,
>> > correct? One way I could think of is to first store the primers in a
>> > list (since they will be used repeatedly to check every single fastq
>> > sequence).
>> >
>> > Here's the code:
>> >
>> > from Bio import SeqIO
>> >
>> > def trim_primers(records, 'primer_file_name'):
>> >
>> >    # read the primers
>> >    primer_list = []
>> >    with open('primer_file_name', 'r') as source:
>> >      for line in source:
>> >        primer_list.append(line.strip())
>> >
>> >    for record in records:
>> >        # list to check if the sequence begins with any of the primers
>> >        check = [record.seq.startswith(x) for x in primer_list]
>> >        # if any of the primer is present in the beginning of the
>> > sequence, then we trim it off
>> >        if any(check):
>> >            # get index of primer that matches the beginning
>> >            idx = check.index(True)
>> >            len_primer = len(primer_list[idx])
>> >            yield record[len_primer:]
>> >        # otherwise just return the whole record
>> >        else:
>> >            yield record
>> >
>> > and then, you can use the function like so:
>> >
>> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
>> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> > print "Saved %i reads" % count
>> >
>> > I haven't tested the function, but I suppose that should do the trick.
>> >
>> > Hope that helps :),
>> > Bow
>> >
>> >
>> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu<mailto:tturne18 at jhmi.edu>> wrote:
>> >> Hi,
>> >>
>> >> I have a question regarding one of the biopython capabilities. I would
>> >> like to trim primers off the end of reads in a fastq file and I found
>> >> wonderful documentation of how to do this on your website as follows:
>> >>
>> >> from Bio import SeqIO
>> >> def trim_primers(records, primer):
>> >>    """Removes perfect primer sequences at start of reads.
>> >>
>> >>    This is a generator function, the records argument should
>> >>    be a list or iterator returning SeqRecord objects.
>> >>    """
>> >>    len_primer = len(primer) #cache this for later
>> >>    for record in records:
>> >>        if record.seq.startswith(primer):
>> >>            yield record[len_primer:]
>> >>        else:
>> >>            yield record
>> >>
>> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
>> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> >> print "Saved %i reads" % count
>> >>
>> >>
>> >>
>> >>
>> >> My question is: Is there a way to loop through a primer file for
>> >> instance instead of looking for only
>> >>
>> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
>> >> from the start of its respective read.
>> >>
>> >> Primer file structured as:
>> >> GATGACGGTGT
>> >> GATGACGGTGA
>> >> GATGACGGCCT
>> >>
>> >> If you have any suggestions it would be greatly appreciated. Thanks.
>> >>
>> >> Tychele
>> >>
>> >>
>> >> _______________________________________________
>> >> Biopython mailing list  -  Biopython at lists.open-bio.org<mailto:Biopython at lists.open-bio.org>
>> >> http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org<mailto:Biopython at lists.open-bio.org>
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>