[Biopython] Paired-End Read Splitting & Joining

Fields, Christopher J cjfields at illinois.edu
Thu Nov 17 13:39:29 UTC 2011


On Nov 17, 2011, at 5:53 AM, Yaqiang Cao wrote:

> 于 2011年11月17日 17:21, Peter Cock 写道:
>> On Thu, Nov 17, 2011 at 3:20 AM, 曹亚强<caoyaqiang0410 at gmail.com>  wrote:
>>> Dear mail-lists:
>>>        Hi, my first time of asking questions in mailing, please excuse me
>>> if there is any possible problems.
>>>        I'm new in Python and biopython, nearly without practically
>>> programming experience in Bioinformatics. Recently my work get involved in
>>> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the
>>> software needs paired-end sequences in two fastq files. So I wonder can
>>> biopython finish the job in a conventient way? Because the paired-end file
>>> is too big and can't be done in a conventient way in *Galaxy*
>>>        Please give me some guide. Thanks.
>>> 
>>> Best wishes,
>>> Yaqiang Cao
>> Probably, yes.
>> 
>> So you have one large FASTQ file containing both parts of
>> each pair (say part one and part two, or they might be
>> labelled as the forward and reverse reads), and you want
>> to split this into two FASTQ files?
>> 
>> How are your reads named? The hard part is inferring this,
>> one common scheme used /1 and /2 suffixes, but Illumina
>> have changed this in their latest pipeline and the part is
>> now in the description instead.
>> 
>> Could you show us the first 6 reads (or so) from the big
>> FASTQ file?
>> 
>> Also are there any single reads in your file, either never
>> paired or orphaned where one of a pair failed Qc?
>> 
>> Peter
> Thanks for replying.
> 
> Yes, I have a .fastq file convert from .sra, used one of NCBI sratools,fastq-dump . And the file is over 1G. I want to split this into two FASTQ files because the tophat requires two files of paired-end sequence. The screenshot of the first 20 lines of the .fastq file is like the attached picture file:
> And because I'm new, I can't quitely understand your words about "

This is one of my gripes about the SRA tools, that they (by default) dump paired-end data as one concatenated string; it's a nasty gotcha.  You need to specify the --split-files option to fastq-dump to dump these as paired end, and this will split them into two files.

> Also are there any single reads in your file, either never
> paired or orphaned where one of a pair failed Qc?
> 
> "
> 
> All I get is in the screenshot. And it's original NCBI SRA  number is SRR100235.
> Thanks dear mail-listing and Peter.

These should all be matched pairs.

> Best wishes,
> Yaqiang Cao


chris




More information about the Biopython mailing list