[Biopython] Paired-End Read Splitting & Joining
Fields, Christopher J
cjfields at illinois.edu
Thu Nov 17 13:39:29 UTC 2011
On Nov 17, 2011, at 5:53 AM, Yaqiang Cao wrote:
> 于 2011年11月17日 17:21, Peter Cock 写道:
>> On Thu, Nov 17, 2011 at 3:20 AM, 曹亚强<caoyaqiang0410 at gmail.com> wrote:
>>> Dear mail-lists:
>>> Hi, my first time of asking questions in mailing, please excuse me
>>> if there is any possible problems.
>>> I'm new in Python and biopython, nearly without practically
>>> programming experience in Bioinformatics. Recently my work get involved in
>>> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the
>>> software needs paired-end sequences in two fastq files. So I wonder can
>>> biopython finish the job in a conventient way? Because the paired-end file
>>> is too big and can't be done in a conventient way in *Galaxy*
>>> Please give me some guide. Thanks.
>>>
>>> Best wishes,
>>> Yaqiang Cao
>> Probably, yes.
>>
>> So you have one large FASTQ file containing both parts of
>> each pair (say part one and part two, or they might be
>> labelled as the forward and reverse reads), and you want
>> to split this into two FASTQ files?
>>
>> How are your reads named? The hard part is inferring this,
>> one common scheme used /1 and /2 suffixes, but Illumina
>> have changed this in their latest pipeline and the part is
>> now in the description instead.
>>
>> Could you show us the first 6 reads (or so) from the big
>> FASTQ file?
>>
>> Also are there any single reads in your file, either never
>> paired or orphaned where one of a pair failed Qc?
>>
>> Peter
> Thanks for replying.
>
> Yes, I have a .fastq file convert from .sra, used one of NCBI sratools,fastq-dump . And the file is over 1G. I want to split this into two FASTQ files because the tophat requires two files of paired-end sequence. The screenshot of the first 20 lines of the .fastq file is like the attached picture file:
> And because I'm new, I can't quitely understand your words about "
This is one of my gripes about the SRA tools, that they (by default) dump paired-end data as one concatenated string; it's a nasty gotcha. You need to specify the --split-files option to fastq-dump to dump these as paired end, and this will split them into two files.
> Also are there any single reads in your file, either never
> paired or orphaned where one of a pair failed Qc?
>
> "
>
> All I get is in the screenshot. And it's original NCBI SRA number is SRR100235.
> Thanks dear mail-listing and Peter.
These should all be matched pairs.
> Best wishes,
> Yaqiang Cao
chris
More information about the Biopython
mailing list