[Biopython] Paired-End Read Splitting & Joining

Thu Nov 17 11:53:41 UTC 2011

于 2011年11月17日 17:21, Peter Cock 写道:
> On Thu, Nov 17, 2011 at 3:20 AM, 曹亚强<caoyaqiang0410 at gmail.com>  wrote:
>> Dear mail-lists:
>>         Hi, my first time of asking questions in mailing, please excuse me
>> if there is any possible problems.
>>         I'm new in Python and biopython, nearly without practically
>> programming experience in Bioinformatics. Recently my work get involved in
>> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the
>> software needs paired-end sequences in two fastq files. So I wonder can
>> biopython finish the job in a conventient way? Because the paired-end file
>> is too big and can't be done in a conventient way in *Galaxy*
>>         Please give me some guide. Thanks.
>>
>> Best wishes,
>> Yaqiang Cao
> Probably, yes.
>
> So you have one large FASTQ file containing both parts of
> each pair (say part one and part two, or they might be
> labelled as the forward and reverse reads), and you want
> to split this into two FASTQ files?
>
> How are your reads named? The hard part is inferring this,
> one common scheme used /1 and /2 suffixes, but Illumina
> have changed this in their latest pipeline and the part is
> now in the description instead.
>
> Could you show us the first 6 reads (or so) from the big
> FASTQ file?
>
> Also are there any single reads in your file, either never
> paired or orphaned where one of a pair failed Qc?
>
> Peter
Thanks for replying.

Yes, I have a .fastq file convert from .sra, used one of NCBI 
sratools,fastq-dump . And the file is over 1G. I want to split this into 
two FASTQ files because the tophat requires two files of paired-end 
sequence. The screenshot of the first 20 lines of the .fastq file is 
like the attached picture file:
And because I'm new, I can't quitely understand your words about "

Also are there any single reads in your file, either never
paired or orphaned where one of a pair failed Qc?

"

All I get is in the screenshot. And it's original NCBI SRA  number is 
SRR100235.
Thanks dear mail-listing and Peter.

Best wishes,
Yaqiang Cao

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot-2011-11-17 19:49:04.png
Type: image/png
Size: 122983 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20111117/76e7bb55/attachment-0001.png>