[Bioperl-l] Output a subset of FASTA data from a single large file

Chris Fields cjfields at uiuc.edu
Fri Jun 9 19:05:44 UTC 2006


There's information in the HOWTOs:

http://www.bioperl.org/wiki/HOWTO:Flat_databases

http://www.bioperl.org/wiki/HOWTO:OBDA

Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
('fasta' format I/O) and this is what I got as output:

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;



i.e. an empty sequence, which is what I guessed might happen, though I
thought it might pick up the second '>' and the full sequence there.  Since
the sequence is tossed you'll have to prescreen your sequence input stream
by either concatenating the two '>' lines together or screening for the
relevant information you want to retain.  You can try maybe getting this
info into Bio::Seq objects and writing to a Bio::SeqIO stream (to file or
file handle).

Once you have that set up, the HOWTO tells you how to set up custom or
secondary namespaces, so you can use a regex to parse out the information
for a primary or secondary keys:

http://www.bioperl.org/wiki/HOWTO:Flat_databases#Secondary_or_custom_namespa
ces

then you could select specific sequences this way (per the HOWTO):

$db->secondary_namespaces("GI");
my $acc_seq = $db->get_Seq_by_id("P84139");
my $gi_seq = $db->get_Seq_by_secondary("GI",443893);

or for multiple sequences (judging from the POD):

my $acc_seqio = $db->get_Stream_by_id(@ids);

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Michael Oldham
> Sent: Thursday, June 08, 2006 9:08 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Output a subset of FASTA data from a single large
> file
> 
> Dear all,
> 
> I am a total Bioperl newbie struggling to accomplish a conceptually simple
> task.  I have a single large fasta file containing about 200,000 probe
> sequences (from an Affymetrix microarray), each of which looks like this:
> 
> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; 
> >Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
> 
> What I would like to do is extract from this file a subset of ~130,800
> probes (both the header and the sequence) and output this subset into a
> new
> fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
> ("1138_at" is the probe set ID in the header listed above); I have these
> 8,175 IDs listed in a separate file.  I *think* that I managed to create
> an
> index of all 200,000 probes in the original fasta file using the following
> script:
> 
> #!/usr/bin/perl -w
> 
>  # script 1: create the index
> 
>  use Bio::Index::Fasta;
>  use strict;
>  my $Index_File_Name = shift;
>  my $inx = Bio::Index::Fasta->new(
>      -filename => $Index_File_Name,
>      -write_flag => 1);
>  $inx->make_index(@ARGV);
> 
> I'm not sure if this is the most sensible approach, and even if it is, I'm
> not sure what to do next.  Any help would be greatly appreciated!
> 
> Many thanks,
> Mike O.
> 
> 
> 
> 
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list