[Bioperl-l] R: Re: fasta file parser

Chris Fields cjfields at uiuc.edu
Tue Jul 22 15:01:11 UTC 2008


On Jul 22, 2008, at 9:17 AM, ste.ghi at libero.it wrote:

> guys, thanks for your ready replies
> but now I'm a little confused...
>
> the first 'while' was just to avoid duplicates in the list...
> does your code create a cleared list? what should I write where you  
> say "# do whatever else here..."
>
> thanks

Up to you; you can do absolutely nothing if you prefer.  The script is  
simply a demo of what could be done based upon what we assume you want  
using the code you provided.

Sendu's approach is also quite applicable (remember, this is perl, so  
TIMTOWTDI).  For instance, the way I would go about this myself: start  
by creating a FASTA database (flatfile or otherwise) of the sequences  
using the ID as the primary key (always a good practice IMHO).  After  
the database is set up, create a lookup table by mapping the list of  
IDs from the database to a hash, similarly to what Sendu demonstrated;  
I believe all DB implementations in BioPerl allow you to retrieve all  
seq IDs in the database as an array.

Using that you would then delete any matching IDs in the file from the  
lookup hash; the leftovers (i.e. those NOT found in the file) would be  
used to retrieve the sequences from the database.  Avoids iterating  
over every sequence (unless you count database creation, of course,  
but again that could be used for other purposes as well).

chris

>
> ----Messaggio originale----
> Da: cjfields at uiuc.edu
> Data: 22/07/2008 15.00
> A: "ste.ghi at libero.it"<ste.ghi at libero.it>
> Cc: <bioperl-l at lists.open-bio.org>
> Ogg: Re: [Bioperl-l] fasta file parser
>
> Use exists() directly on the %seen lookup hash for your test.  Also,  
> use chomp up front when creating %seen to get rid of newlines.
>
> # assuming you have one ID per line...
> while (my $line = <LIST>) {
>    chomp $line;
>    $seen{$line}++;
>    # do whatever else here...
> }
>
> close LIST;
>
> my $newSeqFileName = Bio::SeqIO->new(-file=> ">>INFILE", - 
> format=>'fasta');
> while (my $query = $SeqFileName->next_seq()) {
>     my $id = $query->id;
>     if ( exists( $seen{$id} ) ) {
>         print "$id matched with $seen{$id} listed in $ARGV[1]:  
> skipped!\n";
>         next;
>   }
>     elsif ( $elem ne $query->id ) {
>         if ( exists( $seen2{$id}++ ) ) {;
>             print "$id matched with $seen2{$id}, already found:  
> skipped!\n";
>             next;
>         }
>         $newSeqFileName->write_seq($query);
>     }
> }
>
> chris
>
> On Jul 22, 2008, at 6:28 AM, ste.ghi at libero.it wrote:
>
>> Dear all,
>> I'm trying to write a script wich, given a file containing a list of
>> IDs, parses a big fasta file returning only sequences NOT listed in  
>> the list-
>> file.
>>
>> To do so, I first create an array with the IDs to be excluded:
>>
>> [...]
>>
>> #Load LIST content in an array; avoids duplicates
>> while (my $line = <LIST>) {
>>
>>
>>    push(@array1,$line );
>>
>>    foreach my $uniq ( @array1 ){
>>
>> 	next if $seen
>> { $uniq }++;
>>
>> 	push @unique, $uniq;
>>
>>    }
>> }
>>
>> then, process the fasta file in
>> this way (NOT WORKING).
>>
>> #Fasta file processing
>> my $newSeqFileName  = Bio::
>> SeqIO->new(-file=> ">>INFILE", -format=>'fasta');
>> while (my $query =
>> $SeqFileName->next_seq()) {
>>       foreach my $elem(@unique){
>> 		chomp $elem;
>>
>>
>>       	if ($elem eq $query->id) {
>>
>>            		print $query->id." matched
>> with $elem listed in $ARGV[1]: skipped!\n";
>>                        next;
>> 		}
>> elsif ($elem ne $query->id) {
>>       			next if $seen2{ $query->id }++;
>> 			
>> $newSeqFileName->write_seq($query);
>>
>>            	}
>>
>>        }
>> }
>>
>>
>> ...
>> in this way I get only an exact copy of the input file....where am  
>> I wrong?
>>
>> Thanks a lot for your kind help!
>> Stefano
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
>
>

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign







More information about the Bioperl-l mailing list