any other way? - Perl Programming

Posted by rick (rick), 14 November 2007

Hi Graham,really thanks 4 ur wonderful help.

I have a file,containing amino acid seq.like

MERIKELRDLMSQSRTREILTKTTVDHMAIIKKYTSGRQEKNPALRMKWMMAMKYPITADKRIIEMIPER
NEQGQTLWSKTNDAGSDRVMVSPLAVTWWNRNGPPTSTVHYPKVYKTYFEKVERLKHGTFGPVHFRNQVK
IRRRVDTNPGHADLSAKEAQDVIMEVVFPNEVGARILTSESQLTITKEKKEELQDCKIAPLMVAYMLERE

I need o count each individual character,so I used this script;

#!usr/bin/perl -w
open(FILE,"<FASTA1.fa") or die "cant open file:$!\n";
$sequence = '';
@data=<FILE>;
foreach $line (@data){
if ($line =~m/^\s*$/) {
next;
} elsif($line =~m/^\s*#/) {
next;
} elsif($line =~m/^>/) {
next;
} else {
$sequence .= $line;
}
}

$sequence =~ s/\s//g;

@sq=split(//,$sequence);
$count_of_A = 0;
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$count_of_Q=0;
$count_of_S=0;
$count_of_R=0;
$count_of_W=0;
$count_of_Y=0;
$count_of_I=0;
$count_of_H=0;
$count_of_D=0;
$count_of_E=0;
$count_of_M=0;
$count_of_N=0;
$count_of_P=0;
$count_of_V=0;
$count_of_F=0;
$count_of_K=0;
$count_of_M=0;
$errors = 0;

foreach $char(@sq) {

if ( $char eq 'A' ) {
++$count_of_A;
} elsif ( $char eq 'C' ) {
++$count_of_C;
} elsif ( $char eq 'G' ) {
++$count_of_G;
} elsif ( $char eq 'T' ) {
++$count_of_T;
}
elsif ( $char eq 'S' ) {
++$count_of_S;
}
elsif ( $char eq 'D' ) {
++$count_of_D;
}
elsif ( $char eq 'P' ) {
++$count_of_P;
}
elsif ( $char eq 'V' ) {
++$count_of_V;
}
elsif ( $char eq 'L' ) {
++$count_of_L;
}
elsif ( $char eq 'I' ) {
++$count_of_I;
}
elsif ( $char eq 'M' ) {
++$count_of_M;
}
elsif ( $char eq 'F' ) {
++$count_of_F;
}
elsif ( $char eq 'Y' ) {
++$count_of_Y;
}
elsif ( $char eq 'W' ) {
++$count_of_W;
}
elsif ( $char eq 'H' ) {
++$count_of_H;
}
elsif ( $char eq 'K' ) {
++$count_of_K;
}
elsif ( $char eq 'R' ) {
++$count_of_R;
}
elsif ( $char eq 'Q' ) {
++$count_of_Q;
}
elsif ( $char eq 'N' ) {
++$count_of_N;
}
elsif ( $char eq 'E' ) {
++$count_of_E;
}
else {
print "!!!!!!!! Error - I don\'t recognize this char: $char\n";
++$errors;
}
}

print "A = $count_of_A\n";
print "C = $count_of_C\n";
print "G = $count_of_G\n";
print "T = $count_of_T\n";
print "Q = $count_of_Q\n";
print "S = $count_of_S\n";
print "R = $count_of_R\n";
print "W = $count_of_W\n";
print "Y = $count_of_Y\n";
print "I = $count_of_I\n";
print "H = $count_of_H\n";
print "D = $count_of_D\n";
print "E = $count_of_E\n";
print "L = $count_of_L\n";
print "N = $count_of_N\n";
print "P = $count_of_P\n";
print "V = $count_of_V\n";
print "F = $count_of_F\n";
print "K = $count_of_K\n";
print "M = $count_of_M\n";
print "errors = $errors\n";

close FILE;

This is working finely,but it is really large.Is there any way to do this same work with much smaller script??

Posted by KevinAD (KevinAD), 14 November 2007

I am not sure this will be any fastser than your existing code but you can try:

Code:

#!usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @SEQ = qw(A C D E F G H I K L M N P Q R S T V W Y);
my %counts = map{ $_ => 0} @SEQ;
my %errors = ();
open(FILE,"<FASTA1.fa") or die "cant open file:$!\n";
while ( <FILE> ) {
if (m/^\s*$/ ) {
next;
}
elsif ( m/^\s*#/ ) {
next;
}
elsif ( m/^>/ ) {
next;
}
else {
while (/(.)/g) {
my $c = $1;
if ( exists $counts{$c} ) {
$counts{$c}++;
}
else {
$errors{$c}++;
}
}
}
}

close FILE;
print Dumper \%counts, \%errors;

I used Data:

umper only for convenience, you print the results however you want.

Posted by deep (deep), 24 November 2007

you can use the wonder full "substr"
Here is a partial code, i have used to do the same thing u are doing.
if(/[ARNDCQEGHILKMFPSTWYVJU]+[\n\r]+/) #regex to capture peptide sequences
{
$pep_seq = $&;
#print "$&";
$len = length($&); #lenght of peptides
#print "==>$len";

#print "Len==>$len";

for ($i=0;$i<$len;$i++) #Loop that uses lenght of the peptides as sentinal value.
{
$temp = substr $&,$i,1;#use of perl function substr to capture each and every AA in the current peptide sequence

ignore the commented part..thats just the way I test my scripts. But this should get u going. My regex might be different then what u need, so u can change it accordingly.

Opps just realized, u will need to compare the "temp" with the 20 AA. I "had" to do it as I am doing various processing on each AA.

Posted by KevinAD (KevinAD), 24 November 2007

This can slow down your perl scripts:

$pep_seq = $&;

For reasons I am not sure of, if you use $& (or even worse: $` or $') they can slow down perl scripts. Perl is forced to use them in all regexps if used in one and this can have an impact on performance.

Posted by admin (Graham Ellis), 25 November 2007

on 11/24/07 at 22:48:58, KevinAD wrote:

This can slow down your perl scripts:

$pep_seq = $&;

For reasons I am not sure of, if you use $& (or even worse: $` or $') they can slow down perl scripts.

My understanding is that if you match against a string that is (say) 8 Mbytes in length, then every single match in your program will write 8 Mbytes of temporary variables if there is any mention of $& $' or $`.

See longer article ....

http://www.wellho.net/mouth/1444_Using-English-can-slow-you-right-down-.html

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.