to make work faster?? - Perl Programming

Posted by revtopo (revtopo), 30 October 2007

hi all,

I am under a great confusion. I have got a code which counts the numbers of matches of RNa sequences with e genome. but that takes days together to look in the complete genome. I have attahced the code along with this. Please suggest any way of improving the efficeincy

#!/usr/bin/perl -w
use DBI;
use strict;
use Bio::SeqIO;
use Bio:

erl;
use GD::Graph::bars;

$|=1;

my $window_size = 500;
my $step_size = 500;

my $connect;
my $database = 'Cloned_RNA_2_dev';
my $user ='root';
my $pass = '';
my $host = 'localhost';
my %sequences;
my $total_length;
my $length;
open (RESULT, ">result.csv");
my $dsn = qq(DBI:mysql:database=$database; host=$host);

$connect = DBI-> connect($dsn, $user, $pass,{printError =>1}) or die $DBI::errstr;

#querying the database
my $query = qq/select count(c.Rna_sequence_sequence)
from Cloned_rna c, Hsp h
where c.Rna_sequence_sequence=h.Rna_sequence_sequence
and c.Project_idproject =72
and h.Sequence_db_idSequence_db =1133
and h.accession =?
and h.hit_start >= ?
and h.hit_end<=?
group by h.accession/;

my $sql = $connect->prepare($query);
my %window_counts ;
print RESULT "\"Chromosome\"\t\"start\"\t\"end\"\t\"counts\"\n" ;
my $genome = Bio::SeqIO->new(-file=>'/home/shared-projects/sequence-dbs/TAIR7/TAIR7_nuclear_genome.fna', -format=>'fasta');

while (my $seq = $genome->next_seq()){
$length = $seq->length();
my $id =$seq->id();
#print RESULT "Windows on chromosome $id\n" ;
warn "Getting windows on chromosome $id\n" ;
#walking along the genome
for(my $i=1;$i<=($length-$window_size); $i+=$step_size){
my $window_start = $i;
my $window_end = $i+$window_size-1;

my $sucess=($sql->execute($id, $window_start, $window_end));
die "query failed!\n $query \n" unless $sucess;
my $window_id = "$id,$window_start" ;
if (my @result = $sql->fetchrow_array() ) {
($window_counts{$window_id}) = @result; # has the counts as the value of the hash

warn "\$window_counts{$window_id}='$window_counts{$window_id}'\n";

print RESULT "$id\t$window_start\t$window_end\t$window_counts{$window_id}\n" ;

}

# exit if $i > 10000;

}
}

any help in improving this

Posted by george_Ball (george), 30 October 2007

Well I'm no database guru but I'd probably bet money that the problem is in the database rather than this code, which does not seem to do anything sophisticated. Of course the way to attack this is to use a profiler, such as Dev:

rofiler, which is written up nicely in Brian Foy's recent book "Mastering Perl". My guess is that you'll find that the program is spending most of its time in the call to execute() the SQL statement.

He also discusses a DBI:

rofiler, which can be used to investigate database related profiling in more detail.

Posted by KevinAD (KevinAD), 30 October 2007

How big is the gnome file?

Posted by revtopo (revtopo), 30 October 2007

Cant say the exact length of the genome but the chromosome 1 is 3056785 and the chromosome2 is 1908764 and the chromosomes 3, 4, 5 are of similar length to chromosome 1 and 2.

Posted by KevinAD (KevinAD), 31 October 2007

do you really need to do this?

my $sucess=($sql->execute($id, $window_start, $window_end));
die "query failed!\n $query \n" unless $sucess

Posted by revtopo (revtopo), 31 October 2007

yes ofcourse to know from in which chromosome they are parsed and the positions

Posted by george_Ball (george), 31 October 2007

I have no background in biology or BIO:: Perl so I have no understanding of the numbers you give here...

But look at the code, and you see nested loops. What is the *actual* value (or at least the order of magnitude) of the number of iterations through the while loop ( ie how many times will $seq = $genome->next_seq() return a non-undef value ) ?

What is the *actual* value (or at least the order of magnitude) of

($length-$window_size) / $step_size

If we take the first of these as A and the second as B, then you are issuing the select statement to the database A*B times.

If these are very large numbers then you are hitting the database a *very* large number of times. That may well be the cause of your problem, but it's impossible to say without having some remote idea what these numbers are.

And, as I said earlier, problems like this are best approached through use of a profiling tool.

Posted by KevinAD (KevinAD), 31 October 2007

I'm in the same boat as george here. If there is a forum on the bioperl site you might want to start asking these questions over there where there will hopefully be members with similar issues and that understand the types of files you are working with.

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.