Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
For 2023 - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
to make work faster??

Posted by revtopo (revtopo), 30 October 2007
hi all,

I am under a great confusion. I have got a code which counts the numbers of matches of RNa sequences with e genome. but that takes days together to look in the complete genome. I have attahced the code along with this. Please suggest any way of improving the efficeincy

#!/usr/bin/perl -w
use DBI;
use strict;
use Bio::SeqIO;
use Bio:erl;
use GD::Graph::bars;


my $window_size = 500;
my $step_size = 500;

my $connect;
my $database = 'Cloned_RNA_2_dev';
my $user ='root';
my $pass = '';
my $host = 'localhost';
my %sequences;
my $total_length;
my $length;
open (RESULT, ">result.csv");
my $dsn = qq(DBI:mysql:database=$database; host=$host);

$connect = DBI-> connect($dsn, $user, $pass,{printError =>1}) or die $DBI::errstr;

#querying the database
my $query = qq/select count(c.Rna_sequence_sequence)
   from Cloned_rna c, Hsp h  
   where c.Rna_sequence_sequence=h.Rna_sequence_sequence
   and c.Project_idproject =72
   and h.Sequence_db_idSequence_db =1133
   and h.accession =?
   and h.hit_start >= ?
   and h.hit_end<=?
   group by h.accession/;

my $sql = $connect->prepare($query);
my %window_counts ;
print RESULT "\"Chromosome\"\t\"start\"\t\"end\"\t\"counts\"\n" ;
my $genome = Bio::SeqIO->new(-file=>'/home/shared-projects/sequence-dbs/TAIR7/TAIR7_nuclear_genome.fna', -format=>'fasta');

while (my $seq = $genome->next_seq()){
   $length = $seq->length();
   my $id =$seq->id();
#print RESULT "Windows on chromosome $id\n" ;
warn "Getting windows on chromosome $id\n" ;
#walking along the genome
   for(my $i=1;$i<=($length-$window_size); $i+=$step_size){
     my $window_start = $i;
     my $window_end = $i+$window_size-1;

     my $sucess=($sql->execute($id, $window_start, $window_end));
     die "query failed!\n $query \n" unless $sucess;
     my $window_id = "$id,$window_start" ;
     if (my @result = $sql->fetchrow_array() ) {          
         ($window_counts{$window_id}) = @result; # has the counts as the value of the hash
         warn "\$window_counts{$window_id}='$window_counts{$window_id}'\n";
         print RESULT "$id\t$window_start\t$window_end\t$window_counts{$window_id}\n" ;

#      exit if $i > 10000;


any help in improving this?

Posted by george_Ball (george), 30 October 2007
Well I'm no database guru but I'd probably bet money that the problem is in the database rather than this code, which does not seem to do anything sophisticated. Of course the way to attack this is to use a profiler, such as Dev:rofiler, which is written up nicely in Brian Foy's recent book "Mastering Perl". My guess is that you'll find that the program is spending most of its time in the call to execute() the SQL statement.

He also discusses a DBI:rofiler, which can be used to investigate database related profiling in more detail.

Posted by KevinAD (KevinAD), 30 October 2007
How big is the gnome file?

Posted by revtopo (revtopo), 30 October 2007
Cant say the exact length of the genome but the chromosome 1 is 3056785 and the chromosome2 is 1908764 and the chromosomes 3, 4, 5 are of similar length to chromosome 1 and 2.

Posted by KevinAD (KevinAD), 31 October 2007
do you really need to do this?

my $sucess=($sql->execute($id, $window_start, $window_end));
die "query failed!\n $query \n" unless $sucess

Posted by revtopo (revtopo), 31 October 2007
yes ofcourse to know from in which chromosome they are parsed and the positions

Posted by george_Ball (george), 31 October 2007
I have no background in biology or BIO:: Perl so I have no understanding of the numbers you give here...

But look at the code, and you see nested loops. What is the *actual* value (or at least the order of magnitude) of the number of iterations through the while loop ( ie how many times will $seq = $genome->next_seq() return a non-undef value ) ?

What is the *actual* value (or at least the order of magnitude) of

 ($length-$window_size) / $step_size

If we take the first of these as A and the second as B, then you are issuing the select statement to the database A*B times.

If these are very large numbers then you are hitting the database a *very* large number of times. That may well be the cause of your problem, but it's impossible to say without having some remote idea what these numbers are.

And, as I said earlier, problems like this are best approached through use of a profiling tool.

Posted by KevinAD (KevinAD), 31 October 2007
I'm in the same boat as george here. If there is a  forum on the bioperl site you might want to start asking these questions over there where there will hopefully be members with  similar issues and that understand the types of files you are working with.  

This page is a thread posted to the opentalk forum at and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2023: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: • WEB: • SKYPE: wellho