Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
CDS regions: hashes vs. Set::IntSpan

Posted by Milo (Milo), 6 June 2003
Sorry if this is a bit of a tedious question, but it has been baffling me for days.
I'm trying to determine which regions of a sequence contain potential genes, based on Genbank notations (it's the best we have). These entries are in the form of:

  CDS             complement(1321..1593)

...and the numbers can be extracted by bioperl. I thought that a hash would be a cunning way of determining which bases were coding, like so:

foreach my $feat ($seq->all_SeqFeatures())
{
     my $tag = $feat->primary_tag();
     if ($tag =~ "CDS")
     {
           my $featstart = $feat->start();
           my $featend = $feat->end();
           for (my $i=$featstart;$i<=$featend;$i++)
           {
                 $poscheck{$i} = 1;
           }
     }
}

It's then possible to determine whether a particular region is within a CDS with:

if ($poscheck{$loc} == 1)

...and to count the coding bases with:

$coding =  grep defined, values %poscheck;

However, if used on large genomes (e.g. bacteria) then setting up the hash wastes so much CPU time that the program is rendered almost unusable (24 hours rather than 30 minutes to run). Is there any way to set up a similar hash more quickly?
Alterntively, I'm told that Set::IntSpan may do the trick, but I have found the man page to be incomprehensible. Is there an easier explanation of it about anywhere?
Thanks for any help.
Milo.


Posted by admin (Graham Ellis), 6 June 2003
Hi, Milo ... I'm supposed to be on holiday today - but couldn't resist checking in   ... I don't have a quick one-line answer for you; I'll post a follow up by Monday morning

Posted by Milo (Milo), 6 June 2003
Graham,
Thanks, that would be appreciated.
I've tried making a smaller hash by doing $poscheck{$start} = $stop, but this may break other parts of the code, and I haven't done a proper benchmark yet.
Milo.

Posted by admin (Graham Ellis), 6 June 2003
I can never resist a challenge ... going against the rule that says "don't
re-invent the wheel" I've written a piece of code to keep a list of ranges,
and to merge those ranges.   Simply call "addregion" with the start and end
of each range and it will keep tabs for you.  Call "isitin" with a single
value to be told whether the value given is within one of the ranges, and
isitin also sets a global variable called totbases so that you can save the
need for that grep or anything similar.

Code:
#!/usr/bin/perl

sub addregion {
       my ($start,$end) = @_;
       if ($start > $end) {
               ($start,$end) = ($end,$start);
               }
       my $done = 0;
       my $k;
       for ($k=0;$k<@nreg;$k++) {
# new region starts after old region ends
               next if ($nreg[$k][1] < $start);
# new region ends before old region starts
               if ($end < $nreg[$k][0]) {
                       splice(@nreg,$k,0,[$start,$end]);
                       $done = 1;
                       last;
                       }
# regions overlap
               $nreg[$k][0] = $start < $nreg[$k][0] ? $start : $nreg[$k][0];
               $nreg[$k][1] = $end > $nreg[$k][1] ? $end : $nreg[$k][1];
               $done = 1;
# Combine two regions if the new section has brough about an overlap
               if ($k < $#nreg and $nreg[$k][1] > $nreg[$k+1][0]) {
                       $nreg[$k][1] = $nreg[$k+1][1];
                       splice(@nreg,$k+1,1);
                       }
               last;
       }
       unless ($done) {
               push @nreg,[$start,$end];
               }
}

sub isitin {
       my ($wot) = @_;
       my $within = 0;
       $totbases = 0;
       for ($k=0; $k<@nreg; $k++) {
               if ($wot >= $nreg[$k][0] and $wot <= $nreg[$k][1]) {
                       $within = 1;
               }
               $totbases += $nreg[$k][1] - $nreg[$k][0] + 1;
       }
       return $within;
}

while (<DATA>) {
       ($from,$to) = split;
       addregion($from,$to);
       }

for ($k=0;$k<@nreg;$k++) {
       print ("Covers $nreg[$k][0] to $nreg[$k][1]\n");
       }

while (1) {
       print "Value to check? ";
       chop ($val = <STDIN>) ;
       last unless $val;
       $status = isitin($val);
       print ("status $status, count $totbases\n");
       }

__END__
10 20
17 34
-4000 -2000
40 50
1 7
200 5005
5 12
5000 5100
12322 12500
5050 7070
20000 100000000


Note - this will work with real numbers too and it does not internally combine
regions if the first ends at "n" and the second starts at "n+1".  I don't suppose
that matters, but if it does the code could be altered.   My test numbers go up to
a hundred million which would really hurt a hash - but this code is virtually
instant!

Code:
[localhost:~/jun03] graham% perl regions
Covers -4000 to -2000
Covers 1 to 34
Covers 40 to 50
Covers 200 to 7070
Covers 12322 to 12500
Covers 20000 to 100000000
Value to check? 4
status 1, count 99989097
Value to check? 60
status 0, count 99989097
Value to check? 250
status 1, count 99989097
Value to check?
[localhost:~/jun03] graham%



Posted by Milo (Milo), 6 June 2003
Thanks - that looks useful.
However, I need to ask what nreg is doing. Is it meant to be a global variable?
Milo.

Posted by admin (Graham Ellis), 7 June 2003
Yes - @nreg is a global list - it builds up pairs of values for the start and end of regions covered.  I probaly should have declared it "our" to avoid any problems with strict.

Each time you call addregion, it's @nreg that has another entry splice in, or an existing entry taken out, or two regions are combined if the new entry causes an overlap.

New subject - slight code change.  I've replaced an "if" by a "while" so that if a new regions overlaps a whole series of regions previously added, they'll all be combined, not just two combined.  Don't know if this would effect you or if it's something that couldn't happen on your input data but I thought I may as well do the general fix when I spotted the problem.  Ah - the danger of roll your own clever code is that you MUST check it carefully.

Code:
# Combine multiple regions if the new section has brough about an overlap
               while ($k < $#nreg and $nreg[$k][1] > $nreg[$k+1][0]) {


Posted by Milo (Milo), 7 June 2003
I've managed to get a comparison of the original, and your new code done.
Here are the results, for running the script on 106 genomes, totalling 622M.

The original version:
Code:
Total Elapsed Time = 80765.75 Seconds
 User+System Time = 64475.60 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
90.2   58175 58175.    106   548.82 548.82  main::gbparser


The new version:
Code:
Total Elapsed Time = 2897.801 Seconds
 User+System Time = 2093.501 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
34.4   721.3 721.26 271985   0.0027 0.0027  main::addregion
....


So, a considerable time saving!
Thanks,
Milo.

Posted by Milo (Milo), 5 August 2003
This code turned out to be very useful. As it's ended up in a couple of scripts, I've put it in a module. However, this leads me to ask what to do with @nreg (discussed earlier in the thread). Is it acceptable to leave @nreg in the script (re-set for each new file opened), given that only my scripts will use the module? Or, is it necessary/convenient/possible to place it within the module somehow? I can't see any immediate way to do this, though.
Thanks,
Milo.

Posted by admin (Graham Ellis), 5 August 2003
A quick initial reply - I'm on a public access phone terminal ...

My concern is that you might want several ranges in one program, so each will need an @nreg; solution is to take the module object oriented -- big subject, great fun .. I'll make a note to write something to describe it when  I've got a better connection if you like. May be a few days ....


Posted by Milo (Milo), 5 August 2003
on 08/05/03 at 19:19:39, Graham Ellis wrote:
My concern is that you might want several ranges in one program, so each will need an @nreg; solution is to take the module object oriented -- big subject, great fun .. I'll make a note to write something to describe it when  I've got a better connection if you like. May be a few days ....


Thanks. I've tried to do so, and it _appears_ to work (more data checking needed to be sure). Here's an example:

In the module:

sub new
{
       my ($hit,$start,$end,@nreg) = @_;
       my %rempos;
       $rempos{"hit"} = $hit;
       $rempos{"start"} = $start;
       $rempos{"end"} = $end;
       $rempos{"nreg"} = @nreg;
       bless \%rempos;
}

...and in the script:

my $position = new msatminer ($hit,($start-$flank_size),($end+$flank_size),undef);
$position -> addregion(($start-$flank_size),($end+$flank_size));
$closematch{$selfhit} = $position;
$closematch{$selfhit}{START} = $start;
$closematch{$selfhit}{END} = $end;

...then later:
my $reduce = 0;
foreach my $hit (keys %closematch)
{
      my $position = $closematch{$hit};
      my $check1 = $position->isitin(($closematch{$hit}{START}-$flank_size));
      my $check2 = $position->isitin(($closematch{$hit}{END}+$flank_size));

      if ($check1 == 1 or $check2 == 1) { $reduce++; }
}


This may well be dodgy, but no bugs spotted so far...
Milo.



Posted by admin (Graham Ellis), 7 August 2003
Looks exactly like the sort of thing - sure, everything needs testing, but one of the beauties of OO is than once it's done, it's a robust class  

Posted by Milo (Milo), 7 August 2003
Unfortunately, it was not robust enough in this case  
I got the correct results with one script, but it did not work with the other - @nreg still appeared to be global.
So, I'm trying other permutations to see if I can get a working object.
Milo.

Posted by Milo (Milo), 7 August 2003
Due to my complete inability to comprehend the complexities of OO, I've come up with this instead. It works, but is not so fancy.

In addition to adding isitin and addregion to the module, I've added this:

sub zap
{
     @nreg = ();
}

The calling script now has stuff like:

&msatminer::zap;
foreach my $feat (@features)
{      
       my ($start,$end) = &getstartend($feat);
       &msatminer::addregion($start,$end);
}

Re-setting @nreg in the module namespace each time I examine a new sequece seems to do the trick, as I can write out or save  the summary of which things are "in" before examining the next genome before @nreg is zapped.
Not elegant, but at least I have data
Milo.



Posted by admin (Graham Ellis), 7 August 2003
If it works for you, fine .... taking the "OO" further would mean you can have a separate @nreg for each sequence and be working on 2 at the same time!

Posted by Milo (Milo), 7 August 2003
I'd like to take it further, and it would be useful, but the problem is that, though it makes sense in classes, and in (most) books, my mind just cannot grasp it at the moment. Perhaps that BASIC programming I studied at school has rotted my mind
Perhaps in a few years....
Milo.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho