| ||||||||||||||
| ||||||||||||||
CDS regions: hashes vs. Set::IntSpan Posted by Milo (Milo), 6 June 2003 Sorry if this is a bit of a tedious question, but it has been baffling me for days.I'm trying to determine which regions of a sequence contain potential genes, based on Genbank notations (it's the best we have). These entries are in the form of: CDS complement(1321..1593) ...and the numbers can be extracted by bioperl. I thought that a hash would be a cunning way of determining which bases were coding, like so: foreach my $feat ($seq->all_SeqFeatures()) { my $tag = $feat->primary_tag(); if ($tag =~ "CDS") { my $featstart = $feat->start(); my $featend = $feat->end(); for (my $i=$featstart;$i<=$featend;$i++) { $poscheck{$i} = 1; } } } It's then possible to determine whether a particular region is within a CDS with: if ($poscheck{$loc} == 1) ...and to count the coding bases with: $coding = grep defined, values %poscheck; However, if used on large genomes (e.g. bacteria) then setting up the hash wastes so much CPU time that the program is rendered almost unusable (24 hours rather than 30 minutes to run). Is there any way to set up a similar hash more quickly? Alterntively, I'm told that Set::IntSpan may do the trick, but I have found the man page to be incomprehensible. Is there an easier explanation of it about anywhere? Thanks for any help. Milo. Posted by admin (Graham Ellis), 6 June 2003 Hi, Milo ... I'm supposed to be on holiday today - but couldn't resist checking in ![]() Posted by Milo (Milo), 6 June 2003 Graham,Thanks, that would be appreciated. I've tried making a smaller hash by doing $poscheck{$start} = $stop, but this may break other parts of the code, and I haven't done a proper benchmark yet. Milo. Posted by admin (Graham Ellis), 6 June 2003 I can never resist a challenge ... going against the rule that says "don'tre-invent the wheel" I've written a piece of code to keep a list of ranges, and to merge those ranges. Simply call "addregion" with the start and end of each range and it will keep tabs for you. Call "isitin" with a single value to be told whether the value given is within one of the ranges, and isitin also sets a global variable called totbases so that you can save the need for that grep or anything similar. Code:
Note - this will work with real numbers too and it does not internally combine regions if the first ends at "n" and the second starts at "n+1". I don't suppose that matters, but if it does the code could be altered. My test numbers go up to a hundred million which would really hurt a hash - but this code is virtually instant! Code:
Posted by Milo (Milo), 6 June 2003 Thanks - that looks useful.However, I need to ask what nreg is doing. Is it meant to be a global variable? Milo. Posted by admin (Graham Ellis), 7 June 2003 Yes - @nreg is a global list - it builds up pairs of values for the start and end of regions covered. I probaly should have declared it "our" to avoid any problems with strict.Each time you call addregion, it's @nreg that has another entry splice in, or an existing entry taken out, or two regions are combined if the new entry causes an overlap. New subject - slight code change. I've replaced an "if" by a "while" so that if a new regions overlaps a whole series of regions previously added, they'll all be combined, not just two combined. Don't know if this would effect you or if it's something that couldn't happen on your input data but I thought I may as well do the general fix when I spotted the problem. Ah - the danger of roll your own clever code is that you MUST check it carefully. Code:
Posted by Milo (Milo), 7 June 2003 I've managed to get a comparison of the original, and your new code done.Here are the results, for running the script on 106 genomes, totalling 622M. The original version: Code:
The new version: Code:
So, a considerable time saving! Thanks, Milo. Posted by Milo (Milo), 5 August 2003 This code turned out to be very useful. As it's ended up in a couple of scripts, I've put it in a module. However, this leads me to ask what to do with @nreg (discussed earlier in the thread). Is it acceptable to leave @nreg in the script (re-set for each new file opened), given that only my scripts will use the module? Or, is it necessary/convenient/possible to place it within the module somehow? I can't see any immediate way to do this, though.Thanks, Milo. Posted by admin (Graham Ellis), 5 August 2003 A quick initial reply - I'm on a public access phone terminal ...My concern is that you might want several ranges in one program, so each will need an @nreg; solution is to take the module object oriented -- big subject, great fun .. I'll make a note to write something to describe it when I've got a better connection if you like. May be a few days .... Posted by Milo (Milo), 5 August 2003 on 08/05/03 at 19:19:39, Graham Ellis wrote:
Thanks. I've tried to do so, and it _appears_ to work (more data checking needed to be sure). Here's an example: In the module: sub new { my ($hit,$start,$end,@nreg) = @_; my %rempos; $rempos{"hit"} = $hit; $rempos{"start"} = $start; $rempos{"end"} = $end; $rempos{"nreg"} = @nreg; bless \%rempos; } ...and in the script: my $position = new msatminer ($hit,($start-$flank_size),($end+$flank_size),undef); $position -> addregion(($start-$flank_size),($end+$flank_size)); $closematch{$selfhit} = $position; $closematch{$selfhit}{START} = $start; $closematch{$selfhit}{END} = $end; ...then later: my $reduce = 0; foreach my $hit (keys %closematch) { my $position = $closematch{$hit}; my $check1 = $position->isitin(($closematch{$hit}{START}-$flank_size)); my $check2 = $position->isitin(($closematch{$hit}{END}+$flank_size)); if ($check1 == 1 or $check2 == 1) { $reduce++; } } This may well be dodgy, but no bugs spotted so far... Milo. Posted by admin (Graham Ellis), 7 August 2003 Looks exactly like the sort of thing - sure, everything needs testing, but one of the beauties of OO is than once it's done, it's a robust class ![]() Posted by Milo (Milo), 7 August 2003 Unfortunately, it was not robust enough in this case ![]() I got the correct results with one script, but it did not work with the other - @nreg still appeared to be global. So, I'm trying other permutations to see if I can get a working object. Milo. Posted by Milo (Milo), 7 August 2003 Due to my complete inability to comprehend the complexities of OO, I've come up with this instead. It works, but is not so fancy.In addition to adding isitin and addregion to the module, I've added this: sub zap { @nreg = (); } The calling script now has stuff like: &msatminer::zap; foreach my $feat (@features) { my ($start,$end) = &getstartend($feat); &msatminer::addregion($start,$end); } Re-setting @nreg in the module namespace each time I examine a new sequece seems to do the trick, as I can write out or save the summary of which things are "in" before examining the next genome before @nreg is zapped. Not elegant, but at least I have data ![]() Milo. Posted by admin (Graham Ellis), 7 August 2003 If it works for you, fine .... taking the "OO" further would mean you can have a separate @nreg for each sequence and be working on 2 at the same time!Posted by Milo (Milo), 7 August 2003 I'd like to take it further, and it would be useful, but the problem is that, though it makes sense in classes, and in (most) books, my mind just cannot grasp it at the moment. Perhaps that BASIC programming I studied at school has rotted my mind ![]() Perhaps in a few years.... Milo. This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
| |||||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho |