to find the matches - Perl Programming

Posted by revtopo (revtopo), 8 October 2007

hi there,

I would like to find whether the two strings have exact match. the two strings are $gseq and $Rna_sequence. I have to match the list of RNA sequence that is contained in $Rna_sequnence to the gseq and also have to find how many matches are present for each window.

the codes is similar to this:

#!/usr/bin/perl -w
use DBI;
use strict;
use Bio::SeqIO;

use Bio:

erl;
my $connect;
my $database = 'Cloned_RNA_2_dev';
my $user ='root';
my $pass = '';
my $host = 'localhost';
my %sequences;
my $total_length;
my $length;
my $dsn = qq(DBI:mysql:database=$database; host=$host);
$connect = DBI-> connect($dsn, $user, $pass,{printError =>1}) or die $DBI::errstr;
my $query = $connect->prepare(qq( select Rna_sequence_sequence , Project_idproject, clone_name from Cloned_rna where Project_idproject =72));
$query-> execute;

while (my($Rna_sequence,$Projectid, $clone_name) = $query->fetchrow_array()){
$sequences{$Rna_sequence} = $clone_name;
}
foreach my $Rna_sequence(sort keys %sequences){
print "$Rna_sequence\t $sequences{$Rna_sequence}\n";

my $genome = Bio::SeqIO->new(-file=>'/home/shared-projects/sequence-dbs/TAIR7/TAIR7_nuclear_genome.fna', -format=>'fasta');
while (my $seq = $genome->next_seq()){
$length = $seq->length();
my $gseq =$seq->seq();
#print "the length is: $length\n";
my $window_size = 500; #sets the length of the string in which the match should be found in the whole sequence
my $step_size = 500;
for(my $i=1;$i<=($length-$window_size); $i+= $step_size){
my $seq_window = substr($gseq, $i,
$window_size);# gets the substring of length 500 from the gseq.
#print "the seq_window is $seq_window\n";
}
}
}

Is there any thing to do with the regular expression ?

thanks

Posted by admin (Graham Ellis), 8 October 2007

Can you give a short example of the data and what it should achieve/

I think you're looking for how many time one string occurs exactly in another - have I read this right? Do you want to count overlapping matches? For example, if you want to find how many times "CAC" occurs in GGTGGTGGTCACACTTTGGCACGGGG, woul dyou say the answer was 2 or 3?

Posted by revtopo (revtopo), 9 October 2007

the gseq may have atgaaatttggccatttgggggggcacacata.................. which extends to about 306453 of length. the seq_widow has the first 500 sequence from gseq. $Rna_sequence has many sets of small rna sequence like aa, aaa, aaaaa, atatagcccc, .......atgcatgcataatgggccccaaatttt. the maximum length of the Rna_sequence may be 27. the $Rna _sequence contains all these sets of sequences.

What I want to do is to match all these $Rna_sequence to each window(say 500 here) and find how many Rna have exach matches to the seq_window which in turn is qseq. this has to return both the genomic position and the number of matches in each window. hope I make it clear now.

thanks,

Posted by admin (Graham Ellis), 9 October 2007

on 10/09/07 at 08:12:47, revtopo wrote:

hope I make it clear now.

thanks,

It answers a lot of questions ... but not the one I asked which will really effect the answer. However .. I may have misunderstood the question based on some of your comments. Am I right in thinking you're not interested in how many time each sequence is contained in the 500 element segment, but rather how many different sequences of interest it contains?

Posted by revtopo (revtopo), 9 October 2007

i will say it as 3. thsts the overlapping is accepted.

Posted by admin (Graham Ellis), 10 October 2007

on 10/09/07 at 19:13:01, revtopo wrote:

i will say it as 3. thsts the overlapping is accepted.

The I'm afraid that reguklar expressions are not really the way to go, as global matching starts one match after the end of the previous one - in other words, it misses overlaps. But as the string you're looking for is fixed at each iteration, Perl's index function will work a treat!

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.