Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
to find the matches

Posted by revtopo (revtopo), 8 October 2007
hi there,

I would like to find whether the two strings have exact match. the two strings are $gseq and $Rna_sequence. I have to match the list of RNA sequence that is contained in $Rna_sequnence to the gseq and also have to find how many matches are present for each window.

the codes is similar to this:

#!/usr/bin/perl -w
use DBI;
use strict;
use Bio::SeqIO;

use Bio:erl;
my $connect;
my $database = 'Cloned_RNA_2_dev';
my $user ='root';
my $pass = '';
my $host = 'localhost';
my %sequences;
my $total_length;
my $length;
my $dsn = qq(DBI:mysql:database=$database; host=$host);
$connect = DBI-> connect($dsn, $user, $pass,{printError =>1}) or die $DBI::errstr;
my $query = $connect->prepare(qq( select Rna_sequence_sequence , Project_idproject, clone_name from Cloned_rna where  Project_idproject =72));
$query-> execute;

while (my($Rna_sequence,$Projectid, $clone_name)  = $query->fetchrow_array()){
     $sequences{$Rna_sequence} = $clone_name;
     }
       foreach my $Rna_sequence(sort keys %sequences){
        print "$Rna_sequence\t $sequences{$Rna_sequence}\n";
     
          my $genome = Bio::SeqIO->new(-file=>'/home/shared-projects/sequence-dbs/TAIR7/TAIR7_nuclear_genome.fna', -format=>'fasta');
           while (my $seq = $genome->next_seq()){
                 $length = $seq->length();
                 my $gseq =$seq->seq();            
                 #print "the length is: $length\n";
                 my $window_size = 500; #sets the length of the  string in which the match should be found in the whole sequence
                 my $step_size = 500;
                 for(my $i=1;$i<=($length-$window_size); $i+=      $step_size){
                       my $seq_window = substr($gseq, $i,
$window_size);# gets the substring of length 500 from the gseq.
                       #print "the seq_window is $seq_window\n";
                       }
                 }      
           }

Is there any thing to do with the regular expression ?

thanks


Posted by admin (Graham Ellis), 8 October 2007
Can you give a short example of the data and what it should achieve/

I think you're looking for how many time one string occurs exactly in another - have I read this right?   Do you want to count overlapping matches?   For example, if you want to find how many times "CAC" occurs in GGTGGTGGTCACACTTTGGCACGGGG, woul dyou say the answer was 2 or 3?

Posted by revtopo (revtopo), 9 October 2007
the gseq may have atgaaatttggccatttgggggggcacacata.................. which extends to about 306453 of length.  the seq_widow has the first 500 sequence from gseq. $Rna_sequence has many sets of small rna sequence like aa, aaa, aaaaa, atatagcccc, .......atgcatgcataatgggccccaaatttt. the maximum length of the Rna_sequence may be 27.  the $Rna _sequence contains all these sets of sequences.

What I want  to do is to match all these $Rna_sequence to each window(say 500 here) and find how many Rna have exach matches to the seq_window which in turn is qseq. this has to return both the genomic position and the number of matches in each window. hope I make it clear now.

thanks,

Posted by admin (Graham Ellis), 9 October 2007
on 10/09/07 at 08:12:47, revtopo wrote:
hope I make it clear now.

thanks,



It answers a lot of questions ... but not the one I asked which will really effect the answer.   However .. I may have misunderstood the question based on some of your comments.  Am I right in thinking you're not interested in how many time each sequence is contained in the 500 element segment, but rather how many different sequences of interest it contains?

Posted by revtopo (revtopo), 9 October 2007
i will say it as 3. thsts the overlapping is accepted.

Posted by admin (Graham Ellis), 10 October 2007
on 10/09/07 at 19:13:01, revtopo wrote:
i will say it as 3. thsts the overlapping is accepted.



The I'm afraid that reguklar expressions are not really the way to go, as global matching starts one match after the end of the previous one - in other words, it misses overlaps.  But as the string you're looking for is fixed at each iteration, Perl's index function will work a treat!





This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho