Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
20.9.2014 - We have just updated our course layouts and descriptions and added our 2015 schedule.

keyword searches

Posted by jill (jill), 15 October 2002
I would appreciate advice on the following problem:
We have to search several text fields for keywords, whereby the user types the keywords on a form and the search returns records which have any of those keywords in any of the text fields.
A simple wild card search works, but if for example the user types in the word 'gene', the search returns records containing 'gene' as well as genes genetics etc but also all records containing 'general'. So it returns too many irrelevant records.

What is required is that the search returns records which contain the specified word as a whole word only.  The word can exist in a field in these ways: It can be at the start of the field and followed by space, comma, fullstop,...closing bracket, dash, question mark etc, or it can be at the end of the field and preceded by space, opening bracket, dash, quote etc, or it can be in the middle of the field preceded by and followed by the same lists of punctuation.

I am currently trying to write the sql with a great many or's to search for the keyword with all these possibilities.  I am trying to use arrays and foreach loops to add the ors on to each other.

Am I on the right lines or is there something I am missing?

Posted by admin (Graham Ellis), 15 October 2002
This sounds like a case for regular expressions which you can use in your PHP or in your MySQL.  They have various word boundary options available which should help.  I'll write some samples and  post them up a little later.

Posted by admin (Graham Ellis), 15 October 2002
I've provided a PHP answer, since you posted to the PHP forum ... my example sets up an array of text lines to search through, then does so with a regular expression written to match your very specific requirement.   My example then goes on to use the \b assertation (word boundary) instead of your specification of what is and isn't a word boundary - the results are the same, but of course it's up to you if you go with the one that's provided, or roll your own!

Code:
<?php

$want = "gene";

$demo = array(
"This is a test line with the word gene in it",
"Genetic Enginnering is interesting but no match",
"This line has Gene with a capital G",
"The General Bus Company",
"If you are looking for a multi-gene solution, this may be it",
"Gene Kelly and Doris Day may not be what you're looking for",
"gentle example",
"gene",
"another gentle example",
"This is a line almost ending in gene.",
"This is a line ending in gene",
"This last line has a last word with g-e-n-e at the end; engene");

foreach ($demo as $trythis) {
       if (eregi("(^|[- \('])$want([- ,.\)?]|$)",$trythis)) {
               print $trythis."<br>\n";
       }
}
print "<hr>";
foreach ($demo as $trythis) {
       if (preg_match("/\\b$want\\b/i",$trythis)) {
               print $trythis."<br>\n";
       }
}
?>


Quote:
This is a test line with the word gene in it
This line has Gene with a capital G
If you are looking for a multi-gene solution, this may be it
Gene Kelly and Doris Day may not be what you're looking for
gene
This is a line almost ending in gene.
This is a line ending in gene
------------------------------------------------------------------------
This is a test line with the word gene in it
This line has Gene with a capital G
If you are looking for a multi-gene solution, this may be it
Gene Kelly and Doris Day may not be what you're looking for
gene
This is a line almost ending in gene.
This is a line ending in gene


If you want to do the check within MySQL, have a look at the
REGEXP or RLIKE keywords.   Most of the regular expressions themselves are the same in PHP and MySQL, but the \b word boundary is, I recall, \< for word start and \> for word end in MySQL.  Don't have the manual here tonight, and am writing of line to keep my hotel connection costs down


Posted by jill (jill), 16 October 2002
Many thanks for the help.

I have used an sql statement somewhat like this

select * from olmg where expertise  regexp "[[:<:]]gene[[:>:]]"

Huge amounts of effort saved and much better code, thanks again.

Posted by jill (jill), 16 October 2002
Further to the keyword problem the customer has now modified the requirements.  They would like end users to be able to type in a plain keyword and have the search find it as a whole word, but also to be able to add a wildcard (%) at the beginning or end of the word.  

With the code I have used, namely

select * from olmg where expertise  regexp "[[:<:]]".$keyword[$n]."[[:>:]]"

if the user enters e.g. %gene% in the form, it returns no records

Is there a way to combine these two requirements?  At present I am thinking of checking for the percents with an if and building the sql up in different ways accordingly - not quite sure how though?

TIA for any help

Posted by admin (Graham Ellis), 16 October 2002
If the user is allowed to enter % to mean "perhaps some alphanumerics at the start or end (or even in the middle)  of the word he/ she is searching for, you could replace the % by [[:alnum:]] before you pass it in to the MySQL.  Thus in PHP:
     $lookfor = implode("[[:alnum:]]",explode("%",$lookfor));

I would strongly suggest that you filter the incoming word(s) that your user enters for other special characters, unless your user is going to be able to understand regular expressions, before you pass them from the input through to the SQL.

Posted by jill (jill), 18 October 2002
Many thanks for the reply above.  I found I had to add the asterisk as in
$keyarray[$n] = implode("[[:alnum:]]*",explode("%",$keyarray[$n]));
but then it worked fine.

Can you recommend an idiot's guide to regular expressions?

Posted by admin (Graham Ellis), 20 October 2002
Ah yes - the extra "*" means "One or more of the previous element" - so my original answer would cause % to match exactly one alphanumeric which wasn't quite what you asked   ... your modification means that % matches any number (zero or more) alphanumrics, which is what you asked for  

You ask about an "idiot's guide".  Umm - I think you're one step beyond a starter guide (the term "idiot's guide" worries me in any case),  since you were able to sort our my slipup above .... HOWEVER ... there is a lack of / need for some way for newcomers to get into the philosophy of regular expressions. The "Mastering Regular Expressions" book is excellent but perhaps too advanced and frightening;  have a look also at http://www.regularexpression.info (which is "one of mine") .



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho