Posted by jill (jill), 15 October 2002I would appreciate advice on the following problem:
We have to search several text fields for keywords, whereby the user types the keywords on a form and the search returns records which have any of those keywords in any of the text fields.
A simple wild card search works, but if for example the user types in the word 'gene', the search returns records containing 'gene' as well as genes genetics etc but also all records containing 'general'. So it returns too many irrelevant records.
What is required is that the search returns records which contain the specified word as a whole word only. The word can exist in a field in these ways: It can be at the start of the field and followed by space, comma, fullstop,...closing bracket, dash, question mark etc, or it can be at the end of the field and preceded by space, opening bracket, dash, quote etc, or it can be in the middle of the field preceded by and followed by the same lists of punctuation.
I am currently trying to write the sql with a great many or's to search for the keyword with all these possibilities. I am trying to use arrays and foreach loops to add the ors on to each other.
Am I on the right lines or is there something I am missing?
Posted by admin (Graham Ellis), 15 October 2002This sounds like a case for regular expressions which you can use in your PHP or in your MySQL. They have various word boundary options available which should help. I'll write some samples and post them up a little later.
Posted by admin (Graham Ellis), 15 October 2002I've provided a PHP answer, since you posted to the PHP forum ... my example sets up an array of text lines to search through, then does so with a regular expression written to match your very specific requirement. My example then goes on to use the \b assertation (word boundary) instead of your specification of what is and isn't a word boundary - the results are the same, but of course it's up to you if you go with the one that's provided, or roll your own!
If you want to do the check within MySQL, have a look at the
REGEXP or RLIKE keywords. Most of the regular expressions themselves are the same in PHP and MySQL, but the \b word boundary is, I recall, \< for word start and \> for word end in MySQL. Don't have the manual here tonight, and am writing of line to keep my hotel connection costs down
Posted by jill (jill), 16 October 2002Many thanks for the help.
I have used an sql statement somewhat like this
select * from olmg where expertise regexp "[[:<:]]gene[[:>:]]"
Huge amounts of effort saved and much better code, thanks again.
Posted by jill (jill), 16 October 2002Further to the keyword problem the customer has now modified the requirements. They would like end users to be able to type in a plain keyword and have the search find it as a whole word, but also to be able to add a wildcard (%) at the beginning or end of the word.
With the code I have used, namely
select * from olmg where expertise regexp "[[:<:]]".$keyword[$n]."[[:>:]]"
if the user enters e.g. %gene% in the form, it returns no records
Is there a way to combine these two requirements? At present I am thinking of checking for the percents with an if and building the sql up in different ways accordingly - not quite sure how though?
TIA for any help
Posted by admin (Graham Ellis), 16 October 2002If the user is allowed to enter % to mean "perhaps some alphanumerics at the start or end (or even in the middle) of the word he/ she is searching for, you could replace the % by [[:alnum:]] before you pass it in to the MySQL. Thus in PHP:
$lookfor = implode("[[:alnum:]]",explode("%",$lookfor));
I would strongly suggest that you filter the incoming word(s) that your user enters for other special characters, unless your user is going to be able to understand regular expressions, before you pass them from the input through to the SQL.
Posted by jill (jill), 18 October 2002Many thanks for the reply above. I found I had to add the asterisk as in
$keyarray[$n] = implode("[[:alnum:]]*",explode("%",$keyarray[$n]));
but then it worked fine.
Can you recommend an idiot's guide to regular expressions?
Posted by admin (Graham Ellis), 20 October 2002Ah yes - the extra "*" means "One or more of the previous element" - so my original answer would cause % to match exactly one alphanumeric which wasn't quite what you asked ... your modification means that % matches any number (zero or more) alphanumrics, which is what you asked for
You ask about an "idiot's guide". Umm - I think you're one step beyond a starter guide (the term "idiot's guide" worries me in any case), since you were able to sort our my slipup above .... HOWEVER ... there is a lack of / need for some way for newcomers to get into the philosophy of regular expressions. The "Mastering Regular Expressions" book is excellent but perhaps too advanced and frightening; have a look also at http://www.regularexpression.info (which is "one of mine") .
PH: 01225 708225 • FAX: 01225 793803 • EMAIL: email@example.com • WEB: http://www.wellho.net • SKYPE: wellho