what database should I use?
Posted by critter (critter), 10 March 2003I have an email account that only accepts text messages, and the messages come in about 100 per minute (met data) or 80 Meg of data per day. The text data becomes 30 Meg when compressed. My problem is to design a very fast search engine to find a particular phrase in the messages for an archive that has 5 years of data. We are thinking of perhaps using MySql & PHP to build an online web application. Anyone thinks this can be done effectively with MySql? Suggestions? thanks...
Posted by admin (Graham Ellis), 10 March 2003Let's see - I make that over 250 million records - with indexes, other fields apart from the data, may rise to 40G or 50G. Do you agree my sums?
The volume of data shouldn't be a problem ... if you take a look at the MySQL Reference manual (full details at http://www.wellho.net/book/0-596-00265-3.html ), section 5.1.3, you'll read "Currently we have at least 50G of transaction tables and 200G of other customer data ..... we haven't had and problems handling this with quite modest Sun Ultra ..... we are planning ..... a tenfold increase .... we think we can keep up with this by just adding more disks to our systems". Beware - I'm summarising to quote / review - get yourself the book, or have a look at the MySQL web site which it's reproduced from.
Some thoughts and questions to think about:
- Although the data volume isn't a problem, I do wonder if you plan to have 250 million rows in one table how that will be!
- If you use InnoDb tables, they can be spread over several discs which will give you a major efficiency gain (must be several physical discs, not just several partitions)
- Are you wanting to search through the compressed content, or just through uncompressed indexes, etc?
- If you're looking at using PHP to do the query, this implies that you want your queries to run as quickly as they're entered - i.e. at web speed. Bear in mind that complex queries (for example, if you're searching with RLIKE) may take longer than you wish, but you might have a situation where you can email results back?
Hope this is a starter for you ... handling huge amounts of data doesn't seem to be covered in too many of the guide books and cookbooks - I had quite a job finding the specific reference above ; do post further if you want to follow up on the points I have raised or others
Posted by critter (critter), 11 March 2003Hello Graham,
This is very encouraging... and many thanks for the reference.
Can you elaborate about why having many physical discs will gain efficiency? Does it imply separating the tables so that they reside in different discs. How about RAID 5 cofiguration, would that count eventhough it is one logical disc made up of several discs?
I think if the search interface requires a period entry such as from/to, that may force a quicker respond than say search the whole database.
Another idea I may try is to spawn many threads concurrently that would query the database for different periods... hmm, would that gain overall quicker respond?
Posted by admin (Graham Ellis), 12 March 2003Funny thing how I was commenting the other day about the lack of texts on handling very large databases / optimising use of MySQL. I was in Swindon (our local town here in England) yesterday, and I spotted "SQL Peroformance Tuning", newly published this year, on the shelf. The book is no longer on the shelf in Swindon - it's now here in Melksham, and I'll be getting full details up on our web site in the next few days.
Answering your questions ...
A lot of the time that MySQL takes to do an enquiry is spent with disc heads seeking from one cylinder to another. By splitting the enquiry between various physical discs, it's said that you can share out the seeking; in essence, you're providing extra resources at the busiest point at the system - rather like clearing traffic jams by building an extra road at the point where jams occur. BUT you do need to beware that you don't cause traffic chaos at the junction where the new road starts and ends.
I don't know about RAID v MySQL - I guess that it might help just a little, but that really is a guess (and not a very educated one), so please don't trust my answer. Any gain would probably come more from the RAID buffering than anything else but with smaller block to be written to each disc, the number of seeks on each might be trimmed. As I say, a guess.
Multiple threads to query? I'm concerned that would cause the traffic jam I described above at the start of the new road, as the heads seek firstly the cylinders needed by one thread, then the cylinders needed by another, then back to the first. If the enquiry is primarily disc bound, I would expect that multiple enquiries in parallel might take longer than a sequential series. Again, an excellent question that's very hard to answer and will differ depending on so many hard-to-evaluate criteria such as the resource slicing algorithm of the operating system.
Hopefully some food for thought here. This sounds like a very serious big application and it may be worth investing in some sort of testing and evaluation; big disks are cheap these days, and I'm sure you can write a program to fill up a disc with test data over a few days.
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: email@example.com • WEB: http://www.wellho.net • SKYPE: wellho