Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Perl & a Database

Posted by ChrisS (ChrisS), 13 November 2003
Evening all,

A year after Welhouse training I eventually find the message board!  

Anyway, I currently develop and maintain a middlware solution written in Perl that passes messages between different platforms. The application accepts flatfile and xml messages and maps the data to the format that the relevant recieving system expects.

As our number of input xml messages increase, so do the translate files to generate flatfile. Currently each translate (and parsing file for incoming) is a simple txt file that Perl uses as a template of how to build the required message.

To ease the maintainance of future messages and requirements, as well as increase the flexibility of the application I was considering holding all the mapping data in a central repository (ie database) that the application and application maintainers could reference.

My first concern however is that the application currently processes on average a 1000 messages an hour and is likely to quadraple over the coming years. If I reference the database everytime I guess there could be a performance issue?

I then thought about storing each mapping table into a hash on first start that the application continues to reference throughout the day however  I need to consider what happens if the database fails or becomes corrupt. Also I guess this could potentially cause issues with memory space  

My final idea was to write the database mapping files to text files upon reboot that the application would read from (similar to the current setup). If the DB failed, the application would continue to use the files from the previous day.

Does anyone have any other ideas of how I could best go about this?



Finally I am about to start planning the migration of the application to a Linux platform as opposed to a Unix platform. Are there any obvious considerations I should keep in mind when I start?

Posted by Custard (Custard), 13 November 2003
You could always do a load test on it.

1000 per hour isn't that many. That's 3.7seconds per record (if my maths is correct).

If your mapping data is held in a database, and you are worried about retrieving it slowly, and it's not vast amounts of data you could always cache it into memory structures (hashes, arrays etc) as you suggested.  I think databases like MySql cache frequently accessed data anyway.  Theres also a table type in MySql which is held in memory for speed of access. You could load your data into that on startup/refresh and use that as a cache.

Databases are quite quick though, and I bet they'd find an indexed record faster than you would using a text file.

If you're worried about the database getting corrupt, or failing, then IMHO the risk is the same as a text file corrupting or the disk failing. The data is afterall stored on disk.

I have a similar application to yours that collects XML records, and indexes them into several tables. Along the way the records get checked for duplication, have postcodes looked up etc.  This app has processed 10-15,000 records an hour in the past.
The average speed it processes at is 26 per second, which if my dodgy math is right is 93600 per hour. Although data comes in in bursts, so it would be very unusual., and of course these figures are dependant on the hardware I'm running.

Without knowing details of your application it is difficult to judge what is 'The Right Thing'.

It sounds like you need to give the database a try, and do some load testing and see what happens.  You could also use DProf, and dprofpp to profile your program and see where the slow bits are.  I ran this on mine, and found that 50% of the time was spent opening and closing filehandles to the database tables, so I created a cache of open filehandles.  With a database this optimisation would be a connection pool. Probably overkill for this though.

hth

B

ps. The machine was a Sun e250 2*300Mhz, and my database was Berkely/Sleepycat using tied hashes (see a previous long thread).

Posted by ChrisS (ChrisS), 13 November 2003
Thanks Custard,

Food for thought. As you may have assumed (and assumed correctly) I have little experience with databases.

I will have to set up the database before I start playing whatever the solution I guess..

Trial and error could be the key I feel..

Posted by Custard (Custard), 13 November 2003
I think you're right.

To certain extent you get a 'feel' for how a system will perform, but only based on experience.  For instance in my current project I have a 'feeling' it will be able to process a hundred or so records a second, but I really won't know until I have it all implemented in a few weeks, and have thrown lumps of data at it. I know I will be disappointed if I can only process four or five a second even though that's five times faster than the current system.

Good luck with your project.

B

Posted by admin (Graham Ellis), 14 November 2003
It may turn out that I'm just parallelling Custard's reply, but several minds / though sets on questions like this can help.

Some questions I would ask myself about the application if I were doing it:
a) How peaky is the traffic; if the average is 1000 per hour, what traffic do you expect in at a peak time (say after the end of Coronation Street if this is a Pizza ordering site   ).
b) What do you expect the cpu time per transaction to be? (I would look to running at, max, around 30 to 40% most of the time)
c) How realistic is the "quadruple" factor for the longer term, or is that simply a projected figure after (say) 3 years because that's the time period you've been asked to consider.  One of our contract customers gave me a metric of 74 cost centres when the application was written and told me it might increase "a little".   It's now over 1000 cost centres.   Might this happen?
d) How big will the data store get; you should avoid any architecture which would result in a progressive slowing, such as a directory containing one file per recent message for the last 4 hours.

The good news is that I don't see any problem with these sort of traffic levels, provided the application is neither too big nor too slow to start up for each hit, the data storage issue is addressed, and you don't have a major peak - but it does need careful thought.

* Data, yes, should be stored in a structure fit for the purpose such as a MySQL relational database.
* Code startup should be very carefully considered; we're talking Perl here and it might be expensive for the middleware to be started up for every message.  How about running it, or most of it, as a service through an IP port or in another container such as mod-perl if this is web based.

Further thought - be aware early on of syncronisation / interference issues between transactions.  This is an issue that's easily overlooked in development and can come back and bite you later!

Posted by ChrisS (ChrisS), 14 November 2003
Thanks Graham,

The application was originally designed with some guidance from you so as all can imagine its not to shabby!

The application currently acts as a listener waiting for inbound TCP/IP messages at which point it spwans itself to handle the message through to completion. This raises another point.. When I said a 1000 messages I was only thinking along the lines of transactions.. 1 transaction can involve 2 or 3 messages being passed back and forth. The message is assigned a unique id before being passed on via MQ series so that return messages can be matched.

I know you can lock a file for writing to stop data becoming muddled when two applications try to write at exactly the same time... but can similar be done with the DB?

I dont intend to log any data from transmittions.. I only want to use the database for the aplication reference files. Generally to make their maintainance easier and idiot proof.

As for quadrupling the message levels over coming years.. well marketing have been involved so I'm hoping the're slightly optimistic (only joking) However given the current setup I know it could handle alot more.. and I 'm hoping the DB will increase the versility of the app.

Anyway thanks again for help



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho