Message Read Algorithm/Data Structure: A Programming Challenge

alexander · November 7, 2006

it however seems that buffy has an algo in mind, but dont want to let anyone know what she's thinking... Buffy, can we point out flaws in your algorithms and you do the thinking :)?

Ok, so you want this to utilize the least storage or processing resources, because it seems that its either one of the 2, if you use little hard-drive space, you use a lot of system resources to process the info, or the reverse with using lots of sotrage and signifficantly less processing resources...? Plus say we come up with a crazy datastructure, who will implement it?

Buffy · November 7, 2006

it however seems that buffy has an algo in mind, but dont want to let anyone know what she's thinking... Buffy, can we point out flaws in your algorithms and you do the thinking :)?

No I don't! I actually like Q's suggestion, the only one I thought of initially was Laurie's because its the plain old canonnical m<->n relationship mechanism thats the obvious solution.

What I've found through experience is that by *changing the rules*--which Q has done by summarily saying "I'm going to get rid of a particular query you might *conceivably* want to run, but you really might not need"--you end up with order of magnitude improvements in performance. That's what I've been trying to prod you all to do! :)

Keep trying!

Ok, so you want this to utilize the least storage or processing resources, because it seems that its either one of the 2, if you use little hard-drive space, you use a lot of system resources to process the info, or the reverse with using lots of sotrage and signifficantly less processing resources...?

Disk usage and processing time are highly correllated. Since disk access is generally 10x slower than RAM access, there's a certain logic to emphasizing minimization of disk usage, but as a practical matter, RAM is so huge these days that you often end up caching the whole data table you're working with anyway, so that difference goes away. Thus the key to getting real improvements is to figure out a radically different approach that results in the algorithm going from O(n^2) to at least O(n log n) or if you're really cool, O(n)!

Plus say we come up with a crazy datastructure, who will implement it?

I will! I actually need the dang thing, although it may pop out initially in SQLServer sproc code and VB (fie!).

You could use a subselect instead of an outer join....

Most query optimizers do this automatically under the covers anyway....

Algorithmic transsubstantiation,

Buffy

CraigD · November 7, 2006

Thus the key to getting real improvements is to figure out a radically different approach that results in the algorithm going from O(n^2) to at least O(n log n)

Under a typical language implementation, the MUMPS solution in post #12’s query (ShowUnRd) has order O(log n +log m), where n is the number of messages, m the number of users, in the database – considerably better than O(n).

Though not demanded by the languge standard, MUMPS implementations usually implement their data structures as balanced b-trees. Any database implemented as balanced b-trees has query performance of order O(log n), provided the query exactly matches an index.

Such databases are fairly commonplace.

C1ay · November 7, 2006

Disk usage and processing time are highly correllated. Since disk access is generally 10x slower than RAM access, there's a certain logic to emphasizing minimization of disk usage, but as a practical matter, RAM is so huge these days that you often end up caching the whole data table you're working with anyway, so that difference goes away.

Then again, if you're using prevayler you have the whole db in ram.....

Killean · November 7, 2006

I've been thinking through this dilemma of yours for a while here. Message systems are so bloated with options like this. Anyways, let's do this:

I've put together a program algorithm using chronological sorting. It has its problems using this method (as debuted here at Hypography a little while ago), but is sound enough should the system time remain unaltered. Creation of a script to fix the times if problems arise would be great to have in backup.

Here is my Time Format:

YYYYMMDDhhmmss

Y - Year

M - Month

D - Day

h - Hour

m - Minute

s - Second

In the script, you can make comparisons of which timestamp is lesser or greater. Should the language support numbers into the Ten Trillions, you can compare as Integers. I know MySQL can understand the differences as I have tried using this method of sorting before.

-30 Days: 20061009000000

Yesterday: 20061106000000

Today: 20061107000000

Tomorrow: 20061108000000

We need 3 tables to do our job:

Users(ID, UserName, Password, OtherInfo, TimeStamp, ThreadsRead)

OtherInfo - Emails, messengers, department, etc...

TimeStamp - String in Time Format to show Last Activity

ThreadsRead - Sorted list of Thread ID's designated as Read

Threads(ID, ThreadName, CreatorUserID, TimeStamp)

TimeStamp - String in Time Format to show last Post time

Posts(ID, ThreadID, CreatorUserID, TimeStamp, Message)

TimeStamp - String in Time Format to show Post time

Last thing you need is a way to set global data. What you use; Cookies, Sessions, Temp Files, is up to you. Two globals:

Logged In User Data

Logged In Read List

Now we start the areas the code needs to be:

Function (Update Read List) {

- Set a Time Limit (Time Format - #days)

- Split Logged In Read List

- Grab ID and TimeStamp from Threads DB where TimeStamp > Time Limit

- Grab TimeStamp from Users DB of Logged In User Data

- Start Loop of Threads Data

- If Thread TimeStamp < User TimeStamp Then

- Start Loop of Read List Array

- If Thread ID is on Read List add to Looped List

- End Loop

- Sort Looped List

- Concatenate Looped List

- Save Concat List and Current TimeStamp to User DB

- Set new Logged In Read List from Concat List

}

Function (Thread Read)(Passed In: Thread ID) {

- Split Logged In Read List

- Start Loop of Read List Array

- If Thread ID is on Read List Array Then Flag True and Break Loop

- Else Flag False

- End Loop

- Return Flag

}

Thread Display {

- Set a Time Limit (Time Format - #days)

- Grab All from Threads DB ordering by TimeStamp

- Grab UserName and Extras from User DB of each Thread Data

- Function (Update Read List)

- Start Loop of Thread Data

- If Thread TimeStamp > Time Limit Then

- Output up to Thread Title

- Function (Thread Read) If True then Read Signifier

- Finish Output

- Else Output Thread Data no Read Signifier

- End Loop

}

Post Display {

- Add Thread ID to Logged In Read List

- Save Logged In Read List to User DB

- Grab All from Posts DB of Thread ID ordering by TimeStamp

- Start Loop of Post Data

- Grab All but Read List from User DB of Post Data

- Output Post

- End Loop

}

Login {

- Usual Login Output

- Set Logged In User Data and Read List from User DB

}

Let us see:

Data storage should be at a minimum if you keep the Time Limit within a month or two. I suppose you can find a way to shrink the table usage down to just the Users and Posts (by adding a post title and parent ID to it). As for the queries:

Thread Display:

4 Grabs (min) 3 + # Threads above Time Limit

1 Save

Post Display:

2 Grabs (min) 1 + # Posts

1 Save

Thread grabs can be reduced to 2+, and Post grabs to 1 if you include the user name and extra info (like email or ID) in the Posts table.

:) 5 hours past my usual 8 hours past bedtime. Be on the lookout for bugs.

Killean · November 11, 2006

Browsing through the vB config here and I came across this:

Thread/Forum Read Marking Type
This option controls how threads and forums are marked as read.

1. Inactivity/Cookie Based - once a user has been inactive for a certain amount of time (the value of the session timeout option) all threads and forums are considered read. Individual threads are marked as read within a session via cookies.

This option is how all versions of vBulletin before 3.5 functioned.

2. Database (no automatic forum marking) - this option uses the database to store thread and forum read times. This allows accurate read markers to be kept indefinitely. However, in order for a forum to be marked read when all threads are read, the user must view the list of threads for that forum.

This option is more space and processor intensive than inactivity-based marking.

3. Database (automatic forum marking) - this option is the same as a previous option, but forums are automatically marked as read when the last new thread is read.

This is the most usable option for end users, but most processor intensive.

I think 1 is exactly what I wrote above except I save the 'cookie' data to the database every page view (which may almost be as bad as 2 or 3).

Another way to look at my previous algorithm might be to just highlight all threads that have seen activity since your last visit, and create a global data list to selectively prevent highlighting of those you have read this session (or future sessions). You would also have to save your last activity in the global data, or the moment you log in (or wherever you place the save) your last activity time stamp will be updated.

:)

Sign In

Message Read Algorithm/Data Structure: A Programming Challenge

Recommended Posts

alexander

Buffy

CraigD

C1ay

Killean

Killean

Join the conversation

Browse

Activity