Message Read Algorithm/Data Structure: A Programming Challenge

Buffy · October 31, 2006

I'm too lazy to wade through the various code bases for vBulletin and the rest, so I thought I'd put this up as a program design challenge.

Problem Statement: I have a system that allows users to enter messages that can be read by other users (yes, think Hypography). I want the system to be able to tell each user what messages they have read or not.

Challenge: describe a data structure and program algorithm to 1) store the messages and "message read" information 2) find and report "list of unread messages" for a particular user.

Goals: Minimize space usage and Minimize query time.

Prize: The satisfaction of a job well done.

Format: You can use any language or descriptive approach you wish to convey the information. I don't care, I speak 'em all (and will translate for those of you who need it!)

There are a bunch of different approaches to this seemingly simple problem, but the trade-offs on the goals are critical. Choose wisely.

Survivors ready, go,

Buffy

Jay-qu · October 31, 2006

I dont really have the time to right out an algorithm, but I am thinking of a sorted linked list structure, with each node been a message with a boolean value for been read or not.

Buffy · October 31, 2006

I dont really have the time to right out an algorithm,

Remember, you don't need to really write a whole program, just an outline or approach...

...but I am thinking of a sorted linked list structure, with each node been a message with a boolean value for been read or not.

Right although see the following clarification, and think about the sizes: Hypography has nearly 140,000 posts and 5000 members. Do the math...not pretty! Choice of data structure and algorithm are key!

Clarification: When I say "data structure" I'm really talking about the database design. Its unlikely I have to keep any of this information in memory, and if I do, its an implementation detail that you can ignore.

Challenging,

Buffy

LaurieAG · November 1, 2006

Goals: Minimize space usage and Minimize query time.

Hello Buffy, here's one (psuedo Xbase) way, excuse the indenting. You don't really need to find or report anything.

Basic Data Structures

USERS

userID, unique, numeric/characters

FORUMS

forumID, unique, numeric/characters

MESSAGES

forumID, numeric/characters

messageID, unique numeric/characters

messageName, unique, characters

message, characters

readstatus, logical

userID, numeric/characters

usertoID, numeric/characters

* the program starts here

* start the main event loop

user_input='MAIN'

currentForum='MAIN'

DO WHILE user_input <> 'EXIT'

*** generate/refresh the generic screen bits etc for the currentForum script

*** test if current userName='GUEST' here

*** IF GUEST

****** don't generate/disable options for 'messages to me'/edit etc

*** ELSE

****** generate options for 'messages to me'/edit etc here

*** ENDIF

*** complete user/forum script generation here and load the generated script

*** WAIT for input

*** TRANSLATE user_input

*** DO CASE

****** CASE user_input = 'ADD message'

********* collect/input/select/check all required data here

********* select MESSAGES

********* get next unique number

********* append blank record

********* replace messageID with next unique number/character

********* replace forumID with currentforumID

********* replace messageName with messageTitle

********* replace message with messageText

********* replace readstatus with TRUE

********* replace userID with currentuserID

********* replace usertoID with touserID

****** CASE user_input = 'Messages to me'

********* select MESSAGES

********* set filter to usertoID=currentuserID && very minimal but useful

****** CASE user_input = 'View All Messages'

********* select MESSAGES

********* set filter to all && and easy to change back

ENDDO

RETURN

Buffy · November 1, 2006

Your algorithm is great if there's only one recipient, but I've posing a slightly more complicated problem:

Complexifying factor: Each message could potentially be read by any user, but I want to be able to tell if any particular user has read a particular message.

Hint: there really is more than one way to solve the problem!

Format hint: I don't need this much code detail, so don't work harder than necessary.

Permutationally,

Buffy

LaurieAG · November 1, 2006

Each message could potentially be read by any user, but I want to be able to tell if any particular user has read a particular message.

All you do is add a new MESSAGEREAD data table that contains userID and messageID fields for all messages viewed on the forum. This table has a new record added for each individual message on the screen when any user accesses or navigates a screen (unless you have some other way of telling like the user pressing a button etc?). You can SELECT unique userID from MESSAGEREAD for messageID = currentmessageID to see which users have read the current message, or SELECT unique messageID from MESSAGEREAD for userID = currentuserID to see which messages have been read by the current user.

Also, a very large file with slow access can be created if you use unique character ID fields so it is more efficient to use the much smaller numeric ID field formats in your initial design.

Buffy · November 1, 2006

Good solution--again, not the only one--but it still has weaknesses. If your most common request is "show me the messages I have not seen" then:

you have a full-table scan against a very large table that grows O(n^2)
the result is produced by a xor outer-join which is normally quite inefficient

Can you think of other solutions?

Denormalized,

Buffy

alexander · November 2, 2006

select messageid from user_read where messageid like "%message_beind_searched_for%" - query gets you whether or not you have read a particular message

to get unread messages:

$query1= mysql_query("select messageid from messages", connection);

$query2 = mysql_query("select messageid from user_read", connection);

explode($query1, $total);

explode($query2, $read_array);

$not_total=count($read_array);

foreach($total as $msg)

{

*$r=false;

*foreach($read_array as $read)

*{

**if($msg==$read)

**{

***echo "read";

***$r=true;

***break;

**}

*}

*if(!$r)

*{

**echo "$msg - message not read";

*}

}

just to demo buffy's fears :hihi:

Qfwfq · November 3, 2006

I would say first of all that it's more reasonable to track how many posts of a thread (i. e. up to which post) each user has viewed. The cartesian product to handle is therefore threads times users, and if user Tom has never clicked into thread Blueberries you don't need to have an instance for that pair, saving loads of table rows.

THID | UID | number

The two IDs should belong to the PK and to an index. If Big Brother wants to know if user Tom has viewed post #37 of a thread, select for that thread and user and see if number >= 37. If Tom has skipped pages in that thread, BB can jolly well pound salt (and, of course, he's got the same problem if Tom visited that thread without bothering to actually read).

LaurieAG · November 4, 2006

you have a full-table scan against a very large table that grows O(n^2)
the result is produced by a xor outer-join which is normally quite inefficient

The basic structure is optimised to reduce these problems.

You can get over the table size issue by using only 2 numeric key fields with an index for each field in the readmessage file. With an index set to user or message, the first user or message searched for will be in numeric order and all other relevant message/user results will follow directly in order after the first. The much smaller results file can then be joined to the much smaller messages and/or users files for further analysis. These things don't really change, whichever way you look at it.

Other solutions, with regards to speed and size, really depend on how much information you want to gather.

moo · November 4, 2006

Two options:

(1) log user IDs to message data

(2) log message IDs to user data

The method used would depend on whether you plan to scan messages or users to determine who read what. Scanning both requires both.

Assuming users come and go, (1) will clog message data with seldom used info, so I'd log message IDs to user data (to be deleted with dead accounts).

My 2 cents (allowing for inflation)... :hihi:

moo

CraigD · November 4, 2006

Here’s actual MUMPS code, with the data specification as comment (“;”) text, that does what Buffy requested, and the little more necessary to demonstrate that it does:

 ;Global spec:
;^MBMsg(MsgID,1)=Message title
;^MBMsg(MsgID,2,UserID)=nil
;^MBUser(UserID,1)=User name
;^MBUser(UserID,2,MsgID)=nil
NewMsg(MsgTitle) ;create a new message titled MsgTitle
s MsgID=$o(^MBMsg(""),-1)+1,^MBMsg(MsgID,1)=MsgTitle
s UserID=""
f  s UserID=$o(^MBUser(UserID)) q:UserID=""  s (^MBMsg(MsgID,2,UserID),^MBUser(UserID,2,MsgID))=""
q
ShowUnRd(UserID) ;show unread message titles for user UserID
w "Unread messages for ",^MBUser(UserID,1),!
s MsgID=""
f Count=1:1 s MsgID=$o(^MBUser(UserID,2,MsgID)) q:MsgID=""  d
. w Count,". ",^MBMsg(MsgID,1),!
w "(",Count-1," unread messages)",!
q
MarkRd(UserID,MsgID) ;mark a message as read
k ^MBMsg(MsgID,2,UserID) ,^MBUser(UserID,2,MsgID)
q

It’s slightly more verbose than it could have been. Total design and build time: 11 minutes. MUMPS is very terse.

The following code will create a new message:

d NewMsg(“Test message title”)

This will output a text listing of unread messages for a particular user with ID 123:

d ShowUnRd(123)

This will remove message 456 from the unread list of user 123:

d MarkRd(123,456)

From an efficiency perspective, this algorithm is “front loaded” – the most costly function is NewMsg, which must execute 1 medium, 2 low, and 1 high-cost instruction for every user in the user database (the ^MBUser global). Listing unread messages is almost as low cost as possible, as is marking a message as read.

Buffy · November 4, 2006

Think outside the box people! So far I'm only giving points to Q's answer. Why? Its really got to do with Laurie's last post: In the real world, you've got to define your *own* restrictions based on what *you* think the problem is. Q does a *great* job of this!

Keep thinking! :doh:

Computationally, permutations are totally sucky,

Buffy

CraigD · November 4, 2006

Computationally, permutations are totally sucky

Not to whine, but the solution I gave in post #12 scales at worst linearly with the size of the user database – not what I’d call permutations. Practically, the database engine running on a single commodity box can handle a user for every human on earth with only about a 20 minute lag in posting time (The IO, processor, and storage farm would be a different matter!)

This version avoids the front-loading in NewMsg altogether, at the expense of adding 4 instructions for every message the user has read in ShowUnRd. No posting lag, but more whirring and clicking

NewMsg(MsgTitle) ;create a new message titled MsgTitle
s MsgID=$o(^MBMsg(""),-1)+1,^MBMsg(MsgID,1)=MsgTitle
q
ShowUnRd(UserID) ;show unread message titles for user UserID
w "Unread messages for ",^MBUser(UserID,1),!
s MsgID="",Count=0
f  s MsgID=$o(^MBMSg(MsgID)) q:MsgID=""  d:'$d(^(MsgID,2,UserID))
. s Count=Count+1 w Count,". ",^MBMsg(MsgID,1),!
w "(",Count," unread messages)",!
q
MarkRd(UserID,MsgID) ;mark a message as read
s (^MBMsg(MsgID,2,UserID),^MBUser(UserID,2,MsgID))=""
q

The best MUMPS solution would, I think, be a hybrid of the 2, but, involving background processes, would be a bit unwieldy to post. (Though it could still be written in under 30 minutes)

I’m not sure this problem requires much thinking outside the box. Some problems and solutions are squarely in the box.

C1ay · November 5, 2006

I think it is safe to assume that at the time a post is made it's default state is unread by all users. This is the default for all of the posts in the system and it is really not necessary to use any space to store that in the db.

This only leaves the need to actually store a list of users that have read a message for each message in the system OR a list of messages that each user has read for each user in the system. Both of these datasets are the same size so either way is the same from a storage point of view.

That said it seems the minimum required would be to have a table with two columns, one for messageIDs read by one or more users and one for the userIDs that have read those messages. The a simple select statement can return a list of read or unread messages by user or a list users that have read or not read any particular message. You could additionally create an index on this table if faster queries are needed.

Southtown · November 5, 2006

Think outside the box people! So far I'm only giving points to Q's answer. Why? Its really got to do with Laurie's last post: In the real world, you've got to define your *own* restrictions based on what *you* think the problem is. Q does a *great* job of this!

Keep thinking! :D

Computationally, permutations are totally sucky,
Buffy

Attach tables to the user with forum, thread, and post IDs successively as they exhaust them. (i.e. when some posts are loaded, check to see if all posts in that thread have been displayed, and flag the thread read also.) Then only query what's loaded into a particular window. Keeps you from querying threads or posts on the forum index page, for example... Could you accomplish the same thing by querying only the relevant portion of one big table?

C1ay · November 5, 2006

Good solution--again, not the only one--but it still has weaknesses. If your most common request is "show me the messages I have not seen" then:
you have a full-table scan against a very large table that grows O(n^2)
the result is produced by a xor outer-join which is normally quite inefficient
Can you think of other solutions?

Denormalized,
Buffy

You could use a subselect instead of an outer join....

Sign In

Message Read Algorithm/Data Structure: A Programming Challenge

Recommended Posts

Buffy

Jay-qu

Buffy

LaurieAG

Buffy

LaurieAG

Buffy

alexander

Qfwfq

LaurieAG

moo

CraigD

Buffy

CraigD

C1ay

Southtown

C1ay

Join the conversation

Browse

Activity