|
Blurring the line between a blog and a forum -- Where
conversations are key.
spec.txt
Sylbi Specification & Notes
"Blurring the line between a blog and a forum -- Where conversations are key."
Sylbi is a conversation system that provides the functionality of both a blog
platform and a forum.
TODO List
- Pagination of conversation lists in blog and topic buckets
- Moderator functionality (delete posts, conversations)
- Character encoding... ?
- User preferences (like change password)
Definitions
USER A user account, representing an actual person
TOPIC A bucket for conversations within a certain category, or topic
POST A post is either an ENTRY or RESPONSE
CONVERSATION A chronological list of posts where the first post is an ENTRY
and the remaining posts are RESPONSES
ENTRY A new posting, not flagged as a response to another POST
RESPONSE A new posting in response to an existing posting, regardless of
that posting's type.
Perspectives
Blogs with comments and forums share almost exactly the same characteristics in
that they are a chronological set of postings initiated by someone and replied
to by others. The main difference is the perspective they present. A blog
presents the perspective of an individual (or set of individuals, typically
small), while a forum is the perspective of a category defined on the forum that
users may make an ENTRY or a RESPONSE to. The perspective of the individual is
subjugated by the perspective of the forum.
Other than that, the two things are extremely similar. Sylbi merely integrates
the two perspectives, and adds a slick, low cost categorization technique.
1. User perspective (blog)
Once a user is registered in a Sylbi site, they immediately have a presence and
view that is soley their own. By selecting the user account, you may view their
space, which may be customized to their preference. The user space provides all
the postings made by that particular user.
E.g.
User space for user A (home page):
________________________________
User A's Blogorium
--------------------------------
[Entry 1]
[Entry 2]
[Entry 3]
[ [Entry X]
[Response] ]
...
________________________________
POSTS are listed newest first.
ENTRIES may be clicked to see any responses to them. This is effectively the
CONVERSATION view.
RESPONSES are shown with a fragment of the initial ENTRY in the CONVERSATION to
indicate that it is not an ENTRY.
RESPONSES made to a user's own entry (i.e. a RESPONSE to a RESPONSE where user A
made the initial ENTRY) are not displayed on the user's home page.
2. Topic perspective (forum)
The site admin sets up Topic Buckets, which are topics that he wants to
focus his site on. This is similar to a forum, where the site admin creates
categories for discussion.
Instead of asking users to pick the appropriate topic bucket, user posts are
automatically categorized to buckets by vector-based relevancy check.
?? When an admin creates a Bucket, he can describe what he envisions the Bucket
to contain. This is used as the initial vector by which other posts are included
or excluded from the Bucket.
Admins may manually assign a conversation to a bucket. This is how the bucket is
trained to identify future conversations and add them. It implies that an admin
will have extra work initially when launching a site, but that over time things
will categorize easily as the buckets will become "trained".
There is a "catchall" bucket created by default. It may not be removed. Posts
and conversations that cannot be categorized yet (or ever), reside in this
bucket until they are either expired (an admin setting), deleted manually by an
admin, or found to match a threshold and make it into an admin created bucket.
Admins can adjust the length of time a conversation may live in "catchall"
before expiring. This is an easy way for an admin to enforce "topicalness" on
his site.
The admin determines the vector threshold for each bucket. Lower threshold means
that the bucket has a "loose" topicalness, a higher threshold means it is more
"focused".
Graphical view of buckets:
________________________________
____ ____
(____) (____)
\__/ \__/
Topic 1 Topic 2
____ ____
(____) (____)
\__/ \__/
Topic 4 catchall
________________________________
List views:
________________________________
Bucket ...Entry lead in...
----- ----------------------
----- ----------------------
----- ----------------------
----- ----------------------
----- ----------------------
----- ----------------------
----- ----------------------
----- ----------------------
----- ----------------------
________________________________
View of conversations -- Entries that have one or more responses
View of entries -- All entries, regardless of responses
3. Conversation perspective
Conversations are listed with the initial entry at the top, and all the
responses following. The view is flat, but the responses are ordered in a nested
manner: Responses at a given level N are listed chronologically. However,
responses to a post P (nesting) immediately follow that post in the view. A
response R to a post creates a new level, L, below P's level, N. Any additional
posts at L are chronological. But At level N, R follows P, and any responses to
R immediately follow R, and then each other, chronologically. Higher level posts
"hold place" for lower level responses.
E.g.
View remains flat, although responses are nested:
________________________________
[A. Entry]
[B. Response to A] level 1
[C. Response to A] level 1
[D. Response to C] level 2
[E. Response to D] level 3
[F. Response to A] level 1
________________________________
C held place for D, and D held place for E, all of them pushing F toward the
bottom.
The flat view is easier to read, and if you get into a set of responses on a
certain tangent you are not interested in, you will naturally and easily scan
quickly past until you are back in the tangent you want to read about. This is
how forums are typically read through anyway.
Development Targets
Sylbi should be easy to implement
Setup should be as easy to install as possible, regardless of platform. It will be
written in Perl, but all the needed Modules will already be bundled with it so
users have to execute commands in the fewest number of places.
Sylbi should reach the largest number of users
It will be written using CGI as opposed to mod_perl, because it can then run on
cheap web hosting (like I have) without any real server configuration. (It will
use MySQL, but database creation will be easy in this scenario, too.) Also, Perl
CGI can be run under IIS without a problem. mod_perl is Apache specific.
Sylbi should have strong separation of concerns (SOC)
Sylbi will use Template::Recall to keep HTML out of the code. This should also
make it easy for users to customize their personal views (home page).
Sylbi should contain a logical OO code structure
Sylbi will maintain an object oriented structure and use a uniform coding style,
and implement the popular Model/View/Controller abstraction (via CGI::Application).
Application layout
Sylbi will contain code, installation files, and modules together - even modules
that we get from CPAN.
It will use CGI::Application to handle the run-modes.
It will use CGI::Session to deal with user session state. (I'm leaning towards
using filesystem for state storage. One less demand on the MySQL server.)
It will use Template::Recall to provide templating.
Directory layout:
/index.cgi (Instance script, calls /msite/Sylbi/Index.pm)
/.cgi
/...etc...
/msite/ (Modules)
/CGI/Application
/Template/Recall
/Sylbi/Index.pm (Inherits CGI::Application)
/Sylbi/...etc...
/Sylbi/Data.pm (The data abstraction layer. Methods interact with
SQL)
/templates/ (Used by Template::Recall)
/session/ (Session state cache)
Notes on templates
Sylbi uses Template::Recall for templating.
.htmt HTML templates
.plt Perl templates (i.e. anything to be processed by "eval")
.sqlt SQL templates
Administration
The first account created once a Sylbi site has launched is the admin. This user
is responsible for the following:
* Creating buckets and managing them
* Remove bucket
* Set bucket threshold
* Write initial bucket description
* Managing user accounts
* Deleting
* Banning
* Promoting to admin
* TODO: moderators?
* Managing conversations
* Adding to buckets
* Moving to alternative buckets (re-categorization)
* Deleting
* Administering posts
* Deleting (e.g. offensive)
Data structure (MySQL schema)
-- MySQL dump 10.11
--
-- Host: localhost Database: sylbi
-- ------------------------------------------------------
-- Server version 5.0.45-community-nt
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
--
-- Table structure for table `blog_config`
--
DROP TABLE IF EXISTS `blog_config`;
CREATE TABLE `blog_config` (
`user_id` mediumint(8) unsigned NOT NULL,
`title` varchar(255) default NULL,
`index_template` text,
`convers_template` text,
PRIMARY KEY (`user_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--
-- Table structure for table `convers_termcache`
--
DROP TABLE IF EXISTS `convers_termcache`;
CREATE TABLE `convers_termcache` (
`convers_id` mediumint(8) unsigned NOT NULL default '0',
`termcache` text,
PRIMARY KEY (`convers_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--
-- Table structure for table `conversations`
--
DROP TABLE IF EXISTS `conversations`;
CREATE TABLE `conversations` (
`convers_id` int(10) unsigned NOT NULL auto_increment,
`post_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`convers_id`),
KEY `post_id` (`post_id`)
) ENGINE=MyISAM AUTO_INCREMENT=48 DEFAULT CHARSET=latin1;
--
-- Table structure for table `posts`
--
DROP TABLE IF EXISTS `posts`;
CREATE TABLE `posts` (
`post_id` int(10) unsigned NOT NULL auto_increment,
`user_id` mediumint(8) unsigned default NULL,
`epoch_ts` int(10) unsigned default NULL,
`edit_epoch` int(10) unsigned default NULL,
PRIMARY KEY (`post_id`),
KEY `user_id` (`user_id`),
KEY `epoch_ts` (`epoch_ts`)
) ENGINE=MyISAM AUTO_INCREMENT=202 DEFAULT CHARSET=latin1;
--
-- Table structure for table `posts_text`
--
DROP TABLE IF EXISTS `posts_text`;
CREATE TABLE `posts_text` (
`post_id` int(10) unsigned NOT NULL,
`post_text` text,
PRIMARY KEY (`post_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--
-- Table structure for table `responses`
--
DROP TABLE IF EXISTS `responses`;
CREATE TABLE `responses` (
`convers_id` int(10) unsigned NOT NULL,
`post_id` int(10) unsigned NOT NULL,
`r_post_id` int(10) unsigned NOT NULL,
KEY `post_id` (`post_id`),
KEY `convers_id` (`convers_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--
-- Table structure for table `topics`
--
DROP TABLE IF EXISTS `topics`;
CREATE TABLE `topics` (
`topic_id` mediumint(8) unsigned NOT NULL auto_increment,
`topic_name` varchar(100) default NULL,
`threshold` float unsigned default NULL,
`description` text,
PRIMARY KEY (`topic_id`)
) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=latin1;
--
-- Table structure for table `topics_links`
--
DROP TABLE IF EXISTS `topics_links`;
CREATE TABLE `topics_links` (
`topic_id` mediumint(8) unsigned default NULL,
`convers_id` int(10) unsigned default NULL,
`epoch_ts` int(10) unsigned default NULL,
KEY `topic_id` (`topic_id`),
KEY `convers_id` (`convers_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--
-- Table structure for table `topics_termcache`
--
DROP TABLE IF EXISTS `topics_termcache`;
CREATE TABLE `topics_termcache` (
`topic_id` mediumint(8) unsigned default NULL,
`termcache` text,
KEY `topic_id` (`topic_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--
-- Table structure for table `users`
--
DROP TABLE IF EXISTS `users`;
CREATE TABLE `users` (
`email_addr` varchar(50) NOT NULL,
`passwd` varchar(32) NOT NULL,
`handle` varchar(15) default NULL,
`user_type` varchar(1) NOT NULL,
`confirmed` tinyint(1) default '0',
`user_id` mediumint(8) unsigned NOT NULL auto_increment,
`signed_ts` int(10) unsigned default NULL,
PRIMARY KEY (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=latin1;
/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
-- Dump completed on 2007-11-17 23:48:45
Matching conversations to buckets
I had first thought of using a vector based approach, but chose the following
for performance reasons. This has many similarities to Bayesian spam
categorization. Instead of there being two classifying categories (spam, ham),
there are N categories, one for each topic bucket. Each topic is compared, and the
highest scoring one gets the conversation. This is sped up by caching the hash
of the topic stopwords in SQL.
UPDATE 10/2/2007 --
Improvement of the algorithm for comparing a conversation to a topic bucket.
---
Hash of stopwords in conversation with count, (C)
Hash of stopwords in topic with count, (T)
Intersection of stopwords in C and T, (I)
Calculate threshold:
( (sum(I) / sum(C))
+
(sum(I) / sum(T))
+
floor(0, 1 - ( sum(C) / sum(T) ) )
) / 3
The resulting score is an average of the following factors --
- The ratio of the stopword weight from the intersection I and the conversation C.
- The ratio of the stopword weight from the intersection I and the topic T.
Note: This is typically a low number, because a conversation will usually be
much smaller than a topic
- The reverse ratio of the sum of all the stopwords in both the conversation and
the topic with a floor of zero. This gives larger conversations that are
comparable in size to the topic a smaller number -- at this point the other
two factors should be more important. When a conversation is small, however,
it needs this number to bump the average up and make a more fair comparison.
PREVIOUS NOTES
I will use a "relevancy average" between the bucket and the conversation that
calculates the relevancy based on the importance of the terms matched to the
bucket, and the importance of the terms in the conversation. Importance is
defined by term count. It works as follows.
We have a bucket Y with the following word frequencies
freq. word
--------------
10 A
10 B
7 C
6 D
5 E
4 F
4 G
--------------
46 total occurrences
Now, we have a conversation Z with the following frequencies
freq. word
--------------
5 A
3 B
3 C
2 D
1 E
--------------
14 total occurrences
Say that the following words match between the two:
Z[B] = Y[C]
Z[C] = Y[D]
Z[D] = Y[B]
So taking C, D, and B from Y, we get an aggregate frequency count of 7+6+10=23.
Out of 46 possible occurrences, these words constitute 23/46 = .50 of the
"important" terms in the bucket.
Taking B, C, and D from Z, we get an aggregate of 3+3+2=8. So, 8/14=.57. This
means that the words that matched against the bucket make up 57% of the
important terms in the conversation.
If we take an average of these two numbers, .50+.57 = 1.07/2 = .535. Depending
on the threshold set by the admin of Sylbi, this conversation will make it into
the bucket, or not.
There will necessarily be a limit on the number of terms contained in a bucket's
frequencies. This is both because of performance and space constraints. Once you
eliminate stop words and reduce terms to their stems (Lingua::Stem), I think
that each bucket will have from 250-500 possible important terms. I will have to
test this to be sure.
When a user creates an entry, matching is tried. If the entry cannot be
resolved to a bucket, it goes into "catchall". Then, each time a user posts a
response, the same function is executed, with the default behavior that
the conversation remains in "catchall".
If the conversation matches a bucket (it must check all buckets, and take the
highest match), or if the admin forces the topic into a bucket manually, the
following must occur.
All terms that exist in the bucket have their frequency incremented by the terms
that match in the conversation. E.g. from the example above
Y[C] = 7+Z[B], or 7+3 = 10
Y[D] = 6+Z[C], or 6+3 = 10
Y[B] = 10+Z[D], or 10+2 = 12
For the remaining terms, we insert them into the bucket as new terms, if there
are slots available at the bottom, or if they have a high enough count to push
other terms down. We'll assume that Y, with 7 elements, is at the maximum
threshold of terms it can hold. Z[E] contains only one count. The lowest term in
Y has a count of 4, so Z[E] is discarded. Z[A], however, has a count of 5, which
is greater than elements Y[F,G]. It is inserted ahead of them, becoming Y[F],
effectively. Y[F] then becomes Y[G], and the original Y[G] is lost. Our new
topic bucket then becomes
freq. word
--------------
10 A
10 B
10 C
10 D
5 E
5 F (new term)
4 G (becomes previous word 'F', old 'G' is lost)
--------------
54 total occurrences
Until a conversation is added to a bucket, every time a post is made, the conversation's
term list is loaded from cache (convers_termcache). (Unless it's the first entry, where
the term list is initially saved in cache if it cannot immediately be assigned a
bucket.) The conversation term list (TL) is compared against the topic bucket's TL, and if
the match is above the threshold, it is assigned to the bucket and removed from
convers_termcache. Also, it's terms added to the topics termcache, making it a
"more trained" list.
The conversation may be manually moved after initial addition to another topic by an
administrator. Otherwise, it remains with the topic bucket it matched, and future posts
need not do the comparison.
If no buckets match, terms in the conversations TL are incremented, and new
terms added. We will also need to check the size of the hash structure before
saving it to the database, to ensure that it will fit (we'll have to drop the
lowest frequency terms at this point).
topics_termcache is a permanent storage of TL, convers_termcache is not.
The special topic bucket, "catchall" is created initially, and has topic_id of
0. Sylbi is always trying to move conversations from this topic to a "real"
topic. catchall has an empty description and no TL. If no other buckets are
defined, Sylbi does not perform any categorization attempts.
Re-categorization
If a conversation is re-categorized, this means that the administrator felt that
it belonged to a different bucket than the one it made it in to. A
re-categorization causes Sylbie assemble a TL for the conversation and subtract
the matching values from the existing bucket. It then performs the above
categorization to the destination bucket.
If a conversation is deleted, it's TL is adjusted from the topic's TL.
|