This post describes what is pretty much the starting point for our development work: the Sux0r OS software; my next will describe what we plan to add.
I came across sux0r while investigating the feasibility for the project, before writing the bid: while I found several references to the idea of Bayesian filtering of RSS feeds, and a couple of projects that had made a start on software to implement the idea, sux0r was the only open source project that I found that was still active. But sux0r is not just a personal feed filter, in fact it is something of an all-round content management system with Bayesian classification and support for group collaboration. It comprises a blogging platform, bookmarking, image repository and RSS feed aggregator.
While that’s great for a content management system, it’s a lot more than we really want to deal with. We considered the option of stripping out the functionality that we didn’t want to use, leaving just RSS aggregator and filter, but that seemed like fairly radical surgery to be performing, especially at the start of a project before we really got familiar with what did what in the sux0r code. It also didn’t seem to be a good strategy for contributing back to the sux0r project. So we adopted a more superficial approach: we have a complete installation of sux0r but we have customised the interface so that our users don’t get to see that there is an image library, blogging platform or social bookmark facility.
Using sux0r for feed filtering involves the following steps.
First you need to register as a user (top right of the welcome screen). When you’ve registered and logged in, you’ll want to subscribe to some feeds. Clicking on the “feeds” link at the top of the page will show you that you haven’t subscribed to any, but there will be options on the left to “manage feed” and “suggest feeds”. Click on “manage feeds” and you will see a list of the feeds that sux0r already knows about, check the box next to a feed to subscribe to it. If the feed you want isn’t there then you’ll need to suggest a feed. Sux0r is set up so that a “feed administrator” needs to approve feeds before they can be subscribed to by users, so it won’t appear in the list at “manage feeds” until one of us gets round to looking at it. (We hope to write the API so that an application using it can stream-line this, if the sux0r administrator chooses to let it). Having subscribed to some feeds, you should be able to see aggregated feed items when you click on the feeds link (there can be a delay while newly added feeds are gathered).
To categorize feeds you feeds and categories, so next you need to set up some categories. There is a short video tutorial on YouTube from the lead developer of sux0r describing this.
If you are logged in, up at the top right you will see your user name as a link, click on it and you’ll be taken to your profile page. On the left will be a link that lets you “edit Bayesian”, this lets you set up the categories you want to use for classification. In general, resources can be categorised according to different aspects, for example subject, origin, “interestingness”. Automatic classifiers know these as vectors. For our trials we add an “interestingness” vector and put two categories on it: “interesting” and “not interesting”. One vector with two categories is the simplest case and the most likely to give good results.
The system then needs to know something about what documents should come under each of the categories, e.g. what is interesting (to you) and what is not. You provide this information by training the classifier. If you go back to your feeds page, you should see your aggregated feed items each with a suggested classification shown in a drop-down selection box, and the probability that the classifier worked out for that classification. At first these will be garbage, something like “interestingness : interesting (71%)” for everything. You use this selection box to tell the system whether something really is interesting or not. When you do this the text will turn green to show that the classifier has been trained using the text in this item. As you train the classifier with more text it will begin to recognise those words that are more likely to occur in text of one category than another, and will use this information to analyse the text of feed items and assign them to a category.
It’ll take quite a bit of training before the classification is reliable (It keeps telling me that stuff that I have written isn’t interesting. I don’t like to say whether it’s right or not.) For best results you have to train it with a large number of texts for each classification, if you’re not finding items for one of the categories you can top up the training for that category by going to “edit Bayesian” from your profile and pasting text into the “train documents” box, selecting the classification you want it to have and hitting the “train” button. For our trials with items from current Journals’ Tables of Content, we’re assuming that there might be a deficit of papers that are interesting, so we are suggesting that researchers train the classifier in what is interesting with text taken from the title and abstract of papers they have written, cited or are seminal to their field.
You can of course install sux0r for yourself to fully explore it. But if you just want to try it out for RSS filtering, you’re welcome to use our customised installation. It is provided with no guarantees of service quality or privacy or promise of support. We may have to reset data or kick off users from time to time without warning (though we will try not to), we may reset user data without warning (not necessarily on purpose). All we ask is that you drop us a line to tell us who you are, what you’re trying to do and whether you think it’s working — this will also help us warn you if we have to do anything drastic that would affect your use of the system. Remember, we’re not responsible for the sux0r user interface or the implementation of the Bayesian alogorithm, if you have comments on those, direct them to the sux0r project. My suspicion is that classification will work best for well-defined categories; so for “interestingness” I think it will work better for specialists than generalists. Also, if you really want the filter to reduce the number of items you’re presented with you’ll have to be honest about which really are interesting enough to read.