Project kicks off

The Bayesian Feed Filtering project will be trying to identify those articles that are of interest to specific researchers from a set of RSS feeds of Journal Tables of Content by applying the same approach that is used to filter out junk emails. We had the first project meeting this afternoon, though we’ve each done a little bit of work in the last week or two. We went over our plans for the two main work packages in some detail.

Technical development
We plan to use the sux0r software as our starting point. We’ve contacted the developer and are really pleased with the positive response we got. We’ll develop an API for sux0r that will allow its use by other applications, starting with a basic one that we will build for testing. We hope this approach will allow us to develop what we promised we would deliver, based on sux0r, without having to modify the sux0r code in any way that would limit its use by other sux0r users. (An alternative would have been to hack out of the sux0r code base the code that we wanted to keep and start up a new project as a fork. But we don’t want that.) So here’s the basic architecture I have in mind:
Initial architecture
On the left is sux0r, on the right is an application, joining them is the API. One of Santy’s first jobs is to get familiar with that stuff that’s going on inside sux0r and work out how to expose the relevant parts through the API. We plan to build a basic application of our own to test/demonstrate the API, and to give anyone wanting to build something better a starting point. We’ll also use this application to provide an interface for our test users when we run trials to see if Bayesian filtering actually works for Journal Table of Content RSS feeds.

Since we’ll be using feature driven development we first need to scope the project and then come up with a list of features to develop. We’ll be writing usage scenarios, that’ll be Lisa and me doing that. Our initial brainstorm has come up with a starting list of the following scenarios for using the API:

  • register a new user account
  • authenticate/authorize a user
  • submit feeds (individually or as an OPML?)
  • create a new relevance vector (or perhaps we just limit each user to one?)
  • submit training data (from RSS items or from plain text)
  • obtain feeds with filtered or rated items
  • export all Bayesian probability data
  • export those terms characteristic of interesting items
  • remove feeds
  • remove vector(?)
  • remove account

There may be more, we’ll need to prioritize these, and some of the functionality might be functionality of the application rather than the filterer (e.g. if sux0r doesn’t support APML then it might be easier to parse the APML at the application side and submit feeds to sux0r one at a time).

Trialling with users
We plan to trial the effectiveness of filtering Journal Tables of Content feeds with about 20 researchers with whom we will work quite closely (others are welcome to use the software and tell us what you think, but you’ll be on your own). These researchers may start off using the native sux0r interface, but we want to build a pared-down user interface that is tailored more specifically to what we want them to test. We sketched out the following tasks:

Starting in August: Identify a group of ~20 researchers with whom we can work closely (probably from Heriot-Watt), get them registered with sux0r. Get each researcher to tell us which journals they are interested in, and load the ToC feeds for these journals into sux0r. Get the researchers to train sux0r by identifying interesting and non-interesting articles from these ToCs. This isn’t likely to provide enough text for the filter to work, specifically it’s likely to be low on the “interesting” stuff, so we will supplement this training with text from papers they have written, papers they have cited and any other seminal papers from their field.

We’ll let them use sux0r and/or our test application for a couple of months, so they can continue the training, before we will run a test in early November where we try to tell them which papers out that month are of interest to them. They can than tell us if we got it right.

One factor that we are interested in is the balance of false positives to false negatives, my hunch is that it will be OK to have a false positive rate of up to 50% (i.e. half the stuff we think is interesting actually isn’t) but we’ll need to get the false negatives very low (i.e. we mustn’t throw too many babies out with the bathwater).

For the record the other two work packages are project management (e.g. having the meeting) and engagement with the community. There’s a long list of JISC and other projects that I need to get in touch with; but if you’re interested (and why else would you have read this far?) don’t wait for me to contact you, please get in touch with me: Phil Barker



Filed under management, technical, trialling