Category Archives: technical

How to Install the Bayesian Feed Filter

The Bayesian Feed Filter (BayesFF) is an optional interface for the popular sux0r software package. To be able to use the BayesFF interface you only need to follow the normal process for installing sux0r and make a few edits in the sux0r configuration file.

The BayesFF interface will allow you to use the API and the web interface developed by the BayesFF project. In general, installing sux0r is a simple process that takes less than 30 minutes to complete, depending of the type of PHP configuration found in your web server. You may want to ask your IT support team to install sux0r for you, if you are not familiar with installing and configuring PHP packages that would require access to the web server configuration files. However, if you wish to install sux0r yourself, the following detailed installation guide would help you.

A. Prerequisites

* Configuring PHP to enable mb, gd, and PDO libraries:
– mb is non-default extension and you need to explicitly enable it with the configure option. See http://www.php.net/manual/en/mbstring.installation.php webpage for details
– gd represents the GD library that you will need to install (available at http://www.libgd.org/) and enable with the configure PHP command. See http://www.php.net/manual/en/image.installation.php webpage for details
– PDO driver is enabled by default as of PHP 5.1.0, but you may need to enable it to work with MySQL. Please consult the documentation at http://www.php.net/manual/en/pdo.installation.php and http://www.php.net/manual/en/ref.pdo-mysql.php web pages to find out more about PDO installation.

* MySQL 5.0.x, set to support UTF characters
(Further information on http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html)

* Apache 2.x webserver with mod_rewrite module enabled
(a simple but good tutotial on enabling mod_rewrite can be found at http://www.tutorio.com/tutorial/enable-mod-rewrite-on-apache)

B. Installation

To install sux0r code on your web server:
1. Login to your server and go to the directory where you want to install sux0r
2. Execute the following Unix command:
svn export https://sux0r.svn.sourceforge.net/svnroot/sux0r/branches/icbl/
3. Execute these two commands:
chmod 777 ./data
chmod 777 ./temporary

To create the MySQL database and tables for sux0r:
4. Create a database named “sux0r” on your MySQL server
5. Import ./supplemental/sql/db-mysql.sql into MySQL

C. Configuartion

1. From the shell, execute these commands:
mv ./sample-config.php ./config.php
mv ./sample-.htaccess ./.htaccess

2. Edit ./config.php and ./.htaccess appropriately (follow the instructions included inside these files.) The changes you need to make are pretty obvious.

Edit Database Connection: $CONFIG[‘DSN’]
Edit URL for your intallation of sux0r: $CONFIG[‘URL’]
Edit Title: $CONFIG[‘TITLE’]
If you want to use the BayesFF interface, you will need to change the default value of the $CONFIG[‘PARTITION’] configuration parameter found in config.php,
from:
$CONFIG[‘PARTITION’] = ‘sux0r’;
to:
$CONFIG[‘PARTITION’] = ‘bayesff’;

3. To check your installation, run the ./supplemental/dependencies.php script from your browser. Example:
http://yourwebsite/sux0r210/supplemental/dependencies.php (If there are no errors OK will be returnes with a link to your new installation.

4. If the previous step didn’t produce any error, point your web browser to http://yourwebsite/sux0r210/supplemental/root.php’ and follow the onscreen instructions to make yourself a sux0r root user.

5. Setup a CRON job to fetch RSS feeds every x minutes (we recommend you to start by running the CRON every 60 minutes). The PHP script that fetches the feeds is already provided by sux0r and it is available at http://yourwebsite/sux0r210/modules/feeds/cron.php
For example:
0 * * * * /bin/nice /usr/bin/wget -q -O /dev/null “http://yourwebsite/sux0r210/modules/feeds/cron.php” > /dev/null 2>&1

6. Delete the ./supplemental directory from the webserver.

Sux0r should now be successfully installed on your website.

1 Comment

Filed under dissemination, technical

New feature: return RSS feeds for user

This feature allows a user to see all the feeds that they are filtering: useful if they just want to look at a specific feed or if they want to manage their feeds, e.g. by deleting one. If you’re interested you can read the original feature spec, though we haven’t implemented this in full (see comments).

The API call is an HTTP GET on [sux0rURL]/api/feeds/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). The only parameter you can use is user= to specify the user name.

Example
So if you wanted to see the feeds that I have subscribed to you would GET http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/feeds/?user=philb which would return something like

<?xml version="1.0"?>
<rss version="2.0" xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Philb's RSS Feeds</title>
    <link>http://icbl.macs.hw.ac.uk/sux0r206/user/profile/philb</link>
    <description>Use Case: Return the RSS Feeds for a User. User Nickname: philb</description>
        <atom:link href="http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/feeds/?user=philb" rel="self" type="application/rss+xml" />
    <item>
      <title>OUseful.Info, the blog...</title>
      <link>http://feeds.feedburner.com/ouseful</link>
      <guid>50</guid>
      <description>Thinking differently...</description>
    </item>
    <item>
      <title>Lorcan Dempsey's weblog</title>
      <link>http://orweblog.oclc.org/atom.xml</link>
      <guid>57</guid>
      <description>On libraries, services and networks.</description>
    </item>
<!--snip-->
    <item>
      <title>ehabitus</title>
      <link>http://ehabitus.blogspot.com/feeds/posts/default</link>
      <guid>53</guid>
      <description>(n). "e" + "a system of dispositions (unarticulated, habitual, acquired patterns of perception, thought and action)"</description>
    </item>
  </channel>
</rss>

i.e. an RSS feed where the items provide information on the feeds to which I am subscribed.

comments
Why an RSS feed? Well, we have the code return information in RSS feeds so why not? A simple extension would be to allow a parameter that would let one specify the format of the results, with OPML being the obvious alternative.

You’ll see that we’ve used the guid element to return the identifier used by sux0r for the feed (OK, we’re stretching the definition the “gu” part of guid). This can be used as the feed_id to identify the feed in other API calls, for example to return the items in that feed.

A PUT or POST (not sure which yet) to the same URL base would be a way of adding feeds.

Another extension would be to make the user parameter optional, returning information on all feeds in the system if no user name is provided–this could be useful for some admin functions.

Unfortunately we haven’t been able to implement the error codes properly on our server, you get an HTTP status code of 200-OK whether or not it is. However if you specify an invalid user name you do get sensible error messages returned in the body.

Even more unfortunately, the current implementation does not cover the authentication requirements.

Comments Off on New feature: return RSS feeds for user

Filed under technical

noAuth

One of the “weaknesses” I put in the SWOT analysis was that we had a lot to learn. Fully understanding and implementing authentication and authorization for the API was one of the things that we had to learn. As of now, at the end of the funded work on the project, we seem to have failed in this.

Our first point of failure was in being pointlessly over ambitious in what we wanted to do via the API. When drawing up the initial feature set for the API I took the starting position that anything that you could do through the native sux0r interface should be doable remotely; so the feature set included register new user. This muddied the requirements for accessing the sux0r security procedures in a way that I can now see was quite unnecessary–it’s really not unreasonable to expect people to have an account with a service before the interact with it from another application.

Having clarified this it became clear that oAuth would be the authorization mechanism of choice, though we had no experience in implementing it. Santy got a client working with twitter and flickr based on Andy Smith’s library. He used
Google PHP OAuth library for the server on sux0r, but it didn’t work with either that client or Google’s own client. There is another library he would like to test for the server side, but had already spent more time than was available.

Struggling with oAuth meant less time to spend on actual features. In retrospect we should have implemented the features without authorization in the hope of adding some form of authorization later (which is indeed what Santy has done towards the end of the project), but it is always tempting to keep trying one more thing in the hope that the next try will succeed.

As a result we have fewer features implemented than we planned, and features that should require authorization don’t have it. We still hope to add some form of restriction on access, even HTTP digest authentication requiring sux0r user name and password to be entered into the third-party app is better than nothing.

Lessons learnt: 1) you don’t have to do everything through an API (god, that seems obvious when I write it); 2) get on with what you can do in parallel to trying to overcome road blocks; 3) analysing the problem and implementing the client did give us a better understanding of what oAuth should do.

2 Comments

Filed under management, technical

Features: ReturnVectors and ReturnCategories

The Return Items for a user feature assumes that if you want to get only those items that have been classified under a certain category you know the numerical code used by sux0r to identify the vector and category. These features allow you to find those codes.

Return vectors for a user
The full design for this feature is available, however the current implementation does not cover the authentication requirements.

The API call is an HTTP GET on [sux0rURL]/api/vectors/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). The only parameter is user= to specify a username.

examples
HTTP GET on http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/vectors/?user=philb will return a list the vectors used by philb (me). The data returned is pretty self-explanatory, in this case you get:

<?xml version="1.0"?>
<response xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns">
  <api:userNickname>philb</api:userNickname>
  <api:vectors>
    <api:vector>
      <api:vectorID>6</api:vectorID>
      <api:vectorName>WorkInterest</api:vectorName>
    </api:vector>

    <api:vector>
      <api:vectorID>33</api:vectorID>
      <api:vectorName>CETIS-Domain</api:vectorName>
    </api:vector>
  </api:vectors>
</response>

Return vectors for a user’s category

The full design for this feature is available, however the current implementation does not cover the authentication requirements.

The API call is an HTTP GET on [sux0rURL]/api/categories/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). There are two required parameters
user to specify a username
vec_id to specify the id of a vector used by that user.

examples
HTTP GET on http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/categories/?user=philb&vec_id=6 will return a list the categories used by philb (me) for the vector with id number 6 (which is “work interest”). The data returned is pretty self-explanatory, in this case you get:

<?xml version="1.0"?>
<response xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns">
  <api:userNickname>philb</api:userNickname>
  <api:categories>
    <api:vector>
      <api:vectorID>6</api:vectorID>
      <api:vectorName>WorkInterest</api:vectorName>
    </api:vector>

    <api:category>
      <api:categoryID>12</api:categoryID>
      <api:categoryName>interesting</api:categoryName>
    </api:category>
    <api:category>
      <api:categoryID>13</api:categoryID>
      <api:categoryName>not interesting</api:categoryName>

    </api:category>
  </api:categories>
</response>

Error trapping
Unfortunately we couldn’t implement the error codes properly on our server, you get an HTTP status code of 200-OK whether or not it is. However if you specify an invalid user name or vector id you do get sensible error messages returned, which include links to set you on the right track.

1 Comment

Filed under technical

Feature implemented: Return RSS items for a user

The single most important feature that we are adding with this project is the ability to publish feeds from sux0r corresponding to specified criteria, for example a feed aggregated from all the feeds that a user is subscribed to that have been classified under the same heading by the Bayesian algorithm. (Here’s the full specification if you’re interested). We have now completed work on this.

The API call is an HTTP GET on [sux0rURL]/api/items/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). The parameters you can use are:
user to specify the user name;
vec_id to specify the vector id;
cat_id to specify the category id;
feed_id to specify the id or URL of the feed;
keywords to specify any keywords for filtering the result feed;
threshold to specify the threshold value for the probable relevance against the category;
maxHits to specify a maximum number of hits to return.

Sorting wasn’t implemented, the default sort order is on date. Also we didn’t get authentication working (but we dithered about whether it was necessary for this feature anyway, and life is easier if you can just get a feed into any feed reader).

Examples:
http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&maxHits=20
Gives the most recent 20 items from all the feeds to which user philb (that’s me!) subscribes. (I should note that not many of the feeds I subscribe to are Journal ToCs, so I’m not really using this for the type of feed for which it was intended. Nevertheless I find it kind of works.)

http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&keywords=jisc&maxHits=20
Gives the most recent 20 items containing the word jisc from all the feeds to which I subscribe. Try changing jisc to jisc cetis or “jisc cetis” or “jisc AND cetis”.

http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&vec_id=12&cat_id=24&threshold=0.5&maxHits=30
This is more interesting, vector 12 is my vector for classifying relevance to my research interests and category 24 is the stuff that is relevant. So this a feed of the stuff that is predicted to be relevant to my research interests (since the probability threshold is set to 0.5).

The results feed for that last call looks like this:

<?xml version="1.0"?>
<rss version="2.0" xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Philb's RSS ItemsVector ID: 12, Category ID: 24, Threshold: 0.5, maxHits: 30</title>
    <link>http://icbl.macs.hw.ac.uk/sux0r206/user/profile/philb</link>
    <description>Use Case: Return the RSS Items for a User. User Nickname: philb. Summary of applied filters:  Threshold: 0.5;  maxHits: 30 results</description>
        <atom:link href="http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&amp;vec_id=12&amp;cat_id=24&amp;threshold=0.5&amp;maxHits=30" rel="self" type="application/rss+xml" />
    <item>
      <title>An infrastructure service anti-pattern</title>
      <link>http://blog.paulwalk.net/2009/12/07/an-infrastructure-service-anti-pattern</link>
      <guid>http://blog.paulwalk.net/2009/12/07/an-infrastructure-service-anti-pattern</guid>
      <description>Last week I outlined an idea, that of the service anti-pattern, as part of a presentation I gave last week to the Resource Discovery Taskforce (organised by JISC in partnership with RLUK). The idea seemed to really catch the interest of and resonate with several of those members of the taskforce who were present at [...]</description>
      <pubDate>Mon, 07 Dec 2009 10:37:05 EST</pubDate>
      <source url="http://blog.paulwalk.net/feed">paul walk's weblog</source>
      <api:relevance>1</api:relevance>
    </item>
    <item>
      <title>Statistics of user trial results</title>
      <link>https://bayesianfeedfilter.wordpress.com/2009/12/07/statistics-of-user-trial-results</link>
      <guid>https://bayesianfeedfilter.wordpress.com/2009/12/07/statistics-of-user-trial-results</guid>
      <description>We now have results from our user trials showing how effective sux0r may be in filtering items from journal table of contents RSS feeds that are relevant to a user’s research interests. Quick reminder of how we ran the trials: 20 users had access to sux0r for 4 weeks to train the analyser in what [...]</description>
      <pubDate>Mon, 07 Dec 2009 07:41:18 EST</pubDate>
      <source url="https://bayesianfeedfilter.wordpress.com/feed">Bayesian Feed Filter</source>
      <api:relevance>1</api:relevance>
    </item>
<!--lots more items-->
  </channel>
</rss>

Apart from an additional element for the relevance of the item to the specified category, it’s plain RSS 2.0.

Unfortunately we couldn’t implement the error codes properly on our server, you get an HTTP status code of 200-OK whether or not it is. Also, I think there are some error conditions that we don’t trap satisfactorily, for example specifying a non-existent user or category.

3 Comments

Filed under technical

OAuth

Congratulations to Santy on getting an OAuth test client working. We’re going to be using OAuth to authorise remote access to the Feed Filter (I guess that should be obvious), about 90% of our features require it. One of the “weaknesses” I put in the SWOT analysis was that we had a lot to learn, fully understanding and implementing OAuth relates directly to that. I guess that makes us stronger now. Next: OAuth on the server.

1 Comment

Filed under technical

New features planned for sux0r

My last post described what sux0r already does, this one describes the features for the API that we plan to add.

The idea is to allow users of a remote application to classify feeds and to see the results, i.e. do what was described in that last post but without using the sux0r interface. The hope is that this will allow the use of the filter to be embedded in their own personal toolset, and more generally make the functionality of sux0r as a feed filter/classifier available to other services and applications.

To do this we think the API needs to provide access to the following sux0r functionality (the priority refers to our priority for implementing the feature):

1. Authorise account access for user application
A user gains access to their account through an application using API (using OAuth). High priority.

2. Add a New Feed
A user suggests a feed to be made available for adding to sux0r users’ accounts. High priority

3. Approve a Feed for a User
An feed administrator approves a feed added by a user so that it can be added to users’ accounts. High priority

4. Associate feed with a user
A user associates an approved feed with their account. High priority

5. Create a new Vector for a User
A user creates a new classification vector. Medium priority

6. Create a new Category for a User’s Vector
A user creates a new classification category on a specified vector. Medium priority.

7. Train a Document for a User
The user submits a document and the desired classification to train the classifier. High Priority.

Note: The document could be an RSS Item, which already exists in the database and hence will have an RSS ID number, or it could be plain text, which needs to be added to the database and then trained.

8. Return the RSS Items for a User
A user gets all Items from RSS Feeds to which a user is subscribed. Feeds may be sorted or filtered according specified criteria (e.g. only those in a certain category). Very high priority .

9. Return RSS Feeds for All Users
A user gets a list of all the feeds in the database. Medium priority.

10. Return RSS Feeds for a User
A user gets a list of all the feeds they are subscribed to. High priority

11. Remove feed
A user requests to remove a feed (association) from their account. Medium priority

12. Return vectors
A user gets a list of all the vectors she has created. Medium priority

13. Return categories
A user wants to view all the categories they have created for a vector. Medium priority

14. Export the Bayesian Token Analysis for a User
A user gets the information on frequency of occurrence of words in each vector-category.

1 Comment

Filed under dissemination, technical