Features: ReturnVectors and ReturnCategories

The Return Items for a user feature assumes that if you want to get only those items that have been classified under a certain category you know the numerical code used by sux0r to identify the vector and category. These features allow you to find those codes.

Return vectors for a user
The full design for this feature is available, however the current implementation does not cover the authentication requirements.

The API call is an HTTP GET on [sux0rURL]/api/vectors/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). The only parameter is user= to specify a username.

examples
HTTP GET on http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/vectors/?user=philb will return a list the vectors used by philb (me). The data returned is pretty self-explanatory, in this case you get:

<?xml version="1.0"?>
<response xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns">
  <api:userNickname>philb</api:userNickname>
  <api:vectors>
    <api:vector>
      <api:vectorID>6</api:vectorID>
      <api:vectorName>WorkInterest</api:vectorName>
    </api:vector>

    <api:vector>
      <api:vectorID>33</api:vectorID>
      <api:vectorName>CETIS-Domain</api:vectorName>
    </api:vector>
  </api:vectors>
</response>

Return vectors for a user’s category

The full design for this feature is available, however the current implementation does not cover the authentication requirements.

The API call is an HTTP GET on [sux0rURL]/api/categories/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). There are two required parameters
user to specify a username
vec_id to specify the id of a vector used by that user.

examples
HTTP GET on http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/categories/?user=philb&vec_id=6 will return a list the categories used by philb (me) for the vector with id number 6 (which is “work interest”). The data returned is pretty self-explanatory, in this case you get:

<?xml version="1.0"?>
<response xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns">
  <api:userNickname>philb</api:userNickname>
  <api:categories>
    <api:vector>
      <api:vectorID>6</api:vectorID>
      <api:vectorName>WorkInterest</api:vectorName>
    </api:vector>

    <api:category>
      <api:categoryID>12</api:categoryID>
      <api:categoryName>interesting</api:categoryName>
    </api:category>
    <api:category>
      <api:categoryID>13</api:categoryID>
      <api:categoryName>not interesting</api:categoryName>

    </api:category>
  </api:categories>
</response>

Error trapping
Unfortunately we couldn’t implement the error codes properly on our server, you get an HTTP status code of 200-OK whether or not it is. However if you specify an invalid user name or vector id you do get sensible error messages returned, which include links to set you on the right track.

Advertisements

1 Comment

Filed under technical

Feature implemented: Return RSS items for a user

The single most important feature that we are adding with this project is the ability to publish feeds from sux0r corresponding to specified criteria, for example a feed aggregated from all the feeds that a user is subscribed to that have been classified under the same heading by the Bayesian algorithm. (Here’s the full specification if you’re interested). We have now completed work on this.

The API call is an HTTP GET on [sux0rURL]/api/items/ (where [sux0rURL] is the URL for your sux0r installation, for this project that is http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/ ). The parameters you can use are:
user to specify the user name;
vec_id to specify the vector id;
cat_id to specify the category id;
feed_id to specify the id or URL of the feed;
keywords to specify any keywords for filtering the result feed;
threshold to specify the threshold value for the probable relevance against the category;
maxHits to specify a maximum number of hits to return.

Sorting wasn’t implemented, the default sort order is on date. Also we didn’t get authentication working (but we dithered about whether it was necessary for this feature anyway, and life is easier if you can just get a feed into any feed reader).

Examples:
http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&maxHits=20
Gives the most recent 20 items from all the feeds to which user philb (that’s me!) subscribes. (I should note that not many of the feeds I subscribe to are Journal ToCs, so I’m not really using this for the type of feed for which it was intended. Nevertheless I find it kind of works.)

http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&keywords=jisc&maxHits=20
Gives the most recent 20 items containing the word jisc from all the feeds to which I subscribe. Try changing jisc to jisc cetis or “jisc cetis” or “jisc AND cetis”.

http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&vec_id=12&cat_id=24&threshold=0.5&maxHits=30
This is more interesting, vector 12 is my vector for classifying relevance to my research interests and category 24 is the stuff that is relevant. So this a feed of the stuff that is predicted to be relevant to my research interests (since the probability threshold is set to 0.5).

The results feed for that last call looks like this:

<?xml version="1.0"?>
<rss version="2.0" xmlns:api="http://icbl.macs.hw.ac.uk/sux0rAPI/api/xmlns" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Philb's RSS ItemsVector ID: 12, Category ID: 24, Threshold: 0.5, maxHits: 30</title>
    <link>http://icbl.macs.hw.ac.uk/sux0r206/user/profile/philb</link>
    <description>Use Case: Return the RSS Items for a User. User Nickname: philb. Summary of applied filters:  Threshold: 0.5;  maxHits: 30 results</description>
        <atom:link href="http://icbl.macs.hw.ac.uk/sux0rAPI/icbl/api/items/?user=philb&amp;vec_id=12&amp;cat_id=24&amp;threshold=0.5&amp;maxHits=30" rel="self" type="application/rss+xml" />
    <item>
      <title>An infrastructure service anti-pattern</title>
      <link>http://blog.paulwalk.net/2009/12/07/an-infrastructure-service-anti-pattern</link>
      <guid>http://blog.paulwalk.net/2009/12/07/an-infrastructure-service-anti-pattern</guid>
      <description>Last week I outlined an idea, that of the service anti-pattern, as part of a presentation I gave last week to the Resource Discovery Taskforce (organised by JISC in partnership with RLUK). The idea seemed to really catch the interest of and resonate with several of those members of the taskforce who were present at [...]</description>
      <pubDate>Mon, 07 Dec 2009 10:37:05 EST</pubDate>
      <source url="http://blog.paulwalk.net/feed">paul walk's weblog</source>
      <api:relevance>1</api:relevance>
    </item>
    <item>
      <title>Statistics of user trial results</title>
      <link>https://bayesianfeedfilter.wordpress.com/2009/12/07/statistics-of-user-trial-results</link>
      <guid>https://bayesianfeedfilter.wordpress.com/2009/12/07/statistics-of-user-trial-results</guid>
      <description>We now have results from our user trials showing how effective sux0r may be in filtering items from journal table of contents RSS feeds that are relevant to a user’s research interests. Quick reminder of how we ran the trials: 20 users had access to sux0r for 4 weeks to train the analyser in what [...]</description>
      <pubDate>Mon, 07 Dec 2009 07:41:18 EST</pubDate>
      <source url="https://bayesianfeedfilter.wordpress.com/feed">Bayesian Feed Filter</source>
      <api:relevance>1</api:relevance>
    </item>
<!--lots more items-->
  </channel>
</rss>

Apart from an additional element for the relevance of the item to the specified category, it’s plain RSS 2.0.

Unfortunately we couldn’t implement the error codes properly on our server, you get an HTTP status code of 200-OK whether or not it is. Also, I think there are some error conditions that we don’t trap satisfactorily, for example specifying a non-existent user or category.

3 Comments

Filed under technical

Statistics of user trial results

We now have results from our user trials showing how effective sux0r may be in filtering items from journal table of contents RSS feeds that are relevant to a user’s research interests.

Quick reminder of how we ran the trials: 20 users had access to sux0r for 6 weeks to train the analyser in what they found interesting and not interesting. We then barred access for 4 weeks but continued to aggregate feeds and filter them based on that training. Then we invited the users to look at the results of the filtering: two feeds from sux0r; one aggregating information about journal articles that had been published while the users were barred that sux0r predicted the user would find relevant; the other feed had information about the rest of the articles, the ones that sux0r predicted the user wouldn’t find relevant. We had our users look through both feeds and tell us whether the articles really were relevant to their research interests. We lost two triallists and so have data on 18, you can see this data as a web page (or get the spreadsheet if you prefer).

The initial data needs a little explanation: The first columns (in yellow) relate to the number of items used in the initial six weeks to train the Bayesian analyser in what what was relevant to the users research interests, what wasn’t, and the total number of items used in training. The “Additional docs” column relates to information added that didn’t come from the RSS feeds: was asked users to provide some documents that were relevant to their research interested for training in order to make up for the fact that in a fairly short trial period the number of items published that were relevant may be low.

The next set of columns (in green) relate to the feed of items aggregated after the training (while the users had no access) that were predicted to match the user research interests, showing the number of items of interest in that feed, the total number of items in that feed and the proportion of items in the feed that were interesting. The next three columns (in red) do exactly the same for the feed of items that were predicted not to be relevant.

For a quick overview of the results, here’s a chart of the fraction of interesting items in both feeds:

You need to be careful interpreting this chart. It hides some things, for example, the data point showing that the fraction of interesting items in one of the feeds was 1 (i.e. the feed of interesting items did indeed only have interesting items in it) hides the fact that this feed only had 2 items in it; the user found 9 items overall to be relevant to their research interest, 7 of them were in the wrong feed. Perhaps that’s not so good.

So, did it work? Well, one way of rephrasing that question is to ask whether the feed that was supposed to be relevant (the “interesting feed”) did indeed contain more items relevant to the users research interests than than would otherwise have been the case. That is, is the proportion of interesting items in the interesting feed higher than the proportion of interesting items in the two feeds combined. The answer in all but one case is yes; typically by a factor of between two and three. (The exception is a feed which achieved similar success in getting it wrong. We don’t know what happened here.)

Also we can look at the false negatives, i.e. the number of items that really were of relevance to the user’s interests that were in the feed that was predicted not to be interesting. The chart above shows quite nicely that after using about 150 items for training this was very low.

What about some statistics? It’s worth checking whether the increase in concentration of items related to a user’s research interest as a result of filtering is statistically significant. We used a two sample Z test to compare the difference in the proportion of interesting items in the two feeds to the magnitude of difference that could be expected to happen as the result of chance:
.

I have some reservations about this because of the small number of “interesting” items found in the feed which should be uninteresting when the filtering works–this means that one of the assumptions of the Z-test might not be valid when the filtering is working best–but any value of Z above 3 cannot be reasonably expected to have happened by chance.

Conclusion: for users who used more than about 150 items in training the filtering produces a statistically significant improvement in the number of items in the feed that were relevant to the user’s research interests without filtering out a large number of items that would have been of interest. Next post: were the users happy with these results?

2 Comments

Filed under trialling

User activity

One indirect measure we have of the level of engagement from our trial users is how often they signed into the system looked at their feeds and did some training. Some analysis of the sux0r logs gives the following chart of activity with date (each colour represents a different user):

There was obviously a lot variation between users in how much they used the system (more on that very soon) but what I like from this graph is that for several users (about a third of them) it shows continual spontaneous use throughout the trial period, not just at the points when we were pushing them.

2 Comments

Filed under trialling

Preliminary findings of user trials

We’re now coming to the end of the user-trials, here are some preliminary conclusions which mostly relate to the start of the trails when we gave our users a questionnaire to try to check our assumptions of what would help and their expectations of what we might do.

Our users come from the Science and Engineering schools at Heriot-Watt University, they’re computer scientists, engineers, physicists, chemists, bioscientists and mathematicians. Just over half are PhD students, most of the others are post-docs though there are two lecturers and a professor.

This still seems like a good idea.
That is to say, potential users seem to think it will help them. We wanted 20 volunteer users for the trial and we didn’t find it difficult to get them; in fact we got 21. Nor was it too difficult to get them to use Sux0r; only one failed to use it in to the extent we required. Of course there was a bit of chivvying involved, and we’re giving them an amazon voucher as a thank-you when they complete the trial, which has probably helped, but compared to other similar evaluations it hasn’t been difficult to get potential users engaged with what we’re trying to do.

Our assumptions about how researchers keep up to date is valid for a section of potential users.
We assumed that researchers would try to keep up to date with what was happening in their field my monitoring what was in the latest issues of a defined selection of relevant journals. That is true of most of them to some extent. So for example 11 said that they received email alerts to stay up to date with journal papers. On the other hand the number of journals monitored was typically quite small (5 people looked at none; 8 at 1-4; 6 at 5-10; and 2 at 11-25). This matched what we heard from some volunteers that monitoring current journals wasn’t particularly important to them compared to fairly tightly focused library searches when starting a new project and hearing about papers through social means (by which I mean through colleagues, at conferences and through citations). Our impression is that it was the newer researchers, the PhD students, who made more use of journal tables of content. This would need checking, but perhaps it could be because they work on a fairly specific topic for a number of years and are less well connected to the social research network whereas a more mature researcher will have accreted a number of research interests and will know and communicate with others in the same field.

Feeds alone won’t do it.
Of our 21 mostly young science and technology researchers, 9 know they use RSS feeds (mostly through a personal homepage such as Netvibes), 5 don’t use them but know what they are, 7 have never heard of them; 2 use RSS feeds to keep up to date with journals (the same number as use print copies of journals and photocopies of journal ToCs), compared with 11 who use email alerts.

If you consider this alongside the use of other means of finding new research papers I think the conclusion is that we need to embed the filtered results into some other information discovery service rather than just provide an RSS feed from sux0r. Just as well we’re producing an API.

We have defined “works” for filtering
We found that currently fewer than 25% of articles in a table of contents are of interest to the individual researchers, and they have an expectation that this will rise to 50% or higher (7 want 50%, 7 want 75% and one wants everything to be of interest) in the filtered feed. On the other hand false negatives, that is the interesting articles that wrongly get filtered out, need to be lower than 5-10%.

Those are challenging targets. We’ll be checking the the results against them in the second part of the user tests (which are happening as I’ve been writing this), but we’ll also check whether what we do achieve is perceived as good enough.

Just for the ultra-curious among you, here’s the aggregate data from the questionnaire for this part of the trials

Total Started Survey: 21

Total Completed Survey: 21 (100%)

No participant skipped any questions

1. What methods do you use to stay up to date with journal papers?
Email Alerts 52.4% 11
Print copy of Journals 14.3% 3
Photocopy of Table of Contents 9.5% 2
RSS Feeds 9.5% 2
Use Current Awareness service (i.e. ticTOCs) 4.8% 1
None   0.0% 0
Other (please specify) 61.9% 13
2. How do you find out when an interesting paper has been published?
Find in a table of contents 14.3% 3
Alerted by a colleague 38.1% 8
Read about it in a blog 9.5% 2
Find by searching latest articles 76.2% 16
Other (please specify) 47.6% 10
3. How many journals do you regularly follow?
None 23.8% 5
1-4 38.1% 8
5-10 28.6% 6
11-25 9.5% 2
26+   0.0% 0
4. Do you subscribe to any RSS Feeds.
Yes, using a feed reader (i.e. bloglines, google reader) 9.5% 2
Yes, using a personal homepage (i.e. iGoogle, Netvibes, pageflakes) 23.8% 5
Yes, using a desktop client (thunderbird, outlook) 4.8% 1
Yes, using my mobile phone 4.8% 1
No, but I know what RSS Feeds are 23.8% 5
No, never heard of them 33.3% 7
Other (please specify)   0.0% 0
5. When scanning a table of contents for a journal you follow, on average, what percentage of articles are of interest to you?;
100%   0.0% 0
Over 75%   0.0% 0
Over 50% 4.8% 1
Over 25% 19.0% 4
Less than 25% 71.4% 15
I don’t scan tables of contents 4.8% 1
6. The Bayesian Feed Filter project is investigating a tool which will filter out articles from the latest tables of contents for journals that are not of interest to you.
What would be an acceptable percentage of interesting articles for such a tool?
I would expect all articles to be of interest 4.8% 1
I would expect at least 75% of articles to be of interest 33.3% 7
I would expect at least 50% of articles to be of interest 33.3% 7
I would expect at least 25% of articles to be of interest 19.0% 4
I would only occasional expect an article to be of interest 9.5% 2
7. What percentage of false negatives (i.e. wrongly filtering out interesting articles) would be acceptable for such a tool?
0% (No articles wrongly filtered out) 14.3% 3
<5% 23.8% 5
<10% 38.1% 8
<20% 4.8% 1
<30% 4.8% 1
<50%   0.0% 0
False negatives are not a problem 14.3% 3
8. What sources of research literature do you follow?
Journal Articles 95.2% 20
Conference proceedings 71.4% 15
Pre-prints 14.3% 3
Industry News 33.3% 7
Articles in Institutional or Subject Repositories 19.0% 4
Theses or Dissertation 57.1% 12
Blogs 33.3% 7
Other (please specify) 19.0% 4

4 Comments

Filed under trialling

OAuth

Congratulations to Santy on getting an OAuth test client working. We’re going to be using OAuth to authorise remote access to the Feed Filter (I guess that should be obvious), about 90% of our features require it. One of the “weaknesses” I put in the SWOT analysis was that we had a lot to learn, fully understanding and implementing OAuth relates directly to that. I guess that makes us stronger now. Next: OAuth on the server.

1 Comment

Filed under technical

BayesFF in 45 seconds

I’m doing a 45 second presentation on the Bayes Feed Filter project at the JISC Rapid Innvation Development meeting in Manchester today. This is it:

The Bayesian Feed Filter will help researchers keep up to date with current developments in thier field. It will automatically filter RSS and ATOM feeds from Journals’ tables of content to (hopefully) select those that are relevent to an individual’s research interests.

It uses Bayesian statistical analysis, the same approach used in many spam filters. First you need to train it with samples of what you are and aren’t interested in; then it compares the frequency with which words occur in the text to predict whether new items are on a similar topic to the samples that you were interested in.

We are testing whether this approach works for researchers and Table of Content feeds and building an API, so would like to talk anyone who can use it to personalize their own data presentation.

3 Comments

Filed under dissemination