Author Archives: Phil Barker

Preliminary findings of user trials

We’re now coming to the end of the user-trials, here are some preliminary conclusions which mostly relate to the start of the trails when we gave our users a questionnaire to try to check our assumptions of what would help and their expectations of what we might do.

Our users come from the Science and Engineering schools at Heriot-Watt University, they’re computer scientists, engineers, physicists, chemists, bioscientists and mathematicians. Just over half are PhD students, most of the others are post-docs though there are two lecturers and a professor.

This still seems like a good idea.
That is to say, potential users seem to think it will help them. We wanted 20 volunteer users for the trial and we didn’t find it difficult to get them; in fact we got 21. Nor was it too difficult to get them to use Sux0r; only one failed to use it in to the extent we required. Of course there was a bit of chivvying involved, and we’re giving them an amazon voucher as a thank-you when they complete the trial, which has probably helped, but compared to other similar evaluations it hasn’t been difficult to get potential users engaged with what we’re trying to do.

Our assumptions about how researchers keep up to date is valid for a section of potential users.
We assumed that researchers would try to keep up to date with what was happening in their field my monitoring what was in the latest issues of a defined selection of relevant journals. That is true of most of them to some extent. So for example 11 said that they received email alerts to stay up to date with journal papers. On the other hand the number of journals monitored was typically quite small (5 people looked at none; 8 at 1-4; 6 at 5-10; and 2 at 11-25). This matched what we heard from some volunteers that monitoring current journals wasn’t particularly important to them compared to fairly tightly focused library searches when starting a new project and hearing about papers through social means (by which I mean through colleagues, at conferences and through citations). Our impression is that it was the newer researchers, the PhD students, who made more use of journal tables of content. This would need checking, but perhaps it could be because they work on a fairly specific topic for a number of years and are less well connected to the social research network whereas a more mature researcher will have accreted a number of research interests and will know and communicate with others in the same field.

Feeds alone won’t do it.
Of our 21 mostly young science and technology researchers, 9 know they use RSS feeds (mostly through a personal homepage such as Netvibes), 5 don’t use them but know what they are, 7 have never heard of them; 2 use RSS feeds to keep up to date with journals (the same number as use print copies of journals and photocopies of journal ToCs), compared with 11 who use email alerts.

If you consider this alongside the use of other means of finding new research papers I think the conclusion is that we need to embed the filtered results into some other information discovery service rather than just provide an RSS feed from sux0r. Just as well we’re producing an API.

We have defined “works” for filtering
We found that currently fewer than 25% of articles in a table of contents are of interest to the individual researchers, and they have an expectation that this will rise to 50% or higher (7 want 50%, 7 want 75% and one wants everything to be of interest) in the filtered feed. On the other hand false negatives, that is the interesting articles that wrongly get filtered out, need to be lower than 5-10%.

Those are challenging targets. We’ll be checking the the results against them in the second part of the user tests (which are happening as I’ve been writing this), but we’ll also check whether what we do achieve is perceived as good enough.

Just for the ultra-curious among you, here’s the aggregate data from the questionnaire for this part of the trials

Total Started Survey: 21

Total Completed Survey: 21 (100%)

No participant skipped any questions

1. What methods do you use to stay up to date with journal papers?
Email Alerts 52.4% 11
Print copy of Journals 14.3% 3
Photocopy of Table of Contents 9.5% 2
RSS Feeds 9.5% 2
Use Current Awareness service (i.e. ticTOCs) 4.8% 1
None   0.0% 0
Other (please specify) 61.9% 13
2. How do you find out when an interesting paper has been published?
Find in a table of contents 14.3% 3
Alerted by a colleague 38.1% 8
Read about it in a blog 9.5% 2
Find by searching latest articles 76.2% 16
Other (please specify) 47.6% 10
3. How many journals do you regularly follow?
None 23.8% 5
1-4 38.1% 8
5-10 28.6% 6
11-25 9.5% 2
26+   0.0% 0
4. Do you subscribe to any RSS Feeds.
Yes, using a feed reader (i.e. bloglines, google reader) 9.5% 2
Yes, using a personal homepage (i.e. iGoogle, Netvibes, pageflakes) 23.8% 5
Yes, using a desktop client (thunderbird, outlook) 4.8% 1
Yes, using my mobile phone 4.8% 1
No, but I know what RSS Feeds are 23.8% 5
No, never heard of them 33.3% 7
Other (please specify)   0.0% 0
5. When scanning a table of contents for a journal you follow, on average, what percentage of articles are of interest to you?;
100%   0.0% 0
Over 75%   0.0% 0
Over 50% 4.8% 1
Over 25% 19.0% 4
Less than 25% 71.4% 15
I don’t scan tables of contents 4.8% 1
6. The Bayesian Feed Filter project is investigating a tool which will filter out articles from the latest tables of contents for journals that are not of interest to you.
What would be an acceptable percentage of interesting articles for such a tool?
I would expect all articles to be of interest 4.8% 1
I would expect at least 75% of articles to be of interest 33.3% 7
I would expect at least 50% of articles to be of interest 33.3% 7
I would expect at least 25% of articles to be of interest 19.0% 4
I would only occasional expect an article to be of interest 9.5% 2
7. What percentage of false negatives (i.e. wrongly filtering out interesting articles) would be acceptable for such a tool?
0% (No articles wrongly filtered out) 14.3% 3
<5% 23.8% 5
<10% 38.1% 8
<20% 4.8% 1
<30% 4.8% 1
<50%   0.0% 0
False negatives are not a problem 14.3% 3
8. What sources of research literature do you follow?
Journal Articles 95.2% 20
Conference proceedings 71.4% 15
Pre-prints 14.3% 3
Industry News 33.3% 7
Articles in Institutional or Subject Repositories 19.0% 4
Theses or Dissertation 57.1% 12
Blogs 33.3% 7
Other (please specify) 19.0% 4
Advertisements

4 Comments

Filed under trialling

OAuth

Congratulations to Santy on getting an OAuth test client working. We’re going to be using OAuth to authorise remote access to the Feed Filter (I guess that should be obvious), about 90% of our features require it. One of the “weaknesses” I put in the SWOT analysis was that we had a lot to learn, fully understanding and implementing OAuth relates directly to that. I guess that makes us stronger now. Next: OAuth on the server.

1 Comment

Filed under technical

BayesFF in 45 seconds

I’m doing a 45 second presentation on the Bayes Feed Filter project at the JISC Rapid Innvation Development meeting in Manchester today. This is it:

The Bayesian Feed Filter will help researchers keep up to date with current developments in thier field. It will automatically filter RSS and ATOM feeds from Journals’ tables of content to (hopefully) select those that are relevent to an individual’s research interests.

It uses Bayesian statistical analysis, the same approach used in many spam filters. First you need to train it with samples of what you are and aren’t interested in; then it compares the frequency with which words occur in the text to predict whether new items are on a similar topic to the samples that you were interested in.

We are testing whether this approach works for researchers and Table of Content feeds and building an API, so would like to talk anyone who can use it to personalize their own data presentation.

3 Comments

Filed under dissemination

New features planned for sux0r

My last post described what sux0r already does, this one describes the features for the API that we plan to add.

The idea is to allow users of a remote application to classify feeds and to see the results, i.e. do what was described in that last post but without using the sux0r interface. The hope is that this will allow the use of the filter to be embedded in their own personal toolset, and more generally make the functionality of sux0r as a feed filter/classifier available to other services and applications.

To do this we think the API needs to provide access to the following sux0r functionality (the priority refers to our priority for implementing the feature):

1. Authorise account access for user application
A user gains access to their account through an application using API (using OAuth). High priority.

2. Add a New Feed
A user suggests a feed to be made available for adding to sux0r users’ accounts. High priority

3. Approve a Feed for a User
An feed administrator approves a feed added by a user so that it can be added to users’ accounts. High priority

4. Associate feed with a user
A user associates an approved feed with their account. High priority

5. Create a new Vector for a User
A user creates a new classification vector. Medium priority

6. Create a new Category for a User’s Vector
A user creates a new classification category on a specified vector. Medium priority.

7. Train a Document for a User
The user submits a document and the desired classification to train the classifier. High Priority.

Note: The document could be an RSS Item, which already exists in the database and hence will have an RSS ID number, or it could be plain text, which needs to be added to the database and then trained.

8. Return the RSS Items for a User
A user gets all Items from RSS Feeds to which a user is subscribed. Feeds may be sorted or filtered according specified criteria (e.g. only those in a certain category). Very high priority .

9. Return RSS Feeds for All Users
A user gets a list of all the feeds in the database. Medium priority.

10. Return RSS Feeds for a User
A user gets a list of all the feeds they are subscribed to. High priority

11. Remove feed
A user requests to remove a feed (association) from their account. Medium priority

12. Return vectors
A user gets a list of all the vectors she has created. Medium priority

13. Return categories
A user wants to view all the categories they have created for a vector. Medium priority

14. Export the Bayesian Token Analysis for a User
A user gets the information on frequency of occurrence of words in each vector-category.

1 Comment

Filed under dissemination, technical

About sux0r

This post describes what is pretty much the starting point for our development work: the Sux0r OS software; my next will describe what we plan to add.

I came across sux0r while investigating the feasibility for the project, before writing the bid: while I found several references to the idea of Bayesian filtering of RSS feeds, and a couple of projects that had made a start on software to implement the idea, sux0r was the only open source project that I found that was still active. But sux0r is not just a personal feed filter, in fact it is something of an all-round content management system with Bayesian classification and support for group collaboration. It comprises a blogging platform, bookmarking, image repository and RSS feed aggregator.

While that’s great for a content management system, it’s a lot more than we really want to deal with. We considered the option of stripping out the functionality that we didn’t want to use, leaving just RSS aggregator and filter, but that seemed like fairly radical surgery to be performing, especially at the start of a project before we really got familiar with what did what in the sux0r code. It also didn’t seem to be a good strategy for contributing back to the sux0r project. So we adopted a more superficial approach: we have a complete installation of sux0r but we have customised the interface so that our users don’t get to see that there is an image library, blogging platform or social bookmark facility.

Using sux0r for feed filtering involves the following steps.

Continue reading

3 Comments

Filed under dissemination, technical

Idea: extension to previous literature

I think Bayesian Feed Filtering isn’t just limited to current issues alerts. When searching for everything that has been published on a particular topic (e.g. when I used to do searches on everything published on the biopolymer system I researched the crystallization of) it’s easy to get an overwhelming number of responses but difficult to focus down onto just those that are of interest. So how about doing a general search remotely via SRU, transforming the results into RSS (a page at a time) and passing them through the Bayesian feed filter?

Work I did a while back on transforming SRU responses to HTML might be a starting point (though I swore off ever again trying to do anything like that with XLSTs).

1 Comment

Filed under applications

SWOT Analysis

Here are the Strengths, Weaknesses, Opportunities & Threats of the project, as estimated by Lisa and Phil during an informal project meeting over coffee. Following standard SWOT procedures (I used info and templates from businessballs.com and CIPD for guidance), Strengths and Weaknesses are internal and Opportunities and Threats are external. We think the “internals” of the project comprise the project team (our skills and connections to others) and the idea itself and the approach to realising it; the “externals” are the users, the sux0r project, the JISC environment and others (e.g. commercial interests, our host institution and the wider HE system).

(The points are numbered for ease of referencing, not for ordering.)

Strengths

  1. We think we’re starting with a good idea, at least in principle; an innovative solution to a recognized need.
  2. Using sux0r as a starting point has given us access to existing OS code and put us in contact with a knowledgeable developer.
  3. We have a settled team who have worked well together on a number of previous projects over the last 4-10 years.
  4. We have good existing links with experts in JISC, CETIS, the IE, UKOLN, JISC services and projects (and we’re not afraid to use them).
  5. We have previous experience in related projects dealling with Journal ToC and other RSS feeds (e.g. PerX, TicTocs, GoldDust . . .).
  6. We work in close proximity to our intended test user group (which should help with encouraging engagement for the trials).

Weaknesses

  1. We have lots of new stuff to learn: this is the most deliberately RESTful development we have undertaken; we’re using a project management technique that is new to us; this is first time we’ve worked on a branch of an existing OSS project; we need more robust user trials than we’ve previously managed.
  2. We have all that to learn in a short project time frame (six months, all the team are working part time on this project).
  3. Bayesian filtering is not a complete solution. Other techniques (e.g. popularity from usage data analysis; manual over-rides to specify that that everything from some authors is important, no matter what the topic) would help identify important items but are out of scope.
  4. Bayesian filtering might not work for our users with the type of data and sources we have (see threats), though as a good academic I think this is not so much a weakness as a potential research finding.

Opportunities

  1. Working with sux0r provides an opportunity to work with an existing user base and experienced developer.
  2. Other projects in the information environment provide additional/alternative usage scenarios (but see threat 2).
  3. It may be possible to embed the output of this project into other services, e.g. TicTocs, TechXtra, JISC IE or commercial services.
  4. There is good support for RESTful development approaches.
  5. There is a good developer community in the JISCRI projects.

Threats

  1. Lack of user engagement. We don’t know that users will be as enthusiastic about this approach as we are, they might just resent disruptive technologies.
  2. Expectation mismatch (see opportunity 2 & weakness 3), possibly leading to scope creep.
  3. There might be some unexpected conflict with the sux0r project (over approach or priorities).
  4. There might be a lack of table of content information from the right journals in RSS form, or what there is might be polluted (garbage in garbage out).
  5. Competing demands on time from other projects/tasks that the team are working on (see weakness 2).

I guess some mitigation of the negative factors is called for, that will come later, but a quick reflection is that engagement with the project externals is going to be important.

The programme guidance documentation suggests that the SWOT analysis is best to be undertaken in small steps, throughout the duration of the project; and the other guidance I read suggested that it should draw on as many view points as possible. So, hopefully this isn’t the last on SWOT, and please comment on anything that has been overlooked.

1 Comment

Filed under management