Notes from the WWW 2012 conference

Post categories: IRFS

Yves Raimond | 09:36 UK time, Thursday, 26 April 2012

Last week I attended the International World Wide Web conference in Lyon, France. This conference is probably the largest one in that space: around 2500 participants and 15 parallel tracks. I presented two papers:

Automatic interlinking of speech radio archives in the Linked Data on the Web workshop which focuses on the automated tagging algorithm we mentioned earlier on this blog (slides);
Automated Semantic Tagging of Speech Audio in the demo track, which focuses on the various tools we built to process very large archives with this algorithm, and on applications we built with MetaBroadcast using the resulting tags (slides).

I also contributed to a panel with Peter Mika from Yahoo! Research, Ivan Herman from the W3C, and Sir Tim Berners-Lee from MIT/W3C. The panel was entitled 'Microdata, RDFa, Web APIs, Linked Data: Competing or Complementary?' and was looking at publishing statistics for structured data extracted from the Web Data Commons dataset and from a Yahoo! dataset to try and understand what format were used and for what use-case. One of the main message from this panel is that structured web data is already mainstream - Yahoo! reports that 25% of all web pages contain RDFa data and 7% contain Microdata.

From left to right, Peter Mika, Yves Raimond, Ivan Herman, Tim Berners-Lee (c) Inria / picture T. Fournier

I thought I would write my notes from the conference. Of course, I wasn't able to see everything so the selection of papers below just reflects the presentations I attended. Given the general quality of the papers, I strongly suggest going through the online proceedings.

Linked Data on the Web workshop

I spent the first day of the conference in the Linked Data on the Web workshop. A couple of personal highlight were the following papers:

NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. As more and more online services for Named Entity Recognition are available, the NERD framework attempts to align them to provide a unified way of accessing their results as well as a way to compare them. It looks like most of them perform well in particular domains, and perhaps the best results could be obtained by combining several of them.
Towards Interoperable Provenance Publication on the Linked Data Web. This position paper describes how the work done by the W3C Provenance Working Group could be used to express provenance as Linked Data. One interesting aspect was the application of 'follow-your-nose' principles to provenance data. Some data could be marked as derived from another set of data, identified by a URI. Getting that URI would also hold some derivation information, ultimately leading to a full provenance trail for any derived data. This would be very useful for scientific dataset, but also for news articles, weather reports, etc.
Using read/write Linked Data for Application Integration -- Towards a Linked Data Basic Profile. This paper introduces the Linked Data profile W3C member submission, defining a read-write Linked Data architecture, apparently already in use in some IBM products.
Interacting with the Web of Data through a Web of Inter-connected Lenses. This paper introduces the Mashpoint framework, a framework for pivoting selections of data (e.g. a list of countries) between web applications for data visualisation. Mashpoint looks like a very promising tool for data journalism.

AdMIRe and PhiloWeb workshops

On the second day I attended the AdMIRe workshop and the end of the PhiloWeb workshop. The earlier focused on advances in Music Information Retrieval, while the latter focused on the intersection of Web Science and Philosophy.

Music Retagging Using Label Propagation and Robust Principal Component Analysis. This paper uses content-based similarities between musical tracks to improve the quality of user tags on those tracks.
Melody, bassline and harmony representations for music version identification. This paper compares and combines a number of content-based features for the task of identifying different versions of the same musical work.
Power-Law Distribution in Encoded MFCC Frames of Speech, Music, and Environmental Sound Signals. This paper was particularly interesting in that it dealt with Mel-frequency cepstral coefficients-based sound classification, which we used in BBC R&D for a couple of projects. Most sound similarity metrics using aggregates of MFCCs assume that their distribution is homogeneous. However for a wide range of sounds the MFCCs distribution fits a shifted power-law distribution, which means that very few selected frames can be used to obtain similar performances. Perhaps using similarity measures which do not assume homogeneity could help take such biases towards particular combinations of coefficients into account?
Xavier Serra's keynote described the CompMusic project. Two particularly interesting aspects of the project are that it focuses solely on non-Western music and that it contributes directly to making Musicbrainz better, a bit like what we do for the BBC Music website.
- The Million Song Dataset Challenge. This paper describes a very large-scale dataset for the evaluation of music recommendation algorithms, providing a wide range of data about a million songs.

I arrived quite late at the PhiloWeb workshop, but early enough to see a presentation about Common Logic, which provides most of the logical framework behind languages such as RDF. The workshop ended with a panel of the W3C Technical Architecture Group discussing various philosophical aspects of the Web. One of the biggest issue raised was the huge discrepancy between the 'normal' use of the Web (asynchronous JavaScript everywhere, many resources to construct any single web page) and the Semantic Web or 'purist' view of the Web.

Main conference - day 1

The main conference started on the Wednesday with a very inspiring keynote by Tim Berners-Lee. He tackled a number of very interesting topics, such as the 'principle of least power' when designing new languages, the need for open mobile web applications and the issues around hierarchical systems such as DNS and PKI. He finished his keynote by talking about what he called the 'three sides of privacy': personal data held by businesses, personal data leaks (and the so-called 'jigsaw effect') and privacy invasion (e.g. through Deep Packet Inspection). He concluded by asking the audience to spend 90% of their time building new things, but 10% of their time protecting the open Web infrastructure and information accountability.

I attended the demo sessions all afternoon, where I was presenting our automated tagging framework. The Google Art Project held the keynote of this session, describing the work they have been doing capturing a number of artworks from an international selection of museums. They demonstrated the ability to look at specific parts of artworks in detail, their 'street view' for museums, and the creation of personal collections of artworks. They also mentioned that an API to access the data will be opened - we'll certainly keep an eye out for that! Rai also presented their personalised newscasts use-case within the NoTube project in the same session. They also presented some archive-related work, trying to help journalists find information in the news domain from their archive.

Main conference - day 2

Thursday started with a keynote from Chris Welty (IBM Research), who was part of the team behind IBM Watson which won the Jeopardy! quiz programme last year. A part of his keynote was spent describing the approach used for Watson, which is quite different from the traditional approach for automated question-answering. Typically a question is translated into some formal language and the resulting query is executed on a large knowledge base. Watson never tries to understand the 'meaning' of the questions. Rather, it finds documents that could hold the answer and scores them on lots of dimensions. Then, it learns the best combination of those scores based on previous Jeopardy! games. Semantic technologies in Watson are just used for some of these scores, not as a goal in itself. However it is an important tool, as it does bring a 10% performance boost.

This keynote was followed by a panel on the open Web, introduced by a keynote by Neelie Kroes from the European Commission. The panel was very good, with a lot of controversial questions being tackled, like the HADOPI law in France.

In the afternoon I attended the Entity Linking session. The LINDEN framework was presented first, describing a Named Entity Recognition technique using YAGO as target identifiers. Candidate entities are generated, and disambiguated using a number of features, e.g. link probability (estimated using count information in the dictionary), semantic associativity (using the Wikipedia hyperlink structure), semantic similarity (derived from the YAGO taxonomy) and topical coherence of a document around the candidate entity. The approach was interesting, but the paper suggests that a big part of the algorithm relies on concepts extracted by Wikipedia-Miner and providing some context for the disambiguation. It wasn't clear how LINDEN compares with that tool and whether it actually improved the results first obtained by Wikipedia-Miner.

The second paper was about generating cross-lingual links in Wikipedia. A significant number of Wikipedia pages are lacking cross-lingual links, as everything is currently done manually. The algorithm presented in this paper exploits the fact that articles linked to or from equivalent articles tend to be equivalent.

The final paper of the session was Zencrowd, using probabilistic reasoning to combine automated and manual work (done through Amazon Mechanical Turk, which came up a lot during the conference for user evaluations) for an RDFa enrichment task.

The last session I attended that day was specifically about Semantic Web technologies, describing why SPARQL 1.1 property paths are not scalable and that their semantics need to be changed (which also got the best paper award at the conference), template-based question answering (which addressed this problem in a very different way to what IBM Watson is doing, translating full text queries in SPARQL queries), and mapping relational databases to RDF.

Main conference - day 3

I attended the EU track on the Friday morning, where current EU projects were showcased, including LAWA (tracking entities through time in Web archives) and ARCOMEM (making use of the social web for identifying Web documents to archive).

Finally, I attended the Web Mining session in the afternoon. This session included three very interesting papers. The first one started from the basis that 'real stories are not linear' and described an algorithm for generating 'tube maps' for news stories. The second one tried to address the ambitious goal of predicting news events. Their system gathered a wide range of Linked Data and news article, extracted causal links from different events described within them, and tried to generalise such causal links. Then, given a particular event input, these generalised links can be used to predict future events, e.g. "China overtakes Germany as world's biggest exporter" is used by their system to predict "wheat price will fall". The last paper mined the Google news archive, holding several articles per day since 1895, and derived statistics about how long a person stays mentioned in the news. Apparently, the median duration of a person being famous in the news has consistently been 7 days for the last century.

Share this page

Comments Post your comment

Comment number 1.
At 11:51 26th Apr 2012, spukos wrote:

This comment was removed because the moderators found it broke the house rules. Explain.

Complain about this comment (Comment number 1)
Comment number 2.
At 22:04 27th Apr 2012, Behinehab wrote:

This comment was removed because the moderators found it broke the house rules. Explain.

Complain about this comment (Comment number 2)
Comment number 3.
At 11:29 22nd May 2012, U14179821 wrote:

All this user's posts have been removed.Why?

Complain about this comment (Comment number 3)

This entry is now closed for comments

Jump to more content from this blog

About this blog

This is the Research & Development blog, where researchers, scientists and engineers from BBC R&D share their work in developing the media technologies of the future.

For the latest updates across BBC blogs,
visit the Blogs homepage.

Subscribe to Research and Development

You can stay up to date with Research and Development via these feeds.

Research and Development Feed(RSS)

Research and Development Feed(ATOM)

If you aren't sure what RSS is you'll find our beginner's guide to RSS useful.

Other Related BBC Blogs

Mothballed Blogs

BBC R&D Main Site

R&D Homepage Image

For a detailed breakdown of our activities, teams, locations and how we collaborate visit our main website. We also host videos on the main website without UK only distribution restrictions.

Notes from the WWW 2012 conference

Linked Data on the Web workshop

AdMIRe and PhiloWeb workshops

Main conference - day 1

Main conference - day 2

Main conference - day 3

Comments Post your comment

Comment number 1.

Comment number 2.

Comment number 3.

About this blog

Subscribe to Research and Development

Other Related BBC Blogs

BBC R&D Main Site

More from this blog...

Topical posts on this blog

Being Discussed Now

Archives

Categories

Latest contributors

BBC navigation

BBC links

Notes from the WWW 2012 conference

Linked Data on the Web workshop

AdMIRe and PhiloWeb workshops

Main conference - day 1

Main conference - day 2

Main conference - day 3

Comments Post your comment

Comment number 1.

Comment number 2.

Comment number 3.

About this blog

Subscribe to Research and Development

Other Related BBC Blogs

BBC R&D Main Site

More from this blog...

Topical posts on this blog

Being Discussed Now

Archives

Categories

Latest contributors

BBC iD

BBC navigation

BBC links