« Previous | Main | Next »

Notes from the WWW 2012 conference

Post categories:

Yves Raimond | 09:36 UK time, Thursday, 26 April 2012

Last week I attended the International World Wide Web conference in Lyon, France. This conference is probably the largest one in that space: around 2500 participants and 15 parallel tracks. I presented two papers:

I also contributed to a panel with Peter Mika from Yahoo! Research, Ivan Herman from the W3C, and Sir Tim Berners-Lee from MIT/W3C. The panel was entitled 'Microdata, RDFa, Web APIs, Linked Data: Competing or Complementary?' and was looking at publishing statistics for structured data extracted from the Web Data Commons dataset and from a Yahoo! dataset to try and understand what format were used and for what use-case. One of the main message from this panel is that structured web data is already mainstream - Yahoo! reports that 25% of all web pages contain RDFa data and 7% contain Microdata.

WWW 2012, LDOW panel, day 1

From left to right, Peter Mika, Yves Raimond, Ivan Herman, Tim Berners-Lee (c) Inria / picture T. Fournier

I thought I would write my notes from the conference. Of course, I wasn't able to see everything so the selection of papers below just reflects the presentations I attended. Given the general quality of the papers, I strongly suggest going through the online proceedings.

Linked Data on the Web workshop

I spent the first day of the conference in the Linked Data on the Web workshop. A couple of personal highlight were the following papers:

AdMIRe and PhiloWeb workshops

On the second day I attended the AdMIRe workshop and the end of the PhiloWeb workshop. The earlier focused on advances in Music Information Retrieval, while the latter focused on the intersection of Web Science and Philosophy.

I arrived quite late at the PhiloWeb workshop, but early enough to see a presentation about Common Logic, which provides most of the logical framework behind languages such as RDF. The workshop ended with a panel of the W3C Technical Architecture Group discussing various philosophical aspects of the Web. One of the biggest issue raised was the huge discrepancy between the 'normal' use of the Web (asynchronous JavaScript everywhere, many resources to construct any single web page) and the Semantic Web or 'purist' view of the Web.

Main conference - day 1

The main conference started on the Wednesday with a very inspiring keynote by Tim Berners-Lee. He tackled a number of very interesting topics, such as the 'principle of least power' when designing new languages, the need for open mobile web applications and the issues around hierarchical systems such as DNS and PKI. He finished his keynote by talking about what he called the 'three sides of privacy': personal data held by businesses, personal data leaks (and the so-called 'jigsaw effect') and privacy invasion (e.g. through Deep Packet Inspection). He concluded by asking the audience to spend 90% of their time building new things, but 10% of their time protecting the open Web infrastructure and information accountability.

I attended the demo sessions all afternoon, where I was presenting our automated tagging framework. The Google Art Project held the keynote of this session, describing the work they have been doing capturing a number of artworks from an international selection of museums. They demonstrated the ability to look at specific parts of artworks in detail, their 'street view' for museums, and the creation of personal collections of artworks. They also mentioned that an API to access the data will be opened - we'll certainly keep an eye out for that! Rai also presented their personalised newscasts use-case within the NoTube project in the same session. They also presented some archive-related work, trying to help journalists find information in the news domain from their archive.

Main conference - day 2

Thursday started with a keynote from Chris Welty (IBM Research), who was part of the team behind IBM Watson which won the Jeopardy! quiz programme last year. A part of his keynote was spent describing the approach used for Watson, which is quite different from the traditional approach for automated question-answering. Typically a question is translated into some formal language and the resulting query is executed on a large knowledge base. Watson never tries to understand the 'meaning' of the questions. Rather, it finds documents that could hold the answer and scores them on lots of dimensions. Then, it learns the best combination of those scores based on previous Jeopardy! games. Semantic technologies in Watson are just used for some of these scores, not as a goal in itself. However it is an important tool, as it does bring a 10% performance boost.

This keynote was followed by a panel on the open Web, introduced by a keynote by Neelie Kroes from the European Commission. The panel was very good, with a lot of controversial questions being tackled, like the HADOPI law in France.

In the afternoon I attended the Entity Linking session. The LINDEN framework was presented first, describing a Named Entity Recognition technique using YAGO as target identifiers. Candidate entities are generated, and disambiguated using a number of features, e.g. link probability (estimated using count information in the dictionary), semantic associativity (using the Wikipedia hyperlink structure), semantic similarity (derived from the YAGO taxonomy) and topical coherence of a document around the candidate entity. The approach was interesting, but the paper suggests that a big part of the algorithm relies on concepts extracted by Wikipedia-Miner and providing some context for the disambiguation. It wasn't clear how LINDEN compares with that tool and whether it actually improved the results first obtained by Wikipedia-Miner.

The second paper was about generating cross-lingual links in Wikipedia. A significant number of Wikipedia pages are lacking cross-lingual links, as everything is currently done manually. The algorithm presented in this paper exploits the fact that articles linked to or from equivalent articles tend to be equivalent.

The final paper of the session was Zencrowd, using probabilistic reasoning to combine automated and manual work (done through Amazon Mechanical Turk, which came up a lot during the conference for user evaluations) for an RDFa enrichment task.

The last session I attended that day was specifically about Semantic Web technologies, describing why SPARQL 1.1 property paths are not scalable and that their semantics need to be changed (which also got the best paper award at the conference), template-based question answering (which addressed this problem in a very different way to what IBM Watson is doing, translating full text queries in SPARQL queries), and mapping relational databases to RDF.

Main conference - day 3

I attended the EU track on the Friday morning, where current EU projects were showcased, including LAWA (tracking entities through time in Web archives) and ARCOMEM (making use of the social web for identifying Web documents to archive).

Finally, I attended the Web Mining session in the afternoon. This session included three very interesting papers. The first one started from the basis that 'real stories are not linear' and described an algorithm for generating 'tube maps' for news stories. The second one tried to address the ambitious goal of predicting news events. Their system gathered a wide range of Linked Data and news article, extracted causal links from different events described within them, and tried to generalise such causal links. Then, given a particular event input, these generalised links can be used to predict future events, e.g. "China overtakes Germany as world's biggest exporter" is used by their system to predict "wheat price will fall". The last paper mined the Google news archive, holding several articles per day since 1895, and derived statistics about how long a person stays mentioned in the news. Apparently, the median duration of a person being famous in the news has consistently been 7 days for the last century.

Comments

 

More from this blog...

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.