Ólafur Páll Geirsson

Orðspor: labeling named entities with computer games

01 Jun 2014.

My Bachelor’s thesis. Abstract: Most current natural language processing models use large labeled datasets to achieve good performance, but large accurately labeled datasets for different languages and domains are hard to come by.The standard method to produce labeled data has been through manual labor which is expensive and tedious. This thesis presents an experiment to use computer games and crowdsourcing as a cheap and fun alternative to produce labeled data for the named entity recognition task. We introduce Orðspor, a website with three computer games designed with the objective to collect named entity tags for the Icelandic language text while providing players with an enjoyable time. The games are similar in many ways; they share a common API, they receive input from the same source and their output is merged to produce a single dataset. By using a shared API, we were able to rapidly develop the three games and experiment with different game designs. We envision that many more games could be developed with our API and attract a broad range of players. Moreover, such games could with proper marketing effort be embedded into people’s daily routines in various ways producing a wealth of datasets for use in a variety of natural language processing tasks.

About: crowdsourcing, named entity recognition