[STANBOL-1141] Wikilinks Parser and TDB Generator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: Enhancer, Entityhub
Labels:

Description

Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components.
Wikilinks is one of the result of this task. Wikilinks dataset comprising of 40 million mentions over 3 million entities. The method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. The resource provides URLs of webpages, along with the anchor of the links, and the Wikipedia pages they link to. As provided, this dataset can be used to get all the surface strings that refer to a Wikipedia page, but further, it can be used to download the webpages and extract the context around the webpages

UMass (http://www.iesl.cs.umass.edu/) has created expanded versions of the dataset containing the following extra features:

Complete webpage content (with cleaned DOM structure)
Extracted context for the mentions
Alignment to Freebase entities

The expanded dataset can be downloaded from http://iesl.cs.umass.edu/downloads/wiki-link/context-only/

A tool is needed for parsing this information and store it in any kind of storage consumible later within Stanbol. For the first version, it is possible to convert this dataset to RDF and store it in a triple store like JenaTDB. The goal of this task is to provide an API on the top of this store for easing the retrieval of entities' contextual data. So, "in disambiguation time", we can use the URI of the referenced entity to lookup for disambiguation contexts in Wikilinks

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gsoc-wikilinks-1.0-SNAPSHOT.zip
27/Sep/13 12:13
55 kB
Antonio David Pérez Morales

Activity

People

Assignee:: Unassigned

Reporter:: Antonio David Pérez Morales

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Jul/13 10:28

Updated:: 27/Sep/13 12:13

Resolved:: 26/Jul/13 10:32

Time Tracking

Estimated:

336h

Remaining:

336h

Logged:

Not Specified