[STANBOL-613] Define a standard way on how to obtain the extracted language - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.9.0-incubating
Fix Version/s: enhancer-0.10.0
Component/s: Enhancer
Labels:
None

Description

With the addition of the CELI Langauge Identification Engine there are now two different engines that do support the same feature.

However currently Engines that do consume the detected language are "hard coded" to the LangId Engine (enhancer/engines/langid). Something that need to be changed to allow the adoption of alternatives - like the CELI based implementation.

The suggestion is to use the following Pattern to extract the language

(1) via Annotations:

?x rdf:type fise:TextAnnotation .
?x dc:language ?language .
OPTIONAL

{ ?x dc:created ?engine }

OPTIONAL

{ ?x fise:confidence ?confidence }

(2) via ContentItem metadata

?ci dc:language ?language

(2) is a fallback if (1) delivers no results.

Methods that

extract the language (with the highest confidence) - including fallback to (2)
extract all languages (sorted by confidence) - including fallback to (2)
extract all TextAnnotations with dc:language values

are added to the EnhancementEngineHelper utility of the enhancer.servicesapi module

Attachments

Issue Links

relates to

STANBOL-1417 Create Language Annotation for parsed "Content-Language" header

Resolved

Activity

People

Assignee:: Rupert Westenthaler

Reporter:: Rupert Westenthaler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/May/12 09:29

Updated:: 16/Apr/15 08:36

Resolved:: 18/May/12 10:08