Uploaded image for project: 'Stanbol'
  1. Stanbol
  2. STANBOL-706

DBpedia Spotlight EnhancementEngines integration

    Details

      Description

      In the process of the early adopters programme of the IKS we developed 4 EnhancementEngines, which integrate the different aspects of DBpedia Spotlight in Apache Stanbol. We would like to contribute them, so they can eventually become a part of the Stanbol Stack. The engines are as follows:

      • dbpediaspotlightannotate - spots the potential mentions, retrieves the candidate DBpedia resources, disambiguates them if needed, and links the mentions to the best one
      • dbpediaspotlightcandidates - same as annotate, but does not disambiguate the candidates for each mention. Rather it returns the top K ones.
      • dbpediaspotlightdisambiguate - does not do spotting, it just selects the candidates for the given mentions and does disambiguation.
      • dbpediaspotlightspot - does only spotting, no candidate resource selection, disambiguation or linking

        Issue Links

          Activity

          Hide
          rwesten Rupert Westenthaler added a comment - - edited

          marking this as resolved with http://svn.apache.org/viewvc?rev=1396570&view=rev.

          Further improvements to this Engine should be carried out in own issues.

          Show
          rwesten Rupert Westenthaler added a comment - - edited marking this as resolved with http://svn.apache.org/viewvc?rev=1396570&view=rev . Further improvements to this Engine should be carried out in own issues.
          Hide
          rwesten Rupert Westenthaler added a comment -

          With http://svn.apache.org/viewvc?rev=1386604&view=rev the Spotlight engine was added to the trunk. So all users with a later SNAPSHOT version will have it.

          Note also that I deleted the branch as further development should be carried out in the trunk.

          Show
          rwesten Rupert Westenthaler added a comment - With http://svn.apache.org/viewvc?rev=1386604&view=rev the Spotlight engine was added to the trunk. So all users with a later SNAPSHOT version will have it. Note also that I deleted the branch as further development should be carried out in the trunk.
          Hide
          rwesten Rupert Westenthaler added a comment -

          Hi Iavor,

          If you check out the branch

          http://svn.apache.org/repos/asf/incubator/stanbol/branches/dbpedia-spotlight-engines/

          you will find the merged version of the DBpedia Spotlight EnhancementEngines under

          engines/dbpedia-spotlight/

          With revision http://svn.apache.org/viewvc?rev=1376912&view=rev I have made the following changes

          • moved all Engines in a single Module
          • Parameters are now shared between all of them
          • Domain Model (annotation, surfaceform, candidates) are shared
          • added utility methods for writing enhancements

          All those changes where mainly restructuring of code and removing duplicates.

          In addition to that I have also made some changes

          • Requests/Responses to the RESTful services are now handled differently to avoid creating in-memory copies of request/response data. However NOTE that the request data (the text of the contentItem) is still two times in memory (text and URL encoded version). This can not easily avoided as long as "application/x-www-form-urlencoded" is used to communicate with the server.
          • Added code to the "Annotation" class that extracts the most generic dbpedia-ont Class from the types. This code has a lot of assumptions and NEED to be validated (See comments in the class).
          • Added functionality to extract the fise:selection-context for created fise:TextAnnotation (this was a TODO in the contributed version)
          • Added unit test that validate the written Enhancements for each of the Engines
          • Ensured that the engines are deactivated if Stanbol runs in OfflineMode (by adding a @Reference to OnlineMode)
          • Added a default configuration for an "dpbedia-spotlight" EnhancementChain that is automatically deployed with the bundle. This uses "metaxa;options, tike;optional, langdetect, dbpspotlightannotate".

          In addition STANBOL-717 solves the default Configuration Issue

          Open Issues (sorted by importance)

          1. determine the "dc:type" value for TextAnnotation. Currently I try to use the most generic dbpedia-ont class. First I am not sure if this is a good idea and second also the code for extracting this type (see above) makes a lot of assumptions.
          2. data for suggested Entities: Created fise:EntityAnnotations do not have the correct "fise:entity-label", but do use the SurfaceForm. Also the "fise:entity-type" values (the rdf:type values of the suggested Entity) seam sometimes to divagate from the list returned by dbpedia.org. Can Spotlight provide Entity data? If not I was thinking about a "dereference entity" option that downloads the entity data form dbpedia.org instead of using the information within the spotlight response.
          3. Is it possible that the annotate and disambiguate Engine does return multiple suggestions for a fise:TextAnnoation (spotted Entity). Stanbol Enhancer users are used to get multiple suggestions. So even that "disambiguation" does re-rank suggestions there is no harm if multiple are returned.
          4. Spotter: There are several different possibilities (NER, LingPipeSpotter, OpenNLPChunkerSpotter and Kea). I was thinking to include those options as preconfigured options (human read-able name and description) instead of a simple String field.

          best
          Rupert

          Show
          rwesten Rupert Westenthaler added a comment - Hi Iavor, If you check out the branch http://svn.apache.org/repos/asf/incubator/stanbol/branches/dbpedia-spotlight-engines/ you will find the merged version of the DBpedia Spotlight EnhancementEngines under engines/dbpedia-spotlight/ With revision http://svn.apache.org/viewvc?rev=1376912&view=rev I have made the following changes moved all Engines in a single Module Parameters are now shared between all of them Domain Model (annotation, surfaceform, candidates) are shared added utility methods for writing enhancements All those changes where mainly restructuring of code and removing duplicates. In addition to that I have also made some changes Requests/Responses to the RESTful services are now handled differently to avoid creating in-memory copies of request/response data. However NOTE that the request data (the text of the contentItem) is still two times in memory (text and URL encoded version). This can not easily avoided as long as "application/x-www-form-urlencoded" is used to communicate with the server. Added code to the "Annotation" class that extracts the most generic dbpedia-ont Class from the types. This code has a lot of assumptions and NEED to be validated (See comments in the class). Added functionality to extract the fise:selection-context for created fise:TextAnnotation (this was a TODO in the contributed version) Added unit test that validate the written Enhancements for each of the Engines Ensured that the engines are deactivated if Stanbol runs in OfflineMode (by adding a @Reference to OnlineMode) Added a default configuration for an "dpbedia-spotlight" EnhancementChain that is automatically deployed with the bundle. This uses "metaxa;options, tike;optional, langdetect, dbpspotlightannotate". In addition STANBOL-717 solves the default Configuration Issue Open Issues (sorted by importance) 1. determine the "dc:type" value for TextAnnotation. Currently I try to use the most generic dbpedia-ont class. First I am not sure if this is a good idea and second also the code for extracting this type (see above) makes a lot of assumptions. 2. data for suggested Entities: Created fise:EntityAnnotations do not have the correct "fise:entity-label", but do use the SurfaceForm. Also the "fise:entity-type" values (the rdf:type values of the suggested Entity) seam sometimes to divagate from the list returned by dbpedia.org. Can Spotlight provide Entity data? If not I was thinking about a "dereference entity" option that downloads the entity data form dbpedia.org instead of using the information within the spotlight response. 3. Is it possible that the annotate and disambiguate Engine does return multiple suggestions for a fise:TextAnnoation (spotted Entity). Stanbol Enhancer users are used to get multiple suggestions. So even that "disambiguation" does re-rank suggestions there is no harm if multiple are returned. 4. Spotter: There are several different possibilities (NER, LingPipeSpotter, OpenNLPChunkerSpotter and Kea). I was thinking to include those options as preconfigured options (human read-able name and description) instead of a simple String field. best Rupert
          Hide
          iavorjelev Iavor Jelev added a comment -

          Hi Rupert,

          sorry that I reply so late. I was on the road, almost the whole week. Thanks for your effort and feedback! My take on the questions:

              1. DBpedia Spotlight Modlues/Bundles
          • Yes, I think option 3 (shared bundle) will be the best option, as the engines are using redundant code.
              1. Effects on the Stanbol default Configuration
          • I agree, that option 3 will be the best one.

          Can I help with the changes in some way?

          Show
          iavorjelev Iavor Jelev added a comment - Hi Rupert, sorry that I reply so late. I was on the road, almost the whole week. Thanks for your effort and feedback! My take on the questions: DBpedia Spotlight Modlues/Bundles Yes, I think option 3 (shared bundle) will be the best option, as the engines are using redundant code. Effects on the Stanbol default Configuration I agree, that option 3 will be the best one. Can I help with the changes in some way?
          Hide
          rwesten Rupert Westenthaler added a comment -

          Updates related to the two questions in the above comment

              1. DBpedia Spotlight Modlues/Bundles

          After looking at all four Spotlight engines I came to the conclusion that it would be best to have them all in the same module - as described in option (3) of the above comment. The main reason are the potential code savings of this solution.

          For that I will move all engines to a module with the artifactId "org.apache.stanbol.enhancer.engines.dbpspotlight" and the path "

          {stanbol-trunk}

          /enhancer/engines/dbpedia-spotlight"

              1. Effects on the Stanbol default Configuration

          Those problems are solved by STANBOL-717

          Show
          rwesten Rupert Westenthaler added a comment - Updates related to the two questions in the above comment DBpedia Spotlight Modlues/Bundles After looking at all four Spotlight engines I came to the conclusion that it would be best to have them all in the same module - as described in option (3) of the above comment. The main reason are the potential code savings of this solution. For that I will move all engines to a module with the artifactId "org.apache.stanbol.enhancer.engines.dbpspotlight" and the path " {stanbol-trunk} /enhancer/engines/dbpedia-spotlight" Effects on the Stanbol default Configuration Those problems are solved by STANBOL-717
          Hide
          rwesten Rupert Westenthaler added a comment -

          Progress update (see [1])

          Changes:


          • The POM files are now updated to use the versions of the trunk (0.10.0-incubating-SNAPSHOT)
          • The DBpedia Spotlight Spot engine now behaves as expected for a EnhancementEngine
          • It supports asynchronous enhancements (as highly recommended by Engines calling remote services)
          • It respects OfflineMode - does not allow connections to external services
          • It does not catch any Exceptions - the EnhancementJobManager MUST deal with those as only it knows if an engine is OPTIONAL or REQUIRED.
          • In addition I changed the communication with the Spotlight RESTful service so that request/response data are not loaded in memory twice (e.g. the Response as String and XML document)

          NOTES:


          I also added the Spot Engine to the Enhancer Bundlelist. So for Users that "mvn clean install" the branch and than "mvn clean install" the Full/Stanble Launcher in the trunk ("

          {stanbol-trunk}

          /lanuchers/full") will see the DBpedia Spotlight Spot engine.

          TODOs


          • Similar changes as for the Spot engine need to be done for the other Spotlight engines

          Qusetions:


              1. DBpedia Spotlight Modlues/Bundles

          I have noticed that some Functionality (most noticeable the XMLParser class) is duplicated in some/all of the Spotlight engines. I thee the following possibilities to deal with that

          1. ignore the duplicated code
          2. create an extra module (bundle) that contains the shared functionality
          3. move all engines into a single module

          (1) and (2) would be favorable if typical users would only want to install a subset of the DBpedia Spotlight engines. (3) works best if it is OK to install all (but maybe use only a few - e.g. by configuring according enhancement engines or by deactivating the unused one).

              1. Effects on the Stanbol default Configuration

          With the addition of the DBpedia Spotlight engines we might need to think about changing the default configuration of Apache Stanbol.

          Currently the default EnhancementChain of the Stanbol Launchers includes all active EnhancementEngines. When we add the DBpedia Spotlight Engines this might no longer make sense as the results of the DBpedia Spotlight Engines will be very similar to those of the NER+EntityTagging engine with the default DBpedia dataset. More concrete an EnhancementChain containing all active Enhancement Engines will result in a lot of duplicate results that might confuse users new to Stanbol.

          To avoid this I see two possibilities

          1. Do not include the DBpedia Spotlight Engines in the default Launcher
          2. Deactivate the DBpedia Spotlight Engines by default.
          3. Switch from the "All active Engines Chain" to an explicitly configured Chain for the default configuration add an DBpedia Spotlight Chain.

          I am strongly favoring (3) and only included (1) and (2) to give people that want to keep the "All active Engines Chain" the change to leave a comment. Note that even with (3) we can keep the "All active Engines chain" but it would no longer be the default chain.

          [1] http://svn.apache.org/viewvc?rev=1375468&view=rev

          Show
          rwesten Rupert Westenthaler added a comment - Progress update (see [1] ) Changes: The POM files are now updated to use the versions of the trunk (0.10.0-incubating-SNAPSHOT) The DBpedia Spotlight Spot engine now behaves as expected for a EnhancementEngine It supports asynchronous enhancements (as highly recommended by Engines calling remote services) It respects OfflineMode - does not allow connections to external services It does not catch any Exceptions - the EnhancementJobManager MUST deal with those as only it knows if an engine is OPTIONAL or REQUIRED. In addition I changed the communication with the Spotlight RESTful service so that request/response data are not loaded in memory twice (e.g. the Response as String and XML document) NOTES: I also added the Spot Engine to the Enhancer Bundlelist. So for Users that "mvn clean install" the branch and than "mvn clean install" the Full/Stanble Launcher in the trunk (" {stanbol-trunk} /lanuchers/full") will see the DBpedia Spotlight Spot engine. TODOs Similar changes as for the Spot engine need to be done for the other Spotlight engines Qusetions: DBpedia Spotlight Modlues/Bundles I have noticed that some Functionality (most noticeable the XMLParser class) is duplicated in some/all of the Spotlight engines. I thee the following possibilities to deal with that 1. ignore the duplicated code 2. create an extra module (bundle) that contains the shared functionality 3. move all engines into a single module (1) and (2) would be favorable if typical users would only want to install a subset of the DBpedia Spotlight engines. (3) works best if it is OK to install all (but maybe use only a few - e.g. by configuring according enhancement engines or by deactivating the unused one). Effects on the Stanbol default Configuration With the addition of the DBpedia Spotlight engines we might need to think about changing the default configuration of Apache Stanbol. Currently the default EnhancementChain of the Stanbol Launchers includes all active EnhancementEngines. When we add the DBpedia Spotlight Engines this might no longer make sense as the results of the DBpedia Spotlight Engines will be very similar to those of the NER+EntityTagging engine with the default DBpedia dataset. More concrete an EnhancementChain containing all active Enhancement Engines will result in a lot of duplicate results that might confuse users new to Stanbol. To avoid this I see two possibilities 1. Do not include the DBpedia Spotlight Engines in the default Launcher 2. Deactivate the DBpedia Spotlight Engines by default. 3. Switch from the "All active Engines Chain" to an explicitly configured Chain for the default configuration add an DBpedia Spotlight Chain. I am strongly favoring (3) and only included (1) and (2) to give people that want to keep the "All active Engines Chain" the change to leave a comment. Note that even with (3) we can keep the "All active Engines chain" but it would no longer be the default chain. [1] http://svn.apache.org/viewvc?rev=1375468&view=rev
          Hide
          pablomendes Pablo Mendes added a comment -

          Fine by me.

          • I like "dbpedia-spotlight- {name}"
            * "org.apache.stanbol.enhancer.engines.dbpspotlight.{name}

            " sounds good

          • ok
          Show
          pablomendes Pablo Mendes added a comment - Fine by me. I like "dbpedia-spotlight- {name}" * "org.apache.stanbol.enhancer.engines.dbpspotlight.{name} " sounds good ok
          Hide
          rwesten Rupert Westenthaler added a comment -

          I would suggest to do the following changes to the module paths and artifactIds/packages

          • rename the module path from "dbpsotlight {name}" to "dbpedia-spotlight-{name}

            " to make it similar to others e.g. the "opennlp-ner" engine.

          • change the artifactIds of the modules from "org.apache.stanbol.enhancer.engines.dbpspotlight {name}" to "org.apache.stanbol.enhancer.engines.dbpspotlight.{name}

            " to make them better readable

          • change the java packages accordingly to the proposed change of the artifactIds

          WDYT?

          Show
          rwesten Rupert Westenthaler added a comment - I would suggest to do the following changes to the module paths and artifactIds/packages rename the module path from "dbpsotlight {name}" to "dbpedia-spotlight-{name} " to make it similar to others e.g. the "opennlp-ner" engine. change the artifactIds of the modules from "org.apache.stanbol.enhancer.engines.dbpspotlight {name}" to "org.apache.stanbol.enhancer.engines.dbpspotlight.{name} " to make them better readable change the java packages accordingly to the proposed change of the artifactIds WDYT?

            People

            • Assignee:
              rwesten Rupert Westenthaler
              Reporter:
              iavorjelev Iavor Jelev
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development