Uploaded image for project: 'Stanbol (Retired)'
  1. Stanbol (Retired)
  2. STANBOL-1132

Add co-reference resolution and dependency tree support in the Stanbol NLP processing API

Details

    Description

      Extend the Stanbol NLP Processing API with annotations for co-reference resolution and dependency trees.
      Also, add support for JSON Serialisation/Parsing for the co-reference and dependency tree annotations so that the RESTful NLP Analysis Service can provide co-reference information.

      Attachments

        1. coref_dependency_tree_datamodel.patch
          59 kB
          Cristian Aurelian Petroaca
        2. coref_dependency_tree_datamodel_vers_2.patch
          24 kB
          Cristian Aurelian Petroaca
        3. coref_dependency_tree_vers_3.patch
          27 kB
          Cristian Aurelian Petroaca
        4. coref_dependency_tree_vers_3_fixed.patch
          28 kB
          Cristian Aurelian Petroaca
        5. coref_dependency_tree_datamode_vers_4.patch
          10 kB
          Cristian Aurelian Petroaca
        6. coref_dependency_tree_datamodel_vers_5.patch
          14 kB
          Cristian Aurelian Petroaca

        Issue Links

          Activity

            Changes needed for adding the coref and dependency trees datamodel and the respective json serialization/deserialization routines.

            cpetroaca Cristian Aurelian Petroaca added a comment - Changes needed for adding the coref and dependency trees datamodel and the respective json serialization/deserialization routines.

            Cristian, I will try to have a look at your patch until the first half of the coming week. If I do not write any comments until than feel free to send a reminder^^

            rwesten Rupert Westenthaler added a comment - Cristian, I will try to have a look at your patch until the first half of the coming week. If I do not write any comments until than feel free to send a reminder^^
            rwesten Rupert Westenthaler added a comment - - edited

            Hi Cristian

            Had a look at you patch and IMO it is very good quality. Because of this and because the patch only contains new features I decided to commit it directly to trunk. This should make further work more simple (no merges needed, you can use the default Stanbol Launcher for testing). However if you would prefer to continue in a branch just let me know and I will also apply the patch to the branch.

            In the following a list of things I noticed while working through your patch:

            1. Can you adapt the implementation of GrammaticalRelation to represent the hierarchy similar to the Pos enum. Using a static {} block to build up the transitive closure over parents instead of doing it in the #addParents() method. The getParents() method would then use the static EnumMap<GrammaticalRelation,EnumSet<GrammaticalRelation>>(GrammaticalRelation.class) instead of a local HashMap. I would like to use EnumSet/-Map implementations as they provide much better performance in contrast to HashSet/-Map implementations.

            2. Annotations should be immutable. So I would like to remove all add** set** ... methods from CorefTag, DependencyFeature and GramaticalReleationTag. In addition Getter methods should return immutable wrappers. The best would be to use Collections.unmodifiableSet(..) when setting fields in the Constructor. I added according comments to the CorefTag

            3. IMO DependencyFeatures is obsolete. That's because the Annotated interface already supports multiple values and DependencyFeatures is just a collection of multiple DependencyRelations. So instead of having a DependencyFeatures class I would suggest to just add multiple DependencyRelations values. In detail I suggest to

            • rename DependencyRelation to DependencyRelationTag.
            • DependencyRelationTag extends Tag<DependencyRelationTag>
            • Change NlpAnnotations#DEPENDENCY_ANNOTATION to new Annotation<DependencyFeatures>(
              "stanbol.enhancer.nlp.dependency", DependencyRelationTag.class);
            • Simple add multiple DependencyRelationTag Values to a token. This has also the advantage the iterations over those are sorted by the confidence of those.
            • simple add a new DependencyRelationTag

            Again thanks for the patch. It is great to getting Coref and dependendy tree support to the Stanbol NLP module!

            rwesten Rupert Westenthaler added a comment - - edited Hi Cristian Had a look at you patch and IMO it is very good quality. Because of this and because the patch only contains new features I decided to commit it directly to trunk. This should make further work more simple (no merges needed, you can use the default Stanbol Launcher for testing). However if you would prefer to continue in a branch just let me know and I will also apply the patch to the branch. In the following a list of things I noticed while working through your patch: 1. Can you adapt the implementation of GrammaticalRelation to represent the hierarchy similar to the Pos enum. Using a static {} block to build up the transitive closure over parents instead of doing it in the #addParents() method. The getParents() method would then use the static EnumMap<GrammaticalRelation,EnumSet<GrammaticalRelation>>(GrammaticalRelation.class) instead of a local HashMap. I would like to use EnumSet/-Map implementations as they provide much better performance in contrast to HashSet/-Map implementations. 2. Annotations should be immutable. So I would like to remove all add** set** ... methods from CorefTag, DependencyFeature and GramaticalReleationTag. In addition Getter methods should return immutable wrappers. The best would be to use Collections.unmodifiableSet(..) when setting fields in the Constructor. I added according comments to the CorefTag 3. IMO DependencyFeatures is obsolete. That's because the Annotated interface already supports multiple values and DependencyFeatures is just a collection of multiple DependencyRelations. So instead of having a DependencyFeatures class I would suggest to just add multiple DependencyRelations values. In detail I suggest to rename DependencyRelation to DependencyRelationTag. DependencyRelationTag extends Tag<DependencyRelationTag> Change NlpAnnotations#DEPENDENCY_ANNOTATION to new Annotation<DependencyFeatures>( "stanbol.enhancer.nlp.dependency", DependencyRelationTag.class); Simple add multiple DependencyRelationTag Values to a token. This has also the advantage the iterations over those are sorted by the confidence of those. simple add a new DependencyRelationTag Again thanks for the patch. It is great to getting Coref and dependendy tree support to the Stanbol NLP module!

            applied the patch with http://svn.apache.org/r1532246 to trunk

            rwesten Rupert Westenthaler added a comment - applied the patch with http://svn.apache.org/r1532246 to trunk

            Hi Rupert,

            Working on the trunk is fine. I'll work on the changes you suggested and let you know when I have the patch ready.

            cpetroaca Cristian Aurelian Petroaca added a comment - Hi Rupert, Working on the trunk is fine. I'll work on the changes you suggested and let you know when I have the patch ready.

            Hi Rupert,

            Made the changes you suggested with one comment at no 3: the DependencyRelation doesn't need to extend Tag because there would be nothing to store in its 'tag' since the grammatical relation tag is already stored in the GrammaticalRelationTag class. So I made it a simple class.

            cpetroaca Cristian Aurelian Petroaca added a comment - Hi Rupert, Made the changes you suggested with one comment at no 3: the DependencyRelation doesn't need to extend Tag because there would be nothing to store in its 'tag' since the grammatical relation tag is already stored in the GrammaticalRelationTag class. So I made it a simple class.

            Hi Cristian, I would like to see a example usage of those annotations. IMO this should become much more clear as soon as you add unit tests.

            rwesten Rupert Westenthaler added a comment - Hi Cristian, I would like to see a example usage of those annotations. IMO this should become much more clear as soon as you add unit tests.

            For the sentence "Obama visited China." we have the dependency tree : nsubj(visited-2, Obama-1), root(ROOT-0, visited-2), dobj(visited-2, China-3)

            Let's take the token "visited". We would have the following DependencyRelation objects created and attached as annotations:

            GrammaticalRelationTag nSubjTag = new GrammaticalRelationTag("nsubj", GrammaticalRelation.NominalSubject);
            DependencyRelation relation1 = new DependencyRelation(nSubjTag, false /*isDependent */, ObamaToken);

            GrammaticalRelationTag rootTag = new GrammaticalRelationTag("root", GrammaticalRelation.Root);
            DependencyRelation relation2 = new DependencyRelation(rootTag, true /* isDependent /, null / no token for root */)

            GrammaticalRelationTag dObjTag = new GrammaticalRelationTag("dobj", GrammaticalRelation.DirectObject)
            DependencyRelation relation3 = new DependencyRelation(dObjTag, false /* isDependent, ChinaToken);

            The nSubjTag, rootTag and dObjTag are taken out of the TagSetRegistry in the Stanford NLP API following the examples from PosTag, etc.. That is actually the reason I created a GrammaticalRelationTag and I didn't put the 'tag' at DependencyRelation level.

            By the way, looking at this hierarchy I think it should also be reflected in the json. So instead of the initial json :

            { "tag" : "nsubj", //type of relation - Stanford NLP notation "relationType" : 12, // type of relation - Stanbol NLP mapped value - ordinal number in enum Dependency "isDependent" : "true", "type" : "Token", // type of element with which this token is in relation "start" : 123, // start index of the relating token "end" : 130 // end index of the relating token }

            I would have :

            { "grammaticalRelation" :

            { "tag" : "nsubj", //type of relation - Stanford NLP notation "relationType" : 12, // type of relation - Stanbol NLP mapped value - ordinal number in enum Dependency }

            ,
            "isDependent" : "true",
            "type" : "Token", // type of element with which this token is in relation
            "start" : 123, // start index of the relating token
            "end" : 130 // end index of the relating token
            }

            cpetroaca Cristian Aurelian Petroaca added a comment - For the sentence "Obama visited China." we have the dependency tree : nsubj(visited-2, Obama-1), root(ROOT-0, visited-2), dobj(visited-2, China-3) Let's take the token "visited". We would have the following DependencyRelation objects created and attached as annotations: GrammaticalRelationTag nSubjTag = new GrammaticalRelationTag("nsubj", GrammaticalRelation.NominalSubject); DependencyRelation relation1 = new DependencyRelation(nSubjTag, false /*isDependent */, ObamaToken); GrammaticalRelationTag rootTag = new GrammaticalRelationTag("root", GrammaticalRelation.Root); DependencyRelation relation2 = new DependencyRelation(rootTag, true /* isDependent /, null / no token for root */) GrammaticalRelationTag dObjTag = new GrammaticalRelationTag("dobj", GrammaticalRelation.DirectObject) DependencyRelation relation3 = new DependencyRelation(dObjTag, false /* isDependent, ChinaToken); The nSubjTag, rootTag and dObjTag are taken out of the TagSetRegistry in the Stanford NLP API following the examples from PosTag, etc.. That is actually the reason I created a GrammaticalRelationTag and I didn't put the 'tag' at DependencyRelation level. By the way, looking at this hierarchy I think it should also be reflected in the json. So instead of the initial json : { "tag" : "nsubj", //type of relation - Stanford NLP notation "relationType" : 12, // type of relation - Stanbol NLP mapped value - ordinal number in enum Dependency "isDependent" : "true", "type" : "Token", // type of element with which this token is in relation "start" : 123, // start index of the relating token "end" : 130 // end index of the relating token } I would have : { "grammaticalRelation" : { "tag" : "nsubj", //type of relation - Stanford NLP notation "relationType" : 12, // type of relation - Stanbol NLP mapped value - ordinal number in enum Dependency } , "isDependent" : "true", "type" : "Token", // type of element with which this token is in relation "start" : 123, // start index of the relating token "end" : 130 // end index of the relating token }

            Hi Rupert,

            This is a reminder in case you missed this.

            cpetroaca Cristian Aurelian Petroaca added a comment - Hi Rupert, This is a reminder in case you missed this.

            Hi Cristian,

            I have not missed it. I simple had no time to look at it.

            BTW: I do not see the patch related to your commend from the 20.Okt. Maybe you have forgotten to attach it?

            rwesten Rupert Westenthaler added a comment - Hi Cristian, I have not missed it. I simple had no time to look at it. BTW: I do not see the patch related to your commend from the 20.Okt. Maybe you have forgotten to attach it?

            I didn't put it yet because I first wanted your feedback on my comments to point no 3 in case I'd need to change something.

            cpetroaca Cristian Aurelian Petroaca added a comment - I didn't put it yet because I first wanted your feedback on my comments to point no 3 in case I'd need to change something.

            I attached the changes to the Jira. I added a couple of Junit tests in DependencyRelationTest,java

            cpetroaca Cristian Aurelian Petroaca added a comment - I attached the changes to the Jira. I added a couple of Junit tests in DependencyRelationTest,java

            Hi Rupert,

            Did you get a chance to take a look at the changes? I completed the changes related to dependency tree parsing in the Stanford NLP project as well and I'd like to know that the datamodel code is correct before I make the pull request.

            cpetroaca Cristian Aurelian Petroaca added a comment - Hi Rupert, Did you get a chance to take a look at the changes? I completed the changes related to dependency tree parsing in the Stanford NLP project as well and I'd like to know that the datamodel code is correct before I make the pull request.

            Hi Christian,

            thanks for the remainder. I have reviewed and applied your patch with revision http://svn.apache.org/r1539574

            NOTE: that I had to made a changes to your code because the SpanTypeEnum was moved to an own file.

            rwesten Rupert Westenthaler added a comment - Hi Christian, thanks for the remainder. I have reviewed and applied your patch with revision http://svn.apache.org/r1539574 NOTE: that I had to made a changes to your code because the SpanTypeEnum was moved to an own file.

            The following files also need to be deleted because I removed the DependencyFeatures and they are not needed anymore. Don't know why they did not show up in the patch diff.

            nlp-json/src/main/java/org/apache/stanbol/enhancer/nlp/json/valuetype/impl/DependencyFeaturesSupport.java
            nlp/src/main/java/org/apache/stanbol/enhancer/nlp/dependency/DependencyFeatures.java

            cpetroaca Cristian Aurelian Petroaca added a comment - The following files also need to be deleted because I removed the DependencyFeatures and they are not needed anymore. Don't know why they did not show up in the patch diff. nlp-json/src/main/java/org/apache/stanbol/enhancer/nlp/json/valuetype/impl/DependencyFeaturesSupport.java nlp/src/main/java/org/apache/stanbol/enhancer/nlp/dependency/DependencyFeatures.java

            deleted the two files with 1541855

            rwesten Rupert Westenthaler added a comment - deleted the two files with 1541855
            cpetroaca Cristian Aurelian Petroaca added a comment - - edited

            Hi Rupert,

            I attached a new patch to the jira : coref_dependency_tree_vers_3.patch. It contains the following :

            1. Changed the CorefTag to not extend Tag anymore because it doesn't actually need a tag. Renamed CorefTag and CorefTagSupport to CorefFeature and CorefFeatureSupport.
            2. Added junit tests for CorefFeatureSupport and DependencyRelationSupport serialization and parse routines.

            Again, the deleted CorefTag and CoefTagSupport classes appeared in the patch creation dialog but after I created the patch I could not see them in the patch diff.So please make sure you delete those as well.

            Thanks,
            Cristian

            cpetroaca Cristian Aurelian Petroaca added a comment - - edited Hi Rupert, I attached a new patch to the jira : coref_dependency_tree_vers_3.patch. It contains the following : 1. Changed the CorefTag to not extend Tag anymore because it doesn't actually need a tag. Renamed CorefTag and CorefTagSupport to CorefFeature and CorefFeatureSupport. 2. Added junit tests for CorefFeatureSupport and DependencyRelationSupport serialization and parse routines. Again, the deleted CorefTag and CoefTagSupport classes appeared in the patch creation dialog but after I created the patch I could not see them in the patch diff.So please make sure you delete those as well. Thanks, Cristian

            Hi Cristian,

            after applying your patch I see failing Tests. See the following example.

            testSerializationAndParse(org.apache.stanbol.enhancer.nlp.json.valuetype.CorefFeatureSupportTest) Time elapsed: 0.002 sec <<< FAILURE!
            java.lang.AssertionError
            at org.junit.Assert.fail(Assert.java:86)
            at org.junit.Assert.assertTrue(Assert.java:41)
            at org.junit.Assert.assertTrue(Assert.java:52)
            at org.apache.stanbol.enhancer.nlp.json.valuetype.CorefFeatureSupportTest.testSerializationAndParse(CorefFeatureSupportTest.java:66)

            As far as I can see the serialized Text does match the 'jsonCorefCheckObama' and 'jsonCorefCheckHe' only if I remove all '\r' entries from the fields. This might be a OS specific Issue. The same is also true for the org.apache.stanbol.enhancer.nlp.json.valuetype.DependencyRelationSupportTest.

            For now I removed all '\r' in my local version, but I think this will not solve this problem as this is most likely an OS specific issue. Can you please update the test so that it does not depend on the line separator(s) used by the OS

            best
            Rupert

            rwesten Rupert Westenthaler added a comment - Hi Cristian, after applying your patch I see failing Tests. See the following example. testSerializationAndParse(org.apache.stanbol.enhancer.nlp.json.valuetype.CorefFeatureSupportTest) Time elapsed: 0.002 sec <<< FAILURE! java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.stanbol.enhancer.nlp.json.valuetype.CorefFeatureSupportTest.testSerializationAndParse(CorefFeatureSupportTest.java:66) As far as I can see the serialized Text does match the 'jsonCorefCheckObama' and 'jsonCorefCheckHe' only if I remove all '\r' entries from the fields. This might be a OS specific Issue. The same is also true for the org.apache.stanbol.enhancer.nlp.json.valuetype.DependencyRelationSupportTest. For now I removed all '\r' in my local version, but I think this will not solve this problem as this is most likely an OS specific issue. Can you please update the test so that it does not depend on the line separator(s) used by the OS best Rupert

            Fixed in coref_dependency_tree_vers_3_fixed.patch. Should work fine on Unix based OS as well.

            cpetroaca Cristian Aurelian Petroaca added a comment - Fixed in coref_dependency_tree_vers_3_fixed.patch. Should work fine on Unix based OS as well.

            Hi Cristian,

            Applied the fixed patch to trunk and merged it back to the 0.12 branch.

            Some Notes:

            I had to replace the equals implementation of DependencyRelation, because of several issues: (1) the call to super.equals is a very bad thing, because it checks for instance equality (2) possibility of NPE in equals checks. I replaced your version of hashCode and equals with the IDE generated versions.

            While fixing the equals method I noticed that the NLP features you added do not validate parsed parameters. Is is intended that users can parse NULL for all the defined arguments? If not you should validate those and throw IllegalArgumentExceptions. This could also allow to write simpler hashcode and equals methods as one would not check for NULL values.

            rwesten Rupert Westenthaler added a comment - Hi Cristian, Applied the fixed patch to trunk and merged it back to the 0.12 branch. Some Notes: I had to replace the equals implementation of DependencyRelation, because of several issues: (1) the call to super.equals is a very bad thing, because it checks for instance equality (2) possibility of NPE in equals checks. I replaced your version of hashCode and equals with the IDE generated versions. While fixing the equals method I noticed that the NLP features you added do not validate parsed parameters. Is is intended that users can parse NULL for all the defined arguments? If not you should validate those and throw IllegalArgumentExceptions. This could also allow to write simpler hashcode and equals methods as one would not check for NULL values.

            All fields in the CorefFeature and DependencyRelation classes should be non null except DependencyRelation.partner. I'll do the validations.

            cpetroaca Cristian Aurelian Petroaca added a comment - All fields in the CorefFeature and DependencyRelation classes should be non null except DependencyRelation.partner. I'll do the validations.

            The last patch broke the build on Jenkins with the message

            [ERROR] <https://builds.apache.org/job/stanbol-trunk-1.6/org.apache.stanbol$org.apache.stanbol.enhancer.nlp.json/ws/src/test/java/org/apache/stanbol/enhancer/nlp/json/valuetype/ValueTypeSupportTest.java>:[34,57] cannot find symbol
            symbol : method lineSeparator()
            location: class java.lang.System

            The reason is that System#lineSeparator() was only added in JDK 7 and the build is an Java 6.

            To fix this the code need to be adapted to use the System property instead.

            rwesten Rupert Westenthaler added a comment - The last patch broke the build on Jenkins with the message [ERROR] < https://builds.apache.org/job/stanbol-trunk-1.6/org.apache.stanbol$org.apache.stanbol.enhancer.nlp.json/ws/src/test/java/org/apache/stanbol/enhancer/nlp/json/valuetype/ValueTypeSupportTest.java >: [34,57] cannot find symbol symbol : method lineSeparator() location: class java.lang.System The reason is that System#lineSeparator() was only added in JDK 7 and the build is an Java 6. To fix this the code need to be adapted to use the System property instead.

            I wasn't aware that the build doesn't use Java 7. My bad.
            I'll provide a fixed patch tonight together with the rest of the validations.

            cpetroaca Cristian Aurelian Petroaca added a comment - I wasn't aware that the build doesn't use Java 7. My bad. I'll provide a fixed patch tonight together with the rest of the validations.
            rwesten Rupert Westenthaler added a comment - Already changed with http://svn.apache.org/r1552247 (0.12) and http://svn.apache.org/r1552248 (trunk)

            Added patch with validations and other small changes.

            cpetroaca Cristian Aurelian Petroaca added a comment - Added patch with validations and other small changes.

            Hi Rupert,

            I added a new patch to the jira. Basically it's the same changes as the pervious one + added a missing GrammaticalRelation type for dep tree. So you can ignore coref_dependency_tree_datamode_vers_4.patch and only apply coref_dependency_tree_datamodel_vers_5.patch.

            Thanks

            cpetroaca Cristian Aurelian Petroaca added a comment - Hi Rupert, I added a new patch to the jira. Basically it's the same changes as the pervious one + added a missing GrammaticalRelation type for dep tree. So you can ignore coref_dependency_tree_datamode_vers_4.patch and only apply coref_dependency_tree_datamodel_vers_5.patch. Thanks

            Hi Cristian,

            I applied you patch with http://svn.apache.org/r1559644 to trunk. I will also merge it back to 0.12.

            As we are getting near to release 0.12 I would like to know if we could close this issue. This would allow us to have this as an official feature of 0.12.0.

            best
            Rupert

            rwesten Rupert Westenthaler added a comment - Hi Cristian, I applied you patch with http://svn.apache.org/r1559644 to trunk. I will also merge it back to 0.12. As we are getting near to release 0.12 I would like to know if we could close this issue. This would allow us to have this as an official feature of 0.12.0. best Rupert

            Since dependecy relation is tested and committed on the stanford-nlp project and the coref resolution is also finished and tested (not yet up for pull request though) I'd say that the datamodel is good as is so you can close it.

            cpetroaca Cristian Aurelian Petroaca added a comment - Since dependecy relation is tested and committed on the stanford-nlp project and the coref resolution is also finished and tested (not yet up for pull request though) I'd say that the datamodel is good as is so you can close it.
            rwesten Rupert Westenthaler added a comment - - edited

            Marking this as fixed. All changes where applied to trunk and the 0.12. releasing branch. Meaning that this will be also included in the 0.12.0 release.

            A big thanks to Cristian for adding this feature to Apache Stanbol!

            rwesten Rupert Westenthaler added a comment - - edited Marking this as fixed. All changes where applied to trunk and the 0.12. releasing branch. Meaning that this will be also included in the 0.12.0 release. A big thanks to Cristian for adding this feature to Apache Stanbol!

            People

              rwesten Rupert Westenthaler
              cpetroaca Cristian Aurelian Petroaca
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: