Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.12.0
Description
Extend the Stanbol NLP Processing API with annotations for co-reference resolution and dependency trees.
Also, add support for JSON Serialisation/Parsing for the co-reference and dependency tree annotations so that the RESTful NLP Analysis Service can provide co-reference information.
Attachments
Attachments
- coref_dependency_tree_datamodel.patch
- 59 kB
- Cristian Aurelian Petroaca
- coref_dependency_tree_datamodel_vers_2.patch
- 24 kB
- Cristian Aurelian Petroaca
- coref_dependency_tree_vers_3.patch
- 27 kB
- Cristian Aurelian Petroaca
- coref_dependency_tree_vers_3_fixed.patch
- 28 kB
- Cristian Aurelian Petroaca
- coref_dependency_tree_datamode_vers_4.patch
- 10 kB
- Cristian Aurelian Petroaca
- coref_dependency_tree_datamodel_vers_5.patch
- 14 kB
- Cristian Aurelian Petroaca
Issue Links
- is depended upon by
-
STANBOL-1121 Event extraction Enhancement Engine
- Open
Activity
Cristian, I will try to have a look at your patch until the first half of the coming week. If I do not write any comments until than feel free to send a reminder^^
Hi Cristian
Had a look at you patch and IMO it is very good quality. Because of this and because the patch only contains new features I decided to commit it directly to trunk. This should make further work more simple (no merges needed, you can use the default Stanbol Launcher for testing). However if you would prefer to continue in a branch just let me know and I will also apply the patch to the branch.
In the following a list of things I noticed while working through your patch:
1. Can you adapt the implementation of GrammaticalRelation to represent the hierarchy similar to the Pos enum. Using a static {} block to build up the transitive closure over parents instead of doing it in the #addParents() method. The getParents() method would then use the static EnumMap<GrammaticalRelation,EnumSet<GrammaticalRelation>>(GrammaticalRelation.class) instead of a local HashMap. I would like to use EnumSet/-Map implementations as they provide much better performance in contrast to HashSet/-Map implementations.
2. Annotations should be immutable. So I would like to remove all add** set** ... methods from CorefTag, DependencyFeature and GramaticalReleationTag. In addition Getter methods should return immutable wrappers. The best would be to use Collections.unmodifiableSet(..) when setting fields in the Constructor. I added according comments to the CorefTag
3. IMO DependencyFeatures is obsolete. That's because the Annotated interface already supports multiple values and DependencyFeatures is just a collection of multiple DependencyRelations. So instead of having a DependencyFeatures class I would suggest to just add multiple DependencyRelations values. In detail I suggest to
- rename DependencyRelation to DependencyRelationTag.
- DependencyRelationTag extends Tag<DependencyRelationTag>
- Change NlpAnnotations#DEPENDENCY_ANNOTATION to new Annotation<DependencyFeatures>(
"stanbol.enhancer.nlp.dependency", DependencyRelationTag.class); - Simple add multiple DependencyRelationTag Values to a token. This has also the advantage the iterations over those are sorted by the confidence of those.
- simple add a new DependencyRelationTag
Again thanks for the patch. It is great to getting Coref and dependendy tree support to the Stanbol NLP module!
applied the patch with http://svn.apache.org/r1532246 to trunk
Hi Rupert,
Working on the trunk is fine. I'll work on the changes you suggested and let you know when I have the patch ready.
Hi Rupert,
Made the changes you suggested with one comment at no 3: the DependencyRelation doesn't need to extend Tag because there would be nothing to store in its 'tag' since the grammatical relation tag is already stored in the GrammaticalRelationTag class. So I made it a simple class.
Hi Cristian, I would like to see a example usage of those annotations. IMO this should become much more clear as soon as you add unit tests.
For the sentence "Obama visited China." we have the dependency tree : nsubj(visited-2, Obama-1), root(ROOT-0, visited-2), dobj(visited-2, China-3)
Let's take the token "visited". We would have the following DependencyRelation objects created and attached as annotations:
GrammaticalRelationTag nSubjTag = new GrammaticalRelationTag("nsubj", GrammaticalRelation.NominalSubject);
DependencyRelation relation1 = new DependencyRelation(nSubjTag, false /*isDependent */, ObamaToken);
GrammaticalRelationTag rootTag = new GrammaticalRelationTag("root", GrammaticalRelation.Root);
DependencyRelation relation2 = new DependencyRelation(rootTag, true /* isDependent /, null / no token for root */)
GrammaticalRelationTag dObjTag = new GrammaticalRelationTag("dobj", GrammaticalRelation.DirectObject)
DependencyRelation relation3 = new DependencyRelation(dObjTag, false /* isDependent, ChinaToken);
The nSubjTag, rootTag and dObjTag are taken out of the TagSetRegistry in the Stanford NLP API following the examples from PosTag, etc.. That is actually the reason I created a GrammaticalRelationTag and I didn't put the 'tag' at DependencyRelation level.
By the way, looking at this hierarchy I think it should also be reflected in the json. So instead of the initial json :
{ "tag" : "nsubj", //type of relation - Stanford NLP notation "relationType" : 12, // type of relation - Stanbol NLP mapped value - ordinal number in enum Dependency "isDependent" : "true", "type" : "Token", // type of element with which this token is in relation "start" : 123, // start index of the relating token "end" : 130 // end index of the relating token }I would have :
{ "grammaticalRelation" :
{ "tag" : "nsubj", //type of relation - Stanford NLP notation "relationType" : 12, // type of relation - Stanbol NLP mapped value - ordinal number in enum Dependency },
"isDependent" : "true",
"type" : "Token", // type of element with which this token is in relation
"start" : 123, // start index of the relating token
"end" : 130 // end index of the relating token
}
Hi Cristian,
I have not missed it. I simple had no time to look at it.
BTW: I do not see the patch related to your commend from the 20.Okt. Maybe you have forgotten to attach it?
I didn't put it yet because I first wanted your feedback on my comments to point no 3 in case I'd need to change something.
I attached the changes to the Jira. I added a couple of Junit tests in DependencyRelationTest,java
Hi Rupert,
Did you get a chance to take a look at the changes? I completed the changes related to dependency tree parsing in the Stanford NLP project as well and I'd like to know that the datamodel code is correct before I make the pull request.
Hi Christian,
thanks for the remainder. I have reviewed and applied your patch with revision http://svn.apache.org/r1539574
NOTE: that I had to made a changes to your code because the SpanTypeEnum was moved to an own file.
The following files also need to be deleted because I removed the DependencyFeatures and they are not needed anymore. Don't know why they did not show up in the patch diff.
nlp-json/src/main/java/org/apache/stanbol/enhancer/nlp/json/valuetype/impl/DependencyFeaturesSupport.java
nlp/src/main/java/org/apache/stanbol/enhancer/nlp/dependency/DependencyFeatures.java
Hi Rupert,
I attached a new patch to the jira : coref_dependency_tree_vers_3.patch. It contains the following :
1. Changed the CorefTag to not extend Tag anymore because it doesn't actually need a tag. Renamed CorefTag and CorefTagSupport to CorefFeature and CorefFeatureSupport.
2. Added junit tests for CorefFeatureSupport and DependencyRelationSupport serialization and parse routines.
Again, the deleted CorefTag and CoefTagSupport classes appeared in the patch creation dialog but after I created the patch I could not see them in the patch diff.So please make sure you delete those as well.
Thanks,
Cristian
Hi Cristian,
after applying your patch I see failing Tests. See the following example.
testSerializationAndParse(org.apache.stanbol.enhancer.nlp.json.valuetype.CorefFeatureSupportTest) Time elapsed: 0.002 sec <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at org.apache.stanbol.enhancer.nlp.json.valuetype.CorefFeatureSupportTest.testSerializationAndParse(CorefFeatureSupportTest.java:66)
As far as I can see the serialized Text does match the 'jsonCorefCheckObama' and 'jsonCorefCheckHe' only if I remove all '\r' entries from the fields. This might be a OS specific Issue. The same is also true for the org.apache.stanbol.enhancer.nlp.json.valuetype.DependencyRelationSupportTest.
For now I removed all '\r' in my local version, but I think this will not solve this problem as this is most likely an OS specific issue. Can you please update the test so that it does not depend on the line separator(s) used by the OS
best
Rupert
Fixed in coref_dependency_tree_vers_3_fixed.patch. Should work fine on Unix based OS as well.
Hi Cristian,
Applied the fixed patch to trunk and merged it back to the 0.12 branch.
Some Notes:
I had to replace the equals implementation of DependencyRelation, because of several issues: (1) the call to super.equals is a very bad thing, because it checks for instance equality (2) possibility of NPE in equals checks. I replaced your version of hashCode and equals with the IDE generated versions.
While fixing the equals method I noticed that the NLP features you added do not validate parsed parameters. Is is intended that users can parse NULL for all the defined arguments? If not you should validate those and throw IllegalArgumentExceptions. This could also allow to write simpler hashcode and equals methods as one would not check for NULL values.
All fields in the CorefFeature and DependencyRelation classes should be non null except DependencyRelation.partner. I'll do the validations.
The last patch broke the build on Jenkins with the message
[ERROR] <https://builds.apache.org/job/stanbol-trunk-1.6/org.apache.stanbol$org.apache.stanbol.enhancer.nlp.json/ws/src/test/java/org/apache/stanbol/enhancer/nlp/json/valuetype/ValueTypeSupportTest.java>:[34,57] cannot find symbol
symbol : method lineSeparator()
location: class java.lang.System
The reason is that System#lineSeparator() was only added in JDK 7 and the build is an Java 6.
To fix this the code need to be adapted to use the System property instead.
I wasn't aware that the build doesn't use Java 7. My bad.
I'll provide a fixed patch tonight together with the rest of the validations.
Already changed with http://svn.apache.org/r1552247 (0.12) and http://svn.apache.org/r1552248 (trunk)
Hi Rupert,
I added a new patch to the jira. Basically it's the same changes as the pervious one + added a missing GrammaticalRelation type for dep tree. So you can ignore coref_dependency_tree_datamode_vers_4.patch and only apply coref_dependency_tree_datamodel_vers_5.patch.
Thanks
Hi Cristian,
I applied you patch with http://svn.apache.org/r1559644 to trunk. I will also merge it back to 0.12.
As we are getting near to release 0.12 I would like to know if we could close this issue. This would allow us to have this as an official feature of 0.12.0.
best
Rupert
Since dependecy relation is tested and committed on the stanford-nlp project and the coref resolution is also finished and tested (not yet up for pull request though) I'd say that the datamodel is good as is so you can close it.
Marking this as fixed. All changes where applied to trunk and the 0.12. releasing branch. Meaning that this will be also included in the 0.12.0 release.
A big thanks to Cristian for adding this feature to Apache Stanbol!
Changes needed for adding the coref and dependency trees datamodel and the respective json serialization/deserialization routines.