Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-3969

Add JSON Serialization for CASs and UIMA Descriptors

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.6.0SDK
    • Fix Version/s: 2.7.0SDK
    • Component/s: Core Java Framework
    • Labels:
      None

      Description

      Recent trends toward moving things into the cloud motivated me to consider what a JSON serialization of the CAS and descriptor metadata (more particularly, type systems) might look like.

      I've put up a Wiki page with some of the thoughts so far in this exploration, here: https://cwiki.apache.org/confluence/display/UIMA/JSON+serialization+for+UIMA

      I'm also fooling around with a proof-of-concept implementation, based on our current XMI serialization for the CAS, as well as our MetaDataObject_impl serialization for UIMA descriptors, in order to work out the details. There are additional nits (like how to configure things) not yet worked out.

      Comments and discussion appreciated; I've put this up as a Jira to record them together - but feel free to use email also for any comments you feel might be better being more ephemeral.

        Activity

        Hide
        schor Marshall Schor added a comment -

        Jens Grivolla reports test failures on Mac OSX probably due to line-ending issues.

        Show
        schor Marshall Schor added a comment - Jens Grivolla reports test failures on Mac OSX probably due to line-ending issues.
        Hide
        schor Marshall Schor added a comment -

        Added code to canonicalize test for line ending variations on different platforms. Jens, it would be great if you could confirm this fixes the tests on your Mac platform

        Show
        schor Marshall Schor added a comment - Added code to canonicalize test for line ending variations on different platforms. Jens, it would be great if you could confirm this fixes the tests on your Mac platform
        Hide
        jg Jens Grivolla added a comment -

        All of uimaj now builds with no errors on Mac OSX 10.9

        Show
        jg Jens Grivolla added a comment - All of uimaj now builds with no errors on Mac OSX 10.9
        Hide
        schor Marshall Schor added a comment -

        After writing several docs and emails about this, I am more motivated to refactor XmiCasSerialization into two parts - one related to Xmi, and the other to the generic support code that figures out the items to serialize. That would be the part reusable by JSON serialization, and would allow a better API - one for JSON and one for Xmi.

        Try this refactoring, and (assuming it works) also switch the exceptions reported by JSON serialization back to IOException (instead of wrapping them).

        Show
        schor Marshall Schor added a comment - After writing several docs and emails about this, I am more motivated to refactor XmiCasSerialization into two parts - one related to Xmi, and the other to the generic support code that figures out the items to serialize. That would be the part reusable by JSON serialization, and would allow a better API - one for JSON and one for Xmi. Try this refactoring, and (assuming it works) also switch the exceptions reported by JSON serialization back to IOException (instead of wrapping them).
        Hide
        mcmurry.andy@gmail.com Andy McMurry added a comment -

        Interesting thread in UIMA core about JSON Serialization CAS and Descriptors.

        Show
        mcmurry.andy@gmail.com Andy McMurry added a comment - Interesting thread in UIMA core about JSON Serialization CAS and Descriptors.
        Hide
        jtgreen John T Green added a comment -

        Very good

        On Wed, Aug 27, 2014 at 2:39 AM, AndyMC@apache.org (Andy McMurry) <

        Show
        jtgreen John T Green added a comment - Very good On Wed, Aug 27, 2014 at 2:39 AM, AndyMC@apache.org (Andy McMurry) <
        Hide
        schor Marshall Schor added a comment -

        During refactoring, found some edge case uses with serializing lists and arrays. One of these involves an ambiguity: If a feature value is [ 123, 456 ] this could be an array of numbers, or an array of FsRefs. The context section has a list per type, under the key @featureRefs which says what features ought to be considered as having FsRefs. That disambiguates this case. But it fails in this following case: the feature is an collection of integers, e.g., a list or array in UIMA, but because of other constraints, is being serialized as a separate Feature Structure. In this case the serialized value would be something like 1234 (which is to be interpreted as a FsRef, not as an integer). But the @featureRefs doesn't list this feature as having featureRef values.

        To fix this, I'm thinking of augmenting the @featureRef with @featureRefsOnlyIfSingle which would be a list of features which normally would be expected to have [ xxx, ] kinds of (not Feature Ref) values, but if they have just a single number, then that number should be considered to be a FsRef.

        Show
        schor Marshall Schor added a comment - During refactoring, found some edge case uses with serializing lists and arrays. One of these involves an ambiguity: If a feature value is [ 123, 456 ] this could be an array of numbers, or an array of FsRefs. The context section has a list per type, under the key @featureRefs which says what features ought to be considered as having FsRefs. That disambiguates this case. But it fails in this following case: the feature is an collection of integers, e.g., a list or array in UIMA, but because of other constraints, is being serialized as a separate Feature Structure. In this case the serialized value would be something like 1234 (which is to be interpreted as a FsRef, not as an integer). But the @featureRefs doesn't list this feature as having featureRef values. To fix this, I'm thinking of augmenting the @featureRef with @featureRefsOnlyIfSingle which would be a list of features which normally would be expected to have [ xxx, ] kinds of (not Feature Ref) values, but if they have just a single number, then that number should be considered to be a FsRef.
        Hide
        schor Marshall Schor added a comment -

        Another facet to this: embedding. The Xmi has embedding for non-shared lists and arrays. I don't know if XMI allows this (I know JSON does), but it's possible Xmi could support embedding for other non-shared objects, like user-defined feature structures. Example:

          <xyz:MyType xmi:id="382">
               <myFeat>
                     <xyz:EmbeddedType  xmi:id="404" .... />    <-- a directly embedded FS, otherwise represented via an FsRef integer
               </myFeat>
          </xyz:MyType>
        

        Consider extending the current implementation to support embedding (at least for JSON) of non-shared FSs besides lists and arrays. Also consider making this configurable, that is, having a mode which turns off all embedding - this would make the representation more uniform, and perhaps easier to parse and handle (fewer cases to consider) at the cost of some extra bytes. .

        Show
        schor Marshall Schor added a comment - Another facet to this: embedding. The Xmi has embedding for non-shared lists and arrays. I don't know if XMI allows this (I know JSON does), but it's possible Xmi could support embedding for other non-shared objects, like user-defined feature structures. Example: <xyz:MyType xmi:id= "382" > <myFeat> <xyz:EmbeddedType xmi:id= "404" .... /> <-- a directly embedded FS, otherwise represented via an FsRef integer </myFeat> </xyz:MyType> Consider extending the current implementation to support embedding (at least for JSON) of non-shared FSs besides lists and arrays. Also consider making this configurable, that is, having a mode which turns off all embedding - this would make the representation more uniform, and perhaps easier to parse and handle (fewer cases to consider) at the cost of some extra bytes. .
        Hide
        schor Marshall Schor added a comment -

        While refactoring, I see there's a 0-overhead way to computationally know if feature structures have multiple references to them, independent of the setting of the <multipleReferencesAllowed> attribute on Features. So we don't need that flag to tell if some Array or List is multiply referenced or not. The existing code does this partially, already, and if it discovers a multiple reference to something that was being serialized as if it was not multiply referenced, it reports an error, and "truncates" the list (if it is a list); if it's an array, it loses the object sharing.

        I'm planning on a non-compatible change to switch this to using the actual knowledge of whether or not an FS is shared. This guarantees that correct serialization (all sharing preserved, regardless of the correct / incorrect use of the <multipleReferencesAllowed> flag. This could cause some code that depended on the truncation, or upon the non-sharing (in the case of arrays) to now start to fail... Please post a comment if you know of issues this might cause.

        I'm thinking that if XmiCasSerialization is the only real use of this flag, then we could deprecate it.

        Show
        schor Marshall Schor added a comment - While refactoring, I see there's a 0-overhead way to computationally know if feature structures have multiple references to them, independent of the setting of the <multipleReferencesAllowed> attribute on Features. So we don't need that flag to tell if some Array or List is multiply referenced or not. The existing code does this partially, already, and if it discovers a multiple reference to something that was being serialized as if it was not multiply referenced, it reports an error, and "truncates" the list (if it is a list); if it's an array, it loses the object sharing. I'm planning on a non-compatible change to switch this to using the actual knowledge of whether or not an FS is shared. This guarantees that correct serialization (all sharing preserved, regardless of the correct / incorrect use of the <multipleReferencesAllowed> flag. This could cause some code that depended on the truncation, or upon the non-sharing (in the case of arrays) to now start to fail... Please post a comment if you know of issues this might cause. I'm thinking that if XmiCasSerialization is the only real use of this flag, then we could deprecate it.
        Hide
        renaudrichardet Renaud Richardet added a comment -

        Thanks Marshall Schor for this new functionality. I have started a prototype at https://github.com/renaud/uima_mongo, it can write (serialize) to Mongo, but not read yet (deserialize). Marshall Schor, do you plan to add this functionality?

        Show
        renaudrichardet Renaud Richardet added a comment - Thanks Marshall Schor for this new functionality. I have started a prototype at https://github.com/renaud/uima_mongo , it can write (serialize) to Mongo, but not read yet (deserialize). Marshall Schor , do you plan to add this functionality?
        Hide
        schor Marshall Schor added a comment -

        An updated version of the JSON serialization code is checked into trunk. Updated doc here: http://svn.apache.org/repos/asf/uima/uimaj/trunk/uima-docbook-references/src/docbook/ref.json.xml

        Main change: The variants on how to serialize now are cleaner. Serialized Feature Structures now include @id and @type features; these can be omitted if not wanted. The indexing by id or type is specified by a separate INDEX_ID or INDEX_TYPE spec.

        The code was refactored into a common part between JSON and XMI, and separate parts for those two formats. The JSON serialization was more closely aligned with the XMI style, including respecting existing id values that may have been deserialized into the CAS by a previous XmiDeserialization.

        Since the APIs have changed a bit, those who were brave to try the trunk version will need some slight updating (sorry about that!). And still no deserialization... that will not be done soon, I'm afraid... unless other hands jump in.

        Testing and feedback appreciated - it would be good to get this mostly "right" before releasing.

        Show
        schor Marshall Schor added a comment - An updated version of the JSON serialization code is checked into trunk. Updated doc here: http://svn.apache.org/repos/asf/uima/uimaj/trunk/uima-docbook-references/src/docbook/ref.json.xml Main change: The variants on how to serialize now are cleaner. Serialized Feature Structures now include @id and @type features; these can be omitted if not wanted. The indexing by id or type is specified by a separate INDEX_ID or INDEX_TYPE spec. The code was refactored into a common part between JSON and XMI, and separate parts for those two formats. The JSON serialization was more closely aligned with the XMI style, including respecting existing id values that may have been deserialized into the CAS by a previous XmiDeserialization. Since the APIs have changed a bit, those who were brave to try the trunk version will need some slight updating (sorry about that!). And still no deserialization... that will not be done soon, I'm afraid... unless other hands jump in. Testing and feedback appreciated - it would be good to get this mostly "right" before releasing.
        Hide
        schor Marshall Schor added a comment -

        Fix issues with byteArray serialization. This is done for JSON as "binary" data following JSON/Jackson conventions - so it is encoded as base64. The format should not have [] around the value. Also, add a @featureByeArrays to the @context.

        Change the namespace support to only do namespaces where they're required, type by type. (Meaning some types may not use namespaces, and others will).

        Show
        schor Marshall Schor added a comment - Fix issues with byteArray serialization. This is done for JSON as "binary" data following JSON/Jackson conventions - so it is encoded as base64. The format should not have [] around the value. Also, add a @featureByeArrays to the @context. Change the namespace support to only do namespaces where they're required, type by type. (Meaning some types may not use namespaces, and others will).
        Hide
        schor Marshall Schor added a comment -

        Some users have expressed an interest in exploiting JSONs capabiliies to have more embedding (we currently embed lists and arrays as feature values if they are marked in the type system as <multipleReferencesAllowed>false</multipleReferencesAllowed>.

        They would like this embedding to include FSs within other FSs (FSs means FeatureStructures). And they would like this done based on dynamically determining if the embed candidate is multiply-referenced or not. When I delve into this, I see some issues with supporting this and delta CAS formats.

        One approach is to drop delta CAS formats. I'm wondering if this might be reasonable, given that we have XMI serialization as an alternative (or various binary ones). I think the main motivation for JSON serialization is to connect the output of UIMA pipelines with non-UIMA web or cloud applications; this is probably a quite different model than the standard UIMA pipeline with remotes model, where the intent is to send a CAS to a remote, and have it be "returned" (often as a Delta).

        I'd like to hear from anyone listening of your views on this trade-off between supporting dynamic embeddability and supporting delta-CAS formats.

        Show
        schor Marshall Schor added a comment - Some users have expressed an interest in exploiting JSONs capabiliies to have more embedding (we currently embed lists and arrays as feature values if they are marked in the type system as <multipleReferencesAllowed>false</multipleReferencesAllowed> . They would like this embedding to include FSs within other FSs (FSs means FeatureStructures). And they would like this done based on dynamically determining if the embed candidate is multiply-referenced or not. When I delve into this, I see some issues with supporting this and delta CAS formats. One approach is to drop delta CAS formats. I'm wondering if this might be reasonable, given that we have XMI serialization as an alternative (or various binary ones). I think the main motivation for JSON serialization is to connect the output of UIMA pipelines with non-UIMA web or cloud applications; this is probably a quite different model than the standard UIMA pipeline with remotes model, where the intent is to send a CAS to a remote, and have it be "returned" (often as a Delta). I'd like to hear from anyone listening of your views on this trade-off between supporting dynamic embeddability and supporting delta-CAS formats.
        Hide
        rec Richard Eckart de Castilho added a comment -

        Never needed delta-CAS so far. I suppose going for non-delta JSON sounds good. If anybody requires a delta-JSON, maybe that can be retrofitted as an option?

        Show
        rec Richard Eckart de Castilho added a comment - Never needed delta-CAS so far. I suppose going for non-delta JSON sounds good. If anybody requires a delta-JSON, maybe that can be retrofitted as an option?
        Hide
        schor Marshall Schor added a comment -

        Good point. I'm going to see about having both support for embeddable / dynamically determined, and non-embeddable (except for statically determined arrays/lists, like current XMI/XML) formats, under the control of some configuration, which insures that if delta serialization is being done, the non-embeddable form is made madatory.

        Show
        schor Marshall Schor added a comment - Good point. I'm going to see about having both support for embeddable / dynamically determined, and non-embeddable (except for statically determined arrays/lists, like current XMI/XML) formats, under the control of some configuration, which insures that if delta serialization is being done, the non-embeddable form is made madatory.
        Hide
        schor Marshall Schor added a comment -

        Doing the next (and final ) iteration of the design.

        Show
        schor Marshall Schor added a comment - Doing the next (and final ) iteration of the design.
        Hide
        schor Marshall Schor added a comment -

        See the docbook uima-references, chapter 9, for a description of the JSON implementation. I hope this is the last big change before the 2.6.1 release.

        Show
        schor Marshall Schor added a comment - See the docbook uima-references, chapter 9, for a description of the JSON implementation. I hope this is the last big change before the 2.6.1 release.

          People

          • Assignee:
            schor Marshall Schor
            Reporter:
            schor Marshall Schor
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development