Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10
    • Component/s: cli
    • Labels:
      None

      Description

      From the ApacheCon: CouchDB seems interested in Tika, and they'd like to see an option for producing JSON output from the Tika CLI.

      1. json_output_option.patch
        2 kB
        Selva Ganesan
      2. JSONHelper.java
        2 kB
        Selva Ganesan

        Activity

        Hide
        Chris A. Mattmann added a comment -

        +1, thanks Nick!

        Show
        Chris A. Mattmann added a comment - +1, thanks Nick!
        Hide
        Nick Burch added a comment -

        I've taken Selva's patch, and re-worked it a little bit to fit into the current Cli model for output handlers.

        Going forward, we might want to push the json serialisation logic into the core, especially as we convert more things to be typed. One route for that might be with GSON seriliasers (for now we do a bit of the work ourselves still)

        Show
        Nick Burch added a comment - I've taken Selva's patch, and re-worked it a little bit to fit into the current Cli model for output handlers. Going forward, we might want to push the json serialisation logic into the core, especially as we convert more things to be typed. One route for that might be with GSON seriliasers (for now we do a bit of the work ourselves still)
        Hide
        Selva Ganesan added a comment -

        Wrote a quick json helper that does most of what Nick has mentioned, uses Gson (Maven dependency http://sites.google.com/site/gson/gson-user-guide/using-gson-with-maven2). I have attached the helper. It takes Tika Metadata class and returns the JSON. Hope it helps.

        Show
        Selva Ganesan added a comment - Wrote a quick json helper that does most of what Nick has mentioned, uses Gson (Maven dependency http://sites.google.com/site/gson/gson-user-guide/using-gson-with-maven2 ). I have attached the helper. It takes Tika Metadata class and returns the JSON. Hope it helps.
        Hide
        Nick Burch added a comment -

        The patch looks a good start (thanks Selva!). A couple of things we might want to tweak are:

        • I'm not sure the removal of quotes in the values is correct, shouldn't we escape it?
        • Numbers could be output without quoting
        • If we have several values for one metadata field, we should probably output it as key:array rather than multiple key:value entries

        Some of these changes might be easier with a json library, anyone know if jackson for example would help with them?

        Show
        Nick Burch added a comment - The patch looks a good start (thanks Selva!). A couple of things we might want to tweak are: I'm not sure the removal of quotes in the values is correct, shouldn't we escape it? Numbers could be output without quoting If we have several values for one metadata field, we should probably output it as key:array rather than multiple key:value entries Some of these changes might be easier with a json library, anyone know if jackson for example would help with them?
        Hide
        Chris A. Mattmann added a comment -

        Actually I don't know Tatu But I'll take your word for it.

        In any case, where do we stop? I mean we maintain several custom outputs in the CLI – why treat JSON any differently? Plus, someone did the work to integrate it into JSON already and submitted a patch. The patch is small enough that I'm tempted to just commit it as-is. But I'll wait for more feedback.

        Show
        Chris A. Mattmann added a comment - Actually I don't know Tatu But I'll take your word for it. In any case, where do we stop? I mean we maintain several custom outputs in the CLI – why treat JSON any differently? Plus, someone did the work to integrate it into JSON already and submitted a patch. The patch is small enough that I'm tempted to just commit it as-is. But I'll wait for more feedback.
        Hide
        Benson Margulies added a comment -

        I use jackson all the time. It's not that big and you know Tatu does a
        good job on it.

        On Sun, Apr 10, 2011 at 1:52 PM, Chris A. Mattmann (JIRA)

        Show
        Benson Margulies added a comment - I use jackson all the time. It's not that big and you know Tatu does a good job on it. On Sun, Apr 10, 2011 at 1:52 PM, Chris A. Mattmann (JIRA)
        Hide
        Chris A. Mattmann added a comment -

        Eh, it's always a trade off on stuff like this. If it's simple enough to maintain (along with the rest of our output from the CLI seemingly) isn't it worth maintaining 30 lines of code rather than pulling in a full JSON lib? WDYT?

        Show
        Chris A. Mattmann added a comment - Eh, it's always a trade off on stuff like this. If it's simple enough to maintain (along with the rest of our output from the CLI seemingly) isn't it worth maintaining 30 lines of code rather than pulling in a full JSON lib? WDYT?
        Hide
        Nick Burch added a comment -

        Rather than rolling our own JSON writer, it might be worth using an existing library for that?

        Show
        Nick Burch added a comment - Rather than rolling our own JSON writer, it might be worth using an existing library for that?
        Hide
        Selva Ganesan added a comment -

        A patch I have been using for a while in my projects. Tested with tika dev branch over all the sample file types in test suite. Feel free to include if useful.

        Show
        Selva Ganesan added a comment - A patch I have been using for a while in my projects. Tested with tika dev branch over all the sample file types in test suite. Feel free to include if useful.
        Hide
        Ingo Renner added a comment -

        This would also make it easier to use the output in other systems like CMSs.

        Show
        Ingo Renner added a comment - This would also make it easier to use the output in other systems like CMSs.

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jukka Zitting
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development