Solr
  1. Solr
  2. SOLR-1274

Provide multiple output formats in extract-only mode for tika handler

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: None
    • Labels:
      None

      Description

      The proposed feature is to accept a URL parameter when using extract-only mode to specify an output format. This parameter might just overload the existing "ext.extract.only" so that one can optionally specify a format, e.g. false|true|xml|text where true and xml give the same response (i.e. xml remains the default)

      I had been assuming that I could choose among possible tika output
      formats when using the extracting request handler in extract-only mode
      as if from the CLI with the tika jar:

      -x or --xml Output XHTML content (default)
      -h or --html Output HTML content
      -t or --text Output plain text content
      -m or --metadata Output only metadata

      However, looking at the docs and source, it seems that only the xml
      option is available (hard-coded) in ExtractingDocumentLoader.java

      serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
      

      Providing at least a plain-text response seems to work if you change the serializer to a TextSerializer (org.apache.xml.serialize.TextSerializer).

      1. SOLR-1274.patch
        3 kB
        Peter Wolanin
      2. SOLR-1274.patch
        2 kB
        Peter Wolanin

        Issue Links

          Activity

          Hide
          Peter Wolanin added a comment -

          on solr-user Yonik Seeley suggests waiting until his current changes in SOLR-284 are complete before starting on this issue.

          s

          Show
          Peter Wolanin added a comment - on solr-user Yonik Seeley suggests waiting until his current changes in SOLR-284 are complete before starting on this issue. s
          Hide
          Noble Paul added a comment -

          We are in the process of a release . New feature requests are not generally entertained. Shall we move it to 1.5 ?

          Show
          Noble Paul added a comment - We are in the process of a release . New feature requests are not generally entertained. Shall we move it to 1.5 ?
          Hide
          Peter Wolanin added a comment -

          A minimal version of this would be pretty trivial as far as features go, and I'd thought Yonik was indicating on the e-mail list that it would be a reasonable follow on to his last patch in the linked issue.

          Show
          Peter Wolanin added a comment - A minimal version of this would be pretty trivial as far as features go, and I'd thought Yonik was indicating on the e-mail list that it would be a reasonable follow on to his last patch in the linked issue.
          Hide
          Grant Ingersoll added a comment -

          Peter, if you have a patch, please add, otherwise I will mark as 1.5.

          FWIW, I think the Serializer approach is likely to only work for XML and Text. If you want HTML, etc., then we need to use a Transformer, which is what Tika CLI appears to be doing.

          Show
          Grant Ingersoll added a comment - Peter, if you have a patch, please add, otherwise I will mark as 1.5. FWIW, I think the Serializer approach is likely to only work for XML and Text. If you want HTML, etc., then we need to use a Transformer, which is what Tika CLI appears to be doing.
          Hide
          Peter Wolanin added a comment -

          Here's a patch that's nearly there, but somehow I'm missing something in how java behaves. The param is getting picked up, but this line never evals as true, even when the param is parsed right:

            if (extractFormat == "text") {
          

          If I set it to

            if (true) {
          

          I get the desired text-only output.

          Show
          Peter Wolanin added a comment - Here's a patch that's nearly there, but somehow I'm missing something in how java behaves. The param is getting picked up, but this line never evals as true, even when the param is parsed right: if (extractFormat == "text" ) { If I set it to if ( true ) { I get the desired text-only output.
          Hide
          Otis Gospodnetic added a comment -

          Try:

          if ("text".equals(extractFormat)) {
          

          Show
          Otis Gospodnetic added a comment - Try: if ( "text" .equals(extractFormat)) {
          Hide
          Peter Wolanin added a comment -

          Well, indeed - something like that works better.

          Show
          Peter Wolanin added a comment - Well, indeed - something like that works better.
          Hide
          Grant Ingersoll added a comment -

          I committed this patch, plus a test for it.

          Show
          Grant Ingersoll added a comment - I committed this patch, plus a test for it.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Peter Wolanin
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development