Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-870

Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: None
    • Labels:
      None

      Description

      It would be great to be able to call parseToString with an additional parameter of the maxStringLength, instead of having to set it on the Tika instance. This allows to set it per parse call. Sample code:

      public String parseToString(InputStream stream, Metadata metadata, int maxStringLength)
              throws IOException, TikaException {
          WriteOutContentHandler handler =
              new WriteOutContentHandler(maxStringLength);
          try {
              ParseContext context = new ParseContext();
              context.set(Parser.class, parser);
              parser.parse(
                      stream, new BodyContentHandler(handler), metadata, context);
          } catch (SAXException e) {
              if (!handler.isWriteLimitReached(e)) {
                  // This should never happen with BodyContentHandler...
                  throw new TikaException("Unexpected SAX processing failure", e);
              }
          } finally {
              stream.close();
          }
          return handler.toString();
      }
      
      1. TIKA-870.patch
        6 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Shay!

        Show
        mikemccand Michael McCandless added a comment - Thanks Shay!
        Hide
        mikemccand Michael McCandless added a comment -

        Patch, with the sample code plus a test case.

        The test case failed at first! Ie, the returned string was over the specified limit... I dug and discovered WriteOutContentHandler wasn't overriding/counting ignorableWhitespace, so I added that override and now the test passes.

        I think it's ready...

        Show
        mikemccand Michael McCandless added a comment - Patch, with the sample code plus a test case. The test case failed at first! Ie, the returned string was over the specified limit... I dug and discovered WriteOutContentHandler wasn't overriding/counting ignorableWhitespace, so I added that override and now the test passes. I think it's ready...
        Hide
        mikemccand Michael McCandless added a comment -

        I think this makes sense.

        Show
        mikemccand Michael McCandless added a comment - I think this makes sense.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            kimchy Shay Banon
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development