Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1904

Tika 2.0 - Create Proxy Parser and Detectors

Attach filesAttach ScreenshotAdd voteVotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • None
    • None

    Description

      There are several parsers and detectors that instantiate parsers and detectors that live in different modules in tika 2.0. As of now these modules have are dependent on other modules this includes:
      tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, tika-parser-package-module
      tika-parser-ebook-module -> tika-parser-text-module
      tika-parser-journal-module -> tika-parser-pdf-module

      May of these dependencies could be made optional by introducing the concept of proxy parser and detectors that would enable functionality if all the dependencies are included in the project but not throw a ClassNotFoundException if the dependent module was not include( ex. parse function would do nothing).

      EX
      Currently
      ChmParser

      private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException
              InputStream stream = null;
              Metadata metadata = new Metadata();
              HtmlParser htmlParser = new HtmlParser();
              ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1
              ParseContext parser = new ParseContext();
              try {
                  stream = new ByteArrayInputStream(byteObject);
                  htmlParser.parse(stream, handler, metadata, parser);
              } catch (SAXException e) {
                  throw new RuntimeException(e);
              } catch (IOException e) {
                  // Pushback overflow from tagsoup
              }
          }
      

      Instead the HtmlParser could be Proxyed in the constructor

      private final Parser htmlProxyParser;
          
          public ChmParser() {
              this.htmlProxyParser = new ParserProxy("org.apache.tika.parser.html.HtmlParser");
          }
      

      And

      private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException
              InputStream stream = null;
              Metadata metadata = new Metadata();
              ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1
              ParseContext parser = new ParseContext();
              try {
                  stream = new ByteArrayInputStream(byteObject);
                  htmlProxyParser.parse(stream, handler, metadata, parser);
              } catch (SAXException e) {
                  throw new RuntimeException(e);
              } catch (IOException e) {
                  // Pushback overflow from tagsoup
              }
          }
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bobpaulin Bob Paulin
            bobpaulin Bob Paulin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment