Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2900

Removing comments from *.docx, *.pdf files

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.21
    • None
    • app, example
    • None

    Description

      Hello,

      I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there are comments in the file and tika is extracting them and adding them at the end of the file. I am wondering to know is there a way to exclude comments when it will be extracting text. 

      Here is the following code I am using 

           StringBuilder fileContent = new StringBuilder();
              Parser parser = new AutoDetectParser();
              ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
                      -1);
              //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
              RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
              Metadata metadata = new Metadata();
      
              ParseContext parseContext = new ParseContext();
              OfficeParserConfig officeParserConfig = new OfficeParserConfig();
              officeParserConfig.setUseSAXDocxExtractor(true);
              officeParserConfig.setIncludeDeletedContent(false);
              officeParserConfig.setIncludeMoveFromContent(false);
              officeParserConfig.setIncludeHeadersAndFooters(false);
              parseContext.set(OfficeParserConfig.class, officeParserConfig);
      
              wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
              String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
              

      Attachments

        Issue Links

          Activity

            People

              davemeikle Dave Meikle
              mdasadul Md
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: