Description
Hello,
I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there are comments in the file and tika is extracting them and adding them at the end of the file. I am wondering to know is there a way to exclude comments when it will be extracting text.
Here is the following code I am using
StringBuilder fileContent = new StringBuilder(); Parser parser = new AutoDetectParser(); ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1); //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory); Metadata metadata = new Metadata(); ParseContext parseContext = new ParseContext(); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); officeParserConfig.setIncludeMoveFromContent(false); officeParserConfig.setIncludeHeadersAndFooters(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext); String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
Attachments
Attachments
Issue Links
- links to