Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.21
-
None
-
None
Description
I am using Tika to extract content from *.docx and other files. I am noticing Tika is extracting points from charts and putting them at the end of the file.
I am using following code for extraction
StringBuilder fileContent = new StringBuilder(); Parser parser = new AutoDetectParser(); ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1); //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory); Metadata metadata = new Metadata(); ParseContext parseContext = new ParseContext(); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); officeParserConfig.setIncludeMoveFromContent(false); officeParserConfig.setIncludeHeadersAndFooters(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext); String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
Please find the attach files for input and output from Tika.