Description
I am using following code to extract text from docx file
AutoDetectParser parser = new AutoDetectParser(); ContentHandler contentHandler = new BodyContentHandler(); inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); Metadata metadata = new Metadata(); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setIncludeDeletedContent(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); parser.parse(inputStream, contentHandler, metadata, parseContext); System.out.println(contentHandler.toString());
When I am sending track revised files it's adding all the text deleted with the actual text and inserted text. Is there a way to tell parser to exclude the deleted text?
Here is an example
input Text: This is a sample text. This part will be deleted. This is inserted.
outputText: This is a sample text. This part will be deleted. This is inserted.
Desired output: This is a sample text. be deleted. This is inserted.