Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2593

docx with track change producing incorrect output

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17
    • None
    • core, handler
    • None

    Description

      I am using following code to extract text from docx file 

      AutoDetectParser parser = new AutoDetectParser();
      ContentHandler contentHandler = new BodyContentHandler();
      inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
      Metadata metadata = new Metadata();
      
      OfficeParserConfig officeParserConfig = new OfficeParserConfig();
      officeParserConfig.setIncludeDeletedContent(false);
      parseContext.set(OfficeParserConfig.class, officeParserConfig);
      
      parser.parse(inputStream, contentHandler, metadata, parseContext);
      System.out.println(contentHandler.toString());
      

      When I am sending track revised files it's adding all the text deleted with the actual text and inserted text. Is there a way to tell parser to exclude the deleted text?

      Here is an example 

      input Text: This is a sample text. This part will be deleted. This is inserted.

      outputText: This is a sample text. This part will be deleted. This is inserted.

      Desired output: This is a sample text.  be deleted. This is inserted.

      Attachments

        1. sample.docx
          12 kB
          Md

        Activity

          People

            Unassigned Unassigned
            mdasadul Md
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: