Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3094

Apache Tika fails to extract text for pptx extension.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.24, 1.24.1
    • 1.25
    • None
    • None

    Description

      This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx ententions which was earlier working with Apache Tika 1.23 is no longer working in 1.24 version.

      For .ppt extention it is working fine in both 1.23 and 1.24

       

      As I referred to release notes https://tika.apache.org/1.24/index.html, you have updated the POI to 4.1.2. That might be the root cause of this problem. POI requires https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2 which is not present in bundle I guess.

       

       

      Attachments

        1. Sample PPT.pptx
          84 kB
          Abhishek Chauhan

        Activity

          People

            bob Bob Paulin
            abchauha Abhishek Chauhan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: