Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3094

Apache Tika fails to extract text for pptx extension.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.24, 1.24.1
    • Fix Version/s: 1.25
    • Component/s: None
    • Labels:
      None

      Description

      This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx ententions which was earlier working with Apache Tika 1.23 is no longer working in 1.24 version.

      For .ppt extention it is working fine in both 1.23 and 1.24

       

      As I referred to release notes https://tika.apache.org/1.24/index.html, you have updated the POI to 4.1.2. That might be the root cause of this problem. POI requires https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2 which is not present in bundle I guess.

       

       

        Attachments

        1. Sample PPT.pptx
          84 kB
          Abhishek Chauhan

          Activity

            People

            • Assignee:
              bob Bob Paulin
              Reporter:
              abchauha Abhishek Chauhan
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: