Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3657

Microsoft documents are not text parsed when running under Docker

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.0, 2.2.1
    • None
    • config, core, depedency
    • None

    Description

      We use EmbeddedDocumentExtractor, with this code:

      NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = new NalyticsEmbeddedDocumentExtractor(this);

      this.context.set(EmbeddedDocumentExtractor.class, nalyticsEmbeddedDocumentExtractor);

      This all works fine for us, and has been used in production for a few years. This also works under Tika 2.2.0 when running in development environments (Eclipse, Apache Tomcat). However when running under Docker the text withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under Docker, the Microsoft documents are fully parsed, so this problem was introduced in 2.2.0

      Interestingly, I found that if anything at all is added to the context via context.set the same problem occurs. Also, if the standard Tika Embedded Document Extractor is used the same problem occurs. Our Docker image contains our application's code which uses Tika, as well as Apache DS. The problem occurs running Docker on Ubuntu, Mac OS and Windows.

       

      Attachments

        1. POIFSContainerDetector.java
          16 kB
          Tim Barrett
        2. scenario traces.txt
          9 kB
          Tim Barrett
        3. tika-config.xml
          0.8 kB
          Tim Barrett

        Issue Links

          Activity

            People

              Unassigned Unassigned
              comcortim Tim Barrett
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: