Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3035

Tika-app --extract mode outputs to stderr instead of stdout

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.23
    • Fix Version/s: 1.24
    • Component/s: app
    • Labels:

      Description

      In version 1.23 of Tika I am noticing a problem using the extract functionality. When extracting items from a file the "Extracting ... to ... " output goes to stderr instead of stdout.  

      This problem is observed using the runnable jar `tika-app-1.23.jar` . 

      Example to re-create problem:

      Here we explode testPDF_childAttachments.pdf and redirects standard error to /dev/null:

      $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z testPDF_childAttachments.pdf 2> /dev/null
      
      

      If I do not redirect stderr I see:

      $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z testPDF_childAttachments.pdf
      INFO  As a convenience, TikaCLI has turned on extraction of
      inline images for the PDFParser (TIKA-2374).
      Aside from the -z option, this is not the default behavior
      in Tika generally or in tika-server.
      Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
      See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
      for optional dependencies.Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
      you've excluded the TesseractOCRParser from the default parser.
      Tesseract may dramatically slow down content extraction (TIKA-2359).
      As of Tika 1.15 (and prior versions), Tesseract is automatically called.
      In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
      Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: org.xerial's sqlite-jdbc is not loaded.
      Please provide the jar on your classpath to parse sqlite files.
      See tika-parsers/pom.xml for the correct version.
      Extracting 'image0.jpg' (image/jpeg) to tika-test/out/3975acae-089c-43ae-a3bc-04e4987a0282-image0.jpg
      Extracting 'image1.tif' (image/tiff) to tika-test/out/8d11e4e3-735b-4b0b-9441-3ed4332c2f53-image1.tif
      WARN  No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
      Extracting 'Press Quality(1).joboptions' (text/plain) to tika-test/out/28c3fb48-30ea-403b-8a35-252c8f692305-Press Quality(1).joboptions
      Extracting 'Unit10.doc' (application/msword) to tika-test/out/008b9157-75f3-453b-bdfd-d5403c56891c-Unit10.doc
      

      Using 1.22 I correctly see the extracted files in stdout when redirecting stderr:

      $ java -jar tika-app-1.22.jar --extract-dir=tika-test/out/ -z testPDF_childAttachments.pdf 2> /dev/null
      Extracting 'image0.jpg' (image/jpeg) to tika-test/out/4ec61a12-4e5f-4de3-bee8-fa15521c374a-image0.jpg
      Extracting 'image1.tif' (image/tiff) to tika-test/out/004fbeb5-4b0e-4d35-8c50-23a420dccc99-image1.tif
      Extracting 'Press Quality(1).joboptions' (text/plain) to tika-test/out/8f6174d1-f0c7-4143-990d-a922c2e9513a-Press Quality(1).joboptions
      Extracting 'Unit10.doc' (application/msword) to tika-test/out/b2508bee-745d-4051-b927-0f5c31b97c1e-Unit10.doc
      
      

       

       

        Attachments

        1. testPDF_childAttachments.pdf
          2.21 MB
          Soren Daugaard

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                sorend Soren Daugaard
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: