Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1421

Tika-Parsers tests fail on CentOS6 if tesseract isn't installed

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: parser
    • Labels:
    • Environment:

      CentOS6 AWS VM for DARPA Memex

      Description

      While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 1.7-trunk fresh install of tika in tika-parsers:

      Running org.apache.tika.parser.chm.TestChmLzxcControlData
      Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
      Running org.apache.tika.parser.chm.TestChmBlockInfo
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
      Running org.apache.tika.parser.chm.TestChmItsfHeader
      Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
      Running org.apache.tika.parser.txt.TXTParserTest
      Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
      Running org.apache.tika.parser.txt.CharsetDetectorTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
      Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
      Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
      Running org.apache.tika.parser.image.PSDParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
      Running org.apache.tika.parser.image.ImageParserTest
      Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
      Running org.apache.tika.parser.image.ImageMetadataExtractorTest
      Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
      Running org.apache.tika.parser.image.MetadataFieldsTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
      Running org.apache.tika.parser.image.TiffParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
      Running org.apache.tika.parser.font.FontParsersTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
      Running org.apache.tika.parser.mp4.MP4ParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
      Running org.apache.tika.parser.mp3.Mp3ParserTest
      Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
      Running org.apache.tika.parser.mp3.MpegStreamTest
      Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
      Running org.apache.tika.parser.dwg.DWGParserTest
      Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
      Running org.apache.tika.parser.pkg.GzipParserTest
      Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
      Running org.apache.tika.parser.pkg.Seven7ParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
      Running org.apache.tika.parser.pkg.TarParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
      Running org.apache.tika.parser.pkg.Bzip2ParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
      Running org.apache.tika.parser.pkg.ArParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
      Running org.apache.tika.parser.pkg.ZipParserTest
      Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
      Running org.apache.tika.parser.video.FLVParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
      Running org.apache.tika.parser.solidworks.SolidworksParserTest
      Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
      Running org.apache.tika.parser.ibooks.iBooksParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
      Running org.apache.tika.parser.ParsingReaderTest
      Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
      Running org.apache.tika.parser.mail.RFC822ParserTest
      Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec <<< FAILURE!
      Running org.apache.tika.parser.mbox.MboxParserTest
      Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
      Running org.apache.tika.parser.mbox.OutlookPSTParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
      Running org.apache.tika.parser.jpeg.JpegParserTest
      Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
      Running org.apache.tika.parser.executable.ExecutableParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
      Running org.apache.tika.parser.rtf.RTFParserTest
      Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
      Running org.apache.tika.parser.fork.ForkParserIntegrationTest
      Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.322 sec
      Running org.apache.tika.parser.envi.EnviHeaderParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
      Running org.apache.tika.parser.AutoDetectParserTest
      Tests run: 22, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.439 sec <<< FAILURE!
      Running org.apache.tika.parser.epub.EpubParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
      Running org.apache.tika.parser.code.SourceCodeParserTest
      Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.069 sec
      Running org.apache.tika.parser.netcdf.NetCDFParserTest
      Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.125 sec
      Running org.apache.tika.parser.pdf.PDFParserTest
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
       INFO [main] (PDFParser.java:248) - Document is encrypted
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 5592
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 5592
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 12324
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 5969
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 5687
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
       WARN [main] (FontManager.java:312) - Font not found: Times New Roman
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
       WARN [main] (FontManager.java:312) - Font not found: Times New Roman
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 26441
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 5592
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 205317
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 8777
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 2314576
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 116
      ERROR [main] (NonSequentialPDFParser.java:1904) - Can't find the object xref at offset 5500
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 56931
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 51851
       INFO [main] (PDFParser.java:248) - Document is encrypted
       INFO [main] (PDFParser.java:248) - Document is encrypted
      Tests run: 27, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 14.305 sec <<< FAILURE!
      Running org.apache.tika.parser.RecursiveParserWrapperTest
      Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
      Running org.apache.tika.parser.prt.PRTParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
      Running org.apache.tika.parser.html.HtmlParserTest
      Tests run: 38, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.162 sec
      Running org.apache.tika.parser.mat.MatParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.543 sec
      Running org.apache.tika.parser.feed.FeedParserTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
      Running org.apache.tika.parser.ocr.TesseractOCRTest
      Tests run: 3, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 0.007 sec
      Running org.apache.tika.parser.odf.ODFParserTest
      Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.098 sec
      Running org.apache.tika.parser.hdf.HDFParserTest
       WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= *refno=53 tag= VG (1965) Vgroup length=34 class= Dim0.0 name= Longitude using data 52
       WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= *refno=55 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= Latitude using data 54
       WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= *refno=57 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim2 using data 56
       WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= *refno=59 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim3 using data 58
       WARN [main] (H4header.java:844) - data tag missing vgroup= 70 Sea Surface Temperature
       WARN [main] (H4header.java:844) - data tag missing vgroup= 73 Number of Observations per Bin
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.087 sec
      Running org.apache.tika.embedder.ExternalEmbedderTest
      Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
      Running org.apache.tika.mime.MimeTypesTest
      Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
      Running org.apache.tika.mime.TestMimeTypes
      Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.163 sec
      Running org.apache.tika.mime.MimeTypeTest
      Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
      Running org.apache.tika.detect.TestContainerAwareDetector
      Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.277 sec
      Running org.apache.tika.TestParsers
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 68229
       WARN [main] (PDFParser.java:757) - Count in xref table is 0 at offset 44785
       WARN [main] (FontManager.java:312) - Font not found: Times New Roman
      Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.2 sec
      
      Results :
      
      Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): Exception thrown: TIKA-198: Illegal IOException from org.apache.tika.parser.ocr.TesseractOCRParser@2657d8a0
        testInlineSelector(org.apache.tika.parser.pdf.PDFParserTest): expected:<2> but was:<0>
        testInlineConfig(org.apache.tika.parser.pdf.PDFParserTest): expected:<2> but was:<0>
        testEmbeddedFilesInChildren(org.apache.tika.parser.pdf.PDFParserTest): expected:<5> but was:<3>
      
      Tests in error: 
        testUnusualFromAddress(org.apache.tika.parser.mail.RFC822ParserTest): TIKA-198: Illegal IOException from org.apache.tika.parser.ocr.TesseractOCRParser@1574a7af
        testImages(org.apache.tika.parser.AutoDetectParserTest): TIKA-198: Illegal IOException from org.apache.tika.parser.ocr.TesseractOCRParser@107aac4a
      
      Tests run: 538, Failures: 4, Errors: 2, Skipped: 4
      

      I tried installing Tesseract here:

      http://pkgs.org/centos-6/naulinux-school-x86_64/tesseract-3.01-2.el6.x86_64.rpm.html

      However, installing that causes the other tests to pass, but the Tesseract ones to fail (I think there is something wrong with the English config and am looking into it).

        Attachments

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              chrismattmann Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: