PDFBox
  1. PDFBox
  2. PDFBOX-568

testextract failure on Linux and Mac OS X

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0-incubator
    • Fix Version/s: 1.3.1
    • Component/s: Text extraction
    • Labels:
      None

      Description

      As discussed on the mailing list, the extraction test case seems to fail on non-Windows platforms.

      The troublesome test file is ample_fonts_solidconvertor.pdf, and the textextract.log file says the following (^@ is U+0000 and � is U+FFFD):

      Lines differ at index expected:46-253 actual:46-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 8 at actual line: 8
      expected line was: "@V@e^@r^@d^@a^@n^@a^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@ý^@ @t@e^@x^@t^@ @s@ ^A"
      actual line was: "@V@e^@r^@d^@a^@n^@a^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@�@ ^@t@e^@x^@t^@ @s@ ^A"
      Lines differ at index expected:4-253 actual:4-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 10 at actual line: 10
      expected line was: "AY^A~@ý^@á^@í^@é"
      actual line was: "AY^A~@�@�@�^@�"
      Lines differ at index expected:52-253 actual:52-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 11 at actual line: 11
      expected line was: "@S@a^@n^@s^@ @s@e^@r^@i^@f^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@ý^@ @t@e^@x^@t^@ @s@ ^A"
      actual line was: "@S@a^@n^@s^@ @s@e^@r^@i^@f^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@�@ ^@t@e^@x^@t^@ @s@ ^A"
      Lines differ at index expected:4-253 actual:4-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 13 at actual line: 13
      expected line was: "AY^A~@ý^@á^@í^@é"
      actual line was: "AY^A~@�@�@�^@�"
      Preparing to parse sample_fonts_solidconvertor.pdf for sorted test
      Lines differ at index expected:46-253 actual:46-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 8 at actual line: 8
      expected line was: "@V@e^@r^@d^@a^@n^@a^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@ý^@ @t@e^@x^@t^@ @s@ ^A"
      actual line was: "@V@e^@r^@d^@a^@n^@a^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@�@ ^@t@e^@x^@t^@ @s@ ^A"
      Lines differ at index expected:0-253 actual:0-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 10 at actual line: 10
      expected line was: "@á^@í^@é"
      actual line was: "@�@�@�@�"
      Lines differ at index expected:52-253 actual:52-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 11 at actual line: 11
      expected line was: "@S@a^@n^@s^@ @s@e^@r^@i^@f^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@ý^@ @t@e^@x^@t^@ @s@ ^A"
      actual line was: "@S@a^@n^@s^@ @s@e^@r^@i^@f^@:@ ^@T@o^@t^@o^@ @j@e^@ @p@o^@k^@u^@s^@n^@�@ ^@t@e^@x^@t^@ @s@ ^A"
      Lines differ at index expected:4-253 actual:4-65533
      FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected line: 13 at actual line: 13
      expected line was: "A~^AY@ý^@á^@í^@é"
      actual line was: "A~^AY@�@�@�^@�"

        Activity

        Hide
        Andreas Lehmkühler added a comment -

        Version 992066 fixes the text extraction issue with sample_fonts_solidconvertor.pdf and cweb.pdf from our test arena.

        To achieve that I rearranged/improved the code concerning the encoding. The next step will hopefully be adding support for CID coded fonts

        Show
        Andreas Lehmkühler added a comment - Version 992066 fixes the text extraction issue with sample_fonts_solidconvertor.pdf and cweb.pdf from our test arena. To achieve that I rearranged/improved the code concerning the encoding. The next step will hopefully be adding support for CID coded fonts
        Hide
        Andreas Lehmkühler added a comment - - edited

        Fixed a small typo with version 902568 which has the sideeffect that the extract test every time passes successfully.

        Thanks for the hint Mykola.

        Show
        Andreas Lehmkühler added a comment - - edited Fixed a small typo with version 902568 which has the sideeffect that the extract test every time passes successfully. Thanks for the hint Mykola.
        Hide
        Mykola Gurov added a comment -

        The change in the revision 889724 has suppressed test failures for all the files. I guess, this wasn't the intention?

        ~/pdfbox $ cat test/input/simple-openoffice.pdf.txt
        ##I am a simple pdf.
        ~/pdfbox $ echo garbage > test/input/simple-openoffice.pdf.txt
        ~/pdfbox $ cat test/input/simple-openoffice.pdf.txt
        garbage
        ~/pdfbox $ ant testextract
        Buildfile: build.xml
        ...
        testextract:
        [junit] Testsuite: org.apache.pdfbox.util.TestTextStripper
        [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 10.478 sec
        [junit]
        [junit] Testcase: testExtract took 10.474 sec

        BUILD SUCCESSFUL
        Total time: 22 seconds

        Show
        Mykola Gurov added a comment - The change in the revision 889724 has suppressed test failures for all the files. I guess, this wasn't the intention? ~/pdfbox $ cat test/input/simple-openoffice.pdf.txt ##I am a simple pdf. ~/pdfbox $ echo garbage > test/input/simple-openoffice.pdf.txt ~/pdfbox $ cat test/input/simple-openoffice.pdf.txt garbage ~/pdfbox $ ant testextract Buildfile: build.xml ... testextract: [junit] Testsuite: org.apache.pdfbox.util.TestTextStripper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 10.478 sec [junit] [junit] Testcase: testExtract took 10.474 sec BUILD SUCCESSFUL Total time: 22 seconds
        Hide
        Jukka Zitting added a comment -

        In revision 889724 I added a special check in TestTextStripper.java to disable the test failure in the default build (no point in making the build fail for issues we already know about). Please make sure that this revision is reverted before resolving this issue as fixed!

        Show
        Jukka Zitting added a comment - In revision 889724 I added a special check in TestTextStripper.java to disable the test failure in the default build (no point in making the build fail for issues we already know about). Please make sure that this revision is reverted before resolving this issue as fixed!

          People

          • Assignee:
            Unassigned
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development