PDFBox
  1. PDFBox
  2. PDFBOX-1512

TextPositionComparator is not compatible with Java 7

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.1
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Java 7

      Description

      The TextPostionCompartor causes the following exception running on Java 7: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison method violates its general contract!

      I think the problem is with this check:

      if ( yDifference < .1 ||
      (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
      (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))

      as it violates the contract requirement:

      The implementor must also ensure that the relation is transitive: ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.

      Finally, the implementor must ensure that compare(x, y)==0 implies that sgn(compare(x, z))==sgn(compare(y, z)) for all z.

      Java 7 now is strict and throws exceptions when the contract is violated.

      1. FOP-2252.pdf
        205 kB
        Tilman Hausherr
      2. illustration-of-inconsistent-sorting.png
        3 kB
        Hannes Erven
      3. immo-kurier_arsenal_93x62.pdf
        1.63 MB
        Hannes Erven
      4. TextPositionComparator.java
        3 kB
        Benjamin Papez
      5. Topo.pdf
        17 kB
        Maruan Sahyoun
      6. Topo.txt
        0.0 kB
        Maruan Sahyoun
      7. TopoContained.pdf
        19 kB
        Maruan Sahyoun
      8. TopoContained.txt
        0.0 kB
        Maruan Sahyoun
      9. TopoOverlap.pdf
        18 kB
        Maruan Sahyoun
      10. TopoOverlap.txt
        0.0 kB
        Maruan Sahyoun
      11. WFI_PDFParser_TextPostionComparator.txt
        3 kB
        SCHAEFER B.S.

        Issue Links

          Activity

          Hide
          Andreas Lehmkühler added a comment - - edited

          To avoid misunderstandings, IMHO the comparison itself isn't broken it works well, but it breaks the contract of the sort algorithm of the Collections framework.

          The issue is that PDFBox not only uses the x,y values of a text position. In some cases the context is taken into account if two positions are compared which are neighbors. So that there are cases where there same combination of x,y values may lead to different results if the sorting is done in another order.

          Saying that, it should be possible to simply replace the Collections.sort() call with our own sort implementation (e.g. based on quicksort) using the very same TestPositionComparator.

          Maybe there is some place for an improvement:
          The whole text is splitted into text postition, one for each character, so that we have to sort all single characters. The information of text chunks/whole words/lines of text got lost. We could preserve that information within the TextPosition (number of chunk/ index within the chunk) to simplify the comparison.

          Show
          Andreas Lehmkühler added a comment - - edited To avoid misunderstandings, IMHO the comparison itself isn't broken it works well, but it breaks the contract of the sort algorithm of the Collections framework. The issue is that PDFBox not only uses the x,y values of a text position. In some cases the context is taken into account if two positions are compared which are neighbors. So that there are cases where there same combination of x,y values may lead to different results if the sorting is done in another order. Saying that, it should be possible to simply replace the Collections.sort() call with our own sort implementation (e.g. based on quicksort) using the very same TestPositionComparator. Maybe there is some place for an improvement: The whole text is splitted into text postition, one for each character, so that we have to sort all single characters. The information of text chunks/whole words/lines of text got lost. We could preserve that information within the TextPosition (number of chunk/ index within the chunk) to simplify the comparison.
          Hide
          Maruan Sahyoun added a comment -

          A series of sample files model after the chart in http://en.wikipedia.org/wiki/Topological_sorting together with the text extraction done by Adobe Reader.

          Show
          Maruan Sahyoun added a comment - A series of sample files model after the chart in http://en.wikipedia.org/wiki/Topological_sorting together with the text extraction done by Adobe Reader.
          Hide
          Maruan Sahyoun added a comment - - edited

          I’d think that we can find a sorting algorithm which can handle such cases. Before that what would be the expectation of the sorting result looking at the drawing Hannes provided? Shall we look at inspecting the results of other tools such as Adobe Reader and replicate their behavior?

          [Update] Adobe Reader would extract the sample to C,A,B

          I’m willing to look into solving the issue but would like to have some input on the end result first.

          Maruan

          Show
          Maruan Sahyoun added a comment - - edited I’d think that we can find a sorting algorithm which can handle such cases. Before that what would be the expectation of the sorting result looking at the drawing Hannes provided? Shall we look at inspecting the results of other tools such as Adobe Reader and replicate their behavior? [Update] Adobe Reader would extract the sample to C,A,B I’m willing to look into solving the issue but would like to have some input on the end result first. Maruan
          Hide
          Hannes Erven added a comment -

          The issue here is that the ordering rules are flawed and produce circular ranks.

          If my understanding is correct, the current rules to compare two textpositions A,B are:

          1) if the bottom positions are very close or the elements overlap vertically, order by horizontal position (left first)

          2) else, order by vertical position (top first).

          Check out the drawing I'll attach and the following comparations a sorting algo would do:

          A-B: vertical overlap, rule 1, left wins, A<B
          B-C: no overlap, not close, rule 2, upper bottom wins, B<C
          A-C: vertical overlap, rule 1, left wins, C<A

          So in a nutshell, A comes before B; but C shall be inserted both after B and before A, which is inconsistent.

          Just by looking at the drawing I don't have any idea what a meaningful ordering of these boxes would be anyways...

          Show
          Hannes Erven added a comment - The issue here is that the ordering rules are flawed and produce circular ranks. If my understanding is correct, the current rules to compare two textpositions A,B are: 1) if the bottom positions are very close or the elements overlap vertically, order by horizontal position (left first) 2) else, order by vertical position (top first). Check out the drawing I'll attach and the following comparations a sorting algo would do: A-B: vertical overlap, rule 1, left wins, A<B B-C: no overlap, not close, rule 2, upper bottom wins, B<C A-C: vertical overlap, rule 1, left wins, C<A So in a nutshell, A comes before B; but C shall be inserted both after B and before A, which is inconsistent. Just by looking at the drawing I don't have any idea what a meaningful ordering of these boxes would be anyways...
          Hide
          John Hewson added a comment - - edited

          Some algorithms don't care, which may result in inconsistent ordering across multiple calls

          I don't see how this would happen unless the algorithm was randomised. For a given input the output should always be the same, regardless.

          But as you say the ordering is not sufficiently defined, so there may be more than one solution. Perhaps we need some more sophisticated rules for determining reading order? Off the top of my head, it seems like Topological sorting may be of relevance.

          Show
          John Hewson added a comment - - edited Some algorithms don't care, which may result in inconsistent ordering across multiple calls I don't see how this would happen unless the algorithm was randomised. For a given input the output should always be the same, regardless. But as you say the ordering is not sufficiently defined, so there may be more than one solution. Perhaps we need some more sophisticated rules for determining reading order? Off the top of my head, it seems like Topological sorting may be of relevance.
          Hide
          Hannes Erven added a comment -

          The issue is not related to a specific sorting algorithm. At the moment, the desired order of the elements is not sufficiently defined.

          Some algorithms don't care, which may result in inconsistent ordering across multiple calls, some algorithms (as those Java7 defaults to) detect this and throw an exception.

          I did try hacking the Comparator, but so far it doesn't pass the TextExtract test cases

          Show
          Hannes Erven added a comment - The issue is not related to a specific sorting algorithm. At the moment, the desired order of the elements is not sufficiently defined. Some algorithms don't care, which may result in inconsistent ordering across multiple calls, some algorithms (as those Java7 defaults to) detect this and throw an exception. I did try hacking the Comparator, but so far it doesn't pass the TextExtract test cases
          Hide
          John Hewson added a comment -

          Perhaps we should migrate away from using Collections.sort altogether and use some other sorting algorithm?

          Show
          John Hewson added a comment - Perhaps we should migrate away from using Collections.sort altogether and use some other sorting algorithm?
          Hide
          Florent Guillaume added a comment -

          Could someone knowledgeable please take a stab at this? Otherwise PDFBox quite often crashes without -Djava.util.Arrays.useLegacyMergeSort=true...

          Show
          Florent Guillaume added a comment - Could someone knowledgeable please take a stab at this? Otherwise PDFBox quite often crashes without -Djava.util.Arrays.useLegacyMergeSort=true ...
          Hide
          Tilman Hausherr added a comment -

          The attached file brings also the exception, although at a different place:
          java.lang.IllegalArgumentException: Comparison method violates its general contract!
          at java.util.TimSort.mergeLo(TimSort.java:747)
          at java.util.TimSort.mergeAt(TimSort.java:483)
          at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
          at java.util.TimSort.sort(TimSort.java:223)
          at java.util.TimSort.sort(TimSort.java:173)
          at java.util.Arrays.sort(Arrays.java:659)
          at java.util.Collections.sort(Collections.java:217)
          at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:542)
          at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:436)
          at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:359)
          at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:315)
          at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:158)
          at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:101)

          Show
          Tilman Hausherr added a comment - The attached file brings also the exception, although at a different place: java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeLo(TimSort.java:747) at java.util.TimSort.mergeAt(TimSort.java:483) at java.util.TimSort.mergeForceCollapse(TimSort.java:426) at java.util.TimSort.sort(TimSort.java:223) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:542) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:436) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:359) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:315) at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:158) at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:101)
          Hide
          Tilman Hausherr added a comment -

          I get the exception on page 11 of the file of PDFBOX-1730 (page 12 when looking at it with acrobat viewer, it's a table). It does not crash with the change. Sort by position is set to true.

          Show
          Tilman Hausherr added a comment - I get the exception on page 11 of the file of PDFBOX-1730 (page 12 when looking at it with acrobat viewer, it's a table). It does not crash with the change. Sort by position is set to true.
          Hide
          SCHAEFER B.S. added a comment - - edited

          As Andreas Lehmkühler pointed out, the problem lies in the check of overlays in the code, but changing the code breaks previous implementations.
          We did a try/catch around the sort to avoid the break of the sort algorithm
          <code>
          try

          { Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getInstance()); }

          catch (Exception ex)

          { // Sort algorithm break contract -> do sorting in safemode Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getSafeInstance()); }

          </code>

          and modified the compare method (have a look at attached WFI_PDFParser_TextPostionComparator.txt). Maybe this helps ...

          Show
          SCHAEFER B.S. added a comment - - edited As Andreas Lehmkühler pointed out, the problem lies in the check of overlays in the code, but changing the code breaks previous implementations. We did a try/catch around the sort to avoid the break of the sort algorithm <code> try { Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getInstance()); } catch (Exception ex) { // Sort algorithm break contract -> do sorting in safemode Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getSafeInstance()); } </code> and modified the compare method (have a look at attached WFI_PDFParser_TextPostionComparator.txt). Maybe this helps ...
          Hide
          Florent Guillaume added a comment -

          This really needs to be fixed... Crashing under Java 7 is not really a good thing.

          Show
          Florent Guillaume added a comment - This really needs to be fixed... Crashing under Java 7 is not really a good thing.
          Hide
          Hannes Erven added a comment -

          I'm not sure what Thiyaga's comment refers to (the problem itself or the workaround). I've just verified that the problem can be reproduced using Oracle's 1.7.0_21 .

          Show
          Hannes Erven added a comment - I'm not sure what Thiyaga's comment refers to (the problem itself or the workaround). I've just verified that the problem can be reproduced using Oracle's 1.7.0_21 .
          Hide
          Thiyagarajan Selvaraj added a comment -

          It is working fine with the Java build 1.7.0_17-b02.

          Show
          Thiyagarajan Selvaraj added a comment - It is working fine with the Java build 1.7.0_17-b02.
          Hide
          Hannes Erven added a comment -

          Another file that triggers exactly this issue.

          Show
          Hannes Erven added a comment - Another file that triggers exactly this issue.
          Hide
          Andreas Lehmkühler added a comment -

          Thanks for the hint, that's a good work around to avoid the exception. But in the long run we have to solve this issue somehow.

          Show
          Andreas Lehmkühler added a comment - Thanks for the hint, that's a good work around to avoid the exception. But in the long run we have to solve this issue somehow.
          Hide
          Li Xu added a comment -

          I think I've figured out the workaround. Oracle's doc is correct that you can indeed use the java.util.Arrays.useLegacyMergeSort override. However, I first tried it with System.setProperty() and it did not work. When I pass it to jvm as -Djava.util.Arrays.useLegacyMergeSort=true, I'm now able to avoid this error.

          Show
          Li Xu added a comment - I think I've figured out the workaround. Oracle's doc is correct that you can indeed use the java.util.Arrays.useLegacyMergeSort override. However, I first tried it with System.setProperty() and it did not work. When I pass it to jvm as -Djava.util.Arrays.useLegacyMergeSort=true, I'm now able to avoid this error.
          Hide
          Li Xu added a comment -

          http://www.oracle.com/technetwork/java/javase/compatibility-417013.html#source suggests that using the system property java.util.Arrays.useLegacyMergeSort would revert the sort behavior back to 1.6. I'm unable to make it work. Any of you tried?

          ================
          Area: API: Utilities
          Synopsis: Updated sort behavior for Arrays and Collections may throw an IllegalArgumentException
          Description: The sorting algorithm used by java.util.Arrays.sort and (indirectly) by java.util.Collections.sort has been replaced. The new sort implementation may throw an IllegalArgumentException if it detects a Comparable that violates the Comparable contract. The previous implementation silently ignored such a situation.
          If the previous behavior is desired, you can use the new system property, java.util.Arrays.useLegacyMergeSort, to restore previous mergesort behavior.
          Nature of Incompatibility: behavioral
          RFE: 6804124

          Show
          Li Xu added a comment - http://www.oracle.com/technetwork/java/javase/compatibility-417013.html#source suggests that using the system property java.util.Arrays.useLegacyMergeSort would revert the sort behavior back to 1.6. I'm unable to make it work. Any of you tried? ================ Area: API: Utilities Synopsis: Updated sort behavior for Arrays and Collections may throw an IllegalArgumentException Description: The sorting algorithm used by java.util.Arrays.sort and (indirectly) by java.util.Collections.sort has been replaced. The new sort implementation may throw an IllegalArgumentException if it detects a Comparable that violates the Comparable contract. The previous implementation silently ignored such a situation. If the previous behavior is desired, you can use the new system property, java.util.Arrays.useLegacyMergeSort, to restore previous mergesort behavior. Nature of Incompatibility: behavioral RFE: 6804124
          Hide
          Andreas Lehmkühler added a comment -

          I investigated some time into this and I still run into the exception. I guess the problem is, that we can't just compare the absolute position of the textpositions to be compared. In some case we have to take the context into account, e.g. if a text contains super- or subscript parts those characters have to be sorted using the x-coordinate and the fact the two characters belong together. In such cases the y-ccordinate doesn't matter. This can lead to the described exception depending on the starting point (number of textposition, starting order etc.)

          So I'm not sure if there is a solution, maybe we have to implement our own sort algo.

          Show
          Andreas Lehmkühler added a comment - I investigated some time into this and I still run into the exception. I guess the problem is, that we can't just compare the absolute position of the textpositions to be compared. In some case we have to take the context into account, e.g. if a text contains super- or subscript parts those characters have to be sorted using the x-coordinate and the fact the two characters belong together. In such cases the y-ccordinate doesn't matter. This can lead to the described exception depending on the starting point (number of textposition, starting order etc.) So I'm not sure if there is a solution, maybe we have to implement our own sort algo.
          Hide
          Andreas Lehmkühler added a comment -

          The pdf attached to PDFBOX-161 triggers the exception

          Show
          Andreas Lehmkühler added a comment - The pdf attached to PDFBOX-161 triggers the exception
          Hide
          Andreas Lehmkühler added a comment -

          Unfortunately your fix breaks the text extraction tests. I guess we have to dig deeper into it.

          Show
          Andreas Lehmkühler added a comment - Unfortunately your fix breaks the text extraction tests. I guess we have to dig deeper into it.
          Hide
          Andreas Lehmkühler added a comment -

          Thanks for your fix. Please attach either the pdf in question or provide as with a set of textpostion values so that we can create a unit test.

          Show
          Andreas Lehmkühler added a comment - Thanks for your fix. Please attach either the pdf in question or provide as with a set of textpostion values so that we can create a unit test.
          Hide
          Benjamin Papez added a comment -

          I modified the implementation and tried with the attached file, which no longer braught the exception.

          Show
          Benjamin Papez added a comment - I modified the implementation and tried with the attached file, which no longer braught the exception.

            People

            • Assignee:
              Unassigned
              Reporter:
              Benjamin Papez
            • Votes:
              12 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:

                Development