PDFBox
  1. PDFBox
  2. PDFBOX-1512

TextPositionComparator is not compatible with Java 7

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.1, 2.0.0
    • Fix Version/s: 1.8.8, 2.0.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Java 7

      Description

      The TextPostionCompartor causes the following exception running on Java 7: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison method violates its general contract!

      I think the problem is with this check:

      if ( yDifference < .1 ||
      (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
      (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))

      as it violates the contract requirement:

      The implementor must also ensure that the relation is transitive: ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.

      Finally, the implementor must ensure that compare(x, y)==0 implies that sgn(compare(x, z))==sgn(compare(y, z)) for all z.

      Java 7 now is strict and throws exceptions when the contract is violated.

      1. TextPositionComparator.java
        3 kB
        Benjamin Papez
      2. immo-kurier_arsenal_93x62.pdf
        1.63 MB
        Hannes Erven
      3. WFI_PDFParser_TextPostionComparator.txt
        3 kB
        SCHAEFER B.S.
      4. FOP-2252.pdf
        205 kB
        Tilman Hausherr
      5. illustration-of-inconsistent-sorting.png
        3 kB
        Hannes Erven
      6. TopoContained.txt
        0.0 kB
        Maruan Sahyoun
      7. TopoContained.pdf
        19 kB
        Maruan Sahyoun
      8. TopoOverlap.txt
        0.0 kB
        Maruan Sahyoun
      9. TopoOverlap.pdf
        18 kB
        Maruan Sahyoun
      10. Topo.txt
        0.0 kB
        Maruan Sahyoun
      11. Topo.pdf
        17 kB
        Maruan Sahyoun
      12. quicksort.patch
        8 kB
        Uwe

        Issue Links

          Activity

          Hide
          Tilman Hausherr added a comment -

          I'll make a better solution when Bertrand Gillis has answered my question in PDFBOX-2397.

          Show
          Tilman Hausherr added a comment - I'll make a better solution when Bertrand Gillis has answered my question in PDFBOX-2397 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1631449 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
          [ https://svn.apache.org/r1631449 ]

          PDFBOX-1512: use quicksort if run in an applet

          Show
          ASF subversion and git services added a comment - Commit 1631449 from Tilman Hausherr in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1631449 ] PDFBOX-1512 : use quicksort if run in an applet
          Hide
          ASF subversion and git services added a comment -

          Commit 1631448 from Tilman Hausherr in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1631448 ]

          PDFBOX-1512: use quicksort if run in an applet

          Show
          ASF subversion and git services added a comment - Commit 1631448 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1631448 ] PDFBOX-1512 : use quicksort if run in an applet
          Hide
          Andreas Lehmkühler added a comment -

          AFAIU PDFBOX-2397 the issue isn't limited to applets but to signed jars as used with java webstart as well. To answer your question, IMHO in such cases the default value for "useCustomQuicksort" should be "true" to be sure that everything works well.

          Show
          Andreas Lehmkühler added a comment - AFAIU PDFBOX-2397 the issue isn't limited to applets but to signed jars as used with java webstart as well. To answer your question, IMHO in such cases the default value for "useCustomQuicksort" should be "true" to be sure that everything works well.
          Hide
          Tilman Hausherr added a comment -

          There's one thing that needs to be decided that is related to PDFBOX-2397 (applet security): what will we do when the applet is running in a sandbox without permissions, i.e. what would be the default setting? Applets are considered "dangerous", so the few people who still use them will probably run a recent JRE version.

          Show
          Tilman Hausherr added a comment - There's one thing that needs to be decided that is related to PDFBOX-2397 (applet security): what will we do when the applet is running in a sandbox without permissions, i.e. what would be the default setting? Applets are considered "dangerous", so the few people who still use them will probably run a recent JRE version.
          Hide
          Andreas Lehmkühler added a comment -

          I've applied the Uwes Patch to the trunk and the 1.8 branch.

          Thanks for the contribution!

          Show
          Andreas Lehmkühler added a comment - I've applied the Uwes Patch to the trunk and the 1.8 branch. Thanks for the contribution!
          Hide
          ASF subversion and git services added a comment -

          Commit 1631170 from Andreas Lehmkühler in branch 'pdfbox/branches/1.8'
          [ https://svn.apache.org/r1631170 ]

          PDFBOX-1512: don't use Collections.sort for JDKs >= 1.7 to avoid an IllegalArgumentException as proposed by Uwe Pachler

          Show
          ASF subversion and git services added a comment - Commit 1631170 from Andreas Lehmkühler in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1631170 ] PDFBOX-1512 : don't use Collections.sort for JDKs >= 1.7 to avoid an IllegalArgumentException as proposed by Uwe Pachler
          Hide
          ASF subversion and git services added a comment -

          Commit 1631169 from Andreas Lehmkühler in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1631169 ]

          PDFBOX-1512: don't use Collections.sort for JDKs >= 1.7 to avoid an IllegalArgumentException as proposed by Uwe Pachler

          Show
          ASF subversion and git services added a comment - Commit 1631169 from Andreas Lehmkühler in branch 'pdfbox/trunk' [ https://svn.apache.org/r1631169 ] PDFBOX-1512 : don't use Collections.sort for JDKs >= 1.7 to avoid an IllegalArgumentException as proposed by Uwe Pachler
          Hide
          Tilman Hausherr added a comment -

          To the 21 watchers of this issue: is there anyone of you who can either 1) test the patch of Uwe in your own application, or if you don't build from source 2) are willing to test a 1.8 or a 2.0 snapshot? (if you respond please specify)

          I can commit the patch, but I'm not using the text extraction so much, and I prefer a real world test, even if Uwe has provided a test.
          I would count Andreas Lehmkühler remark of June ("it should be possible to simply...") as a +1.

          Show
          Tilman Hausherr added a comment - To the 21 watchers of this issue: is there anyone of you who can either 1) test the patch of Uwe in your own application, or if you don't build from source 2) are willing to test a 1.8 or a 2.0 snapshot? (if you respond please specify) I can commit the patch, but I'm not using the text extraction so much, and I prefer a real world test, even if Uwe has provided a test. I would count Andreas Lehmkühler remark of June ("it should be possible to simply...") as a +1.
          Hide
          Uwe added a comment - - edited

          @Tilman:
          The problem with the workaround is that it needs to be done outside the application, you can't just activate it around a PDFBox call. It also affects EVERY sort operation in the entire JVM.

          With a single JVM on a dedicated server this may not be a concern, but with larger installations I find the workaround a little too invasive.
          You may not even have control over the JVM in which the library is running, so you can't apply the workaround in this case.

          Show
          Uwe added a comment - - edited @Tilman: The problem with the workaround is that it needs to be done outside the application, you can't just activate it around a PDFBox call. It also affects EVERY sort operation in the entire JVM. With a single JVM on a dedicated server this may not be a concern, but with larger installations I find the workaround a little too invasive. You may not even have control over the JVM in which the library is running, so you can't apply the workaround in this case.
          Hide
          Luis Filipe Nassif added a comment -

          +1 to Uwe proposed approach.

          Show
          Luis Filipe Nassif added a comment - +1 to Uwe proposed approach.
          Hide
          Tilman Hausherr added a comment -

          I agree with Uwe (although I didn't test the patch), although the workaround mentioned before (-Djava.util.Arrays.useLegacyMergeSort=true) would do the same (i.e. use a sort algorithm that doesn't detect the comparator flaw).

          Show
          Tilman Hausherr added a comment - I agree with Uwe (although I didn't test the patch), although the workaround mentioned before (-Djava.util.Arrays.useLegacyMergeSort=true) would do the same (i.e. use a sort algorithm that doesn't detect the comparator flaw).
          Hide
          Uwe added a comment -

          > The infallibility of the previous author is probably not a safe assumption.
          I wouldn't go so far as to to assume the infallibility of any coder

          The current algo will have its weaknesses, no doubt. But devising one that works reliably may not be as easy as it seems.

          Why don't we:

          • Fix the released versions (1.7.x, 1.8.x) with the patch I provided? It's safe to do so, because it won't change the algorithm. It just makes it work on Java 1.7+
          • Fix the trunk (2.0+) properly, with a new comparator/algorithm

          This way, everyone wins:

          • Users of the released versions (like myself) get a quick fix that they desperately need.
          • PDFBox Developers get a chance to clean up this issue properly in the trunk, and on the same token improve the text extraction feature

          What do you think?

          Show
          Uwe added a comment - > The infallibility of the previous author is probably not a safe assumption. I wouldn't go so far as to to assume the infallibility of any coder The current algo will have its weaknesses, no doubt. But devising one that works reliably may not be as easy as it seems. Why don't we: Fix the released versions (1.7.x, 1.8.x) with the patch I provided? It's safe to do so, because it won't change the algorithm. It just makes it work on Java 1.7+ Fix the trunk (2.0+) properly, with a new comparator/algorithm This way, everyone wins: Users of the released versions (like myself) get a quick fix that they desperately need. PDFBox Developers get a chance to clean up this issue properly in the trunk, and on the same token improve the text extraction feature What do you think?
          Hide
          John Hewson added a comment -

          > I suspect that the person who wrote the comparator in the first place had a good reason to write it in the way it is now.

          The infallibility of the previous author is probably not a safe assumption. I've just noticed that PDFBOX-731 actually raises some of the shortcomings in the existing sort algorithm.

          Show
          John Hewson added a comment - > I suspect that the person who wrote the comparator in the first place had a good reason to write it in the way it is now. The infallibility of the previous author is probably not a safe assumption. I've just noticed that PDFBOX-731 actually raises some of the shortcomings in the existing sort algorithm.
          Hide
          Uwe added a comment -

          No idea, I'm simply a user who needs a fix
          The patch I wrote is as unintrusive as it can be. On the other hand, using a different comparator will certainly change the behaviour of the library. I suspect that the person who wrote the comparator in the first place had a good reason to write it in the way it is now.

          But you can try and run the unit tests; if they dont't fail with that comparator of yours then it'll have my vote.

          Show
          Uwe added a comment - No idea, I'm simply a user who needs a fix The patch I wrote is as unintrusive as it can be. On the other hand, using a different comparator will certainly change the behaviour of the library. I suspect that the person who wrote the comparator in the first place had a good reason to write it in the way it is now. But you can try and run the unit tests; if they dont't fail with that comparator of yours then it'll have my vote.
          Hide
          Uwe added a comment -

          You're right, Tom, in principle we'd need a different solution here to make the whole thing water tight (fix transitivity).

          But the intention for my patch wasn't to make the comparison perfect. The motivation was to simply extend what is the status quo on JDK 6 (an imperfect but working text stripper) to JDK 7 and onwards (status quo: unpredictably throwing exceptions). People have been living with the imperfect comparator for years in their solutions, so they won't notice a change - only that they can now use JDK 7 safely.

          Don't get me wrong; if you can provide a comparator that solves the problem, post it here, I'd love to fix this properly.

          However, I've been holding back migrating my project to JDK7 for over a year now because of this issue - I'd much rather implement an imperfect but working solution today than having to wait for another year for an ideal fix.

          Show
          Uwe added a comment - You're right, Tom, in principle we'd need a different solution here to make the whole thing water tight (fix transitivity). But the intention for my patch wasn't to make the comparison perfect. The motivation was to simply extend what is the status quo on JDK 6 (an imperfect but working text stripper) to JDK 7 and onwards (status quo: unpredictably throwing exceptions). People have been living with the imperfect comparator for years in their solutions, so they won't notice a change - only that they can now use JDK 7 safely. Don't get me wrong; if you can provide a comparator that solves the problem, post it here, I'd love to fix this properly. However, I've been holding back migrating my project to JDK7 for over a year now because of this issue - I'd much rather implement an imperfect but working solution today than having to wait for another year for an ideal fix.
          Hide
          Tom Crossland added a comment -

          Part of the issue here is that TextPositionComparator breaks the contract for Comparator. Specifically, it doesn't ensure that the ordering is transitive. The Comparator interface specifies that:

          The implementor must also ensure that the relation is transitive: ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.

          This is currently not the case due to the vertical position tolerance. So TextPositionComparator is not a valid Comparator and it shouldn't be used for sorting. Using a custom Quicksort probably won't help much, as the ordering of the results will depend on the order in which elements are compared.

          Show
          Tom Crossland added a comment - Part of the issue here is that TextPositionComparator breaks the contract for Comparator . Specifically, it doesn't ensure that the ordering is transitive. The Comparator interface specifies that: The implementor must also ensure that the relation is transitive: ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0 . This is currently not the case due to the vertical position tolerance. So TextPositionComparator is not a valid Comparator and it shouldn't be used for sorting. Using a custom Quicksort probably won't help much, as the ordering of the results will depend on the order in which elements are compared.
          Hide
          Uwe added a comment - - edited

          I haven't yet, same as you I'm waiting on Andreas' feedback.

          @Andreas: How did the tests go? Anything I can do to help?

          Show
          Uwe added a comment - - edited I haven't yet, same as you I'm waiting on Andreas' feedback. @Andreas: How did the tests go? Anything I can do to help?
          Hide
          Kirakozov Alexander added a comment -

          Hi, I have the same problem with java8. Any news?

          Show
          Kirakozov Alexander added a comment - Hi, I have the same problem with java8. Any news?
          Hide
          Andreas Lehmkühler added a comment -

          That's good news. I already tried that some time ago but I failed for some reason. I'll run some additional tests.

          Show
          Andreas Lehmkühler added a comment - That's good news. I already tried that some time ago but I failed for some reason. I'll run some additional tests.
          Hide
          Uwe added a comment -

          Adds a stock standard quicksort implementation and code to use it in PDFTextStripper if JDK version >= 1.7

          Show
          Uwe added a comment - Adds a stock standard quicksort implementation and code to use it in PDFTextStripper if JDK version >= 1.7
          Hide
          Uwe added a comment -

          Hi,

          I wrote a patch to implement quicksort, see quicksort.patch.

          Note: While the patch is written against the PDFBox 1.7 branch, it should be easy to apply on the trunk as well. The patch adds this:

          • A Quicksort implementation and unit test
          • A check if we're running on JDK6 or less: If yes: use Collections.sort(), if no, use the QuickSort.sort()

          The custom quicksort implementation is probably a fair bit slower than the one in the JDK (on my box the unit test TestTextStripper takes about 13s instead of 3 (4x), which is the reason for the JDK check. Another option would be to simply catch the exception and only run quicksort as a fallback in this case.

          Uwe

          Show
          Uwe added a comment - Hi, I wrote a patch to implement quicksort, see quicksort.patch. Note: While the patch is written against the PDFBox 1.7 branch, it should be easy to apply on the trunk as well. The patch adds this: A Quicksort implementation and unit test A check if we're running on JDK6 or less: If yes: use Collections.sort(), if no, use the QuickSort.sort() The custom quicksort implementation is probably a fair bit slower than the one in the JDK (on my box the unit test TestTextStripper takes about 13s instead of 3 (4x), which is the reason for the JDK check. Another option would be to simply catch the exception and only run quicksort as a fallback in this case. Uwe
          Hide
          Andreas Lehmkühler added a comment - - edited

          To avoid misunderstandings, IMHO the comparison itself isn't broken it works well, but it breaks the contract of the sort algorithm of the Collections framework.

          The issue is that PDFBox not only uses the x,y values of a text position. In some cases the context is taken into account if two positions are compared which are neighbors. So that there are cases where there same combination of x,y values may lead to different results if the sorting is done in another order.

          Saying that, it should be possible to simply replace the Collections.sort() call with our own sort implementation (e.g. based on quicksort) using the very same TestPositionComparator.

          Maybe there is some place for an improvement:
          The whole text is splitted into text postition, one for each character, so that we have to sort all single characters. The information of text chunks/whole words/lines of text got lost. We could preserve that information within the TextPosition (number of chunk/ index within the chunk) to simplify the comparison.

          Show
          Andreas Lehmkühler added a comment - - edited To avoid misunderstandings, IMHO the comparison itself isn't broken it works well, but it breaks the contract of the sort algorithm of the Collections framework. The issue is that PDFBox not only uses the x,y values of a text position. In some cases the context is taken into account if two positions are compared which are neighbors. So that there are cases where there same combination of x,y values may lead to different results if the sorting is done in another order. Saying that, it should be possible to simply replace the Collections.sort() call with our own sort implementation (e.g. based on quicksort) using the very same TestPositionComparator. Maybe there is some place for an improvement: The whole text is splitted into text postition, one for each character, so that we have to sort all single characters. The information of text chunks/whole words/lines of text got lost. We could preserve that information within the TextPosition (number of chunk/ index within the chunk) to simplify the comparison.
          Hide
          Maruan Sahyoun added a comment -

          A series of sample files model after the chart in http://en.wikipedia.org/wiki/Topological_sorting together with the text extraction done by Adobe Reader.

          Show
          Maruan Sahyoun added a comment - A series of sample files model after the chart in http://en.wikipedia.org/wiki/Topological_sorting together with the text extraction done by Adobe Reader.
          Hide
          Maruan Sahyoun added a comment - - edited

          I’d think that we can find a sorting algorithm which can handle such cases. Before that what would be the expectation of the sorting result looking at the drawing Hannes provided? Shall we look at inspecting the results of other tools such as Adobe Reader and replicate their behavior?

          [Update] Adobe Reader would extract the sample to C,A,B

          I’m willing to look into solving the issue but would like to have some input on the end result first.

          Maruan

          Show
          Maruan Sahyoun added a comment - - edited I’d think that we can find a sorting algorithm which can handle such cases. Before that what would be the expectation of the sorting result looking at the drawing Hannes provided? Shall we look at inspecting the results of other tools such as Adobe Reader and replicate their behavior? [Update] Adobe Reader would extract the sample to C,A,B I’m willing to look into solving the issue but would like to have some input on the end result first. Maruan
          Hide
          Hannes Erven added a comment -

          The issue here is that the ordering rules are flawed and produce circular ranks.

          If my understanding is correct, the current rules to compare two textpositions A,B are:

          1) if the bottom positions are very close or the elements overlap vertically, order by horizontal position (left first)

          2) else, order by vertical position (top first).

          Check out the drawing I'll attach and the following comparations a sorting algo would do:

          A-B: vertical overlap, rule 1, left wins, A<B
          B-C: no overlap, not close, rule 2, upper bottom wins, B<C
          A-C: vertical overlap, rule 1, left wins, C<A

          So in a nutshell, A comes before B; but C shall be inserted both after B and before A, which is inconsistent.

          Just by looking at the drawing I don't have any idea what a meaningful ordering of these boxes would be anyways...

          Show
          Hannes Erven added a comment - The issue here is that the ordering rules are flawed and produce circular ranks. If my understanding is correct, the current rules to compare two textpositions A,B are: 1) if the bottom positions are very close or the elements overlap vertically, order by horizontal position (left first) 2) else, order by vertical position (top first). Check out the drawing I'll attach and the following comparations a sorting algo would do: A-B: vertical overlap, rule 1, left wins, A<B B-C: no overlap, not close, rule 2, upper bottom wins, B<C A-C: vertical overlap, rule 1, left wins, C<A So in a nutshell, A comes before B; but C shall be inserted both after B and before A, which is inconsistent. Just by looking at the drawing I don't have any idea what a meaningful ordering of these boxes would be anyways...
          Hide
          John Hewson added a comment - - edited

          Some algorithms don't care, which may result in inconsistent ordering across multiple calls

          I don't see how this would happen unless the algorithm was randomised. For a given input the output should always be the same, regardless.

          But as you say the ordering is not sufficiently defined, so there may be more than one solution. Perhaps we need some more sophisticated rules for determining reading order? Off the top of my head, it seems like Topological sorting may be of relevance.

          Show
          John Hewson added a comment - - edited Some algorithms don't care, which may result in inconsistent ordering across multiple calls I don't see how this would happen unless the algorithm was randomised. For a given input the output should always be the same, regardless. But as you say the ordering is not sufficiently defined, so there may be more than one solution. Perhaps we need some more sophisticated rules for determining reading order? Off the top of my head, it seems like Topological sorting may be of relevance.
          Hide
          Hannes Erven added a comment -

          The issue is not related to a specific sorting algorithm. At the moment, the desired order of the elements is not sufficiently defined.

          Some algorithms don't care, which may result in inconsistent ordering across multiple calls, some algorithms (as those Java7 defaults to) detect this and throw an exception.

          I did try hacking the Comparator, but so far it doesn't pass the TextExtract test cases

          Show
          Hannes Erven added a comment - The issue is not related to a specific sorting algorithm. At the moment, the desired order of the elements is not sufficiently defined. Some algorithms don't care, which may result in inconsistent ordering across multiple calls, some algorithms (as those Java7 defaults to) detect this and throw an exception. I did try hacking the Comparator, but so far it doesn't pass the TextExtract test cases
          Hide
          John Hewson added a comment -

          Perhaps we should migrate away from using Collections.sort altogether and use some other sorting algorithm?

          Show
          John Hewson added a comment - Perhaps we should migrate away from using Collections.sort altogether and use some other sorting algorithm?
          Hide
          Florent Guillaume added a comment -

          Could someone knowledgeable please take a stab at this? Otherwise PDFBox quite often crashes without -Djava.util.Arrays.useLegacyMergeSort=true...

          Show
          Florent Guillaume added a comment - Could someone knowledgeable please take a stab at this? Otherwise PDFBox quite often crashes without -Djava.util.Arrays.useLegacyMergeSort=true ...
          Hide
          Tilman Hausherr added a comment -

          The attached file brings also the exception, although at a different place:
          java.lang.IllegalArgumentException: Comparison method violates its general contract!
          at java.util.TimSort.mergeLo(TimSort.java:747)
          at java.util.TimSort.mergeAt(TimSort.java:483)
          at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
          at java.util.TimSort.sort(TimSort.java:223)
          at java.util.TimSort.sort(TimSort.java:173)
          at java.util.Arrays.sort(Arrays.java:659)
          at java.util.Collections.sort(Collections.java:217)
          at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:542)
          at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:436)
          at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:359)
          at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:315)
          at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:158)
          at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:101)

          Show
          Tilman Hausherr added a comment - The attached file brings also the exception, although at a different place: java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeLo(TimSort.java:747) at java.util.TimSort.mergeAt(TimSort.java:483) at java.util.TimSort.mergeForceCollapse(TimSort.java:426) at java.util.TimSort.sort(TimSort.java:223) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:542) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:436) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:359) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:315) at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:158) at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:101)
          Hide
          Tilman Hausherr added a comment -

          I get the exception on page 11 of the file of PDFBOX-1730 (page 12 when looking at it with acrobat viewer, it's a table). It does not crash with the change. Sort by position is set to true.

          Show
          Tilman Hausherr added a comment - I get the exception on page 11 of the file of PDFBOX-1730 (page 12 when looking at it with acrobat viewer, it's a table). It does not crash with the change. Sort by position is set to true.
          Hide
          SCHAEFER B.S. added a comment - - edited

          As Andreas Lehmkühler pointed out, the problem lies in the check of overlays in the code, but changing the code breaks previous implementations.
          We did a try/catch around the sort to avoid the break of the sort algorithm
          <code>
          try

          { Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getInstance()); }

          catch (Exception ex)

          { // Sort algorithm break contract -> do sorting in safemode Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getSafeInstance()); }

          </code>

          and modified the compare method (have a look at attached WFI_PDFParser_TextPostionComparator.txt). Maybe this helps ...

          Show
          SCHAEFER B.S. added a comment - - edited As Andreas Lehmkühler pointed out, the problem lies in the check of overlays in the code, but changing the code breaks previous implementations. We did a try/catch around the sort to avoid the break of the sort algorithm <code> try { Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getInstance()); } catch (Exception ex) { // Sort algorithm break contract -> do sorting in safemode Collections.sort(textList, WFI_PDFParser_TextPositionComparator.getSafeInstance()); } </code> and modified the compare method (have a look at attached WFI_PDFParser_TextPostionComparator.txt). Maybe this helps ...
          Hide
          Florent Guillaume added a comment -

          This really needs to be fixed... Crashing under Java 7 is not really a good thing.

          Show
          Florent Guillaume added a comment - This really needs to be fixed... Crashing under Java 7 is not really a good thing.
          Hide
          Hannes Erven added a comment -

          I'm not sure what Thiyaga's comment refers to (the problem itself or the workaround). I've just verified that the problem can be reproduced using Oracle's 1.7.0_21 .

          Show
          Hannes Erven added a comment - I'm not sure what Thiyaga's comment refers to (the problem itself or the workaround). I've just verified that the problem can be reproduced using Oracle's 1.7.0_21 .
          Hide
          Thiyagarajan Selvaraj added a comment -

          It is working fine with the Java build 1.7.0_17-b02.

          Show
          Thiyagarajan Selvaraj added a comment - It is working fine with the Java build 1.7.0_17-b02.
          Hide
          Hannes Erven added a comment -

          Another file that triggers exactly this issue.

          Show
          Hannes Erven added a comment - Another file that triggers exactly this issue.
          Hide
          Andreas Lehmkühler added a comment -

          Thanks for the hint, that's a good work around to avoid the exception. But in the long run we have to solve this issue somehow.

          Show
          Andreas Lehmkühler added a comment - Thanks for the hint, that's a good work around to avoid the exception. But in the long run we have to solve this issue somehow.
          Hide
          Li Xu added a comment -

          I think I've figured out the workaround. Oracle's doc is correct that you can indeed use the java.util.Arrays.useLegacyMergeSort override. However, I first tried it with System.setProperty() and it did not work. When I pass it to jvm as -Djava.util.Arrays.useLegacyMergeSort=true, I'm now able to avoid this error.

          Show
          Li Xu added a comment - I think I've figured out the workaround. Oracle's doc is correct that you can indeed use the java.util.Arrays.useLegacyMergeSort override. However, I first tried it with System.setProperty() and it did not work. When I pass it to jvm as -Djava.util.Arrays.useLegacyMergeSort=true, I'm now able to avoid this error.
          Hide
          Li Xu added a comment -

          http://www.oracle.com/technetwork/java/javase/compatibility-417013.html#source suggests that using the system property java.util.Arrays.useLegacyMergeSort would revert the sort behavior back to 1.6. I'm unable to make it work. Any of you tried?

          ================
          Area: API: Utilities
          Synopsis: Updated sort behavior for Arrays and Collections may throw an IllegalArgumentException
          Description: The sorting algorithm used by java.util.Arrays.sort and (indirectly) by java.util.Collections.sort has been replaced. The new sort implementation may throw an IllegalArgumentException if it detects a Comparable that violates the Comparable contract. The previous implementation silently ignored such a situation.
          If the previous behavior is desired, you can use the new system property, java.util.Arrays.useLegacyMergeSort, to restore previous mergesort behavior.
          Nature of Incompatibility: behavioral
          RFE: 6804124

          Show
          Li Xu added a comment - http://www.oracle.com/technetwork/java/javase/compatibility-417013.html#source suggests that using the system property java.util.Arrays.useLegacyMergeSort would revert the sort behavior back to 1.6. I'm unable to make it work. Any of you tried? ================ Area: API: Utilities Synopsis: Updated sort behavior for Arrays and Collections may throw an IllegalArgumentException Description: The sorting algorithm used by java.util.Arrays.sort and (indirectly) by java.util.Collections.sort has been replaced. The new sort implementation may throw an IllegalArgumentException if it detects a Comparable that violates the Comparable contract. The previous implementation silently ignored such a situation. If the previous behavior is desired, you can use the new system property, java.util.Arrays.useLegacyMergeSort, to restore previous mergesort behavior. Nature of Incompatibility: behavioral RFE: 6804124
          Hide
          Andreas Lehmkühler added a comment -

          I investigated some time into this and I still run into the exception. I guess the problem is, that we can't just compare the absolute position of the textpositions to be compared. In some case we have to take the context into account, e.g. if a text contains super- or subscript parts those characters have to be sorted using the x-coordinate and the fact the two characters belong together. In such cases the y-ccordinate doesn't matter. This can lead to the described exception depending on the starting point (number of textposition, starting order etc.)

          So I'm not sure if there is a solution, maybe we have to implement our own sort algo.

          Show
          Andreas Lehmkühler added a comment - I investigated some time into this and I still run into the exception. I guess the problem is, that we can't just compare the absolute position of the textpositions to be compared. In some case we have to take the context into account, e.g. if a text contains super- or subscript parts those characters have to be sorted using the x-coordinate and the fact the two characters belong together. In such cases the y-ccordinate doesn't matter. This can lead to the described exception depending on the starting point (number of textposition, starting order etc.) So I'm not sure if there is a solution, maybe we have to implement our own sort algo.
          Hide
          Andreas Lehmkühler added a comment -

          The pdf attached to PDFBOX-161 triggers the exception

          Show
          Andreas Lehmkühler added a comment - The pdf attached to PDFBOX-161 triggers the exception
          Hide
          Andreas Lehmkühler added a comment -

          Unfortunately your fix breaks the text extraction tests. I guess we have to dig deeper into it.

          Show
          Andreas Lehmkühler added a comment - Unfortunately your fix breaks the text extraction tests. I guess we have to dig deeper into it.
          Hide
          Andreas Lehmkühler added a comment -

          Thanks for your fix. Please attach either the pdf in question or provide as with a set of textpostion values so that we can create a unit test.

          Show
          Andreas Lehmkühler added a comment - Thanks for your fix. Please attach either the pdf in question or provide as with a set of textpostion values so that we can create a unit test.
          Hide
          Benjamin Papez added a comment -

          I modified the implementation and tried with the attached file, which no longer braught the exception.

          Show
          Benjamin Papez added a comment - I modified the implementation and tried with the attached file, which no longer braught the exception.

            People

            • Assignee:
              Andreas Lehmkühler
              Reporter:
              Benjamin Papez
            • Votes:
              12 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development