Solr
  1. Solr
  2. SOLR-1786

Solr (trunk rev. 912116) suffers from PDFBOX-537 [Endless loop in org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary()] fixed in PDFbox 1.0?

    Details

      Description

      I tried indexing several thousand PDF documents but could not finish as Solr was falling into an endless loop for some of them, for instance: http://cdsweb.cern.ch/record/702585/files/sl-note-2000-019.pdf (the PDF seems OK).
      Can Solr start using PDFbox 1.0?

        Activity

        Hide
        Hoss Man added a comment -

        the initial problem report was specifically about an endless loop that could be avoided by rolling back tika – but that endless loop was already fixed by upgrading Tika in Solr 3.1 – which causes a hard error instead.

        (There are always going to be some files that can't be parsed, and since we're delegating to PDFBox (via Tika) it's not really something we can worry too much about).

        Show
        Hoss Man added a comment - the initial problem report was specifically about an endless loop that could be avoided by rolling back tika – but that endless loop was already fixed by upgrading Tika in Solr 3.1 – which causes a hard error instead. (There are always going to be some files that can't be parsed, and since we're delegating to PDFBox (via Tika) it's not really something we can worry too much about).
        Hide
        Hoss Man added a comment -

        Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

        Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

        Show
        Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
        Hide
        Jan Høydahl added a comment -

        Perhaps same root cause as TIKA-617 ?

        Show
        Jan Høydahl added a comment - Perhaps same root cause as TIKA-617 ?
        Hide
        Jan Høydahl added a comment -

        Tested the linked PDF file with tika-app-1.1-SNAPSHOT.jar and it does not parse, I gave it 2G ram:

        java -jar target/tika-app-1.1-SNAPSHOT.jar http://cdsweb.cern.ch/record/702585/files/sl-note-2000-019.pdf -m
        
        [...]
        <p>ERROR - Stop reading corrupt stream
        WARN - java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        	at java.util.ArrayList.get(ArrayList.java:322)
        	at org.apache.pdfbox.util.operator.Concatenate.process(Concatenate.java:47)
        	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
        	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
        	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
        	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63)
        	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:105)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
        	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
        	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
        WARN - java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        	at java.util.ArrayList.get(ArrayList.java:322)
        [...]
        WARN - Bad Dictionary Declaration org.apache.pdfbox.io.PushBackInputStream@7433b121
        WARN - Invalid dictionary, found: '￿' but expected: '/'
        Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@6db22920
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
        	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
        	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
        Caused by: java.lang.NullPointerException
        	at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:368)
        	at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
        	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
        	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
        	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
        	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
        	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63)
        	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:105)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        	... 5 more
        
        

        Trying to extract using PdfBox1.7 also failed

        java -Xmx3G -jar pdfbox-app-1.7.0-SNAPSHOT.jar ExtractText -debug sl-note-2000-019.pdf
        [...]
        ExtractText failed with the following exception:
        java.io.EOFException: Unexpected end of ZLIB input stream
        	at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
        	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
        	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:115)
        	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
        	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
        	at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
        	at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
        
        

        So you should probably pursue this on the PDFBOX mailing list/JIRA, and then let a possible fix bubble up through TIKA to Solr

        Show
        Jan Høydahl added a comment - Tested the linked PDF file with tika-app-1.1-SNAPSHOT.jar and it does not parse, I gave it 2G ram: java -jar target/tika-app-1.1-SNAPSHOT.jar http://cdsweb.cern.ch/record/702585/files/sl-note-2000-019.pdf -m [...] <p>ERROR - Stop reading corrupt stream WARN - java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pdfbox.util.operator.Concatenate.process(Concatenate.java:47) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:105) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) WARN - java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) [...] WARN - Bad Dictionary Declaration org.apache.pdfbox.io.PushBackInputStream@7433b121 WARN - Invalid dictionary, found: '￿' but expected: '/' Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@6db22920 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:368) at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:105) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more Trying to extract using PdfBox1.7 also failed java -Xmx3G -jar pdfbox-app-1.7.0-SNAPSHOT.jar ExtractText -debug sl-note-2000-019.pdf [...] ExtractText failed with the following exception: java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:115) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105) So you should probably pursue this on the PDFBOX mailing list/JIRA, and then let a possible fix bubble up through TIKA to Solr
        Hide
        Jan Iwaszkiewicz added a comment -

        Thanks. I'm quite sure it is fixed. Unfortunately I don't work in the CDS project anymore and we also didn't decide to use PDFBox for textification (pdftotext instead). Please try to textify/index the PDF linked above to verify.

        Show
        Jan Iwaszkiewicz added a comment - Thanks. I'm quite sure it is fixed. Unfortunately I don't work in the CDS project anymore and we also didn't decide to use PDFBox for textification (pdftotext instead). Please try to textify/index the PDF linked above to verify.
        Hide
        Simon Willnauer added a comment -

        can we close this issue? Jan can you confirm?

        Show
        Simon Willnauer added a comment - can we close this issue? Jan can you confirm?
        Hide
        Steve Rowe added a comment -

        Solr Cell upgraded to Tika 0.8, which included PDFbox 1.1.0, in the Solr 3.1 release.

        The Solr 3.5 release will include Tika 0.10, which includes PDFbox 1.6.0.

        Likely this problem has been addressed.

        Jan, can you test Solr 3.1+ to confirm?

        Show
        Steve Rowe added a comment - Solr Cell upgraded to Tika 0.8, which included PDFbox 1.1.0, in the Solr 3.1 release. The Solr 3.5 release will include Tika 0.10, which includes PDFbox 1.6.0. Likely this problem has been addressed. Jan, can you test Solr 3.1+ to confirm?
        Hide
        Robert Muir added a comment -

        3.4 -> 3.5

        Show
        Robert Muir added a comment - 3.4 -> 3.5
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Hoss Man added a comment -

        Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

        A unique token for finding these 240 issues in the future: hossversioncleanup20100527

        Show
        Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
        Hide
        Hoss Man added a comment -

        marking Fix for 1.5 – we shouldn't release w/o either moving forward or rollingback the version we use.

        (FYI: our PDFBox dependency is based on the tika dependency)

        Show
        Hoss Man added a comment - marking Fix for 1.5 – we shouldn't release w/o either moving forward or rollingback the version we use. (FYI: our PDFBox dependency is based on the tika dependency)

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Iwaszkiewicz
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development