Tika
  1. Tika
  2. TIKA-1228

Embedded files not extracted properly from PDF

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
    • Environment:

      CentOS 6.5 VM

      Description

      IAW pdfbox example here:

      http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java

      the PDF parser does not check for additional entries under Kids node when Names node does not exist.

        Activity

        Hide
        Nick Burch added a comment -

        Do you have a file which shows up the problem? And if so, any chance you could write a short junit unit test that highlights the issue using it?

        Show
        Nick Burch added a comment - Do you have a file which shows up the problem? And if so, any chance you could write a short junit unit test that highlights the issue using it?
        Hide
        Jason Sherman added a comment -

        Sorry about that. I meant to attach the file in the first place. This is a file I was given to use for development testing. I'll get a test written as soon as I can, probably later today. Thanks for the quick response.

        Show
        Jason Sherman added a comment - Sorry about that. I meant to attach the file in the first place. This is a file I was given to use for development testing. I'll get a test written as soon as I can, probably later today. Thanks for the quick response.
        Hide
        Tim Allison added a comment -

        Fixed in r1564042.

        Thank you, Jason Sherman, for reporting this and diagnosing the cause and solution for this bug!

        I'm resolving this for now. I'm waiting to hear back from users@pdfbox to see if we should search recursively for non-null attachment data. The example that you provided does show only checking the children. I'll reopen this issue if we need to switch to full recursion.

        Thank you, again.

        Show
        Tim Allison added a comment - Fixed in r1564042. Thank you, Jason Sherman , for reporting this and diagnosing the cause and solution for this bug! I'm resolving this for now. I'm waiting to hear back from users@pdfbox to see if we should search recursively for non-null attachment data. The example that you provided does show only checking the children. I'll reopen this issue if we need to switch to full recursion. Thank you, again.
        Hide
        Jason Sherman added a comment - - edited

        Thanks for the help. Another possibly related issue is:
        When I was stepping through the pdfbox code, line 286 in PDNameTreeNode throws an exception when running, but processes properly in my evaluation dialog (Intellij 13)

        namesArray = (COSArray)((COSDictionary)((COSArray)node.getDictionaryObject(COSName.KIDS)).get(0)).getDictionaryObject(COSName.NAMES);

        Throws:
        org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSDictionary

        Do you want to pass that on to the pdfbox folks, or should I report it separately?

        Show
        Jason Sherman added a comment - - edited Thanks for the help. Another possibly related issue is: When I was stepping through the pdfbox code, line 286 in PDNameTreeNode throws an exception when running, but processes properly in my evaluation dialog (Intellij 13) namesArray = (COSArray)((COSDictionary)((COSArray)node.getDictionaryObject(COSName.KIDS)).get(0)).getDictionaryObject(COSName.NAMES); Throws: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSDictionary Do you want to pass that on to the pdfbox folks, or should I report it separately?
        Hide
        Tim Allison added a comment -

        Not sure I understand. Is this the snippet that you refer to in PDNameTreeNode:

            public Map<String, COSObjectable> getNames() throws IOException
            {
                COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES );
        

        The above throws a class cast exception, but the code that you show doesn't?

        Are you getting a class cast exception on the document that you submitted with this issue or is it a different document?

        Thank you, again.

        Show
        Tim Allison added a comment - Not sure I understand. Is this the snippet that you refer to in PDNameTreeNode: public Map<String, COSObjectable> getNames() throws IOException { COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES ); The above throws a class cast exception, but the code that you show doesn't? Are you getting a class cast exception on the document that you submitted with this issue or is it a different document? Thank you, again.
        Hide
        Jason Sherman added a comment -

        Tim,

        I saw you already added a test and fix to the codebase. Thanks! I'm going to clone it and use it if you don't mind.

        Jason

        Show
        Jason Sherman added a comment - Tim, I saw you already added a test and fix to the codebase. Thanks! I'm going to clone it and use it if you don't mind. Jason
        Hide
        Jason Sherman added a comment -

        Tim,

        Dang. During my troubleshooting, I first updated pdfbox to 1.8.3 and was using that source to step through the code. After the weirdness with the exception in code, but not in my expression evaluator, I reverted to the original tika code, but failed to revert the pdfbox code. I apologize for the confusion. Thanks again for your fast responses.

        Jason

        Show
        Jason Sherman added a comment - Tim, Dang. During my troubleshooting, I first updated pdfbox to 1.8.3 and was using that source to step through the code. After the weirdness with the exception in code, but not in my expression evaluator, I reverted to the original tika code, but failed to revert the pdfbox code. I apologize for the confusion. Thanks again for your fast responses. Jason
        Hide
        Tim Allison added a comment -

        Y. That's the point of open source. Enjoy!

        Now that I'm looking at this issue again, I dragged out some of my pre-Tika code for pdf attachments using a different pdf library. It looks like the pdf files I was coding against could have the file name in a parent node and the actual bytes in a child or more distant descendant node.

        Will see if I can dig up the triggering files and see if Tika needs any more mods on PDF attachment extraction.

        private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment attach, int recursiveDepth){
        		
            COSName fCOSName = COSName.create("F");
            COSName efCOSName = COSName.create("EF");
            COSObject fObj = dict.get(fCOSName);
            COSObject efObj = dict.get(efCOSName);
            if (null != fObj){
                if (fObj.getClass() == COSString.class){
                    attach.setName(fObj.stringValue());
                } else if (fObj.getClass() == COSStream.class){
                    attach.setBytes(((COSStream)fObj).getDecodedBytes());
                    return attach;
                }
            } 
            if (null != efObj && efObj.getClass() == COSDictionary.class){ 
                int tmpI = recursiveDepth;
                tmpI++;
                return lookForByteStream((COSDictionary)efObj, attach, tmpI);	
            }
            return null;
        }
        
        Show
        Tim Allison added a comment - Y. That's the point of open source. Enjoy! Now that I'm looking at this issue again, I dragged out some of my pre-Tika code for pdf attachments using a different pdf library. It looks like the pdf files I was coding against could have the file name in a parent node and the actual bytes in a child or more distant descendant node. Will see if I can dig up the triggering files and see if Tika needs any more mods on PDF attachment extraction. private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment attach, int recursiveDepth){ COSName fCOSName = COSName.create("F"); COSName efCOSName = COSName.create("EF"); COSObject fObj = dict.get(fCOSName); COSObject efObj = dict.get(efCOSName); if (null != fObj){ if (fObj.getClass() == COSString.class){ attach.setName(fObj.stringValue()); } else if (fObj.getClass() == COSStream.class){ attach.setBytes(((COSStream)fObj).getDecodedBytes()); return attach; } } if (null != efObj && efObj.getClass() == COSDictionary.class){ int tmpI = recursiveDepth; tmpI++; return lookForByteStream((COSDictionary)efObj, attach, tmpI); } return null; }
        Hide
        Tim Allison added a comment -

        Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue?

        Thanks again.

        Show
        Tim Allison added a comment - Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue? Thanks again.
        Hide
        Jason Sherman added a comment - - edited

        Correct. PDNameTreeNode class cast exception is a non-issue.

        Show
        Jason Sherman added a comment - - edited Correct. PDNameTreeNode class cast exception is a non-issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Sherman
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development