Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
    • Environment:

      Java 6.0, Ubuntu

      Description

      Is there an estimated date for implementing default parsing for signed documents, like signed pdf files (pk7s format), for example?

      1. PDF para teste indexação conteúdo.pdf
        19 kB
        Fausto Cruzeiro de Moraes
      2. PDFsigned.pdf.p7s
        23 kB
        Fausto Cruzeiro de Moraes

        Activity

        Jukka Zitting made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Jukka Zitting [ jukkaz ]
        Fix Version/s 1.2 [ 12320169 ]
        Fix Version/s 1.0 [ 12317967 ]
        Resolution Fixed [ 1 ]
        Hide
        Jukka Zitting added a comment -

        In revision 1355724 I added a simple o.a.t.parser.crypto.Pkcs7Parser class that is able to parse the attached p7s file using Bouncy Castle.

        Show
        Jukka Zitting added a comment - In revision 1355724 I added a simple o.a.t.parser.crypto.Pkcs7Parser class that is able to parse the attached p7s file using Bouncy Castle.
        Hide
        Nick Burch added a comment -

        I can't seem to find any information on how the pkcs7 wrapping takes place, nor how to unwrap it. Without knowing that, we can't write anything to use BouncyCastle (or similar) to unpack it

        Are you able to track down any information on how it's done?

        Show
        Nick Burch added a comment - I can't seem to find any information on how the pkcs7 wrapping takes place, nor how to unwrap it. Without knowing that, we can't write anything to use BouncyCastle (or similar) to unpack it Are you able to track down any information on how it's done?
        Hide
        Fausto Cruzeiro de Moraes added a comment -

        Hi Nick!

        I have just attached two samples files, as requested by you.

        Thank you very much!

        Show
        Fausto Cruzeiro de Moraes added a comment - Hi Nick! I have just attached two samples files, as requested by you. Thank you very much!
        Fausto Cruzeiro de Moraes made changes -
        Field Original Value New Value
        Attachment PDF para teste indexação conteúdo.pdf [ 12526011 ]
        Attachment PDFsigned.pdf.p7s [ 12526012 ]
        Hide
        Nick Burch added a comment -

        We still can't help you very much without a (small) sample file, any chance you could upload one?

        If your PDFs really are wrapped in PKCS7, then we'll need something that unpacks the PCKS7 wrapper, and for signed files (initially - no way to supply the private key yet for encrypted ones) triggers the recursing parser for the contents. I think BouncyCastle might help for this, it's worth a look to start with

        In r1331634 I've added some mime magic for pkcs7 files. I'm not sure if it's quite right or not, but it seems OK for a few files I've tried. It'll need someone who knows the PCKS format (or maybe just DER encoding?) to be sure though. Ideally, we should distinguish between signed, encrypted and signed+encrypted, but I'm not sure how we do that...

        Show
        Nick Burch added a comment - We still can't help you very much without a (small) sample file, any chance you could upload one? If your PDFs really are wrapped in PKCS7, then we'll need something that unpacks the PCKS7 wrapper, and for signed files (initially - no way to supply the private key yet for encrypted ones) triggers the recursing parser for the contents. I think BouncyCastle might help for this, it's worth a look to start with In r1331634 I've added some mime magic for pkcs7 files. I'm not sure if it's quite right or not, but it seems OK for a few files I've tried. It'll need someone who knows the PCKS format (or maybe just DER encoding?) to be sure though. Ideally, we should distinguish between signed, encrypted and signed+encrypted, but I'm not sure how we do that...
        Hide
        Fausto Cruzeiro de Moraes added a comment -

        Hi Nick

        Do you have any tip/advice for helping me on this subject?

        Thank you a lot

        Show
        Fausto Cruzeiro de Moraes added a comment - Hi Nick Do you have any tip/advice for helping me on this subject? Thank you a lot
        Hide
        Fausto Cruzeiro de Moraes added a comment -

        Hi Nick

        I am running Tika over two files: PDFnotsigned.pdf (original pdf document, application/pdf) and PDFsigned.pdf.p7s (digitally signed document, application/pkcs7-signature).

        1 - When running the statement: java -jar tika-app-1.0.jar -t PDFnotsigned.pdf > PDFnotsigned.pdf.txt, i get an output file with the expected content

        2 - When running the statement: When running the statement: java -jar tika-app-1.0.jar -t PDFsigned.pdf > PDFsigned.pdf.txt, i get an output file with no content at all, just 0Kb.

        As far as I am concerned, there is no default tika filter related to application/pkcs7-signature mimetype...

        Thanks

        Show
        Fausto Cruzeiro de Moraes added a comment - Hi Nick I am running Tika over two files: PDFnotsigned.pdf (original pdf document, application/pdf) and PDFsigned.pdf.p7s (digitally signed document, application/pkcs7-signature). 1 - When running the statement: java -jar tika-app-1.0.jar -t PDFnotsigned.pdf > PDFnotsigned.pdf.txt, i get an output file with the expected content 2 - When running the statement: When running the statement: java -jar tika-app-1.0.jar -t PDFsigned.pdf > PDFsigned.pdf.txt, i get an output file with no content at all, just 0Kb. As far as I am concerned, there is no default tika filter related to application/pkcs7-signature mimetype... Thanks
        Hide
        Nick Burch added a comment -

        Can you upload a small example file?

        When you try to detect it with Tika, what do you get? When you parse it, what do you get? And how do those two things differ from what you'd expect?

        Show
        Nick Burch added a comment - Can you upload a small example file? When you try to detect it with Tika, what do you get? When you parse it, what do you get? And how do those two things differ from what you'd expect?
        Hide
        Fausto Cruzeiro de Moraes added a comment -

        Hi Nick

        I mean, in fact, I really need do parsing in digitally signed (PKCS7, for example) pdf files, so that Jackrabbit 2.4.0 can extract and index their content.

        Thanks

        Show
        Fausto Cruzeiro de Moraes added a comment - Hi Nick I mean, in fact, I really need do parsing in digitally signed (PKCS7, for example) pdf files, so that Jackrabbit 2.4.0 can extract and index their content. Thanks
        Hide
        Nick Burch added a comment -

        Shortly after someone submits a patch for it! Unfortunately / fortunately (depending on your perspective), we're all volunteers here.

        In the mean time, it may help if you explain what doesn't work and/or what you'd expect to see

        For example, I know we do support password protected Microsoft Office and PDF files

        Show
        Nick Burch added a comment - Shortly after someone submits a patch for it! Unfortunately / fortunately (depending on your perspective), we're all volunteers here. In the mean time, it may help if you explain what doesn't work and/or what you'd expect to see For example, I know we do support password protected Microsoft Office and PDF files
        Fausto Cruzeiro de Moraes created issue -

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Fausto Cruzeiro de Moraes
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development