Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1663

Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It might be useful to integrate commons' DigestUtils and allow users to easily add the MD5 or other supported hashes to the Metadata object.

      Anyone else find this of use?

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        First draft of a DigestingParser. If anyone has a chance to check the mark/reset() stuff to see if there's a more efficient way of doing that, I'd greatly appreciate it.

        When this is ready, I'll add integration for tika-app, tika-server and tika-batch.

        Show
        tallison@mitre.org Tim Allison added a comment - First draft of a DigestingParser. If anyone has a chance to check the mark/reset() stuff to see if there's a more efficient way of doing that, I'd greatly appreciate it. When this is ready, I'll add integration for tika-app, tika-server and tika-batch.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        r1687981.

        Show
        tallison@mitre.org Tim Allison added a comment - r1687981.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #769 (See https://builds.apache.org/job/tika-trunk-jdk1.7/769/)
        TIKA-1663 add a DigestingParser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1687981)

        • /tika/trunk/CHANGES.txt
        • /tika/trunk/tika-app/src/main/java/org/apache/tika/batch
        • /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java
        • /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders
        • /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java
        • /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
        • /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java
        • /tika/trunk/tika-app/src/main/resources/tika-app-batch-config.xml
        • /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java
        • /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
        • /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties
        • /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java
        • /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java
        • /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java
        • /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java
        • /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
        • /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java
        • /tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml
        • /tika/trunk/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java
        • /tika/trunk/tika-batch/src/test/resources/tika-batch-config-MockConsumersBuilder.xml
        • /tika/trunk/tika-batch/src/test/resources/tika-batch-config-broken.xml
        • /tika/trunk/tika-batch/src/test/resources/tika-batch-config-test.xml
        • /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java
        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils
        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/DetectorResource.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/LanguageResource.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaDetectors.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaMimeTypes.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaParsers.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaUtils.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaVersion.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaWelcome.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java
        • /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/LanguageResourceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaDetectorsTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaParsersTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaVersionTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaWelcomeTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TranslateResourceTest.java
        • /tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #769 (See https://builds.apache.org/job/tika-trunk-jdk1.7/769/ ) TIKA-1663 add a DigestingParser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1687981 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-app/src/main/java/org/apache/tika/batch /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java /tika/trunk/tika-app/src/main/resources/tika-app-batch-config.xml /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java /tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml /tika/trunk/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java /tika/trunk/tika-batch/src/test/resources/tika-batch-config-MockConsumersBuilder.xml /tika/trunk/tika-batch/src/test/resources/tika-batch-config-broken.xml /tika/trunk/tika-batch/src/test/resources/tika-batch-config-test.xml /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/DetectorResource.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/LanguageResource.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaDetectors.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaMimeTypes.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaParsers.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaUtils.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaVersion.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaWelcome.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/LanguageResourceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaDetectorsTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaParsersTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaVersionTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaWelcomeTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TranslateResourceTest.java /tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        For those curious, I found no speed hit in adding md5 hashing to a batch run against the ~1million documents in govdocs1. Admittedly, I didn't do thorough benchmarking, but the one digesting run with trunk I ran was a little bit faster than the one non-digesting run I did, where "little bit faster" = "difference was small enough to be in the noise."

        Show
        tallison@mitre.org Tim Allison added a comment - For those curious, I found no speed hit in adding md5 hashing to a batch run against the ~1million documents in govdocs1. Admittedly, I didn't do thorough benchmarking, but the one digesting run with trunk I ran was a little bit faster than the one non-digesting run I did, where "little bit faster" = "difference was small enough to be in the noise."
        Hide
        thammegowda Thamme Gowda added a comment -

        Chris A. Mattmann Tim Allison We need SHA digest of raw content for MEMEX project.
        I tried to enable digesting parser by editing our config file:

        <properties>
            <parsers>
                <parser class="org.apache.tika.parser.DigestingParser">
                    <parser class="org.apache.tika.parser.DefaultParser">
                    </parser>
                </parser>
                .....
        

        This doesnt work for the obvious reason that we havent told which digest algorithm.
        After checking https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java, I found that DigestingParser is a flexible framwork and takes constructor args.

        So, I propose two options:
        1. We offer few popular implementations like SHA, MD5 parsers which doesnt need constructor args. This will enable us to activate them by editing the config xml file instead of source code.
        2. We enhance tika configuration framework and these flexible parsers to accept runtime arguments, so that the flexibility and ease of use is preserved. For instance, if we can supply digest algorithm name from config file and let the DigestingParser use it to instantiate, then we dont need to edit source code of applications.

        <properties>
            <parsers>
                <parser class="org.apache.tika.parser.DigestingParser">
                    <args>
                          <digest>MD5</digest>
                   </args>
                    <parser class="org.apache.tika.parser.DefaultParser">
                    </parser>
                </parser>
                .....
        

        I vote for option 2 even though it is slightly more work, but I feel it is the way to go.
        I donot know if Tika already has a support for option 2 by accepting runtime arguments from config file.
        I faced a similar issue with NamedEntityParser, but found a workaround by using System properties.

        Show
        thammegowda Thamme Gowda added a comment - Chris A. Mattmann Tim Allison We need SHA digest of raw content for MEMEX project. I tried to enable digesting parser by editing our config file: <properties> <parsers> <parser class= "org.apache.tika.parser.DigestingParser" > <parser class= "org.apache.tika.parser.DefaultParser" > </parser> </parser> ..... This doesnt work for the obvious reason that we havent told which digest algorithm. After checking https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java , I found that DigestingParser is a flexible framwork and takes constructor args. So, I propose two options: 1. We offer few popular implementations like SHA, MD5 parsers which doesnt need constructor args. This will enable us to activate them by editing the config xml file instead of source code. 2. We enhance tika configuration framework and these flexible parsers to accept runtime arguments, so that the flexibility and ease of use is preserved. For instance, if we can supply digest algorithm name from config file and let the DigestingParser use it to instantiate, then we dont need to edit source code of applications. <properties> <parsers> <parser class= "org.apache.tika.parser.DigestingParser" > <args> <digest>MD5</digest> </args> <parser class= "org.apache.tika.parser.DefaultParser" > </parser> </parser> ..... I vote for option 2 even though it is slightly more work, but I feel it is the way to go. I donot know if Tika already has a support for option 2 by accepting runtime arguments from config file. I faced a similar issue with NamedEntityParser, but found a workaround by using System properties.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y, I much prefer #2. The parameter part will be solved by TIKA-1508. I haven't looked at that one in a while, though, but it is important. Any interest in contributing?

        Specifying which parser the digesting parser wraps...hmmm...y, it could be handled as you suggest.

        Nick Burch, for parser wrappers (where one parser takes a delegate), any recommendation on how to specify this via config currently?

        As a side note, beware of TIKA-1701 and truncated files!

        Show
        tallison@mitre.org Tim Allison added a comment - Y, I much prefer #2. The parameter part will be solved by TIKA-1508 . I haven't looked at that one in a while, though, but it is important. Any interest in contributing? Specifying which parser the digesting parser wraps...hmmm...y, it could be handled as you suggest. Nick Burch , for parser wrappers (where one parser takes a delegate), any recommendation on how to specify this via config currently? As a side note, beware of TIKA-1701 and truncated files!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        In tika-batch/tika-app, I did a not-so-great-workaround with an interface for a ParserFactory, and then I hardcoded a parser factory that wrapped a DigestingParser around the AutoDetectParser, and then wrapped all of that in a RecursiveParserWrapper...not happy with that and look forward to being able to configure this via the config file.

        Show
        tallison@mitre.org Tim Allison added a comment - In tika-batch/tika-app, I did a not-so-great-workaround with an interface for a ParserFactory, and then I hardcoded a parser factory that wrapped a DigestingParser around the AutoDetectParser, and then wrapped all of that in a RecursiveParserWrapper...not happy with that and look forward to being able to configure this via the config file.
        Hide
        thammegowda Thamme Gowda added a comment -

        Yes, I like to work on TIKA-1508, provided 6 to 8 days timeline from now.

        Show
        thammegowda Thamme Gowda added a comment - Yes, I like to work on TIKA-1508 , provided 6 to 8 days timeline from now.
        Hide
        chrismattmann Chris A. Mattmann added a comment -
        Show
        chrismattmann Chris A. Mattmann added a comment - thanks Thamme Gowda and Tim Allison
        Hide
        gagravarr Nick Burch added a comment -

        The other parser decorators are specified with options inside the parent parser, eg mime includes or excludes are decorators given as options to the main parser. In some ways, this is quite nice, as you do the main definition on the thing that'll do the work, then the decorators after

        One option, for the general case, would be to add additional decorators too, eg http://tika.apache.org/1.12/configuring.html#Configuring_Parsers becomes

            <parser class="org.apache.tika.parser.DefaultParser">
              <mime-exclude>image/jpeg</mime-exclude>
              <mime-exclude>application/pdf</mime-exclude>
              <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
              <decorator class="org.foo.bar.DecoratorWithEmojis"/>
              <decorator class="org.foo.bar.DecoratorWithHashing"/>
            </parser>
        

        For the specific case of the digester, it's a well known thing, so we could give it custom tags. That would make things clearer, and would get round the parameter issue. One option is:

            <parser class="org.apache.tika.parser.DefaultParser">
              <mime-exclude>image/jpeg</mime-exclude>
              <mime-exclude>application/pdf</mime-exclude>
              <digest>MD5,SHA256</digest>
              <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
            </parser>
        

        The other to keep it more in line with the mime includes/excludes is:

            <parser class="org.apache.tika.parser.DefaultParser">
              <mime-exclude>image/jpeg</mime-exclude>
              <mime-exclude>application/pdf</mime-exclude>
              <digest>MD5</digest>
              <digest>SHA256</digest>
              <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
            </parser>
        

        What do people think?

        Show
        gagravarr Nick Burch added a comment - The other parser decorators are specified with options inside the parent parser, eg mime includes or excludes are decorators given as options to the main parser. In some ways, this is quite nice, as you do the main definition on the thing that'll do the work, then the decorators after One option, for the general case, would be to add additional decorators too, eg http://tika.apache.org/1.12/configuring.html#Configuring_Parsers becomes <parser class= "org.apache.tika.parser.DefaultParser" > <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <parser-exclude class= "org.apache.tika.parser.executable.ExecutableParser" /> <decorator class= "org.foo.bar.DecoratorWithEmojis" /> <decorator class= "org.foo.bar.DecoratorWithHashing" /> </parser> For the specific case of the digester, it's a well known thing, so we could give it custom tags. That would make things clearer, and would get round the parameter issue. One option is: <parser class= "org.apache.tika.parser.DefaultParser" > <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <digest>MD5,SHA256</digest> <parser-exclude class= "org.apache.tika.parser.executable.ExecutableParser" /> </parser> The other to keep it more in line with the mime includes/excludes is: <parser class= "org.apache.tika.parser.DefaultParser" > <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <digest>MD5</digest> <digest>SHA256</digest> <parser-exclude class= "org.apache.tika.parser.executable.ExecutableParser" /> </parser> What do people think?
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Thank you, Nick. I somewhat prefer the first option (once we add the parameter setting). I'm hesitant to promote the DigestingParser to a special place, but I'm game if the community is.

        Oh, the other thing...I think I want to add options for encoding the digest bytes. CommonCrawl is using Base32 of sha1...for example.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Thank you, Nick. I somewhat prefer the first option (once we add the parameter setting). I'm hesitant to promote the DigestingParser to a special place, but I'm game if the community is. Oh, the other thing...I think I want to add options for encoding the digest bytes. CommonCrawl is using Base32 of sha1...for example.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Nick Burch, am I right in that we cannot do this now:

         <parser class="org.apache.tika.parser.DefaultParser">
        
            <decorator class="org.foo.bar.DecoratorWithEmojis"/>
         </parser>
        
        Show
        tallison@mitre.org Tim Allison added a comment - Nick Burch , am I right in that we cannot do this now: <parser class= "org.apache.tika.parser.DefaultParser" > <decorator class= "org.foo.bar.DecoratorWithEmojis" /> </parser>

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development