Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2298

To improve object recognition parser so that it may work without external RESTful service setup

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.14
    • Fix Version/s: 1.16
    • Component/s: parser
    • Flags:
      Patch

      Description

      When ObjectRecognitionParser was built to do image recognition, there wasn't
      good support for Java frameworks. All the popular neural networks were in
      C++ or python. Since there was nothing that runs within JVM, we tried
      several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
      However, this game is changing slowly now. Deeplearning4j, the most famous
      neural network library for JVM, now supports importing models that are
      pre-trained in python/C++ based kits [5].

      Improvement:
      It will be nice to have an implementation of ObjectRecogniser that
      doesn't require any external setup(like installation of native libraries or
      starting REST services). Reasons: easy to distribute and also to cut the IO
      time.

        Issue Links

          Activity

          Hide
          asmehra95 Avtar Singh added a comment -

          Not able run the VGG16 model in dl4j
          When I try to run full fledged model i get this error.
          Exception in thread "main" java.lang.OutOfMemoryError: Cannot allocate new FloatPointer(138357544): totalBytes = 1G, physicalBytes = 2G
          at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:76)
          at org.nd4j.linalg.api.buffer.BaseDataBuffer.<init>(BaseDataBuffer.java:445)
          at org.nd4j.linalg.api.buffer.FloatBuffer.<init>(FloatBuffer.java:57)
          at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createFloat(DefaultDataBufferFactory.java:236)
          at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1301)
          at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1275)
          at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:252)
          at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:109)
          at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.create(CpuNDArrayFactory.java:247)
          at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4768)
          at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4726)
          at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3861)
          at org.deeplearning4j.nn.graph.ComputationGraph.init(ComputationGraph.java:342)
          at org.deeplearning4j.nn.graph.ComputationGraph.init(ComputationGraph.java:274)
          at org.deeplearning4j.nn.modelimport.keras.KerasModel.getComputationGraph(KerasModel.java:483)
          at org.deeplearning4j.nn.modelimport.keras.KerasModel.getComputationGraph(KerasModel.java:471)
          at org.deeplearning4j.nn.modelimport.keras.KerasModelImport.importKerasModelAndWeights(KerasModelImport.java:178)
          at modelImport.ModelImportConfig.main(ModelImportConfig.java:18)
          Caused by: java.lang.OutOfMemoryError: Native allocator returned address == 0
          at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:70)
          ... 17 more

          when i run the model that says 'NoTop' It is says: Invalid configuration
          I found out in the source code for helper functions, that the json file needs fixing.

          I am running on i5 6th gen with 4gb RAM.
          I tried 2 OS: Ubuntu and Window.
          Is there any way i can run it?

          Show
          asmehra95 Avtar Singh added a comment - Not able run the VGG16 model in dl4j When I try to run full fledged model i get this error. Exception in thread "main" java.lang.OutOfMemoryError: Cannot allocate new FloatPointer(138357544): totalBytes = 1G, physicalBytes = 2G at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:76) at org.nd4j.linalg.api.buffer.BaseDataBuffer.<init>(BaseDataBuffer.java:445) at org.nd4j.linalg.api.buffer.FloatBuffer.<init>(FloatBuffer.java:57) at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createFloat(DefaultDataBufferFactory.java:236) at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1301) at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1275) at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:252) at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:109) at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.create(CpuNDArrayFactory.java:247) at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4768) at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4726) at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3861) at org.deeplearning4j.nn.graph.ComputationGraph.init(ComputationGraph.java:342) at org.deeplearning4j.nn.graph.ComputationGraph.init(ComputationGraph.java:274) at org.deeplearning4j.nn.modelimport.keras.KerasModel.getComputationGraph(KerasModel.java:483) at org.deeplearning4j.nn.modelimport.keras.KerasModel.getComputationGraph(KerasModel.java:471) at org.deeplearning4j.nn.modelimport.keras.KerasModelImport.importKerasModelAndWeights(KerasModelImport.java:178) at modelImport.ModelImportConfig.main(ModelImportConfig.java:18) Caused by: java.lang.OutOfMemoryError: Native allocator returned address == 0 at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:70) ... 17 more when i run the model that says 'NoTop' It is says: Invalid configuration I found out in the source code for helper functions, that the json file needs fixing. I am running on i5 6th gen with 4gb RAM. I tried 2 OS: Ubuntu and Window. Is there any way i can run it?
          Hide
          thammegowda Thamme Gowda added a comment -

          Avtar Singh
          Please share a link to your code, I will have a look on this!

          Could you also refer to my example code at https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example and see what flags to pass to the importer (especially flags to disable further training)?

          PR to that repo with your VGG16 example would be greatly appreciated!

          Show
          thammegowda Thamme Gowda added a comment - Avtar Singh Please share a link to your code, I will have a look on this! Could you also refer to my example code at https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example and see what flags to pass to the importer (especially flags to disable further training)? PR to that repo with your VGG16 example would be greatly appreciated!
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user asmehra95 opened a pull request:

          https://github.com/apache/tika/pull/159

          fix for TIKA-2298 contributed by asmehra95

          I have imported VGG16 model into Apache tika using deeplearning4j.
          The usage of this recogniser is very similar to TensorFlowRESTrecogniser but it doesn't require any external setup, like running RESTservice in as in case of TensorFlowRESTrecogniser.
          You can read more about TensorFlowRESTrecogniser at https://wiki.apache.org/tika/TikaAndVision

          To use the DL4JImageRecogniser set
          class param to org.apache.tika.parser.recognition.dl4j.DL4JImageRecogniser
          modelType to VGG16
          sample configuration is given below for refference.
          <?xml version="1.0" encoding="UTF-8"?>
          <properties>
          <parsers>
          <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
          <mime>image/jpeg</mime>
          <params>
          <param name="topN" type="int">5</param>
          <param name="minConfidence" type="double">0.015</param>
          <param name="class" type="string">org.apache.tika.parser.recognition.dl4j.DL4JImageRecogniser</param>
          <param name="modelType" type="string">VGG16</param>
          </params>
          </parser>
          </parsers>
          </properties>
          Save the configuration at : tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest

          To run it, build the project and move to root directory of the project and run the command

          java -Xmx3G -jar tika-app/target/tika-app-1.14.jar --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml <path to your image file>

          -Xmx3G is required because VGG16 model requires quite a lot of memory to run. If your system is not able to run it, you may try to pump up the memory further

          Once the model runs, it automatically downloads the model file using helper functions of DL4J locally at .dl4j/trainedModels
          To speed up the process in future, once the model is loaded from original hash files, it is serialized and saved on disk at .dl4j/trainedModels/tikaPreprocessed which significantly reduces
          the resource usage (specially memory consumption) for future loads.
          For more details you can red this gist: https://gist.github.com/asmehra95/a16c49ec91f7f0d7b39c5bf6c2483e4d
          Issue Link:
          https://issues.apache.org/jira/browse/TIKA-2298

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/asmehra95/tika master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/159.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #159


          commit a5cd6f42dcded603f2b6de9476280c4bd95b6806
          Author: asmehra95 <asmehra95@gmail.com>
          Date: 2017-03-24T14:21:40Z

          Added dependencies for DL4JImageRecogniser parser

          commit f777f21b47c8d122e6b7a0819b44977f1d571c59
          Author: asmehra95 <asmehra95@gmail.com>
          Date: 2017-03-24T14:28:54Z

          Imported VGG16 model via deeplearning4j


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user asmehra95 opened a pull request: https://github.com/apache/tika/pull/159 fix for TIKA-2298 contributed by asmehra95 I have imported VGG16 model into Apache tika using deeplearning4j. The usage of this recogniser is very similar to TensorFlowRESTrecogniser but it doesn't require any external setup, like running RESTservice in as in case of TensorFlowRESTrecogniser. You can read more about TensorFlowRESTrecogniser at https://wiki.apache.org/tika/TikaAndVision To use the DL4JImageRecogniser set class param to org.apache.tika.parser.recognition.dl4j.DL4JImageRecogniser modelType to VGG16 sample configuration is given below for refference. <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser"> <mime>image/jpeg</mime> <params> <param name="topN" type="int">5</param> <param name="minConfidence" type="double">0.015</param> <param name="class" type="string">org.apache.tika.parser.recognition.dl4j.DL4JImageRecogniser</param> <param name="modelType" type="string">VGG16</param> </params> </parser> </parsers> </properties> Save the configuration at : tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest To run it, build the project and move to root directory of the project and run the command java -Xmx3G -jar tika-app/target/tika-app-1.14.jar --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml <path to your image file> -Xmx3G is required because VGG16 model requires quite a lot of memory to run. If your system is not able to run it, you may try to pump up the memory further Once the model runs, it automatically downloads the model file using helper functions of DL4J locally at .dl4j/trainedModels To speed up the process in future, once the model is loaded from original hash files, it is serialized and saved on disk at .dl4j/trainedModels/tikaPreprocessed which significantly reduces the resource usage (specially memory consumption) for future loads. For more details you can red this gist: https://gist.github.com/asmehra95/a16c49ec91f7f0d7b39c5bf6c2483e4d Issue Link: https://issues.apache.org/jira/browse/TIKA-2298 You can merge this pull request into a Git repository by running: $ git pull https://github.com/asmehra95/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/159.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #159 commit a5cd6f42dcded603f2b6de9476280c4bd95b6806 Author: asmehra95 <asmehra95@gmail.com> Date: 2017-03-24T14:21:40Z Added dependencies for DL4JImageRecogniser parser commit f777f21b47c8d122e6b7a0819b44977f1d571c59 Author: asmehra95 <asmehra95@gmail.com> Date: 2017-03-24T14:28:54Z Imported VGG16 model via deeplearning4j
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293141458

          hello folks,
          I have fixed formatting issues @thammegowda please review it. Let me know if any changes are required.
          I have made it a little more customizable. You can now choose if you want to save model to disk or not.
          Saving a model to disk requires a lot of memory( around 500mb ) but it saves a lot of runtime memory once the model is saved.

          How to use:
          add a field in config file
          <param name="serialize" type="string">no</param>
          It can be yes or no

          Observations:
          When loading model from disk:
          It only require around 1200mb of ram to run.

          When model is loaded from h5 files using helper functions
          It requires 2500mb of ram to run the model.

          I think we can distribute serialized models for vgg16 instead of the original hash files. Will it produce any problems @saudet @agibsonccc , One more thing, the VGG16 model doesn't work completely offline. It connects to internet after processing the image to decode output. Can we make it entirely offline?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293141458 hello folks, I have fixed formatting issues @thammegowda please review it. Let me know if any changes are required. I have made it a little more customizable. You can now choose if you want to save model to disk or not. Saving a model to disk requires a lot of memory( around 500mb ) but it saves a lot of runtime memory once the model is saved. How to use: add a field in config file <param name="serialize" type="string">no</param> It can be yes or no Observations: When loading model from disk: It only require around 1200mb of ram to run. When model is loaded from h5 files using helper functions It requires 2500mb of ram to run the model. I think we can distribute serialized models for vgg16 instead of the original hash files. Will it produce any problems @saudet @agibsonccc , One more thing, the VGG16 model doesn't work completely offline. It connects to internet after processing the image to decode output. Can we make it entirely offline? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          saudet commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293142844

          /cc @turambar would know more

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - saudet commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293142844 /cc @turambar would know more ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          agibsonccc commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293144691

          Not sure what you mean here..it needs to download the image weights once not all the time. You can try bundling the weights with the model if you want, either that or you can take the pretrained model and save that with dl4j then just bundle that with the jar.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - agibsonccc commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293144691 Not sure what you mean here..it needs to download the image weights once not all the time. You can try bundling the weights with the model if you want, either that or you can take the pretrained model and save that with dl4j then just bundle that with the jar. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293166407

          @agibsonccc What i am saying is, instead of downloading image weights(h5 file) i could write functions that download the serialized model from my repo because both are approximately same in size. The tika user would directly load from this serialized model not the image weights.

          What i doubt is that if the serialized model would work for all the platforms or not. Is there any platform dependency on it.
          The model will be serialized using
          ModelSerializer.writeModel(model, locationToSave, true);
          and loaded using
          model = ModelSerializer.restoreComputationGraph(locationToSave);

          Regarding the offline feature:

          When i try to decode predictions for an image offline it produces an error. Apparently it connects online for decoding.
          here is the stacktrace when offline
          https://gist.github.com/asmehra95/ac8bcfffbc5c1932d38a034d9b486c99

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293166407 @agibsonccc What i am saying is, instead of downloading image weights(h5 file) i could write functions that download the serialized model from my repo because both are approximately same in size. The tika user would directly load from this serialized model not the image weights. What i doubt is that if the serialized model would work for all the platforms or not. Is there any platform dependency on it. The model will be serialized using ModelSerializer.writeModel(model, locationToSave, true); and loaded using model = ModelSerializer.restoreComputationGraph(locationToSave); Regarding the offline feature: When i try to decode predictions for an image offline it produces an error. Apparently it connects online for decoding. here is the stacktrace when offline https://gist.github.com/asmehra95/ac8bcfffbc5c1932d38a034d9b486c99 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293141458

          hello folks,
          I have fixed formatting issues @thammegowda please review it. Let me know if any changes are required.
          I have made it a little more customizable. You can now choose if you want to save model to disk or not.
          Saving a model to disk requires a lot of memory( around 500mb ) but it saves a lot of runtime memory once the model is saved.

          How to use:
          add a field in config file
          ```xml
          <param name="serialize" type="string">no</param>
          ```
          It can be yes or no

          Observations:
          When loading model from disk:
          It only require around 1200mb of ram to run.

          When model is loaded from h5 files using helper functions
          It requires 2500mb of ram to run the model.

          I think we can distribute serialized models for vgg16 instead of the original hash files. Will it produce any problems @saudet @agibsonccc , One more thing, the VGG16 model doesn't work completely offline. It connects to internet after processing the image to decode output. Can we make it entirely offline?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293141458 hello folks, I have fixed formatting issues @thammegowda please review it. Let me know if any changes are required. I have made it a little more customizable. You can now choose if you want to save model to disk or not. Saving a model to disk requires a lot of memory( around 500mb ) but it saves a lot of runtime memory once the model is saved. How to use: add a field in config file ```xml <param name="serialize" type="string">no</param> ``` It can be yes or no Observations: When loading model from disk: It only require around 1200mb of ram to run. When model is loaded from h5 files using helper functions It requires 2500mb of ram to run the model. I think we can distribute serialized models for vgg16 instead of the original hash files. Will it produce any problems @saudet @agibsonccc , One more thing, the VGG16 model doesn't work completely offline. It connects to internet after processing the image to decode output. Can we make it entirely offline? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          thammegowda commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293358840

          @asmehra95 appreciate your effort. Thanks for updating the code based on our review.

          1. I feel this PR should be raised to `tika-dl` module that is being proposed in #165 so that we can isolate DL4J dependencies to that module instead of `tika-parsers`. we have to wait till #165 PR gets merged and then move your classes inside tika-dl module.
          2. I am not sure whats happening with online/offline issue. It seems to me that one or other necessary file is missing (either the Keras JSON model, or the weights or the labels) so it tries to download from S3. I will have a closer look again and report my findings.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - thammegowda commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293358840 @asmehra95 appreciate your effort. Thanks for updating the code based on our review. 1. I feel this PR should be raised to `tika-dl` module that is being proposed in #165 so that we can isolate DL4J dependencies to that module instead of `tika-parsers`. we have to wait till #165 PR gets merged and then move your classes inside tika-dl module. 2. I am not sure whats happening with online/offline issue. It seems to me that one or other necessary file is missing (either the Keras JSON model, or the weights or the labels) so it tries to download from S3. I will have a closer look again and report my findings. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-293461556

          @thammegowda Thank you for your comment.
          I will open a pull request once the tika-dl gets merged.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-293461556 @thammegowda Thank you for your comment. I will open a pull request once the tika-dl gets merged. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-300577734

          guys #165 is now committed, so can this be updated to be inside Tika-DL? @asmehra95 @thammegowda

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-300577734 guys #165 is now committed, so can this be updated to be inside Tika-DL? @asmehra95 @thammegowda ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-301683734

          yes sure! i am on it! @chrismattmann
          i will raise the PR as soon as possible

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-301683734 yes sure! i am on it! @chrismattmann i will raise the PR as soon as possible ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-302990927

          ping @asmehra95 any update?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-302990927 ping @asmehra95 any update? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 opened a new pull request #182: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j into tika-dl
          URL: https://github.com/apache/tika/pull/182

          <b>Note:</b> This is a modified form of #159 raised earlier by me.
          I have imported VGG16 model into tika-dl module using deeplearning4j .
          The usage of this recogniser is very similar to TensorFlowRESTrecogniser but it doesn't require any external setup, like running RESTservice in as in case of TensorFlowRESTrecogniser.
          You can read more about TensorFlowRESTrecogniser at https://wiki.apache.org/tika/TikaAndVision

          To use the DL4JVGG16Net set
          class param to org.apache.tika.dl.imagerec.DL4JVGG16Net
          modelType to VGG16
          sample configuration is given below for refference.

          ```
          <?xml version="1.0" encoding="UTF-8"?>
          <properties>
          <parsers>
          <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
          <mime>image/jpeg</mime>
          <params>
          <param name="topN" type="int">2</param>
          <param name="minConfidence" type="double">0.015</param>
          <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JVGG16Net</param>
          <param name="modelType" type="string">VGG16</param>
          <param name="serialize" type="string">yes</param>
          </params>
          </parser>
          </parsers>
          </properties>
          ```
          Save the configuration at your preffered location.
          A default one is provided at ``` tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml ```

          To run it in default configuration, build the project and move to root directory of the project and run the command.

          '``` java -Xmx3G -cp ./tika-dl/target/tika-dl-1.15-SNAPSHOT-jar-with-dependencies.jar;tika-app/target/tika-app-1.15-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml tika-dl/src/test/resources/org/apache/tika/dl/imagerec/lion.jpg```
          -Xmx3G is required because VGG16 model requires quite a lot of memory to run.
          Observations:
          When loading searilized model from disk:
          It only require around 1200mb of ram to run.

          When model is loaded from h5 files using helper functions
          It requires 2500mb of ram to run the model (required only one time if serialization is set to yes)

          Once the model runs, it automatically downloads the model file using helper functions of DL4J locally at .dl4j/trainedModels
          To speed up the process in future, once the model is loaded from original hash files, it is serialized and saved on disk at .dl4j/trainedModels/tikaPreprocessed which significantly reduces
          the resource usage (specially memory consumption) for future loads.
          Issue Link:
          https://issues.apache.org/jira/browse/TIKA-2298

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 opened a new pull request #182: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j into tika-dl URL: https://github.com/apache/tika/pull/182 <b>Note:</b> This is a modified form of #159 raised earlier by me. I have imported VGG16 model into tika-dl module using deeplearning4j . The usage of this recogniser is very similar to TensorFlowRESTrecogniser but it doesn't require any external setup, like running RESTservice in as in case of TensorFlowRESTrecogniser. You can read more about TensorFlowRESTrecogniser at https://wiki.apache.org/tika/TikaAndVision To use the DL4JVGG16Net set class param to org.apache.tika.dl.imagerec.DL4JVGG16Net modelType to VGG16 sample configuration is given below for refference. ``` <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser"> <mime>image/jpeg</mime> <params> <param name="topN" type="int">2</param> <param name="minConfidence" type="double">0.015</param> <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JVGG16Net</param> <param name="modelType" type="string">VGG16</param> <param name="serialize" type="string">yes</param> </params> </parser> </parsers> </properties> ``` Save the configuration at your preffered location. A default one is provided at ``` tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml ``` To run it in default configuration, build the project and move to root directory of the project and run the command. '``` java -Xmx3G -cp ./tika-dl/target/tika-dl-1.15-SNAPSHOT-jar-with-dependencies.jar;tika-app/target/tika-app-1.15-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml tika-dl/src/test/resources/org/apache/tika/dl/imagerec/lion.jpg``` -Xmx3G is required because VGG16 model requires quite a lot of memory to run. Observations: When loading searilized model from disk: It only require around 1200mb of ram to run. When model is loaded from h5 files using helper functions It requires 2500mb of ram to run the model (required only one time if serialization is set to yes) Once the model runs, it automatically downloads the model file using helper functions of DL4J locally at .dl4j/trainedModels To speed up the process in future, once the model is loaded from original hash files, it is serialized and saved on disk at .dl4j/trainedModels/tikaPreprocessed which significantly reduces the resource usage (specially memory consumption) for future loads. Issue Link: https://issues.apache.org/jira/browse/TIKA-2298 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159#issuecomment-304461966

          superseded by #182

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159#issuecomment-304461966 superseded by #182 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann closed pull request #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j
          URL: https://github.com/apache/tika/pull/159

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann closed pull request #159: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j URL: https://github.com/apache/tika/pull/159 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann commented on issue #182: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j into tika-dl
          URL: https://github.com/apache/tika/pull/182#issuecomment-304462094

          frickin' awesome! I'm going to test this today @asmehra95

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann commented on issue #182: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j into tika-dl URL: https://github.com/apache/tika/pull/182#issuecomment-304462094 frickin' awesome! I'm going to test this today @asmehra95 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          asmehra95 commented on issue #182: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j into tika-dl
          URL: https://github.com/apache/tika/pull/182#issuecomment-304462311

          @thammegowda @chrismattmann awaiting review for this pull request...

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - asmehra95 commented on issue #182: Creation of TIKA-2298 contributed by asmehra95- Import of vgg16 via Deeplearning4j into tika-dl URL: https://github.com/apache/tika/pull/182#issuecomment-304462311 @thammegowda @chrismattmann awaiting review for this pull request... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org

            People

            • Assignee:
              Unassigned
              Reporter:
              asmehra95 Avtar Singh
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development