Nutch
  1. Nutch
  2. NUTCH-609

Allow Plugins to be Loaded from Jar File(s)

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.0.0
    • Fix Version/s: 2.3
    • Component/s: None
    • Labels:
      None
    • Environment:

      All

    • Patch Info:
      Patch Available

      Description

      Currently plugins cannot be loaded from a jar file. Plugins must be unzipped in one or more directories specified by the plugin.folders config. I have been thinking about an extension to PluginRepository or PluginManifestParser (or both) that would allow plugins to packaged into multiple independent jar files and placed on the classpath. The system would search the classpath for resources with the correct folder name and would load any plugins in those jars.

      This functionality would be very useful in making the nutch core more flexible in terms of packaging. It would also help with web applications where we don't want to have a plugins directory included in the webapp.

      Thoughts so far are unzipping those plugin jars into a common temp directory before loading. Another option is using something like commons vfs to interact with the jar files. VFS essential uses a disk based temporary cache for jar files, so it is pretty much the same solution. What are everyone else's thoughts on this?

        Activity

        Hide
        Lewis John McGibbney added a comment -

        Not been activity on this one for sometime. Does anyone have any comments re: relevance of this with regards to the way Nutch is moving? I think it is out with the scope of the 1.4 release.

        Show
        Lewis John McGibbney added a comment - Not been activity on this one for sometime. Does anyone have any comments re: relevance of this with regards to the way Nutch is moving? I think it is out with the scope of the 1.4 release.
        Hide
        Chris A. Mattmann added a comment -
        Show
        Chris A. Mattmann added a comment - pushing this out per http://bit.ly/c7tBv9
        Hide
        Sami Siren added a comment -

        pushing this to 1.1, feel free to put back if there is traction

        Show
        Sami Siren added a comment - pushing this to 1.1, feel free to put back if there is traction
        Hide
        Sami Siren added a comment -

        I think the direction in general is good, I know that we are the only ones feeling bad with the current solution. We should perhaps put some more effort in thinking about a longer term solution for the next generation Nutch.

        Show
        Sami Siren added a comment - I think the direction in general is good, I know that we are the only ones feeling bad with the current solution. We should perhaps put some more effort in thinking about a longer term solution for the next generation Nutch.
        Hide
        Dennis Kubes added a comment -

        Changed to minor status. Other issues are more important currently.

        Show
        Dennis Kubes added a comment - Changed to minor status. Other issues are more important currently.
        Hide
        Andrzej Bialecki added a comment -

        I figured that wasn't the right path to go down right now

        Why? It seems like a much cleaner solution.

        Show
        Andrzej Bialecki added a comment - I figured that wasn't the right path to go down right now Why? It seems like a much cleaner solution.
        Hide
        Dennis Kubes added a comment -

        Rough first draft of patch. After research I determined that to load classes via a jar within a jar that a custom classloader would need to be written. I figured that wasn't the right path to go down right now so created a utility to manage the deletion of resources (files and folders) during shutdown and methods to allow plugins to be unzipped into a temporary folder (the system temp dir by default although this is configurable). This patch will take any jar file on the classpath that ends in plugin.jar or plugins.jar and will unzip its contents into the plugins temp directory. This is then added to the plugin folders and parsed as normal. The plugins temp dir will be kept until the JVM shuts down at which point it and all the resources it contains will be deleted by a shutdown hook.

        Please let me know thoughts on this approach. I would still need to add unit tests and documentation for these classes and methods.

        Show
        Dennis Kubes added a comment - Rough first draft of patch. After research I determined that to load classes via a jar within a jar that a custom classloader would need to be written. I figured that wasn't the right path to go down right now so created a utility to manage the deletion of resources (files and folders) during shutdown and methods to allow plugins to be unzipped into a temporary folder (the system temp dir by default although this is configurable). This patch will take any jar file on the classpath that ends in plugin.jar or plugins.jar and will unzip its contents into the plugins temp directory. This is then added to the plugin folders and parsed as normal. The plugins temp dir will be kept until the JVM shuts down at which point it and all the resources it contains will be deleted by a shutdown hook. Please let me know thoughts on this approach. I would still need to add unit tests and documentation for these classes and methods.
        Hide
        Dennis Kubes added a comment -

        Well, as it turns out I haven't found a way to put a jar file inside of another jar file on the classpath. So something like jar:file:/path/to/jar.jar!/containing.jar won't work. I will research a little more but it looks to me like even if we have each plugin in its own jar file, which I prefer, we would still need to unzip to a temporary directory to put them on the classpath. Does anybody know a different way to include a jar inside of another jar on the classpath?

        Show
        Dennis Kubes added a comment - Well, as it turns out I haven't found a way to put a jar file inside of another jar file on the classpath. So something like jar: file:/path/to/jar.jar!/containing.jar won't work. I will research a little more but it looks to me like even if we have each plugin in its own jar file, which I prefer, we would still need to unzip to a temporary directory to put them on the classpath. Does anybody know a different way to include a jar inside of another jar on the classpath?
        Hide
        Chris A. Mattmann added a comment -

        the downside to this is we could end up with a lot of jars, but currently we are ending up with a lot of folders so I don't know if that is a big difference. Thoughts?

        +1 for this strategy, even if it manes having more jar files to manage. Adopting this strategy would suggest to a more component-based build management system, e.g., a Maven-type. I've been a proponent of using Maven in Nutch for a while, and I think to move the plugins to a .jar file format would ease their adoption as say, remotely downloable Maven style plugins, that then Nutch would rely upon. Then, we could get out of the business of having to CM jar files, which I've never been a fan of.

        Show
        Chris A. Mattmann added a comment - the downside to this is we could end up with a lot of jars, but currently we are ending up with a lot of folders so I don't know if that is a big difference. Thoughts? +1 for this strategy, even if it manes having more jar files to manage. Adopting this strategy would suggest to a more component-based build management system, e.g., a Maven-type. I've been a proponent of using Maven in Nutch for a while, and I think to move the plugins to a .jar file format would ease their adoption as say, remotely downloable Maven style plugins, that then Nutch would rely upon. Then, we could get out of the business of having to CM jar files, which I've never been a fan of.
        Hide
        Dennis Kubes added a comment -

        Sorry, I should have been more clear. I know that it is possible to load the resources directly from the jar files, I just don't how much work that is going to take. I agree that avoiding the unzipping of jar files into temp directories and having to manage those directories for deletion is the preferred solution.

        Another thing I was thinking of and would like to get thoughts on is a convention versus configuration solution. Instead of browsing jar files for named resources, and then having to deal with the contention issues between directories and resources in jars being named the same, what if we were to have plugin jar files named a given way, something like name-plugin.jar. For example the prefix urlfilter plugin would be named urlfilter-prefix-plugin.jar. There would be a single plugin per jar and each jar would be the root directory for its plugin. Then to find plugin jars we are just scanning the classpath for certain named jars. The downside to this is we could end up with a lot of jars, but currently we are ending up with a lot of folders so I don't know if that is a big difference. Thoughts?

        Show
        Dennis Kubes added a comment - Sorry, I should have been more clear. I know that it is possible to load the resources directly from the jar files, I just don't how much work that is going to take. I agree that avoiding the unzipping of jar files into temp directories and having to manage those directories for deletion is the preferred solution. Another thing I was thinking of and would like to get thoughts on is a convention versus configuration solution. Instead of browsing jar files for named resources, and then having to deal with the contention issues between directories and resources in jars being named the same, what if we were to have plugin jar files named a given way, something like name-plugin.jar. For example the prefix urlfilter plugin would be named urlfilter-prefix-plugin.jar. There would be a single plugin per jar and each jar would be the root directory for its plugin. Then to find plugin jars we are just scanning the classpath for certain named jars. The downside to this is we could end up with a lot of jars, but currently we are ending up with a lot of folders so I don't know if that is a big difference. Thoughts?
        Hide
        Andrzej Bialecki added a comment -

        My point was that perhaps we can avoid unzipping - e.g URLClassLoader is able to load resources directly from jars, using the bang-path notation, it can also list resources at a specific path inside a jar.

        Show
        Andrzej Bialecki added a comment - My point was that perhaps we can avoid unzipping - e.g URLClassLoader is able to load resources directly from jars, using the bang-path notation, it can also list resources at a specific path inside a jar.
        Hide
        Dennis Kubes added a comment -

        It may be able to, I will look into it. I am still in the early stages of this but have gotten it to unzip jars and load the plugins. One thing I am seeing is that the actual resources need to be around after the initial parse because the resources are lazily loaded.

        Show
        Dennis Kubes added a comment - It may be able to, I will look into it. I am still in the early stages of this but have gotten it to unzip jars and load the plugins. One thing I am seeing is that the actual resources need to be around after the initial parse because the resources are lazily loaded.
        Hide
        Andrzej Bialecki added a comment -

        What is the reason that the plugin classloader can't load resources directly from a jar file, without unzipping it first?

        Show
        Andrzej Bialecki added a comment - What is the reason that the plugin classloader can't load resources directly from a jar file, without unzipping it first?

          People

          • Assignee:
            Dennis Kubes
            Reporter:
            Dennis Kubes
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:

              Development