Solr
  1. Solr
  2. SOLR-4916

Add support to write and read Solr index files and transaction log files to and from HDFS.

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, Trunk
    • Component/s: None
    • Labels:
      None
    1. SOLR-4916.patch
      601 kB
      Mark Miller
    2. SOLR-4916.patch
      600 kB
      Mark Miller
    3. SOLR-4916.patch
      602 kB
      Mark Miller
    4. SOLR-4916-ivy.patch
      7 kB
      Mark Miller
    5. SOLR-4916-move-MiniDfsCluster-deps-from-solr-test-framework-to-solr-core.patch
      30 kB
      Steve Rowe
    6. SOLR-4916-nulloutput.patch
      1 kB
      Uwe Schindler
    7. SOLR-4916-nulloutput.patch
      0.9 kB
      Uwe Schindler

      Issue Links

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        43d 23h 31m 1 Mark Miller 25/Jul/13 14:31
        Mark Miller made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 4.4 [ 12324324 ]
        Fix Version/s 4.5 [ 12324743 ]
        Resolution Fixed [ 1 ]
        Steve Rowe made changes -
        Fix Version/s 4.5 [ 12324743 ]
        Fix Version/s 4.4 [ 12324324 ]
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Mark Miller added a comment -

        Thanks Uwe - I'll relate these issues back to Apache Blur as well - perhaps it will help convince them of the benefits of collaborating on the HdfsDirectory in a Lucene module.

        Show
        Mark Miller added a comment - Thanks Uwe - I'll relate these issues back to Apache Blur as well - perhaps it will help convince them of the benefits of collaborating on the HdfsDirectory in a Lucene module.
        Hide
        ASF subversion and git services added a comment -

        Commit 1502169 from Uwe Schindler
        [ https://svn.apache.org/r1502169 ]

        Merged revision(s) 1502167 from lucene/dev/trunk:
        SOLR-4916: Fix bugs & usage of NullIndexOutput

        Show
        ASF subversion and git services added a comment - Commit 1502169 from Uwe Schindler [ https://svn.apache.org/r1502169 ] Merged revision(s) 1502167 from lucene/dev/trunk: SOLR-4916 : Fix bugs & usage of NullIndexOutput
        Hide
        ASF subversion and git services added a comment -

        Commit 1502168 from Uwe Schindler
        [ https://svn.apache.org/r1502168 ]

        Merged revision(s) 1502167 from lucene/dev/trunk:
        SOLR-4916: Fix bugs & usage of NullIndexOutput

        Show
        ASF subversion and git services added a comment - Commit 1502168 from Uwe Schindler [ https://svn.apache.org/r1502168 ] Merged revision(s) 1502167 from lucene/dev/trunk: SOLR-4916 : Fix bugs & usage of NullIndexOutput
        Hide
        ASF subversion and git services added a comment -

        Commit 1502167 from Uwe Schindler
        [ https://svn.apache.org/r1502167 ]

        SOLR-4916: Fix bugs & usage of NullIndexOutput

        Show
        ASF subversion and git services added a comment - Commit 1502167 from Uwe Schindler [ https://svn.apache.org/r1502167 ] SOLR-4916 : Fix bugs & usage of NullIndexOutput
        Uwe Schindler made changes -
        Attachment SOLR-4916-nulloutput.patch [ 12591816 ]
        Hide
        Uwe Schindler added a comment -

        Fixing another bug missing to update length. This is all not performance critical, as it affects segments.gen only.

        Will commit soon!

        Show
        Uwe Schindler added a comment - Fixing another bug missing to update length. This is all not performance critical, as it affects segments.gen only. Will commit soon!
        Uwe Schindler made changes -
        Attachment SOLR-4916-nulloutput.patch [ 12591815 ]
        Hide
        Uwe Schindler added a comment - - edited

        Hi Mark,
        while reviewing the current committed stuff I found a "bug":
        HdfsDirectory has a special case to prevent the segments.gen file from being written: It redirecty the output to an NullIndexOutput. The bug is that HdfDirectory has a static instance of this NullIndexOutput, but the static instance has "state" (file-size, file position). When you return always the same instance this is keeping state and not safe if used by different threads in parallel - so it could cause bugs. openOutput must return a new instance (which costs nothing as its a small object on Eden Heap only).

        See attached patch!

        Show
        Uwe Schindler added a comment - - edited Hi Mark, while reviewing the current committed stuff I found a "bug": HdfsDirectory has a special case to prevent the segments.gen file from being written: It redirecty the output to an NullIndexOutput. The bug is that HdfDirectory has a static instance of this NullIndexOutput, but the static instance has "state" (file-size, file position). When you return always the same instance this is keeping state and not safe if used by different threads in parallel - so it could cause bugs. openOutput must return a new instance (which costs nothing as its a small object on Eden Heap only). See attached patch!
        Hide
        ASF subversion and git services added a comment -

        Commit 1502150 from Uwe Schindler
        [ https://svn.apache.org/r1502150 ]

        Merged revision(s) 1502147 from lucene/dev/trunk:
        SOLR-4916: Re-add missing log="download-only"

        Show
        ASF subversion and git services added a comment - Commit 1502150 from Uwe Schindler [ https://svn.apache.org/r1502150 ] Merged revision(s) 1502147 from lucene/dev/trunk: SOLR-4916 : Re-add missing log="download-only"
        Hide
        ASF subversion and git services added a comment -

        Commit 1502148 from Uwe Schindler
        [ https://svn.apache.org/r1502148 ]

        Merged revision(s) 1502147 from lucene/dev/trunk:
        SOLR-4916: Re-add missing log="download-only"

        Show
        ASF subversion and git services added a comment - Commit 1502148 from Uwe Schindler [ https://svn.apache.org/r1502148 ] Merged revision(s) 1502147 from lucene/dev/trunk: SOLR-4916 : Re-add missing log="download-only"
        Hide
        ASF subversion and git services added a comment -

        Commit 1502147 from Uwe Schindler
        [ https://svn.apache.org/r1502147 ]

        SOLR-4916: Re-add missing log="download-only"

        Show
        ASF subversion and git services added a comment - Commit 1502147 from Uwe Schindler [ https://svn.apache.org/r1502147 ] SOLR-4916 : Re-add missing log="download-only"
        Hide
        Uwe Schindler added a comment -

        Cool, thanks! I am glad, easymock is gone, too

        This is a good step forwards to better structure test dependencies in Solr!

        Show
        Uwe Schindler added a comment - Cool, thanks! I am glad, easymock is gone, too This is a good step forwards to better structure test dependencies in Solr!
        Hide
        ASF subversion and git services added a comment -

        Commit 1502114 from Steve Rowe
        [ https://svn.apache.org/r1502114 ]

        SOLR-4916: Move MiniDfsCluster test dependencies from solr test-framework to solr-core; download solr-core test dependencies to solr/core/test-lib/ instead of solr/core/lib/; download DIH test dependencies to solr/contrib/dataimporthandler/test-lib/ instead of [...]/lib/ (merged trunk r1502105)

        Show
        ASF subversion and git services added a comment - Commit 1502114 from Steve Rowe [ https://svn.apache.org/r1502114 ] SOLR-4916 : Move MiniDfsCluster test dependencies from solr test-framework to solr-core; download solr-core test dependencies to solr/core/test-lib/ instead of solr/core/lib/; download DIH test dependencies to solr/contrib/dataimporthandler/test-lib/ instead of [...] /lib/ (merged trunk r1502105)
        Hide
        ASF subversion and git services added a comment -

        Commit 1502113 from Steve Rowe
        [ https://svn.apache.org/r1502113 ]

        SOLR-4916: Move MiniDfsCluster test dependencies from solr test-framework to solr-core; download solr-core test dependencies to solr/core/test-lib/ instead of solr/core/lib/; download DIH test dependencies to solr/contrib/dataimporthandler/test-lib/ instead of [...]/lib/ (merged trunk r1502105)

        Show
        ASF subversion and git services added a comment - Commit 1502113 from Steve Rowe [ https://svn.apache.org/r1502113 ] SOLR-4916 : Move MiniDfsCluster test dependencies from solr test-framework to solr-core; download solr-core test dependencies to solr/core/test-lib/ instead of solr/core/lib/; download DIH test dependencies to solr/contrib/dataimporthandler/test-lib/ instead of [...] /lib/ (merged trunk r1502105)
        Hide
        ASF subversion and git services added a comment -

        Commit 1502105 from Steve Rowe
        [ https://svn.apache.org/r1502105 ]

        SOLR-4916: Move MiniDfsCluster test dependencies from solr test-framework to solr-core; download solr-core test dependencies to solr/core/test-lib/ instead of solr/core/lib/; download DIH test dependencies to solr/contrib/dataimporthandler/test-lib instead of [...]/lib/

        Show
        ASF subversion and git services added a comment - Commit 1502105 from Steve Rowe [ https://svn.apache.org/r1502105 ] SOLR-4916 : Move MiniDfsCluster test dependencies from solr test-framework to solr-core; download solr-core test dependencies to solr/core/test-lib/ instead of solr/core/lib/; download DIH test dependencies to solr/contrib/dataimporthandler/test-lib instead of [...] /lib/
        Hide
        Mark Miller added a comment -

        +1, I think it's a step forward.

        Show
        Mark Miller added a comment - +1, I think it's a step forward.
        Hide
        Steve Rowe added a comment -

        My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too!

        My vote is to move the deps to solr-core.

        +1. Like test-only in Maven, for IVY, I would put them into a separate config and store in a separate directory, so they are not packaged: solr/core/test-libs

        This patch moves the MiniDfsCluster dependencies from solr/test-framework/ivy.xml to solr/core/ivy.xml, using a separate Ivy configuration, and storing the deps in solr/core/test-lib/. I also took the opportunity to store other test jars there: easymock and its deps. As a result, we no longer need exceptions for these test-only deps when pulling from solr/core/lib/ to put into the war.

        The patch also gives DIH the same treatment for easymock -> solr/contrib/dataimporthandler/test-lib/ - previously DIH got that test dep via solr/core/lib/.

        The patch includes Ant/Ivy, IntelliJ, Maven, and Eclipse support for the dependency moves. I successfully ran Solr tests under each of those, except Eclipse, which I don't use.

        I want to include this change in 4.4, so that we don't ship Maven config for solr test-framework with dependencies on solr-core-only test deps.

        Show
        Steve Rowe added a comment - My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too! My vote is to move the deps to solr-core. +1. Like test-only in Maven, for IVY, I would put them into a separate config and store in a separate directory, so they are not packaged: solr/core/test-libs This patch moves the MiniDfsCluster dependencies from solr/test-framework/ivy.xml to solr/core/ivy.xml , using a separate Ivy configuration, and storing the deps in solr/core/test-lib/ . I also took the opportunity to store other test jars there: easymock and its deps. As a result, we no longer need exceptions for these test-only deps when pulling from solr/core/lib/ to put into the war. The patch also gives DIH the same treatment for easymock -> solr/contrib/dataimporthandler/test-lib/ - previously DIH got that test dep via solr/core/lib/ . The patch includes Ant/Ivy, IntelliJ, Maven, and Eclipse support for the dependency moves. I successfully ran Solr tests under each of those, except Eclipse, which I don't use. I want to include this change in 4.4, so that we don't ship Maven config for solr test-framework with dependencies on solr-core-only test deps.
        Hide
        ASF subversion and git services added a comment -

        Commit 1500135 from Steve Rowe
        [ https://svn.apache.org/r1500135 ]

        SOLR-4916: IntelliJ configuration (merged trunk r1497105)

        Show
        ASF subversion and git services added a comment - Commit 1500135 from Steve Rowe [ https://svn.apache.org/r1500135 ] SOLR-4916 : IntelliJ configuration (merged trunk r1497105)
        Hide
        Uwe Schindler added a comment -

        My vote is to move the deps to solr-core.

        +1. Like test-only in Maven, for IVY, I would put them into a separate config and store in a separate directory, so they are not packaged: solr/core/test-libs

        Show
        Uwe Schindler added a comment - My vote is to move the deps to solr-core. +1. Like test-only in Maven, for IVY, I would put them into a separate config and store in a separate directory, so they are not packaged: solr/core/test-libs
        Hide
        Steve Rowe added a comment -

        I guess the main use case is for downstream projects to have the ability to filter out these dependencies and avoid pulling down the test time dependencies - but it seems we would care about that in the maven shadow build, not here - we don't publish based on the ivy files right?

        In that case, it would seem we should simply do the same thing as with some of the other jars in core that are excluded from the webapp - exclude them in the build.xml and have the maven build treat them as part of a test configuration?

        Steve Rowe, does any of that make any sense?

        Yes, it does - if these deps were moved to solr-core and declared in the maven conf as test scope, they would not be pulled in as transitive deps by consumers of the solr-core artifact.

        My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too!

        If we leave the deps where they are now, on test-framework (which I don't think we should do, since these are really only solr-core deps), then they could be declared optional in the maven conf, but then all consumers that need these deps would need to declare them; so, at least in the maven config, there is zero point in keeping them as deps of test-framework.

        My vote is to move the deps to solr-core.

        Show
        Steve Rowe added a comment - I guess the main use case is for downstream projects to have the ability to filter out these dependencies and avoid pulling down the test time dependencies - but it seems we would care about that in the maven shadow build, not here - we don't publish based on the ivy files right? In that case, it would seem we should simply do the same thing as with some of the other jars in core that are excluded from the webapp - exclude them in the build.xml and have the maven build treat them as part of a test configuration? Steve Rowe, does any of that make any sense? Yes, it does - if these deps were moved to solr-core and declared in the maven conf as test scope, they would not be pulled in as transitive deps by consumers of the solr-core artifact. My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too! If we leave the deps where they are now, on test-framework (which I don't think we should do, since these are really only solr-core deps), then they could be declared optional in the maven conf, but then all consumers that need these deps would need to declare them; so, at least in the maven config, there is zero point in keeping them as deps of test-framework. My vote is to move the deps to solr-core.
        Hide
        ASF subversion and git services added a comment -

        Commit 1499847 from Mark Miller
        [ https://svn.apache.org/r1499847 ]

        SOLR-4916: Fix NOTICE - take Solr entry out of Lucene section

        Show
        ASF subversion and git services added a comment - Commit 1499847 from Mark Miller [ https://svn.apache.org/r1499847 ] SOLR-4916 : Fix NOTICE - take Solr entry out of Lucene section
        Hide
        Mark Miller added a comment -

        My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too!

        I crossposted with this, so I had not read it yet. That's fine with me if Robert Muir is fine with it.

        Show
        Mark Miller added a comment - My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too! I crossposted with this, so I had not read it yet. That's fine with me if Robert Muir is fine with it.
        Hide
        ASF subversion and git services added a comment -

        Commit 1499842 from Mark Miller
        [ https://svn.apache.org/r1499842 ]

        SOLR-4916: Fix NOTICE - take Solr entry out of Lucene section

        Show
        ASF subversion and git services added a comment - Commit 1499842 from Mark Miller [ https://svn.apache.org/r1499842 ] SOLR-4916 : Fix NOTICE - take Solr entry out of Lucene section
        Hide
        Uwe Schindler added a comment -

        I also don't want the files in the distribution ZIP, and currently they are listed in the distribution ZIP's test-framework folder (and this is why the smoke tester fails)! Because of that I proposed to have a separate lib folder that is never zipped (to WAR but also not to bin-tgz/bin-zip).

        Show
        Uwe Schindler added a comment - I also don't want the files in the distribution ZIP, and currently they are listed in the distribution ZIP's test-framework folder (and this is why the smoke tester fails)! Because of that I proposed to have a separate lib folder that is never zipped (to WAR but also not to bin-tgz/bin-zip).
        Hide
        Mark Miller added a comment -

        Well that's no help - even if all of this is tied up to the classpaths used, it doesn't seem to be a mechanism for shielding modules from each other AFAICT. I guess the main use case is for downstream projects to have the ability to filter out these dependencies and avoid pulling down the test time dependencies - but it seems we would care about that in the maven shadow build, not here - we don't publish based on the ivy files right?

        In that case, it would seem we should simply do the same thing as with some of the other jars in core that are excluded from the webapp - exclude them in the build.xml and have the maven build treat them as part of a test configuration?

        Steve Rowe, does any of that make any sense?

        Show
        Mark Miller added a comment - Well that's no help - even if all of this is tied up to the classpaths used, it doesn't seem to be a mechanism for shielding modules from each other AFAICT. I guess the main use case is for downstream projects to have the ability to filter out these dependencies and avoid pulling down the test time dependencies - but it seems we would care about that in the maven shadow build, not here - we don't publish based on the ivy files right? In that case, it would seem we should simply do the same thing as with some of the other jars in core that are excluded from the webapp - exclude them in the build.xml and have the maven build treat them as part of a test configuration? Steve Rowe , does any of that make any sense?
        Hide
        Uwe Schindler added a comment -

        I don't think this solution is very nice unless we have the "conf" really working. Currently all JARs are copied to lib folder ignoring the conf="..." attribute and we have to filter them ourselves (see your patch where you exclude from WAR file).

        In this case (without ivy:cachepatch/ivy:cachefileset) I would prefer the current solution.

        IMHO, the problem with the release smoker is more the fact that it "checks too much". The smoke tester should only deny javax classes in official lib folders and JAR files, not in test dependencies. I have really no problem with having the test dependencies in Solr's test-framework, of course not in Lucene's test-framework. In other lib dir we also have transient non-compile time dependencies.

        My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too!

        Show
        Uwe Schindler added a comment - I don't think this solution is very nice unless we have the "conf" really working. Currently all JARs are copied to lib folder ignoring the conf="..." attribute and we have to filter them ourselves (see your patch where you exclude from WAR file). In this case (without ivy:cachepatch/ivy:cachefileset) I would prefer the current solution. IMHO, the problem with the release smoker is more the fact that it "checks too much". The smoke tester should only deny javax classes in official lib folders and JAR files, not in test dependencies. I have really no problem with having the test dependencies in Solr's test-framework, of course not in Lucene's test-framework. In other lib dir we also have transient non-compile time dependencies. My best idea would be to add a second lib folder (test-framework/runtime-libs) that is not packed into the binary ZIP file distribution. It's easy to add: We can add a separate resolve with another target folder. In Maven it should also definitely not be listed as dependency for runtime, too!
        Hide
        Mark Miller added a comment -

        Duh, of course the contribs are also ivy modules that depends on core...I'll mess around and see if I can get this working nicely...

        Show
        Mark Miller added a comment - Duh, of course the contribs are also ivy modules that depends on core...I'll mess around and see if I can get this working nicely...
        Mark Miller made changes -
        Attachment SOLR-4916-ivy.patch [ 12590914 ]
        Hide
        Mark Miller added a comment -

        I don't really know ivy, but here is a patch that moves dfsminicluster dependencies from test-framework to core. I'm not really sure if the private conf stuff is working or not - I don't think we have another module that depends on core to check with...

        Show
        Mark Miller added a comment - I don't really know ivy, but here is a patch that moves dfsminicluster dependencies from test-framework to core. I'm not really sure if the private conf stuff is working or not - I don't think we have another module that depends on core to check with...
        Hide
        ASF subversion and git services added a comment -

        Commit 1499473 from Mark Miller
        [ https://svn.apache.org/r1499473 ]

        SOLR-4916: Merge out separate hdfs solrconfig.xml

        Show
        ASF subversion and git services added a comment - Commit 1499473 from Mark Miller [ https://svn.apache.org/r1499473 ] SOLR-4916 : Merge out separate hdfs solrconfig.xml
        Hide
        ASF subversion and git services added a comment -

        Commit 1499472 from Mark Miller
        [ https://svn.apache.org/r1499472 ]

        SOLR-4916: Merge out separate hdfs solrconfig.xml

        Show
        ASF subversion and git services added a comment - Commit 1499472 from Mark Miller [ https://svn.apache.org/r1499472 ] SOLR-4916 : Merge out separate hdfs solrconfig.xml
        Mark Miller made changes -
        Fix Version/s 5.0 [ 12321664 ]
        Fix Version/s 4.4 [ 12324324 ]
        Hide
        Mark Miller added a comment -

        I still need to toss together an initial bit of doc for this.

        Show
        Mark Miller added a comment - I still need to toss together an initial bit of doc for this.
        Hide
        ASF subversion and git services added a comment -

        Commit 1498714 from Mark Miller
        [ https://svn.apache.org/r1498714 ]

        SOLR-4916: Maven configuration for the new HDFS deps

        Show
        ASF subversion and git services added a comment - Commit 1498714 from Mark Miller [ https://svn.apache.org/r1498714 ] SOLR-4916 : Maven configuration for the new HDFS deps
        Hide
        ASF subversion and git services added a comment -

        Commit 1498713 from Mark Miller
        [ https://svn.apache.org/r1498713 ]

        SOLR-4916: Fix test to close properly

        Show
        ASF subversion and git services added a comment - Commit 1498713 from Mark Miller [ https://svn.apache.org/r1498713 ] SOLR-4916 : Fix test to close properly
        Hide
        ASF subversion and git services added a comment -

        Commit 1498712 from Mark Miller
        [ https://svn.apache.org/r1498712 ]

        SOLR-4916: Do not run hdfs tests on FreeBSD because they do not play nice with blackhole

        Show
        ASF subversion and git services added a comment - Commit 1498712 from Mark Miller [ https://svn.apache.org/r1498712 ] SOLR-4916 : Do not run hdfs tests on FreeBSD because they do not play nice with blackhole
        Hide
        ASF subversion and git services added a comment -

        Commit 1498711 from Mark Miller
        [ https://svn.apache.org/r1498711 ]

        SOLR-4916: Do not run hdfs tests on Windows as it requires cygwin

        Show
        ASF subversion and git services added a comment - Commit 1498711 from Mark Miller [ https://svn.apache.org/r1498711 ] SOLR-4916 : Do not run hdfs tests on Windows as it requires cygwin
        Hide
        ASF subversion and git services added a comment -

        Commit 1498710 from Mark Miller
        [ https://svn.apache.org/r1498710 ]

        SOLR-4916: add assume false to test for java 8

        Show
        ASF subversion and git services added a comment - Commit 1498710 from Mark Miller [ https://svn.apache.org/r1498710 ] SOLR-4916 : add assume false to test for java 8
        Hide
        ASF subversion and git services added a comment -

        Commit 1498707 from Mark Miller
        [ https://svn.apache.org/r1498707 ]

        SOLR-4916: Update NOTICE file and remove log4j from test-framework dependencies

        Show
        ASF subversion and git services added a comment - Commit 1498707 from Mark Miller [ https://svn.apache.org/r1498707 ] SOLR-4916 : Update NOTICE file and remove log4j from test-framework dependencies
        Hide
        ASF subversion and git services added a comment -

        Commit 1498702 from Mark Miller
        [ https://svn.apache.org/r1498702 ]

        SOLR-4916: Add support to write and read Solr index files and transaction log files to and from HDFS.

        Show
        ASF subversion and git services added a comment - Commit 1498702 from Mark Miller [ https://svn.apache.org/r1498702 ] SOLR-4916 : Add support to write and read Solr index files and transaction log files to and from HDFS.
        Hide
        The Heavy Commit Tag Bot added a comment -

        [trunk commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1497563

        SOLR-4916: Maven configuration for the new HDFS deps

        Show
        The Heavy Commit Tag Bot added a comment - [trunk commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1497563 SOLR-4916 : Maven configuration for the new HDFS deps
        Hide
        Robert Muir added a comment -

        permissions can be done in java6 too:
        File.canRead/canExecute/canWrite/setReadable/setExecutable/setWritable

        i dont understand why user groups should be necessary.

        Show
        Robert Muir added a comment - permissions can be done in java6 too: File.canRead/canExecute/canWrite/setReadable/setExecutable/setWritable i dont understand why user groups should be necessary.
        Hide
        Andrzej Bialecki added a comment -

        Don't shoot the messenger I'm just reporting what's already there, and I agree it's somewhat crazy, but some information was not available in pure java < 7, for example file permissions and user groups.

        Show
        Andrzej Bialecki added a comment - Don't shoot the messenger I'm just reporting what's already there, and I agree it's somewhat crazy, but some information was not available in pure java < 7, for example file permissions and user groups.
        Mark Miller made changes -
        Link This issue is related to HADOOP-9643 [ HADOOP-9643 ]
        Mark Miller made changes -
        Link This issue is related to HADOOP-9643 [ HADOOP-9643 ]
        Hide
        Robert Muir added a comment -

        I agree, this is crazy. with java6 you can implement all of these commands in pure java

        Show
        Robert Muir added a comment - I agree, this is crazy. with java6 you can implement all of these commands in pure java
        Hide
        Uwe Schindler added a comment - - edited

        Uwe, these shell commands are used because Hadoop has to run on Java 6. In addition to 'df' it uses 'whoami' and 'ls'.

        • whoami: System.getProperty("user.name")
        • ls: WTF??
        • df: new File(path).getFreeSpace(), to get all mountpoints File#listRoots()
        Show
        Uwe Schindler added a comment - - edited Uwe, these shell commands are used because Hadoop has to run on Java 6. In addition to 'df' it uses 'whoami' and 'ls'. whoami: System.getProperty("user.name") ls: WTF?? df: new File(path).getFreeSpace() , to get all mountpoints File#listRoots()
        Uwe Schindler made changes -
        Link This issue is broken by HADOOP-9643 [ HADOOP-9643 ]
        Hide
        Mark Miller added a comment -

        5x: r1497468 Fix test to close properly
        URL: http://svn.apache.org/r1497468

        Show
        Mark Miller added a comment - 5x: r1497468 Fix test to close properly URL: http://svn.apache.org/r1497468
        Hide
        Andrzej Bialecki added a comment -

        Uwe, these shell commands are used because Hadoop has to run on Java 6. In addition to 'df' it uses 'whoami' and 'ls'.

        There is for sure a config option, let me revisit the source code

        I wish it were so ... I use a set of AspectJ hacks to remove this dependency from Hadoop binaries to run tests on Windows.

        Show
        Andrzej Bialecki added a comment - Uwe, these shell commands are used because Hadoop has to run on Java 6. In addition to 'df' it uses 'whoami' and 'ls'. There is for sure a config option, let me revisit the source code I wish it were so ... I use a set of AspectJ hacks to remove this dependency from Hadoop binaries to run tests on Windows.
        Hide
        Mark Miller added a comment -

        5x: r1497458 Do not run hdfs tests on FreeBSD because they do not play nice with blackhole
        URL: http://svn.apache.org/r1497458

        Show
        Mark Miller added a comment - 5x: r1497458 Do not run hdfs tests on FreeBSD because they do not play nice with blackhole URL: http://svn.apache.org/r1497458
        Hide
        Uwe Schindler added a comment -

        We have one more failing windows test, maybe some unclosed file:

        [junit4:junit4] Suite: org.apache.solr.store.blockcache.BlockDirectoryTest
        [junit4:junit4]   2> 544841 T1593 oassb.BlockDirectory.<init> Block cache on write is disabled
        [junit4:junit4]   2> 544842 T1593 oassb.BlockDirectory.<init> Block cache on read is disabled
        [junit4:junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=BlockDirectoryTest -Dtests.method=testEOF -Dtests.seed=D041A49D09140
        737 -Dtests.slow=true -Dtests.locale=cs_CZ -Dtests.timezone=US/Arizona -Dtests.file.encoding=Cp1252
        [junit4:junit4] ERROR   0.11s J2 | BlockDirectoryTest.testEOF <<<
        [junit4:junit4]    > Throwable #1: java.io.IOException: Unable to delete file: .\org.apache.solr.store.hdfs.HdfsDirectory-1372353019
        215\normal\test.eof
        [junit4:junit4]    >    at __randomizedtesting.SeedInfo.seed([D041A49D09140737:412AE6954B30A14B]:0)
        [junit4:junit4]    >    at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1919)
        [junit4:junit4]    >    at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1399)
        [junit4:junit4]    >    at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1331)
        [junit4:junit4]    >    at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1910)
        [junit4:junit4]    >    at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1399)
        [junit4:junit4]    >    at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1331)
        [junit4:junit4]    >    at org.apache.solr.store.blockcache.BlockDirectoryTest.tearDown(BlockDirectoryTest.java:118)
        [junit4:junit4]    >    at java.lang.Thread.run(Thread.java:724)
        
        Show
        Uwe Schindler added a comment - We have one more failing windows test, maybe some unclosed file: [junit4:junit4] Suite: org.apache.solr.store.blockcache.BlockDirectoryTest [junit4:junit4] 2> 544841 T1593 oassb.BlockDirectory.<init> Block cache on write is disabled [junit4:junit4] 2> 544842 T1593 oassb.BlockDirectory.<init> Block cache on read is disabled [junit4:junit4] 2> NOTE: reproduce with: ant test -Dtestcase=BlockDirectoryTest -Dtests.method=testEOF -Dtests.seed=D041A49D09140 737 -Dtests.slow=true -Dtests.locale=cs_CZ -Dtests.timezone=US/Arizona -Dtests.file.encoding=Cp1252 [junit4:junit4] ERROR 0.11s J2 | BlockDirectoryTest.testEOF <<< [junit4:junit4] > Throwable #1: java.io.IOException: Unable to delete file: .\org.apache.solr.store.hdfs.HdfsDirectory-1372353019 215\normal\test.eof [junit4:junit4] > at __randomizedtesting.SeedInfo.seed([D041A49D09140737:412AE6954B30A14B]:0) [junit4:junit4] > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1919) [junit4:junit4] > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1399) [junit4:junit4] > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1331) [junit4:junit4] > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1910) [junit4:junit4] > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1399) [junit4:junit4] > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1331) [junit4:junit4] > at org.apache.solr.store.blockcache.BlockDirectoryTest.tearDown(BlockDirectoryTest.java:118) [junit4:junit4] > at java.lang.Thread.run(Thread.java:724)
        Hide
        Uwe Schindler added a comment -

        I would prefer to disable this functionality in Mini-Slowdoop-Cluster. There is for sure a config option, let me revisit the source code. There is no reason to run DF from Java, you can do all this with Java 7's Path API. And for tests, disk free is not a problem at all, so could be left out.

        The execute permission is also enabled for the JRE crash tester It spawns another JRE, so it needs execute permission (this is why it is there). The policy file is there to not allow any test to work outside the test sandbox and e.g. modify files in the checkout. If you can execute commands, this cannot be checked anymore - that was the reason why I wanted to disable this.

        Show
        Uwe Schindler added a comment - I would prefer to disable this functionality in Mini-Slowdoop-Cluster. There is for sure a config option, let me revisit the source code. There is no reason to run DF from Java, you can do all this with Java 7's Path API. And for tests, disk free is not a problem at all, so could be left out. The execute permission is also enabled for the JRE crash tester It spawns another JRE, so it needs execute permission (this is why it is there). The policy file is there to not allow any test to work outside the test sandbox and e.g. modify files in the checkout. If you can execute commands, this cannot be checked anymore - that was the reason why I wanted to disable this.
        Hide
        Mark Miller added a comment -

        5x r1497451 Do not run hdfs tests on Windows as it requires cygwin
        URL: http://svn.apache.org/r1497451

        Show
        Mark Miller added a comment - 5x r1497451 Do not run hdfs tests on Windows as it requires cygwin URL: http://svn.apache.org/r1497451
        Hide
        Mark Miller added a comment -

        We can also not check that the shell call does not do any harm outside the java sandbox.

        We simply use a java client when running in Solr. This is just for tests - to run an hdfs filesystem to test against. To run an hdfs env we have to run what the Apache Hadoop project makes.

        As a side note, I think it's way overkill to ban shelling out in Lucene/Solr. Chill out policeman

        Show
        Mark Miller added a comment - We can also not check that the shell call does not do any harm outside the java sandbox. We simply use a java client when running in Solr. This is just for tests - to run an hdfs filesystem to test against. To run an hdfs env we have to run what the Apache Hadoop project makes. As a side note, I think it's way overkill to ban shelling out in Lucene/Solr. Chill out policeman
        Hide
        Uwe Schindler added a comment -

        It should not call "df" at all (also not on unix!). This is not good and platform independent at all. We can also not check that the shell call does not do any harm outside the java sandbox.

        I was about to remove execute permissions from the policy file. Currently this test is (fortunate) the only one calling shell commands!

        Show
        Uwe Schindler added a comment - It should not call "df" at all (also not on unix!). This is not good and platform independent at all. We can also not check that the shell call does not do any harm outside the java sandbox. I was about to remove execute permissions from the policy file. Currently this test is (fortunate) the only one calling shell commands!
        Hide
        Mark Miller added a comment -

        It looks like we will have to ignore hdfs tests on Windows for now - running on Windows requires Cygwin - it seems the current Windows fails happen while trying to make 'df' shell calls from the NameNode.

        Show
        Mark Miller added a comment - It looks like we will have to ignore hdfs tests on Windows for now - running on Windows requires Cygwin - it seems the current Windows fails happen while trying to make 'df' shell calls from the NameNode.
        Hide
        Jack Krupansky added a comment - - edited

        To be clear, this specific Jira is only about reading and writing of internal Solr files from HDFS, not indexing of user data from HDFS, correct?

        In other words, this would not provide support for reading from HDFS by the "stream.file" and "stream.url" update parameters, correct?

        But, external file fields would be covered, correct? As well as all other Solr "conf" configuration files (like stopwords)?

        Show
        Jack Krupansky added a comment - - edited To be clear, this specific Jira is only about reading and writing of internal Solr files from HDFS, not indexing of user data from HDFS, correct? In other words, this would not provide support for reading from HDFS by the "stream.file" and "stream.url" update parameters, correct? But, external file fields would be covered, correct? As well as all other Solr "conf" configuration files (like stopwords)?
        Hide
        Mark Miller added a comment -

        There are some jenkins/test issues to look into.

        Show
        Mark Miller added a comment - There are some jenkins/test issues to look into.
        Hide
        Mark Miller added a comment -

        5x: r1497159 add assume false to test for java 8

        Show
        Mark Miller added a comment - 5x: r1497159 add assume false to test for java 8
        Hide
        Uwe Schindler added a comment -

        Thanks!

        Show
        Uwe Schindler added a comment - Thanks!
        Hide
        Mark Miller added a comment -

        5x: r1497133 Update NOTICE file and remove log4j from test-framework dependencies

        Show
        Mark Miller added a comment - 5x: r1497133 Update NOTICE file and remove log4j from test-framework dependencies
        Hide
        Mark Miller added a comment -

        You missed to add the new dependencies to NOTICE.txt.

        Yeah, I only added blur - I'll look at what else should be added.

        Is there no solution for the jetty 1.6.26 classes?

        I'm not really concerned about it atm - it's a dependency that the tests have for running the namenode - it's only a test framework dependency and all of the package names are different than in later version of jetty so they don't overlap.

        Show
        Mark Miller added a comment - You missed to add the new dependencies to NOTICE.txt. Yeah, I only added blur - I'll look at what else should be added. Is there no solution for the jetty 1.6.26 classes? I'm not really concerned about it atm - it's a dependency that the tests have for running the namenode - it's only a test framework dependency and all of the package names are different than in later version of jetty so they don't overlap.
        Hide
        Uwe Schindler added a comment -

        Mark: You missed to add the new dependencies to NOTICE.txt. This is especially important for CDDL files. They have to be listed in NOTICE.txt

        Is there no solution for the jetty 1.6.26 classes? Can we add easymocks to work around the outdated jetty version, if its not actually used just referred to? The existence of jetty 1.6.26 is horrible, I would have -1 that if I would have seen earlier.

        Show
        Uwe Schindler added a comment - Mark: You missed to add the new dependencies to NOTICE.txt. This is especially important for CDDL files. They have to be listed in NOTICE.txt Is there no solution for the jetty 1.6.26 classes? Can we add easymocks to work around the outdated jetty version, if its not actually used just referred to? The existence of jetty 1.6.26 is horrible, I would have -1 that if I would have seen earlier.
        Hide
        Mark Miller added a comment -

        The commit-tag-bot user cannot currently log in - JIRA wants him to solve a captcha to log in, but doesn't seem to accept any of my answers. Bleh.

        Committed:
        5x: 1497072

        Show
        Mark Miller added a comment - The commit-tag-bot user cannot currently log in - JIRA wants him to solve a captcha to log in, but doesn't seem to accept any of my answers. Bleh. Committed: 5x: 1497072
        Hide
        Steve Rowe added a comment -

        Who knows what evil lurks in the heart of Solr? The shadow maven build knows...

        Show
        Steve Rowe added a comment - Who knows what evil lurks in the heart of Solr? The shadow maven build knows...
        Hide
        Mark Miller added a comment -

        Okay, now that SOLR-4926 is squared away, I am ready to commit this.

        I'm sure that it's going to make the shadow maven build angry.

        Show
        Mark Miller added a comment - Okay, now that SOLR-4926 is squared away, I am ready to commit this. I'm sure that it's going to make the shadow maven build angry.
        Mark Miller made changes -
        Attachment SOLR-4916.patch [ 12588686 ]
        Hide
        Mark Miller added a comment -

        New Patch:

        Show
        Mark Miller added a comment - New Patch: Merged up to trunk. Consolidates hadoop versions in ivy files like is already done with jetty - set the version in one spot in the ivy file. Works around random test fail caused by https://issues.apache.org/jira/browse/HADOOP-9643
        Hide
        Mark Miller added a comment - - edited

        It doesn't greatly affect other parts of Solr, it's not some big experimental change, so I intend to first commit to 5x and see how jenkins likes things and then backport to 4.x.

        A lot of the core changes for this have slowly gone into 4.x long ago - including issues around making custom Directories first class in Solr and other little changes.

        This builds to run against Apache Hadoop 2.0.5-alpha. I don't suspect that will be easily 'pluggable', but it will be easy enough to change the ivy files to point to another Hadoop distro, fix any compile time errors (if there are any), run the tests, and build Solr.

        Because our dependency is on client code that talks to hdfs, I suspect that it will work fine as is with most distros based on the same version of Apache Hadoop - and probably other versions as well in many cases.

        Show
        Mark Miller added a comment - - edited It doesn't greatly affect other parts of Solr, it's not some big experimental change, so I intend to first commit to 5x and see how jenkins likes things and then backport to 4.x. A lot of the core changes for this have slowly gone into 4.x long ago - including issues around making custom Directories first class in Solr and other little changes. This builds to run against Apache Hadoop 2.0.5-alpha. I don't suspect that will be easily 'pluggable', but it will be easy enough to change the ivy files to point to another Hadoop distro, fix any compile time errors (if there are any), run the tests, and build Solr. Because our dependency is on client code that talks to hdfs, I suspect that it will work fine as is with most distros based on the same version of Apache Hadoop - and probably other versions as well in many cases.
        Hide
        Jack Krupansky added a comment -

        Is this intended to be a 5.0-only feature, or 4.x or maybe 4.5 or maybe even 4.4?

        Aren't there a lot of different distributions of Hadoop? So, when Andrzej mentions this patch adding Hadoop as a core Solr dependency, what exactly will that dependency be? 1.2.0? Or, will the Hadoop release be pluggable/configurable?

        Show
        Jack Krupansky added a comment - Is this intended to be a 5.0-only feature, or 4.x or maybe 4.5 or maybe even 4.4? Aren't there a lot of different distributions of Hadoop? So, when Andrzej mentions this patch adding Hadoop as a core Solr dependency, what exactly will that dependency be? 1.2.0? Or, will the Hadoop release be pluggable/configurable?
        Mark Miller made changes -
        Attachment SOLR-4916.patch [ 12588429 ]
        Hide
        Mark Miller added a comment -

        Patch to trunk.

        Show
        Mark Miller added a comment - Patch to trunk.
        Hide
        Mark Miller added a comment -

        I think this is a pretty solid base to iterate on, so I'd like to commit before long to minimize the cost of keeping this set of changes in sync. I'll upload a patch updated to trunk in a bit.

        Show
        Mark Miller added a comment - I think this is a pretty solid base to iterate on, so I'd like to commit before long to minimize the cost of keeping this set of changes in sync. I'll upload a patch updated to trunk in a bit.
        Hide
        Mark Miller added a comment -

        The Patch:

        • An HdfsDirectory implementation that uses a BlockDirectory to cache (read/write) hdfs blocks.

        The default index codec currently supports append only filesystems, so impl is fairly straightforward and effective. It would be interesting if we could easily tell if a codec was append only.

        • An HdfsDirectoryFactory to hook this into Solr.

        Now that Directory is a first class citizen in Solr, allows pretty much everything to work on hdfs with few other tweaks, including Replication.
        Adds a new option to DirectoryFactory to have Searchers explicitly reserve commits points - no delete on last close like unix and no delete while in use fails like windows.

        • An HdfsUpdateLog that allows writing the transaction log to hdfs as well.

        I talked to Yonik a while back and I think we are in agreement that we don't want to currently support making a pluggable UpdateLog - so this one is built in and triggers on using an hdfs:// prefixed update log path.

        • An HdfsLockFactory.

        Simple impl to write lock files to hdfs rather than the local filesystem.

        • SOLR-4655 Overseer should assign node names

        Includes the work for SOLR-4566 - while a good general improvement, this is also important for this patch because we use the node name in hdfs paths - if a different machine takes over for that path, it's awkward to have the address for another machine as part of it.

        • Tests

        There a few new tests specifically written for HDFS. There are also a bunch of new tests that simply run the current pertinent SolrCloud tests against hdfs. Because the SolrCloud tests are already so long, on a slower machine, this can greatly increase the test run time. It's actually almost no noticeable slow down on my 6 core machine, but it's pretty awful on my 2 core machine. To deal with this, in my patch, I have made the tests that are functionally equivalent to current tests but run against hdfs, only run nightly.

        Show
        Mark Miller added a comment - The Patch: An HdfsDirectory implementation that uses a BlockDirectory to cache (read/write) hdfs blocks. The default index codec currently supports append only filesystems, so impl is fairly straightforward and effective. It would be interesting if we could easily tell if a codec was append only. An HdfsDirectoryFactory to hook this into Solr. Now that Directory is a first class citizen in Solr, allows pretty much everything to work on hdfs with few other tweaks, including Replication. Adds a new option to DirectoryFactory to have Searchers explicitly reserve commits points - no delete on last close like unix and no delete while in use fails like windows. An HdfsUpdateLog that allows writing the transaction log to hdfs as well. I talked to Yonik a while back and I think we are in agreement that we don't want to currently support making a pluggable UpdateLog - so this one is built in and triggers on using an hdfs:// prefixed update log path. An HdfsLockFactory. Simple impl to write lock files to hdfs rather than the local filesystem. SOLR-4655 Overseer should assign node names Includes the work for SOLR-4566 - while a good general improvement, this is also important for this patch because we use the node name in hdfs paths - if a different machine takes over for that path, it's awkward to have the address for another machine as part of it. Tests There a few new tests specifically written for HDFS. There are also a bunch of new tests that simply run the current pertinent SolrCloud tests against hdfs. Because the SolrCloud tests are already so long, on a slower machine, this can greatly increase the test run time. It's actually almost no noticeable slow down on my 6 core machine, but it's pretty awful on my 2 core machine. To deal with this, in my patch, I have made the tests that are functionally equivalent to current tests but run against hdfs, only run nightly.
        Hide
        Mark Miller added a comment -

        Thanks for taking a look AB!

        Re. Hadoop dependencies: the patch adds a hard dependency on Hadoop and its dependencies directly to Solr core. I wonder if it's possible to refactor it so that it could be optional and the functionality itself moved to contrib/ - this way only users who want to use HdfsDirectory would need Hadoop deps.

        Yeah, I don't really beleive in Solr contribs - they are not so useful IMO - it's a pain to actually pull them out and it has to be done after the fact. Given the size of the dependencies is such a small percentage of the current size, that we don't want to support the UpdateLog as actually pluggable, and that it would be nice that hdfs was supported out of the box just as local filesystem, I don't see being a contrib being much of a win. It saves a few megabytes when we are already well over 100 - and that's if you are willing to pull it apart after you download it. From what I've seen, even with the huge extract contrib, most people don't bother repackaging. It's hard to imagine they would for a few megabytes.

        Cache and BlockCache imple

        We have done some casual benchmarking - loading tweets at a high rate of speed while sending queries at a high rate of speed with 1 second NRT - essentially the worst case NRT scenerio. By and large, performance has been similiar to local filesystem performance. We will likely share some numbers when we have some less casual results. You do of course have to warm up the block cache before it really kicks in.

        In terms of impl, as I mentioned, the orig HdfsDirectory comes from the Blur guys - we tried not to change it too much currently - not until we figure out if we might evolve it with them in the future - eg as a Lucene module or something.

        Show
        Mark Miller added a comment - Thanks for taking a look AB! Re. Hadoop dependencies: the patch adds a hard dependency on Hadoop and its dependencies directly to Solr core. I wonder if it's possible to refactor it so that it could be optional and the functionality itself moved to contrib/ - this way only users who want to use HdfsDirectory would need Hadoop deps. Yeah, I don't really beleive in Solr contribs - they are not so useful IMO - it's a pain to actually pull them out and it has to be done after the fact. Given the size of the dependencies is such a small percentage of the current size, that we don't want to support the UpdateLog as actually pluggable, and that it would be nice that hdfs was supported out of the box just as local filesystem, I don't see being a contrib being much of a win. It saves a few megabytes when we are already well over 100 - and that's if you are willing to pull it apart after you download it. From what I've seen, even with the huge extract contrib, most people don't bother repackaging. It's hard to imagine they would for a few megabytes. Cache and BlockCache imple We have done some casual benchmarking - loading tweets at a high rate of speed while sending queries at a high rate of speed with 1 second NRT - essentially the worst case NRT scenerio. By and large, performance has been similiar to local filesystem performance. We will likely share some numbers when we have some less casual results. You do of course have to warm up the block cache before it really kicks in. In terms of impl, as I mentioned, the orig HdfsDirectory comes from the Blur guys - we tried not to change it too much currently - not until we figure out if we might evolve it with them in the future - eg as a Lucene module or something.
        Hide
        Andrzej Bialecki added a comment -

        Mark, this functionality looks very cool!

        Re. Hadoop dependencies: the patch adds a hard dependency on Hadoop and its dependencies directly to Solr core. I wonder if it's possible to refactor it so that it could be optional and the functionality itself moved to contrib/ - this way only users who want to use HdfsDirectory would need Hadoop deps.

        Re. Cache and BlockCache implementation - I did something similar in Luke's FsDirectory, where I decided to use Ehcache, although that implementation was read-only so it was much simpler. Performance improvements for repeated searches were of course dramatic, not so much for unique queries though. Do you have some preliminary benchmarks for this implementation, how much slower is the indexing / searching? Anyway, doing an Ehcache-based implementation of Cache with your patch seems straightforward, too.

        There's very little javadoc / package docs for the new public classes and packages.

        What are HdfsDirectory.LF_EXT and getNormalNames() for?

        Show
        Andrzej Bialecki added a comment - Mark, this functionality looks very cool! Re. Hadoop dependencies: the patch adds a hard dependency on Hadoop and its dependencies directly to Solr core. I wonder if it's possible to refactor it so that it could be optional and the functionality itself moved to contrib/ - this way only users who want to use HdfsDirectory would need Hadoop deps. Re. Cache and BlockCache implementation - I did something similar in Luke's FsDirectory, where I decided to use Ehcache, although that implementation was read-only so it was much simpler. Performance improvements for repeated searches were of course dramatic, not so much for unique queries though. Do you have some preliminary benchmarks for this implementation, how much slower is the indexing / searching? Anyway, doing an Ehcache-based implementation of Cache with your patch seems straightforward, too. There's very little javadoc / package docs for the new public classes and packages. What are HdfsDirectory.LF_EXT and getNormalNames() for?
        Hide
        Mark Miller added a comment - - edited

        FYI - this patch incorporates SOLR-4655 as mentioned above.

        Show
        Mark Miller added a comment - - edited FYI - this patch incorporates SOLR-4655 as mentioned above.
        Mark Miller made changes -
        Attachment SOLR-4916.patch [ 12587702 ]
        Hide
        Mark Miller added a comment -

        A first patch - more commentary to come.

        Show
        Mark Miller added a comment - A first patch - more commentary to come.
        Mark Miller made changes -
        Field Original Value New Value
        Link This issue incorporates SOLR-4655 [ SOLR-4655 ]
        Hide
        Mark Miller added a comment -

        I've figured out what was going on with the jetty issue. All tests are currently passing - I'll post my first patch tomorrow.

        Show
        Mark Miller added a comment - I've figured out what was going on with the jetty issue. All tests are currently passing - I'll post my first patch tomorrow.
        Hide
        Mark Miller added a comment -

        For the first issue (the 2 shard split test fails):

        The change that I think actually caused to start failing is when Shalin made it so that if waitForState timed out, it failed the split. I was missing one small piece from SOLR-4655 where we set the correct coreNodeName for the subshard when we waitForState - with that change, the shard split tests are passing.

        For initial issues, that only leaves problem one (jetty and the cp issue) to deal with.

        Show
        Mark Miller added a comment - For the first issue (the 2 shard split test fails): The change that I think actually caused to start failing is when Shalin made it so that if waitForState timed out, it failed the split. I was missing one small piece from SOLR-4655 where we set the correct coreNodeName for the subshard when we waitForState - with that change, the shard split tests are passing. For initial issues, that only leaves problem one (jetty and the cp issue) to deal with.
        Hide
        Mark Miller added a comment -

        The second issue is actually a fail in TestCloudManagedSchemaAddField (or it also can happen in TestCloudManagedSchema). It seems to depend on the luck of the classpath - both tests pass when run in eclipse for me.

        Unfortunately, the issue seems to be a jar hell issue. Some hdfs test classes we need require Jetty 6.1.26 on the test classpath. Previously, probably because of a lot of package name changes in Jetty from 6 to 8, non of the tests had a problem with both 6.1.26 and 8.1.10 on the classpath. It seems one or both of these tests do have a problem though.

        Not sure what the solution might be yet.

        Show
        Mark Miller added a comment - The second issue is actually a fail in TestCloudManagedSchemaAddField (or it also can happen in TestCloudManagedSchema). It seems to depend on the luck of the classpath - both tests pass when run in eclipse for me. Unfortunately, the issue seems to be a jar hell issue. Some hdfs test classes we need require Jetty 6.1.26 on the test classpath. Previously, probably because of a lot of package name changes in Jetty from 6 to 8, non of the tests had a problem with both 6.1.26 and 8.1.10 on the classpath. It seems one or both of these tests do have a problem though. Not sure what the solution might be yet.
        Hide
        Mark Miller added a comment -

        I'm working on getting a first rev for this in line with trunk and 4x. Patrick Hunt and I did the bulk of the implementation while Greg Chanan added the support for talking to a kerberos'd hdfs.

        We borrowed the HdfsDirectory from Apache Blur and contributed some code back to them. Initially, it's in Solr because that was easiest and satisfied my initial goals - however, someone that knows Lucene modules a bit better than me might want to take on moving it later on, with the idea that we might even collaborate with the Apache Blur guys on it in one location.

        I'm close to having an initial patch. I have to take care of 2 test fails that I think are related to the changes in SOLR-4655, and investigate a test fail in TestCloudManagedSchema.

        Show
        Mark Miller added a comment - I'm working on getting a first rev for this in line with trunk and 4x. Patrick Hunt and I did the bulk of the implementation while Greg Chanan added the support for talking to a kerberos'd hdfs. We borrowed the HdfsDirectory from Apache Blur and contributed some code back to them. Initially, it's in Solr because that was easiest and satisfied my initial goals - however, someone that knows Lucene modules a bit better than me might want to take on moving it later on, with the idea that we might even collaborate with the Apache Blur guys on it in one location. I'm close to having an initial patch. I have to take care of 2 test fails that I think are related to the changes in SOLR-4655 , and investigate a test fail in TestCloudManagedSchema.
        Mark Miller created issue -

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            2 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development