Solr
  1. Solr
  2. SOLR-7928

Improve CheckIndex to work against HdfsDirectory

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: hdfs
    • Labels:
      None

      Description

      CheckIndex is very useful for testing an index for corruption. However, it can only work with an index on an FSDirectory, meaning that if you need to check an Hdfs Index, then you have to download it to local disk (which can be very large).

      We should have a way to natively check index on hdfs for corruption.

      1. SOLR-7928.patch
        35 kB
        Gregory Chanan
      2. SOLR-7928.patch
        33 kB
        Mike Drob
      3. SOLR-7928.patch
        33 kB
        Gregory Chanan
      4. SOLR-7928.patch
        33 kB
        Gregory Chanan
      5. SOLR-7928.patch
        33 kB
        Mike Drob
      6. SOLR-7928.patch
        17 kB
        Mike Drob
      7. SOLR-7928.patch
        17 kB
        Mike Drob
      8. SOLR-7928.patch
        17 kB
        Mike Drob

        Issue Links

          Activity

          Hide
          Mike Drob added a comment -

          Attached is one possible approach. This adds a new command line tool, CheckHdfsIndex that builds the HdfsDirectory before passing it to the CheckIndex internals. This approach avoids adding a dependency from lucene to hdfs, leaving it only in solr, which may be more desirable than adding direct support in CheckIndex.

          Show
          Mike Drob added a comment - Attached is one possible approach. This adds a new command line tool, CheckHdfsIndex that builds the HdfsDirectory before passing it to the CheckIndex internals. This approach avoids adding a dependency from lucene to hdfs, leaving it only in solr, which may be more desirable than adding direct support in CheckIndex.
          Hide
          Mike Drob added a comment -

          Bump.

          Show
          Mike Drob added a comment - Bump.
          Hide
          Robert Muir added a comment -

          I'm against this. Its a mistake CheckIndex is not a final class.

          Just use its API, please, no need to subclass.

            /** Create a new CheckIndex on the directory. */
            public CheckIndex(Directory dir) throws IOException {
          
          Show
          Robert Muir added a comment - I'm against this. Its a mistake CheckIndex is not a final class. Just use its API, please, no need to subclass. /** Create a new CheckIndex on the directory. */ public CheckIndex(Directory dir) throws IOException {
          Hide
          Mike Drob added a comment -

          Robert, thanks for looking. I'll put up a new patch without subclassing shortly. I can also make CheckIndex final in my patch, or save that for a separate issue to minimize/contain the changeset.

          As I'm working on this, it looks like I'm going to have to increase the visibility on a bunch of things in CheckIndex to avoid code duplication. Let me know if you think this is going to be a problem.

          Show
          Mike Drob added a comment - Robert, thanks for looking. I'll put up a new patch without subclassing shortly. I can also make CheckIndex final in my patch, or save that for a separate issue to minimize/contain the changeset. As I'm working on this, it looks like I'm going to have to increase the visibility on a bunch of things in CheckIndex to avoid code duplication. Let me know if you think this is going to be a problem.
          Hide
          Mike Drob added a comment -

          Just took another look at this, the patch is still good against current trunk. Does anybody have any thoughts on it?

          Show
          Mike Drob added a comment - Just took another look at this, the patch is still good against current trunk. Does anybody have any thoughts on it?
          Hide
          Gregory Chanan added a comment -

          Looks good. Some questions/comments:

           // Need to be public for CheckHdfsIndex
              public String indexPath = null;
              public String dirImpl = null;
              public PrintStream out = null;
          

          You just need to read these publicly right? Perhaps just write public accessors? I'm also not sure why you want to possibly modify "out", but not a big deal.

          Testing of the HdfsCheckIndex looks pretty minimal...can we reuse TestCheckIndex in some way? I'm thinking like changing each test in there to just take a directory that you pass in. In lucene we use newDirectory, in your test we use an HdfsDirectory. Thoughts?

             try (Directory dir = directory; CheckIndex checker = new CheckIndex(dir)) {
                opts.out = System.out;
                return checker.doCheck(opts);
              }
          

          Any plans to write a MapReduce Tool to do this?

          Show
          Gregory Chanan added a comment - Looks good. Some questions/comments: // Need to be public for CheckHdfsIndex public String indexPath = null ; public String dirImpl = null ; public PrintStream out = null ; You just need to read these publicly right? Perhaps just write public accessors? I'm also not sure why you want to possibly modify "out", but not a big deal. Testing of the HdfsCheckIndex looks pretty minimal...can we reuse TestCheckIndex in some way? I'm thinking like changing each test in there to just take a directory that you pass in. In lucene we use newDirectory, in your test we use an HdfsDirectory. Thoughts? try (Directory dir = directory; CheckIndex checker = new CheckIndex(dir)) { opts.out = System .out; return checker.doCheck(opts); } Any plans to write a MapReduce Tool to do this?
          Hide
          Mike Drob added a comment -

          You just need to read these publicly right? Perhaps just write public accessors?

          Done.

          Testing of the HdfsCheckIndex looks pretty minimal...can we reuse TestCheckIndex in some way? I'm thinking like changing each test in there to just take a directory that you pass in. In lucene we use newDirectory, in your test we use an HdfsDirectory. Thoughts?

          So... this is a good idea in theory, but in practice it gets really difficult to do. TestCheckIndex isn't visible from the Solr test classes unless we start publishing Lucene test artifacts, which I don't think we want to do. I think we can get away with minimal testing here because we aren't changing any of the functionality, and that's all covered in the Lucene test suite. For our purposes, I think it is enough to establish that if you have an HDFS cluster, you can point this tool at it, and it will run.

          Any plans to write a MapReduce Tool to do this?

          Sure, after this gets committed I'll open up a new JIRA and we can discuss there.

          Show
          Mike Drob added a comment - You just need to read these publicly right? Perhaps just write public accessors? Done. Testing of the HdfsCheckIndex looks pretty minimal...can we reuse TestCheckIndex in some way? I'm thinking like changing each test in there to just take a directory that you pass in. In lucene we use newDirectory, in your test we use an HdfsDirectory. Thoughts? So... this is a good idea in theory, but in practice it gets really difficult to do. TestCheckIndex isn't visible from the Solr test classes unless we start publishing Lucene test artifacts, which I don't think we want to do. I think we can get away with minimal testing here because we aren't changing any of the functionality, and that's all covered in the Lucene test suite. For our purposes, I think it is enough to establish that if you have an HDFS cluster, you can point this tool at it, and it will run. Any plans to write a MapReduce Tool to do this? Sure, after this gets committed I'll open up a new JIRA and we can discuss there.
          Hide
          Mike Drob added a comment -

          New patch that addresses a few of Greg's concerns.

          Show
          Mike Drob added a comment - New patch that addresses a few of Greg's concerns.
          Hide
          Uwe Schindler added a comment -

          TestCheckIndex isn't visible from the Solr test classes unless we start publishing Lucene test artifacts, which I don't think we want to do.

          You could make an abstract TestCheckIndexBase in Lucene's test framework.

          Show
          Uwe Schindler added a comment - TestCheckIndex isn't visible from the Solr test classes unless we start publishing Lucene test artifacts, which I don't think we want to do. You could make an abstract TestCheckIndexBase in Lucene's test framework.
          Hide
          Mike Drob added a comment -

          You could make an abstract TestCheckIndexBase in Lucene's test framework.

          Good idea. Patch attached that does this.

          Show
          Mike Drob added a comment - You could make an abstract TestCheckIndexBase in Lucene's test framework. Good idea. Patch attached that does this.
          Hide
          Gregory Chanan added a comment -

          Only one nit. I can change on commit:

          private static BaseTestCheckIndex testCheckIndex;
          

          doesn't look like this needs to be static.

          Show
          Gregory Chanan added a comment - Only one nit. I can change on commit: private static BaseTestCheckIndex testCheckIndex; doesn't look like this needs to be static.
          Hide
          Gregory Chanan added a comment -

          Looks like this conflicts with LUCENE-6866. I attached a patch with a rebase plus the static change.

          Show
          Gregory Chanan added a comment - Looks like this conflicts with LUCENE-6866 . I attached a patch with a rebase plus the static change.
          Hide
          Mike Drob added a comment -

          The rebase looks good to me, thanks for taking care of that.

          Show
          Mike Drob added a comment - The rebase looks good to me, thanks for taking care of that.
          Hide
          Mike Drob added a comment -

          Gregory Chanan - What are your thoughts on this?

          Show
          Mike Drob added a comment - Gregory Chanan - What are your thoughts on this?
          Hide
          Gregory Chanan added a comment -

          Updated patch.

          Uses HdfsTestUtil for getting the configuration.
          Adds some javadoc that was failing on precommit.

          CheckHdfsIndexTest.testDeletedDocs fails pretty regularly for me, though never if I just run it alone. That suggests prior tests are leaving crud around in the Hdfs Directory. Perhaps we should create a new directory for each test?

          Show
          Gregory Chanan added a comment - Updated patch. Uses HdfsTestUtil for getting the configuration. Adds some javadoc that was failing on precommit. CheckHdfsIndexTest.testDeletedDocs fails pretty regularly for me, though never if I just run it alone. That suggests prior tests are leaving crud around in the Hdfs Directory. Perhaps we should create a new directory for each test?
          Hide
          Mike Drob added a comment -

          Modified the patch to delete the directory between every test. I wasn't getting the same failures as you were, Greg, but I suspect that this will take care of them and is much faster than the alternative approach of starting up a new DFS Cluster each time.

          Added a couple more javadocs in addition to what you had. precommit is still failing on build/docs/core/org/apache/lucene/index/CheckIndex.Options.html missing Constructors: Options--, but I'm not sure how to cleanly address that. There is no constructor defined, so the docs must be picking up the default one, which we can't document since it doesn't exist...

          Show
          Mike Drob added a comment - Modified the patch to delete the directory between every test. I wasn't getting the same failures as you were, Greg, but I suspect that this will take care of them and is much faster than the alternative approach of starting up a new DFS Cluster each time. Added a couple more javadocs in addition to what you had. precommit is still failing on build/docs/core/org/apache/lucene/index/CheckIndex.Options.html missing Constructors: Options-- , but I'm not sure how to cleanly address that. There is no constructor defined, so the docs must be picking up the default one, which we can't document since it doesn't exist...
          Hide
          Gregory Chanan added a comment -

          Here's a patch that passes precommit. Added an explicit 0-arg constructor (there are other place sin the code that do that for javadoc). Also added a package-info.html.

          Let me know if this looks good to you Mike Drob and I'll commit it.

          Show
          Gregory Chanan added a comment - Here's a patch that passes precommit. Added an explicit 0-arg constructor (there are other place sin the code that do that for javadoc). Also added a package-info.html. Let me know if this looks good to you Mike Drob and I'll commit it.
          Hide
          Mike Drob added a comment -

          LGTM!

          Show
          Mike Drob added a comment - LGTM!
          Hide
          ASF subversion and git services added a comment -

          Commit 1717340 from gchanan@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1717340 ]

          SOLR-7928: Improve CheckIndex to work against HdfsDirectory

          Show
          ASF subversion and git services added a comment - Commit 1717340 from gchanan@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1717340 ] SOLR-7928 : Improve CheckIndex to work against HdfsDirectory
          Hide
          ASF subversion and git services added a comment -

          Commit 1717342 from gchanan@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1717342 ]

          SOLR-7928: Improve CheckIndex to work against HdfsDirectory

          Show
          ASF subversion and git services added a comment - Commit 1717342 from gchanan@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1717342 ] SOLR-7928 : Improve CheckIndex to work against HdfsDirectory
          Hide
          Gregory Chanan added a comment -

          Committed to 5.5 and trunk, thanks Mike!

          Show
          Gregory Chanan added a comment - Committed to 5.5 and trunk, thanks Mike!
          Hide
          ASF subversion and git services added a comment -

          Commit 1717367 from gchanan@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1717367 ]

          SOLR-7928: Add eol-style native

          Show
          ASF subversion and git services added a comment - Commit 1717367 from gchanan@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1717367 ] SOLR-7928 : Add eol-style native
          Hide
          ASF subversion and git services added a comment -

          Commit 1717368 from gchanan@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1717368 ]

          SOLR-7928: Add eol-style native

          Show
          ASF subversion and git services added a comment - Commit 1717368 from gchanan@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1717368 ] SOLR-7928 : Add eol-style native

            People

            • Assignee:
              Gregory Chanan
              Reporter:
              Mike Drob
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development