Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1020

Basic tool for checking & repairing an index

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This has been requested a number of times on the mailing lists. Most
      recently here:

      http://www.gossamer-threads.com/lists/lucene/java-user/53474

      I think we should provide a basic tool out of the box.

      1. LUCENE-1020.take2.patch
        13 kB
        Michael McCandless
      2. LUCENE-1020.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch: another rev of this tool, with a few minor additions.

        I plan to commit in a day or two.

        Show
        mikemccand Michael McCandless added a comment - Attached patch: another rev of this tool, with a few minor additions. I plan to commit in a day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch.

        I created a first cut at this. It takes the path to the index, opens
        it, and steps through all segments scanning terms, freq, prox, fields,
        norms, stored fields and term vectors. If it detects anything
        inconsistent, and you specified "-fix" on the command-line, it will
        write a new segments file that does not reference the bad segments.

        WARNING: this is all brand new code. Be very careful when trying it.
        Make a full backup copy of your index first!

        It also prints useful details about the index (eg "roughly" what
        version of Lucene produced it) which can be used to gather diagnostics
        when trying to debug problems with an index.

        Below is the output on a healthy index. On an un-healthy index, the
        tool prints 'FAILED' for one or more of the segments and then prints
        the full excpeption (reason). But, nothing is done to the index
        unless you specify the '-fix' command-line option.

        Healthy index output:

        Opening index @ contrib/benchmark/work/index

        Segments file=segments_3 numSegments=6 version=FORMAT_SHARED_DOC_STORE [Lucene 2.3]
        1 of 6: name=_l docCount=9039
        compound=false
        numFiles=11
        size (MB)=44.276
        docStoreOffset=0
        docStoreSeEgment=_0
        no deletions
        test: open reader.........OK
        test: fields, norms.......OK [3 fields]
        test: terms, freq, prox...OK [391050 terms; 6573991 terms/docs pairs; 20476680 tokens]
        test: stored fields.......OK [27117 total field count; avg 3 fields per doc]
        test: term vectors........OK [18078 total vector count; avg 2 term/freq vector fields per doc]

        2 of 6: name=_16 docCount=9193
        compound=false
        numFiles=11
        size (MB)=44.743
        docStoreOffset=9039
        docStoreSeEgment=_0
        no deletions
        test: open reader.........OK
        test: fields, norms.......OK [3 fields]
        test: terms, freq, prox...OK [391013 terms; 6619615 terms/docs pairs; 20746479 tokens]
        test: stored fields.......OK [27579 total field count; avg 3 fields per doc]
        test: term vectors........OK [18386 total vector count; avg 2 term/freq vector fields per doc]

        3 of 6: name=_1a docCount=3686
        compound=false
        numFiles=11
        size (MB)=11.797
        docStoreOffset=18232
        docStoreSeEgment=_0
        no deletions
        test: open reader.........OK
        test: fields, norms.......OK [3 fields]
        test: terms, freq, prox...OK [164885 terms; 1866591 terms/docs pairs; 5047412 tokens]
        test: stored fields.......OK [11058 total field count; avg 3 fields per doc]
        test: term vectors........OK [5953 total vector count; avg 1.615 term/freq vector fields per doc]

        4 of 6: name=_1f docCount=3987
        compound=false
        numFiles=11
        size (MB)=11.851
        docStoreOffset=21918
        docStoreSeEgment=_0
        no deletions
        test: open reader.........OK
        test: fields, norms.......OK [3 fields]
        test: terms, freq, prox...OK [159546 terms; 1804415 terms/docs pairs; 5199299 tokens]
        test: stored fields.......OK [11961 total field count; avg 3 fields per doc]
        test: term vectors........OK [7547 total vector count; avg 1.893 term/freq vector fields per doc]

        5 of 6: name=_1l docCount=838
        compound=false
        numFiles=11
        size (MB)=3.143
        docStoreOffset=28712
        docStoreSeEgment=_0
        no deletions
        test: open reader.........OK
        test: fields, norms.......OK [3 fields]
        test: terms, freq, prox...OK [68824 terms; 436884 terms/docs pairs; 1281678 tokens]
        test: stored fields.......OK [2514 total field count; avg 3 fields per doc]
        test: term vectors........OK [1617 total vector count; avg 1.93 term/freq vector fields per doc]

        6 of 6: name=_1m docCount=450
        compound=false
        numFiles=11
        size (MB)=2.165
        docStoreOffset=29550
        docStoreSeEgment=_0
        no deletions
        test: open reader.........OK
        test: fields, norms.......OK [3 fields]
        test: terms, freq, prox...OK [53147 terms; 278659 terms/docs pairs; 877940 tokens]
        test: stored fields.......OK [1350 total field count; avg 3 fields per doc]
        test: term vectors........OK [895 total vector count; avg 1.989 term/freq vector fields per doc]

        No problems were detected with this index.

        Show
        mikemccand Michael McCandless added a comment - Attached patch. I created a first cut at this. It takes the path to the index, opens it, and steps through all segments scanning terms, freq, prox, fields, norms, stored fields and term vectors. If it detects anything inconsistent, and you specified "-fix" on the command-line, it will write a new segments file that does not reference the bad segments. WARNING: this is all brand new code. Be very careful when trying it. Make a full backup copy of your index first! It also prints useful details about the index (eg "roughly" what version of Lucene produced it) which can be used to gather diagnostics when trying to debug problems with an index. Below is the output on a healthy index. On an un-healthy index, the tool prints 'FAILED' for one or more of the segments and then prints the full excpeption (reason). But, nothing is done to the index unless you specify the '-fix' command-line option. Healthy index output: Opening index @ contrib/benchmark/work/index Segments file=segments_3 numSegments=6 version=FORMAT_SHARED_DOC_STORE [Lucene 2.3] 1 of 6: name=_l docCount=9039 compound=false numFiles=11 size (MB)=44.276 docStoreOffset=0 docStoreSeEgment=_0 no deletions test: open reader.........OK test: fields, norms.......OK [3 fields] test: terms, freq, prox...OK [391050 terms; 6573991 terms/docs pairs; 20476680 tokens] test: stored fields.......OK [27117 total field count; avg 3 fields per doc] test: term vectors........OK [18078 total vector count; avg 2 term/freq vector fields per doc] 2 of 6: name=_16 docCount=9193 compound=false numFiles=11 size (MB)=44.743 docStoreOffset=9039 docStoreSeEgment=_0 no deletions test: open reader.........OK test: fields, norms.......OK [3 fields] test: terms, freq, prox...OK [391013 terms; 6619615 terms/docs pairs; 20746479 tokens] test: stored fields.......OK [27579 total field count; avg 3 fields per doc] test: term vectors........OK [18386 total vector count; avg 2 term/freq vector fields per doc] 3 of 6: name=_1a docCount=3686 compound=false numFiles=11 size (MB)=11.797 docStoreOffset=18232 docStoreSeEgment=_0 no deletions test: open reader.........OK test: fields, norms.......OK [3 fields] test: terms, freq, prox...OK [164885 terms; 1866591 terms/docs pairs; 5047412 tokens] test: stored fields.......OK [11058 total field count; avg 3 fields per doc] test: term vectors........OK [5953 total vector count; avg 1.615 term/freq vector fields per doc] 4 of 6: name=_1f docCount=3987 compound=false numFiles=11 size (MB)=11.851 docStoreOffset=21918 docStoreSeEgment=_0 no deletions test: open reader.........OK test: fields, norms.......OK [3 fields] test: terms, freq, prox...OK [159546 terms; 1804415 terms/docs pairs; 5199299 tokens] test: stored fields.......OK [11961 total field count; avg 3 fields per doc] test: term vectors........OK [7547 total vector count; avg 1.893 term/freq vector fields per doc] 5 of 6: name=_1l docCount=838 compound=false numFiles=11 size (MB)=3.143 docStoreOffset=28712 docStoreSeEgment=_0 no deletions test: open reader.........OK test: fields, norms.......OK [3 fields] test: terms, freq, prox...OK [68824 terms; 436884 terms/docs pairs; 1281678 tokens] test: stored fields.......OK [2514 total field count; avg 3 fields per doc] test: term vectors........OK [1617 total vector count; avg 1.93 term/freq vector fields per doc] 6 of 6: name=_1m docCount=450 compound=false numFiles=11 size (MB)=2.165 docStoreOffset=29550 docStoreSeEgment=_0 no deletions test: open reader.........OK test: fields, norms.......OK [3 fields] test: terms, freq, prox...OK [53147 terms; 278659 terms/docs pairs; 877940 tokens] test: stored fields.......OK [1350 total field count; avg 3 fields per doc] test: term vectors........OK [895 total vector count; avg 1.989 term/freq vector fields per doc] No problems were detected with this index.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development