Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1654

Include diagnostics per-segment when writing a new segment

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be very helpful if each segment in an index included
      diagnostic information, such as the current version of Lucene.

      EG, in LUCENE-1474 this would be very helpful to see if certain
      segments were written under 2.4.0.

      We can start with just the current version.

      We could also consider making this extensible, so you could provide
      your own arbitrary diagnostics, but SegmentInfo/s is not public so I
      think such an API would be "one-way" in that you'd have to use
      CheckIndex to check on it later. Or we could wait on such extensibility
      until we provide some consistent way to access per-segment details
      in the index.

      1. LUCENE-1654.patch
        107 kB
        Michael McCandless
      2. LUCENE-1654.patch
        105 kB
        Michael McCandless

        Activity

        Hide
        earwin Earwin Burrfoot added a comment - - edited

        Let's have string key-value pairs per-segment, per-commit, per-index?
        Lucene can store debugging data there, user can add something if he needs.

        We can design some unified metadata interface available through IndexReader and won't expose SegmentInfo (doing this with our current back-compat policies is akin to suicide)

        Show
        earwin Earwin Burrfoot added a comment - - edited Let's have string key-value pairs per-segment, per-commit, per-index? Lucene can store debugging data there, user can add something if he needs. We can design some unified metadata interface available through IndexReader and won't expose SegmentInfo (doing this with our current back-compat policies is akin to suicide)
        Hide
        mikemccand Michael McCandless added a comment -

        Let's have string key-value pairs per-segment, per-commit, per-index?

        This would be great, but it's quite a bit more work than having IW include its own diagnostics per-segment.

        Can you open a separate issue for that?

        Show
        mikemccand Michael McCandless added a comment - Let's have string key-value pairs per-segment, per-commit, per-index? This would be great, but it's quite a bit more work than having IW include its own diagnostics per-segment. Can you open a separate issue for that?
        Hide
        earwin Earwin Burrfoot added a comment -

        We can start with string key-value pairs per-segment that are created only from inside Lucene and leave everything else for the latter.

        What I'm trying to avoid is writing a partial solution, then a more generic one (with different implementation) and having to keep that partial solution around because it was already released. I.e. you're going to extend segment file format, right? Let it be key-value pairs from the start, instead of a single Int that is predetermined to be Lucene's version.

        Show
        earwin Earwin Burrfoot added a comment - We can start with string key-value pairs per-segment that are created only from inside Lucene and leave everything else for the latter. What I'm trying to avoid is writing a partial solution, then a more generic one (with different implementation) and having to keep that partial solution around because it was already released. I.e. you're going to extend segment file format, right? Let it be key-value pairs from the start, instead of a single Int that is predetermined to be Lucene's version.
        Hide
        mikemccand Michael McCandless added a comment -

        OK I'll start with that.

        Show
        mikemccand Michael McCandless added a comment - OK I'll start with that.
        Hide
        mikemccand Michael McCandless added a comment -

        OK I added per-segment private storage of diagnostics within IW. It
        records the current Lucene version plus OS & Java version details, and
        the reason for the segment (flush or merge) as a Map<String, String>.

        I also changed the commitUserData to be a Map<String,String> too.
        This API has not yet been released so we are free to change it.

        CheckIndex now prints both the commit level and per-segment level
        maps.

        I added Constants.LUCENE_VERSION to record the current version. I try
        to look up the impl version from the manifest and use that, else I
        fallback to a String constant (now 2.9-dev). I also added a unit test
        that asserts this value matches what's in common-build.xml.

        When I commit I'll update ReleaseTodo on the wiki to remember to
        update this value.

        I plan to commit in a day or two.

        Show
        mikemccand Michael McCandless added a comment - OK I added per-segment private storage of diagnostics within IW. It records the current Lucene version plus OS & Java version details, and the reason for the segment (flush or merge) as a Map<String, String>. I also changed the commitUserData to be a Map<String,String> too. This API has not yet been released so we are free to change it. CheckIndex now prints both the commit level and per-segment level maps. I added Constants.LUCENE_VERSION to record the current version. I try to look up the impl version from the manifest and use that, else I fallback to a String constant (now 2.9-dev). I also added a unit test that asserts this value matches what's in common-build.xml. When I commit I'll update ReleaseTodo on the wiki to remember to update this value. I plan to commit in a day or two.
        Hide
        earwin Earwin Burrfoot added a comment -

        Let's use Collections.EMPTY_MAP instead of new HashMap() for empty maps?
        If I correctly read the patch, you retained HasUserData/CommitUserData in file format description, while it was totally removed from the code (except for back-compat read).

        Show
        earwin Earwin Burrfoot added a comment - Let's use Collections.EMPTY_MAP instead of new HashMap() for empty maps? If I correctly read the patch, you retained HasUserData/CommitUserData in file format description, while it was totally removed from the code (except for back-compat read).
        Hide
        mikemccand Michael McCandless added a comment -

        Excellent catches Earwin, thanks! New patch attached.

        Show
        mikemccand Michael McCandless added a comment - Excellent catches Earwin, thanks! New patch attached.
        Hide
        jasonrutherglen Jason Rutherglen added a comment -

        This seems useful for including debug info?

        Show
        jasonrutherglen Jason Rutherglen added a comment - This seems useful for including debug info?
        Hide
        mikemccand Michael McCandless added a comment -

        This seems useful for including debug info?

        You mean making the API public?

        It's the reading side of that API that makes me nervous (SegmentInfo/s is not public).

        Or do you think it'd be useful to allow one to "setDiagnostics(Map<String,String>)" in IW, but only see such diagnostics on running CheckIndex? That'd be a much smaller change.

        Show
        mikemccand Michael McCandless added a comment - This seems useful for including debug info? You mean making the API public? It's the reading side of that API that makes me nervous (SegmentInfo/s is not public). Or do you think it'd be useful to allow one to "setDiagnostics(Map<String,String>)" in IW, but only see such diagnostics on running CheckIndex? That'd be a much smaller change.
        Hide
        earwin Earwin Burrfoot added a comment -

        It's the reading side of that API that makes me nervous (SegmentInfo/s is not public).

        No-no-no, don't open it up, it's suicidal

        We can theoretically allow access to per-segment data from segment readers, as they have 1-1 relation.
        So, when I finish always-use-MSR patch, you should be able to get per-commit data from MSR and per-segment from SR using the same API. Sounds somewhat dirty, but would work well... needs more thought.

        Show
        earwin Earwin Burrfoot added a comment - It's the reading side of that API that makes me nervous (SegmentInfo/s is not public). No-no-no, don't open it up, it's suicidal We can theoretically allow access to per-segment data from segment readers, as they have 1-1 relation. So, when I finish always-use-MSR patch, you should be able to get per-commit data from MSR and per-segment from SR using the same API. Sounds somewhat dirty, but would work well... needs more thought.
        Hide
        jasonrutherglen Jason Rutherglen added a comment -

        You mean making the API public?

        I was thinking it could be (not sure) useful for debugging or
        assertions in LUCENE-1313. I wasn't sure if this was an intended
        use or would simply be extra noise?

        Show
        jasonrutherglen Jason Rutherglen added a comment - You mean making the API public? I was thinking it could be (not sure) useful for debugging or assertions in LUCENE-1313 . I wasn't sure if this was an intended use or would simply be extra noise?
        Hide
        mikemccand Michael McCandless added a comment -

        OK let's hold off on the public API.

        Jason can you give an example of how you'd use this in LUCENE-1313? I don't see the connection. The initial intention was exactly cases like LUCENE-1474 where I'd really like to know which Lucene version wrote a given segment, in helping to debug.

        Show
        mikemccand Michael McCandless added a comment - OK let's hold off on the public API. Jason can you give an example of how you'd use this in LUCENE-1313 ? I don't see the connection. The initial intention was exactly cases like LUCENE-1474 where I'd really like to know which Lucene version wrote a given segment, in helping to debug.
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        I found a bug for indexes with the string-based userdata:

              if (format <= FORMAT_USER_DATA) {
                if (format <= FORMAT_DIAGNOSTICS) {
                  userData = input.readStringStringMap();
                } else {
                  userData = Collections.EMPTY_MAP;
                  if (0 != input.readByte()) {
                    userData.put("userData", input.readString());
                  }
                }
              } else {
                userData = Collections.EMPTY_MAP;
              }
        

        But the empty maps of Collections are read-only, so put() throws UOE.

        I fix this and commit: userData should be a singleton map, if 0!=readByte(), otherwise EMPTY_MAP.

        Show
        thetaphi Uwe Schindler added a comment - - edited I found a bug for indexes with the string-based userdata: if (format <= FORMAT_USER_DATA) { if (format <= FORMAT_DIAGNOSTICS) { userData = input.readStringStringMap(); } else { userData = Collections.EMPTY_MAP; if (0 != input.readByte()) { userData.put( "userData" , input.readString()); } } } else { userData = Collections.EMPTY_MAP; } But the empty maps of Collections are read-only, so put() throws UOE. I fix this and commit: userData should be a singleton map, if 0!=readByte(), otherwise EMPTY_MAP.
        Hide
        mikemccand Michael McCandless added a comment -

        Woops – thanks Uwe!

        Show
        mikemccand Michael McCandless added a comment - Woops – thanks Uwe!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development