Lucene - Core
  1. Lucene - Core
  2. LUCENE-3760

Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spinoff from Ryan's dev thread "DR.getCommitUserData() vs DR.getIndexCommit().getUserData()"... these methods are confusing/dups right now.

      1. LUCENE-3760.patch
        24 kB
        Michael McCandless
      2. LUCENE-3760.patch
        9 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        I renamed static DR.getCurrentVersion -> DR.getLastCommitVersion,
        static DR.getCommitUserData -> DR.getLastCommitUserData, removed
        DR.getCommitUserData (it's redundant w/
        DR.getIndexCommit().getUserData()), and cleaned up the javadocs.

        Show
        Michael McCandless added a comment - Patch. I renamed static DR.getCurrentVersion -> DR.getLastCommitVersion, static DR.getCommitUserData -> DR.getLastCommitUserData, removed DR.getCommitUserData (it's redundant w/ DR.getIndexCommit().getUserData()), and cleaned up the javadocs.
        Hide
        Uwe Schindler added a comment -

        Thanks, nice cleanup

        +1 no complaints from the DR refactorer

        Show
        Uwe Schindler added a comment - Thanks, nice cleanup +1 no complaints from the DR refactorer
        Hide
        Robert Muir added a comment -

        Just curiosity: what are the high-level use-cases for these static methods?

        Also:

        • (as mentioned on the list), I like the improvements to the javadocs,
          but shouldn't we do the same for getLastCommitUserData(Directory)? (if we keep it)
        • Should we fix the getCurrentXXX methods in SegmentInfos.java for consistency?
        Show
        Robert Muir added a comment - Just curiosity: what are the high-level use-cases for these static methods? Also: (as mentioned on the list), I like the improvements to the javadocs, but shouldn't we do the same for getLastCommitUserData(Directory)? (if we keep it) Should we fix the getCurrentXXX methods in SegmentInfos.java for consistency?
        Hide
        Ryan McKinley added a comment -

        thanks mike – much more clear

        Show
        Ryan McKinley added a comment - thanks mike – much more clear
        Hide
        Michael McCandless added a comment -

        what are the high-level use-cases for these static methods?

        Well, they let you get the version and commit userData w/o having to open the full index... it's a nice "freedom" (much lower cost than opening a reader), but how applications actually make use of that freedom specifically, I'm not sure?

        (as mentioned on the list), I like the improvements to the javadocs,
        but shouldn't we do the same for getLastCommitUserData(Directory)? (if we keep it)

        Woops, I'll fix...

        Should we fix the getCurrentXXX methods in SegmentInfos.java for consistency?

        Ahh right I will...

        Show
        Michael McCandless added a comment - what are the high-level use-cases for these static methods? Well, they let you get the version and commit userData w/o having to open the full index... it's a nice "freedom" (much lower cost than opening a reader), but how applications actually make use of that freedom specifically, I'm not sure? (as mentioned on the list), I like the improvements to the javadocs, but shouldn't we do the same for getLastCommitUserData(Directory)? (if we keep it) Woops, I'll fix... Should we fix the getCurrentXXX methods in SegmentInfos.java for consistency? Ahh right I will...
        Hide
        Robert Muir added a comment -

        Well, they let you get the version and commit userData w/o having to open the full index... it's a nice "freedom" (much lower cost than opening a reader), but how applications actually make use of that freedom specifically, I'm not sure?

        If we don't know of or cannot imagine any real use cases, then I propose we make them deprecated and package private.

        If someone has a real use case: then they can still do it (we didnt hurt them). If no one speaks up for a whole release cycle, they are gone.

        Show
        Robert Muir added a comment - Well, they let you get the version and commit userData w/o having to open the full index... it's a nice "freedom" (much lower cost than opening a reader), but how applications actually make use of that freedom specifically, I'm not sure? If we don't know of or cannot imagine any real use cases, then I propose we make them deprecated and package private. If someone has a real use case: then they can still do it (we didnt hurt them). If no one speaks up for a whole release cycle, they are gone.
        Hide
        Michael McCandless added a comment -

        Patch, folding on Robert's suggestions...

        Show
        Michael McCandless added a comment - Patch, folding on Robert's suggestions...
        Hide
        Eks Dev added a comment -

        I have a use case for CommitUserData and I think standard solr DIH could benefit from it as well.
        I use it to persist current status of the max Document id (user id, not lucene docid) to know what I have indexed so far (all update commands are stored in the database and have simple incrementing counter). This makes incremental update process "restart- and rollback- safe" as it gets written on lucene commit and read on startup. I do not index this field (not to pollute term dictionary) and I need only to keep max value of it.

        I find it hugely useful, but if you have better ideas on how to safely persist max/min value of the field I am all ears.

        Last time I checked, solr DIH used its own file in cfg directory to persist max(timestamp), which is kind of risky as it is not in sync with lucene commit point under all scenarios.

        I think I even opened an isue on solr jira to expose "user commit data" feature to solr, but I am missing good ideas on how to expose it to solr users (max/min/avg field tracking maybe)...

        Cheers, eks

        Show
        Eks Dev added a comment - I have a use case for CommitUserData and I think standard solr DIH could benefit from it as well. I use it to persist current status of the max Document id (user id, not lucene docid) to know what I have indexed so far (all update commands are stored in the database and have simple incrementing counter). This makes incremental update process "restart- and rollback- safe" as it gets written on lucene commit and read on startup. I do not index this field (not to pollute term dictionary) and I need only to keep max value of it. I find it hugely useful, but if you have better ideas on how to safely persist max/min value of the field I am all ears. Last time I checked, solr DIH used its own file in cfg directory to persist max(timestamp), which is kind of risky as it is not in sync with lucene commit point under all scenarios. I think I even opened an isue on solr jira to expose "user commit data" feature to solr, but I am missing good ideas on how to expose it to solr users (max/min/avg field tracking maybe)... Cheers, eks
        Hide
        Michael McCandless added a comment -

        Eks, do you mean you use the IR/DR static methods to look at the last commit's userData?

        Or that you use commit userData from an already opened IndexReader...?

        Show
        Michael McCandless added a comment - Eks, do you mean you use the IR/DR static methods to look at the last commit's userData? Or that you use commit userData from an already opened IndexReader...?
        Hide
        Eks Dev added a comment -

        whoops, before putting mouth to action, one should use brain... just quickly skimmed over this issue and stumbled on " ...If no one speaks up for a whole release cycle, they are gone..." out of context, so I concluded user data is gone.

        Of course, they do not have to be static, I read it only on restart so even if I do not need open IR, it is not an issue to open it once...
        sorry for the noise

        Show
        Eks Dev added a comment - whoops, before putting mouth to action, one should use brain... just quickly skimmed over this issue and stumbled on " ...If no one speaks up for a whole release cycle, they are gone..." out of context, so I concluded user data is gone. Of course, they do not have to be static, I read it only on restart so even if I do not need open IR, it is not an issue to open it once... sorry for the noise
        Hide
        Uwe Schindler added a comment -

        +1 to commit

        Show
        Uwe Schindler added a comment - +1 to commit
        Hide
        Shai Erera added a comment -

        Sorry guys for not following the issue more closely. We use IndexReader.getCommitUserData in many places across our code.

        One place is when we addIndexes(Directory). We have a need to merge the input Directory's commit data with the target index's commit data. Since we use the optimized writer.addIndexes method, it is a waste to open an IndexReader to retrieve the commit data.

        Another place is where we just need to read the commit data to extract information, without actually needing an IndexReader instance.

        In short, this method is very useful to us, and I don't see the advantage of making it deprecated. I don't mind if we move it to a DirectoryUtil class if it bothers anyone that it's on IndexReader.

        We don't use IndexReader.getVersion so I don't mind if it becomes deprecated, but I wonder, if there's no real advantage, what's the harm of keeping it?

        Show
        Shai Erera added a comment - Sorry guys for not following the issue more closely. We use IndexReader.getCommitUserData in many places across our code. One place is when we addIndexes(Directory). We have a need to merge the input Directory's commit data with the target index's commit data. Since we use the optimized writer.addIndexes method, it is a waste to open an IndexReader to retrieve the commit data. Another place is where we just need to read the commit data to extract information, without actually needing an IndexReader instance. In short, this method is very useful to us, and I don't see the advantage of making it deprecated. I don't mind if we move it to a DirectoryUtil class if it bothers anyone that it's on IndexReader. We don't use IndexReader.getVersion so I don't mind if it becomes deprecated, but I wonder, if there's no real advantage, what's the harm of keeping it?
        Hide
        Uwe Schindler added a comment -

        You don't need to open IR:

        • use the static IR.listCommits(Directory dir)
        • take the last IndexCommit in the list (the newest one)
        • call getCommitUserData()
        Show
        Uwe Schindler added a comment - You don't need to open IR: use the static IR.listCommits(Directory dir) take the last IndexCommit in the list (the newest one) call getCommitUserData()
        Hide
        Uwe Schindler added a comment - - edited

        By the way: Why is the return value of listCommits() Collection<IndexCommit> and not List<IndexCommit>. There is only one implementation, that already uses a List, so for things like getting newest Commit quickly, it might be better to have a List or SortedSet.

        Change is easy (maybe in separate issue). In 4.0 we should change this (3.x is also possible because of backwards using return type overloading).

        Show
        Uwe Schindler added a comment - - edited By the way: Why is the return value of listCommits() Collection<IndexCommit> and not List<IndexCommit>. There is only one implementation, that already uses a List, so for things like getting newest Commit quickly, it might be better to have a List or SortedSet. Change is easy (maybe in separate issue). In 4.0 we should change this (3.x is also possible because of backwards using return type overloading).
        Hide
        Shai Erera added a comment -

        The method that you proposes is not very efficient – why do I need to list all commits just to retrieve the latest's commitData?

        By the way: Why is the return value of listCommits() Collection<IndexCommit> and not List<IndexCommit>

        It is like that for API back-compat. The return value of a method is part of its signature. I don't remember on which issue, but I wanted to do that once, and discovered that it breaks either the backwards tests, or current running code (with the new .jar). That's why the method documents that in practice it returns a sorted List.

        Show
        Shai Erera added a comment - The method that you proposes is not very efficient – why do I need to list all commits just to retrieve the latest's commitData? By the way: Why is the return value of listCommits() Collection<IndexCommit> and not List<IndexCommit> It is like that for API back-compat. The return value of a method is part of its signature. I don't remember on which issue, but I wanted to do that once, and discovered that it breaks either the backwards tests, or current running code (with the new .jar). That's why the method documents that in practice it returns a sorted List.
        Hide
        Uwe Schindler added a comment -

        But we can change that in 4.0?

        Show
        Uwe Schindler added a comment - But we can change that in 4.0?
        Hide
        Uwe Schindler added a comment -

        The method that you proposes is not very efficient – why do I need to list all commits just to retrieve the latest's commitData?

        If you only have 1 commit it does not matter.

        Show
        Uwe Schindler added a comment - The method that you proposes is not very efficient – why do I need to list all commits just to retrieve the latest's commitData? If you only have 1 commit it does not matter.
        Hide
        Shai Erera added a comment -

        But we can change that in 4.0?

        Yes, in 4.0, DirReader.listCommits returns List<>.

        If you only have 1 commit it does not matter.

        Yes, that's right. But I cannot guarantee that (because we sometimes maintain few index commits). What's wrong with just undeprecating the method? It's not like it was deprecated for technical reasons. As said above, nobody thought of a useful scenario to use it, so it was deprecated. But I have a useful scenario.

        Do you agree that I simply undeprecate them?

        Show
        Shai Erera added a comment - But we can change that in 4.0? Yes, in 4.0, DirReader.listCommits returns List<>. If you only have 1 commit it does not matter. Yes, that's right. But I cannot guarantee that (because we sometimes maintain few index commits). What's wrong with just undeprecating the method? It's not like it was deprecated for technical reasons. As said above, nobody thought of a useful scenario to use it, so it was deprecated. But I have a useful scenario. Do you agree that I simply undeprecate them?
        Hide
        Robert Muir added a comment -

        I don't think we should: I don't understand the use case behind these static methods.

        Is it really faster? Do you have any benchmarks showing the gain?

        Show
        Robert Muir added a comment - I don't think we should: I don't understand the use case behind these static methods. Is it really faster? Do you have any benchmarks showing the gain?
        Hide
        Shai Erera added a comment -

        You mean if IndexReader.getCommitUserData() is faster than IndexReader.open().getIndexCommit().getUserData()? Do I really need to benchmark this !?

        The static method just reads the SegmentInfos and returns the commitData. Opening an IndexReader at least opens its all sub-segments, loads norms, deleted docs and what not.

        Look, it's not like I'm trying to add a new API. It existed for a long time, and it happens to be that I'm using it (and maybe others who didn't speak up). So why deprecate it?

        I'm also using it in the SearcherTaxoManager (not yet uploaded to LUCENE-3786) to inspect the commit data on the taxonomy directory before I create a new DirTaxoReader, which is very expensive. So why nuke it?

        Show
        Shai Erera added a comment - You mean if IndexReader.getCommitUserData() is faster than IndexReader.open().getIndexCommit().getUserData()? Do I really need to benchmark this !? The static method just reads the SegmentInfos and returns the commitData. Opening an IndexReader at least opens its all sub-segments, loads norms, deleted docs and what not. Look, it's not like I'm trying to add a new API. It existed for a long time, and it happens to be that I'm using it (and maybe others who didn't speak up). So why deprecate it? I'm also using it in the SearcherTaxoManager (not yet uploaded to LUCENE-3786 ) to inspect the commit data on the taxonomy directory before I create a new DirTaxoReader, which is very expensive. So why nuke it?
        Hide
        Robert Muir added a comment -

        You mean if IndexReader.getCommitUserData() is faster than IndexReader.open().getIndexCommit().getUserData()? Do I really need to benchmark this !?

        Nope, but thats not really the question. I mean in the full picture of the application, is there another way?

        The static method just reads the SegmentInfos and returns the commitData. Opening an IndexReader at least opens its all sub-segments, loads norms, deleted docs and what not.

        wait, i dont think thats totally true. Opening the static method may or may not also read in fieldinfos (as they are tied to segmentinfos in trunk currently). Opening an indexreader doesnt load norms until you ask for them, for example.

        Look, it's not like I'm trying to add a new API. It existed for a long time, and it happens to be that I'm using it (and maybe others who didn't speak up). So why deprecate it?

        Well I thought the API was confusing, especially the overloaded names, but when looking into it, I started to question why it needs to exist.

        I think this is ok: we can't keep adding and adding, I think its good to try to hide some of this stuff.

        Especially in Lucene's trunk: lots of things in o.a.l.index (SegmentInfos, public ctors to SegmentReader, etc) don't really need to be public anymore IMO,
        some of this was expert workarounds for the fact we didn't have an extensible codecs package. We shouldn't try to support 2 "apis" from this perspective,
        accessing parts of the index through expert methods and at the same time having a codec api.

        I'm also using it in the SearcherTaxoManager (not yet uploaded to LUCENE-3786) to inspect the commit data on the taxonomy directory before I create a new DirTaxoReader, which is very expensive. So why nuke it?

        Well its still available in 3.x as deprecated methods... were you planning on making a patch against 3.x for LUCENE-3786 initially?

        here's an idea: how about we defer this discussion until we see your patch? If it makes sense for LUCENE-3786, then we have a valid use case?

        Show
        Robert Muir added a comment - You mean if IndexReader.getCommitUserData() is faster than IndexReader.open().getIndexCommit().getUserData()? Do I really need to benchmark this !? Nope, but thats not really the question. I mean in the full picture of the application, is there another way? The static method just reads the SegmentInfos and returns the commitData. Opening an IndexReader at least opens its all sub-segments, loads norms, deleted docs and what not. wait, i dont think thats totally true. Opening the static method may or may not also read in fieldinfos (as they are tied to segmentinfos in trunk currently). Opening an indexreader doesnt load norms until you ask for them, for example. Look, it's not like I'm trying to add a new API. It existed for a long time, and it happens to be that I'm using it (and maybe others who didn't speak up). So why deprecate it? Well I thought the API was confusing, especially the overloaded names, but when looking into it, I started to question why it needs to exist. I think this is ok: we can't keep adding and adding, I think its good to try to hide some of this stuff. Especially in Lucene's trunk: lots of things in o.a.l.index (SegmentInfos, public ctors to SegmentReader, etc) don't really need to be public anymore IMO, some of this was expert workarounds for the fact we didn't have an extensible codecs package. We shouldn't try to support 2 "apis" from this perspective, accessing parts of the index through expert methods and at the same time having a codec api. I'm also using it in the SearcherTaxoManager (not yet uploaded to LUCENE-3786 ) to inspect the commit data on the taxonomy directory before I create a new DirTaxoReader, which is very expensive. So why nuke it? Well its still available in 3.x as deprecated methods... were you planning on making a patch against 3.x for LUCENE-3786 initially? here's an idea: how about we defer this discussion until we see your patch? If it makes sense for LUCENE-3786 , then we have a valid use case?
        Hide
        Shai Erera added a comment -

        I mean in the full picture of the application, is there another way?

        One of my use cases is a distributed system which receives 'delta' indexes that need to be merged with a bigger one. There, I open FSDirectory and use IndexWriter.addIndexes(dir). I don't need to load the IndexReader. What I do need though is read some time stamps and other metadata that I put in the index's commit data and merge that with the target index's commit data.

        Opening an indexreader doesnt load norms until you ask for them, for example

        This is from SegmentReader's ctor. I followed IndexReader.open() and reached it, without any 'ifs' (unless I missed something):

              instance.core = new SegmentCoreReaders(instance, dir, si, readBufferSize, termInfosIndexDivisor);
              if (doOpenStores) {
                instance.core.openDocStores(si);
              }
              instance.loadDeletedDocs();
              instance.openNorms(instance.core.cfsDir, readBufferSize);
        

        It does load the norms and deleted docs (and maybe even doc stores).

        Well its still available in 3.x as deprecated methods... were you planning on making a patch against 3.x for LUCENE-3786 initially?

        There are two questions here:

        (1) Was I planning to support SearcherTaxoManager in 3.x – YES
        (2) Was I planning to start with a patch against 3.x – Not Necessarily.

        here's an idea: how about we defer this discussion until we see your patch? If it makes sense for LUCENE-3786, then we have a valid use case?

        Why? Isn't the scenario that I described above enough? Maybe you tell me – if all you need to do is addIndexes(Directory) and merge the Directory's commitUserData, would you agree to open an IndexReader just for reading the commit data? Don't you think it's too much?

        Show
        Shai Erera added a comment - I mean in the full picture of the application, is there another way? One of my use cases is a distributed system which receives 'delta' indexes that need to be merged with a bigger one. There, I open FSDirectory and use IndexWriter.addIndexes(dir). I don't need to load the IndexReader. What I do need though is read some time stamps and other metadata that I put in the index's commit data and merge that with the target index's commit data. Opening an indexreader doesnt load norms until you ask for them, for example This is from SegmentReader's ctor. I followed IndexReader.open() and reached it, without any 'ifs' (unless I missed something): instance.core = new SegmentCoreReaders(instance, dir, si, readBufferSize, termInfosIndexDivisor); if (doOpenStores) { instance.core.openDocStores(si); } instance.loadDeletedDocs(); instance.openNorms(instance.core.cfsDir, readBufferSize); It does load the norms and deleted docs (and maybe even doc stores). Well its still available in 3.x as deprecated methods... were you planning on making a patch against 3.x for LUCENE-3786 initially? There are two questions here: (1) Was I planning to support SearcherTaxoManager in 3.x – YES (2) Was I planning to start with a patch against 3.x – Not Necessarily. here's an idea: how about we defer this discussion until we see your patch? If it makes sense for LUCENE-3786 , then we have a valid use case? Why? Isn't the scenario that I described above enough? Maybe you tell me – if all you need to do is addIndexes(Directory) and merge the Directory's commitUserData, would you agree to open an IndexReader just for reading the commit data? Don't you think it's too much?
        Hide
        Shai Erera added a comment -

        You also made that comment:

        If someone has a real use case: then they can still do it (we didnt hurt them). If no one speaks up for a whole release cycle, they are gone.

        I can still use them in 3.x, but won't in trunk (they're gone from IR there already). And I did speak up before the release .

        Show
        Shai Erera added a comment - You also made that comment: If someone has a real use case: then they can still do it (we didnt hurt them). If no one speaks up for a whole release cycle, they are gone. I can still use them in 3.x, but won't in trunk (they're gone from IR there already). And I did speak up before the release .
        Hide
        Robert Muir added a comment -

        It does load the norms and deleted docs (and maybe even doc stores).

        norms are lazy-loaded and cached, when you first ask for them. That just opens the file.

        But still, i dont get it, if you are really worried about these costs, whats wrong with Uwe's suggested method?

        Show
        Robert Muir added a comment - It does load the norms and deleted docs (and maybe even doc stores). norms are lazy-loaded and cached, when you first ask for them. That just opens the file. But still, i dont get it, if you are really worried about these costs, whats wrong with Uwe's suggested method?
        Hide
        Uwe Schindler added a comment -

        I can still use them in 3.x, but won't in trunk (they're gone from IR there already)

        IR does no longer know Directory at all, I think your speak about DirectoryReader in trunk. But there you still can use listCommist()[last]. And thats cheap, just because it creates an ArrayList it is not expensive?

        Show
        Uwe Schindler added a comment - I can still use them in 3.x, but won't in trunk (they're gone from IR there already) IR does no longer know Directory at all, I think your speak about DirectoryReader in trunk. But there you still can use listCommist() [last] . And thats cheap, just because it creates an ArrayList it is not expensive?
        Hide
        Shai Erera added a comment -

        You're right Uwe, I can do that. But it's still expensive to list all the commits if one just wants the latest's commitData.

        Show
        Shai Erera added a comment - You're right Uwe, I can do that. But it's still expensive to list all the commits if one just wants the latest's commitData.
        Hide
        Uwe Schindler added a comment -

        In that case, SegmentInfos is public, so you can use this expert API and read the segmentinfos yourself and ask for anything you want. The code in DR.getCommitUserData is only like 3 lines.

        Show
        Uwe Schindler added a comment - In that case, SegmentInfos is public, so you can use this expert API and read the segmentinfos yourself and ask for anything you want. The code in DR.getCommitUserData is only like 3 lines.
        Hide
        Shai Erera added a comment -

        Yes, I know that I can write it. But it means I'll need to write it in every project that I work on. And so will other people. I really don't get it – what's the problem with having such utility method. If we move it to a DirectoryUtils, will that be better? If we move it to DirectoryReader (on trunk) will that be better?

        I don't understand since when we decided to drop API that our users need, because some people "don't understand the scenario". Is there a technical reason to remove the method? Is it badly implemented? Is it buggy? Is it hard to maintain?

        Show
        Shai Erera added a comment - Yes, I know that I can write it. But it means I'll need to write it in every project that I work on. And so will other people. I really don't get it – what's the problem with having such utility method. If we move it to a DirectoryUtils, will that be better? If we move it to DirectoryReader (on trunk) will that be better? I don't understand since when we decided to drop API that our users need, because some people "don't understand the scenario". Is there a technical reason to remove the method? Is it badly implemented? Is it buggy? Is it hard to maintain?
        Hide
        Shai Erera added a comment -

        Resolving back ... looks like I'm the only one that it bothers.

        Show
        Shai Erera added a comment - Resolving back ... looks like I'm the only one that it bothers.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development