Lucene - Core
  1. Lucene - Core
  2. LUCENE-3433

Random access non RAM resident IndexDocValues (CSF)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      There should be a way to get specific IndexDocValues by going through the Directory rather than loading all of the values into memory.

      1. LUCENE-3433.patch
        192 kB
        Simon Willnauer
      2. LUCENE-3433.patch
        182 kB
        Simon Willnauer
      3. LUCENE-3433.patch
        135 kB
        Simon Willnauer
      4. LUCENE-3433.patch
        29 kB
        Simon Willnauer
      5. sorted_source.patch
        49 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        yonik, do we really need random access? we already have an enum for on disk lookups which is essentially a DocIDSetIterator.

        Show
        Simon Willnauer added a comment - yonik, do we really need random access? we already have an enum for on disk lookups which is essentially a DocIDSetIterator.
        Hide
        Robert Muir added a comment -

        Simon: I think it would be a really nice feature.

        Imagine someone that has a large collection and maybe quite a few docvalues fields:
        it would be nice if you have a fixed int64 docvalues for it to just be a mmap'ed readLong() for example.

        I think in a lot of cases the memory resident random-access that we have now is going to be great, e.g.
        if someone wants to use it for a scoring factor and its going to be super-hot.

        but in other more advanced cases (especially if you have many of them or access is relatively infrequent)
        it would be a nice option to have available where currently you have to deal with the downsides of
        'fake terms' with a payload or stored fields.

        Show
        Robert Muir added a comment - Simon: I think it would be a really nice feature. Imagine someone that has a large collection and maybe quite a few docvalues fields: it would be nice if you have a fixed int64 docvalues for it to just be a mmap'ed readLong() for example. I think in a lot of cases the memory resident random-access that we have now is going to be great, e.g. if someone wants to use it for a scoring factor and its going to be super-hot. but in other more advanced cases (especially if you have many of them or access is relatively infrequent) it would be a nice option to have available where currently you have to deal with the downsides of 'fake terms' with a payload or stored fields.
        Hide
        Jason Rutherglen added a comment -

        This is somewhat funny, as it seems the opinion has changed on MMap'ing and the potential for page faults:

        http://www.lucidimagination.com/search/document/8951a336dffa9535/storing_and_loading_the_fst_directly_from_disk#8951a336dffa9535

        Show
        Jason Rutherglen added a comment - This is somewhat funny, as it seems the opinion has changed on MMap'ing and the potential for page faults: http://www.lucidimagination.com/search/document/8951a336dffa9535/storing_and_loading_the_fst_directly_from_disk#8951a336dffa9535
        Hide
        Robert Muir added a comment -

        no opinion has changed: traversing on disk automata in the way you describe is inefficient.

        here I describe accessing a fixed long value: 1 seek per access.

        how many seeks will an fst lookup take? hint: its more than 1!

        Show
        Robert Muir added a comment - no opinion has changed: traversing on disk automata in the way you describe is inefficient. here I describe accessing a fixed long value: 1 seek per access. how many seeks will an fst lookup take? hint: its more than 1!
        Hide
        Jason Rutherglen added a comment -

        Here's another thread discussing MMap'ing and field caches, where the consensus is against it:

        http://www.lucidimagination.com/search/document/70623ef5879bca38/fst_and_fieldcache#45006a7fe2847c09 posted in "1969-12-31 19:00"

        Show
        Jason Rutherglen added a comment - Here's another thread discussing MMap'ing and field caches, where the consensus is against it: http://www.lucidimagination.com/search/document/70623ef5879bca38/fst_and_fieldcache#45006a7fe2847c09 posted in "1969-12-31 19:00"
        Hide
        Robert Muir added a comment -

        this isn't the field cache.

        Show
        Robert Muir added a comment - this isn't the field cache.
        Hide
        Simon Willnauer added a comment -

        I looked at the code and figured that there is actually not much difference implementation wise to random access vs. advance(). I basically renamed int advance(int) to int seek(int) which has similar semantics as advance has but can do seek forward and backwards. I think we can go with such an approach since we still have advance etc. and implementation wise there is not much overhead.

        Show
        Simon Willnauer added a comment - I looked at the code and figured that there is actually not much difference implementation wise to random access vs. advance(). I basically renamed int advance(int) to int seek(int) which has similar semantics as advance has but can do seek forward and backwards. I think we can go with such an approach since we still have advance etc. and implementation wise there is not much overhead.
        Hide
        Simon Willnauer added a comment -

        any comments on the patch here? If nobody objects I am going to commit this later today.

        Show
        Simon Willnauer added a comment - any comments on the patch here? If nobody objects I am going to commit this later today.
        Hide
        Chris Male added a comment -

        I don't have any particular comments on the implementation, but I really like the seek() API, its nice and clear so +1 to committing.

        Show
        Chris Male added a comment - I don't have any particular comments on the implementation, but I really like the seek() API, its nice and clear so +1 to committing.
        Hide
        Michael McCandless added a comment -

        Do we have any DV enum impls that cannot move backwards? If not, maybe we should just remove advance() and only have seek()?

        Show
        Michael McCandless added a comment - Do we have any DV enum impls that cannot move backwards? If not, maybe we should just remove advance() and only have seek()?
        Hide
        Robert Muir added a comment -

        Maybe we should take a look at the bigger picture: perhaps DocValues should really just be
        a ByteBuffer where offsets are docids, maybe even just using ByteBuffer etc API?

        Personally i think the seek() versus advance() is overkill, I think having DISI-like enums
        doesn't really make sense for CSF (which is random access by definition), it only
        needs get(int) apis.

        Show
        Robert Muir added a comment - Maybe we should take a look at the bigger picture: perhaps DocValues should really just be a ByteBuffer where offsets are docids, maybe even just using ByteBuffer etc API? Personally i think the seek() versus advance() is overkill, I think having DISI-like enums doesn't really make sense for CSF (which is random access by definition), it only needs get(int) apis.
        Hide
        Simon Willnauer added a comment -

        Maybe we should take a look at the bigger picture: perhaps DocValues should really just be

        a ByteBuffer where offsets are docids, maybe even just using ByteBuffer etc API?

        this looks like a very big change to me. you should open an issue and uplaod a patch so we can see how that looks like for variable bytes etc.

        Personally i think the seek() versus advance() is overkill, I think having DISI-like enums

        doesn't really make sense for CSF (which is random access by definition), it only
        needs get(int) apis.

        I figured adding this is a quick win. I'd rather refine the API in a different issue?

        Show
        Simon Willnauer added a comment - Maybe we should take a look at the bigger picture: perhaps DocValues should really just be a ByteBuffer where offsets are docids, maybe even just using ByteBuffer etc API? this looks like a very big change to me. you should open an issue and uplaod a patch so we can see how that looks like for variable bytes etc. Personally i think the seek() versus advance() is overkill, I think having DISI-like enums doesn't really make sense for CSF (which is random access by definition), it only needs get(int) apis. I figured adding this is a quick win. I'd rather refine the API in a different issue?
        Hide
        Robert Muir added a comment -

        this looks like a very big change to me. you should open an issue and uplaod a patch so we can see how that looks like for variable bytes etc.

        Well this is just discussion? Basically I'm trying to shoot for some simplification here.

        As far as variable bytes, maybe DocValues shouldn't actually implement variable bytes,
        but instead is used only to implement the docid -> offset part. In other words, the variable bytes
        is basically like stored fields today, and docvalues just does the FDX part. This means
        docvalues doesnt need a variable bytes impl at all, you just use a Long impl.

        For the "FDT" part of variable bytes we could have something else? then this would simplify
        DocValues some more and narrow its scope.

        Show
        Robert Muir added a comment - this looks like a very big change to me. you should open an issue and uplaod a patch so we can see how that looks like for variable bytes etc. Well this is just discussion? Basically I'm trying to shoot for some simplification here. As far as variable bytes, maybe DocValues shouldn't actually implement variable bytes, but instead is used only to implement the docid -> offset part. In other words, the variable bytes is basically like stored fields today, and docvalues just does the FDX part. This means docvalues doesnt need a variable bytes impl at all, you just use a Long impl. For the "FDT" part of variable bytes we could have something else? then this would simplify DocValues some more and narrow its scope.
        Hide
        Simon Willnauer added a comment -

        Well this is just discussion? Basically I'm trying to shoot for some simplification here.

        to this seems like a major rewrite and far more than this issue, no? Can you create a new issue for this & the discussion

        Show
        Simon Willnauer added a comment - Well this is just discussion? Basically I'm trying to shoot for some simplification here. to this seems like a major rewrite and far more than this issue, no? Can you create a new issue for this & the discussion
        Hide
        Robert Muir added a comment -

        I don't know, to me it seems to fit this issue perfectly.

        If we are looking at CSF as a random-access per-doc value, then I'm just suggesting perhaps
        the APIs should reflect its random-access-ness... just a get(int docid) basically.

        we dont have to do this immediately, we can add seek to the enums like you did for now,
        though i'm slightly concerned this would make it more difficult to nuke the enums later if
        we decide to try to simplify the APIs much later down the road, but this isn't a big deal.

        Show
        Robert Muir added a comment - I don't know, to me it seems to fit this issue perfectly. If we are looking at CSF as a random-access per-doc value, then I'm just suggesting perhaps the APIs should reflect its random-access-ness... just a get(int docid) basically. we dont have to do this immediately, we can add seek to the enums like you did for now, though i'm slightly concerned this would make it more difficult to nuke the enums later if we decide to try to simplify the APIs much later down the road, but this isn't a big deal.
        Hide
        Uwe Schindler added a comment -

        I agree, the DocValues should be designed to be random access from the beginning, Iterators just look wrong. We can supply them on top, but the basic API should look like ByteBuffer, FloatBuffer, IntBuffer,...

        Show
        Uwe Schindler added a comment - I agree, the DocValues should be designed to be random access from the beginning, Iterators just look wrong. We can supply them on top, but the basic API should look like ByteBuffer, FloatBuffer, IntBuffer,...
        Hide
        Yonik Seeley added a comment -

        A faster short-term "expert level" API for fixed width values would be to just allow the retrieval of the underlying IndexInput (and return null if not supported so we don't box ourselves in). Then a getLong() for example, would go right to ByteBuffer.getLong() (for the mmap directory) and avoid extra checking and copying to another BytesRef.

        Anyway, it would be nice to eventually have something closer to the metal in the spirit of the bulk-api branch that doesn't impose per-call bounds checking and other book keeping like advance() currently does (I can see Uwe cringing now

        Aside: in FixedStraightBytesEnum, I see:

              if (target >= maxDoc || size == 0) {
        

        The size==0 check shouldn't be needed, right?

        Show
        Yonik Seeley added a comment - A faster short-term "expert level" API for fixed width values would be to just allow the retrieval of the underlying IndexInput (and return null if not supported so we don't box ourselves in). Then a getLong() for example, would go right to ByteBuffer.getLong() (for the mmap directory) and avoid extra checking and copying to another BytesRef. Anyway, it would be nice to eventually have something closer to the metal in the spirit of the bulk-api branch that doesn't impose per-call bounds checking and other book keeping like advance() currently does (I can see Uwe cringing now Aside: in FixedStraightBytesEnum, I see: if (target >= maxDoc || size == 0) { The size==0 check shouldn't be needed, right?
        Hide
        Andrzej Bialecki added a comment -

        Random access is also important if we want to use DocValues instead of norms.

        Show
        Andrzej Bialecki added a comment - Random access is also important if we want to use DocValues instead of norms.
        Hide
        Simon Willnauer added a comment -

        Random access is also important if we want to use DocValues instead of norms.

        we have random access already for in memory source.

        To simplify this here a bit we could remove ValuesEnum and use Source also for on disk access. The source API is already pretty much like ByteBuffer etc but provides the flexibility we need on the impl level.

        Show
        Simon Willnauer added a comment - Random access is also important if we want to use DocValues instead of norms. we have random access already for in memory source. To simplify this here a bit we could remove ValuesEnum and use Source also for on disk access. The source API is already pretty much like ByteBuffer etc but provides the flexibility we need on the impl level.
        Hide
        Simon Willnauer added a comment -

        This patch removes all uses of ValuesEnum and adds a IndexDocValues#getDirectSource() which provides direct access without loading into memory. This patch also removes all bounds checks and improves on disk access on several variants.

        its still a bit rough but reflects the idea. However, this patch also removes the sorted variant for efficiency reasons. Currently those variant require a lot of memory during merge as well as a sep. interface (sortedSource) which adds another complexity to the API. With this patch we only have a single API (Source) which is a good start to further improve this.

        Show
        Simon Willnauer added a comment - This patch removes all uses of ValuesEnum and adds a IndexDocValues#getDirectSource() which provides direct access without loading into memory. This patch also removes all bounds checks and improves on disk access on several variants. its still a bit rough but reflects the idea. However, this patch also removes the sorted variant for efficiency reasons. Currently those variant require a lot of memory during merge as well as a sep. interface (sortedSource) which adds another complexity to the API. With this patch we only have a single API (Source) which is a good start to further improve this.
        Hide
        Robert Muir added a comment -

        I like the direction of this patch!

        Show
        Robert Muir added a comment - I like the direction of this patch!
        Hide
        Michael McCandless added a comment -

        However, this patch also removes the sorted variant for efficiency reasons.

        Wait, we lose SortedSource entirely? Or just for getDirectSource?

        Can we break out that lossage as a separate issue? (I don't think we
        should drop it?). Or is it somehow coupled to removal of the enums?

        This value type is useful for apps that want to sort by string field,
        without paying the high uninversion-on-reader-init that you now pay
        (to use FieldCache). Grouping should also cutover to this (away
        from FieldCache).

        Show
        Michael McCandless added a comment - However, this patch also removes the sorted variant for efficiency reasons. Wait, we lose SortedSource entirely? Or just for getDirectSource? Can we break out that lossage as a separate issue? (I don't think we should drop it?). Or is it somehow coupled to removal of the enums? This value type is useful for apps that want to sort by string field, without paying the high uninversion-on-reader-init that you now pay (to use FieldCache). Grouping should also cutover to this (away from FieldCache).
        Hide
        Robert Muir added a comment -

        This value type is useful for apps that want to sort by string field,
        without paying the high uninversion-on-reader-init that you now pay
        (to use FieldCache). Grouping should also cutover to this (away
        from FieldCache).

        I think this is bogus? If you have high-turnaround/NRT you are gonna pay the price to uninvert somewhere,
        so it doesnt matter where you do it.

        If you don't, then you shouldnt be closing your IndexReader and reopening the same one again.

        +1 to drop SortedSource

        Show
        Robert Muir added a comment - This value type is useful for apps that want to sort by string field, without paying the high uninversion-on-reader-init that you now pay (to use FieldCache). Grouping should also cutover to this (away from FieldCache). I think this is bogus? If you have high-turnaround/NRT you are gonna pay the price to uninvert somewhere, so it doesnt matter where you do it. If you don't, then you shouldnt be closing your IndexReader and reopening the same one again. +1 to drop SortedSource
        Hide
        Simon Willnauer added a comment -

        Wait, we lose SortedSource entirely? Or just for getDirectSource?

        SortedSource or rather its impls. are hairy and horribly inefficient during merge. They only support 2GB max. and load everything into heap memory during merge. I think we can make this more efficient but as long as it is using so much memory we shouldn't release this. I am going to create another issue we can pick up later with a patch that contains the sorted source impls we have today. we can then pick it up from there.

        Show
        Simon Willnauer added a comment - Wait, we lose SortedSource entirely? Or just for getDirectSource? SortedSource or rather its impls. are hairy and horribly inefficient during merge. They only support 2GB max. and load everything into heap memory during merge. I think we can make this more efficient but as long as it is using so much memory we shouldn't release this. I am going to create another issue we can pick up later with a patch that contains the sorted source impls we have today. we can then pick it up from there.
        Hide
        Michael McCandless added a comment -

        SortedSource or rather its impls. are hairy and horribly inefficient during merge.

        Well let's address that ("remove SortedSource") under a different issue? I don't think we should remove them here because it's unrelated (there are other great simplifications here).

        I'm all for simplifying, but not if we lose good functionality (cut into the bone).

        Loading everything into heap can be solved, but, I think is relatively low priority because that's how the search app would use sorted DV at search time anyway. Ie, an app sets this on their "country" field because they know they want to sort by "country" at search time, and they will load the DV into RAM at search time. We only hold this RAM allocated while we merge the one field, then it's freed, so if multiple DV sort fields are used, merging takes less RAM than searching. I think the merge RAM usage is relatively minor.

        The SortedSource has a simple API (just like FC.DocTermsIndex): getOrd(int docID), getValue(int ord). I don't see why this is adding complexity to DV.

        Show
        Michael McCandless added a comment - SortedSource or rather its impls. are hairy and horribly inefficient during merge. Well let's address that ("remove SortedSource") under a different issue? I don't think we should remove them here because it's unrelated (there are other great simplifications here). I'm all for simplifying, but not if we lose good functionality (cut into the bone). Loading everything into heap can be solved, but, I think is relatively low priority because that's how the search app would use sorted DV at search time anyway. Ie, an app sets this on their "country" field because they know they want to sort by "country" at search time, and they will load the DV into RAM at search time. We only hold this RAM allocated while we merge the one field, then it's freed, so if multiple DV sort fields are used, merging takes less RAM than searching. I think the merge RAM usage is relatively minor. The SortedSource has a simple API (just like FC.DocTermsIndex): getOrd(int docID), getValue(int ord). I don't see why this is adding complexity to DV.
        Hide
        Robert Muir added a comment -

        I agree with simon here, can't we spin off a different issue for these?

        Seems like this is really a higher level thing that belongs on a fieldcache issue, in my opinion docvalues should just be per-document values...
        (i think its great if we can use it under the covers to speed up some use case: not sure one exists here...)

        I would rather we simplify docvalues behind the scenes, allow it to grow and mature iteratively on its own without shoving fieldcache into it.

        Show
        Robert Muir added a comment - I agree with simon here, can't we spin off a different issue for these? Seems like this is really a higher level thing that belongs on a fieldcache issue, in my opinion docvalues should just be per-document values... (i think its great if we can use it under the covers to speed up some use case: not sure one exists here...) I would rather we simplify docvalues behind the scenes, allow it to grow and mature iteratively on its own without shoving fieldcache into it.
        Hide
        Simon Willnauer added a comment -

        here is a new patch containing a bunch of cleanup and javadocs.

        I also attached a patch that adds back the sorted source so we can spin off a new issue and make them efficient without writing it from the scratch.

        this one seems close / ready.

        Show
        Simon Willnauer added a comment - here is a new patch containing a bunch of cleanup and javadocs. I also attached a patch that adds back the sorted source so we can spin off a new issue and make them efficient without writing it from the scratch. this one seems close / ready.
        Hide
        Michael McCandless added a comment -

        I agree with simon here, can't we spin off a different issue for these?

        +1: I agree removal of SortedSource is unrelated to this issue. We
        should discuss it under a new issue (it's obviously contentious), and
        for this issue do the nice cleanups we all agree on. It shouldn't be
        removed under this one.

        One shouldn't have to pay such a high price (uninversion on searcher
        startup) to sort or group by a string field, which we do today. It's
        silly to re-invert on every searcher startup when we can sort once
        during indexing and record that in the doc values, and SortedSource
        gives us that.

        Besides the merge RAM usage (which I think is minor) is there a
        technical/code complexity reason that SortedSource should be removed?
        Does it somehow require the enums or something? I'm trying to
        understand how/why it suddenly got coupled into this issue...

        I think sorting and grouping by string fields are first class
        functions for Lucene.

        Show
        Michael McCandless added a comment - I agree with simon here, can't we spin off a different issue for these? +1: I agree removal of SortedSource is unrelated to this issue. We should discuss it under a new issue (it's obviously contentious), and for this issue do the nice cleanups we all agree on. It shouldn't be removed under this one. One shouldn't have to pay such a high price (uninversion on searcher startup) to sort or group by a string field, which we do today. It's silly to re-invert on every searcher startup when we can sort once during indexing and record that in the doc values, and SortedSource gives us that. Besides the merge RAM usage (which I think is minor) is there a technical/code complexity reason that SortedSource should be removed? Does it somehow require the enums or something? I'm trying to understand how/why it suddenly got coupled into this issue... I think sorting and grouping by string fields are first class functions for Lucene.
        Hide
        Michael McCandless added a comment -

        I also attached a patch that adds back the sorted source so we can spin off a new issue and make them efficient without writing it from the scratch.

        Simon, can you invert this patch, and open a new issue for removing
        SortedSource?

        Show
        Michael McCandless added a comment - I also attached a patch that adds back the sorted source so we can spin off a new issue and make them efficient without writing it from the scratch. Simon, can you invert this patch, and open a new issue for removing SortedSource?
        Hide
        Martijn van Groningen added a comment -

        I think sorting and grouping by string fields are first class functions for Lucene.

        And faceting too!

        Maybe we should have DocTermIndex that is independent of source and have impls for DV and impls for indexed values.
        Maybe the name DocTermIndex doesn't make sense then, because it suggests that values come from the inverted index which might not be the case.

        Show
        Martijn van Groningen added a comment - I think sorting and grouping by string fields are first class functions for Lucene. And faceting too! Maybe we should have DocTermIndex that is independent of source and have impls for DV and impls for indexed values. Maybe the name DocTermIndex doesn't make sense then, because it suggests that values come from the inverted index which might not be the case.
        Hide
        Robert Muir added a comment -

        I think sorting and grouping by string fields are first class
        functions for Lucene.

        I disagree: if you aren't sorting by score, then go use a database.

        Show
        Robert Muir added a comment - I think sorting and grouping by string fields are first class functions for Lucene. I disagree: if you aren't sorting by score, then go use a database.
        Hide
        Simon Willnauer added a comment -

        Simon, can you invert this patch, and open a new issue for removing SortedSource?

        actually my plan was to have one iterface for now and then open an issue to add back the SortedSource with an impl that we all agree on. Currently, the sorted variants are somewhat flaky and heavy I think we should simply remove it here and then go and work out a plan how to implement this. The technical reason here is simply to rethink the interface, we now have one which is simple so let see what we can do to make this work with sorted variants.

        simon

        Show
        Simon Willnauer added a comment - Simon, can you invert this patch, and open a new issue for removing SortedSource? actually my plan was to have one iterface for now and then open an issue to add back the SortedSource with an impl that we all agree on. Currently, the sorted variants are somewhat flaky and heavy I think we should simply remove it here and then go and work out a plan how to implement this. The technical reason here is simply to rethink the interface, we now have one which is simple so let see what we can do to make this work with sorted variants. simon
        Hide
        Robert Muir added a comment -

        +1: I agree removal of SortedSource is unrelated to this issue. We
        should discuss it under a new issue (it's obviously contentious), and
        for this issue do the nice cleanups we all agree on. It shouldn't be
        removed under this one.

        Thats not what i meant by spin off a different issue, i think we should
        spin off a different issue to add back SortedSource.

        Docvalues really needs to be simplified, Simon has done just that, and I think
        its great as a part of that that it focuses on what it should be, thats per-document values,
        not being some precomputed FieldCache.

        Show
        Robert Muir added a comment - +1: I agree removal of SortedSource is unrelated to this issue. We should discuss it under a new issue (it's obviously contentious), and for this issue do the nice cleanups we all agree on. It shouldn't be removed under this one. Thats not what i meant by spin off a different issue, i think we should spin off a different issue to add back SortedSource. Docvalues really needs to be simplified, Simon has done just that, and I think its great as a part of that that it focuses on what it should be, thats per-document values, not being some precomputed FieldCache.
        Hide
        Mark Miller added a comment -

        I agree removal of SortedSource is unrelated to this issue. We
        should discuss it under a new issue (it's obviously contentious), and
        for this issue do the nice cleanups we all agree on. It shouldn't be
        removed under this one.

        +1 - something contentious should not be removed in an unrelated issue like this. If it's already in, but some want it out, let's make an a new issue to discuss. Once something is in, there should be a clear and dedicated issue discussing it's removal if there is dispute. I don't agree with simply pulling it and putting the onus on those who want it to make an issue to get it back in.

        Show
        Mark Miller added a comment - I agree removal of SortedSource is unrelated to this issue. We should discuss it under a new issue (it's obviously contentious), and for this issue do the nice cleanups we all agree on. It shouldn't be removed under this one. +1 - something contentious should not be removed in an unrelated issue like this. If it's already in, but some want it out, let's make an a new issue to discuss. Once something is in, there should be a clear and dedicated issue discussing it's removal if there is dispute. I don't agree with simply pulling it and putting the onus on those who want it to make an issue to get it back in.
        Hide
        Simon Willnauer added a comment -

        +1 - something contentious should not be removed in an unrelated issue like this. If it's already in, but some want it out, let's make an a new issue to discuss. Once something is in, there should be a clear and dedicated issue discussing it's removal if there is dispute. I don't agree with simply pulling it and putting the onus on those who want it to make an issue to get it back in.

        there is no dispute here If you'd have looked at the API and the code you'd know what I talk about though. We cut over to a new api where sorted source doesn't fit in nicely. We had ValuesEnum used for merging as the LCD between SortedSource and Source. Now we only have Source as a RandomAccess API. To keep this at a reasonable size we should try to add the missing part in a different issue. We should also rethink how to merge sortd sources since they are quite mem heavy (I think this is a potential issue). Adding back SortedSource is going to be tough without a new issue since lots of stuff has changed. To make this dev process easier backing out and re-adding seems best to me. Don't worry we gonna add it back though.

        Show
        Simon Willnauer added a comment - +1 - something contentious should not be removed in an unrelated issue like this. If it's already in, but some want it out, let's make an a new issue to discuss. Once something is in, there should be a clear and dedicated issue discussing it's removal if there is dispute. I don't agree with simply pulling it and putting the onus on those who want it to make an issue to get it back in. there is no dispute here If you'd have looked at the API and the code you'd know what I talk about though. We cut over to a new api where sorted source doesn't fit in nicely. We had ValuesEnum used for merging as the LCD between SortedSource and Source. Now we only have Source as a RandomAccess API. To keep this at a reasonable size we should try to add the missing part in a different issue. We should also rethink how to merge sortd sources since they are quite mem heavy (I think this is a potential issue). Adding back SortedSource is going to be tough without a new issue since lots of stuff has changed. To make this dev process easier backing out and re-adding seems best to me. Don't worry we gonna add it back though.
        Hide
        Mark Miller added a comment -

        there is no dispute here

        If there is no dispute, what exactly is Mike talking about?

        Show
        Mark Miller added a comment - there is no dispute here If there is no dispute, what exactly is Mike talking about?
        Hide
        Robert Muir added a comment -

        Here is the way i see the problem:

        • Currently docvalues is a bit confusing... I think a lot of this is due to the current API
        • no offense to simon but i think in a way this forces him to feel responsible for doing all work on it. The complexity makes it hard for others to get involved.
        • with this patch, the api becomes a lot simpler: i'm sure its not perfect but the API seems to correspond to what DV does, at least it makes sense to me.

        can we temporarily drop SortedSource, open a new issue to add it back (mark it blocker for 4.0 even?!). This way, we can rethink how to implement this functionality (maybe it doesnt even belong as docvalues but something on top of it, or something else entirely).

        Show
        Robert Muir added a comment - Here is the way i see the problem: Currently docvalues is a bit confusing... I think a lot of this is due to the current API no offense to simon but i think in a way this forces him to feel responsible for doing all work on it. The complexity makes it hard for others to get involved. with this patch, the api becomes a lot simpler: i'm sure its not perfect but the API seems to correspond to what DV does, at least it makes sense to me. can we temporarily drop SortedSource, open a new issue to add it back (mark it blocker for 4.0 even?!). This way, we can rethink how to implement this functionality (maybe it doesnt even belong as docvalues but something on top of it, or something else entirely).
        Hide
        Simon Willnauer added a comment -

        here we go.. the make everybody happy patch! I added SortedSource back and integrated it into the Source pattern for random access. we now have an entirely disk resident SortedSource impl for both variants and a single interface in the first place. SortedSource instance can be obtained via Source#asSortedSource() which returns null if the source is not sorted. With this random access DirectSortedSource we can also improve the merging for sorted sources which was one of my major issues here.

        Show
        Simon Willnauer added a comment - here we go.. the make everybody happy patch! I added SortedSource back and integrated it into the Source pattern for random access. we now have an entirely disk resident SortedSource impl for both variants and a single interface in the first place. SortedSource instance can be obtained via Source#asSortedSource() which returns null if the source is not sorted. With this random access DirectSortedSource we can also improve the merging for sorted sources which was one of my major issues here.
        Hide
        Michael McCandless added a comment -

        Thanks Simon! I'll look through the patch... it's a great cleanup.

        Show
        Michael McCandless added a comment - Thanks Simon! I'll look through the patch... it's a great cleanup.
        Hide
        Michael McCandless added a comment -

        Patch is awesome Simon; thank you.

        Only thing I noticed: can you fix SortedSource.numOrds back to .getValueCount?

        +1 to commit!

        Show
        Michael McCandless added a comment - Patch is awesome Simon; thank you. Only thing I noticed: can you fix SortedSource.numOrds back to .getValueCount? +1 to commit!
        Hide
        Simon Willnauer added a comment -

        Only thing I noticed: can you fix SortedSource.numOrds back to .getValueCount?

        yeah thats the better name though - no idea why I changed that though.... anyway I will commit this soon if nobody objects.

        simon

        Show
        Simon Willnauer added a comment - Only thing I noticed: can you fix SortedSource.numOrds back to .getValueCount? yeah thats the better name though - no idea why I changed that though.... anyway I will commit this soon if nobody objects. simon
        Hide
        Chris Male added a comment -

        Having the simpler interface seems like a great achievement. I love that we can keep SortedSource.

        +1 to committing.

        Show
        Chris Male added a comment - Having the simpler interface seems like a great achievement. I love that we can keep SortedSource. +1 to committing.
        Hide
        Simon Willnauer added a comment -

        committed in rev. 1179970

        Show
        Simon Willnauer added a comment - committed in rev. 1179970

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development