Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      There are parts of Lucene that can potentially be speeded up if computations were to be offloaded from CPU to the GPU(s). With commodity GPUs having as high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to speed parts of Lucene (indexing, search).

      First that comes to mind is spatial filtering, which is traditionally known to be a good candidate for GPU based speedup (esp. when complex polygons are involved). In the past, Mike McCandless has mentioned that "both initial indexing and merging are CPU/IO intensive, but they are very amenable to soaking up the hardware's concurrency."

      I'm opening this issue as an exploratory task, suitable for a GSoC project. I volunteer to mentor any GSoC student willing to work on this this summer.

        Activity

        Hide
        qwerty123 vikash added a comment -

        Hi I am willing to work on this.

        Show
        qwerty123 vikash added a comment - Hi I am willing to work on this.
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        Here are some ideas on things to start out with:

        1. Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a query point, return top-n nearest indexed points.
        2. Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a polygon (complex shape), return all points that lie inside the polygon.

        In both the cases, compare performance against existing Lucene spatial search. One would need to choose the most suitable algorithm for doing these as efficiently as possible. Any GPGPU API can be used for now (OpenCL, CUDA) for initial exploration.

        David Smiley, Karl Wright, Nicholas Knize, Michael McCandless, given your depth and expertise in this area, do you have any suggestions? Any other area of Lucene that comes to mind which should be easiest to start with, in terms of exploring GPU based parallelization?

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - Here are some ideas on things to start out with: Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a query point, return top-n nearest indexed points. Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a polygon (complex shape), return all points that lie inside the polygon. In both the cases, compare performance against existing Lucene spatial search. One would need to choose the most suitable algorithm for doing these as efficiently as possible. Any GPGPU API can be used for now (OpenCL, CUDA) for initial exploration. David Smiley , Karl Wright , Nicholas Knize , Michael McCandless , given your depth and expertise in this area, do you have any suggestions? Any other area of Lucene that comes to mind which should be easiest to start with, in terms of exploring GPU based parallelization?
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        Another experiment that, I think, is worth trying out:

        • Benchmarking an aggregation over a DocValues field (e.g. using sqrt(), haversine distance etc.), and comparing the corresponding performance when executed on the GPU. This could potentially speed up scoring of results.

        For reference, Postgresql seems to have experienced speedup in some areas (esp. aggregations over column oriented fields): https://www.slideshare.net/kaigai/gpgpu-accelerates-postgresql

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - Another experiment that, I think, is worth trying out: Benchmarking an aggregation over a DocValues field (e.g. using sqrt(), haversine distance etc.), and comparing the corresponding performance when executed on the GPU. This could potentially speed up scoring of results. For reference, Postgresql seems to have experienced speedup in some areas (esp. aggregations over column oriented fields): https://www.slideshare.net/kaigai/gpgpu-accelerates-postgresql
        Hide
        dsmiley David Smiley added a comment -

        I have a question to us all. (a) could whatever comes of this actually be contributed to Lucene itself given the likelihood of requiring native O.S. bindings (lets presume in spatial-extras as it seems this is the only module that can have an external dependency), and (b) does that matter for GSOC or to the expectations of the contributor? If (a) is a "no", we need to be honest up front with the contributor. I know in the past Solr has been denied off-heap filters that would have required a un-pure Java approach. A native binding would be another degree of un-purity

        Show
        dsmiley David Smiley added a comment - I have a question to us all. (a) could whatever comes of this actually be contributed to Lucene itself given the likelihood of requiring native O.S. bindings (lets presume in spatial-extras as it seems this is the only module that can have an external dependency), and (b) does that matter for GSOC or to the expectations of the contributor? If (a) is a "no", we need to be honest up front with the contributor. I know in the past Solr has been denied off-heap filters that would have required a un-pure Java approach. A native binding would be another degree of un-purity
        Hide
        mikemccand Michael McCandless added a comment -

        Maybe even the basic hit scoring that e.g. BooleanScorer does with disjunction of high frequency terms, would be amenable to GPU acceleration? Today BooleanScorer processes a whole window of hits at once, doing fairly simple math (the Similarity methods) on each.

        Show
        mikemccand Michael McCandless added a comment - Maybe even the basic hit scoring that e.g. BooleanScorer does with disjunction of high frequency terms, would be amenable to GPU acceleration? Today BooleanScorer processes a whole window of hits at once, doing fairly simple math (the Similarity methods) on each.
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        David Smiley, that is a very important question. Afaik, there is no Apache compatible GPGPU framework. Both OpenCL and CUDA are likely incompatible with Apache (I am not fully sure). I see that jCUDA is MIT license, which is a wrapper around the native libraries.

        If there are benefits to using GPGPU processing, my thought is that we can ensure all necessary plumbing in our codebase in order to offload processing to some plugin, whereby the user can plugin the exact GPU kernels from outside the Lucene distribution (if those kernels also violate any licensing restrictions we have). If there are clear benefits in speeding things up using a GPU, it would not be, for the end-user, the end of the world if the code comes outside Apache distribution.

        If (a) is a "no", we need to be honest up front with the contributor.

        That is a good point, and we can document this clearly.

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - David Smiley , that is a very important question. Afaik, there is no Apache compatible GPGPU framework. Both OpenCL and CUDA are likely incompatible with Apache (I am not fully sure). I see that jCUDA is MIT license, which is a wrapper around the native libraries. If there are benefits to using GPGPU processing, my thought is that we can ensure all necessary plumbing in our codebase in order to offload processing to some plugin, whereby the user can plugin the exact GPU kernels from outside the Lucene distribution (if those kernels also violate any licensing restrictions we have). If there are clear benefits in speeding things up using a GPU, it would not be, for the end-user , the end of the world if the code comes outside Apache distribution. If (a) is a "no", we need to be honest up front with the contributor. That is a good point, and we can document this clearly.
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        Hi,
        in General, including CUDA into Lucene may be a good idea, but I see no real possibility to do this inside Lucene Core or any other module. My idea would be to add some abstraction to the relevant parts of Lucene and make it easier to "plug in" different implementations. Then this code could also be hosted outside Lucene (if Licenses is a problem) anywhere on Github.

        We still should have the following in our head: Mike's example looks "simple" as a quick test if we see gains, but making the whole thing ready for commit or bundling in any project in/outside Lucene is a whole different story. Currently BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do the calculation that could possibly be parallelized. But for moving all this to CUDA, you have to add "plugin points" all there and change APIs completely. It is also hard to test, because none of our Jenkins servers has a GPU! Also for 0/8/15 users of Lucene, this could be a huge problem, if we add native stuff into Lucene that they may never use. Because of that it MUST BE SEPARATED from Lucene core. Completely...

        IMHO, I'd create a full new search engine like CLucene in C code if I would solely focus on GPU parallelization. The current iterator based approaches are not easy to transform or plug into CUDA...

        For the GSoc project, we should make sure to the GSoc student that this is just a project to "explore" GPU acceleration: if it brings any performance - I doubt that, because the call overhead between Java and CUDA is way too high - in contrast to Postgres where all in plain C/C++. The results would then be used to plan and investigate ways how to include that into Lucene as "plugin points" (e.g., as SPI modules).

        Show
        thetaphi Uwe Schindler added a comment - - edited Hi, in General, including CUDA into Lucene may be a good idea, but I see no real possibility to do this inside Lucene Core or any other module. My idea would be to add some abstraction to the relevant parts of Lucene and make it easier to "plug in" different implementations. Then this code could also be hosted outside Lucene (if Licenses is a problem) anywhere on Github. We still should have the following in our head: Mike's example looks "simple" as a quick test if we see gains, but making the whole thing ready for commit or bundling in any project in/outside Lucene is a whole different story. Currently BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do the calculation that could possibly be parallelized. But for moving all this to CUDA, you have to add "plugin points" all there and change APIs completely. It is also hard to test, because none of our Jenkins servers has a GPU! Also for 0/8/15 users of Lucene, this could be a huge problem, if we add native stuff into Lucene that they may never use. Because of that it MUST BE SEPARATED from Lucene core. Completely... IMHO, I'd create a full new search engine like CLucene in C code if I would solely focus on GPU parallelization. The current iterator based approaches are not easy to transform or plug into CUDA... For the GSoc project, we should make sure to the GSoc student that this is just a project to "explore" GPU acceleration: if it brings any performance - I doubt that, because the call overhead between Java and CUDA is way too high - in contrast to Postgres where all in plain C/C++. The results would then be used to plan and investigate ways how to include that into Lucene as "plugin points" (e.g., as SPI modules).

          People

          • Assignee:
            Unassigned
            Reporter:
            ichattopadhyaya Ishan Chattopadhyaya
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development