Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8492

Add LogisticRegressionQuery and LogitStream



    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: streaming expressions
    • Labels:


      This ticket is to add a new query called a LogisticRegressionQuery (LRQ).

      The LRQ extends AnalyticsQuery (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) and returns a DelegatingCollector that implements a Stochastic Gradient Descent (SGD) optimizer for Logistic Regression.

      This ticket also adds the LogitStream which leverages Streaming Expressions to provide iteration over the shards. Each call to LogitStream.read() calls down to the shards and executes the LogisticRegressionQuery. The model data is collected from the shards and the weights are averaged and sent back to the shards with the next iteration. Each call to read() returns a Tuple with the averaged weights and error from the shards. With this approach the LogitStream streams the changing model back to the client after each iteration.

      The LogitStream will return the EOF Tuple when it reaches the defined maxIterations. When sent as a Streaming Expression to the Stream handler this provides parallel iterative behavior. This same approach can be used to implement other parallel iterative algorithms.

      The initial patch has a test which simply tests the mechanics of the iteration. More work will need to be done to ensure the SGD is properly implemented. The distributed approach of the SGD will also need to be reviewed.

      This implementation is designed for use cases with a small number of features because each feature is it's own discreet field.

      An implementation which supports a higher number of features would be possible by packing features into a byte array and storing as binary DocValues.

      This implementation is designed to support a large sample set. With a large number of shards, a sample set into the billions may be possible.

      sample Streaming Expression Syntax:

      logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80")


        1. logit.csv
          29 kB
          Joel Bernstein
        2. SOLR-8492.diff
          73 kB
          Cao Manh Dat
        3. SOLR-8492.diff
          37 kB
          Cao Manh Dat
        4. SOLR-8492.patch
          74 kB
          Cao Manh Dat
        5. SOLR-8492.patch
          74 kB
          Cao Manh Dat
        6. SOLR-8492.patch
          35 kB
          Cao Manh Dat
        7. SOLR-8492.patch
          36 kB
          Cao Manh Dat
        8. SOLR-8492.patch
          36 kB
          Joel Bernstein
        9. SOLR-8492.patch
          30 kB
          Cao Manh Dat
        10. SOLR-8492.patch
          28 kB
          Cao Manh Dat
        11. SOLR-8492.patch
          30 kB
          Joel Bernstein

          Issue Links



              • Assignee:
                jbernste Joel Bernstein
              • Votes:
                1 Vote for this issue
                6 Start watching this issue


                • Created: