Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: None
    • Labels:
      None

      Description

      This ticket is to track a "replacement" for the StatsComponent. The AnalyticsComponent supports the following features:

      • All functionality of StatsComponent (SOLR-4499)
      • Field Faceting (SOLR-3435)
        • Support for limit
        • Sorting (bucket name or any stat in the bucket
        • Support for offset
      • Range Faceting
        • Supports all options of standard range faceting
      • Query Faceting (SOLR-2925)
      • Ability to use overall/field facet statistics as input to range/query faceting (ie calc min/max date and then facet over that range
      • Support for more complex aggregate/mapping operations (SOLR-1622)
        • Aggregations: min, max, sum, sum-of-square, count, missing, stddev, mean, median, percentiles
        • Operations: negation, abs, add, multiply, divide, power, log, date math, string reversal, string concat
        • Easily pluggable framework to add additional operations
      • New / cleaner output format

      Outstanding Issues:

      • Multi-value field support for stats (supported for faceting)
      • Multi-shard support (may not be possible for some operations, eg median)
      1. Search Analytics Component.pdf
        19 kB
        Steven Bower
      2. solr_analytics-2013.10.04-2.patch
        521 kB
        Steven Bower
      3. SOLR-5302.patch
        546 kB
        Erick Erickson
      4. SOLR-5302.patch
        529 kB
        Steven Bower
      5. SOLR-5302.patch
        523 kB
        Erick Erickson
      6. SOLR-5302.patch
        515 kB
        Steven Bower
      7. Statistical Expressions.pdf
        13 kB
        Steven Bower

        Issue Links

          Activity

          Hide
          Steven Bower added a comment -

          Related tickets

          Show
          Steven Bower added a comment - Related tickets
          Hide
          Steven Bower added a comment -

          Initial patch, please review/comment. Additionally PDF exports of some docs for using the component

          Show
          Steven Bower added a comment - Initial patch, please review/comment. Additionally PDF exports of some docs for using the component
          Hide
          Uwe Schindler added a comment -

          Hi,
          thanks for the patch! We also got your iCLA.

          Could you please remove this from every license header?:

          + * Copyright 2013 Bloomberg Finance L.P.
          + *
          

          Uwe

          Show
          Uwe Schindler added a comment - Hi, thanks for the patch! We also got your iCLA. Could you please remove this from every license header?: + * Copyright 2013 Bloomberg Finance L.P. + * Uwe
          Hide
          Shawn Heisey added a comment -

          I love new functionality. Thank you for all the time and effort!

          I was going to suggest that you just replace the existing StatsComponent rather than create a new component, but as I look a little bit into things, it looks like it might not be a new component from the user/admin perspective, just the code perspective. I haven't looked in-depth, but I do see a new class in the patch, so I'm slightly confused. That confusion may clear up after I've looked deeper.

          Side note, and most likely not your fault at all: Your PDF text is invisible in my in-browser PDF viewer. Windows 8 Pro, Firefox 24.0. Everything is fine if downloaded and opened in Adobe Reader. I think this is probably using the PDF viewer built into Windows 8, which sucks.

          Show
          Shawn Heisey added a comment - I love new functionality. Thank you for all the time and effort! I was going to suggest that you just replace the existing StatsComponent rather than create a new component, but as I look a little bit into things, it looks like it might not be a new component from the user/admin perspective, just the code perspective. I haven't looked in-depth, but I do see a new class in the patch, so I'm slightly confused. That confusion may clear up after I've looked deeper. Side note, and most likely not your fault at all: Your PDF text is invisible in my in-browser PDF viewer. Windows 8 Pro, Firefox 24.0. Everything is fine if downloaded and opened in Adobe Reader. I think this is probably using the PDF viewer built into Windows 8, which sucks .
          Hide
          Steven Bower added a comment -

          We originally had this code integrated into the stats component but we wanted to change the output format which made that a bit more complex.. it easily can go back in and replace it... also the "olap=true" i am not wedded to for turning it on, it was just better than a shortened version of analytics

          Show
          Steven Bower added a comment - We originally had this code integrated into the stats component but we wanted to change the output format which made that a bit more complex.. it easily can go back in and replace it... also the "olap=true" i am not wedded to for turning it on, it was just better than a shortened version of analytics
          Hide
          Robert Muir added a comment -

          Can we remove all the class.equals/isassignablefrom stuff?

          we should instead use proper fieldtype methods ... only use instanceof when absolutely necessary, and only instanceof, and please open an issue when it because it means solr is broken. using instanceof, isassignablefrom, class.equals, etc completely breaks solr's pluggability in increasingly bogus ways.

          Show
          Robert Muir added a comment - Can we remove all the class.equals/isassignablefrom stuff? we should instead use proper fieldtype methods ... only use instanceof when absolutely necessary, and only instanceof, and please open an issue when it because it means solr is broken. using instanceof, isassignablefrom, class.equals, etc completely breaks solr's pluggability in increasingly bogus ways.
          Hide
          Houston Putman added a comment - - edited

          The fieldtype methods should only work when working with fields though. I think we also use the class.equals stuff with ValueSource classes...

          Yeah, I just checked and we use it to check the type (numeric, string or date) of the value source or function. So we need to make a fix for that.

          Show
          Houston Putman added a comment - - edited The fieldtype methods should only work when working with fields though. I think we also use the class.equals stuff with ValueSource classes... Yeah, I just checked and we use it to check the type (numeric, string or date) of the value source or function. So we need to make a fix for that.
          Hide
          Steven Bower added a comment -

          Uwe Schindler Sent a mail over to our legal folks as this is what they instructed me to do.. will follow up and resolve

          Show
          Steven Bower added a comment - Uwe Schindler Sent a mail over to our legal folks as this is what they instructed me to do.. will follow up and resolve
          Hide
          Uwe Schindler added a comment - - edited

          Hi Steven,

          I refer to this one: http://www.apache.org/legal/src-headers.html

          Source File Headers for Code Developed at the ASF

          This section refers only to works submitted directly to the ASF by the copyright owner or owner's agent.
          If the source file is submitted with a copyright notice included in it, the copyright owner (or owner's agent) must either:

          • remove such notices, or
          • move them to the NOTICE file associated with each applicable project release, or
          • provide written permission for the ASF to make such removal or relocation of the notices.

          Each source file should include the following license header – note that there should be no copyright notice in the header:

                 Licensed to the Apache Software Foundation (ASF) under one
                 or more contributor license agreements.  See the NOTICE file
                 distributed with this work for additional information
                 regarding copyright ownership.  The ASF licenses this file
                 to you under the Apache License, Version 2.0 (the
                 "License"); you may not use this file except in compliance
                 with the License.  You may obtain a copy of the License at
          
                   http://www.apache.org/licenses/LICENSE-2.0
          
                 Unless required by applicable law or agreed to in writing,
                 software distributed under the License is distributed on an
                 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
                 KIND, either express or implied.  See the License for the
                 specific language governing permissions and limitations
                 under the License.
          
          Show
          Uwe Schindler added a comment - - edited Hi Steven, I refer to this one: http://www.apache.org/legal/src-headers.html Source File Headers for Code Developed at the ASF This section refers only to works submitted directly to the ASF by the copyright owner or owner's agent. If the source file is submitted with a copyright notice included in it, the copyright owner (or owner's agent) must either: remove such notices, or move them to the NOTICE file associated with each applicable project release, or provide written permission for the ASF to make such removal or relocation of the notices. Each source file should include the following license header – note that there should be no copyright notice in the header: Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
          Hide
          Steven Bower added a comment -

          Updated patch:

          • Updated license comment to remove copyrights
          • Added copyright notice to NOTICE.txt
          • Cleaned up lots of Javadoc warnings
          • Cleaned up some exception handling
          Show
          Steven Bower added a comment - Updated patch: Updated license comment to remove copyrights Added copyright notice to NOTICE.txt Cleaned up lots of Javadoc warnings Cleaned up some exception handling
          Hide
          Steven Bower added a comment -

          Removed original patch file as it contained incorrect copyright headers

          Show
          Steven Bower added a comment - Removed original patch file as it contained incorrect copyright headers
          Hide
          Yonik Seeley added a comment -

          Sweet... nice work guys!

          Implementation details are just that. But perhaps we should land this on trunk and let the interface "bake" so it doesn't accidentally get released early in a 4x release?
          On a quick scroll through, it looks like mostly new files, which is great (i.e. it won't complicate the backporting/merging of other solr features from 4x to trunk)

          Show
          Yonik Seeley added a comment - Sweet... nice work guys! Implementation details are just that. But perhaps we should land this on trunk and let the interface "bake" so it doesn't accidentally get released early in a 4x release? On a quick scroll through, it looks like mostly new files, which is great (i.e. it won't complicate the backporting/merging of other solr features from 4x to trunk)
          Hide
          Steven Bower added a comment -

          Yup.. we intentionally layed it out so that there is very little (only 2 files) that need to change in order to merge this in. Would love for this to end up on trunk. We are actively working on this as well, adding new functionality, performance tuning, etc.. If I had commit access to trunk I'd gladly keep it up to date, merged with the latest, as well as keep up patch releases for 4.x (as that is what we are deploying it against currently into our production environment)

          Show
          Steven Bower added a comment - Yup.. we intentionally layed it out so that there is very little (only 2 files) that need to change in order to merge this in. Would love for this to end up on trunk. We are actively working on this as well, adding new functionality, performance tuning, etc.. If I had commit access to trunk I'd gladly keep it up to date, merged with the latest, as well as keep up patch releases for 4.x (as that is what we are deploying it against currently into our production environment)
          Hide
          Erick Erickson added a comment -

          I've assigned this to myself to commit & etc. I'll need all the help anyone wants to lend as far as the technical details are concerned, this is a lot of code in places I'm not all that familiar with, and like everyone else I have faaaar too many things on my plate

          Steven & co:

          A couple of procedural details:

          1> There's no need at all to remove old patches when you put new ones up, in fact it's preferable to leave the old ones there. Just name them all SOLR-5302.patch. The newest version will be in blue and all the older versions will be gray and they're listed in date order so it's really easy to know the order and look at changes version-to-version should that be necessary.

          2> At some point when we're in some agreement (very soon I hope!), I'll commit the patch to trunk where we can bang on it a while before merging into 4x. I'll try to turn any new patches around in a day or less when we get to that point. I'm weaseling here since I'll be traveling for 10 days or so starting this weekend, otherwise I should be faster....

          I applied the patch to trunk and there are two issues:

          3a> A couple of files have this: "import org.apache.mahout.math.Arrays;", and as far as I can tell it only is for the toString operation in error messages. The code compiles if we just use the java.util.Arrays import. I'd rather not introduce a new dependency so how about switching to java.util.Arrays?

          3b> Trying to run the tests on trunk at least gives this error: "dynamicField can not have a default value: *_i. " (there are a couple of others). See SOLR-5227 which CHANGES.txt claims that setting the default and required options was silently ignored anyway as of 4.5 and emits a new init error... Removing the default assignments gets us past the initialization error, but then several tests fail, stack trace at the end (TRUNK), I haven't pursued it yet:

          Thanks loads for taking this all no and contributing it back! I'll do my best to get it into the code base as fast as possible. And the patch comes with documentation too! How cool is that!

          Erick

          java.lang.NullPointerException
          at __randomizedtesting.SeedInfo.seed([22E0CD041D7B8CF3]:0)
          at org.apache.solr.analytics.util.valuesource.MissFieldSource.description(MissFieldSource.java:52)
          at org.apache.lucene.queries.function.ValueSource.toString(ValueSource.java:58)
          at org.apache.solr.analytics.statistics.StatsCollectorSupplierFactory.create(StatsCollectorSupplierFactory.java:159)
          at org.apache.solr.analytics.accumulator.BasicAccumulator.<init>(BasicAccumulator.java:60)
          at org.apache.solr.analytics.accumulator.BasicAccumulator.create(BasicAccumulator.java:84)
          at org.apache.solr.analytics.request.AnalyticsStats.execute(AnalyticsStats.java:82)
          at org.apache.solr.handler.component.AnalyticsComponent.process(AnalyticsComponent.java:44)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:209)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1861)
          at org.apache.solr.util.TestHarness.query(TestHarness.java:291)
          at org.apache.solr.util.TestHarness.query(TestHarness.java:273)
          at org.apache.solr.analytics.NoFacetTest.beforeClass(NoFacetTest.java:103)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559)
          at com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79)
          at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:677)
          at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693)
          at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
          at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
          at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
          at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
          at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
          at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
          at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
          at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
          at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:43)
          at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
          at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
          at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
          at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
          at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358)
          at java.lang.Thread.run(Thread.java:722)

          Show
          Erick Erickson added a comment - I've assigned this to myself to commit & etc. I'll need all the help anyone wants to lend as far as the technical details are concerned, this is a lot of code in places I'm not all that familiar with, and like everyone else I have faaaar too many things on my plate Steven & co: A couple of procedural details: 1> There's no need at all to remove old patches when you put new ones up, in fact it's preferable to leave the old ones there. Just name them all SOLR-5302 .patch. The newest version will be in blue and all the older versions will be gray and they're listed in date order so it's really easy to know the order and look at changes version-to-version should that be necessary. 2> At some point when we're in some agreement (very soon I hope!), I'll commit the patch to trunk where we can bang on it a while before merging into 4x. I'll try to turn any new patches around in a day or less when we get to that point. I'm weaseling here since I'll be traveling for 10 days or so starting this weekend, otherwise I should be faster.... I applied the patch to trunk and there are two issues: 3a> A couple of files have this: "import org.apache.mahout.math.Arrays;", and as far as I can tell it only is for the toString operation in error messages. The code compiles if we just use the java.util.Arrays import. I'd rather not introduce a new dependency so how about switching to java.util.Arrays? 3b> Trying to run the tests on trunk at least gives this error: "dynamicField can not have a default value: *_i. " (there are a couple of others). See SOLR-5227 which CHANGES.txt claims that setting the default and required options was silently ignored anyway as of 4.5 and emits a new init error... Removing the default assignments gets us past the initialization error, but then several tests fail, stack trace at the end (TRUNK), I haven't pursued it yet: Thanks loads for taking this all no and contributing it back! I'll do my best to get it into the code base as fast as possible. And the patch comes with documentation too! How cool is that! Erick java.lang.NullPointerException at __randomizedtesting.SeedInfo.seed( [22E0CD041D7B8CF3] :0) at org.apache.solr.analytics.util.valuesource.MissFieldSource.description(MissFieldSource.java:52) at org.apache.lucene.queries.function.ValueSource.toString(ValueSource.java:58) at org.apache.solr.analytics.statistics.StatsCollectorSupplierFactory.create(StatsCollectorSupplierFactory.java:159) at org.apache.solr.analytics.accumulator.BasicAccumulator.<init>(BasicAccumulator.java:60) at org.apache.solr.analytics.accumulator.BasicAccumulator.create(BasicAccumulator.java:84) at org.apache.solr.analytics.request.AnalyticsStats.execute(AnalyticsStats.java:82) at org.apache.solr.handler.component.AnalyticsComponent.process(AnalyticsComponent.java:44) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:209) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1861) at org.apache.solr.util.TestHarness.query(TestHarness.java:291) at org.apache.solr.util.TestHarness.query(TestHarness.java:273) at org.apache.solr.analytics.NoFacetTest.beforeClass(NoFacetTest.java:103) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559) at com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79) at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:677) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46) at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:43) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) at java.lang.Thread.run(Thread.java:722)
          Hide
          David Arthur added a comment -

          Would this support performing analytics on the score? For example, I'd like to roll up results by some fields and sum the scores.

          Awesome patch by the way.

          Show
          David Arthur added a comment - Would this support performing analytics on the score? For example, I'd like to roll up results by some fields and sum the scores. Awesome patch by the way.
          Hide
          Houston Putman added a comment -

          Not currently, but I think it would be very easy to add that functionality. We tried to make adding new features as painless as possible.

          Show
          Houston Putman added a comment - Not currently, but I think it would be very easy to add that functionality. We tried to make adding new features as painless as possible.
          Hide
          Erick Erickson added a comment -

          BTW, I'd guess we should keep additional enhancements out of this patch and add them in as new JIRAs, perhaps linking them back here unless they're totally painless....

          Show
          Erick Erickson added a comment - BTW, I'd guess we should keep additional enhancements out of this patch and add them in as new JIRAs, perhaps linking them back here unless they're totally painless....
          Hide
          Steven Bower added a comment -

          Erick Erickson Will check out trunk tonight and apply/test.. also will start creating linked sub-tickets for requests here but also the laundry list of things we plan on adding over time.

          Show
          Steven Bower added a comment - Erick Erickson Will check out trunk tonight and apply/test.. also will start creating linked sub-tickets for requests here but also the laundry list of things we plan on adding over time.
          Hide
          Steven Bower added a comment -

          Erick Erickson per 3a) just eclipse being bad at choosing the right package, will cleanup.. 3b) will require a bit more work as we added quite a bit of code to work around not having missing values for docValues.. the SOLR-5227 fix is a great improvement and will greatly simplify our code

          Show
          Steven Bower added a comment - Erick Erickson per 3a) just eclipse being bad at choosing the right package, will cleanup.. 3b) will require a bit more work as we added quite a bit of code to work around not having missing values for docValues.. the SOLR-5227 fix is a great improvement and will greatly simplify our code
          Hide
          Steven Bower added a comment - - edited

          Patch updated to include

          • Cleaned up imports of Arrays class to use java.util.Arrays
          • Added support for "missing" numeric doc values
          • removed the "defaultIsMissing" stuff as its no longer needed
          • No words against 4.5.0 (and trunk although I've not explicitly tested)
          Show
          Steven Bower added a comment - - edited Patch updated to include Cleaned up imports of Arrays class to use java.util.Arrays Added support for "missing" numeric doc values removed the "defaultIsMissing" stuff as its no longer needed No words against 4.5.0 (and trunk although I've not explicitly tested)
          Hide
          Saar Carmi added a comment -

          Thanks for that functionality!

          In the attachment "Search Analytics Component.pdf" there is a broken link "This shows how to add the functionality yourself".
          Where can I find that one?

          Also the link to power point earlier in that pdf is broken.

          Show
          Saar Carmi added a comment - Thanks for that functionality! In the attachment "Search Analytics Component.pdf" there is a broken link "This shows how to add the functionality yourself". Where can I find that one? Also the link to power point earlier in that pdf is broken.
          Hide
          Steven Bower added a comment -

          That part of the doc is a bit rough... I'll try to post shortly.. the pptx I'll need to have reviewed before I can post as it contained some internal stuff.. hopefully will get this up next week..

          Show
          Steven Bower added a comment - That part of the doc is a bit rough... I'll try to post shortly.. the pptx I'll need to have reviewed before I can post as it contained some internal stuff.. hopefully will get this up next week..
          Hide
          Alexander Koval added a comment - - edited

          Do you plan to make support for local params for excluding some tagged filters? See SOLR-3177

          Show
          Alexander Koval added a comment - - edited Do you plan to make support for local params for excluding some tagged filters? See SOLR-3177
          Hide
          Andrew Psaltis added a comment -

          Steven Bower This is great, we have been playing around with this against Solr 4.5. What would it take to implement the pivot faceting so that a stat that is defined could be applied across multiple dimensions? Can you point me in the write direction to do this?

          Show
          Andrew Psaltis added a comment - Steven Bower This is great, we have been playing around with this against Solr 4.5. What would it take to implement the pivot faceting so that a stat that is defined could be applied across multiple dimensions? Can you point me in the write direction to do this?
          Hide
          Houston Putman added a comment - - edited

          Andrew Psaltis, I would look at FacetingAccumulator and the fieldFacetAccumulator and implement something similar to those. I don't know much about pivot faceting, but from what I can tell it is nested field facets. The FacetingAccumulator acts like a wrapper on the BasicAccumulator to add functionality for Facets; so I would add another BasicAccumulator wrapper that deals with pivoting.

          When the functionality is there, you will want to make a PivotFacetingRequest class, and look at the AnalyticsRequestFactory, AnalyticsStats and AnalyticsRequest to make sure your pivot params get parsed correctly and computed.

          Show
          Houston Putman added a comment - - edited Andrew Psaltis, I would look at FacetingAccumulator and the fieldFacetAccumulator and implement something similar to those. I don't know much about pivot faceting, but from what I can tell it is nested field facets. The FacetingAccumulator acts like a wrapper on the BasicAccumulator to add functionality for Facets; so I would add another BasicAccumulator wrapper that deals with pivoting. When the functionality is there, you will want to make a PivotFacetingRequest class, and look at the AnalyticsRequestFactory, AnalyticsStats and AnalyticsRequest to make sure your pivot params get parsed correctly and computed.
          Hide
          Steven Bower added a comment -

          Added sub-task for pivot faceting.. shouldn't be too difficult to add..

          Show
          Steven Bower added a comment - Added sub-task for pivot faceting.. shouldn't be too difficult to add..
          Hide
          Erick Erickson added a comment -

          OK, I'm not quite sure how to proceed given the size of this patch. What do people think about this as a way forward?

          I'll do all the pre-commit/ant testing stuff, basically the secretarial work involved in committing this to trunk. Since this is a new component, it's at least somewhat isolated from other bits of the code. I'll let it bake for a while in trunk and then merge into 4x. Since we just put 4.5.1 out (well, Mark did), if sometime a week or so after it's committed to trunk I merge it to 4x, there'll be substantial time to bake there before any 4.6 goes out.

          Of course I'll look it over, but given the size it'll be mostly a surface level look-over. Anyone who wants to delve into details is more than welcome to...

          How does that sound?

          Show
          Erick Erickson added a comment - OK, I'm not quite sure how to proceed given the size of this patch. What do people think about this as a way forward? I'll do all the pre-commit/ant testing stuff, basically the secretarial work involved in committing this to trunk. Since this is a new component, it's at least somewhat isolated from other bits of the code. I'll let it bake for a while in trunk and then merge into 4x. Since we just put 4.5.1 out (well, Mark did), if sometime a week or so after it's committed to trunk I merge it to 4x, there'll be substantial time to bake there before any 4.6 goes out. Of course I'll look it over, but given the size it'll be mostly a surface level look-over. Anyone who wants to delve into details is more than welcome to... How does that sound?
          Hide
          Erick Erickson added a comment -

          Please apply my updated version of the patch or make the same changes before making a new one or I'll have to re-do some work.

          NOTE: This is against trunk!

          Working with pre-commit:

          Changes I had to make:

          A couple of files were indented with tabs. Since it's a new file, I just reformatted them.

          The forbidden api checks failed on several files. Mostly requiring either Scanners to have "UTF-8" specified or String.toLowercase to have Locale.ROOT and such-like.

          I did most of this on the plane ride home, and I must admit it's annoying to have precommit fail because I don't have internet connnectivity, there must be a build flag somewhere.

          These files have missing javadocs
          [exec] missing: org.apache.solr.analytics.accumulator
          [exec] missing: org.apache.solr.analytics.accumulator.facet
          [exec] missing: org.apache.solr.analytics.expression
          [exec] missing: org.apache.solr.analytics.plugin
          [exec] missing: org.apache.solr.analytics.request
          [exec] missing: org.apache.solr.analytics.statistics
          [exec] missing: org.apache.solr.analytics.util
          [exec] missing: org.apache.solr.analytics.util.valuesource
          [exec]
          [exec] Missing javadocs were found!

          Tests failing, and a JVM crash to boot.

          FieldFacetExrasTest fails with "unknown field int_id". There's nothing in schema-docValues.xml that would map to that field, did it get changed? Is this a difference between trunk and 4x?

          • org.apache.solr.analytics.NoFacetTest (suite)
            [junit4] - org.apache.solr.analytics.facet.FieldFacetExtrasTest (suite)
            [junit4] - org.apache.solr.analytics.expression.ExpressionTest (suite)
            [junit4] - org.apache.solr.analytics.AbstractAnalyticsStatsTest.initializationError
            [junit4] - org.apache.solr.analytics.util.valuesource.FunctionTest (suite)
            [junit4] - org.apache.solr.analytics.facet.AbstractAnalyticsFacetTest.initializationError
            [junit4] - org.apache.solr.analytics.facet.FieldFacetTest (suite)
            [junit4] - org.apache.solr.analytics.facet.QueryFacetTest.queryTest
            [junit4] - org.apache.solr.analytics.facet.RangeFacetTest (suite)
          Show
          Erick Erickson added a comment - Please apply my updated version of the patch or make the same changes before making a new one or I'll have to re-do some work. NOTE: This is against trunk! Working with pre-commit: Changes I had to make: A couple of files were indented with tabs. Since it's a new file, I just reformatted them. The forbidden api checks failed on several files. Mostly requiring either Scanners to have "UTF-8" specified or String.toLowercase to have Locale.ROOT and such-like. I did most of this on the plane ride home, and I must admit it's annoying to have precommit fail because I don't have internet connnectivity, there must be a build flag somewhere. These files have missing javadocs [exec] missing: org.apache.solr.analytics.accumulator [exec] missing: org.apache.solr.analytics.accumulator.facet [exec] missing: org.apache.solr.analytics.expression [exec] missing: org.apache.solr.analytics.plugin [exec] missing: org.apache.solr.analytics.request [exec] missing: org.apache.solr.analytics.statistics [exec] missing: org.apache.solr.analytics.util [exec] missing: org.apache.solr.analytics.util.valuesource [exec] [exec] Missing javadocs were found! Tests failing, and a JVM crash to boot. FieldFacetExrasTest fails with "unknown field int_id". There's nothing in schema-docValues.xml that would map to that field, did it get changed? Is this a difference between trunk and 4x? org.apache.solr.analytics.NoFacetTest (suite) [junit4] - org.apache.solr.analytics.facet.FieldFacetExtrasTest (suite) [junit4] - org.apache.solr.analytics.expression.ExpressionTest (suite) [junit4] - org.apache.solr.analytics.AbstractAnalyticsStatsTest.initializationError [junit4] - org.apache.solr.analytics.util.valuesource.FunctionTest (suite) [junit4] - org.apache.solr.analytics.facet.AbstractAnalyticsFacetTest.initializationError [junit4] - org.apache.solr.analytics.facet.FieldFacetTest (suite) [junit4] - org.apache.solr.analytics.facet.QueryFacetTest.queryTest [junit4] - org.apache.solr.analytics.facet.RangeFacetTest (suite)
          Hide
          Nelson Gonzalez Gonzalez added a comment -

          I apologize if Solr JIRA is not for this kind of questions but I really need help with the Analytics Component. I am working on a project where we need to compute some stats that are impossible using the standard StatsComponent, I applied the patch to Solr 4.5.1 and it worked, but with SolrCloud it didn't, so I have a question:

          Does Anallytics Component support SolrCloud? I could not see the component working on SolrCloud cluster, but it worked on a single solr instance.

          Show
          Nelson Gonzalez Gonzalez added a comment - I apologize if Solr JIRA is not for this kind of questions but I really need help with the Analytics Component. I am working on a project where we need to compute some stats that are impossible using the standard StatsComponent, I applied the patch to Solr 4.5.1 and it worked, but with SolrCloud it didn't, so I have a question: Does Anallytics Component support SolrCloud? I could not see the component working on SolrCloud cluster, but it worked on a single solr instance.
          Hide
          Erick Erickson added a comment -

          NOTE: I'm pretty sure the JVM crash I'm seeing is unrelated to this patch, it shows up in other places and appears to be a Java problem...

          Show
          Erick Erickson added a comment - NOTE: I'm pretty sure the JVM crash I'm seeing is unrelated to this patch, it shows up in other places and appears to be a Java problem...
          Hide
          Steven Bower added a comment -

          Nelson Gonzalez Gonzalez Can you provide more detail w/regard to your issue. Is your solr cloud setup sharded. If it is sharded the Analytics component will not work currently as some of the statistics (median/percentile) cannot be easily be computed across shards.

          Show
          Steven Bower added a comment - Nelson Gonzalez Gonzalez Can you provide more detail w/regard to your issue. Is your solr cloud setup sharded. If it is sharded the Analytics component will not work currently as some of the statistics (median/percentile) cannot be easily be computed across shards.
          Hide
          Steven Bower added a comment -

          been away this last week at Lucene rev so haven't had a chance to look at these issues...

          my guess is a schema change on trunk w/regard to _id fields.. will take a look and adjust..

          Rest should get fixed up next week when I'm back in the states.

          Show
          Steven Bower added a comment - been away this last week at Lucene rev so haven't had a chance to look at these issues... my guess is a schema change on trunk w/regard to _id fields.. will take a look and adjust.. Rest should get fixed up next week when I'm back in the states.
          Hide
          Nelson Gonzalez Gonzalez added a comment - - edited

          Yes my solr cloud is sharded, I split the index in two shards. I applied the patch to Solr 4.5.1 and tried to test the Analytics component with the standard solr example (exampledocs folder).
          I have two shards (shard1, shard2) with 32 documets (exampledocs post.jar).

          For example when I execute the following query in a single solr instance (no shards) it returns stats:

          http://localhost:8983/solr/select?q=*:*&olap=true&olap.req1.statistic.stat1=sum(price)

          Response:

          <lst name="stats">
          <lst name="req1">
          <double name="stat1">5251.270030975342</double>
          </lst>
          </lst>

          If I try to execute the same query in a solr cloud sharded (two shard) it return NO stats result, it does NOT throw any exception, simply does NOT return any stat result.

          If I execute the same query in a sharded environment (solrcloud) with the paramater distrib=false, it does returns stats:

          http://localhost:8983/solr/select?q=*:*&olap=true&olap.req1.statistic.stat1=sum(price)&distrib=false

          It is like Analytics Component does not support distributed query (shards solrcloud)

          I see there are some methods in the AnalyticsComponent.class that are not implemented:

          modifyRequest, handleResponses, finishStage

          I think the not implemented methods is the reason of why the analytics component is not returning data in a solrcloud shard environment, but I am not sure.

          Please I really need help because I am close to the deadline of a project and need to make the decision whether continue with the analytics component or choose another approach to fix a big issue “stats.facet does not support multivalue fields”, related to this jira ticket https://issues.apache.org/jira/browse/SOLR-1782, we have our index split in many shards using features of solrcloud Solr 4.4.0

          Show
          Nelson Gonzalez Gonzalez added a comment - - edited Yes my solr cloud is sharded, I split the index in two shards. I applied the patch to Solr 4.5.1 and tried to test the Analytics component with the standard solr example (exampledocs folder). I have two shards (shard1, shard2) with 32 documets (exampledocs post.jar). For example when I execute the following query in a single solr instance (no shards) it returns stats: http://localhost:8983/solr/select?q=*:*&olap=true&olap.req1.statistic.stat1=sum(price ) Response: <lst name="stats"> <lst name="req1"> <double name="stat1">5251.270030975342</double> </lst> </lst> If I try to execute the same query in a solr cloud sharded (two shard) it return NO stats result, it does NOT throw any exception, simply does NOT return any stat result. If I execute the same query in a sharded environment (solrcloud) with the paramater distrib=false, it does returns stats: http://localhost:8983/solr/select?q=*:*&olap=true&olap.req1.statistic.stat1=sum(price)&distrib=false It is like Analytics Component does not support distributed query (shards solrcloud) I see there are some methods in the AnalyticsComponent.class that are not implemented: modifyRequest, handleResponses, finishStage I think the not implemented methods is the reason of why the analytics component is not returning data in a solrcloud shard environment, but I am not sure. Please I really need help because I am close to the deadline of a project and need to make the decision whether continue with the analytics component or choose another approach to fix a big issue “stats.facet does not support multivalue fields”, related to this jira ticket https://issues.apache.org/jira/browse/SOLR-1782 , we have our index split in many shards using features of solrcloud Solr 4.4.0
          Hide
          Erick Erickson added a comment -

          [~smb-solr] Steve:

          Don't mean to hassle you on this, just a ping to make sure I'm not dropping the ball here. I know how much stuff stacks up when you're away for a week!

          FYI, though, I'll be out of internet range most of December, so I'd like to get this committed to trunk by Thanksgiving if possible, then to 4x before I leave. Otherwise I'll have to hand it off to someone else to commit.

          I really appreciate all the work that went into this and your willingness to contribute it!

          Erick

          Show
          Erick Erickson added a comment - [~smb-solr] Steve: Don't mean to hassle you on this, just a ping to make sure I'm not dropping the ball here. I know how much stuff stacks up when you're away for a week! FYI, though, I'll be out of internet range most of December, so I'd like to get this committed to trunk by Thanksgiving if possible, then to 4x before I leave. Otherwise I'll have to hand it off to someone else to commit. I really appreciate all the work that went into this and your willingness to contribute it! Erick
          Hide
          Steven Bower added a comment -

          Been a bit tied up.. should have this gtg by mid-day today

          Show
          Steven Bower added a comment - Been a bit tied up.. should have this gtg by mid-day today
          Hide
          Erick Erickson added a comment -

          NP, like I said I'm just insuring that I haven't dropped the ball....

          Show
          Erick Erickson added a comment - NP, like I said I'm just insuring that I haven't dropped the ball....
          Hide
          Steven Bower added a comment -

          New patch attached addressing the schema and javadoc issues

          For the schema i just added a new one schema-analytics.xml .. must have missed this file in my first pass as I had made changes to schema-docValues.xml but now its a separate file and should avoid any confusion/changes in the future.

          added package.html files for missing javadoc

          Show
          Steven Bower added a comment - New patch attached addressing the schema and javadoc issues For the schema i just added a new one schema-analytics.xml .. must have missed this file in my first pass as I had made changes to schema-docValues.xml but now its a separate file and should avoid any confusion/changes in the future. added package.html files for missing javadoc
          Hide
          Erick Erickson added a comment -

          OK, I'll try this out this evening at latest. If it passes precommit and test I'll put it up on trunk and we can go from there.

          Show
          Erick Erickson added a comment - OK, I'll try this out this evening at latest. If it passes precommit and test I'll put it up on trunk and we can go from there.
          Hide
          Erick Erickson added a comment -

          Some of the tests were failing because, at least on my Mac, the paths to the various files in test-files weren't correct when I ran "ant test" from the shell. Current implementation succeeds both from the shell and IntelliJ.

          We'll see if this breaks on Jenkins.

          Meanwhile, committing on trunk. Anyone who wants to give it a whirl, feedback greatly appreciated!

          Awesome stuff Steven! Here I thought I had a big patch at 150K or so

          This has considerable documentation, anyone want to volunteer to incorporate it into the Wiki?

          Show
          Erick Erickson added a comment - Some of the tests were failing because, at least on my Mac, the paths to the various files in test-files weren't correct when I ran "ant test" from the shell. Current implementation succeeds both from the shell and IntelliJ. We'll see if this breaks on Jenkins. Meanwhile, committing on trunk. Anyone who wants to give it a whirl, feedback greatly appreciated! Awesome stuff Steven! Here I thought I had a big patch at 150K or so This has considerable documentation, anyone want to volunteer to incorporate it into the Wiki?
          Hide
          ASF subversion and git services added a comment -

          Commit 1543651 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1543651 ]

          SOLR-5302 Analytics component. Checking in to trunk, we'll let it back then port to 4x

          Show
          ASF subversion and git services added a comment - Commit 1543651 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1543651 ] SOLR-5302 Analytics component. Checking in to trunk, we'll let it back then port to 4x
          Hide
          Erick Erickson added a comment -

          So I can't type in comments correctly. The SVN comment should be "we'll let it bake then"....

          I won't be able to do anything with this after 5-Dec for a month or so. How long do people think it needs to bake before putting committing to 4x? We just cut 4.6, so there's some time to bake before the next Solr release I should think, especially with the holidays coming up.

          What do people think? I'll put a note in my calendar to put it up 1-Dec unless
          a> there are problems found
          or
          b> people object
          or
          c> consensus is reached that this should be done sooner.

          Show
          Erick Erickson added a comment - So I can't type in comments correctly. The SVN comment should be "we'll let it bake then".... I won't be able to do anything with this after 5-Dec for a month or so. How long do people think it needs to bake before putting committing to 4x? We just cut 4.6, so there's some time to bake before the next Solr release I should think, especially with the holidays coming up. What do people think? I'll put a note in my calendar to put it up 1-Dec unless a> there are problems found or b> people object or c> consensus is reached that this should be done sooner.
          Hide
          Steven Bower added a comment -

          One thing I'll try to do shortly is to make this fail better and/or add support for multi-shard environments.. Some things can be handled similarly to stats component but some things (median, etc) can't..

          Is there a generally accepted approach to handling non multi shard compliant components?

          Show
          Steven Bower added a comment - One thing I'll try to do shortly is to make this fail better and/or add support for multi-shard environments.. Some things can be handled similarly to stats component but some things (median, etc) can't.. Is there a generally accepted approach to handling non multi shard compliant components?
          Hide
          Steven Bower added a comment -

          Btw... Erik thanks for putting in all the work you did on this!

          Show
          Steven Bower added a comment - Btw... Erik thanks for putting in all the work you did on this!
          Hide
          Erick Erickson added a comment - - edited

          bq: thanks for putting in all the work you did on this!

          It's a very small fraction of the work you did!

          About extending to multi-shard environments, let's open up a new JIRA for that, it'll make tracking and reconciling all this easier.

          Show
          Erick Erickson added a comment - - edited bq: thanks for putting in all the work you did on this! It's a very small fraction of the work you did! About extending to multi-shard environments, let's open up a new JIRA for that, it'll make tracking and reconciling all this easier.
          Hide
          Erick Erickson added a comment -

          Test failure on Jenkins, doesn't reproduce for me though. Noticed one failure was a 32 bit client and thought that might be relevant but it happens on a 64 bit client too.

          ant test -Dtestcase=NoFacetTest -Dtests.method=stddevTest -Dtests.seed=8DD436C49013B770 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ms_MY -Dtests.timezone=Europe/Sofia -Dtests.file.encoding=UTF-8

          and did not reproduce the problem. Tried running the suite with -Dtests.iters=100 and that succeeded too. Also tried one of the other failures for this case,

          ant test -Dtestcase=HardAutoCommitTest -Dtests.method=testCommitWithin -Dtests.seed=E7BA795017967CA6 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=iw -Dtests.timezone=Australia/Tasmania -Dtests.file.encoding=UTF-8

          and that succeeds as well.

          Maybe an environment issue and/or some kind of precision problem? Here's the tests in question:
          //Float
          Double floatResult = (Double)getStatResult(response, "str", "double", "float_fd");
          Double floatTest = (Double)calculateNumberStat(floatTestStart, "stddev");
          assertTrue(Math.abs(floatResult-floatTest)<.00000000001);

          Show
          Erick Erickson added a comment - Test failure on Jenkins, doesn't reproduce for me though. Noticed one failure was a 32 bit client and thought that might be relevant but it happens on a 64 bit client too. ant test -Dtestcase=NoFacetTest -Dtests.method=stddevTest -Dtests.seed=8DD436C49013B770 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ms_MY -Dtests.timezone=Europe/Sofia -Dtests.file.encoding=UTF-8 and did not reproduce the problem. Tried running the suite with -Dtests.iters=100 and that succeeded too. Also tried one of the other failures for this case, ant test -Dtestcase=HardAutoCommitTest -Dtests.method=testCommitWithin -Dtests.seed=E7BA795017967CA6 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=iw -Dtests.timezone=Australia/Tasmania -Dtests.file.encoding=UTF-8 and that succeeds as well. Maybe an environment issue and/or some kind of precision problem? Here's the tests in question: //Float Double floatResult = (Double)getStatResult(response, "str", "double", "float_fd"); Double floatTest = (Double)calculateNumberStat(floatTestStart, "stddev"); assertTrue(Math.abs(floatResult-floatTest)<.00000000001);
          Hide
          Erick Erickson added a comment -

          Rats. Forgot to mention this JIRA in the commit I just did. r-1543796

          Following Dawid's suggestion, I changed the test to:

          assertTrue("Oops: (double raws) " + Double.doubleToRawLongBits(floatResult) + " - "
          + Double.doubleToRawLongBits(floatTest) + " < " + Double.doubleToRawLongBits(.00000000001) +
          " Calculated diff " + Double.doubleToRawLongBits(floatResult - floatTest),
          Math.abs(floatResult - floatTest) < .00000000001);

          to give us the raw data to help figure out what's going on.

          Show
          Erick Erickson added a comment - Rats. Forgot to mention this JIRA in the commit I just did. r-1543796 Following Dawid's suggestion, I changed the test to: assertTrue("Oops: (double raws) " + Double.doubleToRawLongBits(floatResult) + " - " + Double.doubleToRawLongBits(floatTest) + " < " + Double.doubleToRawLongBits(.00000000001) + " Calculated diff " + Double.doubleToRawLongBits(floatResult - floatTest), Math.abs(floatResult - floatTest) < .00000000001); to give us the raw data to help figure out what's going on.
          Hide
          David Smiley added a comment -

          Minor observation: your assertTrue should probably be written as an assertEquals between floatResult, floatTest, and the given delta. JUnit would have told you the values even without adding a message (I believe).

          Show
          David Smiley added a comment - Minor observation: your assertTrue should probably be written as an assertEquals between floatResult, floatTest, and the given delta. JUnit would have told you the values even without adding a message (I believe).
          Hide
          Otis Gospodnetic added a comment -

          This ticket is to track a "replacement" for the StatsComponent.

          Is there anything StatsComponent does that this Analytics Component does not or cannot/will not do? If not, should StatsComponent be deprecated?

          Multi-shard support (may not be possible for some operations, eg median)

          See https://www.google.com/search?q=qdigest

          Show
          Otis Gospodnetic added a comment - This ticket is to track a "replacement" for the StatsComponent. Is there anything StatsComponent does that this Analytics Component does not or cannot/will not do? If not, should StatsComponent be deprecated? Multi-shard support (may not be possible for some operations, eg median) See https://www.google.com/search?q=qdigest
          Hide
          Erick Erickson added a comment -

          [~steven bower] I've created a new Solr JIRA for fixing any test errors, I've seen two so far that may well be environment sensitivities. Let's collect any test fixes in SOLR-5448, that'll make it easier to merge into 4.x.

          Also, what do people think about just closing the other JIRAs linked to this one about improvements to stats component? And Otis' question about whether to just deprecate the stats component is a good one. I suppose if we decide to deprecate the old stats component, it answers the question about closing JIRAs related to it.

          Show
          Erick Erickson added a comment - [~steven bower] I've created a new Solr JIRA for fixing any test errors, I've seen two so far that may well be environment sensitivities. Let's collect any test fixes in SOLR-5448 , that'll make it easier to merge into 4.x. Also, what do people think about just closing the other JIRAs linked to this one about improvements to stats component? And Otis' question about whether to just deprecate the stats component is a good one. I suppose if we decide to deprecate the old stats component, it answers the question about closing JIRAs related to it.
          Hide
          Nim Lhûg added a comment -

          Erick Erickson Has this been merged into the 4.6 RCs? If so, I would like to run it through our application tests in place of the StatsComponent. We use the StatsComponent for a lot of our heavy lifting, and having it deprecated (or replaced) makes me a bit nervous. Mostly because this patch is very complex and I haven't had much time to test it.

          Show
          Nim Lhûg added a comment - Erick Erickson Has this been merged into the 4.6 RCs? If so, I would like to run it through our application tests in place of the StatsComponent. We use the StatsComponent for a lot of our heavy lifting, and having it deprecated (or replaced) makes me a bit nervous. Mostly because this patch is very complex and I haven't had much time to test it.
          Hide
          Nim Lhûg added a comment -

          Additionally, StatsComponent has SolrJ integration (FieldStatsInfo classes etc). Analytics Component doesn't seem to have any SolrJ sugar yet (unless I overlooked a patch somewhere?). Might be a bit too soon to deprecate StatsComponent.

          Show
          Nim Lhûg added a comment - Additionally, StatsComponent has SolrJ integration (FieldStatsInfo classes etc). Analytics Component doesn't seem to have any SolrJ sugar yet (unless I overlooked a patch somewhere?). Might be a bit too soon to deprecate StatsComponent.
          Hide
          Erick Erickson added a comment -

          Nim:

          Good point about SolrJ, although there's nothing magic about SolrJ integration, you can always use params.

          No, it hasn't been folded into any 4x code yet, we're letting it bake for a while on trunk. If all goes well, I'll back-port into 4x in early Dec.

          Show
          Erick Erickson added a comment - Nim: Good point about SolrJ, although there's nothing magic about SolrJ integration, you can always use params. No, it hasn't been folded into any 4x code yet, we're letting it bake for a while on trunk. If all goes well, I'll back-port into 4x in early Dec.
          Hide
          Nim Lhûg added a comment -

          Ok. Refactoring our application to use the new component (with or without SolrJ sugar) would be a pretty substantial effort, but I'll try to squeeze it in at some point this month so I can give it a bit of a test.

          Show
          Nim Lhûg added a comment - Ok. Refactoring our application to use the new component (with or without SolrJ sugar) would be a pretty substantial effort, but I'll try to squeeze it in at some point this month so I can give it a bit of a test.
          Hide
          David Smiley added a comment -

          Fear not Nim (and others), Solr almost never changes the request/response params/format. It may have happened before but it is so rare that I simply can't recall the last time it has. Stuff gets deprecated but sticks around forever. Backwards compatibility is kept to extremely high standards here (good for you, sucks for us committers). Instead of removal, I suspect at some point in the future, the Stats' implementation would get replaced by a proxy implementation that uses the new code in this Analytics component. And that is not an option until this Analytics component does everything Stats does (e.g. distributed-mode).

          Show
          David Smiley added a comment - Fear not Nim (and others), Solr almost never changes the request/response params/format. It may have happened before but it is so rare that I simply can't recall the last time it has. Stuff gets deprecated but sticks around forever. Backwards compatibility is kept to extremely high standards here (good for you, sucks for us committers). Instead of removal, I suspect at some point in the future, the Stats' implementation would get replaced by a proxy implementation that uses the new code in this Analytics component. And that is not an option until this Analytics component does everything Stats does (e.g. distributed-mode).
          Hide
          Steven Bower added a comment -

          Erick Erickson will continue looking at the new ticket/test issues... should hopefully get it sorted today...

          Also regarding the interface to the analytics component.. I am not really wedded to the current interface.. it loosely follows some of the structure of the Stats component but obviously had to diverge for new functionality... We also built an XML based format (the code is actually in the patch) for specifying analytics requests.. I'll take a look at the solrj stuff too because internally all the parameters passed in the uri are turned into an object model (AnalyticsRequest and suborniate classes) which could easily be moved into SolrJ side to make things cleaner.. I'll try to look at that this week as well.. but if people have suggestions w/regard to input/output format I'm open to make some changes.

          Show
          Steven Bower added a comment - Erick Erickson will continue looking at the new ticket/test issues... should hopefully get it sorted today... Also regarding the interface to the analytics component.. I am not really wedded to the current interface.. it loosely follows some of the structure of the Stats component but obviously had to diverge for new functionality... We also built an XML based format (the code is actually in the patch) for specifying analytics requests.. I'll take a look at the solrj stuff too because internally all the parameters passed in the uri are turned into an object model (AnalyticsRequest and suborniate classes) which could easily be moved into SolrJ side to make things cleaner.. I'll try to look at that this week as well.. but if people have suggestions w/regard to input/output format I'm open to make some changes.
          Hide
          Otis Gospodnetic added a comment -

          Steven Bower - just linked MAHOUT-1361 which you may want to look at.

          Show
          Otis Gospodnetic added a comment - Steven Bower - just linked MAHOUT-1361 which you may want to look at.
          Hide
          Erick Erickson added a comment -

          Steven Bower (or anyone else for that matter who likes this kind of thing.
          Here's the bits from the enhanced error output:

          You may want to watch SOLR-5488, let's move over there for the test fixes.

          1 tests failed.
          FAILED: org.apache.solr.analytics.NoFacetTest.stddevTest

          Error Message:
          Oops: (double raws) 4631318898052956160 - 4628496337733101339 < 4442235333156365461 Calculated diff 4625071700926640586

          Stack Trace:
          java.lang.AssertionError: Oops: (double raws) 4631318898052956160 - 4628496337733101339 < 4442235333156365461 Calculated diff 4625071700926640586
          at __randomizedtesting.SeedInfo.seed([94AAF7392EB49CCD:916AB78C1ED798C2]:0)
          at org.junit.Assert.fail(Assert.java:93)
          at org.junit.Assert.assertTrue(Assert.java:43)
          at org.apache.solr.analytics.NoFacetTest.stddevTest(NoFacetTest.java:227)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

          Show
          Erick Erickson added a comment - Steven Bower (or anyone else for that matter who likes this kind of thing. Here's the bits from the enhanced error output: You may want to watch SOLR-5488 , let's move over there for the test fixes. 1 tests failed. FAILED: org.apache.solr.analytics.NoFacetTest.stddevTest Error Message: Oops: (double raws) 4631318898052956160 - 4628496337733101339 < 4442235333156365461 Calculated diff 4625071700926640586 Stack Trace: java.lang.AssertionError: Oops: (double raws) 4631318898052956160 - 4628496337733101339 < 4442235333156365461 Calculated diff 4625071700926640586 at __randomizedtesting.SeedInfo.seed( [94AAF7392EB49CCD:916AB78C1ED798C2] :0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.solr.analytics.NoFacetTest.stddevTest(NoFacetTest.java:227) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          Hide
          Erick Erickson added a comment -

          I have a time constraint here. See comments for SOLR-5488. The short form is I have to be done with this no later than tomorrow (Tuesday) night. I've outlined several options at SOLR-5488, let me know what people think the best thing to do is. Please comment on SOLR-5488.

          Show
          Erick Erickson added a comment - I have a time constraint here. See comments for SOLR-5488 . The short form is I have to be done with this no later than tomorrow (Tuesday) night. I've outlined several options at SOLR-5488 , let me know what people think the best thing to do is. Please comment on SOLR-5488 .
          Hide
          Erick Erickson added a comment -

          I'm going to commit this to 4x this afternoon unless there are objections. The test that fails isn't a regression, so... I don't like putting code in 4x that has a sporadic test failure, but life isn't always tidy and I have a time limit.

          Or a committer can volunteer to take this over until I can work with it again.

          Show
          Erick Erickson added a comment - I'm going to commit this to 4x this afternoon unless there are objections. The test that fails isn't a regression, so... I don't like putting code in 4x that has a sporadic test failure, but life isn't always tidy and I have a time limit. Or a committer can volunteer to take this over until I can work with it again.
          Hide
          Uwe Schindler added a comment -

          Hi Erick,
          Can you simply add an @Ignore to the test with a message mentioning the issue?

          Show
          Uwe Schindler added a comment - Hi Erick, Can you simply add an @Ignore to the test with a message mentioning the issue?
          Hide
          Robert Muir added a comment -

          I dont understand why a JIRA issue can have a time limit. Maybe this wasnt ready for trunk yet and should be iterated on in a branch?

          I dont think unstable stuff should be backported to 4.x!!!

          Show
          Robert Muir added a comment - I dont understand why a JIRA issue can have a time limit. Maybe this wasnt ready for trunk yet and should be iterated on in a branch? I dont think unstable stuff should be backported to 4.x!!!
          Hide
          Erick Erickson added a comment -

          The JIRA doesn't have a time limit, I do. All I'm doing here is trying to insure that people don't expect me to do anything with this in the near future and leaving a paper trail that lets someone else pick it up in my absence. And letting folks know the current state in case they do want to pick it up.

          I'm fine with leaving it as it is. I've listed the merges that need to happen if someone wants to merge this all in to 4x when appropriate. If there's a fix for the test problem, then there'll be at least one more merge than I've listed of course.

          I'm also trying to NOT check things in the night before I leave on vacation...

          Show
          Erick Erickson added a comment - The JIRA doesn't have a time limit, I do. All I'm doing here is trying to insure that people don't expect me to do anything with this in the near future and leaving a paper trail that lets someone else pick it up in my absence. And letting folks know the current state in case they do want to pick it up. I'm fine with leaving it as it is. I've listed the merges that need to happen if someone wants to merge this all in to 4x when appropriate. If there's a fix for the test problem, then there'll be at least one more merge than I've listed of course. I'm also trying to NOT check things in the night before I leave on vacation...
          Hide
          Elran Dvir added a comment -

          I saw in documentation that "Unique count" is supported among other statistical expressions.
          what about the unique values themselves? (as described in SOLR-5428)

          Thanks.

          Show
          Elran Dvir added a comment - I saw in documentation that "Unique count" is supported among other statistical expressions. what about the unique values themselves? (as described in SOLR-5428 ) Thanks.
          Hide
          Steven Bower added a comment -

          Hadn't thought about unique values, but in principal its pretty straightforward as we hold on to all the values as we need them to count things...

          Maybe create a new ticket for that...

          I will also think a bit more because often what is wanted is not always a stat but the ability to transform/reduce the set of values coming back... as in the case of distinct values... of course you can solve that now with faceting..

          Show
          Steven Bower added a comment - Hadn't thought about unique values, but in principal its pretty straightforward as we hold on to all the values as we need them to count things... Maybe create a new ticket for that... I will also think a bit more because often what is wanted is not always a stat but the ability to transform/reduce the set of values coming back... as in the case of distinct values... of course you can solve that now with faceting..
          Hide
          Michel Lemay added a comment -

          Few problems found in patch as of 19/Nov/13 18:46

          • Sorting on bucket name (asc or desc) does not work
          • Some functions produces errors. ex: &o.req1.s.value1=int(sum(myfield)) will produce the following error: <str name="msg">int does not have the correct number of arguments.</str>
          • Output of not so large values are in scientific notation (StatsComponent output the same value correctly)
          Show
          Michel Lemay added a comment - Few problems found in patch as of 19/Nov/13 18:46 Sorting on bucket name (asc or desc) does not work Some functions produces errors. ex: &o.req1.s.value1=int(sum(myfield)) will produce the following error: <str name="msg">int does not have the correct number of arguments.</str> Output of not so large values are in scientific notation (StatsComponent output the same value correctly)
          Hide
          Houston Putman added a comment -

          1) I don't think that sorting on bucket name is supported. If you want the buckets sorted, you have to choose a statistic to sort on.

          2) int() is not a funciton. It was used in an early implementation, but I don't think that it is still around anywhere in the code. Was the expression "&o.req1.s.value1=int(sum(myfield))" in an actual test?

          Show
          Houston Putman added a comment - 1) I don't think that sorting on bucket name is supported. If you want the buckets sorted, you have to choose a statistic to sort on. 2) int() is not a funciton. It was used in an early implementation, but I don't think that it is still around anywhere in the code. Was the expression "&o.req1.s.value1=int(sum(myfield))" in an actual test?
          Hide
          Michel Lemay added a comment - - edited

          1) In my opinion, it should be implemented to be parallel with SimpleFacet feature list. Also, it's mentioned in the documentation of this feature request: "The AnalyticsComponent supports the following features:
          All functionality of StatsComponent (...Sorting (bucket name or any stat in the bucket..."
          However, I'm not sure if I misunderstood what 'bucket name' refers to. Is is the facet display value itself or olap.req.statistics.<name> ?

          2) I derived from an example found at the botton of this document: https://issues.apache.org/jira/secure/attachment/12606794/Statistical%20Expressions.pdf
          Note, double() also have the same behavior.
          Surprisingly, add(..,..) returns a double even if both parameters are integers.. Thats why I wanted to cast back to int.

          Show
          Michel Lemay added a comment - - edited 1) In my opinion, it should be implemented to be parallel with SimpleFacet feature list. Also, it's mentioned in the documentation of this feature request: "The AnalyticsComponent supports the following features: All functionality of StatsComponent (...Sorting (bucket name or any stat in the bucket..." However, I'm not sure if I misunderstood what 'bucket name' refers to. Is is the facet display value itself or olap.req.statistics.<name> ? 2) I derived from an example found at the botton of this document: https://issues.apache.org/jira/secure/attachment/12606794/Statistical%20Expressions.pdf Note, double() also have the same behavior. Surprisingly, add(..,..) returns a double even if both parameters are integers.. Thats why I wanted to cast back to int.
          Hide
          Houston Putman added a comment -

          1) Bucket name refers to the facet display value. This should be added to the new features ticket, and shouldn't be too hard to implement.

          2) Good catch, that should be updated. Every expression or function that operates on numeric values will return a double. This is mainly for simplicity, because it would get ugly pretty quickly otherwise.

          Show
          Houston Putman added a comment - 1) Bucket name refers to the facet display value. This should be added to the new features ticket, and shouldn't be too hard to implement. 2) Good catch, that should be updated. Every expression or function that operates on numeric values will return a double. This is mainly for simplicity, because it would get ugly pretty quickly otherwise.
          Hide
          Mehmet Erkek added a comment -

          This looks like a great feature. Many thanks to those who involved in it. We will try to use it.
          Maybe not the best place to ask, If so, I apologize however I have two questions:
          1-Which release includes this functionality?
          2-How can we use it? the attached pdf has broken links and it does not seem detail info on how to use it. I appreciate if you can share some more info.
          Thanks.

          Show
          Mehmet Erkek added a comment - This looks like a great feature. Many thanks to those who involved in it. We will try to use it. Maybe not the best place to ask, If so, I apologize however I have two questions: 1-Which release includes this functionality? 2-How can we use it? the attached pdf has broken links and it does not seem detail info on how to use it. I appreciate if you can share some more info. Thanks.
          Hide
          Erick Erickson added a comment -

          Right, the user's list is probably a better forum...

          1> This functionality is only in trunk (the future 5x). There's an occasional test failure that we want to fix before we fold it in to 4x. 4.7 would be the earliest this would go into a released version.

          2> The PDFs are fine as far as I can tell, you need to download them rather than open them in place.

          Show
          Erick Erickson added a comment - Right, the user's list is probably a better forum... 1> This functionality is only in trunk (the future 5x). There's an occasional test failure that we want to fix before we fold it in to 4x. 4.7 would be the earliest this would go into a released version. 2> The PDFs are fine as far as I can tell, you need to download them rather than open them in place.
          Hide
          Mehmet Erkek added a comment - - edited

          Thank you Erick.
          1) Any idea when the future 5x could be released?
          2) This is link I meant in the pdf: https://cms.prod.bloomberg.com/team/display/fdns/Search+Analytics+Component

          Show
          Mehmet Erkek added a comment - - edited Thank you Erick. 1) Any idea when the future 5x could be released? 2) This is link I meant in the pdf: https://cms.prod.bloomberg.com/team/display/fdns/Search+Analytics+Component
          Hide
          Shawn Heisey added a comment -

          This is link I meant in the pdf: https://cms.prod.bloomberg.com/team/display/fdns/Search+Analytics+Component

          If I had to guess, I would say that is an internal website for Bloomberg, something that only employees can get to. If they intend it for public consumption, they'll need to publish the data on a public website and fix the links in the PDF.

          Any idea when the future 5x could be released?

          Quick answer: 5.0 is many months away. It's impossible to give any kind of release date prediction. Hopefully this particular feature will end up in a 4.x release, once Erick (or another committer) has the time to devote to giving the code a thorough review.

          Longer answer:

          At this time, nobody has come up with a timeframe for Solr 5.0. Once somebody decides we're going to begin the process and agrees to be the release manager, a LOT has to happen, and there's really no way to make it happen quickly.

          Even if we began the 5.0 release process tomorrow and everything were to be extremely smooth, I don't think you'd even see a 5.0-ALPHA release for a few months. We can't begin the release process that soon, so it's going to be even longer. One of the big items still left to do is to embed the HTTP server layer and make Solr into a standalone application.

          I wasn't involved with the development when 4.0 was released, so I don't know how much time passed between the beginning of the 4.0 release process and 4.0-ALPHA, but I can tell you that there were three months between 4.0-ALPHA and 4.0-FINAL.

          Show
          Shawn Heisey added a comment - This is link I meant in the pdf: https://cms.prod.bloomberg.com/team/display/fdns/Search+Analytics+Component If I had to guess, I would say that is an internal website for Bloomberg, something that only employees can get to. If they intend it for public consumption, they'll need to publish the data on a public website and fix the links in the PDF. Any idea when the future 5x could be released? Quick answer: 5.0 is many months away. It's impossible to give any kind of release date prediction. Hopefully this particular feature will end up in a 4.x release, once Erick (or another committer) has the time to devote to giving the code a thorough review. Longer answer: At this time, nobody has come up with a timeframe for Solr 5.0. Once somebody decides we're going to begin the process and agrees to be the release manager, a LOT has to happen, and there's really no way to make it happen quickly. Even if we began the 5.0 release process tomorrow and everything were to be extremely smooth, I don't think you'd even see a 5.0-ALPHA release for a few months. We can't begin the release process that soon, so it's going to be even longer. One of the big items still left to do is to embed the HTTP server layer and make Solr into a standalone application. I wasn't involved with the development when 4.0 was released, so I don't know how much time passed between the beginning of the 4.0 release process and 4.0-ALPHA, but I can tell you that there were three months between 4.0-ALPHA and 4.0-FINAL.
          Hide
          Mehmet Erkek added a comment -

          Thanks Shawn. Nice answer. I think we need this component sooner. In this case, my questions here is : Is there anything we can do to help including this feature in one of 4.X versions ?

          Show
          Mehmet Erkek added a comment - Thanks Shawn. Nice answer. I think we need this component sooner. In this case, my questions here is : Is there anything we can do to help including this feature in one of 4.X versions ?
          Hide
          Erick Erickson added a comment -

          Mehmet:

          We're still trying to track down what's behind the test failures, that effort is being tracked in SOLR-5488. That discussion shows a way to reproduce the test failures we see, albeit intermittently.

          You could certainly help if you can
          1> reproduce the problem. Note the discussion at SOLR-5488 about
          ant test -Dtestcase=ExpressionTest -Dtests.iters=10000
          2> figure out why/create a patch.

          and/or

          3> exercise trunk as much as possible to see that it all works.

          Let's move the rest of the discussion over to SOLR-5488 though, this JIRA is gated by that one.

          Show
          Erick Erickson added a comment - Mehmet: We're still trying to track down what's behind the test failures, that effort is being tracked in SOLR-5488 . That discussion shows a way to reproduce the test failures we see, albeit intermittently. You could certainly help if you can 1> reproduce the problem. Note the discussion at SOLR-5488 about ant test -Dtestcase=ExpressionTest -Dtests.iters=10000 2> figure out why/create a patch. and/or 3> exercise trunk as much as possible to see that it all works. Let's move the rest of the discussion over to SOLR-5488 though, this JIRA is gated by that one.
          Hide
          Pete added a comment -

          Hi,

          I am new to this. Can I apply patch 5302 to Apache Solr Version 4.6.1 ?

          Will I be able to replicate the facet.pivot behavior using the analytics component ?

          So example: I would like to get the sum of field "price" by each "manu" split by "instock" of true and false ?

          Many thanks

          Show
          Pete added a comment - Hi, I am new to this. Can I apply patch 5302 to Apache Solr Version 4.6.1 ? Will I be able to replicate the facet.pivot behavior using the analytics component ? So example: I would like to get the sum of field "price" by each "manu" split by "instock" of true and false ? Many thanks
          Hide
          Steven Bower added a comment -

          The patch is for trunk.. in general it should apply pretty cleanly as most of the code is in a separate package.. but.. there are a few files/apis that have changed so it's not likely to apply cleanly.. I am going to be working on moving to 4.6.x this week so maybe I'll try and make a clean 4.6.1 patch.. currently pivot facet behavior is not supported, however its totally doable, just need to do the work...

          Show
          Steven Bower added a comment - The patch is for trunk.. in general it should apply pretty cleanly as most of the code is in a separate package.. but.. there are a few files/apis that have changed so it's not likely to apply cleanly.. I am going to be working on moving to 4.6.x this week so maybe I'll try and make a clean 4.6.1 patch.. currently pivot facet behavior is not supported, however its totally doable, just need to do the work...
          Hide
          Erick Erickson added a comment -

          We seem to have stabilized this in trunk, so we need to back port it to 4x when we get any outstanding interface issues resolved. See: SOLR-5963 i

          Show
          Erick Erickson added a comment - We seem to have stabilized this in trunk, so we need to back port it to 4x when we get any outstanding interface issues resolved. See: SOLR-5963 i
          Hide
          Grant Ingersoll added a comment -

          Does this work in SolrCloud mode? It seems to be the case that it doesn't. I really don't think we should put in new functionality like this without it supporting SolrCloud.

          Show
          Grant Ingersoll added a comment - Does this work in SolrCloud mode? It seems to be the case that it doesn't. I really don't think we should put in new functionality like this without it supporting SolrCloud.
          Hide
          Erick Erickson added a comment -

          Well, if it doesn't function in distributed mode, it seems we have two choices:
          1> pull it out of trunk
          2> put it into 4x and iterate.

          If we go with <1>, it seems best if I created an "uber patch" that preserves the work so far (including all the test stabilization updates) and attach that to a new JIRA. This would be both SOLR-5302 and SOLR-5488.

          Show
          Erick Erickson added a comment - Well, if it doesn't function in distributed mode, it seems we have two choices: 1> pull it out of trunk 2> put it into 4x and iterate. If we go with <1>, it seems best if I created an "uber patch" that preserves the work so far (including all the test stabilization updates) and attach that to a new JIRA. This would be both SOLR-5302 and SOLR-5488 .
          Hide
          Ryan McKinley added a comment -

          option 2 seems better – it will be easier to improve without dealing with massive patches. We could mark as experimental and change the format if necessary for distributed search.

          I really don't think we should put in new functionality like this without it supporting SolrCloud.

          I am all for aiming to have distributed search supported everywhere, but I don't think that should be a blocker.

          Show
          Ryan McKinley added a comment - option 2 seems better – it will be easier to improve without dealing with massive patches. We could mark as experimental and change the format if necessary for distributed search. I really don't think we should put in new functionality like this without it supporting SolrCloud. I am all for aiming to have distributed search supported everywhere, but I don't think that should be a blocker.
          Hide
          Steven Bower added a comment -

          Grant Ingersoll I agree that the ideal should be to have everything work in distributed mode (makes thins way less confusing for people). However substantial work would be needed to make this functionality work in a multi-shard environment.. We'd essentially need a generic distributed map-reduce implementation that could run inside a query. +1 for that... This is because of some of the stats are not easily computed without knowing all the values in one place (eg median/percentiles).

          I believe that there is substantial value in what exists in this patch and that we continue work into the future to design/implement multi-shard support for analytics.

          Show
          Steven Bower added a comment - Grant Ingersoll I agree that the ideal should be to have everything work in distributed mode (makes thins way less confusing for people). However substantial work would be needed to make this functionality work in a multi-shard environment.. We'd essentially need a generic distributed map-reduce implementation that could run inside a query. +1 for that... This is because of some of the stats are not easily computed without knowing all the values in one place (eg median/percentiles). I believe that there is substantial value in what exists in this patch and that we continue work into the future to design/implement multi-shard support for analytics.
          Hide
          Grant Ingersoll added a comment -

          I don't agree. Distributed is and should be the default mode we do everything in going forward and if we don't account for it up front, then we end up making all kinds of compromises on it and/or it takes years to get done (just look at MLT). I can almost guarantee you the first question on the list once this is released is "how come it doesn't work in distributed". This is not a case of the "perfect being the enemy of the good enough", but a case of missing the fact that the usage of distributed is the world we live in and so this patch only serves those going backwards and not those going forward.

          It would be one thing if this issue had a plan for what can be distributed and what can't and an approach outlined such that it could be implemented sooner rather than later, but that doesn't appear to be the case, AFAICT. For instance, some of the stats that can't be easily distributed do have approximations that can be.

          We'd essentially need a generic distributed map-reduce implementation that could run inside a query. +1 for that.

          See https://issues.apache.org/jira/browse/SOLR-5069.

          Show
          Grant Ingersoll added a comment - I don't agree. Distributed is and should be the default mode we do everything in going forward and if we don't account for it up front, then we end up making all kinds of compromises on it and/or it takes years to get done (just look at MLT). I can almost guarantee you the first question on the list once this is released is "how come it doesn't work in distributed". This is not a case of the "perfect being the enemy of the good enough", but a case of missing the fact that the usage of distributed is the world we live in and so this patch only serves those going backwards and not those going forward. It would be one thing if this issue had a plan for what can be distributed and what can't and an approach outlined such that it could be implemented sooner rather than later, but that doesn't appear to be the case, AFAICT. For instance, some of the stats that can't be easily distributed do have approximations that can be. We'd essentially need a generic distributed map-reduce implementation that could run inside a query. +1 for that. See https://issues.apache.org/jira/browse/SOLR-5069 .
          Hide
          Steven Bower added a comment -

          If someone wants to do that work that's great.. I don't have plans to work on multi-shard at the moment (this will change in the future) as I just don't have a use-case for it... we will though.. If someone wants to pick it up I'd gladly assist...

          I understand the intention to have everything cloud compatible.. The reality is that many components suffer from inconsistencies when in cloud mode (MLT, All the join work being done and in Solr, FieldCollapsing, etc..) I think it should be the intention to make things work in cloud mode however some use-case don't really make sense in distributed mode when you look at the cost of the implementation.. we can do analytics very quickly in solr with this component but doing this as a map-reduce/distributed implementation may prove to be prohibitively time consuming at query time and thus may not ever get used in distributed configurations..

          Anyway I'd like to see this get in prior to supporting multi-node as it will probably be a long while before the infrastructure is in place to support it (ie the map-reduce ticket)

          Show
          Steven Bower added a comment - If someone wants to do that work that's great.. I don't have plans to work on multi-shard at the moment (this will change in the future) as I just don't have a use-case for it... we will though.. If someone wants to pick it up I'd gladly assist... I understand the intention to have everything cloud compatible.. The reality is that many components suffer from inconsistencies when in cloud mode (MLT, All the join work being done and in Solr, FieldCollapsing, etc..) I think it should be the intention to make things work in cloud mode however some use-case don't really make sense in distributed mode when you look at the cost of the implementation.. we can do analytics very quickly in solr with this component but doing this as a map-reduce/distributed implementation may prove to be prohibitively time consuming at query time and thus may not ever get used in distributed configurations.. Anyway I'd like to see this get in prior to supporting multi-node as it will probably be a long while before the infrastructure is in place to support it (ie the map-reduce ticket)
          Hide
          Grant Ingersoll added a comment -

          How about as a compromise, we make this a contrib and make it fail fast in the sharded case so that we can move forward? In the meantime, a couple of engineers here at Lucid are looking at the distributed case.

          Show
          Grant Ingersoll added a comment - How about as a compromise, we make this a contrib and make it fail fast in the sharded case so that we can move forward? In the meantime, a couple of engineers here at Lucid are looking at the distributed case.
          Hide
          Hoss Man added a comment -

          How about as a compromise, we make this a contrib and make it fail fast in the sharded case

          +1

          This functionality is really cool, and I think for the people ith single node setups who want it and can take advantage of it we should absolutely make it available – but I share grant's general concerns about adding new "built-in" functionality that doesn't work at all in distrib mode. I'd hate to see people try this out in the example and think it's great and will solve all their problems but then they get confused/disappointed/angry when it does nothing useful in their multi-node setup.

          As an optional contrib, we can make the inform(SolrCore) method check for SolrCloud mode and fail fast (and likewise, for old-school pre-SolrCloud manual managed multi-shard setups we can have distributedProcess fail fast at request time)

          (Note: I still have some other concerns about the general user API – see comments SOLR-5963, where arguably this whole discussion should have taken place)

          Show
          Hoss Man added a comment - How about as a compromise, we make this a contrib and make it fail fast in the sharded case +1 This functionality is really cool, and I think for the people ith single node setups who want it and can take advantage of it we should absolutely make it available – but I share grant's general concerns about adding new "built-in" functionality that doesn't work at all in distrib mode. I'd hate to see people try this out in the example and think it's great and will solve all their problems but then they get confused/disappointed/angry when it does nothing useful in their multi-node setup. As an optional contrib, we can make the inform(SolrCore) method check for SolrCloud mode and fail fast (and likewise, for old-school pre-SolrCloud manual managed multi-shard setups we can have distributedProcess fail fast at request time) (Note: I still have some other concerns about the general user API – see comments SOLR-5963 , where arguably this whole discussion should have taken place)
          Hide
          Pradeep added a comment -

          I do see that this issue was fixed in 5.0. Does this supports distributed mode also?

          Show
          Pradeep added a comment - I do see that this issue was fixed in 5.0. Does this supports distributed mode also?
          Hide
          David Arthur added a comment -

          Pradeep, see previous discussion - no support for distributed (aka, "cloud") yet.

          Chris Hostetter, Grant Ingersoll, Steven Bower, in distributed mode, would this component work when talking to individual shards directly? If that's the case, then (for some stats) the end user can do a final roll-up themselves.

          Show
          David Arthur added a comment - Pradeep , see previous discussion - no support for distributed (aka, "cloud") yet. Chris Hostetter , Grant Ingersoll , Steven Bower , in distributed mode, would this component work when talking to individual shards directly? If that's the case, then (for some stats) the end user can do a final roll-up themselves.
          Hide
          Anirudha added a comment - - edited

          Yes, in solrCloud mode; Currently, you can use this component when talking to individual shards or if you have only one shard.

          Show
          Anirudha added a comment - - edited Yes, in solrCloud mode; Currently, you can use this component when talking to individual shards or if you have only one shard.
          Hide
          Shalin Shekhar Mangar added a comment -

          Has anybody given a thought about how this might use the new AnalyticsQuery? Is the AnalyticsQuery framework powerful enough to make this component cloud-aware?

          Show
          Shalin Shekhar Mangar added a comment - Has anybody given a thought about how this might use the new AnalyticsQuery? Is the AnalyticsQuery framework powerful enough to make this component cloud-aware?
          Hide
          Joel Bernstein added a comment -

          I'm fairly certain all the functionality in the AnalyticsComponent could be implemented as an AnalyticsQuery. Any functions that could be distributed would have a MergeStrategy implementations as well.

          Show
          Joel Bernstein added a comment - I'm fairly certain all the functionality in the AnalyticsComponent could be implemented as an AnalyticsQuery. Any functions that could be distributed would have a MergeStrategy implementations as well.
          Hide
          Erick Erickson added a comment -

          Shalin:

          It's actually a somewhat different problem I think. We're thinking of pulling this out of 5.x and going with the analytics framework instead, but haven't quite reached consensus on that. The big consideration here is that making this work distributed is seems like a big task. Using the pluggable framework seems like it would be easier to build up as necessary.

          We really need to figure it out soon....

          Show
          Erick Erickson added a comment - Shalin: It's actually a somewhat different problem I think. We're thinking of pulling this out of 5.x and going with the analytics framework instead, but haven't quite reached consensus on that. The big consideration here is that making this work distributed is seems like a big task. Using the pluggable framework seems like it would be easier to build up as necessary. We really need to figure it out soon....
          Hide
          Yonik Seeley added a comment -

          We're thinking of pulling this out of 5.x and going with the analytics framework instead, but haven't quite reached consensus on that.

          I didn't realize that... can you point me at the discussion?

          Show
          Yonik Seeley added a comment - We're thinking of pulling this out of 5.x and going with the analytics framework instead, but haven't quite reached consensus on that. I didn't realize that... can you point me at the discussion?
          Hide
          Steven Bower added a comment -

          Making the types of expression the analyics framework supports distributed is hard period regardless of what framework.. (eg median, percentiles, etc..) unless you accept some error rate... can someone point me to the "analytics framework" that is being talked about..?

          Show
          Steven Bower added a comment - Making the types of expression the analyics framework supports distributed is hard period regardless of what framework.. (eg median, percentiles, etc..) unless you accept some error rate... can someone point me to the "analytics framework" that is being talked about..?
          Hide
          Erick Erickson added a comment -

          bq: I didn't realize that... can you point me at the discussion?

          I mis-stated that severely, my apologies. What I should have said is more along the lines that I don't quite know what to do with back-porting the analytics stuff to 4.x. Or whether we should. It's quite a bit of code, the interface is complex, and it doesn't play nice in distributed mode. I believe there are functions that simply won't work distributed. And maybe can't.

          Then there's the pluggable analytics framework that's been recently added. I really wonder whether the right thing to do long-term is to pull this out of 5x and port as much as possible into the pluggable analytics framework piecemeal as necessary, stealing as much as possible and supporting what can be supported in distributed mode. That still leaves the question of what to do with functions that are inherently difficult/impossible to support in sharded environments...

          See SOLR-5963 for some of the other discussion about whether to move this to a contrib rather than have it be in the mainline code. My concern is that if we move it to a contrib, it'll just be code that languishes, especially given the distributed limitations. Would it just be better to use the pluggable framework? It seems to me that the use-case for single-shard analytics is becoming less compelling, but that may be a misperception on my part.

          Don't want it to seem like there's any decision here, more like I don't want to introduce this much code into the mainline tree if it doesn't have wide applicability, and I think the lack of distributed support severely limits how widely it applies.

          That said, I'm not dogmatically opposed either. But I'd like some sense of what others think about it.

          Show
          Erick Erickson added a comment - bq: I didn't realize that... can you point me at the discussion? I mis-stated that severely, my apologies. What I should have said is more along the lines that I don't quite know what to do with back-porting the analytics stuff to 4.x. Or whether we should. It's quite a bit of code, the interface is complex, and it doesn't play nice in distributed mode. I believe there are functions that simply won't work distributed. And maybe can't. Then there's the pluggable analytics framework that's been recently added. I really wonder whether the right thing to do long-term is to pull this out of 5x and port as much as possible into the pluggable analytics framework piecemeal as necessary, stealing as much as possible and supporting what can be supported in distributed mode. That still leaves the question of what to do with functions that are inherently difficult/impossible to support in sharded environments... See SOLR-5963 for some of the other discussion about whether to move this to a contrib rather than have it be in the mainline code. My concern is that if we move it to a contrib, it'll just be code that languishes, especially given the distributed limitations. Would it just be better to use the pluggable framework? It seems to me that the use-case for single-shard analytics is becoming less compelling, but that may be a misperception on my part. Don't want it to seem like there's any decision here, more like I don't want to introduce this much code into the mainline tree if it doesn't have wide applicability, and I think the lack of distributed support severely limits how widely it applies. That said, I'm not dogmatically opposed either. But I'd like some sense of what others think about it.
          Hide
          Steven Bower added a comment -

          I think moving to contrib is probably the right thing at this point...

          Show
          Steven Bower added a comment - I think moving to contrib is probably the right thing at this point...
          Hide
          David Smiley added a comment -

          I think moving to contrib is probably the right thing at this point...

          +1

          Show
          David Smiley added a comment - I think moving to contrib is probably the right thing at this point... +1
          Hide
          Joel Bernstein added a comment -

          Steven,

          Erick is talking about the AnalyticsQuery API in Solr 4.9. (http://heliosearch.org/solrs-new-analyticsquery-api/), which is a plugin point for custom analytics. It's design allows developers to plugin custom analytic Collectors inline with the flow of the search.

          Porting all the functions from the AnalyticsComponent to be AnalyticsQuery's and then adding distributed support (where possible) would take some serious thought and effort.

          For a near term solution, that makes all the functions available, I think the best option is the contrib approach.

          Show
          Joel Bernstein added a comment - Steven, Erick is talking about the AnalyticsQuery API in Solr 4.9. ( http://heliosearch.org/solrs-new-analyticsquery-api/ ), which is a plugin point for custom analytics. It's design allows developers to plugin custom analytic Collectors inline with the flow of the search. Porting all the functions from the AnalyticsComponent to be AnalyticsQuery's and then adding distributed support (where possible) would take some serious thought and effort. For a near term solution, that makes all the functions available, I think the best option is the contrib approach.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Steven Bower
            • Votes:
              21 Vote for this issue
              Watchers:
              47 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development