Solr
  1. Solr
  2. SOLR-6666

Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: Schema and Analysis, update
    • Labels:
      None
    • Environment:

      Linux, Solr 4.8, Schema with 70 fields and more than 500 specific CopyFields for dynamic fields, but without wildcards (the fields are dynamic, the copy directive is not)

      Description

      Result:
      After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance.

      Explanation:

      Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, "getCopyFieldsList()":

      getCopyFieldsList()
      final List<CopyField> result = new ArrayList<>();
          for (DynamicCopy dynamicCopy : dynamicCopyFields) {
            if (dynamicCopy.matches(sourceField)) {
              result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
            }
          }
          List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
          if (null != fixedCopyFields) {
            result.addAll(fixedCopyFields);
          }
      

      This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field).
      As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O( n ) time while N is the size of the "dynamicCopyFields" group).
      The next part is just a simple "hash" extraction, which takes O(1) time.

      Our schema contains over then 500 copyFields but only 70 of then are "indexed" fields.
      We also have one dynamic field with a wildcard ( * ), which "catches" the rest of the document fields.
      As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard).

      From some reason, the copyFields registration procedure defines those 400 fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” array,
      This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the "fixedCopyFields".
      Only copyFields with asterisks need this "special" treatment and they are (especially on our case) pretty rare.

      Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure.
      Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java).

      1. SOLR-6666.patch
        11 kB
        Erick Erickson
      2. SOLR-6666.patch
        11 kB
        Erick Erickson
      3. SOLR-6666.patch
        4 kB
        Elran Dvir
      4. SOLR-6666.patch
        3 kB
        Liram Vardi

        Activity

        Hide
        Erick Erickson added a comment -

        Liram:

        Good stuff!
        Could you attach the patch to this JIRA? That'll make it easiest for someone to pick up.
        Here's a hint on how: http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

        Thanks!

        Show
        Erick Erickson added a comment - Liram: Good stuff! Could you attach the patch to this JIRA? That'll make it easiest for someone to pick up. Here's a hint on how: http://wiki.apache.org/solr/HowToContribute#Working_With_Patches Thanks!
        Hide
        Liram Vardi added a comment -

        Thanks
        The patch is attached.

        Show
        Liram Vardi added a comment - Thanks The patch is attached.
        Hide
        Erick Erickson added a comment -

        I'll take a look, but I won't have any serious tim until this weekend, so if anyone wants to pick this up feel free.

        Show
        Erick Erickson added a comment - I'll take a look, but I won't have any serious tim until this weekend, so if anyone wants to pick this up feel free.
        Hide
        Liram Vardi added a comment -

        Hi all,
        Did anyone have a chance to take a look on this?
        Thanks

        Show
        Liram Vardi added a comment - Hi all, Did anyone have a chance to take a look on this? Thanks
        Hide
        Erick Erickson added a comment -

        Liram:

        Sorry, it's been on my list for a bit but I haven't gotten to it yet. Over Thanksgiving, I promise.

        Erick

        Show
        Erick Erickson added a comment - Liram: Sorry, it's been on my list for a bit but I haven't gotten to it yet. Over Thanksgiving, I promise . Erick
        Hide
        Erick Erickson added a comment -

        And he breaks his promise again. Sorry, came down with something this weekend and got swamped. This week fer sure (he says again).

        Show
        Erick Erickson added a comment - And he breaks his promise again. Sorry, came down with something this weekend and got swamped. This week fer sure (he says again).
        Hide
        Erick Erickson added a comment -

        The problem here is that if you add this copyField directive to schema.xml
        <copyField source="fail_dynamic" dest="dynamic_*"/>
        the schema won't load with the patch. It fails with a message
        about the source field needing an asterisk if the destination
        has one. Other tests have this pattern and fail BTW, see:
        TestFieldCollectionResource, TestManagedSotpFilterFactory
        and TestManagedSynonymFileFactory.

        The fail_dynamic field "fulfills" this requirement since it is actually a
        match for "*_dynamic"

        So are you saying that if you have
        <field name="one"... />
        <field name="two".../>

        and a copyField of
        <copyField source="one" dest="two" /> that bogus logic happens because
        it matches a dynamic field?

        Or are your source fields "explicit", but only really instantiated by matching a
        dynamic field so there's no corresponding <field> definition?

        If it's the former, then it seems that doing a test way up top similar to this:
        if (destSchemaField != null && sourceSchemaField != null) { // Source and destination are explicit
        List<CopyField> copyFieldList = copyFieldsMap.get(source);
        if (copyFieldList == null)

        { copyFieldList = new ArrayList<>(); copyFieldsMap.put(source, copyFieldList); }

        copyFieldList.add(new CopyField(sourceSchemaField, destSchemaField, maxChars));
        incrementCopyFieldTargetCount(destSchemaField);
        return;
        }
        (and maybe taking this out from the end of the method?) would catch your case.

        It's certainly an open question whether this is the way it "should" be of course. I don't
        quite know if there are shortcuts we could take that would satisfy both situations, i.e.
        shortcut non-asterisk source fields in copyField directives that happen to be instantiations
        of dynamic fields while still respecting all the ways a field could get into the "explicit" field
        ("fail_dynamic" above).

        It's also possible that the test that blows up above is too restrictive, I'm not prepared
        to say one way or another. But I can't commit this without getting a resolution to that question.

        Under any circumstances, it seems that beefing up the IndexSchemaTest would be a good thing,
        on a quick look they aren't all that comprehensive.

        Show
        Erick Erickson added a comment - The problem here is that if you add this copyField directive to schema.xml <copyField source="fail_dynamic" dest="dynamic_*"/> the schema won't load with the patch. It fails with a message about the source field needing an asterisk if the destination has one. Other tests have this pattern and fail BTW, see: TestFieldCollectionResource, TestManagedSotpFilterFactory and TestManagedSynonymFileFactory. The fail_dynamic field "fulfills" this requirement since it is actually a match for "*_dynamic" So are you saying that if you have <field name="one"... /> <field name="two".../> and a copyField of <copyField source="one" dest="two" /> that bogus logic happens because it matches a dynamic field? Or are your source fields "explicit", but only really instantiated by matching a dynamic field so there's no corresponding <field> definition? If it's the former, then it seems that doing a test way up top similar to this: if (destSchemaField != null && sourceSchemaField != null) { // Source and destination are explicit List<CopyField> copyFieldList = copyFieldsMap.get(source); if (copyFieldList == null) { copyFieldList = new ArrayList<>(); copyFieldsMap.put(source, copyFieldList); } copyFieldList.add(new CopyField(sourceSchemaField, destSchemaField, maxChars)); incrementCopyFieldTargetCount(destSchemaField); return; } (and maybe taking this out from the end of the method?) would catch your case. It's certainly an open question whether this is the way it "should" be of course. I don't quite know if there are shortcuts we could take that would satisfy both situations, i.e. shortcut non-asterisk source fields in copyField directives that happen to be instantiations of dynamic fields while still respecting all the ways a field could get into the "explicit" field ("fail_dynamic" above). It's also possible that the test that blows up above is too restrictive, I'm not prepared to say one way or another. But I can't commit this without getting a resolution to that question. Under any circumstances, it seems that beefing up the IndexSchemaTest would be a good thing, on a quick look they aren't all that comprehensive.
        Hide
        Liram Vardi added a comment -

        Hi Erick,

        Thanks you for the comprehensive review for this post.

        Indeed, my patch causes a loading failure and also to TestFieldCollectionResource to fail, when we are using the “fail_dynamic” example.
        (TestManagedSotpFilterFactory and TestManagedSynonymFileFactory failed on my environment regardless the patch.)
        Although, based on the following wiki, I was sure that this example is invalid.
        https://cwiki.apache.org/confluence/display/solr/Copying+Fields
        However, based on your explanation I tried to find a combine solution which satisfies those two cases, as you said.

        The case that I am trying to solve is the case that source is not explicit (which means that does not have a field definition and it is only instantiated by matching a dynamic field – the second case that you described on your response).
        So, let make things a bit more ordered.
        Let us assume three possible types for each copyfield, source and destination:
        1) Explicit – the field has explicit “field” definition.
        2) Glob – The field contains an asterisks on its copyField definition and it matches to one (or more) of the fields definitions (dynamic or static).
        3) Reference – the copy field references to some dynamic field, but it is without any asterisks.

        Each copyfield’s source and destination belongs to one of those types.
        When Solr reads the schema, it divides the copy fields eventually to two groups: fixedCopyFields and to dynamicCopyFields.
        As I explained before, the “fixedCopyFields” is much less expensive than the “dynamicCopyFields”.
        Now, let define the following decision table:

        Case Source Destination Decision
        1 Explicit Explicit fixedCopyFields
        2 Explicit Glob Error!
        3 Explicit Reference dynamicCopyFields
        4 Glob Explicit dynamicCopyFields
        5 Glob Glob dynamicCopyFields
        6 Glob Reference dynamicCopyFields
        7 Reference Explicit fixedCopyFields
        8 Reference Glob dynamicCopyFields
        9 Reference Reference dynamicCopyFields

        As you can see, until today only for case “1” (source and destination are explicit), Solr put those copy fields on the “static” hash.
        On the next version of patch SOLR-6666, I did a refectory on the “if” statement which divides those copyfields.
        At the previous version of the patch, the code throw exception on case 8 (.i.e fail_dynamic example).
        Now after the refectory, case “8” is legal again and case “7”, which is the one that I am trying to solve, sends those copyfields to the “fixedCopyFields” map.

        Another open question is if cases 3 and 9 need also to stay as “DynamicCopyFields” or can we make the update more efficient by moving those also to the “static” map… But currently the patch does not change this.

        The second version of the patch is attached.

        Thanks!

        Show
        Liram Vardi added a comment - Hi Erick, Thanks you for the comprehensive review for this post. Indeed, my patch causes a loading failure and also to TestFieldCollectionResource to fail, when we are using the “fail_dynamic” example. (TestManagedSotpFilterFactory and TestManagedSynonymFileFactory failed on my environment regardless the patch.) Although, based on the following wiki, I was sure that this example is invalid. https://cwiki.apache.org/confluence/display/solr/Copying+Fields However, based on your explanation I tried to find a combine solution which satisfies those two cases, as you said. The case that I am trying to solve is the case that source is not explicit (which means that does not have a field definition and it is only instantiated by matching a dynamic field – the second case that you described on your response). So, let make things a bit more ordered. Let us assume three possible types for each copyfield, source and destination: 1) Explicit – the field has explicit “field” definition. 2) Glob – The field contains an asterisks on its copyField definition and it matches to one (or more) of the fields definitions (dynamic or static). 3) Reference – the copy field references to some dynamic field, but it is without any asterisks. Each copyfield’s source and destination belongs to one of those types. When Solr reads the schema, it divides the copy fields eventually to two groups: fixedCopyFields and to dynamicCopyFields. As I explained before, the “fixedCopyFields” is much less expensive than the “dynamicCopyFields”. Now, let define the following decision table: Case Source Destination Decision 1 Explicit Explicit fixedCopyFields 2 Explicit Glob Error! 3 Explicit Reference dynamicCopyFields 4 Glob Explicit dynamicCopyFields 5 Glob Glob dynamicCopyFields 6 Glob Reference dynamicCopyFields 7 Reference Explicit fixedCopyFields 8 Reference Glob dynamicCopyFields 9 Reference Reference dynamicCopyFields As you can see, until today only for case “1” (source and destination are explicit), Solr put those copy fields on the “static” hash. On the next version of patch SOLR-6666 , I did a refectory on the “if” statement which divides those copyfields. At the previous version of the patch, the code throw exception on case 8 (.i.e fail_dynamic example). Now after the refectory, case “8” is legal again and case “7”, which is the one that I am trying to solve, sends those copyfields to the “fixedCopyFields” map. Another open question is if cases 3 and 9 need also to stay as “DynamicCopyFields” or can we make the update more efficient by moving those also to the “static” map… But currently the patch does not change this. The second version of the patch is attached. Thanks!
        Hide
        Erick Erickson added a comment -

        Hmmm, I like the way you've broken things out, it makes the code easier to follow. It gave me a headache looking the original code before either patch.

        We have three other test failures, it's always best to run 'ant test' before putting up a patch. That said, I think the one I'm seeing (there are three, but they're all the same problem) is the following:

        Use account "steve_rowe" instead I'm particularly interested in what you think here.

        The trunk code returns this fragment in TestCopyFieldCollectionResource.java

        { "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i", "dest":"title"}

        ,

        whereas the patched code returns:

        { "source":"src_sub_no_ast_i", "dest":"title"}

        ,

        The schema.xml file (if I've got the right one) has this line:
        <copyField source="src_sub_no_ast_i" dest="title"/>

        Like I said, the original code hurt my head, I suspect it was just wrong. Steve, do you have any comments here? Or am I mis-interpreting things?

        The attached patch fixes these three problems, I'll run the whole test suite again too.

        Show
        Erick Erickson added a comment - Hmmm, I like the way you've broken things out, it makes the code easier to follow. It gave me a headache looking the original code before either patch. We have three other test failures, it's always best to run 'ant test' before putting up a patch. That said, I think the one I'm seeing (there are three, but they're all the same problem) is the following: Use account "steve_rowe" instead I'm particularly interested in what you think here. The trunk code returns this fragment in TestCopyFieldCollectionResource.java { "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i", "dest":"title"} , whereas the patched code returns: { "source":"src_sub_no_ast_i", "dest":"title"} , The schema.xml file (if I've got the right one) has this line: <copyField source="src_sub_no_ast_i" dest="title"/> Like I said, the original code hurt my head, I suspect it was just wrong. Steve, do you have any comments here? Or am I mis-interpreting things? The attached patch fixes these three problems, I'll run the whole test suite again too.
        Hide
        Steve Rowe added a comment -

        +1 for Erick's patch.

        Separately, I've long thought that iterating through each dynamic copy field looking for matches was non-optimal - I think this could be converted into one or more regular expressions, like maybe one for prefix globs and one for suffix globs.

        We have three other test failures, it's always best to run 'ant test' before putting up a patch. That said, I think the one I'm seeing (there are three, but they're all the same problem) is the following:

        Steve Rowe I'm particularly interested in what you think here.

        The trunk code returns this fragment in TestCopyFieldCollectionResource.java

        { "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i", "dest":"title"}

        ,
        whereas the patched code returns:

        { "source":"src_sub_no_ast_i", "dest":"title"}

        ,
        The schema.xml file (if I've got the right one) has this line:

        <copyField source="src_sub_no_ast_i" dest="title"/>

        Yeah the sourceDynamicBase is determined at schema parsing time and stored in a DynamicCopy instance. While the optimization included in the patch removes this capability – DynamicCopy is not used in copyFieldsMap – the idea of a single source "base" is wrongheaded for at least two reasons: a) copyField source globs can match multiple dynamicField-s (as well as non-dynamic fields for that matter); and b) it's not always possible to identify the source field(s) at schema parsing time - see e.g. SOLR-4650. Anyway, there is no real loss here in dropping support for sourceDynamicBase.

        Like I said, the original code hurt my head, I suspect it was just wrong. Steve, do you have any comments here? Or am I mis-interpreting things?

        The code isn't changed much, really? Can you say why you think it was (is?) wrong? The code is trying to handle lots of cases - see the table on SOLR-4567. That said, I think the code for handling dynamic copyField sources is more complicated than it needs to be: except for cases 14-16 in the above-linked table (copyField-s with a no-glob-dynamic-field-reference source and a dynamic field destination), it's not necessary to verify at schema parsing time that any (dynamic) field(s) are matched by copy field glob sources - see SOLR-4650 for a concrete example. I think sourceDynamicBase should go away entirely.

        Show
        Steve Rowe added a comment - +1 for Erick's patch. Separately, I've long thought that iterating through each dynamic copy field looking for matches was non-optimal - I think this could be converted into one or more regular expressions, like maybe one for prefix globs and one for suffix globs. We have three other test failures, it's always best to run 'ant test' before putting up a patch. That said, I think the one I'm seeing (there are three, but they're all the same problem) is the following: Steve Rowe I'm particularly interested in what you think here. The trunk code returns this fragment in TestCopyFieldCollectionResource.java { "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i", "dest":"title"} , whereas the patched code returns: { "source":"src_sub_no_ast_i", "dest":"title"} , The schema.xml file (if I've got the right one) has this line: <copyField source= "src_sub_no_ast_i" dest= "title" /> Yeah the sourceDynamicBase is determined at schema parsing time and stored in a DynamicCopy instance. While the optimization included in the patch removes this capability – DynamicCopy is not used in copyFieldsMap – the idea of a single source "base" is wrongheaded for at least two reasons: a) copyField source globs can match multiple dynamicField-s (as well as non-dynamic fields for that matter); and b) it's not always possible to identify the source field(s) at schema parsing time - see e.g. SOLR-4650 . Anyway, there is no real loss here in dropping support for sourceDynamicBase . Like I said, the original code hurt my head, I suspect it was just wrong. Steve, do you have any comments here? Or am I mis-interpreting things? The code isn't changed much, really? Can you say why you think it was (is?) wrong? The code is trying to handle lots of cases - see the table on SOLR-4567 . That said, I think the code for handling dynamic copyField sources is more complicated than it needs to be: except for cases 14-16 in the above-linked table (copyField-s with a no-glob-dynamic-field-reference source and a dynamic field destination), it's not necessary to verify at schema parsing time that any (dynamic) field(s) are matched by copy field glob sources - see SOLR-4650 for a concrete example. I think sourceDynamicBase should go away entirely.
        Hide
        Erick Erickson added a comment -

        bq: The code isn't changed much, really? Can you say why you think it was (is?) wrong?

        This was just referring to the test case I referenced where even though there was
        a copy that looked "explicit", although based on a dynamic field. But if I'm finally understanding,
        the problem the old code was fine, just in this case we can take a shortcut.

        My comment about the code hurting my head was mostly that I didn't have a framework
        for trying to figure out what it was doing, I really wish I'd known about the table at SOLR-4567.
        when I was looking at the original patch to help organize things.

        In case you can't tell, I'm doing some new year's cleanup of JIRAs I assigned to myself....

        Thanks! I'll commit this shortly

        Show
        Erick Erickson added a comment - bq: The code isn't changed much, really? Can you say why you think it was (is?) wrong? This was just referring to the test case I referenced where even though there was a copy that looked "explicit", although based on a dynamic field. But if I'm finally understanding, the problem the old code was fine, just in this case we can take a shortcut. My comment about the code hurting my head was mostly that I didn't have a framework for trying to figure out what it was doing, I really wish I'd known about the table at SOLR-4567 . when I was looking at the original patch to help organize things. In case you can't tell, I'm doing some new year's cleanup of JIRAs I assigned to myself.... Thanks! I'll commit this shortly
        Hide
        Erick Erickson added a comment -

        Same patch with CHANGES.txt added.

        Show
        Erick Erickson added a comment - Same patch with CHANGES.txt added.
        Hide
        ASF subversion and git services added a comment -

        Commit 1649657 from Erick Erickson in branch 'dev/trunk'
        [ https://svn.apache.org/r1649657 ]

        SOLR-6666: Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

        Show
        ASF subversion and git services added a comment - Commit 1649657 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1649657 ] SOLR-6666 : Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
        Hide
        ASF subversion and git services added a comment -

        Commit 1649681 from Erick Erickson in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1649681 ]

        SOLR-6666: Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

        Show
        ASF subversion and git services added a comment - Commit 1649681 from Erick Erickson in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649681 ] SOLR-6666 : Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
        Hide
        Erick Erickson added a comment -

        Thanks Liram and a big thanks to Steve for reviewing.

        Show
        Erick Erickson added a comment - Thanks Liram and a big thanks to Steve for reviewing.
        Hide
        Liram Vardi added a comment -

        You are welcome!
        Thanks

        Show
        Liram Vardi added a comment - You are welcome! Thanks
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Erick Erickson
            Reporter:
            Liram Vardi
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development