Solr
  1. Solr
  2. SOLR-1931

Schema Browser does not scale with large indexes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: web gui
    • Labels:
      None

      Description

      The Schema Browser JSP by default causes the Luke handler to "scan the world". In large indexes this make the UI useless.

      On an index with 64m documents & 8gb of disk space, the Schema Browser took 6 minutes to open and hogged all disk I/O, making Solr useless.

      1. SOLR-1931-3x.patch
        24 kB
        Erick Erickson
      2. SOLR-1931-3x.patch
        20 kB
        Erick Erickson
      3. SOLR-1931-3x.patch
        15 kB
        Erick Erickson
      4. SOLR-1931-trunk.patch
        33 kB
        Erick Erickson
      5. SOLR-1931-trunk.patch
        32 kB
        Erick Erickson
      6. SOLR-1931-trunk.patch
        32 kB
        Erick Erickson
      7. SOLR-1931-trunk.patch
        16 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          This is correctly marked as "Improvement", but I don't see any proposal.

          Some off the top of my head:

          • don't start scanning by default... have a "START" button or something
          • have a progress bar
          • make it cancelable

          But since it looks like the schema browser goes through a request handler (luke), things like cancellation look hard. Any ideas?

          Show
          Yonik Seeley added a comment - This is correctly marked as "Improvement", but I don't see any proposal. Some off the top of my head: don't start scanning by default... have a "START" button or something have a progress bar make it cancelable But since it looks like the schema browser goes through a request handler (luke), things like cancellation look hard. Any ideas?
          Hide
          Lance Norskog added a comment -

          The UI should change to a drill-down UI that starts at a fast summary of the index schema and fields, then drills down to deep scans of different fields.

          LukeRequestHandler shall default to "no fields" instead of "all fields".
          LukeRequestHandler shall have a new call that returns a list of fields and nothing else.

          The Ajax UI would fetch the list of fields and then fetch individual fields as is does now.
          The Ajax UI would include a button that delves deeper into fields, doing the various scans it does now.

          Show
          Lance Norskog added a comment - The UI should change to a drill-down UI that starts at a fast summary of the index schema and fields, then drills down to deep scans of different fields. LukeRequestHandler shall default to "no fields" instead of "all fields". LukeRequestHandler shall have a new call that returns a list of fields and nothing else. The Ajax UI would fetch the list of fields and then fetch individual fields as is does now. The Ajax UI would include a button that delves deeper into fields, doing the various scans it does now.
          Hide
          Hoss Man added a comment -

          LukeRequestHandler shall default to "no fields" instead of "all fields".

          I don't really see a need to change the default on LukeRequestHandler – if you are hitting it direclty you should really know what you're doing. improving how the schema browser uses luke seems orthoginal to that. (but i could be convined otherwise)

          LukeRequestHandler shall have a new call that returns a list of fields and nothing else.

          isn't this essentially what "/admin/luke?numTerms=0" does? ... all of the other data returned when numTerms=0 is essentially free.

          if you have a crazy ass big number of dynamicFields, then it still might be slow – but if you are in that situation schema browser isn't going to ever be of much use to you unless we add some more comprehensive UI elements for "searching" the list of fields. (the existing "/admin/luke?show=schema" call can be used as the source of this data ... schema browser already uses that to build up the list of fieldtypes and dynamic fields)

          Show
          Hoss Man added a comment - LukeRequestHandler shall default to "no fields" instead of "all fields". I don't really see a need to change the default on LukeRequestHandler – if you are hitting it direclty you should really know what you're doing. improving how the schema browser uses luke seems orthoginal to that. (but i could be convined otherwise) LukeRequestHandler shall have a new call that returns a list of fields and nothing else. isn't this essentially what "/admin/luke?numTerms=0" does? ... all of the other data returned when numTerms=0 is essentially free. if you have a crazy ass big number of dynamicFields, then it still might be slow – but if you are in that situation schema browser isn't going to ever be of much use to you unless we add some more comprehensive UI elements for "searching" the list of fields. (the existing "/admin/luke?show=schema" call can be used as the source of this data ... schema browser already uses that to build up the list of fieldtypes and dynamic fields)
          Hide
          Lance Norskog added a comment -

          This is my test index:
          65m documents
          2 text fields each with 10m and 14m unique terms, 'text0' and 'text1'.
          several more string fields with 1 to 10 unique terms: 'protocol' has 4 unique facets
          No dynamic fields.

          • numTerms=0
            • Returns immediately with the field list.
          • numTerms=10
            • 130-160 seconds
          • numTerms=10&fl=protocol
            • 45 seconds
          • numTerms=10&fl=text0
            • 60 seconds
          • numTerms=10&fl=text1
            • 60 seconds
          • show=schema
            • 18 seconds after above warmup queries

          These numbers are consistent, run multiple times against the same index load, in various orders.

          Given the above numbers, the commands should be:

          • to get a list of fixed fields
            • numTerms=0
          • to find dynamic fields
            • show=schema
          • to find unique terms for a field
            • allow user to choose between
              • numTerms=X&fl=field
              • facet call

          It needs a new show=schema option that does not scan for dynamic fields. That would be called on page open, then the individual fields can have drill-downs and there can be a 'scan for dynamic fields' button that does the current show=schema scan.

          Does this make sense?

          Other possible features:

          • info on segments
            • separate above results on segments?
          • shortest, longest, mean, standard deviation of text field lengths
          Show
          Lance Norskog added a comment - This is my test index: 65m documents 2 text fields each with 10m and 14m unique terms, 'text0' and 'text1'. several more string fields with 1 to 10 unique terms: 'protocol' has 4 unique facets No dynamic fields. numTerms=0 Returns immediately with the field list. numTerms=10 130-160 seconds numTerms=10&fl=protocol 45 seconds numTerms=10&fl=text0 60 seconds numTerms=10&fl=text1 60 seconds show=schema 18 seconds after above warmup queries These numbers are consistent, run multiple times against the same index load, in various orders. Given the above numbers, the commands should be: to get a list of fixed fields numTerms=0 to find dynamic fields show=schema to find unique terms for a field allow user to choose between numTerms=X&fl=field facet call It needs a new show=schema option that does not scan for dynamic fields. That would be called on page open, then the individual fields can have drill-downs and there can be a 'scan for dynamic fields' button that does the current show=schema scan. Does this make sense? Other possible features: info on segments separate above results on segments? shortest, longest, mean, standard deviation of text field lengths
          Hide
          Hoss Man added a comment -

          Anything with numTerms !=0 is likely to be slow, and the more fields you ask for the slower it will be.

          This one though seems like we should fix it...

          show=schema

          * 18 seconds after above warmup queries

          ...that's really weird. i havn't looked at the code, but skimming the show=schema output again i realized that for some reason that mode computes the total number of terms in the index for you (which must be a full walk of hte TermEnum) ... so we should definitley fix that up ... counting the terms is expensive enough it should definitely require it's own param.

          It needs a new show=schema option that does not scan for dynamic fields

          ...i'm not following you there. listing hte dynamic field declarations should be dirty cheap (listing the "real" fields that result from those dynamic fields could be costly if you have millions of them, but that doesn't seem to be what you're refering to .. like i said though: if you are in that boat, the schema browser is never oging to be useful for you.)

          Show
          Hoss Man added a comment - Anything with numTerms !=0 is likely to be slow, and the more fields you ask for the slower it will be. This one though seems like we should fix it... show=schema * 18 seconds after above warmup queries ...that's really weird. i havn't looked at the code, but skimming the show=schema output again i realized that for some reason that mode computes the total number of terms in the index for you (which must be a full walk of hte TermEnum) ... so we should definitley fix that up ... counting the terms is expensive enough it should definitely require it's own param. It needs a new show=schema option that does not scan for dynamic fields ...i'm not following you there. listing hte dynamic field declarations should be dirty cheap (listing the "real" fields that result from those dynamic fields could be costly if you have millions of them, but that doesn't seem to be what you're refering to .. like i said though: if you are in that boat, the schema browser is never oging to be useful for you.)
          Hide
          Lance Norskog added a comment -

          ...that's really weird. i havn't looked at the code, but skimming the show=schema output again i realized that for some reason that mode computes the total number of terms in the index for you (which must be a full walk of hte TermEnum) ... so we should definitley fix that up ... counting the terms is expensive enough it should definitely require it's own param.

          Cool! That's one down.

          ..i'm not following you there. listing hte dynamic field declarations should be dirty cheap (listing the "real" fields that result from those dynamic fields could be costly if you have millions of them, but that doesn't seem to be what you're refering to .. like i said though: if you are in that boat, the schema browser is never oging to be useful for you.)

          This was me being confused. Never mind.

          The drill-down interface should be able to ask for # of terms for a field. The above enhancement to show=schema (with a per-field option) supplies everything needed for a fully drill-down index scanner that scales up to zillions of docs.

          Show
          Lance Norskog added a comment - ...that's really weird. i havn't looked at the code, but skimming the show=schema output again i realized that for some reason that mode computes the total number of terms in the index for you (which must be a full walk of hte TermEnum) ... so we should definitley fix that up ... counting the terms is expensive enough it should definitely require it's own param. Cool! That's one down. ..i'm not following you there. listing hte dynamic field declarations should be dirty cheap (listing the "real" fields that result from those dynamic fields could be costly if you have millions of them, but that doesn't seem to be what you're refering to .. like i said though: if you are in that boat, the schema browser is never oging to be useful for you.) This was me being confused. Never mind. The drill-down interface should be able to ask for # of terms for a field. The above enhancement to show=schema (with a per-field option) supplies everything needed for a fully drill-down index scanner that scales up to zillions of docs.
          Hide
          Lance Norskog added a comment -

          I can't find where the index is scanned. If any of you can point me in the right direction, please do.

          Show
          Lance Norskog added a comment - I can't find where the index is scanned. If any of you can point me in the right direction, please do.
          Hide
          Hoss Man added a comment -

          I can't find where the index is scanned

          take a look at getIndexInfo and it's "countTerms" boolean

          by the looks of how it's used, it seems like the assumption was that if the numTerms param != 0, then you're clearly okay with it taking some time, so it might as well compute the total number of terms in the index for you – but the default value of numTerms is "20", so in the "/admin/luke?show=schema" case, it computes the total number of terms even though it isn't finding top terms for any field.

          if we just add an explicit "countTerms" boolean param that should make it possible to speed things up a lot.

          Show
          Hoss Man added a comment - I can't find where the index is scanned take a look at getIndexInfo and it's "countTerms" boolean by the looks of how it's used, it seems like the assumption was that if the numTerms param != 0, then you're clearly okay with it taking some time, so it might as well compute the total number of terms in the index for you – but the default value of numTerms is "20", so in the "/admin/luke?show=schema" case, it computes the total number of terms even though it isn't finding top terms for any field. if we just add an explicit "countTerms" boolean param that should make it possible to speed things up a lot.
          Hide
          Erick Erickson added a comment -

          In the trunk (4.x) version, (from Muir) below. I haven't looked at this yet, but being able to get some approximation back from Luke quickly would be a big help. Maybe we can make this happen on trunk?

          The use-case I'm interested in is the one in which we're really only looking for outrageous numbers of unique terms. Having unique terms per segment would go a long way towards that use-case.

          *******
          Is it really necessary to see the 'top level' number of distinct terms
          summed across all segments?
          Maybe its good enough to list the information on a per-segment basis.
          Then it would always be instant-fast:

          you would just use FieldsEnum api to list all the fields, and for each
          field call .terms() and then Terms.getUniqueTermCount()

          Note: getUniqueTermCount won't work (returns -1) for any 3.x segments
          that haven't yet been upgraded to the 4.0 format.
          The old 3.x format only allows you to get the uniqueTermCount across
          all fields in the segment (Fields.getUniqueTermCount), because fields
          are not clearly separated.

          Show
          Erick Erickson added a comment - In the trunk (4.x) version, (from Muir) below. I haven't looked at this yet, but being able to get some approximation back from Luke quickly would be a big help. Maybe we can make this happen on trunk? The use-case I'm interested in is the one in which we're really only looking for outrageous numbers of unique terms. Having unique terms per segment would go a long way towards that use-case. ******* Is it really necessary to see the 'top level' number of distinct terms summed across all segments? Maybe its good enough to list the information on a per-segment basis. Then it would always be instant-fast: you would just use FieldsEnum api to list all the fields, and for each field call .terms() and then Terms.getUniqueTermCount() Note: getUniqueTermCount won't work (returns -1) for any 3.x segments that haven't yet been upgraded to the 4.0 format. The old 3.x format only allows you to get the uniqueTermCount across all fields in the segment (Fields.getUniqueTermCount), because fields are not clearly separated.
          Hide
          Otis Gospodnetic added a comment -

          Is that actually true? What if one is looking at a completely optimized index?

          Show
          Otis Gospodnetic added a comment - Is that actually true? What if one is looking at a completely optimized index?
          Hide
          Erick Erickson added a comment -

          bq: What if one is looking at a completely optimized index?

          I wondered about that myself, and I suspect this would work just as you indicate, optimizing the index would get you the exact unique counts for all the fields. Which conveniently leaves it up to the user to decide just how necessary getting exact information is....

          Here's a code snippet from Muir (thanks!!!) that we should preserve, 4.x only.
          new ReaderUtil.Gather(reader) {
          @Override
          protected void add(int base, IndexReader r) throws IOException {
          System.out.println("segment: " + r.toString());
          FieldsEnum e = r.fields().iterator();
          String field;
          while ((field = e.next()) != null)

          { System.out.println("\t" + field + ": " + e.terms().getUniqueTermCount()); }

          }
          }.run();

          segment: _34(4.0):C1802000/89498
          body: 4886489
          date: 136729
          datenum: 631685
          group100: 100
          group100K: 100000
          group10K: 10000
          group1M: 999999
          groupblock: 180200
          groupend: 1
          id: 1802000
          timesecnum: 73524
          title: 139038
          titleTokenized: 73144
          segment: _67(4.0):C1802000/89561
          body: 4985143

          Show
          Erick Erickson added a comment - bq: What if one is looking at a completely optimized index? I wondered about that myself, and I suspect this would work just as you indicate, optimizing the index would get you the exact unique counts for all the fields. Which conveniently leaves it up to the user to decide just how necessary getting exact information is.... Here's a code snippet from Muir (thanks!!!) that we should preserve, 4.x only. new ReaderUtil.Gather(reader) { @Override protected void add(int base, IndexReader r) throws IOException { System.out.println("segment: " + r.toString()); FieldsEnum e = r.fields().iterator(); String field; while ((field = e.next()) != null) { System.out.println("\t" + field + ": " + e.terms().getUniqueTermCount()); } } }.run(); segment: _34(4.0):C1802000/89498 body: 4886489 date: 136729 datenum: 631685 group100: 100 group100K: 100000 group10K: 10000 group1M: 999999 groupblock: 180200 groupend: 1 id: 1802000 timesecnum: 73524 title: 139038 titleTokenized: 73144 segment: _67(4.0):C1802000/89561 body: 4985143
          Hide
          Erick Erickson added a comment -

          Well, there are a couple of issues here. I've attached patches for trunk and 3x for consideration.

          I fixed a structural flaw that traversed all the terms in all the fields twice, once to get the total number of terms across all the fields and once to get the individual counts.

          But that's not where the bulk of the time gets spent. It turns out that getting the count of documents in which each term appears is the culprit. These two lines are executed for each field
          Query q = new TermRangeQuery(fieldName, null, null, false, false);
          TopDocs top = searcher.search(q, 1);

          and top.totalHits is reported. I have an index with 99M documents, mostly integer data that takes 360 seconds to return data when the above is executed and 150 without. Both versions traverse all the terms once, so these times would be greater without the patch due to the second traversal.

          So the attached patches default to NOT doing the above and there's a new parameter reportDocCount that can be set to true to collect that information. What do people think? And is there a better way to get the count of documents in which the term appears? And do any alternate methods respect deleted docs like this one does?

          I tried spinning through using TermDocs (3.6) but soon realized that the people who wrote TermRangeQuery probably got there first.

          So I guess my real question is whether people object to the change in behavior, that users must explicitly request doc counts. Which also means that the admin/schema browser doesn't report this by default and I haven't made it optional from that interface. I'm not inclined to since that interface is going away, but if people feel strongly I might be persuaded. That info is available by admin/luke?fl=myfield&reportDocCount=true in a less painful fashion for a particular field anyway.

          Along the way I alphabetized the fields without my other kludge of putting comparators in other classes. I'll kill that JIRA if this one goes forward.

          Note that this still doesn't scale all that well, on my test index it's still a 5 minute wait. But then I guess that this kind of data gathering will take time by its nature.

          If nobody objects, I'll commit this early next week after I've had a chance to put it down for a while and look at it with fresh eyes and do some more testing. I think there's some inefficiencies in the single pass that I can wring out (about 30 seconds is spent just gathering the data in the single term enumeration loop).

          Show
          Erick Erickson added a comment - Well, there are a couple of issues here. I've attached patches for trunk and 3x for consideration. I fixed a structural flaw that traversed all the terms in all the fields twice, once to get the total number of terms across all the fields and once to get the individual counts. But that's not where the bulk of the time gets spent. It turns out that getting the count of documents in which each term appears is the culprit. These two lines are executed for each field Query q = new TermRangeQuery(fieldName, null, null, false, false); TopDocs top = searcher.search(q, 1); and top.totalHits is reported. I have an index with 99M documents, mostly integer data that takes 360 seconds to return data when the above is executed and 150 without. Both versions traverse all the terms once, so these times would be greater without the patch due to the second traversal. So the attached patches default to NOT doing the above and there's a new parameter reportDocCount that can be set to true to collect that information. What do people think? And is there a better way to get the count of documents in which the term appears? And do any alternate methods respect deleted docs like this one does? I tried spinning through using TermDocs (3.6) but soon realized that the people who wrote TermRangeQuery probably got there first. So I guess my real question is whether people object to the change in behavior, that users must explicitly request doc counts. Which also means that the admin/schema browser doesn't report this by default and I haven't made it optional from that interface. I'm not inclined to since that interface is going away, but if people feel strongly I might be persuaded. That info is available by admin/luke?fl=myfield&reportDocCount=true in a less painful fashion for a particular field anyway. Along the way I alphabetized the fields without my other kludge of putting comparators in other classes. I'll kill that JIRA if this one goes forward. Note that this still doesn't scale all that well, on my test index it's still a 5 minute wait. But then I guess that this kind of data gathering will take time by its nature. If nobody objects, I'll commit this early next week after I've had a chance to put it down for a while and look at it with fresh eyes and do some more testing. I think there's some inefficiencies in the single pass that I can wring out (about 30 seconds is spent just gathering the data in the single term enumeration loop).
          Hide
          Yonik Seeley added a comment -

          And is there a better way to get the count of documents in which the term appears?

          In which any term for the field appears? In trunk, there is Terms.getDocCount()

          Show
          Yonik Seeley added a comment - And is there a better way to get the count of documents in which the term appears? In which any term for the field appears? In trunk, there is Terms.getDocCount()
          Hide
          Erick Erickson added a comment -

          Thanks Robert and Yonik for pointing me at the new 4x capabilities, they make a huge difference. But you knew that.

          The killer for 3.x was getting the document counts via a range query, I don't think there's a good way to get the counts and not pay the penalty, so there's a new parameter recordDocCounts.

          Here's my latest and close-to-last cut at this, both for 3x and 4x.

          The data set is 89M documents, times in seconds.

          3.5
          637 getting doc counts

          3x with this patch
          552 getting doc counts
          53 Stats without doc counts, but
          histogram etc. No option to do
          this before.

          4x, original
          450 or so as I remember, getting doc
          counts, histograms, etc..

          4x with patch, histograms still work.
          158 Getting the doc counts the old way
          (span queries). I mean,
          you guys said ranges were going
          to be faster.
          39 Getting the doc counts with
          terms.getDocCount().
          (including histograms)

          Here's my proposal, I'll probably commit this next weekend at the latest unless there are objections:

          1> I'll let these stew for a couple of days, and look them over again. Anyone who wants to look too, please feel free.

          2> Live with getting the doc counts in 4x including the deleted docs and remove the reportDocCounts parameter (it'll live in 3.6 and other 3x versions). I think the performance is fine without carrying that kind of kludgy option forward. I could be persuaded otherwise, but an optimized index will take care of the counting of deleted documents problem if anyone really cares.

          Show
          Erick Erickson added a comment - Thanks Robert and Yonik for pointing me at the new 4x capabilities, they make a huge difference. But you knew that. The killer for 3.x was getting the document counts via a range query, I don't think there's a good way to get the counts and not pay the penalty, so there's a new parameter recordDocCounts. Here's my latest and close-to-last cut at this, both for 3x and 4x. The data set is 89M documents, times in seconds. 3.5 637 getting doc counts 3x with this patch 552 getting doc counts 53 Stats without doc counts, but histogram etc. No option to do this before. 4x, original 450 or so as I remember, getting doc counts, histograms, etc.. 4x with patch, histograms still work. 158 Getting the doc counts the old way (span queries). I mean, you guys said ranges were going to be faster. 39 Getting the doc counts with terms.getDocCount(). (including histograms) Here's my proposal, I'll probably commit this next weekend at the latest unless there are objections: 1> I'll let these stew for a couple of days, and look them over again. Anyone who wants to look too, please feel free. 2> Live with getting the doc counts in 4x including the deleted docs and remove the reportDocCounts parameter (it'll live in 3.6 and other 3x versions). I think the performance is fine without carrying that kind of kludgy option forward. I could be persuaded otherwise, but an optimized index will take care of the counting of deleted documents problem if anyone really cares.
          Hide
          Robert Muir added a comment -

          why is it still 39seconds? shouldn't tools like this just use statistics and not enumerate terms or any anything else by default so that they return instantly?

          its 4.0, why not just backwards break and make it fast?

          Instead of doing enumerations and stuff, you could display all of the Terms-level statistics per segment per field:

          • uniqueTermCount (# of terms)
          • sumDocFreq (# of postings/term-doc mappings)
          • sumTotalTermFreq (# of positions/tokens)
          • docCount (# of documents with at least one posting for the field)

          This would all be basically instantaneous and would give a more thorough picture of the performance characteristics of the index (e.g. how many positions).
          You could also compute derived stats like average field length etc too.

          Show
          Robert Muir added a comment - why is it still 39seconds? shouldn't tools like this just use statistics and not enumerate terms or any anything else by default so that they return instantly? its 4.0, why not just backwards break and make it fast? Instead of doing enumerations and stuff, you could display all of the Terms-level statistics per segment per field: uniqueTermCount (# of terms) sumDocFreq (# of postings/term-doc mappings) sumTotalTermFreq (# of positions/tokens) docCount (# of documents with at least one posting for the field) This would all be basically instantaneous and would give a more thorough picture of the performance characteristics of the index (e.g. how many positions). You could also compute derived stats like average field length etc too.
          Hide
          Erick Erickson added a comment -

          bq: why is it still 39 seconds?

          Histograms and collecting the top N terms by frequency. Still gotta spin through all the terms to collect either statistic. Take that bit out and the response is less than 0.5 seconds.

          39 seconds isn't bad at all for an index this size, and one can still specify particular fields of interest if the index is more complex than this one. I can probably be argued out of their importance although it'll take a little doing. This is really for, from my perspective, troubleshooting at a high level and that information is valuable.

          Besides, I told you I had to look it over after a while. I just saw something horribly trivial that cuts it down to 15 seconds. There's a loop where, after the histo stuff is collected, we test to see if the current term frequency is above the threshold of the already-collected items.... and changing it from

          if (freq < tiq.minfreq) continue;
          to, essentially,
          if (freq <= tiq.minfreq) continue;

          means that the pathological case of inserting every last <uniqueKey> in the tracking priority queue doesn't happen. Siiigggh.

          Oh, and the patch I'll attach in a couple of minutes actually compiles. I half cleaned up the stupid recordDocCount parameter by removing the definition, but not getting it from the parameters. Fella has to go to sleep more often.

          Also, this index is a little peculiar in that many of the fields have only a very few values so YMMV.

          Show
          Erick Erickson added a comment - bq: why is it still 39 seconds? Histograms and collecting the top N terms by frequency. Still gotta spin through all the terms to collect either statistic. Take that bit out and the response is less than 0.5 seconds. 39 seconds isn't bad at all for an index this size, and one can still specify particular fields of interest if the index is more complex than this one. I can probably be argued out of their importance although it'll take a little doing. This is really for, from my perspective, troubleshooting at a high level and that information is valuable. Besides, I told you I had to look it over after a while. I just saw something horribly trivial that cuts it down to 15 seconds. There's a loop where, after the histo stuff is collected, we test to see if the current term frequency is above the threshold of the already-collected items.... and changing it from if (freq < tiq.minfreq) continue; to, essentially, if (freq <= tiq.minfreq) continue; means that the pathological case of inserting every last <uniqueKey> in the tracking priority queue doesn't happen. Siiigggh. Oh, and the patch I'll attach in a couple of minutes actually compiles. I half cleaned up the stupid recordDocCount parameter by removing the definition, but not getting it from the parameters. Fella has to go to sleep more often. Also, this index is a little peculiar in that many of the fields have only a very few values so YMMV.
          Hide
          Erick Erickson added a comment -

          Trunk that, you know, actually compiles or something, mea culpa.

          Also reduces the 4x time down to 15 seconds after fixing a stupid oversight. Really gotta let this stew for a while and look at it with less-tired eyes.

          Show
          Erick Erickson added a comment - Trunk that, you know, actually compiles or something, mea culpa. Also reduces the 4x time down to 15 seconds after fixing a stupid oversight. Really gotta let this stew for a while and look at it with less-tired eyes.
          Hide
          Erick Erickson added a comment -

          Final patches attached. All honor unto whoever wrote the tests for the binary writers, I discovered that a TreeMap is unacceptable. In other words, all the tests pass now.

          Unless there are objections, I intend to commit these tomorrow or Friday.

          Show
          Erick Erickson added a comment - Final patches attached. All honor unto whoever wrote the tests for the binary writers, I discovered that a TreeMap is unacceptable. In other words, all the tests pass now. Unless there are objections, I intend to commit these tomorrow or Friday.
          Hide
          Erick Erickson added a comment -

          Fixed in:
          trunk: 1227924
          3.x: 1227926

          Show
          Erick Erickson added a comment - Fixed in: trunk: 1227924 3.x: 1227926

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Lance Norskog
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development