Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we use the following procedure:

      • track HashSet<Long> uniqueValues, until it exceeds 256 unique values.
      • convert to array, sort and assign ordinals to each one
      • create encoder map (HashMap<Long,Integer>) to encode each value.

      This results in each value being hashed twice... but the vast majority of the time people will just be using single-byte norms and a simple array is enough for that range.

        Activity

        Hide
        Robert Muir added a comment -

        Attached is a patch that speeds it up, but I'm not happy with the complexity.

        I benchmarked by indexing geonames with every field as indexed TextField with norms (160 segments), then timed merging all of these:

        SM 0 [Mon Jun 30 14:19:47 EDT 2014; Lucene Merge Thread #1]: 1533 msec to merge norms [1340190 docs]
        SM 0 [Mon Jun 30 14:19:59 EDT 2014; Lucene Merge Thread #1]: 1603 msec to merge norms [1509620 docs]
        SM 0 [Mon Jun 30 14:20:11 EDT 2014; Lucene Merge Thread #0]: 2432 msec to merge norms [1380799 docs]
        SM 0 [Mon Jun 30 14:20:13 EDT 2014; Lucene Merge Thread #1]: 3043 msec to merge norms [1601868 docs]
        SM 0 [Mon Jun 30 14:20:25 EDT 2014; Lucene Merge Thread #0]: 1785 msec to merge norms [1819675 docs]
        SM 0 [Mon Jun 30 14:21:19 EDT 2014; Lucene Merge Thread #0]: 8900 msec to merge norms [8330469 docs]
        
        SM 0 [Mon Jun 30 14:22:15 EDT 2014; Lucene Merge Thread #1]: 1119 msec to merge norms [1340190 docs]
        SM 0 [Mon Jun 30 14:22:26 EDT 2014; Lucene Merge Thread #1]: 1214 msec to merge norms [1509620 docs]
        SM 0 [Mon Jun 30 14:22:37 EDT 2014; Lucene Merge Thread #0]: 1110 msec to merge norms [1380799 docs]
        SM 0 [Mon Jun 30 14:22:38 EDT 2014; Lucene Merge Thread #1]: 1284 msec to merge norms [1601868 docs]
        SM 0 [Mon Jun 30 14:22:49 EDT 2014; Lucene Merge Thread #0]: 1335 msec to merge norms [1819675 docs]
        SM 0 [Mon Jun 30 14:23:41 EDT 2014; Lucene Merge Thread #0]: 6834 msec to merge norms [8330469 docs]
        

        Comparing the other values (e.g. time to merge postings/stored fields) between the two runs there wasn't much noise, so I think removing all the hashing helps.

        Show
        Robert Muir added a comment - Attached is a patch that speeds it up, but I'm not happy with the complexity. I benchmarked by indexing geonames with every field as indexed TextField with norms (160 segments), then timed merging all of these: SM 0 [Mon Jun 30 14:19:47 EDT 2014; Lucene Merge Thread #1]: 1533 msec to merge norms [1340190 docs] SM 0 [Mon Jun 30 14:19:59 EDT 2014; Lucene Merge Thread #1]: 1603 msec to merge norms [1509620 docs] SM 0 [Mon Jun 30 14:20:11 EDT 2014; Lucene Merge Thread #0]: 2432 msec to merge norms [1380799 docs] SM 0 [Mon Jun 30 14:20:13 EDT 2014; Lucene Merge Thread #1]: 3043 msec to merge norms [1601868 docs] SM 0 [Mon Jun 30 14:20:25 EDT 2014; Lucene Merge Thread #0]: 1785 msec to merge norms [1819675 docs] SM 0 [Mon Jun 30 14:21:19 EDT 2014; Lucene Merge Thread #0]: 8900 msec to merge norms [8330469 docs] SM 0 [Mon Jun 30 14:22:15 EDT 2014; Lucene Merge Thread #1]: 1119 msec to merge norms [1340190 docs] SM 0 [Mon Jun 30 14:22:26 EDT 2014; Lucene Merge Thread #1]: 1214 msec to merge norms [1509620 docs] SM 0 [Mon Jun 30 14:22:37 EDT 2014; Lucene Merge Thread #0]: 1110 msec to merge norms [1380799 docs] SM 0 [Mon Jun 30 14:22:38 EDT 2014; Lucene Merge Thread #1]: 1284 msec to merge norms [1601868 docs] SM 0 [Mon Jun 30 14:22:49 EDT 2014; Lucene Merge Thread #0]: 1335 msec to merge norms [1819675 docs] SM 0 [Mon Jun 30 14:23:41 EDT 2014; Lucene Merge Thread #0]: 6834 msec to merge norms [8330469 docs] Comparing the other values (e.g. time to merge postings/stored fields) between the two runs there wasn't much noise, so I think removing all the hashing helps.
        Hide
        Adrien Grand added a comment - - edited

        The patch looks good to me. I think the complexity is ok, I was just a bit confused why the size was stored as a short when looking at NormsMap out of context, maybe we could just have a comment about this limitation?

        Show
        Adrien Grand added a comment - - edited The patch looks good to me. I think the complexity is ok, I was just a bit confused why the size was stored as a short when looking at NormsMap out of context, maybe we could just have a comment about this limitation?
        Hide
        Robert Muir added a comment -

        I'll try to add an assert as well.

        Show
        Robert Muir added a comment - I'll try to add an assert as well.
        Hide
        ASF subversion and git services added a comment -

        Commit 1607074 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1607074 ]

        LUCENE-5797: Optimize norms merging

        Show
        ASF subversion and git services added a comment - Commit 1607074 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1607074 ] LUCENE-5797 : Optimize norms merging
        Hide
        ASF subversion and git services added a comment -

        Commit 1607080 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1607080 ]

        LUCENE-5797: Optimize norms merging

        Show
        ASF subversion and git services added a comment - Commit 1607080 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1607080 ] LUCENE-5797 : Optimize norms merging

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development