Lucene - Core
  1. Lucene - Core
  2. LUCENE-2023

Improve performance of SmartChineseAnalyzer

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.

      This patch improves the internal hhmm implementation.
      Time to index my chinese corpus is 75% of the previous time.

      1. LUCENE-2023.patch
        82 kB
        Robert Muir
      2. LUCENE-2023.patch
        76 kB
        Robert Muir
      3. LUCENE-2023.patch
        38 kB
        Robert Muir
      4. LUCENE-2023.patch
        23 kB
        Robert Muir
      5. LUCENE-2023.patch
        18 kB
        Robert Muir
      6. LUCENE-2023.patch
        14 kB
        Robert Muir
      7. LUCENE-2023.patch
        10 kB
        Robert Muir
      8. LUCENE-2023.patch
        10 kB
        Robert Muir

        Activity

        Hide
        DM Smith added a comment -

        If we have a 2.9.2 release, can this be there too?

        Show
        DM Smith added a comment - If we have a 2.9.2 release, can this be there too?
        Hide
        Robert Muir added a comment -

        DM, personally I don't see the slowness as a bug... though I am very aware word segmentation speed is really important to chinese users.

        but if all the other devs jumped up and said yeah its really critical to be in a bugfix release, i'd backport it to java 1.4

        in my opinion its just an improvement.

        Show
        Robert Muir added a comment - DM, personally I don't see the slowness as a bug... though I am very aware word segmentation speed is really important to chinese users. but if all the other devs jumped up and said yeah its really critical to be in a bugfix release, i'd backport it to java 1.4 in my opinion its just an improvement.
        Hide
        DM Smith added a comment -

        I fully understand that at some point, "just say no."

        I don't think it warrants a new bug fix release, but if there is one then it would be a "nice to have" iff the backport is a trivial effort.

        That said, a 25% improvement is substantial.

        Show
        DM Smith added a comment - I fully understand that at some point, "just say no." I don't think it warrants a new bug fix release, but if there is one then it would be a "nice to have" iff the backport is a trivial effort. That said, a 25% improvement is substantial.
        Hide
        Mark Miller added a comment -

        How long are you stuck on 1.4 DM You guys are gonna upgrade those Macs someday aren't you? Or is it a matter of supporting that platform for your users?

        IMO, it wouldn't be horrible to stretch easily backported optimizations considering this is the last 1.4 release - as long as someone is willing to do it - and we have another release.

        Show
        Mark Miller added a comment - How long are you stuck on 1.4 DM You guys are gonna upgrade those Macs someday aren't you? Or is it a matter of supporting that platform for your users? IMO, it wouldn't be horrible to stretch easily backported optimizations considering this is the last 1.4 release - as long as someone is willing to do it - and we have another release.
        Hide
        Robert Muir added a comment -

        well, i'm more than willing to do the backport, if will be used.

        for now, maybe someone can take a look and double-check the patch for trunk, smartcn is always dangerous territory.

        (for the record the big bottleneck was the hashmap usage in these graphs, its rather silly since its indexed by int, you know the number of values up front, etc)

        Show
        Robert Muir added a comment - well, i'm more than willing to do the backport, if will be used. for now, maybe someone can take a look and double-check the patch for trunk, smartcn is always dangerous territory. (for the record the big bottleneck was the hashmap usage in these graphs, its rather silly since its indexed by int, you know the number of values up front, etc)
        Hide
        DM Smith added a comment -

        Thanks, Mark.

        I'm stuck with 1.4 for the near future. I wish people would upgrade their MacOS! I'll have a release soon and then another in 4-6 months. The release after that will be Java 5 and will probably follow shortly after.

        I'm willing to do the work, especially if it is trivial and it is something I use Perhaps, it would be best to create a separate issue targeting 2.9.2 (or whatever the next release might be), linking it to the original(s), mark it as an "Improvement" and attach appropriate patch(es). Maybe one issue for all improvements, or one per "parent" issue? Then if there is a bugfix that necessitates such a release, the patches would be applied for that release.

        Show
        DM Smith added a comment - Thanks, Mark. I'm stuck with 1.4 for the near future. I wish people would upgrade their MacOS! I'll have a release soon and then another in 4-6 months. The release after that will be Java 5 and will probably follow shortly after. I'm willing to do the work, especially if it is trivial and it is something I use Perhaps, it would be best to create a separate issue targeting 2.9.2 (or whatever the next release might be), linking it to the original(s), mark it as an "Improvement" and attach appropriate patch(es). Maybe one issue for all improvements, or one per "parent" issue? Then if there is a bugfix that necessitates such a release, the patches would be applied for that release.
        Hide
        DM Smith added a comment -

        Robert,
        You have in BigramDictionary:

          public boolean isToExist(int to) {
            return to < tokenPairListTable.length && tokenPairListTable[to] != null;
          }
        

        And you call it in:

          public void addSegTokenPair(SegTokenPair tokenPair) {
            final int to = tokenPair.to;
            if (!isToExist(to)) {
              ArrayList<SegTokenPair> newlist = new ArrayList<SegTokenPair>();
              newlist.add(tokenPair);
              tokenPairListTable[to] = newlist;
              tableSize++;
            } else {
              List<SegTokenPair> tokenPairList = tokenPairListTable[to];
              tokenPairList.add(tokenPair);
            }
          }
        

        The check in addSegTokenPair assumes the isToExist(to) returns false when "to" is in bounds because "tokenPairListTable[to]" will throw an array bounds exception otherwise. Is it an invariant that tokenPair.to will always be in bounds?

        In the same way the array in SegGraph, does the same thing.

        With the former implementation, it did not have an issue.

        Other than that, it looks good.

        Show
        DM Smith added a comment - Robert, You have in BigramDictionary: public boolean isToExist( int to) { return to < tokenPairListTable.length && tokenPairListTable[to] != null ; } And you call it in: public void addSegTokenPair(SegTokenPair tokenPair) { final int to = tokenPair.to; if (!isToExist(to)) { ArrayList<SegTokenPair> newlist = new ArrayList<SegTokenPair>(); newlist.add(tokenPair); tokenPairListTable[to] = newlist; tableSize++; } else { List<SegTokenPair> tokenPairList = tokenPairListTable[to]; tokenPairList.add(tokenPair); } } The check in addSegTokenPair assumes the isToExist(to) returns false when "to" is in bounds because "tokenPairListTable [to] " will throw an array bounds exception otherwise. Is it an invariant that tokenPair.to will always be in bounds? In the same way the array in SegGraph, does the same thing. With the former implementation, it did not have an issue. Other than that, it looks good.
        Hide
        Robert Muir added a comment -

        hi DM,

        i think the bounds checks are redundant actually,
        With both situations, the bounds are calculated up front in the constructor.

        Is it an invariant that tokenPair.to will always be in bounds?

        Yes, in this case.

        The reason I did this is for isToExist, etc is because those methods are public... but this stuff is pkg private anyway so maybe i should delete the bounds checks altogether???

        Show
        Robert Muir added a comment - hi DM, i think the bounds checks are redundant actually, With both situations, the bounds are calculated up front in the constructor. Is it an invariant that tokenPair.to will always be in bounds? Yes, in this case. The reason I did this is for isToExist, etc is because those methods are public... but this stuff is pkg private anyway so maybe i should delete the bounds checks altogether???
        Hide
        Robert Muir added a comment -

        change these redundant bounds checks to assertions as DM observed.

        Show
        Robert Muir added a comment - change these redundant bounds checks to assertions as DM observed.
        Hide
        Robert Muir added a comment -

        updated patch, shaves off another 5%

        my avg indexing throughput:

        • cjkanalyzer: 3447k/s
        • orig smartcn: 1357k/s
        • patched smartcn: 1965k/s

        there are serious memory consumption problems in the n^2 part of the algorithm (BiSegGraph), will see about improving it more.

        Show
        Robert Muir added a comment - updated patch, shaves off another 5% my avg indexing throughput: cjkanalyzer: 3447k/s orig smartcn: 1357k/s patched smartcn: 1965k/s there are serious memory consumption problems in the n^2 part of the algorithm (BiSegGraph), will see about improving it more.
        Hide
        Robert Muir added a comment -

        create a generic graph that is reusable, used by both SegGraph and BiSegGraph.
        This cleans up the code a lot and prevents billions of arraylists from being created in n^2 style.

        Show
        Robert Muir added a comment - create a generic graph that is reusable, used by both SegGraph and BiSegGraph. This cleans up the code a lot and prevents billions of arraylists from being created in n^2 style.
        Hide
        Robert Muir added a comment -

        in BiSegGraph, char[] was being created in n^2 fashion for each edge (SegTokenPair), even though its only used for weight calculation.
        instead, add methods to BigramDictionary to get the frequency of a bigram: getFrequency(char left[], char right[]) without this silliness.

        new figures are:

        • cjkanalyzer: 3447k/s
        • orig smartcn: 1357k/s
        • patched smartcn: 2125k/s
        Show
        Robert Muir added a comment - in BiSegGraph, char[] was being created in n^2 fashion for each edge (SegTokenPair), even though its only used for weight calculation. instead, add methods to BigramDictionary to get the frequency of a bigram: getFrequency(char left[], char right[]) without this silliness. new figures are: cjkanalyzer: 3447k/s orig smartcn: 1357k/s patched smartcn: 2125k/s
        Hide
        Robert Muir added a comment -

        Question, the smartcn internals are pkg private (and marked experimental to boot),
        I'd like to keep this clean and theres some unused stuff that could now be deprecated or removed.

        should this be 3.0 or 3.1? should i deprecate or clean house (since its experimental and pkg private)?

        Thanks!

        Show
        Robert Muir added a comment - Question, the smartcn internals are pkg private (and marked experimental to boot), I'd like to keep this clean and theres some unused stuff that could now be deprecated or removed. should this be 3.0 or 3.1? should i deprecate or clean house (since its experimental and pkg private)? Thanks!
        Hide
        DM Smith added a comment -

        Internals are internals. Anyone digging into smartcn's internals should be hog-tied and whipped. They can change at any moment and without warning. (IMHO)

        Show
        DM Smith added a comment - Internals are internals. Anyone digging into smartcn's internals should be hog-tied and whipped. They can change at any moment and without warning. (IMHO)
        Hide
        Robert Muir added a comment -

        Anyone digging into smartcn's internals should be hog-tied and whipped

        oh no, I think I am in trouble then!

        Show
        Robert Muir added a comment - Anyone digging into smartcn's internals should be hog-tied and whipped oh no, I think I am in trouble then!
        Hide
        Robert Muir added a comment -

        i think this BiSegGraph might be a lot better as a 2D array of weights instead of complex object graph.
        i'll try it and see what happens.

        Show
        Robert Muir added a comment - i think this BiSegGraph might be a lot better as a 2D array of weights instead of complex object graph. i'll try it and see what happens.
        Hide
        Robert Muir added a comment -

        latest iteration, gets rid of SegTokenPair/PathNode.
        BiSegGraph still isn't as simple or efficient as it should be,
        but my indexing speed is up to 2400k/s

        Show
        Robert Muir added a comment - latest iteration, gets rid of SegTokenPair/PathNode. BiSegGraph still isn't as simple or efficient as it should be, but my indexing speed is up to 2400k/s
        Hide
        Robert Muir added a comment -

        refactor a lot of this analyzer:

        • move hhmm specific stuff (like WordType, CharType, Utility) into hhmm package
        • move/remove tokenfilter specific stuff (like lowercasing, full-width conversion) out of hhmm package (uses LowerCaseFilter, adds FullWidthFilter)
        • remove the stopwords list, it was full of various punctuation, all of which got converted by "SegTokenFilter" into a comma anyway. instead just don't emit punctuation.

        to me, this refactoring makes the analyzer easier to debug. it also happens to improve performance (up to 2500k/s now)

        Show
        Robert Muir added a comment - refactor a lot of this analyzer: move hhmm specific stuff (like WordType, CharType, Utility) into hhmm package move/remove tokenfilter specific stuff (like lowercasing, full-width conversion) out of hhmm package (uses LowerCaseFilter, adds FullWidthFilter) remove the stopwords list, it was full of various punctuation, all of which got converted by "SegTokenFilter" into a comma anyway. instead just don't emit punctuation. to me, this refactoring makes the analyzer easier to debug. it also happens to improve performance (up to 2500k/s now)
        Hide
        Robert Muir added a comment -

        fix WordTokenFilter to use Version, because if its not going to output delimiters thru stopFilter and then remove them, then it needs to adjust posInc (depending on version)

        Show
        Robert Muir added a comment - fix WordTokenFilter to use Version, because if its not going to output delimiters thru stopFilter and then remove them, then it needs to adjust posInc (depending on version)
        Hide
        Robert Muir added a comment -

        if no one objects I'd rather work this into 3.1, along with some refactoring of this code.

        DM Smith, if this causes you a problem I would rather just upload a java 1.4 patch for you to improve your performance than slip it into 3.0. there was already a bug in 2.9 in this analyzer so I don't want to introduce a new one without having a lot of time to play with this code.

        Show
        Robert Muir added a comment - if no one objects I'd rather work this into 3.1, along with some refactoring of this code. DM Smith, if this causes you a problem I would rather just upload a java 1.4 patch for you to improve your performance than slip it into 3.0. there was already a bug in 2.9 in this analyzer so I don't want to introduce a new one without having a lot of time to play with this code.
        Hide
        Michael McCandless added a comment -

        I think given that this package is brand new, and in contrib (no back compat promise unless explicitly stated), and Robert has some solid improvements, it should be fine to make non-back-compatible changes here.

        Also I think we should only do this for 3.1? This is a big change... and it breaks back compat. It shouldn't be backported to past releases?

        Show
        Michael McCandless added a comment - I think given that this package is brand new, and in contrib (no back compat promise unless explicitly stated), and Robert has some solid improvements, it should be fine to make non-back-compatible changes here. Also I think we should only do this for 3.1? This is a big change... and it breaks back compat. It shouldn't be backported to past releases?
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development