Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1
    • Component/s: None
    • Labels:
      None

      Description

      The Token class should be more friendly to classes not in it's package:
      1) add setTermText()
      2) remove final from class and toString()
      3) add clone()

      Support for (1):
      TokenFilters in the same package as Token are able to do things like
      "t.termText = t.termText.toLowerCase();" which is more efficient, but more importantly less error prone. Without the ability to change only the term text, a new Token must be created, and one must remember to set all the properties correctly. This exact issue caused this bug:
      http://issues.apache.org/jira/browse/LUCENE-437

      Support for (2):
      Removing final allows one to subclass Token. I didn't see any performance impact after removing final.
      I can go into more detail on why I want to subclass Token if anyone is interested.

      Support for (3):

      • support for a synonym TokenFilter, where one needs to make two tokens from one (same args that support (1), and esp important if instance is a subclass of Token).
      1. SpeedTest.java
        0.6 kB
        Yonik Seeley
      2. TokenSpeed.java
        3 kB
        Yonik Seeley
      3. yonik_Token.txt
        2 kB
        Yonik Seeley

        Activity

        Hide
        Erik Hatcher added a comment -

        Yes, please elaborate on why you need to subclass Token.

        Show
        Erik Hatcher added a comment - Yes, please elaborate on why you need to subclass Token.
        Hide
        Yonik Seeley added a comment -

        Mostly to convey information across TokenFilters, and the single type string isn't sufficient.
        For exampe, I'd like to have an int or long that can be used as 32 or 64 independent flags.

        In general, having attributes you can dynamically attach to tokens allows to you decompose token filters to more basic functions and thus gives greater power to filter chains.

        Some use cases I can think of:

        • conditionals... mark tokens in one filter and conditionally act on them in another.
        • decouple the marking of tokens from the transformation of tokens... one could have a
          TokenMatcherFilter that would tag certain tokens that matched a regex, for example.
        • protected tokens: mark certain words as "do not change", "do not stem", "do not lowercase" for instance.
        • mark tokens that are split from a larger token (for example when a camelCase filter splits "fooBar" into "foo Bar") so they may be treated differently by other filters
        • performance (hey it comes for free). You can do things like StandardTokenFilter, which checks the type of the token and doesn't have to re-parse every single token.

        I've already had to implement TokenFilter functionality (protected tokens, token splitting and combining) where I've had to stuff more functionallity than I'd like into a single filter because of then inability of one filter to provide more info to another.

        So I think there's a strong case for being able to dynamically add attributes (set bit flags) on a token. I planned on subclassing Token to achieve that. But because I don't know what other people may need/want in the future, making it so one can provide extensions to Token via inheritance seems like a good thing.

        Show
        Yonik Seeley added a comment - Mostly to convey information across TokenFilters, and the single type string isn't sufficient. For exampe, I'd like to have an int or long that can be used as 32 or 64 independent flags. In general, having attributes you can dynamically attach to tokens allows to you decompose token filters to more basic functions and thus gives greater power to filter chains. Some use cases I can think of: conditionals... mark tokens in one filter and conditionally act on them in another. decouple the marking of tokens from the transformation of tokens... one could have a TokenMatcherFilter that would tag certain tokens that matched a regex, for example. protected tokens: mark certain words as "do not change", "do not stem", "do not lowercase" for instance. mark tokens that are split from a larger token (for example when a camelCase filter splits "fooBar" into "foo Bar") so they may be treated differently by other filters performance (hey it comes for free). You can do things like StandardTokenFilter, which checks the type of the token and doesn't have to re-parse every single token. I've already had to implement TokenFilter functionality (protected tokens, token splitting and combining) where I've had to stuff more functionallity than I'd like into a single filter because of then inability of one filter to provide more info to another. So I think there's a strong case for being able to dynamically add attributes (set bit flags) on a token. I planned on subclassing Token to achieve that. But because I don't know what other people may need/want in the future, making it so one can provide extensions to Token via inheritance seems like a good thing.
        Hide
        Mark Harwood added a comment -

        >>Mostly to convey information across TokenFilters

        Could you not achieve this communication using ThreadLocals?

        Do we know what the performance cost is of making Token non-final?

        Show
        Mark Harwood added a comment - >>Mostly to convey information across TokenFilters Could you not achieve this communication using ThreadLocals? Do we know what the performance cost is of making Token non-final?
        Hide
        Yonik Seeley added a comment -

        > Could you not achieve this communication using ThreadLocals?
        Yes, but it would make me queasy
        Performance would also be less than desirable because you would have to stick every token in a HashMap in that thread local and then retrieve them again in the next TokenFilter.

        > Do we know what the performance cost is of making Token non-final?
        That is the key question. There is no reason not to make it non-final if there isn't a performance impact.

        In my general experience with recent hotspot JVMs, final on classes almost never make a difference. "final" applied to variables, however, can still make a big difference. I think the JVMs can tell if there are any loaded subclasses, and if not, treat it as final.

        In the context of indexing, I couldn't detect any difference in speed. I'll try and isolate it more.

        Show
        Yonik Seeley added a comment - > Could you not achieve this communication using ThreadLocals? Yes, but it would make me queasy Performance would also be less than desirable because you would have to stick every token in a HashMap in that thread local and then retrieve them again in the next TokenFilter. > Do we know what the performance cost is of making Token non-final? That is the key question. There is no reason not to make it non-final if there isn't a performance impact. In my general experience with recent hotspot JVMs, final on classes almost never make a difference. "final" applied to variables, however, can still make a big difference. I think the JVMs can tell if there are any loaded subclasses, and if not, treat it as final. In the context of indexing, I couldn't detect any difference in speed. I'll try and isolate it more.
        Hide
        Mark Harwood added a comment -

        Yes it is slower/uglier. I was hoping you could spare the CPU cycles and solve your specific communication problem this ThreadLocal way rather than inflicting extra CPU cycles on everyone else
        Would be interesting to know exactly what removing final is going to cost.

        Show
        Mark Harwood added a comment - Yes it is slower/uglier. I was hoping you could spare the CPU cycles and solve your specific communication problem this ThreadLocal way rather than inflicting extra CPU cycles on everyone else Would be interesting to know exactly what removing final is going to cost.
        Hide
        Yonik Seeley added a comment -

        As expected, I haven't been able to find a performance difference after applying this patch with 1.4.2 or 1.5 JVMs.

        java5 -server 10000000 TokCreate
        time=10636
        time=10636
        time=10575
        time=10705
        time=10706

        java5 -server 10000000 TokCreate // Token patch applied (non-final)
        time=10595
        time=10516
        time=10575
        time=10515
        time=10525

        java5 -server 2000000 TokChain
        time=9484
        time=9473
        time=9514
        time=9494
        time=9502

        java5 -server 2000000 TokChain // Token patch applied (non-final)
        time=9494
        time=9434
        time=9444
        time=9443
        time=9444

        java4 -server 10000000 TokCreate
        time=9294
        time=9323
        time=9274
        time=9344
        time=9384

        java4 -server 10000000 TokCreate // Token patch applied (non-final)
        time=9254
        time=9233
        time=9283
        time=9283
        time=9224

        java4 -server 2000000 TokChain
        time=7150
        time=7171
        time=7171
        time=7120
        time=7160

        java4 -server 2000000 TokChain // Token patch applied (non-final)
        time=7101
        time=7150
        time=7140
        time=7160
        time=7110

        java5 -client 10000000 TokCreate
        time=23384
        time=23253
        time=23534

        java5 -client 10000000 TokCreate // Token patch applied (non-final)
        time=23373
        time=23694
        time=23384

        Show
        Yonik Seeley added a comment - As expected, I haven't been able to find a performance difference after applying this patch with 1.4.2 or 1.5 JVMs. java5 -server 10000000 TokCreate time=10636 time=10636 time=10575 time=10705 time=10706 java5 -server 10000000 TokCreate // Token patch applied (non-final) time=10595 time=10516 time=10575 time=10515 time=10525 java5 -server 2000000 TokChain time=9484 time=9473 time=9514 time=9494 time=9502 java5 -server 2000000 TokChain // Token patch applied (non-final) time=9494 time=9434 time=9444 time=9443 time=9444 java4 -server 10000000 TokCreate time=9294 time=9323 time=9274 time=9344 time=9384 java4 -server 10000000 TokCreate // Token patch applied (non-final) time=9254 time=9233 time=9283 time=9283 time=9224 java4 -server 2000000 TokChain time=7150 time=7171 time=7171 time=7120 time=7160 java4 -server 2000000 TokChain // Token patch applied (non-final) time=7101 time=7150 time=7140 time=7160 time=7110 java5 -client 10000000 TokCreate time=23384 time=23253 time=23534 java5 -client 10000000 TokCreate // Token patch applied (non-final) time=23373 time=23694 time=23384
        Hide
        Yonik Seeley added a comment -

        token performance tests

        Show
        Yonik Seeley added a comment - token performance tests
        Hide
        Doug Cutting added a comment -

        +1 This sounds like a good change.

        Show
        Doug Cutting added a comment - +1 This sounds like a good change.
        Hide
        Yonik Seeley added a comment -

        committed.

        Show
        Yonik Seeley added a comment - committed.
        Hide
        Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            3 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development