Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Introducing HTMLStripCharFilter:

      • move html strip logic from HTMLStripReader to HTMLStripCharFilter
      • make HTMLStripReader depracated
      • make HTMLStrip*TokenizerFactory deprecated
      1. SOLR-1343.patch
        158 kB
        Koji Sekiguchi

        Activity

        Hide
        Koji Sekiguchi added a comment -

        I'll commit in a few days if there is no objections.

        Show
        Koji Sekiguchi added a comment - I'll commit in a few days if there is no objections.
        Hide
        Shalin Shekhar Mangar added a comment -

        Koji, what is the advantage of the HTMLStripCharFilter over HTMLStripReader?

        Show
        Shalin Shekhar Mangar added a comment - Koji, what is the advantage of the HTMLStripCharFilter over HTMLStripReader?
        Hide
        Koji Sekiguchi added a comment -

        Koji, what is the advantage of the HTMLStripCharFilter over HTMLStripReader?

        Good question, Shalin
        Because after LUCENE-1466 committed, all tokenizers can read chars from CharFilter rather than Reader, I'd like to replace Readers like this by CharFilters. Obvious advantages are:

        1. We can use an arbitrary tokenizer, e.g. CJKTokenizer.
        2. We can use a chain of CharFilters. For example, we can strip HTML tags then normalize chars before tokenizer running.
        Show
        Koji Sekiguchi added a comment - Koji, what is the advantage of the HTMLStripCharFilter over HTMLStripReader? Good question, Shalin Because after LUCENE-1466 committed, all tokenizers can read chars from CharFilter rather than Reader, I'd like to replace Readers like this by CharFilters. Obvious advantages are: We can use an arbitrary tokenizer, e.g. CJKTokenizer. We can use a chain of CharFilters. For example, we can strip HTML tags then normalize chars before tokenizer running.
        Hide
        Koji Sekiguchi added a comment -

        Committed revision 802263.

        Show
        Koji Sekiguchi added a comment - Committed revision 802263.
        Hide
        Jason Rutherglen added a comment -

        I'm seeing a bug related to this patch going in. It's been hard
        to track down and I'm dealing with a JVM bug at the same time,
        so I haven't had time to write a test case yet.

        In summary, I reverted to the previous classes and the indexing
        goes back to normal.

        Show
        Jason Rutherglen added a comment - I'm seeing a bug related to this patch going in. It's been hard to track down and I'm dealing with a JVM bug at the same time, so I haven't had time to write a test case yet. In summary, I reverted to the previous classes and the indexing goes back to normal.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Hide
        Shem M added a comment -

        Is there a reason why the filter replace text tags like <b> or <i> with space?
        I see that in the past it wasn't like this (from the code):
        //break;//was
        //return whitespace from

        It make the life a lot harder when I have for example this text:
        Some t<b>ex</b>t here
        and I want to find "text"

        Show
        Shem M added a comment - Is there a reason why the filter replace text tags like <b> or <i> with space? I see that in the past it wasn't like this (from the code): //break;//was //return whitespace from It make the life a lot harder when I have for example this text: Some t<b>ex</b>t here and I want to find "text"

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development