Solr
  1. Solr
  2. SOLR-814

Add new Japanese Hiragana Filter and Factory

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.3
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Japanese Hiragana and Katakana character sets can be easily translated between. This filter normalizes all Hiragana characters to their Katakana counterpart, allowing for indexing and searching using either.

        Activity

        Hide
        Koji Sekiguchi added a comment -

        Using CharFilter can solve this problem in a flexible manner.

        Show
        Koji Sekiguchi added a comment - Using CharFilter can solve this problem in a flexible manner.
        Hide
        Todd Feak added a comment -

        Yes, they are used differently.

        However, a word written in Hiragana is the same word when written in Katakana. Same meaning. Futhermore, it's not always cut and dried which to use. For example, a movie title may be written in Hiragana or Katakana, depending on the Director's preference. The user (searcher) may not have remembered the Director's preference, so may search using the other. Without this normalization they would get a search miss.

        I don't doubt your experience at Ultraseek, but this feature was explicitly asked for by Japanese (native speaking) engineers at Sony. I just (literally) double checked with a couple of onsite native speaking Japanese engineers and both agree that this is useful, at least for our searches.

        I would say that it should be up to the schema developer as to whether this functionality is useful or not for their situation. Either way, I offer it up to the community for their decision.

        Show
        Todd Feak added a comment - Yes, they are used differently. However, a word written in Hiragana is the same word when written in Katakana. Same meaning. Futhermore, it's not always cut and dried which to use. For example, a movie title may be written in Hiragana or Katakana, depending on the Director's preference. The user (searcher) may not have remembered the Director's preference, so may search using the other. Without this normalization they would get a search miss. I don't doubt your experience at Ultraseek, but this feature was explicitly asked for by Japanese (native speaking) engineers at Sony. I just (literally) double checked with a couple of onsite native speaking Japanese engineers and both agree that this is useful, at least for our searches. I would say that it should be up to the schema developer as to whether this functionality is useful or not for their situation. Either way, I offer it up to the community for their decision.
        Hide
        Walter Underwood added a comment -

        This seems like a bad idea. Hirigana and katakana are used quite differently in Japanese. They are not interchangeable.

        I was the engineer for Japanese support in Ultraseek for years and even visited our distributor there, but no one ever asked for this feature. They asked for a lot of things, but never this.

        It is very useful, maybe essential, to normalize full-width and half-width versions of hirigana, katakana, and ASCII.

        Show
        Walter Underwood added a comment - This seems like a bad idea. Hirigana and katakana are used quite differently in Japanese. They are not interchangeable. I was the engineer for Japanese support in Ultraseek for years and even visited our distributor there, but no one ever asked for this feature. They asked for a lot of things, but never this. It is very useful, maybe essential, to normalize full-width and half-width versions of hirigana, katakana, and ASCII.
        Hide
        Todd Feak added a comment -

        Attached patch containing Filter, Factory, and Unit Tests for both

        Show
        Todd Feak added a comment - Attached patch containing Filter, Factory, and Unit Tests for both

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Todd Feak
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development