Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6837

Add N-best output capability to JapaneseTokenizer

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.3
    • Fix Version/s: 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Japanese morphological analyzers often generate mis-segmented tokens. N-best output reduces the impact of mis-segmentation on search result. N-best output is more meaningful than character N-gram, and it increases hit count too.

      If you use N-best output, you can get decompounded tokens (ex: "シニアソフトウェアエンジニア" =>

      {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}

      ) and overwrapped tokens (ex: "数学部長谷川" =>

      {"数学", "部", "部長", "長谷川", "谷川"}

      ), depending on the dictionary and N-best parameter settings.

      1. LUCENE-6837.patch
        51 kB
        Christian Moen
      2. LUCENE-6837.patch
        51 kB
        KONNO, Hiroharu
      3. LUCENE-6837.patch
        51 kB
        Christian Moen
      4. LUCENE-6837.patch
        42 kB
        Christian Moen
      5. LUCENE-6837.patch
        32 kB
        KONNO, Hiroharu
      6. LUCENE-6837 for 5.4.zip
        7.84 MB
        Ippei UKAI

        Activity

        Hide
        hkonno KONNO, Hiroharu added a comment -
        Show
        hkonno KONNO, Hiroharu added a comment - LUCENE-6837 .patch
        Hide
        koji Koji Sekiguchi added a comment -

        We have our own morphological analyzer with n-best output.

        If nobody take this, I'll assign to me.

        Show
        koji Koji Sekiguchi added a comment - We have our own morphological analyzer with n-best output. If nobody take this, I'll assign to me.
        Hide
        cm Christian Moen added a comment -

        Thanks. I've had a very quick look at the code and have some comments and questions. I'm happy to take care of this, Koji.

        Show
        cm Christian Moen added a comment - Thanks. I've had a very quick look at the code and have some comments and questions. I'm happy to take care of this, Koji.
        Hide
        mikemccand Michael McCandless added a comment -

        Wow, adding nbest to the best-first viterbi search is not easy!

        Show
        mikemccand Michael McCandless added a comment - Wow, adding nbest to the best-first viterbi search is not easy!
        Hide
        cm Christian Moen added a comment -

        Thanks a lot for this, Konno-san. Very nice work! I like the idea to calculate the n-best cost using examples.

        Since search mode and also extended mode solves a similar problem, I'm wondering if it makes sense to introduce n-best as a separate mode in itself. In your experience in developing the feature, do you think it makes a lot of sense to use it with search and extended mode?

        I think I'm in favour of supporting it for all the modes, even though it perhaps makes the most sense for normal mode. The reason for this is to make sure that the entire API for JapaneseTokenizer is functional for all the tokenizer modes.

        I'll add a few tests and I'd like to commit this soon.

        Show
        cm Christian Moen added a comment - Thanks a lot for this, Konno-san. Very nice work! I like the idea to calculate the n-best cost using examples. Since search mode and also extended mode solves a similar problem, I'm wondering if it makes sense to introduce n-best as a separate mode in itself. In your experience in developing the feature, do you think it makes a lot of sense to use it with search and extended mode? I think I'm in favour of supporting it for all the modes, even though it perhaps makes the most sense for normal mode. The reason for this is to make sure that the entire API for JapaneseTokenizer is functional for all the tokenizer modes. I'll add a few tests and I'd like to commit this soon.
        Hide
        hkonno KONNO, Hiroharu added a comment -

        Thank you for your good evaluation.

        Because the difference between N-best output and search-mode output is quite big, so I agree to your opinion to support the N-best for all modes.

        Show
        hkonno KONNO, Hiroharu added a comment - Thank you for your good evaluation. Because the difference between N-best output and search-mode output is quite big, so I agree to your opinion to support the N-best for all modes.
        Hide
        cm Christian Moen added a comment -

        I've attached a new patch with some minor changes:

        • Made the System.out.printf being subject to VERBOSE being true
        • Introduced RuntimeException to deal with the initialization error cases
        • Renamed the new parameters to nBestCost and nBestExamples
        • Added additional javadoc here and there to document the new functionality

        I'm planning on running some stability tests with the new tokenizer parameters next.

        Show
        cm Christian Moen added a comment - I've attached a new patch with some minor changes: Made the System.out.printf being subject to VERBOSE being true Introduced RuntimeException to deal with the initialization error cases Renamed the new parameters to nBestCost and nBestExamples Added additional javadoc here and there to document the new functionality I'm planning on running some stability tests with the new tokenizer parameters next.
        Hide
        cm Christian Moen added a comment -

        Tokenizing Japanese Wikipedia seems fine with nBestCost set, but it seems like random-blasting doesn't pass.

        Konno-san, I'm wondering if I can ask you the trouble of looking into why the testRandomHugeStrings fails with the latest patch?

        The test basically does random-blasting with nBestCost set to 2000. I think it's a good idea that we fix this before we commit. I believe it's easily reproducible, but I used

        ant test  -Dtestcase=TestJapaneseTokenizer -Dtests.method=testRandomHugeStrings -Dtests.seed=99EB179B92E66345 -Dtests.slow=true -Dtests.locale=sr_CS -Dtests.timezone=PNT -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
        

        in my environment.

        Show
        cm Christian Moen added a comment - Tokenizing Japanese Wikipedia seems fine with nBestCost set, but it seems like random-blasting doesn't pass. Konno-san, I'm wondering if I can ask you the trouble of looking into why the testRandomHugeStrings fails with the latest patch? The test basically does random-blasting with nBestCost set to 2000. I think it's a good idea that we fix this before we commit. I believe it's easily reproducible, but I used ant test -Dtestcase=TestJapaneseTokenizer -Dtests.method=testRandomHugeStrings -Dtests.seed=99EB179B92E66345 -Dtests.slow=true -Dtests.locale=sr_CS -Dtests.timezone=PNT -Dtests.asserts=true -Dtests.file.encoding=US-ASCII in my environment.
        Hide
        hkonno KONNO, Hiroharu added a comment -

        Hi Christian,

        I found my mistake, and I update patch file. (only +1 line)
        I thought that the result of Set<Integer>.toArray() is sorted, but it is not.
        I added explicit sort code and it seems fine.
        Please try it.
        thanks.

        Show
        hkonno KONNO, Hiroharu added a comment - Hi Christian, I found my mistake, and I update patch file. (only +1 line) I thought that the result of Set<Integer>.toArray() is sorted, but it is not. I added explicit sort code and it seems fine. Please try it. thanks.
        Hide
        cm Christian Moen added a comment -

        Thanks a lot, Konno-san. Things look good. My apologies that I couldn't look into this earlier.

        I've attached a new patch where I've included your fix and also renamed some methods. I think it's getting ready...

        Show
        cm Christian Moen added a comment - Thanks a lot, Konno-san. Things look good. My apologies that I couldn't look into this earlier. I've attached a new patch where I've included your fix and also renamed some methods. I think it's getting ready...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1717713 from Christian Moen in branch 'dev/trunk'
        [ https://svn.apache.org/r1717713 ]

        Add n-best output to JapaneseTokenizer (LUCENE-6837)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1717713 from Christian Moen in branch 'dev/trunk' [ https://svn.apache.org/r1717713 ] Add n-best output to JapaneseTokenizer ( LUCENE-6837 )
        Hide
        mikemccand Michael McCandless added a comment -

        Christian Moen are you planning to backport this for 5.5?

        Show
        mikemccand Michael McCandless added a comment - Christian Moen are you planning to backport this for 5.5?
        Hide
        ippei Ippei UKAI added a comment - - edited

        I wanted to give this feature a try, and would like to share what I did so you can try it too.

        Attached is a slightly modified version of JapaneseTokenizer from r1717713 compiled for use with Lucene/Solr 5.4. Basically, modified classes are in a separate package so it does not conflict with existing.

        solrconfig.xml:

        <lib dir="${solr.solr.home}/lib/LUCENE-6837" />
        

        (if you place the files under $SOLR_HOME/lib/LUCENE-6837)

        schema.xml:

                <tokenizer class="org.apache.lucene.analysis.ja6837.JapaneseTokenizerFactory"
                   mode="NORMAL"
                   discardPunctuation="true"
                   nBestExamples="/シニアソフトウェアエンジニア-ソフトウェア/数学部長谷川-長谷川/"
                />
        
        Show
        ippei Ippei UKAI added a comment - - edited I wanted to give this feature a try, and would like to share what I did so you can try it too. Attached is a slightly modified version of JapaneseTokenizer from r1717713 compiled for use with Lucene/Solr 5.4. Basically, modified classes are in a separate package so it does not conflict with existing. solrconfig.xml: <lib dir= "${solr.solr.home}/lib/LUCENE-6837" /> (if you place the files under $SOLR_HOME/lib/ LUCENE-6837 ) schema.xml: <tokenizer class= "org.apache.lucene.analysis.ja6837.JapaneseTokenizerFactory" mode= "NORMAL" discardPunctuation= "true" nBestExamples= "/シニアソフトウェアエンジニア-ソフトウェア/数学部長谷川-長谷川/" />
        Hide
        cm Christian Moen added a comment -

        Hello Mike,

        Yes, I'd like to backport this to 5.5.

        Show
        cm Christian Moen added a comment - Hello Mike, Yes, I'd like to backport this to 5.5.
        Hide
        janhoy Jan Høydahl added a comment -

        This could be resolved with fix version 6.0, right?

        Show
        janhoy Jan Høydahl added a comment - This could be resolved with fix version 6.0, right?

          People

          • Assignee:
            cm Christian Moen
            Reporter:
            hkonno KONNO, Hiroharu
          • Votes:
            3 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development