Solr
  1. Solr
  2. SOLR-4412

LanguageIdentifier lcmap for language field

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.4, 5.0
    • Component/s: contrib - LangId
    • Labels:
      None

      Description

      For some languages, the detector will detect sub-languages, such as LangDetect detecting zh-tw or zh-cn for Chinese. Tika detector only detects zh. Today you can use lcmap to map these two into one code, e.g. langid.map.lcmap=zh-cn:zh zh-tw:zh. But the langField output is not changed.

      We need an option for langField as well.

      1. SOLR-4412.patch
        8 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        Proposal is a new option:

        langid.lcmap
        

        Same syntax as langid.map.lcmap. If set, will affect both langField and field-name mappings. So if you want these to be different, specify both. This way the API is backwards compatible.

        Show
        Jan Høydahl added a comment - Proposal is a new option: langid.lcmap Same syntax as langid.map.lcmap. If set, will affect both langField and field-name mappings. So if you want these to be different, specify both. This way the API is backwards compatible.
        Hide
        Jan Høydahl added a comment -

        First patch (git diff format)

        Show
        Jan Høydahl added a comment - First patch (git diff format)
        Hide
        ASF subversion and git services added a comment -

        Commit 1498959 from janhoy@apache.org
        [ https://svn.apache.org/r1498959 ]

        SOLR-4412: LanguageIdentifier lcmap for language field

        Show
        ASF subversion and git services added a comment - Commit 1498959 from janhoy@apache.org [ https://svn.apache.org/r1498959 ] SOLR-4412 : LanguageIdentifier lcmap for language field
        Hide
        ASF subversion and git services added a comment -

        Commit 1498961 from janhoy@apache.org
        [ https://svn.apache.org/r1498961 ]

        SOLR-4412: LanguageIdentifier lcmap for language field (merge from trunk)

        Show
        ASF subversion and git services added a comment - Commit 1498961 from janhoy@apache.org [ https://svn.apache.org/r1498961 ] SOLR-4412 : LanguageIdentifier lcmap for language field (merge from trunk)
        Hide
        Jan Høydahl added a comment -

        Committed to trunk and 4.x

        Show
        Jan Høydahl added a comment - Committed to trunk and 4.x
        Hide
        Jan Høydahl added a comment -

        Updated https://wiki.apache.org/solr/LanguageDetection#langid.lcmap

        Not yet updated the new Confluence docs. TODO

        Show
        Jan Høydahl added a comment - Updated https://wiki.apache.org/solr/LanguageDetection#langid.lcmap Not yet updated the new Confluence docs. TODO
        Hide
        Jack Krupansky added a comment - - edited

        From the original generic description, I got the impression that this issue would cover BOTH language identifier processors, but the final patch covers only one of them - it doesn't add the feature uniformly to the Tika Language Identifier update processor.

        Was this intentional or simply an oversight?

        If intentional, what is the reasoning?

        And the wiki update does not mention that the new feature covers only one of the two implementations, even though the wiki in general covers both implementations.

        Show
        Jack Krupansky added a comment - - edited From the original generic description, I got the impression that this issue would cover BOTH language identifier processors, but the final patch covers only one of them - it doesn't add the feature uniformly to the Tika Language Identifier update processor. Was this intentional or simply an oversight? If intentional, what is the reasoning? And the wiki update does not mention that the new feature covers only one of the two implementations, even though the wiki in general covers both implementations.
        Hide
        Jan Høydahl added a comment -

        Please explain why you believe it does not apply to Tika langid as well. The changes are only in common base class, not in the two specialized implementations.

        Show
        Jan Høydahl added a comment - Please explain why you believe it does not apply to Tika langid as well. The changes are only in common base class, not in the two specialized implementations.
        Hide
        Jack Krupansky added a comment -

        Thanks for the clarification - I mistook the very long base class name for one of the implementations. So, it looks fine.

        Show
        Jack Krupansky added a comment - Thanks for the clarification - I mistook the very long base class name for one of the implementations. So, it looks fine.
        Hide
        Hoss Man added a comment -

        Not yet updated the new Confluence docs. TODO

        I took at stab at this based on what i understood of hte issue and your MoinMoin edits, please review...

        https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=32604265&selectedPageVersions=2&selectedPageVersions=1

        Show
        Hoss Man added a comment - Not yet updated the new Confluence docs. TODO I took at stab at this based on what i understood of hte issue and your MoinMoin edits, please review... https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=32604265&selectedPageVersions=2&selectedPageVersions=1
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development