Solr
  1. Solr
  2. SOLR-4412

LanguageIdentifier lcmap for language field

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.4, Trunk
    • Component/s: contrib - LangId
    • Labels:
      None

      Description

      For some languages, the detector will detect sub-languages, such as LangDetect detecting zh-tw or zh-cn for Chinese. Tika detector only detects zh. Today you can use lcmap to map these two into one code, e.g. langid.map.lcmap=zh-cn:zh zh-tw:zh. But the langField output is not changed.

      We need an option for langField as well.

      1. SOLR-4412.patch
        8 kB
        Jan Høydahl

        Activity

        Jan Høydahl created issue -
        Jan Høydahl made changes -
        Field Original Value New Value
        Description For some languages, the detector will detect sub-languages, such as LangDetect detecting zh-tw or zh-cn for Chinese. Tika detector only detects zh. Today you can use {{{lcmap}}} to map these two into one code, e.g. {{{langid.map.lcmap=zh-cn:zh zh-tw:zh}}}. But the {{{langField}}} output is not changed.

        We need an option for {{{langField}}} as well.
        For some languages, the detector will detect sub-languages, such as LangDetect detecting zh-tw or zh-cn for Chinese. Tika detector only detects zh. Today you can use {{lcmap}} to map these two into one code, e.g. {{langid.map.lcmap=zh-cn:zh zh-tw:zh}}. But the {{langField}} output is not changed.

        We need an option for {{langField}} as well.
        Hide
        Jan Høydahl added a comment -

        Proposal is a new option:

        langid.lcmap
        

        Same syntax as langid.map.lcmap. If set, will affect both langField and field-name mappings. So if you want these to be different, specify both. This way the API is backwards compatible.

        Show
        Jan Høydahl added a comment - Proposal is a new option: langid.lcmap Same syntax as langid.map.lcmap. If set, will affect both langField and field-name mappings. So if you want these to be different, specify both. This way the API is backwards compatible.
        Robert Muir made changes -
        Fix Version/s 4.3 [ 12324128 ]
        Fix Version/s 5.0 [ 12321664 ]
        Fix Version/s 4.2 [ 12323893 ]
        Hide
        Jan Høydahl added a comment -

        First patch (git diff format)

        Show
        Jan Høydahl added a comment - First patch (git diff format)
        Jan Høydahl made changes -
        Attachment SOLR-4412.patch [ 12573301 ]
        Jan Høydahl made changes -
        Assignee Jan Høydahl [ janhoy ]
        Uwe Schindler made changes -
        Fix Version/s 4.4 [ 12324324 ]
        Fix Version/s 4.3 [ 12324128 ]
        Jan Høydahl made changes -
        Comment [ Commit 1498948 from janhoy@apache.org
        [ https://svn.apache.org/r1498948 ]

        SOLR-4412: Added comments about variant to schema.xml ]
        Jan Høydahl made changes -
        Comment [ Commit 1498951 from janhoy@apache.org
        [ https://svn.apache.org/r1498951 ]

        SOLR-4412: Added comments about variant to schema.xml (merge from trunk) ]
        Jan Høydahl made changes -
        Fix Version/s 5.0 [ 12321664 ]
        Hide
        ASF subversion and git services added a comment -

        Commit 1498959 from janhoy@apache.org
        [ https://svn.apache.org/r1498959 ]

        SOLR-4412: LanguageIdentifier lcmap for language field

        Show
        ASF subversion and git services added a comment - Commit 1498959 from janhoy@apache.org [ https://svn.apache.org/r1498959 ] SOLR-4412 : LanguageIdentifier lcmap for language field
        Hide
        ASF subversion and git services added a comment -

        Commit 1498961 from janhoy@apache.org
        [ https://svn.apache.org/r1498961 ]

        SOLR-4412: LanguageIdentifier lcmap for language field (merge from trunk)

        Show
        ASF subversion and git services added a comment - Commit 1498961 from janhoy@apache.org [ https://svn.apache.org/r1498961 ] SOLR-4412 : LanguageIdentifier lcmap for language field (merge from trunk)
        Hide
        Jan Høydahl added a comment -

        Committed to trunk and 4.x

        Show
        Jan Høydahl added a comment - Committed to trunk and 4.x
        Jan Høydahl made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Jan Høydahl added a comment -

        Updated https://wiki.apache.org/solr/LanguageDetection#langid.lcmap

        Not yet updated the new Confluence docs. TODO

        Show
        Jan Høydahl added a comment - Updated https://wiki.apache.org/solr/LanguageDetection#langid.lcmap Not yet updated the new Confluence docs. TODO
        Hide
        Jack Krupansky added a comment - - edited

        From the original generic description, I got the impression that this issue would cover BOTH language identifier processors, but the final patch covers only one of them - it doesn't add the feature uniformly to the Tika Language Identifier update processor.

        Was this intentional or simply an oversight?

        If intentional, what is the reasoning?

        And the wiki update does not mention that the new feature covers only one of the two implementations, even though the wiki in general covers both implementations.

        Show
        Jack Krupansky added a comment - - edited From the original generic description, I got the impression that this issue would cover BOTH language identifier processors, but the final patch covers only one of them - it doesn't add the feature uniformly to the Tika Language Identifier update processor. Was this intentional or simply an oversight? If intentional, what is the reasoning? And the wiki update does not mention that the new feature covers only one of the two implementations, even though the wiki in general covers both implementations.
        Hide
        Jan Høydahl added a comment -

        Please explain why you believe it does not apply to Tika langid as well. The changes are only in common base class, not in the two specialized implementations.

        Show
        Jan Høydahl added a comment - Please explain why you believe it does not apply to Tika langid as well. The changes are only in common base class, not in the two specialized implementations.
        Hide
        Jack Krupansky added a comment -

        Thanks for the clarification - I mistook the very long base class name for one of the implementations. So, it looks fine.

        Show
        Jack Krupansky added a comment - Thanks for the clarification - I mistook the very long base class name for one of the implementations. So, it looks fine.
        Jan Høydahl made changes -
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Hide
        Hoss Man added a comment -

        Not yet updated the new Confluence docs. TODO

        I took at stab at this based on what i understood of hte issue and your MoinMoin edits, please review...

        https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=32604265&selectedPageVersions=2&selectedPageVersions=1

        Show
        Hoss Man added a comment - Not yet updated the new Confluence docs. TODO I took at stab at this based on what i understood of hte issue and your MoinMoin edits, please review... https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=32604265&selectedPageVersions=2&selectedPageVersions=1
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues
        Steve Rowe made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development