Solr
  1. Solr
  2. SOLR-1979

Create LanguageIdentifierUpdateProcessor

    Details

      Description

      Language identification from document fields, and mapping of field names to language-specific fields based on detected language.

      Wrap the Tika LanguageIdentifier in an UpdateProcessor.

      See user documentation at http://wiki.apache.org/solr/LanguageDetection

      1. SOLR-1979-branch_3x.patch
        54 kB
        Jan Høydahl
      2. SOLR-1979.patch
        35 kB
        Jan Høydahl
      3. SOLR-1979.patch
        51 kB
        Grant Ingersoll
      4. SOLR-1979.patch
        59 kB
        Grant Ingersoll
      5. SOLR-1979.patch
        59 kB
        Grant Ingersoll
      6. SOLR-1979.patch
        51 kB
        Jan Høydahl
      7. SOLR-1979.patch
        58 kB
        Jan Høydahl
      8. SOLR-1979.patch
        51 kB
        Jan Høydahl
      9. SOLR-1979.patch
        48 kB
        Jan Høydahl
      10. SOLR-1979.patch
        50 kB
        Jan Høydahl
      11. SOLR-1979.patch
        54 kB
        Jan Høydahl
      12. SOLR-1979.patch
        54 kB
        Jan Høydahl
      13. SOLR-1979.patch
        55 kB
        Jan Høydahl
      14. SOLR-1979.patch
        57 kB
        Jan Høydahl
      15. SOLR-1979.patch
        56 kB
        Jan Høydahl

        Issue Links

          Activity

          Jan Høydahl created issue -
          Jan Høydahl made changes -
          Field Original Value New Value
          Description We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.

          To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html"] in an UpdateProcessor. The processor should be configured like this:

          {{monospaced}}
            <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
              <str name="inputFields">title,teaser,body</str>
              <str name="isoOutputField">language</str>
              <str name="fullOutputField">language_display</str>
            </processor>
          {{monospaced}}
          We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.

          To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html"] in an UpdateProcessor. The processor should be configured like this:

          {code:xml}
            <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
              <str name="inputFields">title,teaser,body</str>
              <str name="isoOutputField">language</str>
              <str name="fullOutputField">language_display</str>
            </processor>
          {code}
          Jan Høydahl made changes -
          Link This issue is duplicated by SOLR-2244 [ SOLR-2244 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12465289 ]
          Jan Høydahl made changes -
          Description We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.

          To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html"] in an UpdateProcessor. The processor should be configured like this:

          {code:xml}
            <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
              <str name="inputFields">title,teaser,body</str>
              <str name="isoOutputField">language</str>
              <str name="fullOutputField">language_display</str>
            </processor>
          {code}
          We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.

          To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this:

          {code:xml}
            <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
              <str name="inputFields">name,subject</str>
              <str name="outputField">language_s</str>
              <str name="idField">id</str>
              <str name="fallback">en</str>
            </processor>
          {code}

          It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used.
          Grant Ingersoll made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Grant Ingersoll made changes -
          Attachment SOLR-1979.patch [ 12465342 ]
          Grant Ingersoll made changes -
          Attachment SOLR-1979.patch [ 12465351 ]
          Grant Ingersoll made changes -
          Attachment SOLR-1979.patch [ 12465378 ]
          Jan Høydahl made changes -
          Assignee Grant Ingersoll [ gsingers ] Jan Høydahl [ janhoy ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12483437 ]
          Jan Høydahl made changes -
          Description We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.

          To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this:

          {code:xml}
            <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
              <str name="inputFields">name,subject</str>
              <str name="outputField">language_s</str>
              <str name="idField">id</str>
              <str name="fallback">en</str>
            </processor>
          {code}

          It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used.
          Language identification from document fields, and mapping of field names to language-specific fields based on detected language.

          Wrap the Tika LanguageIdentifier in an UpdateProcessor.
          Jan Høydahl made changes -
          Labels UpdateProcessor
          Fix Version/s 3.4 [ 12316683 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12483881 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12489184 ]
          Jan Høydahl made changes -
          Fix Version/s 3.5 [ 12317876 ]
          Fix Version/s 3.4 [ 12316683 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12494027 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12494038 ]
          Jan Høydahl made changes -
          Fix Version/s 4.0 [ 12314992 ]
          Description Language identification from document fields, and mapping of field names to language-specific fields based on detected language.

          Wrap the Tika LanguageIdentifier in an UpdateProcessor.
          Language identification from document fields, and mapping of field names to language-specific fields based on detected language.

          Wrap the Tika LanguageIdentifier in an UpdateProcessor.

          See user documentation at http://wiki.apache.org/solr/LanguageDetection
          Jan Høydahl made changes -
          Component/s contrib - LangId [ 12315701 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12495061 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12495073 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12495328 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979.patch [ 12497090 ]
          Jan Høydahl made changes -
          Attachment SOLR-1979-branch_3x.patch [ 12497882 ]
          Attachment SOLR-1979.patch [ 12497883 ]
          Jan Høydahl made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Jan Høydahl
            • Votes:
              8 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development