Solr
  1. Solr
  2. SOLR-2244

Add Language Identification support

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

      1. solr2244.patch
        106 kB
        Tommaso Teofili

        Issue Links

          Activity

          Grant Ingersoll created issue -
          Hide
          Tommaso Teofili added a comment -

          Cool, this would be a nice feature

          Show
          Tommaso Teofili added a comment - Cool, this would be a nice feature
          Hide
          Tommaso Teofili added a comment -

          I've made a patch to use Tika 0.8 language identification feature inside an UpdateRequestProcessor

          Show
          Tommaso Teofili added a comment - I've made a patch to use Tika 0.8 language identification feature inside an UpdateRequestProcessor
          Tommaso Teofili made changes -
          Field Original Value New Value
          Attachment solr2244.patch [ 12460210 ]
          Hide
          Grant Ingersoll added a comment -

          Cool, I will check it out.

          Show
          Grant Ingersoll added a comment - Cool, I will check it out.
          Grant Ingersoll made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Hide
          Grant Ingersoll added a comment -

          I'm going to suggest that we rename contrib/extraction to be contrib/tika and that we just roll all of these things under one area, that way we don't have to muck with libraries, etc.

          Heck, it might even make sense at this point to just move it into core.

          Show
          Grant Ingersoll added a comment - I'm going to suggest that we rename contrib/extraction to be contrib/tika and that we just roll all of these things under one area, that way we don't have to muck with libraries, etc. Heck, it might even make sense at this point to just move it into core.
          Hide
          Tommaso Teofili added a comment -

          I'm going to suggest that we rename contrib/extraction to be contrib/tika and that we just roll all of these things under one area, that way we don't have to muck with libraries, etc.

          nice suggestion

          Heck, it might even make sense at this point to just move it into core.

          +1

          Show
          Tommaso Teofili added a comment - I'm going to suggest that we rename contrib/extraction to be contrib/tika and that we just roll all of these things under one area, that way we don't have to muck with libraries, etc. nice suggestion Heck, it might even make sense at this point to just move it into core. +1
          Hide
          Robert Muir added a comment -

          Heck, it might even make sense at this point to just move it into core.

          non-option until SOLR-2088 is fixed. Solr "core" should work on turkish computers, too.

          Show
          Robert Muir added a comment - Heck, it might even make sense at this point to just move it into core. non-option until SOLR-2088 is fixed. Solr "core" should work on turkish computers, too.
          Jan Høydahl made changes -
          Link This issue duplicates SOLR-1979 [ SOLR-1979 ]
          Hide
          Jan Høydahl added a comment -

          There is already an issue specifying this functionality. I also have a patch that is not yet uploaded.

          Show
          Jan Høydahl added a comment - There is already an issue specifying this functionality. I also have a patch that is not yet uploaded.
          Hide
          Tommaso Teofili added a comment -

          Thanks for notifying Jon. My patch is very straightforward and simple so feel free to integrate/modify it with yours.

          Show
          Tommaso Teofili added a comment - Thanks for notifying Jon. My patch is very straightforward and simple so feel free to integrate/modify it with yours.
          Hide
          Grant Ingersoll added a comment -

          I'm going to move forward with this patch, since I don't see one for SOLR-1979.

          I'm going to keep it in contrib/langid, but have it use the Tika libs from contrib/extraction, so that we won't have to package them twice. I don't really like changing contrib/extraction to be contrib/tika since then it is not clear what the functionality is and we also may have other lang. id tools in the future.

          Show
          Grant Ingersoll added a comment - I'm going to move forward with this patch, since I don't see one for SOLR-1979 . I'm going to keep it in contrib/langid, but have it use the Tika libs from contrib/extraction, so that we won't have to package them twice. I don't really like changing contrib/extraction to be contrib/tika since then it is not clear what the functionality is and we also may have other lang. id tools in the future.
          Hide
          Jan Høydahl added a comment -

          Added my patch to SOLR-1979. The difference from this patch is that it is based on contrib/extraction, is configured in-line instead of through own config file, and has a fallback configuration.

          Show
          Jan Høydahl added a comment - Added my patch to SOLR-1979 . The difference from this patch is that it is based on contrib/extraction, is configured in-line instead of through own config file, and has a fallback configuration.
          Hide
          Grant Ingersoll added a comment -

          Actually, I'm going to switch back to SOLR-1979, as it is a superset of this patch. I should have a patch up shortly.

          Show
          Grant Ingersoll added a comment - Actually, I'm going to switch back to SOLR-1979 , as it is a superset of this patch. I should have a patch up shortly.
          Grant Ingersoll made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development