Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-16010

langid should include all required Tika dependencies

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • contrib - LangId
    • None

    Description

      Currently, the langid module requires that extraction module to be loaded for langid to work. It isn't clear if what is included in the extraction module will even meet the langid needs (ie: tika-langdetect isn't included in extraction module)

      ➜  solr git:(SOLR-15989) find solr/packaging/build/solr-10.0.0-SNAPSHOT/ -name '*tika*.jar'
      solr/packaging/build/solr-10.0.0-SNAPSHOT/modules/langid/lib/tika-core-1.27.jar
      solr/packaging/build/solr-10.0.0-SNAPSHOT/modules/extraction/lib/tika-parsers-1.27.jar
      solr/packaging/build/solr-10.0.0-SNAPSHOT/modules/extraction/lib/tika-java7-1.27.jar
      solr/packaging/build/solr-10.0.0-SNAPSHOT/modules/extraction/lib/tika-xmp-1.27.jar
      solr/packaging/build/solr-10.0.0-SNAPSHOT/modules/extraction/lib/vorbis-java-tika-0.8.jar
      solr/packaging/build/solr-10.0.0-SNAPSHOT/modules/extraction/lib/tika-core-1.27.jar
      

      This came out of a discussion in SOLR-15989 - https://github.com/apache/solr/pull/621#discussion_r806083202

      Attachments

        Issue Links

          Activity

            janhoy Jan Høydahl added a comment -

            Currently these jars are packaged in langid/lib:

            jsonic-1.2.7.jar
            langdetect-1.1-20120112.jar
            opennlp-tools-1.9.1.jar
            solr-langid-10.0.0-SNAPSHOT.jar
            tika-core-1.27.jar 

            I think this is enough, as tika 1.x has not yet split language detection out from core, like 2.x has?

            janhoy Jan Høydahl added a comment - Currently these jars are packaged in langid/lib : jsonic-1.2.7.jar langdetect-1.1-20120112.jar opennlp-tools-1.9.1.jar solr-langid-10.0.0-SNAPSHOT.jar tika-core-1.27.jar I think this is enough, as tika 1.x has not yet split language detection out from core, like 2.x has?
            krisden Kevin Risden added a comment -

            tika-langdetect exists in Tika 1.x - https://mvnrepository.com/artifact/org.apache.tika/tika-langdetect/1.28.1

            It seems like to me at least the langid module needs to be tested to make sure it all works and updated any jars/documentation as needed. Maybe nothing is needed.

            krisden Kevin Risden added a comment - tika-langdetect exists in Tika 1.x - https://mvnrepository.com/artifact/org.apache.tika/tika-langdetect/1.28.1 It seems like to me at least the langid module needs to be tested to make sure it all works and updated any jars/documentation as needed. Maybe nothing is needed.
            janhoy Jan Høydahl added a comment -

            I just tested langid on main

            SOLR_MODULES=langid solr start -c
            bin/solr create -c test
            curl -X POST -H 'Content-type:application/json' -d '{"add-updateprocessor":
              {"name": "langid",
              "class": "org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory",
              "langid.fl": "title",
              "langid.langField":"language_s"}
            }' http://localhost:8983/solr/test/config
            # Post some docs

            This works, although the old TikaLanguageIdentifier is not very good, it needs a lot of text to detect anything. LangDetectLanguageIdentifier is better. They both work with our current 9.0 lib/ folder, so no need to further tika dependencies or any dependency on extraction module.

            janhoy Jan Høydahl added a comment - I just tested langid on main SOLR_MODULES=langid solr start -c bin/solr create -c test curl -X POST -H 'Content-type:application/json' -d '{ "add-updateprocessor" :   { "name" : "langid" ,   "class" : "org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory" ,   "langid.fl" : "title" ,   "langid.langField" : "language_s" } }' http: //localhost:8983/solr/test/config # Post some docs This works, although the old TikaLanguageIdentifier is not very good, it needs a lot of text to detect anything. LangDetectLanguageIdentifier is better. They both work with our current 9.0 lib/ folder, so no need to further tika dependencies or any dependency on extraction module.
            krisden Kevin Risden added a comment -

            Awesome thanks for checking janhoy!

            krisden Kevin Risden added a comment - Awesome thanks for checking janhoy !

            People

              Unassigned Unassigned
              krisden Kevin Risden
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: