Tika
  1. Tika
  2. TIKA-490

Support for adding language profiles dynamically

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.8
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.

      It would be better to make this configurable, so you could add your own languages without recompiling.

      Suggested approach:
      Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
      Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!

      Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

      1. TIKA-490.patch
        8 kB
        Jan Høydahl
      2. TIKA-490.patch
        8 kB
        Jan Høydahl
      3. TIKA-490.Mattmann.082210.patch.txt
        10 kB
        Chris A. Mattmann
      4. TIKA-490.Mattmann.082210.2.patch.txt
        11 kB
        Chris A. Mattmann
      5. TIKA-490.janhoy.082310.patch
        12 kB
        Jan Høydahl
      6. TIKA-490.janhoy.082310.patch
        12 kB
        Jan Høydahl
      7. TIKA-490.janhoy.092010.patch
        14 kB
        Jan Høydahl
      8. TIKA-490.janhoy.092010.patch
        14 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        Here is my first patch (against current trunk 987074) which adds a properties file for languages list, and initializes profiles in constructor. I also made the new initProfiles() public so that it is possible to programmatically re-load the profiles if you have changed the property file, without restarting the JavaVM.

        I also changed the URL for ISO639 to a more complete version, and extracted some strings into constants.

        Also added a JDK logger to add some output. Did not use log4j or slf4j because those would require new dependencies and hurt backwards drop-in compatibility. We now log what languages get initalized and from which property file, which will aid in debugging.

        All existing tests PASS. No API changes are done which breaks backward compatibility.

        It would be great if someone could test it and commit it in time for 0.8, as I suspect to use it with Solr with some custom language profiles.

        Show
        Jan Høydahl added a comment - Here is my first patch (against current trunk 987074) which adds a properties file for languages list, and initializes profiles in constructor. I also made the new initProfiles() public so that it is possible to programmatically re-load the profiles if you have changed the property file, without restarting the JavaVM. I also changed the URL for ISO639 to a more complete version, and extracted some strings into constants. Also added a JDK logger to add some output. Did not use log4j or slf4j because those would require new dependencies and hurt backwards drop-in compatibility. We now log what languages get initalized and from which property file, which will aid in debugging. All existing tests PASS. No API changes are done which breaks backward compatibility. It would be great if someone could test it and commit it in time for 0.8, as I suspect to use it with Solr with some custom language profiles.
        Hide
        Jan Høydahl added a comment -

        Changed initProfiles() to a static method, and also made the Properties object static. Re-ran tests.

        Show
        Jan Høydahl added a comment - Changed initProfiles() to a static method, and also made the Properties object static. Re-ran tests.
        Hide
        Chris A. Mattmann added a comment -

        Hi Jan,

        I don't get the point of the override properties part. What does it buy us? The way you've set it up, it also loads from the classpath just like the language identifier properties proper file, so, it shouldn't be any more arduous to just mod that file if necessary (since they both are classpath loaded).

        Let me know what you think. I've reviewed the rest of the patch it looks good and I'm ready to commit it, sans the override part.

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Hi Jan, I don't get the point of the override properties part. What does it buy us? The way you've set it up, it also loads from the classpath just like the language identifier properties proper file, so, it shouldn't be any more arduous to just mod that file if necessary (since they both are classpath loaded). Let me know what you think. I've reviewed the rest of the patch it looks good and I'm ready to commit it, sans the override part. Cheers, Chris
        Hide
        Jan Høydahl added a comment -

        (Adding same comment to this issue as I sent to mailing list)

        My rationale for the override part is as follows:

        The default properties file will be embedded within tika-xx.jar. I assume most people are not keen to unpack and repack JARs to make a config change. We COULD put a similar named properties file at another location, but then the user needs to make sure that location is EARLIER in classpath than the JAR file. In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure this, and to avoid any confusion about class-loader peculiarities, it's more straight-forward to look for an override file.

        Take Solr as an example. The user would then put the override properties file along with his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/

        Show
        Jan Høydahl added a comment - (Adding same comment to this issue as I sent to mailing list) My rationale for the override part is as follows: The default properties file will be embedded within tika-xx.jar. I assume most people are not keen to unpack and repack JARs to make a config change. We COULD put a similar named properties file at another location, but then the user needs to make sure that location is EARLIER in classpath than the JAR file. In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure this, and to avoid any confusion about class-loader peculiarities, it's more straight-forward to look for an override file. Take Solr as an example. The user would then put the override properties file along with his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/
        Hide
        Chris A. Mattmann added a comment -

        Interesting. What why couldn't we just name the file the same thing, then? Would this be putting it up as a gamble to the Classloader?

        Show
        Chris A. Mattmann added a comment - Interesting. What why couldn't we just name the file the same thing, then? Would this be putting it up as a gamble to the Classloader?
        Hide
        Chris A. Mattmann added a comment -
        • see my updated patch. I've replaced all logging in the patch with TikaExceptions and proper handling. We don't want to have any logging in tika-core. If logging needs to happen it can happen downstream in the clients of the functionality provided by core. I also decided I'm fine with the override file notion (thanks for the explanation, Jan!). If there are no objections I'd like to commit this in the next 24 hours.
        Show
        Chris A. Mattmann added a comment - see my updated patch. I've replaced all logging in the patch with TikaExceptions and proper handling. We don't want to have any logging in tika-core. If logging needs to happen it can happen downstream in the clients of the functionality provided by core. I also decided I'm fine with the override file notion (thanks for the explanation, Jan!). If there are no objections I'd like to commit this in the next 24 hours.
        Hide
        Jan Høydahl added a comment -

        Either solution is OK by me.

        But from my experience classpath problems in application servers can cause a headache, and therefore I tried to be as explicit as possible.

        It would of course work to put a tika.language.properties file on classpath as well, if you know it's earlier in classpath than the tika jar.

        Or we could name the jar-internal properties file tika.language.default.properties, and let people override through tika.language.properties

        I'll leave the decision up to you. Remove the override if you feel strong about it.

        Show
        Jan Høydahl added a comment - Either solution is OK by me. But from my experience classpath problems in application servers can cause a headache, and therefore I tried to be as explicit as possible. It would of course work to put a tika.language.properties file on classpath as well, if you know it's earlier in classpath than the tika jar. Or we could name the jar-internal properties file tika.language.default.properties, and let people override through tika.language.properties I'll leave the decision up to you. Remove the override if you feel strong about it.
        Hide
        Chris A. Mattmann added a comment -
        • updated patch. I decided against throwing TikaExceptions, and just decided to throw an IOException if there was a problem loading the props file containing the profile default information. This makes the real exception the one that's exposed to the consumer of this API, which makes more sense. I don't like returning null in ProfilingWriter#getLanguage, but I couldn't think of anything more elegant at this time. So, I documented that there's a possibility that that method could return null now, so a caller should plan accordingly.

        If there are no objections, I'll commit this in the next 24 hrs.

        Show
        Chris A. Mattmann added a comment - updated patch. I decided against throwing TikaExceptions, and just decided to throw an IOException if there was a problem loading the props file containing the profile default information. This makes the real exception the one that's exposed to the consumer of this API, which makes more sense. I don't like returning null in ProfilingWriter#getLanguage, but I couldn't think of anything more elegant at this time. So, I documented that there's a possibility that that method could return null now, so a caller should plan accordingly. If there are no objections, I'll commit this in the next 24 hrs.
        Hide
        Jan Høydahl added a comment -

        Been thinking a bit more, and attaching a (in my eyes) approved patch.

        Those who override the config, they need better logging, else they're in the dark as to what happened.
        Since we cannot use a log framework, I've added a way to detect and fetch errors through the API.

        Also I've removed the throw of IOException in constructor. Since there is only a risk of that particular exception being thrown if the property file does not exist, in my opinion this will NEVER happen. So we should not force people to catch it. Also, it is better to keep backward API compatibility for all the existing applications out there. Users that override config will not either get the IOException since - if their tika.language.override.properties does not exist, tika will load the built-in one. And if the property file exists but is faulty, you'll not get IOException either. If there are errors in the properties file, you can test that through the new methods, and even ask for a list of languages that are successfully initialized.

        Changes in the attached patch:

        • public static boolean hasErrors()
        • public static String getErrors()
        • public static Set<String> getSupportedLanguages()
        • Constructor no longer throws IOException
        • Added better comments to public methods, including a warning for isReasonablyCertain() and short texts

        Sample output from a test application:
        Supported languages: [is, da, it, no, hu, th, de, el, fi, pt, pl, sv, fr, en, ru, et, es, nl]
        Number of languages supported: 18
        Has errors? true
        Language xx (Unknown) not initialized. Message: Failed trying to load language profile for language "xx". Error: null

        Show
        Jan Høydahl added a comment - Been thinking a bit more, and attaching a (in my eyes) approved patch. Those who override the config, they need better logging, else they're in the dark as to what happened. Since we cannot use a log framework, I've added a way to detect and fetch errors through the API. Also I've removed the throw of IOException in constructor. Since there is only a risk of that particular exception being thrown if the property file does not exist, in my opinion this will NEVER happen. So we should not force people to catch it. Also, it is better to keep backward API compatibility for all the existing applications out there. Users that override config will not either get the IOException since - if their tika.language.override.properties does not exist, tika will load the built-in one. And if the property file exists but is faulty, you'll not get IOException either. If there are errors in the properties file, you can test that through the new methods, and even ask for a list of languages that are successfully initialized. Changes in the attached patch: public static boolean hasErrors() public static String getErrors() public static Set<String> getSupportedLanguages() Constructor no longer throws IOException Added better comments to public methods, including a warning for isReasonablyCertain() and short texts Sample output from a test application: Supported languages: [is, da, it, no, hu, th, de, el, fi, pt, pl, sv, fr, en, ru, et, es, nl] Number of languages supported: 18 Has errors? true Language xx (Unknown) not initialized. Message: Failed trying to load language profile for language "xx". Error: null
        Hide
        Chris A. Mattmann added a comment -

        Hi Jan:

        I like your approach! Definitely a nice set of improvements. I noticed though in the patch there are still places (the unit tests) where we check for Exceptions – this isn't needed anymore since your updates obviate any exception being thrown. No worries, I'll address this when I commit your patch. Looks good to me. I plan to commit this sometime tonight or tomorrow.

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Hi Jan: I like your approach! Definitely a nice set of improvements. I noticed though in the patch there are still places (the unit tests) where we check for Exceptions – this isn't needed anymore since your updates obviate any exception being thrown. No worries, I'll address this when I commit your patch. Looks good to me. I plan to commit this sometime tonight or tomorrow. Cheers, Chris
        Hide
        Jan Høydahl added a comment -

        Yepp, I saw that.

        This patch removes the try's from test.
        I also removed the call to initProfiles() from constructor, as we now call it statically, and we no longer throw IOException in constructor, so we don't need it there for that purpose.

        I also took the freedom of adding a simple getter
        public double getDistance()
        I know that it is not directly related to this issue, so feel free to discard it and I will create a new ticket fir it.
        But given its low risk and high gain... at least until TIKA-369 is implemented.
        it would let the consumer decide what is certain enough for his use case. by e.g. computing normalized distance based on text length.
        An alternative is to mark it as deprecated and semi-internal, as the numerical value will probably change with TIKA-369.

        Show
        Jan Høydahl added a comment - Yepp, I saw that. This patch removes the try's from test. I also removed the call to initProfiles() from constructor, as we now call it statically, and we no longer throw IOException in constructor, so we don't need it there for that purpose. I also took the freedom of adding a simple getter public double getDistance() I know that it is not directly related to this issue, so feel free to discard it and I will create a new ticket fir it. But given its low risk and high gain... at least until TIKA-369 is implemented. it would let the consumer decide what is certain enough for his use case. by e.g. computing normalized distance based on text length. An alternative is to mark it as deprecated and semi-internal, as the numerical value will probably change with TIKA-369 .
        Hide
        Jukka Zitting added a comment -

        I wonder why we'd need these extra properties files in the first place. As an alternative, how about if we just added an optional Map<String, LanguageProfile> argument to the LanguageIdentifier constructors, so applications that want to add new languages or customize the default profiles can easily do so. Such an approach would be much more flexible and less error-prone (no need for extra logging) than using properties files, and an application that needs runtime configuration of language profiles can still implement its own properties files for this purpose.

        Show
        Jukka Zitting added a comment - I wonder why we'd need these extra properties files in the first place. As an alternative, how about if we just added an optional Map<String, LanguageProfile> argument to the LanguageIdentifier constructors, so applications that want to add new languages or customize the default profiles can easily do so. Such an approach would be much more flexible and less error-prone (no need for extra logging) than using properties files, and an application that needs runtime configuration of language profiles can still implement its own properties files for this purpose.
        Hide
        Chris A. Mattmann added a comment -

        Hey Jukka:

        +1, let me take a crack at addressing your issues. I think we can make this look like the way that the Parsers are dynamically loaded, including the API support for it.

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Hey Jukka: +1, let me take a crack at addressing your issues. I think we can make this look like the way that the Parsers are dynamically loaded, including the API support for it. Cheers, Chris
        Hide
        Jan Høydahl added a comment -

        Jukka,

        Your idea for programmatic adding of profiles is good for some integration scenarios.

        My original intent with filing this issue however, was to support the case where an (end) user or integrator needs to add support for a new langue on the fly without programming.

        For me as a Search engine Consultant I can then add say Sami Language support to Solr extracting request handler by placing one property file and one Language profile file on the classpath. With a constructor only this would require code change to Solr.

        So we can add constructor but please don't remove property file.

        Show
        Jan Høydahl added a comment - Jukka, Your idea for programmatic adding of profiles is good for some integration scenarios. My original intent with filing this issue however, was to support the case where an (end) user or integrator needs to add support for a new langue on the fly without programming. For me as a Search engine Consultant I can then add say Sami Language support to Solr extracting request handler by placing one property file and one Language profile file on the classpath. With a constructor only this would require code change to Solr. So we can add constructor but please don't remove property file.
        Hide
        Jan Høydahl added a comment -

        Updated patch which adds ability to load your own map of LanguageProfile-s programmatically. I chose to do that not via a constructor, but rather a new signature for for initProfiles(), which will replace the profiles map with the one provided.

        The public API after this patch will be:
        // Existing
        public LanguageIdentifier(LanguageProfile profile)
        public LanguageIdentifier(String content)
        public String getLanguage()
        public boolean isReasonablyCertain()
        public String toString()

        // New from patch
        public static void initProfiles()
        public static void initProfiles(Map<String, LanguageProfile> profilesMap)
        public static void clearProfiles()
        public static void addProfile(String language, LanguageProfile profile)
        public static Set<String> getSupportedLanguages()
        public static boolean hasErrors()
        public static String getErrors()

        None of the original signatures are changed.
        You can now initialize language profiles in these ways:
        A) Loading tika.language.properties and profiles from jar file (default, equivalent to v0.7)
        B) Loading tika.language.override.properties from classpath and profiles from classpath (nice for integrators)
        C) Dynamic from client code, using addProfile and/or initProfiles(Map<String, LanguageProfile>)

        PS: I removed the getDistance() method again, since it's exposing internal value. Already available through toString() for those who need it. Also removed the "force" parameter to initProfiles() as it's not needed - we always want to init from scratch.

        Show
        Jan Høydahl added a comment - Updated patch which adds ability to load your own map of LanguageProfile-s programmatically. I chose to do that not via a constructor, but rather a new signature for for initProfiles(), which will replace the profiles map with the one provided. The public API after this patch will be: // Existing public LanguageIdentifier(LanguageProfile profile) public LanguageIdentifier(String content) public String getLanguage() public boolean isReasonablyCertain() public String toString() // New from patch public static void initProfiles() public static void initProfiles(Map<String, LanguageProfile> profilesMap) public static void clearProfiles() public static void addProfile(String language, LanguageProfile profile) public static Set<String> getSupportedLanguages() public static boolean hasErrors() public static String getErrors() None of the original signatures are changed. You can now initialize language profiles in these ways: A) Loading tika.language.properties and profiles from jar file (default, equivalent to v0.7) B) Loading tika.language.override.properties from classpath and profiles from classpath (nice for integrators) C) Dynamic from client code, using addProfile and/or initProfiles(Map<String, LanguageProfile>) PS: I removed the getDistance() method again, since it's exposing internal value. Already available through toString() for those who need it. Also removed the "force" parameter to initProfiles() as it's not needed - we always want to init from scratch.
        Hide
        Jan Høydahl added a comment -

        Replacing with correct patch

        Show
        Jan Høydahl added a comment - Replacing with correct patch
        Hide
        Jan Høydahl added a comment -

        Chris, Jukka,

        Are we approaching a solution on this one?

        Show
        Jan Høydahl added a comment - Chris, Jukka, Are we approaching a solution on this one?
        Hide
        Chris A. Mattmann added a comment -

        Hi Jan:

        I've been a bit bogged down lately but plan on looking at this later in the week and trying to resolve it then.

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Hi Jan: I've been a bit bogged down lately but plan on looking at this later in the week and trying to resolve it then. Cheers, Chris
        Hide
        Chris A. Mattmann added a comment -
        • I applied the latest patch from Jan in r1029556. Ultimately we should move to a service-provider like mechanism on this though the existing patch is quite close to that mechanism in the end. It would be nice to just have language profiles picked up dynamically on the classpath, but this is good enough for now. Thanks Jan!
        Show
        Chris A. Mattmann added a comment - I applied the latest patch from Jan in r1029556. Ultimately we should move to a service-provider like mechanism on this though the existing patch is quite close to that mechanism in the end. It would be nice to just have language profiles picked up dynamically on the classpath, but this is good enough for now. Thanks Jan!

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development