Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1723

Integrate language-detector into Tika

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.13
    • Component/s: languageidentifier
    • Labels:
      None
    • Flags:
      Patch

      Description

      The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.

      This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if Chris A. Mattmann moves forward with turning language detection into a pluggable extension point.

      I'll add comments with results below.

      1. TIKA-1723.patch
        74 kB
        Ken Krugler
      2. TIKA-1723-2.patch
        109 kB
        Ken Krugler
      3. TIKA-1723-3.patch
        110 kB
        Ken Krugler
      4. TIKA-1723v2.patch
        770 kB
        Tim Allison

        Issue Links

          Activity

          Hide
          kkrugler Ken Krugler added a comment -

          Part of this work is looking to make the API for language detection more generic - currently it's tightly coupled to the existing internal implementation.

          For example, a LanguageProfile is used for both the target language model and what's built from character statistics, but this isn't how it always works for other detectors.

          And LanguageProfile exposes public details about the implementation, e.g. DEFAULT_NGRAM_LENGTH is a public constant.

          I've created an abstract LanguageDetector class plus a few new concrete classes, and have integrated language-detector using these.

          But in order to not break compatibility with existing users, I've left the current implementation in place. If the patch looks promising, I could turn those into facades for the new implementation, and mark them as deprecated.

          Show
          kkrugler Ken Krugler added a comment - Part of this work is looking to make the API for language detection more generic - currently it's tightly coupled to the existing internal implementation. For example, a LanguageProfile is used for both the target language model and what's built from character statistics, but this isn't how it always works for other detectors. And LanguageProfile exposes public details about the implementation, e.g. DEFAULT_NGRAM_LENGTH is a public constant. I've created an abstract LanguageDetector class plus a few new concrete classes, and have integrated language-detector using these. But in order to not break compatibility with existing users, I've left the current implementation in place. If the patch looks promising, I could turn those into facades for the new implementation, and mark them as deprecated.
          Hide
          kkrugler Ken Krugler added a comment -

          The above work added the language-detector dependency to tika-core, which is where the current language support lives...but that doesn't feel right.

          I think we should have a tika-language jar, which could then also include whatever pluggable scaffolding was developed by Chris A. Mattmann.

          Show
          kkrugler Ken Krugler added a comment - The above work added the language-detector dependency to tika-core, which is where the current language support lives...but that doesn't feel right. I think we should have a tika-language jar, which could then also include whatever pluggable scaffolding was developed by Chris A. Mattmann .
          Hide
          kkrugler Ken Krugler added a comment -

          There are a number of TODO comments in the code, many around open design decisions.

          I also haven't included test data for all of the 70 languages that language-detector supports, as that doesn't feel like a good use of my time

          Show
          kkrugler Ken Krugler added a comment - There are a number of TODO comments in the code, many around open design decisions. I also haven't included test data for all of the 70 languages that language-detector supports, as that doesn't feel like a good use of my time
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          I've only taken a brief look, but I think that moving to an abstract LanguageDetector is great!

          1. On the confidence scores...I suspect that different detectors will have different underlying distributions. Should we have the Tika wrapper for each detector munge the confidence scores so that they are normally distributed around .5 (or something), or should we let each detector determine "high" or "medium"...still include the raw confidence scores, but add a variable to LanguageResult for "high/medium/low".
          2. Should we add a setPriors(Map<String,Float> langPriors to LanguageDetector? Some implementations may or may not use it (similar to what you're doing with mixedLanguages and shortText).
          3. Could we rename LangDetector to something like OptimaizeLangDetector so that in the future if we integrate cybozulab's langdetect, we won't have confusion?
          4. Should we take this opportunity to create a new tika-langdetect module?

          On a related note, as I was looking at ProfilingHandler and ProfilingWriter, I'm wondering if it isn't possible to include the language detection in the handler component instead of the writer component. I think this would allow easier dual language detection and content handling. The goal would be something like what Chris A. Mattmann added to tika-server: wrap ToXMLHandler (or friend) in a LanguageDetectorHandler...the XMLHandler would write the chars to the specified outputstream, and the LanguageDetectorHandler would compute the language detection stats.

          If there's an obvious way to do this now, please let me know. If not, I can try to implement this with our current language detector and then your Optimaize wrapper and all others that we choose could use that?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited I've only taken a brief look, but I think that moving to an abstract LanguageDetector is great! On the confidence scores...I suspect that different detectors will have different underlying distributions. Should we have the Tika wrapper for each detector munge the confidence scores so that they are normally distributed around .5 (or something), or should we let each detector determine "high" or "medium"...still include the raw confidence scores, but add a variable to LanguageResult for "high/medium/low". Should we add a setPriors(Map<String,Float> langPriors to LanguageDetector? Some implementations may or may not use it (similar to what you're doing with mixedLanguages and shortText). Could we rename LangDetector to something like OptimaizeLangDetector so that in the future if we integrate cybozulab's langdetect, we won't have confusion? Should we take this opportunity to create a new tika-langdetect module? On a related note, as I was looking at ProfilingHandler and ProfilingWriter, I'm wondering if it isn't possible to include the language detection in the handler component instead of the writer component. I think this would allow easier dual language detection and content handling. The goal would be something like what Chris A. Mattmann added to tika-server: wrap ToXMLHandler (or friend) in a LanguageDetectorHandler...the XMLHandler would write the chars to the specified outputstream, and the LanguageDetectorHandler would compute the language detection stats. If there's an obvious way to do this now, please let me know. If not, I can try to implement this with our current language detector and then your Optimaize wrapper and all others that we choose could use that?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          This is a very, very rough patch that breaks out a new tika-langdetect module. The idea is to follow the pattern of tika-translate – only the smallest footprint necessary in core. For now, there's quite a bit of duplication of test resources, etc. Eventually we could move Tika's current langdetect implementation into this module and leave the scaffolding behind, but probably not until 2.0.

          I wouldn't propose doing this all in one patch (integrating Optimaize and doing all of the internal restructuring), but I wanted to continue roughing out where we might be headed. If this basically looks good, I'll open a separate ticket for that, and then Ken Krugler, we could add this.

          If Chris A. Mattmann or others want to add configurability for language id through TikaConfig, I think that'd be great and something along these lines would be a good start.

          Show
          tallison@mitre.org Tim Allison added a comment - This is a very, very rough patch that breaks out a new tika-langdetect module. The idea is to follow the pattern of tika-translate – only the smallest footprint necessary in core. For now, there's quite a bit of duplication of test resources, etc. Eventually we could move Tika's current langdetect implementation into this module and leave the scaffolding behind, but probably not until 2.0. I wouldn't propose doing this all in one patch (integrating Optimaize and doing all of the internal restructuring), but I wanted to continue roughing out where we might be headed. If this basically looks good, I'll open a separate ticket for that, and then Ken Krugler , we could add this. If Chris A. Mattmann or others want to add configurability for language id through TikaConfig, I think that'd be great and something along these lines would be a good start.
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Tim - thanks for the fast review.

          1. Re confidence scores...yes they'll have different ranges & meanings for their raw scores. So that's why I'd put the comment into LanguageResult about these being normalized to conform to the range constants defined previously. But I like your idea better - call this a "rawScore", and have a separate enumerated confidence value (LOW, MED, HIGH). I'll go make that change.

          2. Re setPriors - I haven't seen a case where it's necessary to dynamically change the a priori probabilities when using language-detector, so I'd propose having an alternative loadModels(Map<String, Float>)). This way the detector could load a different model depending on the probability (as an example). But having a separate call to set the probabilities is also possible. Though in that case, what if the set of languages doesn't match what was previously loaded? Throw an error?

          3. Re OptimaizeLangDetector - sure, makes sense to rename it.

          Show
          kkrugler Ken Krugler added a comment - Hi Tim - thanks for the fast review. 1. Re confidence scores...yes they'll have different ranges & meanings for their raw scores. So that's why I'd put the comment into LanguageResult about these being normalized to conform to the range constants defined previously. But I like your idea better - call this a "rawScore", and have a separate enumerated confidence value (LOW, MED, HIGH). I'll go make that change. 2. Re setPriors - I haven't seen a case where it's necessary to dynamically change the a priori probabilities when using language-detector, so I'd propose having an alternative loadModels(Map<String, Float>)). This way the detector could load a different model depending on the probability (as an example). But having a separate call to set the probabilities is also possible. Though in that case, what if the set of languages doesn't match what was previously loaded? Throw an error? 3. Re OptimaizeLangDetector - sure, makes sense to rename it.
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Tim - re putting language detection into the handler. I'd been thinking about how best to add language attributes to the XHTML being generated by the parsers, as I think that's the right way to handle multi-lingual documents (I assume that's what you mean by "dual language detection").

          The problem is that you'd want the output to be hierarchical, in that <html lang=xx xml:lang=xx> is where you'd want to specify the "primary" language for the document, and then only add the lang attributes to elements where it's different.

          But that would require deferring the output of all XHTML until after the document had been processed, or processing it twice, which seems ugly. So the other solution would be to add language tags at every opportunity (any element that supports the lang attribute). Though you'd only have to do this if the language was different from the enclosing element's language. But you'd want to process each chunk of text individually, e.g. you wouldn't know in advance if there's a list with a different language for each item.

          Which means this is getting pretty complicated.

          Show
          kkrugler Ken Krugler added a comment - Hi Tim - re putting language detection into the handler. I'd been thinking about how best to add language attributes to the XHTML being generated by the parsers, as I think that's the right way to handle multi-lingual documents (I assume that's what you mean by "dual language detection"). The problem is that you'd want the output to be hierarchical, in that <html lang=xx xml:lang=xx> is where you'd want to specify the "primary" language for the document, and then only add the lang attributes to elements where it's different. But that would require deferring the output of all XHTML until after the document had been processed, or processing it twice, which seems ugly. So the other solution would be to add language tags at every opportunity (any element that supports the lang attribute). Though you'd only have to do this if the language was different from the enclosing element's language. But you'd want to process each chunk of text individually, e.g. you wouldn't know in advance if there's a list with a different language for each item. Which means this is getting pretty complicated.
          Hide
          kkrugler Ken Krugler added a comment -

          I've also been thinking about how to use lang=xx and xml:lang=xx attributes in (X)HTML docs - e.g. should we key off that and skip language detection to improve efficiency (and hopefully accuracy)?

          And what about Content-Language in the response header, if that's provided with metadata?

          Show
          kkrugler Ken Krugler added a comment - I've also been thinking about how to use lang=xx and xml:lang=xx attributes in (X)HTML docs - e.g. should we key off that and skip language detection to improve efficiency (and hopefully accuracy)? And what about Content-Language in the response header, if that's provided with metadata?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Agreed on complexity of multilingual lang id. You would definitely want to do a two-step process...I would think from a statistical perspective as well as from the perspective you point out: majority lang for the doc, etc.

          This was in the back of my mind as an "eventually, wouldn't this be nice," but I think we're still a good way away from that.

          Apologies for my lack of clarity, what I meant by "dual language detection and content handling" was: allow for identification of the overall language of the document at the same time that you are handling/writing out regular content, say, to an outputstream or a byte buffer via the usual ToTextHandler (or friend). I realize that you can probably do this via some kind of wrapping of the writer, but it seems like we might want to move this into the handler.

          Show
          tallison@mitre.org Tim Allison added a comment - Agreed on complexity of multilingual lang id. You would definitely want to do a two-step process...I would think from a statistical perspective as well as from the perspective you point out: majority lang for the doc, etc. This was in the back of my mind as an "eventually, wouldn't this be nice," but I think we're still a good way away from that. Apologies for my lack of clarity, what I meant by "dual language detection and content handling" was: allow for identification of the overall language of the document at the same time that you are handling/writing out regular content, say, to an outputstream or a byte buffer via the usual ToTextHandler (or friend). I realize that you can probably do this via some kind of wrapping of the writer, but it seems like we might want to move this into the handler.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          My personal preference would be to add to whatever metadata we have about the document, not overwrite it. We might use that information in the priors for the doc.

          So, again, my personal preference, would be to use the "added by Tika" (TikaCoreProperties.TIKA_META_PREFIX) to any metadata that we compute via lang id.

          Show
          tallison@mitre.org Tim Allison added a comment - My personal preference would be to add to whatever metadata we have about the document, not overwrite it. We might use that information in the priors for the doc. So, again, my personal preference, would be to use the "added by Tika" (TikaCoreProperties.TIKA_META_PREFIX) to any metadata that we compute via lang id.
          Hide
          kkrugler Ken Krugler added a comment - - edited

          Version 2 of my patch (not be be confused with Tim's patch, which is about moving this code into a new tika-langdetect module)

          Show
          kkrugler Ken Krugler added a comment - - edited Version 2 of my patch (not be be confused with Tim's patch, which is about moving this code into a new tika-langdetect module)
          Hide
          kkrugler Ken Krugler added a comment -

          New patch which uses Locale to handle language names (language tags).

          Show
          kkrugler Ken Krugler added a comment - New patch which uses Locale to handle language names (language tags).
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Tim - I just attached a new version of my patch, which addresses the issues raised as per above. Still some TODOs, but I'd like to get this committed so you could do the move to tika-langdetect. Let me know what you think.

          Show
          kkrugler Ken Krugler added a comment - Hi Tim - I just attached a new version of my patch, which addresses the issues raised as per above. Still some TODOs, but I'd like to get this committed so you could do the move to tika-langdetect. Let me know what you think.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Ken,
          This looks great. And, yes, I wouldn't want anyone to confuse your patch2 with my horrible mess.
          To confirm, is this the overall goal:

          1. Make language detection configurable via TikaConfig
          2. Create a separate package tika-lang-detect (or similar) and put various language detection implementations/dependencies there including Tika's legacy detection code and Optimaize?
          3. Make Optimaize the default language detector in tika-app and tika-server
          4. Add other lang detectors as desired to the new package
          5. Deprecate and then eventually remove ProfilingHandler and ProfilingWriter

          If everyone is ok with committing the patch as is and then doing some fairly substantial moving next week (or so) into the new package, then, y, go for it.

          I'm excited to try out Optimaize. Thank you for the integration!

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Ken, This looks great. And, yes, I wouldn't want anyone to confuse your patch2 with my horrible mess. To confirm, is this the overall goal: Make language detection configurable via TikaConfig Create a separate package tika-lang-detect (or similar) and put various language detection implementations/dependencies there including Tika's legacy detection code and Optimaize? Make Optimaize the default language detector in tika-app and tika-server Add other lang detectors as desired to the new package Deprecate and then eventually remove ProfilingHandler and ProfilingWriter If everyone is ok with committing the patch as is and then doing some fairly substantial moving next week (or so) into the new package, then, y, go for it. I'm excited to try out Optimaize. Thank you for the integration!
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Ken this is great work. my +1 to move forward on it. I should have time early next week to pick up and help work on this more in terms of refactoring into another tika module, etc.

          Show
          chrismattmann Chris A. Mattmann added a comment - Ken this is great work. my +1 to move forward on it. I should have time early next week to pick up and help work on this more in terms of refactoring into another tika module, etc.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I forgot to mention that we'll need to modify the Tika bundle to get this to work via OSGi.

          Show
          tallison@mitre.org Tim Allison added a comment - I forgot to mention that we'll need to modify the Tika bundle to get this to work via OSGi.
          Hide
          kkrugler Ken Krugler added a comment -

          Regarding the current detection code...

          I'm going to propose that we leave it in tika-core, w/deprecation annotations, unless someone can come up with a good reason why we'd want to have it available via the new API.

          Show
          kkrugler Ken Krugler added a comment - Regarding the current detection code... I'm going to propose that we leave it in tika-core, w/deprecation annotations, unless someone can come up with a good reason why we'd want to have it available via the new API.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Makes sense. I proposed moving it over just so that we didn't lose our investment in that code, but if Optimaize or another lang-detect package blows it out of the water, then it makes sense to abandon it.

          Are you generally in agreement with the overall way ahead above (w exception of handing of legacy code)?

          Should we remove legacy detection code in 2.0?

          Show
          tallison@mitre.org Tim Allison added a comment - Makes sense. I proposed moving it over just so that we didn't lose our investment in that code, but if Optimaize or another lang-detect package blows it out of the water, then it makes sense to abandon it. Are you generally in agreement with the overall way ahead above (w exception of handing of legacy code)? Should we remove legacy detection code in 2.0?
          Hide
          kkrugler Ken Krugler added a comment -

          Hi Tim,

          1. Not sure about "Make language detection configurable via TikaConfig". Doesn't that get into issues with the classloader, etc? In any case, I assume that's something Chris A. Mattmann will address in a separate issue, re making the language detection pluggable.

          2. Yes re separate package, without porting current detection code.

          3. Yes re making Optimaize the default detector (though this is more about #1 above). So currently it would be "the only detector", at least for the new API.

          4. I think so, though there's a philosophical issue here...should we just have one built-in implementation, and assume that any others will be separate plug-ins implemented by somebody else?

          5. Yes re getting rid of legacy code in 2.0 (including current detection code/data & ProfilingXXX classes)

          Show
          kkrugler Ken Krugler added a comment - Hi Tim, 1. Not sure about "Make language detection configurable via TikaConfig". Doesn't that get into issues with the classloader, etc? In any case, I assume that's something Chris A. Mattmann will address in a separate issue, re making the language detection pluggable. 2. Yes re separate package, without porting current detection code. 3. Yes re making Optimaize the default detector (though this is more about #1 above). So currently it would be "the only detector", at least for the new API. 4. I think so, though there's a philosophical issue here...should we just have one built-in implementation, and assume that any others will be separate plug-ins implemented by somebody else? 5. Yes re getting rid of legacy code in 2.0 (including current detection code/data & ProfilingXXX classes)
          Hide
          kkrugler Ken Krugler added a comment -

          Biggest remaining issue before I commit is how to deal with language names (aka language tags). I've got a LanguageNames class (probably should be renamed to LanguageTags) that wraps some of Java's Locale object, to help with handling conversion between strings and formal locales, and doing fuzzy comparison. But some of what should be in that class requires functionality not provided by Locale (e.g. what's the suppress-script setting for a locale?), and other functionality requires some decision making. For example, if you request 'zh' as one of the language profiles, and the detector has zh-Latn-CN, then is that a match, and thus pinyin (e.g. "beijing") gets flagged as Chinese?

          Show
          kkrugler Ken Krugler added a comment - Biggest remaining issue before I commit is how to deal with language names (aka language tags). I've got a LanguageNames class (probably should be renamed to LanguageTags) that wraps some of Java's Locale object, to help with handling conversion between strings and formal locales, and doing fuzzy comparison. But some of what should be in that class requires functionality not provided by Locale (e.g. what's the suppress-script setting for a locale?), and other functionality requires some decision making. For example, if you request 'zh' as one of the language profiles, and the detector has zh-Latn-CN, then is that a match, and thus pinyin (e.g. "beijing") gets flagged as Chinese?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Great. Thank you.

          1. ...Doesn't that get into issues with the classloader, etc? In any case, I assume that's something Chris A. Mattmann will address in a separate issue, re making the language detection pluggable.

          Y, and y. It'll be possible, but it'll take some work in a separate issue.

          4. I think so, though there's a philosophical issue here...should we just have one built-in implementation, and assume that any others will be separate plug-ins implemented by somebody else?

          Once we go the route of plugability, we may as well add a wrapper for cybozu's in the tika-lang-detect module...I think. We could cut down on some configuration in the Solr config with more configuration on our side. Wait... But seriously, I think we should add it, eventually.

          Show
          tallison@mitre.org Tim Allison added a comment - Great. Thank you. 1. ...Doesn't that get into issues with the classloader, etc? In any case, I assume that's something Chris A. Mattmann will address in a separate issue, re making the language detection pluggable. Y, and y. It'll be possible, but it'll take some work in a separate issue. 4. I think so, though there's a philosophical issue here...should we just have one built-in implementation, and assume that any others will be separate plug-ins implemented by somebody else? Once we go the route of plugability, we may as well add a wrapper for cybozu's in the tika-lang-detect module...I think. We could cut down on some configuration in the Solr config with more configuration on our side. Wait... But seriously, I think we should add it, eventually.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, I agree...that's a potential mess/challenge/opportunity. We might want to see how Solr's handling that now.

          This is overkill, but do we need a separate object for this: LanguageSpec that includes language, extlang, script, region, variant, extension, and private-use?

          all nullable except language

          For loadModels and hasModels, we could require an exact match.

          We could add a getMatchingModels that would return a set of models that match the non-null items?

          A LanguageResult would have a LanguageSpec object instead of a String and be the best effort parse of whatever the underlying lang id'er said.

          Again, this might just be too much...

          Show
          tallison@mitre.org Tim Allison added a comment - Y, I agree...that's a potential mess/challenge/opportunity. We might want to see how Solr's handling that now. This is overkill, but do we need a separate object for this: LanguageSpec that includes language, extlang, script, region, variant, extension, and private-use? all nullable except language For loadModels and hasModels , we could require an exact match. We could add a getMatchingModels that would return a set of models that match the non-null items? A LanguageResult would have a LanguageSpec object instead of a String and be the best effort parse of whatever the underlying lang id'er said. Again, this might just be too much...
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Come on over to the 2.x branch, the water is fine. Plenty of freedom to break things there. I still don't have a good solution, though, to the complexity you raise above.

          Show
          tallison@mitre.org Tim Allison added a comment - Come on over to the 2.x branch, the water is fine. Plenty of freedom to break things there. I still don't have a good solution, though, to the complexity you raise above.
          Hide
          kkrugler Ken Krugler added a comment -

          Tim Allison I must admit, focusing on this change in 2.0, and not worrying about the backwards compatibility stuff (if that's OK) would be nice. Or would we still want to keep around the old language detector API? I'm hoping the answer is no

          Show
          kkrugler Ken Krugler added a comment - Tim Allison I must admit, focusing on this change in 2.0, and not worrying about the backwards compatibility stuff (if that's OK) would be nice. Or would we still want to keep around the old language detector API? I'm hoping the answer is no
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Agreed on the ease of building the new ld framework in 2.0.

          Given Mike's comparison of Tika and langdetect here, even though it is now dated, I'd be willing to put our language detector on mothballs in 2.x (i.e. leave it in 1.x, and if we need to resurrect it we can). That said, I didn't write that code, and I know that Toke Eskildsen on TIKA-1549 has since dramatically improved our speed.

          This is certainly a large enough issue to invite feedback from the entire community. Do we want to drop our language detection code in 2.x? Or is there a good reason to keep it?

          Show
          tallison@mitre.org Tim Allison added a comment - Agreed on the ease of building the new ld framework in 2.0. Given Mike's comparison of Tika and langdetect here , even though it is now dated, I'd be willing to put our language detector on mothballs in 2.x (i.e. leave it in 1.x, and if we need to resurrect it we can). That said, I didn't write that code, and I know that Toke Eskildsen on TIKA-1549 has since dramatically improved our speed. This is certainly a large enough issue to invite feedback from the entire community. Do we want to drop our language detection code in 2.x? Or is there a good reason to keep it?
          Hide
          kkrugler Ken Krugler added a comment -

          Good idea re gathering input - I just emailed the dev list.

          Show
          kkrugler Ken Krugler added a comment - Good idea re gathering input - I just emailed the dev list.
          Hide
          kkrugler Ken Krugler added a comment -

          OK, I've committed this code to a new tika-langdetect module in the 2.x branch. Next steps are to remove the old support, and then fix up everything that breaks.

          Show
          kkrugler Ken Krugler added a comment - OK, I've committed this code to a new tika-langdetect module in the 2.x branch. Next steps are to remove the old support, and then fix up everything that breaks.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          This is now done, Ken's Optimaize langdetect, N-gram langdetect and Text.jl from MIT are all now integrated:

          LMC-053601:tika1.13 mattmann$ git commit -m "Resolve conflicts in CHANGES.txt"
          [master 2caf3da] Resolve conflicts in CHANGES.txt
          LMC-053601:tika1.13 mattmann$ git push -u origin master
          Counting objects: 477, done.
          Delta compression using up to 8 threads.
          Compressing objects: 100% (237/237), done.
          Writing objects: 100% (477/477), 113.91 KiB | 0 bytes/s, done.
          Total 477 (delta 134), reused 320 (delta 67)
          remote: tika git commit: Resolve conflicts in CHANGES.txt
          remote: tika git commit: Update with information about TIKA-1872, TIKA-1696 and TIKA-1723.
          remote: tika git commit: Merge branch 'TIKA-1872'
          remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872
          remote: tika git commit: Updated TextLangDetector and fixed build errors
          remote: tika git commit: Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika into TIKA-1872
          remote: tika git commit: Depend on 1.13-SNAPSHOT, not 2.0.
          remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872
          remote: tika git commit: Added missing license headers
          remote: tika git commit: Add missing license headers
          remote: tika git commit: fix for TIKA-1872 contributed by trevorlewis
          remote: tika git commit: Make detector "discoverable", use that everywhere
          remote: tika git commit: Move base lang detect classes to core
          remote: tika git commit: Remove built-in lang detector
          remote: tika git commit: Add tika-langdetect dependency in other modules
          remote: tika git commit: Add project.build.sourceEncoding to properties
          remote: tika git commit: Roll in new lang detect support in new module
          remote: tika git commit: Add missing dependency on tika-test-resources
          To https://git-wip-us.apache.org/repos/asf/tika.git
             c9d508d..2caf3da  master -> master
          Branch master set up to track remote
          

          Thanks Ken Krugler and Trevor Lewis!

          Show
          chrismattmann Chris A. Mattmann added a comment - This is now done, Ken's Optimaize langdetect, N-gram langdetect and Text.jl from MIT are all now integrated: LMC-053601:tika1.13 mattmann$ git commit -m "Resolve conflicts in CHANGES.txt" [master 2caf3da] Resolve conflicts in CHANGES.txt LMC-053601:tika1.13 mattmann$ git push -u origin master Counting objects: 477, done. Delta compression using up to 8 threads. Compressing objects: 100% (237/237), done. Writing objects: 100% (477/477), 113.91 KiB | 0 bytes/s, done. Total 477 (delta 134), reused 320 (delta 67) remote: tika git commit: Resolve conflicts in CHANGES.txt remote: tika git commit: Update with information about TIKA-1872, TIKA-1696 and TIKA-1723. remote: tika git commit: Merge branch 'TIKA-1872' remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872 remote: tika git commit: Updated TextLangDetector and fixed build errors remote: tika git commit: Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika into TIKA-1872 remote: tika git commit: Depend on 1.13-SNAPSHOT, not 2.0. remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872 remote: tika git commit: Added missing license headers remote: tika git commit: Add missing license headers remote: tika git commit: fix for TIKA-1872 contributed by trevorlewis remote: tika git commit: Make detector "discoverable", use that everywhere remote: tika git commit: Move base lang detect classes to core remote: tika git commit: Remove built-in lang detector remote: tika git commit: Add tika-langdetect dependency in other modules remote: tika git commit: Add project.build.sourceEncoding to properties remote: tika git commit: Roll in new lang detect support in new module remote: tika git commit: Add missing dependency on tika-test-resources To https://git-wip-us.apache.org/repos/asf/tika.git c9d508d..2caf3da master -> master Branch master set up to track remote Thanks Ken Krugler and Trevor Lewis !

            People

            • Assignee:
              kkrugler Ken Krugler
              Reporter:
              kkrugler Ken Krugler
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development