Solr
  1. Solr
  2. SOLR-1979

Create LanguageIdentifierUpdateProcessor

    Details

      Description

      Language identification from document fields, and mapping of field names to language-specific fields based on detected language.

      Wrap the Tika LanguageIdentifier in an UpdateProcessor.

      See user documentation at http://wiki.apache.org/solr/LanguageDetection

      1. SOLR-1979-branch_3x.patch
        54 kB
        Jan Høydahl
      2. SOLR-1979.patch
        35 kB
        Jan Høydahl
      3. SOLR-1979.patch
        51 kB
        Grant Ingersoll
      4. SOLR-1979.patch
        59 kB
        Grant Ingersoll
      5. SOLR-1979.patch
        59 kB
        Grant Ingersoll
      6. SOLR-1979.patch
        51 kB
        Jan Høydahl
      7. SOLR-1979.patch
        58 kB
        Jan Høydahl
      8. SOLR-1979.patch
        51 kB
        Jan Høydahl
      9. SOLR-1979.patch
        48 kB
        Jan Høydahl
      10. SOLR-1979.patch
        50 kB
        Jan Høydahl
      11. SOLR-1979.patch
        54 kB
        Jan Høydahl
      12. SOLR-1979.patch
        54 kB
        Jan Høydahl
      13. SOLR-1979.patch
        55 kB
        Jan Høydahl
      14. SOLR-1979.patch
        57 kB
        Jan Høydahl
      15. SOLR-1979.patch
        56 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -

          I would look at the Language Identifier in Tika (which is based on the Nutch work) as it is likely to be the one that is more maintained going forward IMHO...

          Show
          Chris A. Mattmann added a comment - I would look at the Language Identifier in Tika (which is based on the Nutch work) as it is likely to be the one that is more maintained going forward IMHO...
          Hide
          Jan Høydahl added a comment -

          I have implemented a first shot patch using the Tika LanguageIdentifier. It is unfortunately quite limited in features, and for short text segments, isReasonablyCertain() always returns false Also, the number of languages supported is still quite low. But it works as a start, and then we can focus on improving the Tika code in future releases.

          I plan on putting the patch in contrib/extraction, since it depends on Tika. If I put it relative to main, Solr will not compile unless you put tika jar in lib. Agree?

          Show
          Jan Høydahl added a comment - I have implemented a first shot patch using the Tika LanguageIdentifier. It is unfortunately quite limited in features, and for short text segments, isReasonablyCertain() always returns false Also, the number of languages supported is still quite low. But it works as a start, and then we can focus on improving the Tika code in future releases. I plan on putting the patch in contrib/extraction, since it depends on Tika. If I put it relative to main, Solr will not compile unless you put tika jar in lib. Agree?
          Hide
          Jan Høydahl added a comment -

          First raw patch implementing language identification.

          Show
          Jan Høydahl added a comment - First raw patch implementing language identification.
          Hide
          Grant Ingersoll added a comment -

          See http://wiki.apache.org/solr/LanguageDetection for the start of documentation.

          isReasonablyCertain() always returns false

          See TIKA-568.

          Show
          Grant Ingersoll added a comment - See http://wiki.apache.org/solr/LanguageDetection for the start of documentation. isReasonablyCertain() always returns false See TIKA-568 .
          Hide
          Jan Høydahl added a comment -

          Simply allowing to set the threshold for isReasonablyCertain() is probably not enough to get a robust detection. This is because the distance measure is very sensitive to the length of the profiles in use. Thus, it is a bit dangerous to expose getDistance() as in TIKA-568, cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA.

          See TIKA-369 and TIKA-496.

          I think the right way to go is solving these two issues first. By fixing so that getDisance() is not biased towards profile length, we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages...

          Show
          Jan Høydahl added a comment - Simply allowing to set the threshold for isReasonablyCertain() is probably not enough to get a robust detection. This is because the distance measure is very sensitive to the length of the profiles in use. Thus, it is a bit dangerous to expose getDistance() as in TIKA-568 , cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA. See TIKA-369 and TIKA-496 . I think the right way to go is solving these two issues first. By fixing so that getDisance() is not biased towards profile length, we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages...
          Hide
          Jan Høydahl added a comment -

          The idField input parameter is just used for decent logging if detection fails. It would be more elegant to get the id field name automatically through SolrCore...

          Show
          Jan Høydahl added a comment - The idField input parameter is just used for decent logging if detection fails. It would be more elegant to get the id field name automatically through SolrCore...
          Hide
          Robert Muir added a comment -

          cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA.

          we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages...

          I don't follow the logic: if its not very normalized then it seems like this approach doesnt tell you anything... language 1 could be uncertain,
          and language 2 is just completely uncertain, but that tells you nothing: isn't it like trying to determine if a good lucene search result score is "certainly a hit" and not really the right way to go?

          For example: consider the case where the language isn't supported at all by Tika (i dont see a list of supported languages anywhere by the way!).
          It would be good for us to know that the detection is uncertain at all... how relatively uncertain it is with regards to the next language, is not very important.

          I think its also important we be able to get this uncertainty or whatever different agnostic of the implementation.
          For example, we should be able to somehow think of chaining detectors...

          Its really important to "cheat" and not use heuristics for languages that don't need them.
          For example, disregarding some strange theoretical/historical cases, you can simply look at the unicode properties
          in the document to determine that its in the Greek language, as its basically the only modern language using the greek alphabet

          Show
          Robert Muir added a comment - cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA. we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages... I don't follow the logic: if its not very normalized then it seems like this approach doesnt tell you anything... language 1 could be uncertain, and language 2 is just completely uncertain, but that tells you nothing: isn't it like trying to determine if a good lucene search result score is "certainly a hit" and not really the right way to go? For example: consider the case where the language isn't supported at all by Tika (i dont see a list of supported languages anywhere by the way!). It would be good for us to know that the detection is uncertain at all... how relatively uncertain it is with regards to the next language, is not very important. I think its also important we be able to get this uncertainty or whatever different agnostic of the implementation. For example, we should be able to somehow think of chaining detectors... Its really important to "cheat" and not use heuristics for languages that don't need them. For example, disregarding some strange theoretical/historical cases, you can simply look at the unicode properties in the document to determine that its in the Greek language, as its basically the only modern language using the greek alphabet
          Hide
          Grant Ingersoll added a comment -

          I took Jan's and Tommaso's patches and reworked them a bit. It seems to me that there isn't much point in merely identifying the language if you aren't going to do something about it. So, this patch builds on what Jan and Tommaso did and then will remap the input fields to new per language fields (note, we could make this optional). I also tried to standardize the input parameters a bit. I dropped the outputField setting and a number of other settings and I made the language detection to be per input field. The basic gist of it is that if you input two fields: name, subject, it will detect the language of each field and then attempt to map them to a new field. The new field is made by concatenating the original field name with "_" + the ISO 639 code. For example, if en is the detected language, then the new field for name would be name_en. If that field doesn't exist, it will fall back to the original field (i.e. name).

          Left to do:

          1. Fix the tests. I don't like how we currently tests UpdateProcessorChains. It should not require writing your own little piece of update mechanism. You should be able to simply setup the appropriate configuration, hook it into an update handler and then hit that update handler.
          2. Need to check the license headers, builds, etc.
          Show
          Grant Ingersoll added a comment - I took Jan's and Tommaso's patches and reworked them a bit. It seems to me that there isn't much point in merely identifying the language if you aren't going to do something about it. So, this patch builds on what Jan and Tommaso did and then will remap the input fields to new per language fields (note, we could make this optional). I also tried to standardize the input parameters a bit. I dropped the outputField setting and a number of other settings and I made the language detection to be per input field. The basic gist of it is that if you input two fields: name, subject, it will detect the language of each field and then attempt to map them to a new field. The new field is made by concatenating the original field name with "_" + the ISO 639 code. For example, if en is the detected language, then the new field for name would be name_en. If that field doesn't exist, it will fall back to the original field (i.e. name). Left to do: Fix the tests. I don't like how we currently tests UpdateProcessorChains. It should not require writing your own little piece of update mechanism. You should be able to simply setup the appropriate configuration, hook it into an update handler and then hit that update handler. Need to check the license headers, builds, etc.
          Hide
          Robert Muir added a comment -

          We really need to not be using ISO 639-1 here.

          For example,
          Its not expressive enough, not differentiating between Simplified and Traditional chinese, yet SmartChineseAnalyzer only works on Simplified.

          I would like to see RFC 3066 instead

          Show
          Robert Muir added a comment - We really need to not be using ISO 639-1 here. For example, Its not expressive enough, not differentiating between Simplified and Traditional chinese, yet SmartChineseAnalyzer only works on Simplified. I would like to see RFC 3066 instead
          Hide
          Grant Ingersoll added a comment -

          I would like to see RFC 3066 instead

          Yeah, that makes sense, however, I believe Tika returns 639. (Tika doesn't recognize Chinese yet at all). One approach is we could normalize, I suppose. Another is to fix Tika. I'd really like to see Tika support more languages, too.

          Longer term, I'd like to not do the fieldName_LangCode thing at all and instead let the user supply a string that could have variable substitution if they want, something like fieldName_$

          {langCode}, or it could be ${langCode}

          _fieldName or it could just be another literal.

          Show
          Grant Ingersoll added a comment - I would like to see RFC 3066 instead Yeah, that makes sense, however, I believe Tika returns 639. (Tika doesn't recognize Chinese yet at all). One approach is we could normalize, I suppose. Another is to fix Tika. I'd really like to see Tika support more languages, too. Longer term, I'd like to not do the fieldName_LangCode thing at all and instead let the user supply a string that could have variable substitution if they want, something like fieldName_$ {langCode}, or it could be ${langCode} _fieldName or it could just be another literal.
          Hide
          Grant Ingersoll added a comment -

          Another thought, here, is that, over time, this class becomes a base class and it becomes easy to replace the language detection piece, that way one gets all the infrastructure of this class, but can plugin their own detection. In fact, I'm going to do that right now.

          Show
          Grant Ingersoll added a comment - Another thought, here, is that, over time, this class becomes a base class and it becomes easy to replace the language detection piece, that way one gets all the infrastructure of this class, but can plugin their own detection. In fact, I'm going to do that right now.
          Hide
          Yonik Seeley added a comment -

          The new field is made by concatenating the original field name with "_" + the ISO 639 code.

          This could be problematic given a large set of language codes since they could collide with existing dynamic field definitions.
          Perhaps something with "text" in the name also?

          Perhaps fieldName_$

          {langCode}

          Text

          Examples:
          name_enText
          name_frText

          It would probably also be nice to be able to map a number of languages to a single field.... say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field.

          And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped.

          Show
          Yonik Seeley added a comment - The new field is made by concatenating the original field name with "_" + the ISO 639 code. This could be problematic given a large set of language codes since they could collide with existing dynamic field definitions. Perhaps something with "text" in the name also? Perhaps fieldName_$ {langCode} Text Examples: name_enText name_frText It would probably also be nice to be able to map a number of languages to a single field.... say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field. And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped.
          Hide
          Robert Muir added a comment -

          Yeah, that makes sense, however, I believe Tika returns 639.

          Right, but 639 is just a subset of 3066 etc.

          So, ignore what tika does. its 639 identifiers are also valid 3066.

          Our API should at least be 3066, Java7/ICU already support BCP47 locale identifiers etc, so you get the normalization there for free.

          It would probably also be nice to be able to map a number of languages to a single field.... say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field.

          And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped.

          Both of these are good reasons why we must avoid 639-1.
          We should be able to use things like macrolanguages and undetermined language.

          Show
          Robert Muir added a comment - Yeah, that makes sense, however, I believe Tika returns 639. Right, but 639 is just a subset of 3066 etc. So, ignore what tika does. its 639 identifiers are also valid 3066. Our API should at least be 3066, Java7/ICU already support BCP47 locale identifiers etc, so you get the normalization there for free. It would probably also be nice to be able to map a number of languages to a single field.... say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field. And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped. Both of these are good reasons why we must avoid 639-1. We should be able to use things like macrolanguages and undetermined language.
          Hide
          Jan Høydahl added a comment -

          @Robert: Yes, there must be a way to tell whether or not the language even has a profile, through some well defined method. It's not important HOW we improve detection certainty, but comparing the top n distances could help. I'm also a fan of including other metrics than profile similarity if that can help, however for unique scripts that will automatically be covered by profile similarity. Detailed solution discussions should continue in TIKA-369.

          Macro languages: See TIKA-493

          It makes sense to allow for detecting languages outside 639-1, and I believe RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 2-letter code for a language it will be used. 639-1 is what "everyone" already knows.

          In general, improvements should be done in Tika space, then use those in Solr, thus building one strong language detection library.

          @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor, to make things more flexible. Example:

           
            <processor class="org.apache.solr.update.processor.LanguageFieldMapperUpdateProcessor">
              <str name="languageField">language</str>
              <str name="fromRegEx">(.*?)_lang</str>
              <str name="toRegEx">$1_$lang</str>
              <str name="notSupportedLanguageToRegEx">$1_t</str>
              <str name="supportedLanguages">de,en,fr,it,es,nl</str>
            </processor>
          

          Your thought of allowing to detect language for individual fields in one go is also interesting. I'd love to see metadata support in SolrInputDocument, so that one processor could annotate a @language on the fields analyzed. Then next processor could act on metadata to rename field...

          @Yonik: By allowing regex naming of field names, we give users a generic tool to avoid field name clashes, by picking the pattern.. Mapping multiple languages to same suffix also makes sense.

          Show
          Jan Høydahl added a comment - @Robert: Yes, there must be a way to tell whether or not the language even has a profile, through some well defined method. It's not important HOW we improve detection certainty, but comparing the top n distances could help. I'm also a fan of including other metrics than profile similarity if that can help, however for unique scripts that will automatically be covered by profile similarity. Detailed solution discussions should continue in TIKA-369 . Macro languages: See TIKA-493 It makes sense to allow for detecting languages outside 639-1, and I believe RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 2-letter code for a language it will be used. 639-1 is what "everyone" already knows. In general, improvements should be done in Tika space, then use those in Solr, thus building one strong language detection library. @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor, to make things more flexible. Example: <processor class= "org.apache.solr.update.processor.LanguageFieldMapperUpdateProcessor" > <str name= "languageField" > language </str> <str name= "fromRegEx" > (.*?)_lang </str> <str name= "toRegEx" > $1_$lang </str> <str name= "notSupportedLanguageToRegEx" > $1_t </str> <str name= "supportedLanguages" > de,en,fr,it,es,nl </str> </processor> Your thought of allowing to detect language for individual fields in one go is also interesting. I'd love to see metadata support in SolrInputDocument, so that one processor could annotate a @language on the fields analyzed. Then next processor could act on metadata to rename field... @Yonik: By allowing regex naming of field names, we give users a generic tool to avoid field name clashes, by picking the pattern.. Mapping multiple languages to same suffix also makes sense.
          Hide
          Grant Ingersoll added a comment -

          @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor, to make things more flexible

          I don't really see that it makes it any more flexible. If it was a general purpose mapper, maybe, but since it is tied to the language field, why not just put in the language processor? I've already got the method that choose the output field as a protected. With that, one merely would need to extend it to provide an alternate method from what you have proposed.

          Show
          Grant Ingersoll added a comment - @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor, to make things more flexible I don't really see that it makes it any more flexible. If it was a general purpose mapper, maybe, but since it is tied to the language field, why not just put in the language processor? I've already got the method that choose the output field as a protected. With that, one merely would need to extend it to provide an alternate method from what you have proposed.
          Hide
          Grant Ingersoll added a comment -

          Here's a patch that passes the tests. Note, I modified the Solr base test case to have some new methods to properly call update handlers and then validate the results.

          Show
          Grant Ingersoll added a comment - Here's a patch that passes the tests. Note, I modified the Solr base test case to have some new methods to properly call update handlers and then validate the results.
          Hide
          Grant Ingersoll added a comment -

          Note, the patch still needs more tests and needs to check headers, etc. as well as the better field mapping and the proper language support that Robert is talking about.

          Show
          Grant Ingersoll added a comment - Note, the patch still needs more tests and needs to check headers, etc. as well as the better field mapping and the proper language support that Robert is talking about.
          Hide
          Robert Muir added a comment -

          It makes sense to allow for detecting languages outside 639-1, and I believe RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 2-letter code for a language it will be used. 639-1 is what "everyone" already knows.

          In general, improvements should be done in Tika space, then use those in Solr, thus building one strong language detection library.

          yes they do, the 639-1 codes that tika outputs are also valid BCP47 codes

          but in solr, when designing up front, i was just saying we shouldn't limit any abstract portion to 639-1 when another implementation might support 3066 or BCP47... we should make sure we allow that.

          Show
          Robert Muir added a comment - It makes sense to allow for detecting languages outside 639-1, and I believe RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 2-letter code for a language it will be used. 639-1 is what "everyone" already knows. In general, improvements should be done in Tika space, then use those in Solr, thus building one strong language detection library. yes they do, the 639-1 codes that tika outputs are also valid BCP47 codes but in solr, when designing up front, i was just saying we shouldn't limit any abstract portion to 639-1 when another implementation might support 3066 or BCP47... we should make sure we allow that.
          Hide
          Grant Ingersoll added a comment -

          but in solr, when designing up front, i was just saying we shouldn't limit any abstract portion to 639-1 when another implementation might support 3066 or BCP47... we should make sure we allow that.

          Agreed.The only thing we are doing now is using the language that the language detector returns as part of the field name. Both of these steps are easily overridable. Both also rely on those fields existing.

          This could be problematic given a large set of language codes since they could collide with existing dynamic field definitions.

          Yonik, I wasn't planning on relying on dynamic fields necessarily. It may make sense to have users either predeclare the variations.

          All in all, I would like to see Solr have better support for languages in both the schema and the config. In my experience, in apps that have to support a lot of languages, there is a lot of redundancy in both the schema and the config.

          Show
          Grant Ingersoll added a comment - but in solr, when designing up front, i was just saying we shouldn't limit any abstract portion to 639-1 when another implementation might support 3066 or BCP47... we should make sure we allow that. Agreed.The only thing we are doing now is using the language that the language detector returns as part of the field name. Both of these steps are easily overridable. Both also rely on those fields existing. This could be problematic given a large set of language codes since they could collide with existing dynamic field definitions. Yonik, I wasn't planning on relying on dynamic fields necessarily. It may make sense to have users either predeclare the variations. All in all, I would like to see Solr have better support for languages in both the schema and the config. In my experience, in apps that have to support a lot of languages, there is a lot of redundancy in both the schema and the config.
          Hide
          Robert Muir added a comment -

          Agreed.The only thing we are doing now is using the language that the language detector returns as part of the field name. Both of these steps are easily overridable. Both also rely on those fields existing.

          "Easily overridable" does not solve the problem!

          Please don't commit this, its so easy to just change the code, variable names, documentation here to say these interfaces are BCP47 language ids.

          We should not be using 639-1 codes in any APIs!!!!!!!

          Show
          Robert Muir added a comment - Agreed.The only thing we are doing now is using the language that the language detector returns as part of the field name. Both of these steps are easily overridable. Both also rely on those fields existing. "Easily overridable" does not solve the problem! Please don't commit this, its so easy to just change the code, variable names, documentation here to say these interfaces are BCP47 language ids. We should not be using 639-1 codes in any APIs!!!!!!!
          Hide
          Robert Muir added a comment -

          Both also rely on those fields existing.

          I don't think this check should be at "runtime" either.

          What if you are indexing lots of documents and suddenly you encounter a thai document (or mis-detected as Thai!) and the whole thing fails?

          Can't we validate the output mapping (and log it!) at initialization time?

          Show
          Robert Muir added a comment - Both also rely on those fields existing. I don't think this check should be at "runtime" either. What if you are indexing lots of documents and suddenly you encounter a thai document (or mis-detected as Thai!) and the whole thing fails? Can't we validate the output mapping (and log it!) at initialization time?
          Hide
          Yonik Seeley added a comment -

          Yonik, I wasn't planning on relying on dynamic fields necessarily. It may make sense to have users either predeclare the variations.

          Sure, but the problem was the ease by which a generated field of originalname_$

          {langcode}

          could clash with existing fields (regardless of if they are dynamic fields) due to there being many different language codes.

          If we use regex naming as Jan suggests (or another configurable mechanism) then the issue comes down to what we configure by default or by example.

          Show
          Yonik Seeley added a comment - Yonik, I wasn't planning on relying on dynamic fields necessarily. It may make sense to have users either predeclare the variations. Sure, but the problem was the ease by which a generated field of originalname_$ {langcode} could clash with existing fields (regardless of if they are dynamic fields) due to there being many different language codes. If we use regex naming as Jan suggests (or another configurable mechanism) then the issue comes down to what we configure by default or by example.
          Hide
          Jan Høydahl added a comment -

          @Grant: "I dropped the outputField setting and a number of other settings"

          There should be a way to output the language for the whole document to some field as some applications need to filter on language.

          I like making most things configurable, but with good defaults which fits most needs. The default could be to detect a document wide langauge from all input fields and output this to a "language_s" field, unless you specify params docLangInputFields=f1,f2.. and docLangOutputField=nn. Likewise make it easy to disable field renaming.

          Show
          Jan Høydahl added a comment - @Grant: "I dropped the outputField setting and a number of other settings" There should be a way to output the language for the whole document to some field as some applications need to filter on language. I like making most things configurable, but with good defaults which fits most needs. The default could be to detect a document wide langauge from all input fields and output this to a "language_s" field, unless you specify params docLangInputFields=f1,f2.. and docLangOutputField=nn. Likewise make it easy to disable field renaming.
          Hide
          Grant Ingersoll added a comment -

          There should be a way to output the language for the whole document to some field as some applications need to filter on language.

          There is. It's the langField.

          Can't we validate the output mapping (and log it!) at initialization time?

          To some extent, but users can also pass it in.

          We should not be using 639-1 codes in any APIs!!!!!!!

          I'll update the patch.

          Show
          Grant Ingersoll added a comment - There should be a way to output the language for the whole document to some field as some applications need to filter on language. There is. It's the langField. Can't we validate the output mapping (and log it!) at initialization time? To some extent, but users can also pass it in. We should not be using 639-1 codes in any APIs!!!!!!! I'll update the patch.
          Hide
          Grant Ingersoll added a comment -

          Removes mentions of ISO 639.

          Show
          Grant Ingersoll added a comment - Removes mentions of ISO 639.
          Hide
          Erik Hatcher added a comment -

          In skimming the current patch, it looks like fields get mapped no matter what. What if I just want the language detected and added as another field, but no field mapping desired? (one might have decent enough analysis already on a general "title" field for example that it doesn't need to be mapped to anything else at all)

          Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? Or a separate language value for each of the input fields (which seems odd to me)?

          On field name mapping, maybe we want to have a generic concept of a FieldNameMapper such that various implementations could be plugged in rather than having to subclass the update processor?

          Show
          Erik Hatcher added a comment - In skimming the current patch, it looks like fields get mapped no matter what. What if I just want the language detected and added as another field, but no field mapping desired? (one might have decent enough analysis already on a general "title" field for example that it doesn't need to be mapped to anything else at all) Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? Or a separate language value for each of the input fields (which seems odd to me)? On field name mapping, maybe we want to have a generic concept of a FieldNameMapper such that various implementations could be plugged in rather than having to subclass the update processor?
          Hide
          Yonik Seeley added a comment -

          In skimming the current patch, it looks like fields get mapped no matter what. What if I just want the language detected and added as another field, but no field mapping desired?

          Yeah, that's sort of in line with my:

          And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped.

          So for all unmapped languages, you may want to map to a single generic field, or not map at all (leave field as is).
          I guess it also depends on the general strategy... if you are detecting language on the "body" field, are we using a copyField type approach and only storing the body field while indexing as body_enText, or are we moving the field from "body" to "body_enText"?

          Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document?

          I could see both making sense.

          Show
          Yonik Seeley added a comment - In skimming the current patch, it looks like fields get mapped no matter what. What if I just want the language detected and added as another field, but no field mapping desired? Yeah, that's sort of in line with my: And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped. So for all unmapped languages, you may want to map to a single generic field, or not map at all (leave field as is). I guess it also depends on the general strategy... if you are detecting language on the "body" field, are we using a copyField type approach and only storing the body field while indexing as body_enText, or are we moving the field from "body" to "body_enText"? Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? I could see both making sense.
          Hide
          Grant Ingersoll added a comment -

          So for all unmapped languages, you may want to map to a single generic field, or not map at all (leave field as is).

          It currently leaves it in the original field.

          Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? Or a separate language value for each of the input fields (which seems odd to me)?

          Current patch requires multivalued language field. I figure the main thing you want the lang. field for is faceting and filtering, but it can be changed. As for the broader goal, I think it makes sense to detect languages per field and not per document. In other words, you can have multiple languages in a single document.

          Show
          Grant Ingersoll added a comment - So for all unmapped languages, you may want to map to a single generic field, or not map at all (leave field as is). It currently leaves it in the original field. Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? Or a separate language value for each of the input fields (which seems odd to me)? Current patch requires multivalued language field. I figure the main thing you want the lang. field for is faceting and filtering, but it can be changed. As for the broader goal, I think it makes sense to detect languages per field and not per document. In other words, you can have multiple languages in a single document.
          Hide
          Erik Hatcher added a comment -

          If a list of fields (by name) is mapped into a corresponding parallel identified language code field, do we leave it up to search clients to also know the list of field names to jive a field (say title) with its identified language?

          A language field shouldn't have to be multivalued - it just doesn't match the domain model of many search applications where there will only ever be one and only one language per document.

          Show
          Erik Hatcher added a comment - If a list of fields (by name) is mapped into a corresponding parallel identified language code field, do we leave it up to search clients to also know the list of field names to jive a field (say title) with its identified language? A language field shouldn't have to be multivalued - it just doesn't match the domain model of many search applications where there will only ever be one and only one language per document.
          Hide
          Erik Hatcher added a comment -

          Oh, and don't get me wrong, I get the multivalued language per document need too, here. Anyway, it'll be easy enough add support for this to be controlled through configuration. In single language per doc mode, basically concatenate all of the fields specified and detect on that and map into a singled value language field. Language-per-field I get too, of course... just depends on the domain being modeled and in my experience I've seen apps designed both ways. Neither way is the one true way, it just depends.

          And of course Muir is smirking and saying "heck, you have multiple languages within a field often too, so we need to account for that somehow too". But probably not here, yet.

          Show
          Erik Hatcher added a comment - Oh, and don't get me wrong, I get the multivalued language per document need too, here. Anyway, it'll be easy enough add support for this to be controlled through configuration. In single language per doc mode, basically concatenate all of the fields specified and detect on that and map into a singled value language field. Language-per-field I get too, of course... just depends on the domain being modeled and in my experience I've seen apps designed both ways. Neither way is the one true way, it just depends. And of course Muir is smirking and saying "heck, you have multiple languages within a field often too, so we need to account for that somehow too". But probably not here, yet.
          Hide
          Jan Høydahl added a comment -

          Allow for both a "language" field and a "languages" (multivalued) field.
          If fields are mapped, the new name reflect the language, so I don't know if we need a field->lang mapping.
          However, have you considered extending the document model to allow metadata per field? Then @language would be a valid field metadata, mostly as a means for later processing to pick up and act on. This can be a valuable mechanism for other inter processor communication as well as to pass info between document centric processing and Analysis.

          Show
          Jan Høydahl added a comment - Allow for both a "language" field and a "languages" (multivalued) field. If fields are mapped, the new name reflect the language, so I don't know if we need a field->lang mapping. However, have you considered extending the document model to allow metadata per field? Then @language would be a valid field metadata, mostly as a means for later processing to pick up and act on. This can be a valuable mechanism for other inter processor communication as well as to pass info between document centric processing and Analysis.
          Hide
          Tommaso Teofili added a comment -

          However, have you considered extending the document model to allow metadata per field? Then @language would be a valid field metadata, mostly as a means for later processing to pick up and act on. This can be a valuable mechanism for other inter processor communication as well as to pass info between document centric processing and Analysis.

          I've also thought about this option and it sounds somehow reasonable but I think that it'd be a very huge change on the API; so from one point of view I like the idea but from another standpoint I think it could lead to a proliferation of @metadata.
          So in the end I've not a strong opinion on that but I also have to say that I've seen such customizations in a production environment to leverage per field metadata.

          Regarding per field and per document language fields I think that a document language field could be handled with two fixed strategies/policies (that can be also extended):

          1. restrictive strategy - if different languages result to be mapped inside the document language field than say that document language is, for example, "x-unspecified"
          2. simple strategy - map all the retrieved languages (per field) inside the document language field as different values (so multivalued="true")
          Show
          Tommaso Teofili added a comment - However, have you considered extending the document model to allow metadata per field? Then @language would be a valid field metadata, mostly as a means for later processing to pick up and act on. This can be a valuable mechanism for other inter processor communication as well as to pass info between document centric processing and Analysis. I've also thought about this option and it sounds somehow reasonable but I think that it'd be a very huge change on the API; so from one point of view I like the idea but from another standpoint I think it could lead to a proliferation of @metadata. So in the end I've not a strong opinion on that but I also have to say that I've seen such customizations in a production environment to leverage per field metadata. Regarding per field and per document language fields I think that a document language field could be handled with two fixed strategies/policies (that can be also extended): restrictive strategy - if different languages result to be mapped inside the document language field than say that document language is, for example, "x-unspecified" simple strategy - map all the retrieved languages (per field) inside the document language field as different values (so multivalued="true")
          Hide
          Grant Ingersoll added a comment -

          I'm going to be out of pocket for the next week. If someone can put the field mapping stuff up, then I think we will have the basis for a good first pass at this, which we can then iterate on. I also think we need to get together and add a bunch more languages to Tika b/c it is pretty unacceptable to not have, at a minimum, support for the big Asian languages of CJK.

          Show
          Grant Ingersoll added a comment - I'm going to be out of pocket for the next week. If someone can put the field mapping stuff up, then I think we will have the basis for a good first pass at this, which we can then iterate on. I also think we need to get together and add a bunch more languages to Tika b/c it is pretty unacceptable to not have, at a minimum, support for the big Asian languages of CJK.
          Hide
          Robert Muir added a comment -

          I also think we need to get together and add a bunch more languages to Tika b/c it is pretty unacceptable to not have, at a minimum, support for the big Asian languages of CJK.

          What languages does tika support in its identifier? I couldnt find an actual list only a ref to Europarl (http://www.statmt.org/europarl/), is it just those languages?

          Also is there docs on whats necessary (legally and technically) to contribute a new profile... is just recording ngrams from creative commons text acceptable?

          Show
          Robert Muir added a comment - I also think we need to get together and add a bunch more languages to Tika b/c it is pretty unacceptable to not have, at a minimum, support for the big Asian languages of CJK. What languages does tika support in its identifier? I couldnt find an actual list only a ref to Europarl ( http://www.statmt.org/europarl/ ), is it just those languages? Also is there docs on whats necessary (legally and technically) to contribute a new profile... is just recording ngrams from creative commons text acceptable?
          Hide
          Grant Ingersoll added a comment -

          Have a look at http://tika.apache.org/0.8/detection.html

          Really, though, you need to dig into the Tika class: LanguageIdentifier. Adding languages, AFAICT, involves building the model accordingly and then letting Tika know about it via a properties file.

          Show
          Grant Ingersoll added a comment - Have a look at http://tika.apache.org/0.8/detection.html Really, though, you need to dig into the Tika class: LanguageIdentifier. Adding languages, AFAICT, involves building the model accordingly and then letting Tika know about it via a properties file.
          Hide
          Robert Muir added a comment -

          Have a look at http://tika.apache.org/0.8/detection.html

          That page does not have a list of languages.

          Show
          Robert Muir added a comment - Have a look at http://tika.apache.org/0.8/detection.html That page does not have a list of languages.
          Hide
          Grant Ingersoll added a comment -

          Sorry, you are right. See http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties

          name.da=Danish
          name.de=German
          name.et=Estonian
          name.el=Greek
          name.en=English
          name.es=Spanish
          name.fi=Finnish
          name.fr=French
          name.hu=Hungarian
          name.is=Icelandic
          name.it=Italian
          name.nl=Dutch
          name.no=Norwegian
          name.pl=Polish
          name.pt=Portuguese
          name.ru=Russian
          name.sv=Swedish
          name.th=Thai

          Kind of random that Thai is thrown in there!

          Show
          Grant Ingersoll added a comment - Sorry, you are right. See http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties name.da=Danish name.de=German name.et=Estonian name.el=Greek name.en=English name.es=Spanish name.fi=Finnish name.fr=French name.hu=Hungarian name.is=Icelandic name.it=Italian name.nl=Dutch name.no=Norwegian name.pl=Polish name.pt=Portuguese name.ru=Russian name.sv=Swedish name.th=Thai Kind of random that Thai is thrown in there!
          Hide
          Robert Muir added a comment -

          Kind of random that Thai is thrown in there!

          I agree, i tend to detect thai by the characters being between U+0E00 and U+0E7F.

          anyway, if we add more languages it would be good if one of us could document the process, because many important ones are missing.

          Show
          Robert Muir added a comment - Kind of random that Thai is thrown in there! I agree, i tend to detect thai by the characters being between U+0E00 and U+0E7F. anyway, if we add more languages it would be good if one of us could document the process, because many important ones are missing.
          Hide
          Jan Høydahl added a comment -

          Discussion on the process for adding language profiles to TIKA should be continued in TIKA-546

          I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492

          Show
          Jan Høydahl added a comment - Discussion on the process for adding language profiles to TIKA should be continued in TIKA-546 I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492
          Hide
          Robert Muir added a comment -

          I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492

          Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 639-1 code i see is "se" but this seems to be appropriate only for North Sami.

          Show
          Robert Muir added a comment - I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492 Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 639-1 code i see is "se" but this seems to be appropriate only for North Sami.
          Hide
          Jan Høydahl added a comment -

          >>I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492
          >Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 639-1 code i see is "se" but this seems to be appropriate only for North Sami.

          Exactly. That's one example which will need a wider range of codes. I was planning to use 639-2 for those that do not have a 2-letter code, but BCP47 it will be now (although the end result may be more or less the same)

          We also need to detect whether a language is part of a macro language, and add both to languages multivalue field, because it should be possible to filter on Norwegian (no) without specifying both nn and nb, and also for sami (smi) without specifying all of the specific languages.

          Show
          Jan Høydahl added a comment - >>I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492 >Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 639-1 code i see is "se" but this seems to be appropriate only for North Sami. Exactly. That's one example which will need a wider range of codes. I was planning to use 639-2 for those that do not have a 2-letter code, but BCP47 it will be now (although the end result may be more or less the same) We also need to detect whether a language is part of a macro language, and add both to languages multivalue field, because it should be possible to filter on Norwegian (no) without specifying both nn and nb, and also for sami (smi) without specifying all of the specific languages.
          Hide
          Robert Muir added a comment -

          We also need to detect whether a language is part of a macro language, and add both to languages multivalue field, because it should be possible to filter on Norwegian (no) without specifying both nn and nb, and also for sami (smi) without specifying all of the specific languages.

          macrolangs: http://www.sil.org/iso639-3/iso-639-3-macrolanguages_20100128.tab
          collections: http://www.loc.gov/standards/iso639-5/iso639-5.tab.txt

          Show
          Robert Muir added a comment - We also need to detect whether a language is part of a macro language, and add both to languages multivalue field, because it should be possible to filter on Norwegian (no) without specifying both nn and nb, and also for sami (smi) without specifying all of the specific languages. macrolangs: http://www.sil.org/iso639-3/iso-639-3-macrolanguages_20100128.tab collections: http://www.loc.gov/standards/iso639-5/iso639-5.tab.txt
          Hide
          Lance Norskog added a comment -

          A use case for multi-language fields: PDFs with different languages in different columns.

          Show
          Lance Norskog added a comment - A use case for multi-language fields: PDFs with different languages in different columns.
          Hide
          Lance Norskog added a comment -

          About Thai: there is a lot of South and East Asian language text out there written in phonetic USASCII, especially older pre-Unicode. Samples of these texts from different languages have ngram profiles just as distinct as the European languages.

          Show
          Lance Norskog added a comment - About Thai: there is a lot of South and East Asian language text out there written in phonetic USASCII, especially older pre-Unicode. Samples of these texts from different languages have ngram profiles just as distinct as the European languages.
          Hide
          Erik Hatcher added a comment -

          What about leveraging payloads (we can output term|payload strings to the payload field type) for associating languages with fields?

          Show
          Erik Hatcher added a comment - What about leveraging payloads (we can output term|payload strings to the payload field type) for associating languages with fields?
          Hide
          Grant Ingersoll added a comment -

          What about leveraging payloads (we can output term|payload strings to the payload field type) for associating languages with fields?

          Yeah, that could be used with mixed language text (or a marker token).

          Jan, do you have any updates to the patch? I'd like to move forward with the basic functionality at least, but I still think we need the field mapping stuff, or we should punt all field mapping stuff to another processor. WDYT?

          Show
          Grant Ingersoll added a comment - What about leveraging payloads (we can output term|payload strings to the payload field type) for associating languages with fields? Yeah, that could be used with mixed language text (or a marker token). Jan, do you have any updates to the patch? I'd like to move forward with the basic functionality at least, but I still think we need the field mapping stuff, or we should punt all field mapping stuff to another processor. WDYT?
          Hide
          Jan Høydahl added a comment -

          Jan, do you have any updates to the patch? I'd like to move forward with the basic functionality at least, but I still think we need the field mapping stuff, or we should punt all field mapping stuff to another processor. WDYT?

          I don't have any updates.

          Keep it basic in first version. Allow for per-document and per-field detection.

          Make field-mapping configurable and optional (default off), allowing people to chain in their own mapper downstream if they choose.

          Mixed-language per field is a different beast and should be dealt with to later. Probably requires analysis changes as well if we want analyzers to pick up language from payloads or something.

          My 2 cents

          Show
          Jan Høydahl added a comment - Jan, do you have any updates to the patch? I'd like to move forward with the basic functionality at least, but I still think we need the field mapping stuff, or we should punt all field mapping stuff to another processor. WDYT? I don't have any updates. Keep it basic in first version. Allow for per-document and per-field detection. Make field-mapping configurable and optional (default off), allowing people to chain in their own mapper downstream if they choose. Mixed-language per field is a different beast and should be dealt with to later. Probably requires analysis changes as well if we want analyzers to pick up language from payloads or something. My 2 cents
          Hide
          Tommaso Teofili added a comment -

          Keep it basic in first version. Allow for per-document and per-field detection. Make field-mapping configurable and optional (default off), allowing people to chain in their own mapper downstream if they choose.

          I agree, this sounds good for a basic implementation.

          Show
          Tommaso Teofili added a comment - Keep it basic in first version. Allow for per-document and per-field detection. Make field-mapping configurable and optional (default off), allowing people to chain in their own mapper downstream if they choose. I agree, this sounds good for a basic implementation.
          Hide
          Jan Høydahl added a comment -

          Continuing on this implementing the ideas above...

          Show
          Jan Høydahl added a comment - Continuing on this implementing the ideas above...
          Hide
          Jan Høydahl added a comment -

          New version. Example of accepted params:

           <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
             <defaults>
               <str name="langid">true</str>
               <str name="langid.fl">title,subject,text,keywords</str>
               <str name="langid.langField">language_s</str>
               <str name="langid.langsField">languages</str>
               <str name="langid.overwrite">false</str>
               <float name="langid.threshold">0.5</float>
               <str name="langid.whitelist">no,en,es,dk</str>
               <str name="langid.map">true</str>
               <str name="langid.map.fl">title,text</str>
               <bool name="langid.map.overwrite">false</bool>
               <bool name="langid.map.keepOrig">false</bool>
               <bool name="langid.map.individual">false</bool>
               <str name="langid.map.individual.fl"></str>
               <str name="langid.fallbackFields">meta_content_language,lang</str>
               <str name="langid.fallback">en</str>
             </defaults>
           </processor>
          

          The only mandatory parameter is langid.fl
          To enable field name mapping, set langid.map=true. It will then map field names for all fields in langid.fl. If the set of fields to map is different from langid.fl, supply langid.map.fl. Those fields will then be renamed with a language suffix equal to the language detected from the langid.fl fields.

          If you require detecting languages separately for each field, supply langid.map.individual=true. The supplied fields will then be renamed according to detected language on an individual basis. If the set of fields to detect individually is different from the already supplied langid.fl or langid.map.fl, supply langid.map.individual.fl. The fields listed in langid.map.individual.fl will then be detected individually, while the rest of the mapping fields will be mapped according to global document language.

          Show
          Jan Høydahl added a comment - New version. Example of accepted params: <processor class= "org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory" > <defaults> <str name= "langid" > true </str> <str name= "langid.fl" >title,subject,text,keywords</str> <str name= "langid.langField" >language_s</str> <str name= "langid.langsField" >languages</str> <str name= "langid.overwrite" > false </str> < float name= "langid.threshold" >0.5</ float > <str name= "langid.whitelist" >no,en,es,dk</str> <str name= "langid.map" > true </str> <str name= "langid.map.fl" >title,text</str> <bool name= "langid.map.overwrite" > false </bool> <bool name= "langid.map.keepOrig" > false </bool> <bool name= "langid.map.individual" > false </bool> <str name= "langid.map.individual.fl" ></str> <str name= "langid.fallbackFields" >meta_content_language,lang</str> <str name= "langid.fallback" >en</str> </defaults> </processor> The only mandatory parameter is langid.fl To enable field name mapping, set langid.map=true. It will then map field names for all fields in langid.fl. If the set of fields to map is different from langid.fl, supply langid.map.fl. Those fields will then be renamed with a language suffix equal to the language detected from the langid.fl fields. If you require detecting languages separately for each field, supply langid.map.individual=true. The supplied fields will then be renamed according to detected language on an individual basis. If the set of fields to detect individually is different from the already supplied langid.fl or langid.map.fl, supply langid.map.individual.fl. The fields listed in langid.map.individual.fl will then be detected individually, while the rest of the mapping fields will be mapped according to global document language.
          Hide
          Jan Høydahl added a comment -

          One question regarding the JUnit test: I now use

          assertU(commit());
          

          How can I add update request params to this commit? To select another update chain from different tests, I'd like to add update params on the fly, e.g.:

          assertU(commit(), "update.chain=langid2");
          
          Show
          Jan Høydahl added a comment - One question regarding the JUnit test: I now use assertU(commit()); How can I add update request params to this commit? To select another update chain from different tests, I'd like to add update params on the fly, e.g.: assertU(commit(), "update.chain=langid2" );
          Hide
          Jan Høydahl added a comment -

          Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 gives certainty 0.9. The default threshold of 0.5 now works pretty well, at least for the tests...

          New parameters:
          Field name mapping is now configurable to user defined pattern, so to map ABC_title to title_<lang>, you set:

          &langid.map.pattern=ABC_(.*)
          &langid.map.replace=$1_{lang}
          

          A parameter to map multiple detected languages to same field regex. I.e. to map both Japanese, Korean and Chinese texts to a field *_cjk, do:

          langid.map.lcmap=jp:cjk zh:cjk ko:cjk

          Turn off validation of field names against schema (useful if you want to rename or delete fields later in the UpdateChain):

          &langid.enforceSchema=false

          Other changes
          Removed default on langField, i.e. if langField is not specified, the detected language will not be written anywhere. A typical minimal config for only detecting language and writing to a field is now:

          <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
             <defaults>
               <str name="langid.fl">title,subject,text,keywords</str>
               <str name="langid.langField">language_s</str>
             </defaults>
          </processor>
          

          Also added multiple other languages to the tests.

          Show
          Jan Høydahl added a comment - Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 gives certainty 0.9. The default threshold of 0.5 now works pretty well, at least for the tests... New parameters: Field name mapping is now configurable to user defined pattern, so to map ABC_title to title_<lang>, you set: &langid.map.pattern=ABC_(.*) &langid.map.replace=$1_{lang} A parameter to map multiple detected languages to same field regex. I.e. to map both Japanese, Korean and Chinese texts to a field *_cjk, do: langid.map.lcmap=jp:cjk zh:cjk ko:cjk Turn off validation of field names against schema (useful if you want to rename or delete fields later in the UpdateChain): &langid.enforceSchema= false Other changes Removed default on langField, i.e. if langField is not specified, the detected language will not be written anywhere. A typical minimal config for only detecting language and writing to a field is now: <processor class= "org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory" > <defaults> <str name= "langid.fl" >title,subject,text,keywords</str> <str name= "langid.langField" >language_s</str> </defaults> </processor> Also added multiple other languages to the tests.
          Hide
          Jan Høydahl added a comment -

          This has been tested on a real, several hundred thousand docs dataset, including HTML, office docs and multiple other formats and it works well.

          I'd like some more pairs of eyes on this however.

          One thing which is less than perfect is that the threshold conversion from Tika currently parses out the (internal) distance value from a String, in lack of a getDistance() method (TIKA-568). This is a bit of a hack, but I argue it's a beneficial one since we can now configure langid.threshold to something meaningful for our own data instead of the preset binary isReasonablyCertain(). As we also normalize to a value between 0-1, we abstract away the TIKA implementation detail, and are free to use any improved distance measures from TIKA in the future e.g. as a result of TIKA-369, or even plug in a non-Tika identifier or a hybrid solution.

          Show
          Jan Høydahl added a comment - This has been tested on a real, several hundred thousand docs dataset, including HTML, office docs and multiple other formats and it works well. I'd like some more pairs of eyes on this however. One thing which is less than perfect is that the threshold conversion from Tika currently parses out the (internal) distance value from a String, in lack of a getDistance() method ( TIKA-568 ). This is a bit of a hack, but I argue it's a beneficial one since we can now configure langid.threshold to something meaningful for our own data instead of the preset binary isReasonablyCertain(). As we also normalize to a value between 0-1, we abstract away the TIKA implementation detail, and are free to use any improved distance measures from TIKA in the future e.g. as a result of TIKA-369 , or even plug in a non-Tika identifier or a hybrid solution.
          Hide
          Jan Høydahl added a comment -

          Updated to latest trunk, simplified build file, added clean target

          Show
          Jan Høydahl added a comment - Updated to latest trunk, simplified build file, added clean target
          Hide
          Jan Høydahl added a comment -

          Moving to 3.5

          Show
          Jan Høydahl added a comment - Moving to 3.5
          Hide
          Lance Norskog added a comment -

          I'm impressed! This is a lot of work and empirical testing for a difficult problem.

          Comments:
          There are a few parameters that are true/false, but in the future you might want a third answer. It might be worth making the decision via a keyword so you can add new keywords later.

          About the multiple languages in one field problem: you can't solve everything at once. The other document analysis components like UIMA should be able to identify parts of documents, and then you use this on one part at a time. This is the point of a modular toolkit: you combine the tools to solve advanced problems.

          Show
          Lance Norskog added a comment - I'm impressed! This is a lot of work and empirical testing for a difficult problem. Comments: There are a few parameters that are true/false, but in the future you might want a third answer. It might be worth making the decision via a keyword so you can add new keywords later. About the multiple languages in one field problem: you can't solve everything at once. The other document analysis components like UIMA should be able to identify parts of documents, and then you use this on one part at a time. This is the point of a modular toolkit: you combine the tools to solve advanced problems.
          Hide
          Jan Høydahl added a comment -

          An updated documentation of the Processor is now at http://wiki.apache.org/solr/LanguageDetection

          @Lance: What params were on your mind as candidates for keyword instead of true/false, and for what potential future reasons?

          Show
          Jan Høydahl added a comment - An updated documentation of the Processor is now at http://wiki.apache.org/solr/LanguageDetection @Lance: What params were on your mind as candidates for keyword instead of true/false, and for what potential future reasons?
          Hide
          Markus Jelsma added a comment -

          Hi Jan,

          Can we also use the mapping feature without detection? Our detection is done in a Nutch cluster so we already identified many millions of docs.

          Thanks

          Show
          Markus Jelsma added a comment - Hi Jan, Can we also use the mapping feature without detection? Our detection is done in a Nutch cluster so we already identified many millions of docs. Thanks
          Hide
          Jan Høydahl added a comment -

          @Markus: Sure. If you put your pre-known language code in the same field configured in langid.langField and use langid.overwrite=false, you will obtain that behavior.

          Show
          Jan Høydahl added a comment - @Markus: Sure. If you put your pre-known language code in the same field configured in langid.langField and use langid.overwrite=false, you will obtain that behavior.
          Hide
          Markus Jelsma added a comment -

          Hi. This is not what i understood from reading the wiki doc. Will the update processor skip detection with these settings? It's rather costly on many docs.

          Anyway, this is great work already, thanks!

          Show
          Markus Jelsma added a comment - Hi. This is not what i understood from reading the wiki doc. Will the update processor skip detection with these settings? It's rather costly on many docs. Anyway, this is great work already, thanks!
          Hide
          Jan Høydahl added a comment -

          Yep, it will skip detection if the field defined in langid.langField is not emtpty and langid.overwrite==false

          Show
          Jan Høydahl added a comment - Yep, it will skip detection if the field defined in langid.langField is not emtpty and langid.overwrite==false
          Hide
          Jan Høydahl added a comment -

          Patch updated to fit new directory structure, updated comments to point to Wiki doc.

          Also optimized regex, now pre-compiling patterns instead of using String.replace directly.

          Show
          Jan Høydahl added a comment - Patch updated to fit new directory structure, updated comments to point to Wiki doc. Also optimized regex, now pre-compiling patterns instead of using String.replace directly.
          Hide
          Jan Høydahl added a comment -

          New patch with these improvements:

          • Now also allows config at first level, without <lst name="default">
          • Added langid to example schema (commented out), so it is really easy to demonstrate
          Show
          Jan Høydahl added a comment - New patch with these improvements: Now also allows config at first level, without <lst name="default"> Added langid to example schema (commented out), so it is really easy to demonstrate
          Hide
          Jan Høydahl added a comment -

          Any changes you'd like before committing this? Lance, what config param changes did you have in mind?

          Show
          Jan Høydahl added a comment - Any changes you'd like before committing this? Lance, what config param changes did you have in mind?
          Hide
          Jan Høydahl added a comment -

          Some further improvements:

          • Default fallback language if none set is now "" to avoid nullpointer exception
          • All individually detected languages are now added to "langsField" array
          • More tests
          Show
          Jan Høydahl added a comment - Some further improvements: Default fallback language if none set is now "" to avoid nullpointer exception All individually detected languages are now added to "langsField" array More tests
          Hide
          Jan Høydahl added a comment -

          Added link to Wiki in example update chain in solrconfig

          Show
          Jan Høydahl added a comment - Added link to Wiki in example update chain in solrconfig
          Hide
          Jan Høydahl added a comment -

          Question: Since I plan to commit this for both 3.x and 4.x, I will be adding the CHANGES entry under 3.5 section, also for TRUNK. I know there have been some discussion around where to log changes, but as long as 4.0 is not released before 3.5, it will always be true that the feature was released in 3.5 and exists for all later revisions, not?

          Show
          Jan Høydahl added a comment - Question: Since I plan to commit this for both 3.x and 4.x, I will be adding the CHANGES entry under 3.5 section, also for TRUNK. I know there have been some discussion around where to log changes, but as long as 4.0 is not released before 3.5, it will always be true that the feature was released in 3.5 and exists for all later revisions, not?
          Hide
          Jan Høydahl added a comment -

          Fixed java.lang.IndexOutOfBoundsException bug in resolveLanguage() when no languages detected. Added more corner case tests.

          Show
          Jan Høydahl added a comment - Fixed java.lang.IndexOutOfBoundsException bug in resolveLanguage() when no languages detected. Added more corner case tests.
          Hide
          Jan Høydahl added a comment -

          New patch:

          • Added contrib folders to eclipse dot.classpath
          • Added javadoc entries to build.xml
          • Fixed Javadoc errors
          • Upgraded test case to use schema v1.4
          Show
          Jan Høydahl added a comment - New patch: Added contrib folders to eclipse dot.classpath Added javadoc entries to build.xml Fixed Javadoc errors Upgraded test case to use schema v1.4
          Hide
          Jan Høydahl added a comment -

          Added final patches which will be committed now.

          Show
          Jan Høydahl added a comment - Added final patches which will be committed now.
          Hide
          Jan Høydahl added a comment -

          Finally committed this long-lived issue

          Show
          Jan Høydahl added a comment - Finally committed this long-lived issue
          Hide
          Mark Miller added a comment -

          Nice! Great feature to get in - thanks guys.

          Show
          Mark Miller added a comment - Nice! Great feature to get in - thanks guys.
          Hide
          T Jake Luciani added a comment -

          build on 3x branch still failing because solr/contrib/langid/src/java/overview.html was only committed to trunk. This file needs to be added to branch_3x as well.

          Show
          T Jake Luciani added a comment - build on 3x branch still failing because solr/contrib/langid/src/java/overview.html was only committed to trunk. This file needs to be added to branch_3x as well.
          Hide
          Jan Høydahl added a comment -

          Fixed overview.html in branch

          Show
          Jan Høydahl added a comment - Fixed overview.html in branch
          Hide
          Uwe Schindler added a comment -

          Bulk close after 3.5 is released

          Show
          Uwe Schindler added a comment - Bulk close after 3.5 is released

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Jan Høydahl
            • Votes:
              8 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development