Lucene - Core
  1. Lucene - Core
  2. LUCENE-5468

Hunspell very high memory use when loading dictionary

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 4.8, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Hunspell stemmer requires gigantic (for the task) amounts of memory to load dictionary/rules files.
      For example loading a 4.5 MB polish dictionary (with empty index!) will cause whole core to crash with various out of memory errors unless you set max heap size close to 2GB or more.
      By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).

      Sample error log entries:
      http://pastebin.com/fSrdd5W1
      http://pastebin.com/Lmi0re7Z

      1. patch.txt
        15 kB
        Robert Muir
      2. LUCENE-5468.patch
        239 kB
        Robert Muir

        Activity

        Hide
        Torsten Krah added a comment -

        Just for interest - are multiple dictionaries still supported with this change (after reading all comments its not clear if it was dropped or not)?
        This option is nice to have, because you can make local modifications and can update the main dictionary from upstream (libreoffice etc.) without need for merging or something.
        If not - is there already a ticket to get this working again?

        Show
        Torsten Krah added a comment - Just for interest - are multiple dictionaries still supported with this change (after reading all comments its not clear if it was dropped or not)? This option is nice to have, because you can make local modifications and can update the main dictionary from upstream (libreoffice etc.) without need for merging or something. If not - is there already a ticket to get this working again?
        Hide
        Lukas Vlcek added a comment -

        Good points, I did not know that about the german dictionary. In this perspective my suggestion sounds really hack-ish and should be left out.

        [...] even a 6-year-old kid can use it.

        I am always amazed about what kids these days can achieve...

        Thanks for your time Robert!

        Show
        Lukas Vlcek added a comment - Good points, I did not know that about the german dictionary. In this perspective my suggestion sounds really hack-ish and should be left out. [...] even a 6-year-old kid can use it. I am always amazed about what kids these days can achieve... Thanks for your time Robert!
        Hide
        Robert Muir added a comment -

        I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception.

        This responsibility is really important though. Maybe you should break away from the czech dictionary and look at the others before you decide that its "easiest" here. For example the german dictionary has lots of complex casing rules encoded in the affix file itself for decompounding purposes. This feature already is plenty complicated. If you can do ANYTHING and I mean ANYTHING outside of it in any way, we should keep it out of here.

        As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature.

        Because its an IR library, not a tool for building lexical resources. We just dont have the resources to "compete" with that, we don't have people that need it, and why waste our time when there are perfectly good tools available? I don't know why you refuse to "learn" the hunspell tools, they are trivial to learn!

        Besides the commandline tools, quick searches reveal GUI tools too, such as http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html. Quote from the page: "My tool is so intuitive that even a 6-year-old kid can use it."

        I don't think such work should be duplicated inside the apache lucene project.

        Show
        Robert Muir added a comment - I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception. This responsibility is really important though. Maybe you should break away from the czech dictionary and look at the others before you decide that its "easiest" here. For example the german dictionary has lots of complex casing rules encoded in the affix file itself for decompounding purposes. This feature already is plenty complicated . If you can do ANYTHING and I mean ANYTHING outside of it in any way, we should keep it out of here. As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature. Because its an IR library, not a tool for building lexical resources. We just dont have the resources to "compete" with that, we don't have people that need it, and why waste our time when there are perfectly good tools available? I don't know why you refuse to "learn" the hunspell tools, they are trivial to learn! Besides the commandline tools, quick searches reveal GUI tools too, such as http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html . Quote from the page: "My tool is so intuitive that even a 6-year-old kid can use it." I don't think such work should be duplicated inside the apache lucene project.
        Hide
        Lukas Vlcek added a comment - - edited

        Hi Robert,

        I created a new ticket LUCENE-5484 for distinct recursion levels per pre/suffix rules.

        There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is

        I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception. Heh...

        But seriously. How about if I open a ticket for this to allow to fly this idea around. WDYT?

        I would like to try to implement it as well (if no one else will do it) though I will not get to it soon. As for the 80/20 aspect the good thing about this feature is that it could be measured (precision, recall, ...). And may be only implementing this feature cold tell us if it is useful or not.

        [...] if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java [...]

        Short: I agree

        Long: Creating a new dictionary is very hard. It is for wizards... but the thing here Robert is that creating a new dictionary from scratch is something completely different than extending existing dictionary. At least for average users (like me), they probably can hardly do the former but can relatively easy do the later. The former involves creating the affix rules, the later means using given affix rules and build on top of it.

        When I was trying to extend existing dictionary then in fact I had to do the following:
        1) identify words that were missing in the dict file (or files)
        2) assigning some of existing rules to each of them
        3) verify #2 was done right

        As for 1) that is easy (the only trick when creating a new file with missing words is to stick to encoding defined in aff file)
        As for 2) that is harder but in my case I was building on top of relatively large dictionary so I could bet on the fact that language morphology has been already cover well in affix rules (so I assumed I was not introducing words with new/unique morphology to the dictionary). So in fact instead of trying to understand the rules (see my note about this below) I searched for words that should have similar morphology features and used their rules (for example if I were to add a word "fail" I would search for "sail" and use the same rules).
        As for 3) this in fact means expanding token in root form according to all possible valid rules and check it all makes sense. As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature.


        Note:
        As for understanding the affix rules - this is probably complex topic and I did not have a time to dig deep enough to say anything qualified about it yet. However, as far as I understand various *spell systems have various limitations. For example in case of the Czech dictionary, it is an ispell which allowed only limited number of affix rules (that is what I understood from conversation with an author of Czech dictionary). Which means that if the number of rules is limited then what we see being shipped in aff file is more a result of some preprocessing that takes set of rules that are understandable to human and produces more compact set that might not be easily understood by humans.

        But this is unrelated topic, except that it can illustrate the situation of average user who just want to add some new words into existing dictionary and do not have the capacity to become an expert on ispell (or myspell, or aspell, ... or ... you name it).

        Show
        Lukas Vlcek added a comment - - edited Hi Robert, I created a new ticket LUCENE-5484 for distinct recursion levels per pre/suffix rules. There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception. Heh... But seriously. How about if I open a ticket for this to allow to fly this idea around. WDYT? I would like to try to implement it as well (if no one else will do it) though I will not get to it soon. As for the 80/20 aspect the good thing about this feature is that it could be measured (precision, recall, ...). And may be only implementing this feature cold tell us if it is useful or not. [...] if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java [...] Short: I agree Long: Creating a new dictionary is very hard. It is for wizards... but the thing here Robert is that creating a new dictionary from scratch is something completely different than extending existing dictionary. At least for average users (like me), they probably can hardly do the former but can relatively easy do the later. The former involves creating the affix rules, the later means using given affix rules and build on top of it. When I was trying to extend existing dictionary then in fact I had to do the following: 1) identify words that were missing in the dict file (or files) 2) assigning some of existing rules to each of them 3) verify #2 was done right As for 1) that is easy (the only trick when creating a new file with missing words is to stick to encoding defined in aff file) As for 2) that is harder but in my case I was building on top of relatively large dictionary so I could bet on the fact that language morphology has been already cover well in affix rules (so I assumed I was not introducing words with new/unique morphology to the dictionary). So in fact instead of trying to understand the rules (see my note about this below) I searched for words that should have similar morphology features and used their rules (for example if I were to add a word "fail" I would search for "sail" and use the same rules). As for 3) this in fact means expanding token in root form according to all possible valid rules and check it all makes sense. As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature. Note: As for understanding the affix rules - this is probably complex topic and I did not have a time to dig deep enough to say anything qualified about it yet. However, as far as I understand various *spell systems have various limitations. For example in case of the Czech dictionary, it is an ispell which allowed only limited number of affix rules (that is what I understood from conversation with an author of Czech dictionary). Which means that if the number of rules is limited then what we see being shipped in aff file is more a result of some preprocessing that takes set of rules that are understandable to human and produces more compact set that might not be easily understood by humans. But this is unrelated topic, except that it can illustrate the situation of average user who just want to add some new words into existing dictionary and do not have the capacity to become an expert on ispell (or myspell, or aspell, ... or ... you name it).
        Hide
        Robert Muir added a comment -

        Lowercasing in Hunspell - Robert, when you think about it there is really no simple solution to this using existing Lucene analysis flow AFAIK. If you apply lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased token (if there is any record for it in the dictionary). If you apply if after Hunspell you have the problem with first token in sentences (in most cases). The other option is (as you mentioned) employ some more sophisticated analysis chain (but is there any suitable in Lucene out of the box or do I have to go down the road and setup complex language library or framework?)
        So the option to allow lowercasing for second pass is IMO nice compromise that can help a lot with really minimal effort (and it is also easy to explain to users what it does and when to use it). It is not perfect solution but may be good enough to solve 80/20 principle.

        There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is

        Getting all inflections - yes, there are CL tools for this. But this is really more about user experience comfort, and again, it is easy to explain how to use it, what it does and users do not have to mess with CL tools and things like that. Not sure how hard it would be to implement this with what is in Hunspell now.
        Also one thing is some CL tool used against some dictionary files and other thing can be using Lucene code on dictionary loaded into memory by Lucene. If there are issues in the code these two approaches can give different results (yes, they should be the same...)

        On this one i honestly do disagree. I dont mean to sound rude, but if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java The tools provided by hunspell are the best here, and if someone is making a custom dictionary they already need to be digging into these tools/docs to know what they are doing. I don't see a value in duplicating this stuff and providing morphological generation and other super-advanced esoteric stuff, when there are more basic things needed (like decomposition). As far as if things differ, then those are bugs that should be fixed...

        Show
        Robert Muir added a comment - Lowercasing in Hunspell - Robert, when you think about it there is really no simple solution to this using existing Lucene analysis flow AFAIK. If you apply lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased token (if there is any record for it in the dictionary). If you apply if after Hunspell you have the problem with first token in sentences (in most cases). The other option is (as you mentioned) employ some more sophisticated analysis chain (but is there any suitable in Lucene out of the box or do I have to go down the road and setup complex language library or framework?) So the option to allow lowercasing for second pass is IMO nice compromise that can help a lot with really minimal effort (and it is also easy to explain to users what it does and when to use it). It is not perfect solution but may be good enough to solve 80/20 principle. There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is Getting all inflections - yes, there are CL tools for this. But this is really more about user experience comfort, and again, it is easy to explain how to use it, what it does and users do not have to mess with CL tools and things like that. Not sure how hard it would be to implement this with what is in Hunspell now. Also one thing is some CL tool used against some dictionary files and other thing can be using Lucene code on dictionary loaded into memory by Lucene. If there are issues in the code these two approaches can give different results (yes, they should be the same...) On this one i honestly do disagree. I dont mean to sound rude, but if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java The tools provided by hunspell are the best here, and if someone is making a custom dictionary they already need to be digging into these tools/docs to know what they are doing. I don't see a value in duplicating this stuff and providing morphological generation and other super-advanced esoteric stuff, when there are more basic things needed (like decomposition). As far as if things differ, then those are bugs that should be fixed...
        Hide
        Lukas Vlcek added a comment -

        OK, I will open a new ticket for the recursionCap tomorrow (it is late on my end now).

        Just a real quick comments on my two other suggestions:

        Lowercasing in Hunspell - Robert, when you think about it there is really no simple solution to this using existing Lucene analysis flow AFAIK. If you apply lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased token (if there is any record for it in the dictionary). If you apply if after Hunspell you have the problem with first token in sentences (in most cases). The other option is (as you mentioned) employ some more sophisticated analysis chain (but is there any suitable in Lucene out of the box or do I have to go down the road and setup complex language library or framework?)
        So the option to allow lowercasing for second pass is IMO nice compromise that can help a lot with really minimal effort (and it is also easy to explain to users what it does and when to use it). It is not perfect solution but may be good enough to solve 80/20 principle.

        Getting all inflections - yes, there are CL tools for this. But this is really more about user experience comfort, and again, it is easy to explain how to use it, what it does and users do not have to mess with CL tools and things like that. Not sure how hard it would be to implement this with what is in Hunspell now.
        Also one thing is some CL tool used against some dictionary files and other thing can be using Lucene code on dictionary loaded into memory by Lucene. If there are issues in the code these two approaches can give different results (yes, they should be the same...)

        Show
        Lukas Vlcek added a comment - OK, I will open a new ticket for the recursionCap tomorrow (it is late on my end now). Just a real quick comments on my two other suggestions: Lowercasing in Hunspell - Robert, when you think about it there is really no simple solution to this using existing Lucene analysis flow AFAIK. If you apply lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased token (if there is any record for it in the dictionary). If you apply if after Hunspell you have the problem with first token in sentences (in most cases). The other option is (as you mentioned) employ some more sophisticated analysis chain (but is there any suitable in Lucene out of the box or do I have to go down the road and setup complex language library or framework?) So the option to allow lowercasing for second pass is IMO nice compromise that can help a lot with really minimal effort (and it is also easy to explain to users what it does and when to use it). It is not perfect solution but may be good enough to solve 80/20 principle. Getting all inflections - yes, there are CL tools for this. But this is really more about user experience comfort, and again, it is easy to explain how to use it, what it does and users do not have to mess with CL tools and things like that. Not sure how hard it would be to implement this with what is in Hunspell now. Also one thing is some CL tool used against some dictionary files and other thing can be using Lucene code on dictionary loaded into memory by Lucene. If there are issues in the code these two approaches can give different results (yes, they should be the same...)
        Hide
        Chris Male added a comment -

        Yeah I guess. We can go over that in a new issue.

        Show
        Chris Male added a comment - Yeah I guess. We can go over that in a new issue.
        Hide
        Robert Muir added a comment -

        Chris but Lukas has a real use-case and its probably like 5 total lines of code to split that out? I dunno, it seems fine to me.

        Show
        Robert Muir added a comment - Chris but Lukas has a real use-case and its probably like 5 total lines of code to split that out? I dunno, it seems fine to me.
        Hide
        Chris Male added a comment -

        I dont think we should make the recusionCap anymore complex. I put it in there simply to prevent languages from getting into infinite loops.

        Show
        Chris Male added a comment - I dont think we should make the recusionCap anymore complex. I put it in there simply to prevent languages from getting into infinite loops.
        Hide
        Robert Muir added a comment -

        ok, i was a little confused. I thought perhaps you referred to the previous discussion above about removing things

        I just want to make it clear i kept all the additional options we already had!

        So what I am proposing is having an option to set recursionCap separately for prefix and suffix. In case of czech dict I would say: you can apply only one prefix rule and only one suffix rule (meaning you can NEVER apply two prefix rules or two affix rules).

        +1, can you open an issue for this?

        As for ignoreCase - how does it work if the dictionary contains terms like "Xx" and "xx" and each is allowed to use different set of rules? I need to distinguish between them.

        Right, thats why it does nothing by default

        But at the same hand if the dictionary contains only "yy" but I get "Yy" as input (because it was the first word of the sentence) would it be able to process it correctly and still distinguish between "Xx" and "xx"?

        In my opinion, this is not the responsibility of this filter (it simply has ignoreCase on or off). This has more to do with your analysis chain? So if you want to put a lowercase filter first always, thats one approach. If you want to use some rule/heuristic for sentence tokenization or other fancy stuff, you can selectively lowercase and get what you want. But this filter knows nothing about that

        I think it wold not be hard to expose such API and I believe users would appreciate this when constructing custom dictionaries (I tried that and I was missing such feature, sure I can implement it myself but I believe having it in Solr and Elasticsearch would be great, definitely this is not useful for indexing process but as a part of tuning your dictionary this would be helpful).

        Why not just use the hunspell command-line tools like 'unmunch', 'analyze', etc for that?

        Show
        Robert Muir added a comment - ok, i was a little confused. I thought perhaps you referred to the previous discussion above about removing things I just want to make it clear i kept all the additional options we already had! So what I am proposing is having an option to set recursionCap separately for prefix and suffix. In case of czech dict I would say: you can apply only one prefix rule and only one suffix rule (meaning you can NEVER apply two prefix rules or two affix rules). +1, can you open an issue for this? As for ignoreCase - how does it work if the dictionary contains terms like "Xx" and "xx" and each is allowed to use different set of rules? I need to distinguish between them. Right, thats why it does nothing by default But at the same hand if the dictionary contains only "yy" but I get "Yy" as input (because it was the first word of the sentence) would it be able to process it correctly and still distinguish between "Xx" and "xx"? In my opinion, this is not the responsibility of this filter (it simply has ignoreCase on or off). This has more to do with your analysis chain? So if you want to put a lowercase filter first always, thats one approach. If you want to use some rule/heuristic for sentence tokenization or other fancy stuff, you can selectively lowercase and get what you want. But this filter knows nothing about that I think it wold not be hard to expose such API and I believe users would appreciate this when constructing custom dictionaries (I tried that and I was missing such feature, sure I can implement it myself but I believe having it in Solr and Elasticsearch would be great, definitely this is not useful for indexing process but as a part of tuning your dictionary this would be helpful). Why not just use the hunspell command-line tools like 'unmunch', 'analyze', etc for that?
        Hide
        Lukas Vlcek added a comment -

        Robert,

        I did not check the latest code so please forgive my ignorance but let me try to explain:

        recursionCap does not distinguishes between how many prefix and suffix rules were applied. Does it? It counts just the total. If I set recursionCap to 1 it actually includes all the following options:

        • 2 prefix rules, 0 suffix rules
        • 1prefix rule, 1 suffix rule
        • 0 prefix rules, 2 suffix rules

        This may not play well with some affix rule dictionaries. For example the czech dictionary is constructed in such a way that only one suffix rule can be applied otherwise the filter can generate irrelevant tokens. So the recursionCap MUST be set to 0.
        However, if this recursion level is consumed on removal of prefix then it can not continue and manipulate also the suffix. So what I am proposing is having an option to set recursionCap separately for prefix and suffix. In case of czech dict I would say: you can apply only one prefix rule and only one suffix rule (meaning you can NEVER apply two prefix rules or two affix rules).

        As for ignoreCase - how does it work if the dictionary contains terms like "Xx" and "xx" and each is allowed to use different set of rules? I need to distinguish between them. But at the same hand if the dictionary contains only "yy" but I get "Yy" as input (because it was the first word of the sentence) would it be able to process it correctly and still distinguish between "Xx" and "xx"?

        As for the last feature I probably confused you. What I am looking for is not output of all possible root words for given term but all possible inflections for given (root) word. For example: input is "tell" and based on loaded dictionary the output would be ["tell","tells","telling", ...]. I think it wold not be hard to expose such API and I believe users would appreciate this when constructing custom dictionaries (I tried that and I was missing such feature, sure I can implement it myself but I believe having it in Solr and Elasticsearch would be great, definitely this is not useful for indexing process but as a part of tuning your dictionary this would be helpful).

        Show
        Lukas Vlcek added a comment - Robert, I did not check the latest code so please forgive my ignorance but let me try to explain: recursionCap does not distinguishes between how many prefix and suffix rules were applied. Does it? It counts just the total. If I set recursionCap to 1 it actually includes all the following options: 2 prefix rules, 0 suffix rules 1prefix rule, 1 suffix rule 0 prefix rules, 2 suffix rules This may not play well with some affix rule dictionaries. For example the czech dictionary is constructed in such a way that only one suffix rule can be applied otherwise the filter can generate irrelevant tokens. So the recursionCap MUST be set to 0. However, if this recursion level is consumed on removal of prefix then it can not continue and manipulate also the suffix. So what I am proposing is having an option to set recursionCap separately for prefix and suffix. In case of czech dict I would say: you can apply only one prefix rule and only one suffix rule (meaning you can NEVER apply two prefix rules or two affix rules). As for ignoreCase - how does it work if the dictionary contains terms like "Xx" and "xx" and each is allowed to use different set of rules? I need to distinguish between them. But at the same hand if the dictionary contains only "yy" but I get "Yy" as input (because it was the first word of the sentence) would it be able to process it correctly and still distinguish between "Xx" and "xx"? As for the last feature I probably confused you. What I am looking for is not output of all possible root words for given term but all possible inflections for given (root) word. For example: input is "tell" and based on loaded dictionary the output would be ["tell","tells","telling", ...] . I think it wold not be hard to expose such API and I believe users would appreciate this when constructing custom dictionaries (I tried that and I was missing such feature, sure I can implement it myself but I believe having it in Solr and Elasticsearch would be great, definitely this is not useful for indexing process but as a part of tuning your dictionary this would be helpful).
        Hide
        Robert Muir added a comment -

        All 3 of these options are still supported by both the filter/dictionary and the factory. look at 'recursionCap', 'ignoreCase', and dictionaries is a List<InputStream>. And by default it outputs all terms (unless you supply longestMatch=true). So I'm not really sure what is needed here?

        Show
        Robert Muir added a comment - All 3 of these options are still supported by both the filter/dictionary and the factory. look at 'recursionCap', 'ignoreCase', and dictionaries is a List<InputStream>. And by default it outputs all terms (unless you supply longestMatch=true). So I'm not really sure what is needed here?
        Hide
        Lukas Vlcek added a comment -

        Amazing improvement!

        While we are on Hunspell I would like to make a proposal for additional enhancements but first I would like to ask if you would be interested in seeing such improvements in the code. I would be happy to open a new ticket for this in such case.

        1) AFAIR Hunspell token filter has an option to setup level of recursion. Originally hardcoded to 2 if I am not mistaken. But the level of recursion counts for both prefix and postfix rules - meaning if it is set to 2 and 1 prefix rule is applied, then we can only apply 2-1 suffix rules. What I would like to propose is adding an option to explicitly specify recursion level for both the prefix and for postfix rules. This probably depends a lot on how the affix rules are constructed but I can clearly see this would help in case of Czech dictionary - hopefully this might be found useful for other languages too.

        2) Case sensitivity is a tricky part. Czech dictionary is case sensitive and it can deliver very nice results but users can not always fully benefit from this. The biggest problem I remember are tokens at the beginning of sentences. They start with capitals and thus they may not be found in dict where only lowercased variation is recorded.
        I was thinking that one useful solution to this issue can be adding an option to lowercase given token if it hasn't been found in dict and making a second pass through the filter again with lowercased token (it is costly but would be only optional so user is the one to decide if this is worth the indexing time).

        3) Also it would be really useful if Hunspell token filter provided an option to output all terms that are the result of application of relevant rules to input token (so in essence quite opposite transformation to what is used during stemming). Such functionality would be useful if users want to add custom extension to existing dictionary (having an option to load several dict files is really useful IMO) and they want to check that they constructed valid rules for specific words. Having Lucene directly supporting them via exposed API would be great I think (especially when thinking about later applications in Solr and Elasticsearch).

        Show
        Lukas Vlcek added a comment - Amazing improvement! While we are on Hunspell I would like to make a proposal for additional enhancements but first I would like to ask if you would be interested in seeing such improvements in the code. I would be happy to open a new ticket for this in such case. 1) AFAIR Hunspell token filter has an option to setup level of recursion. Originally hardcoded to 2 if I am not mistaken. But the level of recursion counts for both prefix and postfix rules - meaning if it is set to 2 and 1 prefix rule is applied, then we can only apply 2-1 suffix rules. What I would like to propose is adding an option to explicitly specify recursion level for both the prefix and for postfix rules. This probably depends a lot on how the affix rules are constructed but I can clearly see this would help in case of Czech dictionary - hopefully this might be found useful for other languages too. 2) Case sensitivity is a tricky part. Czech dictionary is case sensitive and it can deliver very nice results but users can not always fully benefit from this. The biggest problem I remember are tokens at the beginning of sentences. They start with capitals and thus they may not be found in dict where only lowercased variation is recorded. I was thinking that one useful solution to this issue can be adding an option to lowercase given token if it hasn't been found in dict and making a second pass through the filter again with lowercased token (it is costly but would be only optional so user is the one to decide if this is worth the indexing time). 3) Also it would be really useful if Hunspell token filter provided an option to output all terms that are the result of application of relevant rules to input token (so in essence quite opposite transformation to what is used during stemming). Such functionality would be useful if users want to add custom extension to existing dictionary (having an option to load several dict files is really useful IMO) and they want to check that they constructed valid rules for specific words. Having Lucene directly supporting them via exposed API would be great I think (especially when thinking about later applications in Solr and Elasticsearch).
        Hide
        ASF subversion and git services added a comment -

        Commit 1572774 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1572774 ]

        LUCENE-5468: reduce RAM usage of hunspell

        Show
        ASF subversion and git services added a comment - Commit 1572774 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1572774 ] LUCENE-5468 : reduce RAM usage of hunspell
        Hide
        ASF subversion and git services added a comment -

        Commit 1572754 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1572754 ]

        LUCENE-5468: reduce RAM usage of hunspell

        Show
        ASF subversion and git services added a comment - Commit 1572754 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1572754 ] LUCENE-5468 : reduce RAM usage of hunspell
        Hide
        Michael McCandless added a comment -

        Definitely +1 to commit this and worry about speedups separately!

        Show
        Michael McCandless added a comment - Definitely +1 to commit this and worry about speedups separately!
        Hide
        Robert Muir added a comment -

        I'm not sure it matters here, but do you handle the FST Builder returning
        null for the built FST (when there was nothing added)? Just a common
        gotchya...

        Do you have any sense of how the lookup speed changed?

        Many dictionaries have either no prefixes or no suffixes: the code comment below this also answers your other question about NULL FST I think.
        Admittedly its probably no faster now, but it can be faster if we make the Stemmer smarter when walking the possibilities in the word:

          // TODO: this is pretty stupid, considering how the stemming algorithm works
          // we can speed it up to be significantly faster!
          IntsRef lookupAffix(FST<IntsRef> fst, char word[], int offset, int length) {
            if (fst == null) {
              return null;
            }
        

        Given the fact this thing takes sometimes 100MB per field and makes it nearly unusable, I made such larger changes a TODO for a separate issue?

        Show
        Robert Muir added a comment - I'm not sure it matters here, but do you handle the FST Builder returning null for the built FST (when there was nothing added)? Just a common gotchya... Do you have any sense of how the lookup speed changed? Many dictionaries have either no prefixes or no suffixes: the code comment below this also answers your other question about NULL FST I think. Admittedly its probably no faster now, but it can be faster if we make the Stemmer smarter when walking the possibilities in the word: // TODO: this is pretty stupid, considering how the stemming algorithm works // we can speed it up to be significantly faster! IntsRef lookupAffix(FST<IntsRef> fst, char word[], int offset, int length) { if (fst == null ) { return null ; } Given the fact this thing takes sometimes 100MB per field and makes it nearly unusable, I made such larger changes a TODO for a separate issue?
        Hide
        Michael McCandless added a comment -

        These are incredible reductions on RAM usage from cutting over to
        FSTs. And it's nice that you are using IntSequenceOutputs, and
        that you are now able to load dictionaries that failed before!

        I'm not sure it matters here, but do you handle the FST Builder returning
        null for the built FST (when there was nothing added)? Just a common
        gotchya...

        Do you have any sense of how the lookup speed changed?

        Show
        Michael McCandless added a comment - These are incredible reductions on RAM usage from cutting over to FSTs. And it's nice that you are using IntSequenceOutputs, and that you are now able to load dictionaries that failed before! I'm not sure it matters here, but do you handle the FST Builder returning null for the built FST (when there was nothing added)? Just a common gotchya... Do you have any sense of how the lookup speed changed?
        Hide
        Chris Male added a comment -

        Awesome, sounds like a great addition then.

        Show
        Chris Male added a comment - Awesome, sounds like a great addition then.
        Hide
        Robert Muir added a comment -

        No, but when testing relevance, outputting all the stems leads to slower indexing, a much larger index, and significantly impacts precision for some languages.

        So after reading CLEF experiments done with hungarian (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.8036&rep=rep1&type=pdf) where they suggest a simple disambiguation heuristic (shortest stem for most aggressive), i experimented with the opposite, and found it was quite useful

        Show
        Robert Muir added a comment - No, but when testing relevance, outputting all the stems leads to slower indexing, a much larger index, and significantly impacts precision for some languages. So after reading CLEF experiments done with hungarian ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.8036&rep=rep1&type=pdf ) where they suggest a simple disambiguation heuristic (shortest stem for most aggressive), i experimented with the opposite, and found it was quite useful
        Hide
        Chris Male added a comment -

        Is the longestOnly option a standard Hunspell thing? (more a question of general interest)

        Show
        Chris Male added a comment - Is the longestOnly option a standard Hunspell thing? (more a question of general interest)
        Hide
        Robert Muir added a comment -

        I think the change is ready. There are other improvements that can be done (for example, maybe an option for the factory to cache these things in case you use same ones across multiple fields, and more efficient affix handling against the FST, and so on), but it would be better on different issues I think?

        Here is a patch (from diff-sources), sorry its not so useful, as I renamed some things. I tried making one from svn diff after reintegration, but it was equally useless. If you want you can also review my commits on this issue to the branch, too.

        here is CHANGES entry:

        API Changes:

        • LUCENE-5468: Move offline Sort (from suggest module) to OfflineSort. (Robert Muir)

        Optimizations:

        • LUCENE-5468: HunspellStemFilter uses 10 to 100x less RAM. It also loads
          all known openoffice dictionaries without error, and supports an additional
          longestOnly option for a less aggressive approach. (Robert Muir)
        Show
        Robert Muir added a comment - I think the change is ready. There are other improvements that can be done (for example, maybe an option for the factory to cache these things in case you use same ones across multiple fields, and more efficient affix handling against the FST, and so on), but it would be better on different issues I think? Here is a patch (from diff-sources), sorry its not so useful, as I renamed some things. I tried making one from svn diff after reintegration, but it was equally useless. If you want you can also review my commits on this issue to the branch, too. here is CHANGES entry: API Changes: LUCENE-5468 : Move offline Sort (from suggest module) to OfflineSort. (Robert Muir) Optimizations: LUCENE-5468 : HunspellStemFilter uses 10 to 100x less RAM. It also loads all known openoffice dictionaries without error, and supports an additional longestOnly option for a less aggressive approach. (Robert Muir)
        Hide
        ASF subversion and git services added a comment -

        Commit 1572727 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1572727 ]

        LUCENE-5468: add additional change

        Show
        ASF subversion and git services added a comment - Commit 1572727 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1572727 ] LUCENE-5468 : add additional change
        Hide
        ASF subversion and git services added a comment -

        Commit 1572724 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1572724 ]

        LUCENE-5468: fix precommit+test

        Show
        ASF subversion and git services added a comment - Commit 1572724 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1572724 ] LUCENE-5468 : fix precommit+test
        Hide
        ASF subversion and git services added a comment -

        Commit 1572718 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1572718 ]

        LUCENE-5468: hunspell2 -> hunspell (with previous options and tests)

        Show
        ASF subversion and git services added a comment - Commit 1572718 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1572718 ] LUCENE-5468 : hunspell2 -> hunspell (with previous options and tests)
        Hide
        Robert Muir added a comment -

        I have the previous options added back too locally. so i will fix up tests and so on and just copy over the old filter and make a patch.

        Show
        Robert Muir added a comment - I have the previous options added back too locally. so i will fix up tests and so on and just copy over the old filter and make a patch.
        Hide
        Chris Male added a comment -

        Those are some pretty amazing reductions, well done!

        Show
        Chris Male added a comment - Those are some pretty amazing reductions, well done!
        Hide
        Robert Muir added a comment -

        I am finished compressing for now. I think its pretty reasonable across all the languages.

        I will cleanup and try to add back the multiple dictionary/ignore case stuff and clean up some other things.

        dict old RAM new RAM
        af_ZA.zip 18 MB 917.1 KB
        ak_GH.zip 1.5 MB 103.2 KB
        bg_BG.zip FAIL 465.7 KB
        ca_ANY.zip 28.9 MB 675.4 KB
        ca_ES.zip 15.1 MB 639.8 KB
        cop_EG.zip 2.1 MB 144.5 KB
        cs_CZ.zip 50.4 MB 1.5 MB
        cy_GB.zip FAIL 627.4 KB
        da_DK.zip FAIL 669.8 KB
        de_AT.zip 1.3 MB 123.9 KB
        de_CH.zip 12.6 MB 725.4 KB
        de_DE.zip 12.6 MB 726 KB
        de_DE_comb.zip 102.2 MB 4.2 MB
        de_DE_frami.zip 20.9 MB 1023.5 KB
        de_DE_neu.zip 101.5 MB 4.2 MB
        el_GR.zip 74.3 MB 1 MB
        en_AU.zip 8.1 MB 521 KB
        en_CA.zip 9.8 MB 450.5 KB
        en_GB-oed.zip 8.2 MB 526.6 KB
        en_GB.zip 8.3 MB 527.3 KB
        en_NZ.zip 8.4 MB 532.4 KB
        eo.zip 4.9 MB 310.5 KB
        eo_EO.zip 4.9 MB 310.5 KB
        es_AR.zip 14.8 MB 734.9 KB
        es_BO.zip 14.8 MB 735 KB
        es_CL.zip 14.7 MB 734.9 KB
        es_CO.zip 14.3 MB 722.1 KB
        es_CR.zip 14.8 MB 733.9 KB
        es_CU.zip 14.7 MB 732.8 KB
        es_DO.zip 14.7 MB 731.9 KB
        es_EC.zip 14.8 MB 733.5 KB
        es_ES.zip 15.1 MB 743 KB
        es_GT.zip 14.8 MB 734.5 KB
        es_HN.zip 14.8 MB 735.2 KB
        es_MX.zip 14.3 MB 723.8 KB
        es_NEW.zip 15.5 MB 768.5 KB
        es_NI.zip 14.8 MB 734.5 KB
        es_PA.zip 14.8 MB 733.8 KB
        es_PE.zip 14.2 MB 721.3 KB
        es_PR.zip 14.7 MB 732.4 KB
        es_PY.zip 14.8 MB 734.1 KB
        es_SV.zip 14.8 MB 733.6 KB
        es_UY.zip 14.8 MB 736.9 KB
        es_VE.zip 14.3 MB 722.7 KB
        et_EE.zip 53.6 MB 473.6 KB
        fo_FO.zip 18.6 MB 517.9 KB
        fr_FR-1990_1-3-2.zip 14 MB 526.7 KB
        fr_FR-classique_1-3-2.zip 14 MB 539.2 KB
        fr_FR_1-3-2.zip 14.5 MB 550.4 KB
        fy_NL.zip 4.2 MB 265.6 KB
        ga_IE.zip 14 MB 460.6 KB
        gd_GB.zip 2.7 MB 143.1 KB
        gl_ES.zip FAIL 479.4 KB
        gsc_FR.zip FAIL 1.3 MB
        gu_IN.zip 20.3 MB 947 KB
        he_IL.zip 53.3 MB 539.2 KB
        hi_IN.zip 2.7 MB 169 KB
        hil_PH.zip 3.4 MB 197 KB
        hr_HR.zip 29.7 MB 573 KB
        hu_HU.zip FAIL 1.2 MB
        hu_HU_comb.zip FAIL 5.4 MB
        ia.zip 4.9 MB 222.9 KB
        id_ID.zip 3.9 MB 226.3 KB
        it_IT.zip 15.3 MB 612.9 KB
        ku_TR.zip 1.6 MB 118.7 KB
        la.zip 5.1 MB 199.3 KB
        lt_LT.zip 15 MB 682.5 KB
        lv_LV.zip 36.3 MB 763.9 KB
        mg_MG.zip 2.9 MB 163.8 KB
        mi_NZ.zip FAIL 191.4 KB
        mk_MK.zip FAIL 469.1 KB
        mos_BF.zip 13.3 MB 242.2 KB
        mr_IN.zip FAIL 147.7 KB
        ms_MY.zip 4.1 MB 226.9 KB
        nb_NO.zip 22.9 MB 1.2 MB
        ne_NP.zip 5.5 MB 328.1 KB
        nl_NL.zip 22.9 MB 1.1 MB
        nl_med.zip 1.2 MB 92.3 KB
        nn_NO.zip 16.5 MB 914 KB
        nr_ZA.zip 3.1 MB 203.3 KB
        ns_ZA.zip 1.7 MB 118 KB
        ny_MW.zip FAIL 101.8 KB
        oc_FR.zip 9.1 MB 401.5 KB
        pl_PL.zip 43.9 MB 1.7 MB
        pt_BR.zip FAIL 2.1 MB
        pt_PT.zip 5.8 MB 379.4 KB
        ro_RO.zip 5.1 MB 256.3 KB
        ru_RU.zip 21.7 MB 882 KB
        ru_RU_ye.zip 43.7 MB 1.5 MB
        ru_RU_yo.zip 21.7 MB 897.3 KB
        rw_RW.zip 1.6 MB 102.3 KB
        sk_SK.zip 25.1 MB 1.2 MB
        sl_SI.zip 38.3 MB 604 KB af_ZA.zip 18 MB 917.1 KB
        ak_GH.zip 1.5 MB 103.2 KB
        bg_BG.zip FAIL 465.7 KB
        ca_ANY.zip 28.9 MB 675.4 KB
        ca_ES.zip 15.1 MB 639.8 KB
        cop_EG.zip 2.1 MB 144.5 KB
        cs_CZ.zip 50.4 MB 1.5 MB
        cy_GB.zip FAIL 627.4 KB
        da_DK.zip FAIL 669.8 KB
        de_AT.zip 1.3 MB 123.9 KB
        de_CH.zip 12.6 MB 725.4 KB
        de_DE.zip 12.6 MB 726 KB
        de_DE_comb.zip 102.2 MB 4.2 MB
        de_DE_frami.zip 20.9 MB 1023.5 KB
        de_DE_neu.zip 101.5 MB 4.2 MB
        el_GR.zip 74.3 MB 1 MB
        en_AU.zip 8.1 MB 521 KB
        en_CA.zip 9.8 MB 450.5 KB
        en_GB-oed.zip 8.2 MB 526.6 KB
        en_GB.zip 8.3 MB 527.3 KB
        en_NZ.zip 8.4 MB 532.4 KB
        eo.zip 4.9 MB 310.5 KB
        eo_EO.zip 4.9 MB 310.5 KB
        es_AR.zip 14.8 MB 734.9 KB
        es_BO.zip 14.8 MB 735 KB
        es_CL.zip 14.7 MB 734.9 KB
        es_CO.zip 14.3 MB 722.1 KB
        es_CR.zip 14.8 MB 733.9 KB
        es_CU.zip 14.7 MB 732.8 KB
        es_DO.zip 14.7 MB 731.9 KB
        es_EC.zip 14.8 MB 733.5 KB
        es_ES.zip 15.1 MB 743 KB
        es_GT.zip 14.8 MB 734.5 KB
        es_HN.zip 14.8 MB 735.2 KB
        es_MX.zip 14.3 MB 723.8 KB
        es_NEW.zip 15.5 MB 768.5 KB
        es_NI.zip 14.8 MB 734.5 KB
        es_PA.zip 14.8 MB 733.8 KB
        es_PE.zip 14.2 MB 721.3 KB
        es_PR.zip 14.7 MB 732.4 KB
        es_PY.zip 14.8 MB 734.1 KB
        es_SV.zip 14.8 MB 733.6 KB
        es_UY.zip 14.8 MB 736.9 KB
        es_VE.zip 14.3 MB 722.7 KB
        et_EE.zip 53.6 MB 473.6 KB
        fo_FO.zip 18.6 MB 517.9 KB
        fr_FR-1990_1-3-2.zip 14 MB 526.7 KB
        fr_FR-classique_1-3-2.zip 14 MB 539.2 KB
        fr_FR_1-3-2.zip 14.5 MB 550.4 KB
        fy_NL.zip 4.2 MB 265.6 KB
        ga_IE.zip 14 MB 460.6 KB
        gd_GB.zip 2.7 MB 143.1 KB
        gl_ES.zip FAIL 479.4 KB
        gsc_FR.zip FAIL 1.3 MB
        gu_IN.zip 20.3 MB 947 KB
        he_IL.zip 53.3 MB 539.2 KB
        hi_IN.zip 2.7 MB 169 KB
        hil_PH.zip 3.4 MB 197 KB
        hr_HR.zip 29.7 MB 573 KB
        hu_HU.zip FAIL 1.2 MB
        hu_HU_comb.zip FAIL 5.4 MB
        ia.zip 4.9 MB 222.9 KB
        id_ID.zip 3.9 MB 226.3 KB
        it_IT.zip 15.3 MB 612.9 KB
        ku_TR.zip 1.6 MB 118.7 KB
        la.zip 5.1 MB 199.3 KB
        lt_LT.zip 15 MB 682.5 KB
        lv_LV.zip 36.3 MB 763.9 KB
        mg_MG.zip 2.9 MB 163.8 KB
        mi_NZ.zip FAIL 191.4 KB
        mk_MK.zip FAIL 469.1 KB
        mos_BF.zip 13.3 MB 242.2 KB
        mr_IN.zip FAIL 147.7 KB
        ms_MY.zip 4.1 MB 226.9 KB
        nb_NO.zip 22.9 MB 1.2 MB
        ne_NP.zip 5.5 MB 328.1 KB
        nl_NL.zip 22.9 MB 1.1 MB
        nl_med.zip 1.2 MB 92.3 KB
        nn_NO.zip 16.5 MB 914 KB
        nr_ZA.zip 3.1 MB 203.3 KB
        ns_ZA.zip 1.7 MB 118 KB
        ny_MW.zip FAIL 101.8 KB
        oc_FR.zip 9.1 MB 401.5 KB
        pl_PL.zip 43.9 MB 1.7 MB
        pt_BR.zip FAIL 2.1 MB
        pt_PT.zip 5.8 MB 379.4 KB
        ro_RO.zip 5.1 MB 256.3 KB
        ru_RU.zip 21.7 MB 882 KB
        ru_RU_ye.zip 43.7 MB 1.5 MB
        ru_RU_yo.zip 21.7 MB 897.3 KB
        rw_RW.zip 1.6 MB 102.3 KB
        sk_SK.zip 25.1 MB 1.2 MB
        sl_SI.zip 38.3 MB 604 KB
        sq_AL.zip 28.9 MB 581.7 KB
        ss_ZA.zip 3.1 MB 208.5 KB
        st_ZA.zip 1.7 MB 118.7 KB
        sv_SE.zip 9.5 MB 535.4 KB
        sw_KE.zip 6.3 MB 318.2 KB
        tet_ID.zip 2 MB 124.5 KB
        th_TH.zip FAIL 409.6 KB
        tl_PH.zip 2.6 MB 148.7 KB
        tn_ZA.zip 1.5 MB 93.7 KB
        ts_ZA.zip 1.6 MB 113.1 KB
        uk_UA.zip 17.6 MB 979.1 KB
        ve_ZA.zip FAIL 140.9 KB
        vi_VN.zip 1.7 MB 85.8 KB
        xh_ZA.zip 3 MB 191.1 KB
        zu_ZA.zip 24.5 MB 827.1 KB
        sq_AL.zip 28.9 MB 581.7 KB
        ss_ZA.zip 3.1 MB 208.5 KB
        st_ZA.zip 1.7 MB 118.7 KB
        sv_SE.zip 9.5 MB 535.4 KB
        sw_KE.zip 6.3 MB 318.2 KB
        tet_ID.zip 2 MB 124.5 KB
        th_TH.zip FAIL 409.6 KB
        tl_PH.zip 2.6 MB 148.7 KB
        tn_ZA.zip 1.5 MB 93.7 KB
        ts_ZA.zip 1.6 MB 113.1 KB
        uk_UA.zip 17.6 MB 979.1 KB
        ve_ZA.zip FAIL 140.9 KB
        vi_VN.zip 1.7 MB 85.8 KB
        xh_ZA.zip 3 MB 191.1 KB
        zu_ZA.zip 24.5 MB 827.1 KB
        Show
        Robert Muir added a comment - I am finished compressing for now. I think its pretty reasonable across all the languages. I will cleanup and try to add back the multiple dictionary/ignore case stuff and clean up some other things. dict old RAM new RAM af_ZA.zip 18 MB 917.1 KB ak_GH.zip 1.5 MB 103.2 KB bg_BG.zip FAIL 465.7 KB ca_ANY.zip 28.9 MB 675.4 KB ca_ES.zip 15.1 MB 639.8 KB cop_EG.zip 2.1 MB 144.5 KB cs_CZ.zip 50.4 MB 1.5 MB cy_GB.zip FAIL 627.4 KB da_DK.zip FAIL 669.8 KB de_AT.zip 1.3 MB 123.9 KB de_CH.zip 12.6 MB 725.4 KB de_DE.zip 12.6 MB 726 KB de_DE_comb.zip 102.2 MB 4.2 MB de_DE_frami.zip 20.9 MB 1023.5 KB de_DE_neu.zip 101.5 MB 4.2 MB el_GR.zip 74.3 MB 1 MB en_AU.zip 8.1 MB 521 KB en_CA.zip 9.8 MB 450.5 KB en_GB-oed.zip 8.2 MB 526.6 KB en_GB.zip 8.3 MB 527.3 KB en_NZ.zip 8.4 MB 532.4 KB eo.zip 4.9 MB 310.5 KB eo_EO.zip 4.9 MB 310.5 KB es_AR.zip 14.8 MB 734.9 KB es_BO.zip 14.8 MB 735 KB es_CL.zip 14.7 MB 734.9 KB es_CO.zip 14.3 MB 722.1 KB es_CR.zip 14.8 MB 733.9 KB es_CU.zip 14.7 MB 732.8 KB es_DO.zip 14.7 MB 731.9 KB es_EC.zip 14.8 MB 733.5 KB es_ES.zip 15.1 MB 743 KB es_GT.zip 14.8 MB 734.5 KB es_HN.zip 14.8 MB 735.2 KB es_MX.zip 14.3 MB 723.8 KB es_NEW.zip 15.5 MB 768.5 KB es_NI.zip 14.8 MB 734.5 KB es_PA.zip 14.8 MB 733.8 KB es_PE.zip 14.2 MB 721.3 KB es_PR.zip 14.7 MB 732.4 KB es_PY.zip 14.8 MB 734.1 KB es_SV.zip 14.8 MB 733.6 KB es_UY.zip 14.8 MB 736.9 KB es_VE.zip 14.3 MB 722.7 KB et_EE.zip 53.6 MB 473.6 KB fo_FO.zip 18.6 MB 517.9 KB fr_FR-1990_1-3-2.zip 14 MB 526.7 KB fr_FR-classique_1-3-2.zip 14 MB 539.2 KB fr_FR_1-3-2.zip 14.5 MB 550.4 KB fy_NL.zip 4.2 MB 265.6 KB ga_IE.zip 14 MB 460.6 KB gd_GB.zip 2.7 MB 143.1 KB gl_ES.zip FAIL 479.4 KB gsc_FR.zip FAIL 1.3 MB gu_IN.zip 20.3 MB 947 KB he_IL.zip 53.3 MB 539.2 KB hi_IN.zip 2.7 MB 169 KB hil_PH.zip 3.4 MB 197 KB hr_HR.zip 29.7 MB 573 KB hu_HU.zip FAIL 1.2 MB hu_HU_comb.zip FAIL 5.4 MB ia.zip 4.9 MB 222.9 KB id_ID.zip 3.9 MB 226.3 KB it_IT.zip 15.3 MB 612.9 KB ku_TR.zip 1.6 MB 118.7 KB la.zip 5.1 MB 199.3 KB lt_LT.zip 15 MB 682.5 KB lv_LV.zip 36.3 MB 763.9 KB mg_MG.zip 2.9 MB 163.8 KB mi_NZ.zip FAIL 191.4 KB mk_MK.zip FAIL 469.1 KB mos_BF.zip 13.3 MB 242.2 KB mr_IN.zip FAIL 147.7 KB ms_MY.zip 4.1 MB 226.9 KB nb_NO.zip 22.9 MB 1.2 MB ne_NP.zip 5.5 MB 328.1 KB nl_NL.zip 22.9 MB 1.1 MB nl_med.zip 1.2 MB 92.3 KB nn_NO.zip 16.5 MB 914 KB nr_ZA.zip 3.1 MB 203.3 KB ns_ZA.zip 1.7 MB 118 KB ny_MW.zip FAIL 101.8 KB oc_FR.zip 9.1 MB 401.5 KB pl_PL.zip 43.9 MB 1.7 MB pt_BR.zip FAIL 2.1 MB pt_PT.zip 5.8 MB 379.4 KB ro_RO.zip 5.1 MB 256.3 KB ru_RU.zip 21.7 MB 882 KB ru_RU_ye.zip 43.7 MB 1.5 MB ru_RU_yo.zip 21.7 MB 897.3 KB rw_RW.zip 1.6 MB 102.3 KB sk_SK.zip 25.1 MB 1.2 MB sl_SI.zip 38.3 MB 604 KB af_ZA.zip 18 MB 917.1 KB ak_GH.zip 1.5 MB 103.2 KB bg_BG.zip FAIL 465.7 KB ca_ANY.zip 28.9 MB 675.4 KB ca_ES.zip 15.1 MB 639.8 KB cop_EG.zip 2.1 MB 144.5 KB cs_CZ.zip 50.4 MB 1.5 MB cy_GB.zip FAIL 627.4 KB da_DK.zip FAIL 669.8 KB de_AT.zip 1.3 MB 123.9 KB de_CH.zip 12.6 MB 725.4 KB de_DE.zip 12.6 MB 726 KB de_DE_comb.zip 102.2 MB 4.2 MB de_DE_frami.zip 20.9 MB 1023.5 KB de_DE_neu.zip 101.5 MB 4.2 MB el_GR.zip 74.3 MB 1 MB en_AU.zip 8.1 MB 521 KB en_CA.zip 9.8 MB 450.5 KB en_GB-oed.zip 8.2 MB 526.6 KB en_GB.zip 8.3 MB 527.3 KB en_NZ.zip 8.4 MB 532.4 KB eo.zip 4.9 MB 310.5 KB eo_EO.zip 4.9 MB 310.5 KB es_AR.zip 14.8 MB 734.9 KB es_BO.zip 14.8 MB 735 KB es_CL.zip 14.7 MB 734.9 KB es_CO.zip 14.3 MB 722.1 KB es_CR.zip 14.8 MB 733.9 KB es_CU.zip 14.7 MB 732.8 KB es_DO.zip 14.7 MB 731.9 KB es_EC.zip 14.8 MB 733.5 KB es_ES.zip 15.1 MB 743 KB es_GT.zip 14.8 MB 734.5 KB es_HN.zip 14.8 MB 735.2 KB es_MX.zip 14.3 MB 723.8 KB es_NEW.zip 15.5 MB 768.5 KB es_NI.zip 14.8 MB 734.5 KB es_PA.zip 14.8 MB 733.8 KB es_PE.zip 14.2 MB 721.3 KB es_PR.zip 14.7 MB 732.4 KB es_PY.zip 14.8 MB 734.1 KB es_SV.zip 14.8 MB 733.6 KB es_UY.zip 14.8 MB 736.9 KB es_VE.zip 14.3 MB 722.7 KB et_EE.zip 53.6 MB 473.6 KB fo_FO.zip 18.6 MB 517.9 KB fr_FR-1990_1-3-2.zip 14 MB 526.7 KB fr_FR-classique_1-3-2.zip 14 MB 539.2 KB fr_FR_1-3-2.zip 14.5 MB 550.4 KB fy_NL.zip 4.2 MB 265.6 KB ga_IE.zip 14 MB 460.6 KB gd_GB.zip 2.7 MB 143.1 KB gl_ES.zip FAIL 479.4 KB gsc_FR.zip FAIL 1.3 MB gu_IN.zip 20.3 MB 947 KB he_IL.zip 53.3 MB 539.2 KB hi_IN.zip 2.7 MB 169 KB hil_PH.zip 3.4 MB 197 KB hr_HR.zip 29.7 MB 573 KB hu_HU.zip FAIL 1.2 MB hu_HU_comb.zip FAIL 5.4 MB ia.zip 4.9 MB 222.9 KB id_ID.zip 3.9 MB 226.3 KB it_IT.zip 15.3 MB 612.9 KB ku_TR.zip 1.6 MB 118.7 KB la.zip 5.1 MB 199.3 KB lt_LT.zip 15 MB 682.5 KB lv_LV.zip 36.3 MB 763.9 KB mg_MG.zip 2.9 MB 163.8 KB mi_NZ.zip FAIL 191.4 KB mk_MK.zip FAIL 469.1 KB mos_BF.zip 13.3 MB 242.2 KB mr_IN.zip FAIL 147.7 KB ms_MY.zip 4.1 MB 226.9 KB nb_NO.zip 22.9 MB 1.2 MB ne_NP.zip 5.5 MB 328.1 KB nl_NL.zip 22.9 MB 1.1 MB nl_med.zip 1.2 MB 92.3 KB nn_NO.zip 16.5 MB 914 KB nr_ZA.zip 3.1 MB 203.3 KB ns_ZA.zip 1.7 MB 118 KB ny_MW.zip FAIL 101.8 KB oc_FR.zip 9.1 MB 401.5 KB pl_PL.zip 43.9 MB 1.7 MB pt_BR.zip FAIL 2.1 MB pt_PT.zip 5.8 MB 379.4 KB ro_RO.zip 5.1 MB 256.3 KB ru_RU.zip 21.7 MB 882 KB ru_RU_ye.zip 43.7 MB 1.5 MB ru_RU_yo.zip 21.7 MB 897.3 KB rw_RW.zip 1.6 MB 102.3 KB sk_SK.zip 25.1 MB 1.2 MB sl_SI.zip 38.3 MB 604 KB sq_AL.zip 28.9 MB 581.7 KB ss_ZA.zip 3.1 MB 208.5 KB st_ZA.zip 1.7 MB 118.7 KB sv_SE.zip 9.5 MB 535.4 KB sw_KE.zip 6.3 MB 318.2 KB tet_ID.zip 2 MB 124.5 KB th_TH.zip FAIL 409.6 KB tl_PH.zip 2.6 MB 148.7 KB tn_ZA.zip 1.5 MB 93.7 KB ts_ZA.zip 1.6 MB 113.1 KB uk_UA.zip 17.6 MB 979.1 KB ve_ZA.zip FAIL 140.9 KB vi_VN.zip 1.7 MB 85.8 KB xh_ZA.zip 3 MB 191.1 KB zu_ZA.zip 24.5 MB 827.1 KB sq_AL.zip 28.9 MB 581.7 KB ss_ZA.zip 3.1 MB 208.5 KB st_ZA.zip 1.7 MB 118.7 KB sv_SE.zip 9.5 MB 535.4 KB sw_KE.zip 6.3 MB 318.2 KB tet_ID.zip 2 MB 124.5 KB th_TH.zip FAIL 409.6 KB tl_PH.zip 2.6 MB 148.7 KB tn_ZA.zip 1.5 MB 93.7 KB ts_ZA.zip 1.6 MB 113.1 KB uk_UA.zip 17.6 MB 979.1 KB ve_ZA.zip FAIL 140.9 KB vi_VN.zip 1.7 MB 85.8 KB xh_ZA.zip 3 MB 191.1 KB zu_ZA.zip 24.5 MB 827.1 KB
        Hide
        ASF subversion and git services added a comment -

        Commit 1572666 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1572666 ]

        LUCENE-5468: convert affixes to FST

        Show
        ASF subversion and git services added a comment - Commit 1572666 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1572666 ] LUCENE-5468 : convert affixes to FST
        Hide
        ASF subversion and git services added a comment -

        Commit 1572660 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1572660 ]

        LUCENE-5468: encode affix data as 8 bytes per affix, before cutting over to FST

        Show
        ASF subversion and git services added a comment - Commit 1572660 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1572660 ] LUCENE-5468 : encode affix data as 8 bytes per affix, before cutting over to FST
        Hide
        ASF subversion and git services added a comment -

        Commit 1572643 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1572643 ]

        LUCENE-5468: don't create unnecessary objects

        Show
        ASF subversion and git services added a comment - Commit 1572643 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1572643 ] LUCENE-5468 : don't create unnecessary objects
        Hide
        ASF subversion and git services added a comment -

        Commit 1571844 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571844 ]

        LUCENE-5468: make Affix fixed-width

        Show
        ASF subversion and git services added a comment - Commit 1571844 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571844 ] LUCENE-5468 : make Affix fixed-width
        Hide
        ASF subversion and git services added a comment -

        Commit 1571807 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571807 ]

        LUCENE-5468: Stem -> CharsRef

        Show
        ASF subversion and git services added a comment - Commit 1571807 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571807 ] LUCENE-5468 : Stem -> CharsRef
        Hide
        ASF subversion and git services added a comment -

        Commit 1571802 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571802 ]

        LUCENE-5468: remove redundant 'append' in Affix

        Show
        ASF subversion and git services added a comment - Commit 1571802 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571802 ] LUCENE-5468 : remove redundant 'append' in Affix
        Hide
        ASF subversion and git services added a comment -

        Commit 1571788 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571788 ]

        LUCENE-5468: deduplicate patterns used by affix condition check

        Show
        ASF subversion and git services added a comment - Commit 1571788 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571788 ] LUCENE-5468 : deduplicate patterns used by affix condition check
        Hide
        ASF subversion and git services added a comment -

        Commit 1571356 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571356 ]

        LUCENE-5468: sort dictionary data with offline sorter

        Show
        ASF subversion and git services added a comment - Commit 1571356 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571356 ] LUCENE-5468 : sort dictionary data with offline sorter
        Hide
        ASF subversion and git services added a comment -

        Commit 1571321 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571321 ]

        LUCENE-5468: factor OfflineSorter out of suggest

        Show
        ASF subversion and git services added a comment - Commit 1571321 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571321 ] LUCENE-5468 : factor OfflineSorter out of suggest
        Hide
        Robert Muir added a comment -

        I brought the previous FST patch up to speed, and then built a test to parse many dictionaries and compare memory. When it says FAIL, thats because the current code can't parse the dictionary (i fixed all the issues here).

        In general, RAM use is better, but in some cases its still bad because of how the affixes are represented. I still havent removed my Treemap yet either (i wanted to have a way to test all the dictionaries like this before really locking things down).

        dict old RAM new RAM
        af_ZA.zip 18 MB 899 KB
        ak_GH.zip 1.5 MB 71 KB
        bg_BG.zip FAIL 1.1 MB
        ca_ANY.zip 28.9 MB 1.2 MB
        ca_ES.zip 15.1 MB 1.2 MB
        cop_EG.zip 2.1 MB 489.3 KB
        cs_CZ.zip 50.4 MB 2.8 MB
        cy_GB.zip FAIL 1.6 MB
        da_DK.zip FAIL 750.8 KB
        de_AT.zip 1.3 MB 293.1 KB
        de_CH.zip 12.6 MB 895.6 KB
        de_DE.zip 12.6 MB 895 KB
        de_DE_comb.zip 102.2 MB 4.8 MB
        de_DE_frami.zip 20.9 MB 1.2 MB
        de_DE_neu.zip 101.5 MB 4.8 MB
        el_GR.zip 74.3 MB 1.1 MB
        en_AU.zip 8.1 MB 1.2 MB
        en_CA.zip 9.8 MB 436.7 KB
        en_GB-oed.zip 8.2 MB 1.2 MB
        en_GB.zip 8.3 MB 1.2 MB
        en_NZ.zip 8.4 MB 1.2 MB
        eo.zip 4.9 MB 1.3 MB
        eo_EO.zip 4.9 MB 1.3 MB
        es_AR.zip 14.8 MB 3.9 MB
        es_BO.zip 14.8 MB 3.9 MB
        es_CL.zip 14.7 MB 3.9 MB
        es_CO.zip 14.3 MB 3.8 MB
        es_CR.zip 14.8 MB 3.9 MB
        es_CU.zip 14.7 MB 3.9 MB
        es_DO.zip 14.7 MB 3.9 MB
        es_EC.zip 14.8 MB 3.9 MB
        es_ES.zip 15.1 MB 4.1 MB
        es_GT.zip 14.8 MB 3.9 MB
        es_HN.zip 14.8 MB 3.9 MB
        es_MX.zip 14.3 MB 3.8 MB
        es_NEW.zip 15.5 MB 4.2 MB
        es_NI.zip 14.8 MB 3.9 MB
        es_PA.zip 14.8 MB 3.9 MB
        es_PE.zip 14.2 MB 3.8 MB
        es_PR.zip 14.7 MB 3.9 MB
        es_PY.zip 14.8 MB 3.9 MB
        es_SV.zip 14.8 MB 3.9 MB
        es_UY.zip 14.8 MB 3.9 MB
        es_VE.zip 14.3 MB 3.8 MB
        et_EE.zip 53.6 MB 5.9 MB
        fo_FO.zip 18.6 MB 485.7 KB
        fr_FR-1990_1-3-2.zip 14 MB 636.4 KB
        fr_FR-classique_1-3-2.zip 14 MB 743.1 KB
        fr_FR_1-3-2.zip 14.5 MB 755.2 KB
        fy_NL.zip 4.2 MB 272.8 KB
        ga_IE.zip 14 MB 674.8 KB
        gd_GB.zip 2.7 MB 111 KB
        gl_ES.zip FAIL 1.2 MB
        gsc_FR.zip FAIL 1.4 MB
        gu_IN.zip 20.3 MB 914.9 KB
        he_IL.zip 53.3 MB 1.8 MB
        hi_IN.zip 2.7 MB 136.9 KB
        hil_PH.zip 3.4 MB 164.8 KB
        hr_HR.zip 29.7 MB 564.8 KB
        hu_HU.zip FAIL 17.6 MB
        hu_HU_comb.zip FAIL 19.9 MB
        ia.zip 4.9 MB 211.9 KB
        id_ID.zip 3.9 MB 218.4 KB
        it_IT.zip 15.3 MB 1.6 MB
        ku_TR.zip 1.6 MB 147.6 KB
        la.zip 5.1 MB 2.5 MB
        lt_LT.zip 15 MB 2.8 MB
        lv_LV.zip 36.3 MB 1.9 MB
        mg_MG.zip 2.9 MB 131.7 KB
        mi_NZ.zip FAIL 171.2 KB
        mk_MK.zip FAIL 436.9 KB
        mos_BF.zip 13.3 MB 210 KB
        mr_IN.zip FAIL 115.5 KB
        ms_MY.zip 4.1 MB 221.6 KB
        nb_NO.zip 22.9 MB 1.4 MB
        ne_NP.zip 5.5 MB 495.6 KB
        nl_NL.zip 22.9 MB 1.1 MB
        nl_med.zip 1.2 MB 60.2 KB
        nn_NO.zip 16.5 MB 1 MB
        nr_ZA.zip 3.1 MB 171.1 KB
        ns_ZA.zip 1.7 MB 85.8 KB
        ny_MW.zip FAIL 69.6 KB
        oc_FR.zip 9.1 MB 690.5 KB
        pl_PL.zip 43.9 MB 4.9 MB
        pt_BR.zip FAIL 3.9 MB
        pt_PT.zip 5.8 MB 773.4 KB
        ro_RO.zip 5.1 MB 226.2 KB
        ru_RU.zip 21.7 MB 1.4 MB
        ru_RU_ye.zip 43.7 MB 1.6 MB
        ru_RU_yo.zip 21.7 MB 1.4 MB
        rw_RW.zip 1.6 MB 70.1 KB
        sk_SK.zip 25.1 MB 2.3 MB
        sl_SI.zip 38.3 MB 806.6 KB
        sq_AL.zip 28.9 MB 654.6 KB
        ss_ZA.zip 3.1 MB 176.3 KB
        st_ZA.zip 1.7 MB 86.5 KB
        sv_SE.zip 9.5 MB 668.8 KB
        sw_KE.zip 6.3 MB 286 KB
        tet_ID.zip 2 MB 92.4 KB
        th_TH.zip FAIL 377.4 KB
        tl_PH.zip 2.6 MB 116.5 KB
        tn_ZA.zip 1.5 MB 61.6 KB
        ts_ZA.zip 1.6 MB 81 KB
        uk_UA.zip 17.6 MB 3 MB
        ve_ZA.zip FAIL 108.8 KB
        vi_VN.zip 1.7 MB 53.6 KB
        xh_ZA.zip 3 MB 158.9 KB
        zu_ZA.zip 24.5 MB 13.5 MB
        Show
        Robert Muir added a comment - I brought the previous FST patch up to speed, and then built a test to parse many dictionaries and compare memory. When it says FAIL, thats because the current code can't parse the dictionary (i fixed all the issues here). In general, RAM use is better, but in some cases its still bad because of how the affixes are represented. I still havent removed my Treemap yet either (i wanted to have a way to test all the dictionaries like this before really locking things down). dict old RAM new RAM af_ZA.zip 18 MB 899 KB ak_GH.zip 1.5 MB 71 KB bg_BG.zip FAIL 1.1 MB ca_ANY.zip 28.9 MB 1.2 MB ca_ES.zip 15.1 MB 1.2 MB cop_EG.zip 2.1 MB 489.3 KB cs_CZ.zip 50.4 MB 2.8 MB cy_GB.zip FAIL 1.6 MB da_DK.zip FAIL 750.8 KB de_AT.zip 1.3 MB 293.1 KB de_CH.zip 12.6 MB 895.6 KB de_DE.zip 12.6 MB 895 KB de_DE_comb.zip 102.2 MB 4.8 MB de_DE_frami.zip 20.9 MB 1.2 MB de_DE_neu.zip 101.5 MB 4.8 MB el_GR.zip 74.3 MB 1.1 MB en_AU.zip 8.1 MB 1.2 MB en_CA.zip 9.8 MB 436.7 KB en_GB-oed.zip 8.2 MB 1.2 MB en_GB.zip 8.3 MB 1.2 MB en_NZ.zip 8.4 MB 1.2 MB eo.zip 4.9 MB 1.3 MB eo_EO.zip 4.9 MB 1.3 MB es_AR.zip 14.8 MB 3.9 MB es_BO.zip 14.8 MB 3.9 MB es_CL.zip 14.7 MB 3.9 MB es_CO.zip 14.3 MB 3.8 MB es_CR.zip 14.8 MB 3.9 MB es_CU.zip 14.7 MB 3.9 MB es_DO.zip 14.7 MB 3.9 MB es_EC.zip 14.8 MB 3.9 MB es_ES.zip 15.1 MB 4.1 MB es_GT.zip 14.8 MB 3.9 MB es_HN.zip 14.8 MB 3.9 MB es_MX.zip 14.3 MB 3.8 MB es_NEW.zip 15.5 MB 4.2 MB es_NI.zip 14.8 MB 3.9 MB es_PA.zip 14.8 MB 3.9 MB es_PE.zip 14.2 MB 3.8 MB es_PR.zip 14.7 MB 3.9 MB es_PY.zip 14.8 MB 3.9 MB es_SV.zip 14.8 MB 3.9 MB es_UY.zip 14.8 MB 3.9 MB es_VE.zip 14.3 MB 3.8 MB et_EE.zip 53.6 MB 5.9 MB fo_FO.zip 18.6 MB 485.7 KB fr_FR-1990_1-3-2.zip 14 MB 636.4 KB fr_FR-classique_1-3-2.zip 14 MB 743.1 KB fr_FR_1-3-2.zip 14.5 MB 755.2 KB fy_NL.zip 4.2 MB 272.8 KB ga_IE.zip 14 MB 674.8 KB gd_GB.zip 2.7 MB 111 KB gl_ES.zip FAIL 1.2 MB gsc_FR.zip FAIL 1.4 MB gu_IN.zip 20.3 MB 914.9 KB he_IL.zip 53.3 MB 1.8 MB hi_IN.zip 2.7 MB 136.9 KB hil_PH.zip 3.4 MB 164.8 KB hr_HR.zip 29.7 MB 564.8 KB hu_HU.zip FAIL 17.6 MB hu_HU_comb.zip FAIL 19.9 MB ia.zip 4.9 MB 211.9 KB id_ID.zip 3.9 MB 218.4 KB it_IT.zip 15.3 MB 1.6 MB ku_TR.zip 1.6 MB 147.6 KB la.zip 5.1 MB 2.5 MB lt_LT.zip 15 MB 2.8 MB lv_LV.zip 36.3 MB 1.9 MB mg_MG.zip 2.9 MB 131.7 KB mi_NZ.zip FAIL 171.2 KB mk_MK.zip FAIL 436.9 KB mos_BF.zip 13.3 MB 210 KB mr_IN.zip FAIL 115.5 KB ms_MY.zip 4.1 MB 221.6 KB nb_NO.zip 22.9 MB 1.4 MB ne_NP.zip 5.5 MB 495.6 KB nl_NL.zip 22.9 MB 1.1 MB nl_med.zip 1.2 MB 60.2 KB nn_NO.zip 16.5 MB 1 MB nr_ZA.zip 3.1 MB 171.1 KB ns_ZA.zip 1.7 MB 85.8 KB ny_MW.zip FAIL 69.6 KB oc_FR.zip 9.1 MB 690.5 KB pl_PL.zip 43.9 MB 4.9 MB pt_BR.zip FAIL 3.9 MB pt_PT.zip 5.8 MB 773.4 KB ro_RO.zip 5.1 MB 226.2 KB ru_RU.zip 21.7 MB 1.4 MB ru_RU_ye.zip 43.7 MB 1.6 MB ru_RU_yo.zip 21.7 MB 1.4 MB rw_RW.zip 1.6 MB 70.1 KB sk_SK.zip 25.1 MB 2.3 MB sl_SI.zip 38.3 MB 806.6 KB sq_AL.zip 28.9 MB 654.6 KB ss_ZA.zip 3.1 MB 176.3 KB st_ZA.zip 1.7 MB 86.5 KB sv_SE.zip 9.5 MB 668.8 KB sw_KE.zip 6.3 MB 286 KB tet_ID.zip 2 MB 92.4 KB th_TH.zip FAIL 377.4 KB tl_PH.zip 2.6 MB 116.5 KB tn_ZA.zip 1.5 MB 61.6 KB ts_ZA.zip 1.6 MB 81 KB uk_UA.zip 17.6 MB 3 MB ve_ZA.zip FAIL 108.8 KB vi_VN.zip 1.7 MB 53.6 KB xh_ZA.zip 3 MB 158.9 KB zu_ZA.zip 24.5 MB 13.5 MB
        Hide
        ASF subversion and git services added a comment -

        Commit 1571137 from Robert Muir in branch 'dev/branches/lucene5468'
        [ https://svn.apache.org/r1571137 ]

        LUCENE-5468: commit current state

        Show
        ASF subversion and git services added a comment - Commit 1571137 from Robert Muir in branch 'dev/branches/lucene5468' [ https://svn.apache.org/r1571137 ] LUCENE-5468 : commit current state
        Hide
        Chris Male added a comment -

        Sounds good

        Show
        Chris Male added a comment - Sounds good
        Hide
        Robert Muir added a comment -

        Well, I don't want the whole issue to get hung up on that stuff. Basically i'm working on a number of changes (especially tests though, to ensure the stuff is really working correctly). If we want, we can just lay down my new files on top of the existing stuff, or we can keep it/deprecate it, whatever we want to do.

        I just want to make some progress on a few improvements I've been investigating to try to make this thing more usable

        Show
        Robert Muir added a comment - Well, I don't want the whole issue to get hung up on that stuff. Basically i'm working on a number of changes (especially tests though, to ensure the stuff is really working correctly). If we want, we can just lay down my new files on top of the existing stuff, or we can keep it/deprecate it, whatever we want to do. I just want to make some progress on a few improvements I've been investigating to try to make this thing more usable
        Hide
        Chris Male added a comment -

        Multiple dictionaries was never in the original design either. Having an efficient and usable design seems to be of higher priority so +1 to not forking and doing this in place.

        Show
        Chris Male added a comment - Multiple dictionaries was never in the original design either. Having an efficient and usable design seems to be of higher priority so +1 to not forking and doing this in place.
        Hide
        Robert Muir added a comment -

        I don't think we should let some esoteric options like multiple dictionaries keep this stuff unusable.

        So I'm happy to just fork the entire stuff into a different package (hunspell2 or something), so we have a reasonably efficient version that doesnt have these esoteric options. The old stuff can stay as is, i do not care.

        Show
        Robert Muir added a comment - I don't think we should let some esoteric options like multiple dictionaries keep this stuff unusable. So I'm happy to just fork the entire stuff into a different package (hunspell2 or something), so we have a reasonably efficient version that doesnt have these esoteric options. The old stuff can stay as is, i do not care.
        Hide
        Mathias H. added a comment -

        I now solved the problem in my special case. I wrote a custom TokenFilterFactory that wraps the DictionaryCompoundWordTokenFilterFactory / HunspellStemFilterFactory and caches the factories, so they will be reused across indexes and fieldtypes.

        Show
        Mathias H. added a comment - I now solved the problem in my special case. I wrote a custom TokenFilterFactory that wraps the DictionaryCompoundWordTokenFilterFactory / HunspellStemFilterFactory and caches the factories, so they will be reused across indexes and fieldtypes.
        Hide
        Mathias H. added a comment -

        Dictionaries with the same file location should be shared across all field and all indexes. This would minimize the problem if you're using multiple indexes.

        Currently I can't use Solr because I have 10 indexes with 5 field and for each field a DictionaryCompoundWordTokenFilterFactory is assigned. So the dictionary will be loaded 50 times. This is too much for my RAM.

        Show
        Mathias H. added a comment - Dictionaries with the same file location should be shared across all field and all indexes. This would minimize the problem if you're using multiple indexes. Currently I can't use Solr because I have 10 indexes with 5 field and for each field a DictionaryCompoundWordTokenFilterFactory is assigned. So the dictionary will be loaded 50 times. This is too much for my RAM.
        Hide
        Robert Muir added a comment -

        You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej

        Well someone has to sort to 'test' any dictionary customizations with hunspells tools anyway.

        So i assume people are already doing 'sort foo.dic my_foo_customizations.dic > combined.dic' then using 'analyze'
        and other commands to test... otherwise how are they testing their customizations ?!

        Show
        Robert Muir added a comment - You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej Well someone has to sort to 'test' any dictionary customizations with hunspells tools anyway. So i assume people are already doing 'sort foo.dic my_foo_customizations.dic > combined.dic' then using 'analyze' and other commands to test... otherwise how are they testing their customizations ?!
        Hide
        Robert Muir added a comment -

        note: in some cases we will still have to use the throwaway treemap or similar like the patch i uploaded does.

        but we could then know these two cases up front:

        • someone enables ignoreCase=true
        • when binary sort order of the charset != utf8 binary order
        Show
        Robert Muir added a comment - note: in some cases we will still have to use the throwaway treemap or similar like the patch i uploaded does. but we could then know these two cases up front: someone enables ignoreCase=true when binary sort order of the charset != utf8 binary order
        Hide
        Dawid Weiss added a comment -

        You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej. Lucene/Solr now even has on-disk merge sort which you can use for large(r) data sets – this code is along FSTCompletion in trunk.

        Show
        Dawid Weiss added a comment - You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej. Lucene/Solr now even has on-disk merge sort which you can use for large(r) data sets – this code is along FSTCompletion in trunk.
        Hide
        Chris Male added a comment -

        I don't see any problem mandating that overrides/customizations adhere to a sorted order. I don't think we can assume custom dictionaries are going to be small - there's nothing in the APIs which force that. Using FSTs gives us the performance benefit we're seeking in this issue, I think the small sacrifice is worth the huge benefit.

        Show
        Chris Male added a comment - I don't see any problem mandating that overrides/customizations adhere to a sorted order. I don't think we can assume custom dictionaries are going to be small - there's nothing in the APIs which force that. Using FSTs gives us the performance benefit we're seeking in this issue, I think the small sacrifice is worth the huge benefit.
        Hide
        Robert Muir added a comment -

        Also, its required by the hunspell format itself. So this is not crazy to enforce.

        Show
        Robert Muir added a comment - Also, its required by the hunspell format itself. So this is not crazy to enforce.
        Hide
        Maciej Lisiewski added a comment -

        What I was trying to say is that the custom dictionaries are small enough to be loaded and sorted in memory before building FST.

        Show
        Maciej Lisiewski added a comment - What I was trying to say is that the custom dictionaries are small enough to be loaded and sorted in memory before building FST.
        Hide
        Robert Muir added a comment -

        Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here?

        Doesn't matter here, our FST requires that it be built in-order. doesn't matter if even one single word is out of order.

        because of this, we can't build the data structure efficiently.

        Show
        Robert Muir added a comment - Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here? Doesn't matter here, our FST requires that it be built in-order. doesn't matter if even one single word is out of order. because of this, we can't build the data structure efficiently.
        Hide
        Maciej Lisiewski added a comment -

        Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here?
        Simple example: default PL dictionary is close to 200k words. Largest custom dictionaries (legal, military, medical) will be 5-10k words (I'm basing those estimates on the best sources that I have found to generate those dictionaries from). In most cases we should expect <1k words.

        Show
        Maciej Lisiewski added a comment - Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here? Simple example: default PL dictionary is close to 200k words. Largest custom dictionaries (legal, military, medical) will be 5-10k words (I'm basing those estimates on the best sources that I have found to generate those dictionaries from). In most cases we should expect <1k words.
        Hide
        Robert Muir added a comment -

        at least the local override/customizations files can surely require sorted order?

        Show
        Robert Muir added a comment - at least the local override/customizations files can surely require sorted order?
        Hide
        Jan Høydahl added a comment -

        Background for supporting multiple dictionaries is here: http://code.google.com/p/lucene-hunspell/issues/detail?id=4 and is invaluable for adding local customizations or overrides without touching the official dictionaries.

        Show
        Jan Høydahl added a comment - Background for supporting multiple dictionaries is here: http://code.google.com/p/lucene-hunspell/issues/detail?id=4 and is invaluable for adding local customizations or overrides without touching the official dictionaries.
        Hide
        Dawid Weiss added a comment -

        Looks good to me from looking at the diff. Btw., we really should pull out the getOutputForInput(FST, input) logic currently present in lookupOrd somewhere where it's reusable – I've seen it in a few places (or needed it a few times)...

        Show
        Dawid Weiss added a comment - Looks good to me from looking at the diff. Btw., we really should pull out the getOutputForInput(FST, input) logic currently present in lookupOrd somewhere where it's reusable – I've seen it in a few places (or needed it a few times)...
        Hide
        Robert Muir added a comment -

        Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?

        Maybe synonyms too? I dunno, just seems like if factories implement ResourceLoaderAware,
        instead of calling init() and inform() on all of them, instead they should be able to parse
        their params in init(), override equals/hashcode based on their parameters, and some mechanism
        would just then reuse existing ones instead of creating duplicates.

        Show
        Robert Muir added a comment - Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat? Maybe synonyms too? I dunno, just seems like if factories implement ResourceLoaderAware, instead of calling init() and inform() on all of them, instead they should be able to parse their params in init(), override equals/hashcode based on their parameters, and some mechanism would just then reuse existing ones instead of creating duplicates.
        Hide
        Chris Male added a comment -

        Hey, patch looks cool Robert.

        we allow multiple dictionary files... is this really needed?

        I don't think so.

        solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this)

        Thats a really good point actually. Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?

        Show
        Chris Male added a comment - Hey, patch looks cool Robert. we allow multiple dictionary files... is this really needed? I don't think so. solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this) Thats a really good point actually. Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?
        Hide
        Robert Muir added a comment -

        here's a patch cutting this thing over to use less ram once its started. but it probably uses more initially when parsing, mainly because we cannot guarantee the input is in sorted order. I think we should fix that, so that jumping thru hoops is the exception rather than the rule:

        • we allow multiple dictionary files... is this really needed?
        • if you use ignoreCase it means entries can be out of sorted order too.
        • in some strange encodings the order in the original file could differ from binary order.

        the building could just do the 2-phase thing it does now for the crazy cases and be efficient for the 90% case if we clean up.

        The remaining problems:

        • fix existing confusion in the dictionary api (like multiple input files) so that most of the time we can rely upon sorted order.
        • solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this).
        • anything in the patch with nocommit, TODO, or bogus should be fixed.
        Show
        Robert Muir added a comment - here's a patch cutting this thing over to use less ram once its started. but it probably uses more initially when parsing, mainly because we cannot guarantee the input is in sorted order. I think we should fix that, so that jumping thru hoops is the exception rather than the rule: we allow multiple dictionary files... is this really needed? if you use ignoreCase it means entries can be out of sorted order too. in some strange encodings the order in the original file could differ from binary order. the building could just do the 2-phase thing it does now for the crazy cases and be efficient for the 90% case if we clean up. The remaining problems: fix existing confusion in the dictionary api (like multiple input files) so that most of the time we can rely upon sorted order. solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this). anything in the patch with nocommit, TODO, or bogus should be fixed.
        Hide
        Robert Muir added a comment -

        I'm working on a quick 80/20 stab here. I think it will help a lot.

        Show
        Robert Muir added a comment - I'm working on a quick 80/20 stab here. I think it will help a lot.
        Hide
        Dawid Weiss added a comment -

        You're probably right – my opinion was based on my inspection of hunspell's source code that I did once or twice in the past – I remember there's logic to perform more advanced stuff than dictionary lookup, but I never got the full picture if or how it's used.

        Show
        Dawid Weiss added a comment - You're probably right – my opinion was based on my inspection of hunspell's source code that I did once or twice in the past – I remember there's logic to perform more advanced stuff than dictionary lookup, but I never got the full picture if or how it's used.
        Hide
        Robert Muir added a comment -

        Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.

        I think what you describe is essentially at a highlevel exactly what the hunspellfilter does. Theoretically there is more intelligent handling possible (correcting spelling), but this isn't implemented, not interesting for search anyway for the most part, and there is definitely no OOV mechanism.

        Show
        Robert Muir added a comment - Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have. I think what you describe is essentially at a highlevel exactly what the hunspellfilter does. Theoretically there is more intelligent handling possible (correcting spelling), but this isn't implemented, not interesting for search anyway for the most part, and there is definitely no OOV mechanism.
        Hide
        Dawid Weiss added a comment -

        I must disappoint you here – morfologik simply compiles a list of inflected-base-tag triples, it has no logic for generating these forms from lexical flags/ base dictionaries. Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.

        Show
        Dawid Weiss added a comment - I must disappoint you here – morfologik simply compiles a list of inflected-base-tag triples, it has no logic for generating these forms from lexical flags/ base dictionaries. Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.
        Hide
        Robert Muir added a comment -

        yeah but the HunspellDictionary really is ridiculous if you try to use a large dictionary with it,
        even without cutting over to an FST it could probably be improved.

        for minority languages without really nice dictionaries it probably doesnt matter much, but for
        the languages with really nice dictionaries you also tend to have language-specific options available.

        just another crazy idea: I don't know how much of morfologik is dependent upon polish itself, but
        if it already knows how to compile ispell/hunspell into an efficient form and work with it, maybe
        we should just be seeing if we can 'generalize' that and work it from that angle.

        Show
        Robert Muir added a comment - yeah but the HunspellDictionary really is ridiculous if you try to use a large dictionary with it, even without cutting over to an FST it could probably be improved. for minority languages without really nice dictionaries it probably doesnt matter much, but for the languages with really nice dictionaries you also tend to have language-specific options available. just another crazy idea: I don't know how much of morfologik is dependent upon polish itself, but if it already knows how to compile ispell/hunspell into an efficient form and work with it, maybe we should just be seeing if we can 'generalize' that and work it from that angle.
        Hide
        Dawid Weiss added a comment -

        You know what they say these days – just buy more ram and get rid of the problem by covering it with money

        Show
        Dawid Weiss added a comment - You know what they say these days – just buy more ram and get rid of the problem by covering it with money
        Hide
        Robert Muir added a comment -

        As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.

        Right but honestly the original motivation was to get something up quickly when you have no other choice: for minority languages, etc.

        Show
        Robert Muir added a comment - As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages. Right but honestly the original motivation was to get something up quickly when you have no other choice: for minority languages, etc.
        Hide
        Maciej Lisiewski added a comment - - edited

        The last time I checked Morfologik was just mentioned as a possible new stemmer - I have used it before and I prefer it to Stempel/Hunspell, so I guess this solves my problem for now, thanks

        As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.

        Show
        Maciej Lisiewski added a comment - - edited The last time I checked Morfologik was just mentioned as a possible new stemmer - I have used it before and I prefer it to Stempel/Hunspell, so I guess this solves my problem for now, thanks As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.
        Hide
        Chris Male added a comment -

        +1 to your idea Robert. I've been thinking along the same lines that FSTs might help us out here.

        Show
        Chris Male added a comment - +1 to your idea Robert. I've been thinking along the same lines that FSTs might help us out here.
        Hide
        Dawid Weiss added a comment -

        Morfologik will be exactly the same size in memory as its unzipped dictionary, so about 1.8MB + 3.5MB if you use both pl (morfologik) and pl-sgjp (morfeusz) dictionaries. These are fixed dictionaries (that is unknown words won't be stemmed) but the coverage is decent for contemporary Polish.

        If you explain what you're trying to do/ achieve then perhaps we'll be able to give you some more hints.

        Show
        Dawid Weiss added a comment - Morfologik will be exactly the same size in memory as its unzipped dictionary, so about 1.8MB + 3.5MB if you use both pl (morfologik) and pl-sgjp (morfeusz) dictionaries. These are fixed dictionaries (that is unknown words won't be stemmed) but the coverage is decent for contemporary Polish. If you explain what you're trying to do/ achieve then perhaps we'll be able to give you some more hints.
        Hide
        Robert Muir added a comment -

        By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).

        I imagine Stempel's Trie is good, but have you also compared Morfologik (http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/morfologik/) ?
        Its precompiled FST might be the most space-efficient for polish.

        But really I think Hunspell's dictionary structure should be more efficient, we could build the FST on-the-fly (if case-insensitive mode is off). But when
        this is on, entries must be merged.

        Instead it might be better for the hunspell stuff to support loading FSTs (where we would do any case-sensitivity tweaking/merging of entries, then build FST).
        It might be possible to re-use some of the same code from SOLR-2888 that does a similar thing to build a suggester FST.

        In my opinion its worth it to build the FST not just for the words, but also the affixes (in some files these are humungous too!)

        For lucene I think we would just allow HunspellDictionary to also be instantiated from these FST inputstreams. The solr factory / configuration would need
        to be tweaked to make this easy and intuitive.

        Show
        Robert Muir added a comment - By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well). I imagine Stempel's Trie is good, but have you also compared Morfologik ( http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/morfologik/ ) ? Its precompiled FST might be the most space-efficient for polish. But really I think Hunspell's dictionary structure should be more efficient, we could build the FST on-the-fly (if case-insensitive mode is off). But when this is on, entries must be merged. Instead it might be better for the hunspell stuff to support loading FSTs (where we would do any case-sensitivity tweaking/merging of entries, then build FST). It might be possible to re-use some of the same code from SOLR-2888 that does a similar thing to build a suggester FST. In my opinion its worth it to build the FST not just for the words, but also the affixes (in some files these are humungous too!) For lucene I think we would just allow HunspellDictionary to also be instantiated from these FST inputstreams. The solr factory / configuration would need to be tweaked to make this easy and intuitive.

          People

          • Assignee:
            Unassigned
            Reporter:
            Maciej Lisiewski
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development