Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently the suggester api and especially TermFreqIterator don't play that nice with BytesRef and other paradigms we use in lucene, further the java iterator pattern isn't that useful when it gets to work with TermsEnum, BytesRef etc. We should try to clean up this api step by step moving over to BytesRef including the Lookup class and its interface...

      1. LUCENE-3807.patch
        67 kB
        Simon Willnauer
      2. LUCENE-3807.patch
        63 kB
        Simon Willnauer
      3. LUCENE-3807.patch
        68 kB
        Simon Willnauer
      4. LUCENE-3807.patch
        68 kB
        Simon Willnauer
      5. LUCENE-3807.patch
        56 kB
        Simon Willnauer
      6. LUCENE-3807.patch
        75 kB
        Simon Willnauer
      7. LUCENE-3807.patch
        80 kB
        Simon Willnauer
      8. LUCENE-3807.patch
        12 kB
        Simon Willnauer
      9. LUCENE-3807.patch
        16 kB
        Simon Willnauer
      10. LUCENE-3807_3x.patch
        180 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a first patch extracting a BytesRefIterator as a common interface directly shared with TermsEnum as a first step. The semantics are taken from TermsEnum directly so nothing changes for TermsEnum and its consumers. The simple next method makes the most of the TermFreqIters trivial.

        comments welcome...

        Show
        Simon Willnauer added a comment - here is a first patch extracting a BytesRefIterator as a common interface directly shared with TermsEnum as a first step. The semantics are taken from TermsEnum directly so nothing changes for TermsEnum and its consumers. The simple next method makes the most of the TermFreqIters trivial. comments welcome...
        Hide
        Dawid Weiss added a comment -

        +1.

        Show
        Dawid Weiss added a comment - +1.
        Hide
        Robert Muir added a comment -

        I like the patch, but only one thing (its fine to commit it as-is though, we can solve this on another issue, i just couldnt help but notice)

        I don't think we should have the BufferedTermFreqIteratorWrapper/etc and the SortedTermFreqIterator marker interface needs to be fixed.

        Here are the problems:

        • Marker interface SortedTermFreqIterator doesn't tell you if its UTF-8 or UTF-16 order. Its implemented by two classes: SortedTermFreqIteratorWrapper,
          which sorts in UTF-16 order, and HighFrequencyDictionary, which returns terms from the index (so UTF-8 order). The problem is that classes
          that rely upon sorted order like JaSpell/TST are likely broken already. Fortunately FST/WFST always do their own sort.
        • Buffering in RAM is not ideal. Instead I think all of these classes should be using our Sort anyway which can spill to disk.

        For now could we put the BytesRefList in the suggest package since its only used there? we might not need it after we clean up
        this sorting stuff in some future issue.

        Also I don't think we should factor out the BytesRefIterator. I seriously think its a bad idea to tie our core index Terms enumeration API
        with the spellcheck API at this time, it would make it hard to change in the future if we need, especially with spellcheck being... needing work

        Show
        Robert Muir added a comment - I like the patch, but only one thing (its fine to commit it as-is though, we can solve this on another issue, i just couldnt help but notice) I don't think we should have the BufferedTermFreqIteratorWrapper/etc and the SortedTermFreqIterator marker interface needs to be fixed. Here are the problems: Marker interface SortedTermFreqIterator doesn't tell you if its UTF-8 or UTF-16 order. Its implemented by two classes: SortedTermFreqIteratorWrapper, which sorts in UTF-16 order, and HighFrequencyDictionary, which returns terms from the index (so UTF-8 order). The problem is that classes that rely upon sorted order like JaSpell/TST are likely broken already. Fortunately FST/WFST always do their own sort. Buffering in RAM is not ideal. Instead I think all of these classes should be using our Sort anyway which can spill to disk. For now could we put the BytesRefList in the suggest package since its only used there? we might not need it after we clean up this sorting stuff in some future issue. Also I don't think we should factor out the BytesRefIterator. I seriously think its a bad idea to tie our core index Terms enumeration API with the spellcheck API at this time, it would make it hard to change in the future if we need, especially with spellcheck being... needing work
        Hide
        Robert Muir added a comment -

        Again i just wanted to mention, i think its fine to commit as-is.

        Consider my comments thinking-out-loud.

        Show
        Robert Muir added a comment - Again i just wanted to mention, i think its fine to commit as-is. Consider my comments thinking-out-loud.
        Hide
        Simon Willnauer added a comment -

        I committed the patch to trunk and moved the BytesRefList to the suggest package (pkg private) for now. I keep this issue open as I have more iterations / patches. I will backport before I close this issue ie. batch port to 3.x all commits related to this issue. commit revision is 1291418

        Show
        Simon Willnauer added a comment - I committed the patch to trunk and moved the BytesRefList to the suggest package (pkg private) for now. I keep this issue open as I have more iterations / patches. I will backport before I close this issue ie. batch port to 3.x all commits related to this issue. commit revision is 1291418
        Hide
        Robert Muir added a comment -

        I think we should go with a simpler solution for the Buffering/SortedIterator... this has caused a few recent test fails.

        (latest: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12476/)

        If we are going to replace it anyway with the Sort that spills to disk, cant we just use a simple treemap?

        Show
        Robert Muir added a comment - I think we should go with a simpler solution for the Buffering/SortedIterator... this has caused a few recent test fails. (latest: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12476/ ) If we are going to replace it anyway with the Sort that spills to disk, cant we just use a simple treemap?
        Hide
        Simon Willnauer added a comment -

        just drop it I will replace that stuff with on disk sort tomorrow anyway

        Show
        Simon Willnauer added a comment - just drop it I will replace that stuff with on disk sort tomorrow anyway
        Hide
        Simon Willnauer added a comment -

        I think we should go with a simpler solution for the Buffering/SortedIterator... this has caused a few recent test fails.

        those failures have been fixed..

        Show
        Simon Willnauer added a comment - I think we should go with a simpler solution for the Buffering/SortedIterator... this has caused a few recent test fails. those failures have been fixed..
        Hide
        Simon Willnauer added a comment -

        next patch...

        • cut over Lookup API to use CharSequence instead of string. this allows us to use CharsRef and saves object creation.
        • removed Sorted marker interface. BytesRefIterator has a getComparator() that can be null to indicate that there is no specified sort order. That way we can compare the Comparator<BytesRef> instance to see if the sort order matches what the impl expects.
        • added a IOException to Dictionary#getWordsIterator() removing tons of try/catch throw RTE
        • renamed freq to weight
        • weight (ie freq) now returns a long instead of a float discouraging floating point number as weights

        next steps would be cleaning up the in-memory sorting stuff and use the sorter which goes to disk to do the actual sorting internally if needed

        comments welcome

        Show
        Simon Willnauer added a comment - next patch... cut over Lookup API to use CharSequence instead of string. this allows us to use CharsRef and saves object creation. removed Sorted marker interface. BytesRefIterator has a getComparator() that can be null to indicate that there is no specified sort order. That way we can compare the Comparator<BytesRef> instance to see if the sort order matches what the impl expects. added a IOException to Dictionary#getWordsIterator() removing tons of try/catch throw RTE renamed freq to weight weight (ie freq) now returns a long instead of a float discouraging floating point number as weights next steps would be cleaning up the in-memory sorting stuff and use the sorter which goes to disk to do the actual sorting internally if needed comments welcome
        Hide
        Robert Muir added a comment -

        I agree with removing IOException from getComparator.

        I agree also with moving weight to a long... though currently the e.g. solr integrations take floats as input,
        so this really needs to be listed as a backwards break since it will directly affect users.

        I agree with removing the sorted marker interface: its not useful since you don't know the order.

        However, I don't think we should add the charsref methods... I think Bytes/Ints/CharsRef should have parallel
        apis and someone can just call unicodeutil: in general these are reference classes not stringbuffers and
        we shouldn't encourage abuse via sugar apis. I already have an issue open for fixing, cleaning up,
        and making those APIs consistent.

        I don't think we should add a generics parameter V to Lookup, especially if LookupResult itself is still
        wired to float. I do think suggesters should be able to return additional data but this needs more thought:
        its necessary to actually get the additional data to them.

        Show
        Robert Muir added a comment - I agree with removing IOException from getComparator. I agree also with moving weight to a long... though currently the e.g. solr integrations take floats as input, so this really needs to be listed as a backwards break since it will directly affect users. I agree with removing the sorted marker interface: its not useful since you don't know the order. However, I don't think we should add the charsref methods... I think Bytes/Ints/CharsRef should have parallel apis and someone can just call unicodeutil: in general these are reference classes not stringbuffers and we shouldn't encourage abuse via sugar apis. I already have an issue open for fixing, cleaning up, and making those APIs consistent. I don't think we should add a generics parameter V to Lookup, especially if LookupResult itself is still wired to float. I do think suggesters should be able to return additional data but this needs more thought: its necessary to actually get the additional data to them.
        Hide
        Robert Muir added a comment -

        I found another bug (added test and committed fix in r1291826).
        This would cause an exception in solr if someone called build but the field was empty.

        I think before proceeding with refactoring, we really need to beef up tests.
        I'll help with this.

        Show
        Robert Muir added a comment - I found another bug (added test and committed fix in r1291826). This would cause an exception in solr if someone called build but the field was empty. I think before proceeding with refactoring, we really need to beef up tests. I'll help with this.
        Hide
        Simon Willnauer added a comment -

        new patch cleaning up the api a little more according to roberts comments. I had some leftover CharsRef uses in there and I removed the generics on Lookup. This patch also removes all the additional methods on CharsRef and adds some randomization for CharSequences..

        Show
        Simon Willnauer added a comment - new patch cleaning up the api a little more according to roberts comments. I had some leftover CharsRef uses in there and I removed the generics on Lookup. This patch also removes all the additional methods on CharsRef and adds some randomization for CharSequences..
        Hide
        Simon Willnauer added a comment -

        I think before proceeding with refactoring, we really need to beef up tests.

        I will help too while I think the patch is ready... we can go with iterations?

        Show
        Simon Willnauer added a comment - I think before proceeding with refactoring, we really need to beef up tests. I will help too while I think the patch is ready... we can go with iterations?
        Hide
        Simon Willnauer added a comment -

        so this really needs to be listed as a backwards break since it will directly affect users.

        I will note this in the CHANGES once we are done

        Show
        Simon Willnauer added a comment - so this really needs to be listed as a backwards break since it will directly affect users. I will note this in the CHANGES once we are done
        Hide
        Robert Muir added a comment -

        I like this latest patch!

        I didnt test it, but i think its fine to move forward.

        I just mainly want to address the issues of whole suggesters that are not tested
        at all in this module (LUCENE-3813) before we did too much more.

        Show
        Robert Muir added a comment - I like this latest patch! I didnt test it, but i think its fine to move forward. I just mainly want to address the issues of whole suggesters that are not tested at all in this module ( LUCENE-3813 ) before we did too much more.
        Hide
        Dawid Weiss added a comment - - edited

        Sorry for being late, work. I like the patch. Comments:

        +    public Comparator<BytesRef> getComparator() {
        +      return null;
        +    }
        

        This shows up in a number of places. I have mixed feelings about certain types having a comparator and others not having it, but it's minor.

        BufferingTermFreqIteratorWrapper is a nuisance (buffers in memory). It would be nicer to have a sort on disk if something doesn't support sorted iteration order.

        I also wonder float -> long = 4 -> 8 bytes... would this count as an incompatible API change (because what used to work for a given amount of RAM won't work anymore – BufferingTermFreqIteratorWrapper again)?

        +      if (l1 < l2) {
        +        aStop = l1;
        +      } else {
        +        aStop = l2;
        +      }
        

        if I remember correctly Math.min/max are intrinsics, so you can afford to be explicit

        Why not a specific covariant here?

        -  public Float get(String key) {
        +  public Object get(CharSequence key) {
        

        This doesn't seem necessary (lookup accepts a CharSequence?).

        @@ -199,7 +199,7 @@ public class LookupBenchmarkTest extends LuceneTestCase {
                 public Integer call() throws Exception {
                   int v = 0;
                   for (String term : input) {
        -            v += lookup.lookup(term, onlyMorePopular, num).size();
        +            v += lookup.lookup(new CharsRef(term), onlyMorePopular, num).size();
        

        I like the rest, including the CharSequenceish evilness of bytesToCharSequence

        Show
        Dawid Weiss added a comment - - edited Sorry for being late, work. I like the patch. Comments: + public Comparator<BytesRef> getComparator() { + return null; + } This shows up in a number of places. I have mixed feelings about certain types having a comparator and others not having it, but it's minor. BufferingTermFreqIteratorWrapper is a nuisance (buffers in memory). It would be nicer to have a sort on disk if something doesn't support sorted iteration order. I also wonder float -> long = 4 -> 8 bytes... would this count as an incompatible API change (because what used to work for a given amount of RAM won't work anymore – BufferingTermFreqIteratorWrapper again)? + if (l1 < l2) { + aStop = l1; + } else { + aStop = l2; + } if I remember correctly Math.min/max are intrinsics, so you can afford to be explicit Why not a specific covariant here? - public Float get(String key) { + public Object get(CharSequence key) { This doesn't seem necessary (lookup accepts a CharSequence?). @@ -199,7 +199,7 @@ public class LookupBenchmarkTest extends LuceneTestCase { public Integer call() throws Exception { int v = 0; for (String term : input) { - v += lookup.lookup(term, onlyMorePopular, num).size(); + v += lookup.lookup(new CharsRef(term), onlyMorePopular, num).size(); I like the rest, including the CharSequenceish evilness of bytesToCharSequence
        Hide
        Simon Willnauer added a comment -

        This shows up in a number of places. I have mixed feelings about certain types having a comparator and others not having it, but it's minor.

        well this an indicator if we know something about the order or not. if you get null there is not order specified...

        BufferingTermFreqIteratorWrapper is a nuisance (buffers in memory). It would be nicer to have a sort on disk if something doesn't support sorted iteration order.

        this is the ultimate goal... see my comment above (

         next steps would be cleaning up the in-memory sorting stuff and use the sorter which goes to disk to do the actual sorting internally if needed 

        )

        I also wonder float -> long = 4 -> 8 bytes... would this count as an incompatible API change (because what used to work for a given amount of RAM won't work anymore – BufferingTermFreqIteratorWrapper again)?

        see above

        if I remember correctly Math.min/max are intrinsics, so you can afford to be explicit

        I will upload a new patch - we use this in BytesRefComp too, I think its safe to fix

        This doesn't seem necessary (lookup accepts a CharSequence?).

        right - leftover from an old iteration

        Why not a specific covariant here?

        that would be Float get(String) or Float get(CharSequence) or...

        Show
        Simon Willnauer added a comment - This shows up in a number of places. I have mixed feelings about certain types having a comparator and others not having it, but it's minor. well this an indicator if we know something about the order or not. if you get null there is not order specified... BufferingTermFreqIteratorWrapper is a nuisance (buffers in memory). It would be nicer to have a sort on disk if something doesn't support sorted iteration order. this is the ultimate goal... see my comment above ( next steps would be cleaning up the in-memory sorting stuff and use the sorter which goes to disk to do the actual sorting internally if needed ) I also wonder float -> long = 4 -> 8 bytes... would this count as an incompatible API change (because what used to work for a given amount of RAM won't work anymore – BufferingTermFreqIteratorWrapper again)? see above if I remember correctly Math.min/max are intrinsics, so you can afford to be explicit I will upload a new patch - we use this in BytesRefComp too, I think its safe to fix This doesn't seem necessary (lookup accepts a CharSequence?). right - leftover from an old iteration Why not a specific covariant here? that would be Float get(String) or Float get(CharSequence) or...
        Hide
        Simon Willnauer added a comment -

        updated patch with davids suggestions

        Show
        Simon Willnauer added a comment - updated patch with davids suggestions
        Hide
        Simon Willnauer added a comment -

        following up on this, here is another patch.
        I cleaned up the APIs even further removing get / add from Lookup which is not really used. TST and Jaspell still support add and get is still implemented on all others but its not part of the interface. The main thing that buggs me with add is that its inconsistent with build since it allows arbitrary output while build uses a Number.
        Once we need this we can add it back to the interface.

        TST and Jaspell are now optimized to not necessarily create new Float objects all the time but share instances if the incoming weight allows it. I also create Number instances based on the range which might safe some more memory while minor IMO.

        LookupResult now also returns a long instead of a float to be consistent with the TermFreqIterator.

        SortedTermFreqIteratorWrapper is now based on the On-Disk sort. I extended the sorter impl a little to work with Comparator<BytesRef> and fixed the TODO by using BytesRefList with ByteBlockPool internally. WFSTCompletionLookup now also uses the SortedTFIteratorWrapper to remove duplicated code.

        Show
        Simon Willnauer added a comment - following up on this, here is another patch. I cleaned up the APIs even further removing get / add from Lookup which is not really used. TST and Jaspell still support add and get is still implemented on all others but its not part of the interface. The main thing that buggs me with add is that its inconsistent with build since it allows arbitrary output while build uses a Number. Once we need this we can add it back to the interface. TST and Jaspell are now optimized to not necessarily create new Float objects all the time but share instances if the incoming weight allows it. I also create Number instances based on the range which might safe some more memory while minor IMO. LookupResult now also returns a long instead of a float to be consistent with the TermFreqIterator. SortedTermFreqIteratorWrapper is now based on the On-Disk sort. I extended the sorter impl a little to work with Comparator<BytesRef> and fixed the TODO by using BytesRefList with ByteBlockPool internally. WFSTCompletionLookup now also uses the SortedTFIteratorWrapper to remove duplicated code.
        Hide
        Simon Willnauer added a comment -

        more cleanups... I made BytesRefSorter using BytesRefList (we should maybe rename to bytesrefarray) and in turn use BytesRefIterator instead of the stupid java iterator. So all those classes should be consistent now.

        Show
        Simon Willnauer added a comment - more cleanups... I made BytesRefSorter using BytesRefList (we should maybe rename to bytesrefarray) and in turn use BytesRefIterator instead of the stupid java iterator. So all those classes should be consistent now.
        Hide
        Dawid Weiss added a comment -

        I looked at your patch briefly, Simon. Random notes:

        • I'd change currentElement into lastElement or something like that. Otherwise it looks odd to me in the code, as in:
               throw new IndexOutOfBoundsException("index " + pos
                   + " must be less than the size: " + currentElement);
          
        • typo in "orderdEntries".
        • I'm very likely paranoid but I'd stick to just one class for storing these:
          protected Number weightAsNumber(long weight) {
          

          Since these are objects the memory gain will most likely be obscured by object alignments and object overhead itself and the downside is that you're using an interface with all call sites that will very likely become megamorphic (so no chances to inline anything). I don't know if it's worth the effort.

        I didn't have time to think much about changes to the functional logic; I don't think there were any (and if there were, they should be covered by tests?).

        Show
        Dawid Weiss added a comment - I looked at your patch briefly, Simon. Random notes: I'd change currentElement into lastElement or something like that. Otherwise it looks odd to me in the code, as in: throw new IndexOutOfBoundsException( "index " + pos + " must be less than the size: " + currentElement); typo in "orderdEntries". I'm very likely paranoid but I'd stick to just one class for storing these: protected Number weightAsNumber( long weight) { Since these are objects the memory gain will most likely be obscured by object alignments and object overhead itself and the downside is that you're using an interface with all call sites that will very likely become megamorphic (so no chances to inline anything). I don't know if it's worth the effort. I didn't have time to think much about changes to the functional logic; I don't think there were any (and if there were, they should be covered by tests?).
        Hide
        Simon Willnauer added a comment -

        merged with trunk.

        I added a changes.txt to contrib and fixed the issues david raised.

        I think this is ready. once this is in I will backport this to 3x and close the issue

        Show
        Simon Willnauer added a comment - merged with trunk. I added a changes.txt to contrib and fixed the issues david raised. I think this is ready. once this is in I will backport this to 3x and close the issue
        Hide
        Robert Muir added a comment -

        i did a quick review: looks good to me

        Show
        Robert Muir added a comment - i did a quick review: looks good to me
        Hide
        Simon Willnauer added a comment -

        committed to trunk.. I think we are ready to backport. I will do that in the next days

        Show
        Simon Willnauer added a comment - committed to trunk.. I think we are ready to backport. I will do that in the next days
        Hide
        Robert Muir added a comment -

        Somewhere along the line we got load(File), load(InputStream), store(File), store(InputStream)... can we improve this?

        I think its confusing when writing a new suggester to have all these hooks.

        load(File) takes a directory name, so a suggester is free to use multiple files (I am working on one that might use 2 files).
        so then what is load(InputStream) ?!

        Show
        Robert Muir added a comment - Somewhere along the line we got load(File), load(InputStream), store(File), store(InputStream)... can we improve this? I think its confusing when writing a new suggester to have all these hooks. load(File) takes a directory name, so a suggester is free to use multiple files (I am working on one that might use 2 files). so then what is load(InputStream) ?!
        Hide
        Dawid Weiss added a comment -

        I added methods that used streams. They stored the underlying FST. I personally didn't like the "must point to a directory" approach.

        The File(dir) methods got inherited from the original Lookup interface I think. Those saving/reading from streams were not in the interface so they were specific to a concrete implementation. Don't know how it looks now.

        Show
        Dawid Weiss added a comment - I added methods that used streams. They stored the underlying FST. I personally didn't like the "must point to a directory" approach. The File(dir) methods got inherited from the original Lookup interface I think. Those saving/reading from streams were not in the interface so they were specific to a concrete implementation. Don't know how it looks now.
        Hide
        Robert Muir added a comment -

        Right I'm not concerned about concrete impls. Concrete can do whatever it wants

        In the actual lookup interface, load/store from stream was added underneath this issue:

        http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/Lookup.java?r1=1236012&r2=1291418&diff_format=h

        I'm not opposed to this change: but if we are going to require that implementations only use a single file,
        then we should remove the File ones and just let caller deal with the inputstream... does that make sense?

        Currently its impossible to tell which one will be called!

        Show
        Robert Muir added a comment - Right I'm not concerned about concrete impls. Concrete can do whatever it wants In the actual lookup interface, load/store from stream was added underneath this issue: http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/Lookup.java?r1=1236012&r2=1291418&diff_format=h I'm not opposed to this change: but if we are going to require that implementations only use a single file, then we should remove the File ones and just let caller deal with the inputstream... does that make sense? Currently its impossible to tell which one will be called!
        Hide
        Dawid Weiss added a comment -

        If you say you already have an implementation that could use more than one file then it'd be silly to enforce a single file format. I agree too many options is not good, in particular it's painful for potential implementors.

        Show
        Dawid Weiss added a comment - If you say you already have an implementation that could use more than one file then it'd be silly to enforce a single file format. I agree too many options is not good, in particular it's painful for potential implementors.
        Hide
        Robert Muir added a comment -

        Well I switched to a single file already, so ignore that implementation

        But my concern is mostly about API confusion for implementing suggesters: if we require a single file then lets just have InputStream.
        Otherwise, lets remove InputStream from Lookup, (of course a specific concrete Impl that only uses one file is still free to provide that)

        Show
        Robert Muir added a comment - Well I switched to a single file already, so ignore that implementation But my concern is mostly about API confusion for implementing suggesters: if we require a single file then lets just have InputStream. Otherwise, lets remove InputStream from Lookup, (of course a specific concrete Impl that only uses one file is still free to provide that)
        Hide
        Dawid Weiss added a comment -

        +1 for streams. They're a lot more flexible than Files.

        Show
        Dawid Weiss added a comment - +1 for streams. They're a lot more flexible than Files.
        Hide
        Simon Willnauer added a comment -

        the main reason why I moved this to the interface was that I needed to do some extra processing in the stream ie. calc a checksum and I wanted to write stuff directly to HDFS etc. I think we should go for streams or we pass in a lucene directory while I think unless really needed we should stick with a stream.

        Show
        Simon Willnauer added a comment - the main reason why I moved this to the interface was that I needed to do some extra processing in the stream ie. calc a checksum and I wanted to write stuff directly to HDFS etc. I think we should go for streams or we pass in a lucene directory while I think unless really needed we should stick with a stream.
        Hide
        Simon Willnauer added a comment -

        here is a patch that removes the file based store / load methods. Once thing I don't like about this is that we use one an the same file name in solr to store stuff. We could possibly fix this and make it bw compatible by pulling a file name from the factory for instance but even that is kind of flaky. Maybe somebody has a better idea.

        Yet I think one of the biggest issues here is that we don't really have a header on the actual implementation. Ie you could simply load a FST from a FST suggester into WFST but the results would be bogus. I think we should add real headers to the files to fail early and give good error messages.

        Show
        Simon Willnauer added a comment - here is a patch that removes the file based store / load methods. Once thing I don't like about this is that we use one an the same file name in solr to store stuff. We could possibly fix this and make it bw compatible by pulling a file name from the factory for instance but even that is kind of flaky. Maybe somebody has a better idea. Yet I think one of the biggest issues here is that we don't really have a header on the actual implementation. Ie you could simply load a FST from a FST suggester into WFST but the results would be bogus. I think we should add real headers to the files to fail early and give good error messages.
        Hide
        Simon Willnauer added a comment -

        here is a patch against 3.x

        Show
        Simon Willnauer added a comment - here is a patch against 3.x
        Hide
        Simon Willnauer added a comment -

        here is a new patch removing the file based store / load methods and adding a fileName method to solrs factories to maintain compatibility. I will commit this soon.

        Show
        Simon Willnauer added a comment - here is a new patch removing the file based store / load methods and adding a fileName method to solrs factories to maintain compatibility. I will commit this soon.
        Hide
        Simon Willnauer added a comment -

        backported committed patches to 3.x in rev 1297946

        Show
        Simon Willnauer added a comment - backported committed patches to 3.x in rev 1297946
        Hide
        Simon Willnauer added a comment -

        I committed the last patch to trunk and backported to 3x.

        thanks guys

        Show
        Simon Willnauer added a comment - I committed the last patch to trunk and backported to 3x. thanks guys
        Hide
        Robert Muir added a comment -

        Thanks Simon! hard work but I expect there is more.

        Hopefully this module will grow into something much bigger...

        Show
        Robert Muir added a comment - Thanks Simon! hard work but I expect there is more. Hopefully this module will grow into something much bigger...

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development