Lucene - Core
  1. Lucene - Core
  2. LUCENE-4012

Make all query classes serializable with Jackson, and provide a trivial query parser to consume them

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I started off on LUCENE-4004 wanting to use DisjunctionMaxQuery via a parser. However, this wasn't really because I thought that human beans should be improvisationally composing such thing. My real goal was to concoct a query tree over here, and then serialize it to send to Solr over there.

      It occurs to me that if the Xml parser is pretty good for this, JSON would be better. It further occurs to me that the query classes may already all work with Jackson, and, if they don't, the required tweaks will be quite small. By allowing Jackson to write out class names as needed, you get the ability to serialize any query, so long as the other side has the classes in class path. A trifle verbose, but not as verbose as XML, and furthermore squishable (though not in a URL) via SMILE or BSON.

      So, the goal of this JIRA is to accumulate tweaks to the query classes to make them more 'bean pattern'. An alternative would be Jackson annotations. However, I suspect that folks would be happier to minimize the level of coupling here; in the extreme, the trivial parser could live in contrib if no one wants a dependency, even optional, on Jackson itself.

      1. bq.patch
        0.6 kB
        Benson Margulies

        Activity

        Hide
        Robert Muir added a comment -

        I think if it works for you, just iterate in your github and ping the issue when you make progress?
        Otherwise we worry too much about where/how the code sits when that isn't really so important
        at this stage.

        As far as final integration, I think there are a number of ways to do it but its really
        unrelated to making progress here. One suggestion might be to split queryparser/ module
        like analyzers/ so we have:

        • classic/ [including things based on it: complexPhrase, ext, analyzing]
        • flexible/ [including precedence which is based on it]
        • xml/
        • json/

        This could probably make things less confusing, as currently queryparser/ is a mix of
        different frameworks with different dependencies (e.g. xml depends on queries/ and sandbox/,
        but the others dont, and json will depend on jackson and maybe other stuff, etc, etc).

        And then finally, probably a followup issue to do solr integration (i'm more fuzzy on that).

        Show
        Robert Muir added a comment - I think if it works for you, just iterate in your github and ping the issue when you make progress? Otherwise we worry too much about where/how the code sits when that isn't really so important at this stage. As far as final integration, I think there are a number of ways to do it but its really unrelated to making progress here. One suggestion might be to split queryparser/ module like analyzers/ so we have: classic/ [including things based on it: complexPhrase, ext, analyzing] flexible/ [including precedence which is based on it] xml/ json/ This could probably make things less confusing, as currently queryparser/ is a mix of different frameworks with different dependencies (e.g. xml depends on queries/ and sandbox/, but the others dont, and json will depend on jackson and maybe other stuff, etc, etc). And then finally, probably a followup issue to do solr integration (i'm more fuzzy on that).
        Hide
        Benson Margulies added a comment -

        Right. The tiny patch is an example of the sort of change, and I think it's always good to have 'getters'.

        In fact, I don't even need that patch for the code to work, but i thought it was a good idea.

        I'll continue to set this up. Let me know when/how/if you want it shaped to go into the tree somewhere.

        Show
        Benson Margulies added a comment - Right. The tiny patch is an example of the sort of change, and I think it's always good to have 'getters'. In fact, I don't even need that patch for the code to work, but i thought it was a good idea. I'll continue to set this up. Let me know when/how/if you want it shaped to go into the tree somewhere.
        Hide
        Robert Muir added a comment -

        The query integration at a glance looks very non-invasive from my perspective!

        I think we should pursue this? We might run into some tricky parts but we could
        have this component as a separate pluggable module with probably minimal changes
        to the core Query apis right?

        Show
        Robert Muir added a comment - The query integration at a glance looks very non-invasive from my perspective! I think we should pursue this? We might run into some tricky parts but we could have this component as a separate pluggable module with probably minimal changes to the core Query apis right?
        Hide
        Benson Margulies added a comment -

        git://github.com/bimargulies/lucene-json-qp.git

        Rob, here you will find the promised look at how this works. No QP yet, that's the easy part.

        Apologies for the Maven project, but that's what I do quickly.

        No patches to core required at all yet. if you run the unit test, you will see what the json looks like. Concise, no surprise, it's not.

        Show
        Benson Margulies added a comment - git://github.com/bimargulies/lucene-json-qp.git Rob, here you will find the promised look at how this works. No QP yet, that's the easy part. Apologies for the Maven project, but that's what I do quickly. No patches to core required at all yet. if you run the unit test, you will see what the json looks like. Concise, no surprise, it's not.
        Hide
        Robert Muir added a comment -

        My only comment on that: it would help perpetuate this inverted 'disable/omit' stuff
        that i just cant stand. I really wish BQ had 'enableCoord' in the ctor with true as
        the default: of course a horrible thing to try to fix but we were able to fix this
        for omitTF etc so I'm not yet losing hope.

        Show
        Robert Muir added a comment - My only comment on that: it would help perpetuate this inverted 'disable/omit' stuff that i just cant stand. I really wish BQ had 'enableCoord' in the ctor with true as the default: of course a horrible thing to try to fix but we were able to fix this for omitTF etc so I'm not yet losing hope.
        Hide
        Benson Margulies added a comment -

        Add accessor to BooleanQuery.java to allow serializing 'disable-coord'.

        Show
        Benson Margulies added a comment - Add accessor to BooleanQuery.java to allow serializing 'disable-coord'.
        Hide
        Robert Muir added a comment -

        Yeah a sketch (maybe just Term and Boolean or something ?) would be cool, maybe my paranoia
        is unjustified... then we could see what it actually would need to look like and think about
        the backwards-compatibility/API costs etc would be.

        Show
        Robert Muir added a comment - Yeah a sketch (maybe just Term and Boolean or something ?) would be cool, maybe my paranoia is unjustified... then we could see what it actually would need to look like and think about the backwards-compatibility/API costs etc would be.
        Hide
        Benson Margulies added a comment -

        Rob,

        These are not Java annotations, they are specific to Jackson. (Though Jackson will pay attention to the JAX-B annotations in Java 6, I fear that this is a cure worse than the disease.) So hanging them on the real classes makes for a hard dependency on Jackson to compile, and I'm not sure myself that this is a good idea in the middle of Lucene. If I stick to the mixin approach, no annotation are needed at all.

        As for the rest, I think that we agree. The point of the annotations is to only serialize the stuff that makes up the plausible, stable, sort of attributes that you would support in a parser syntax. (though I suppose someone might want to SolrCloudify some really expert complexity), and the selective annotation avoids the rest.

        So I'm going to sketch something out and we'll all see what it looks like.

        Show
        Benson Margulies added a comment - Rob, These are not Java annotations, they are specific to Jackson. (Though Jackson will pay attention to the JAX-B annotations in Java 6, I fear that this is a cure worse than the disease.) So hanging them on the real classes makes for a hard dependency on Jackson to compile, and I'm not sure myself that this is a good idea in the middle of Lucene. If I stick to the mixin approach, no annotation are needed at all. As for the rest, I think that we agree. The point of the annotations is to only serialize the stuff that makes up the plausible, stable, sort of attributes that you would support in a parser syntax. (though I suppose someone might want to SolrCloudify some really expert complexity), and the selective annotation avoids the rest. So I'm going to sketch something out and we'll all see what it looks like.
        Hide
        Robert Muir added a comment -

        First, Jackson is far, far, less implementation-sensitive than the built-in Java serialization. It looks at the public API of constructors, getters, and setters. How likely are you to change the API in a way that invalidates the idea that a TermQuery consists of a term name and a value?

        TermQuery currently has a bogus constructor today:

        /** Expert: constructs a TermQuery that will use the
          *  provided docFreq instead of looking up the docFreq
          *  against the searcher. */
        public TermQuery(Term t, int docFreq) {
        

        Its bogus because some scoring models may not use docFreq at all (e.g. language modelling): I know why it exists (for one contrib query), I think we have a plan to fix it, but its just not yet done... like many things

        But thats an example of one I think is fair to remove in a minor release: besides being bogus, its documented as Expert,
        so its fair game.

        I'm just bringing this up as a counterexample... I thought serialization was harmless before myself
        until I tried to make useful changes (like yanking vector-space model out of the scoring system) and was
        totally blocked by serialization. So I'm gonna be suspicious of the word no matter what

        If it were up to me, I'd start by annotating the actually classes themselves with Jackson @nnotations for this purpose, and only create mixins when and if an incompatible change happened, but I'm not really opinionated.

        Are these annotations in java 6 API? Annotations are no different than other pieces of code, they are an additional
        responsibility. As far as @Deprecated, we fail the build if we add @deprecated javadocs but forget it, as far as
        @Override, well nothing automated yet, but we should be somehow enforcing that as well. Besides the fact we don't
        want to add any dependencies to the lucene core, how would we know such annotations were correct?

        Would it be sensible to start to concoct a contrib module with all the jackson in it, accompanied by benign 'add a few getters and setters' classes to the queries?

        Don't let me get in your way: I'm just the devil's advocate when i hear serialization. As far as adding getters and setters
        to queries I'm not sure if they are benign or not without us looking at them. For example, AutomatonQuery and all of its
        subclasses: Wildcard, Regexp, etc are totally immutable once created. This is for good reasons so I don't think we should
        add any setters to these classes, and the stuff it stores internally shouldn't be accessible via a getter.

        Show
        Robert Muir added a comment - First, Jackson is far, far, less implementation-sensitive than the built-in Java serialization. It looks at the public API of constructors, getters, and setters. How likely are you to change the API in a way that invalidates the idea that a TermQuery consists of a term name and a value? TermQuery currently has a bogus constructor today: /** Expert: constructs a TermQuery that will use the * provided docFreq instead of looking up the docFreq * against the searcher. */ public TermQuery(Term t, int docFreq) { Its bogus because some scoring models may not use docFreq at all (e.g. language modelling): I know why it exists (for one contrib query), I think we have a plan to fix it, but its just not yet done... like many things But thats an example of one I think is fair to remove in a minor release: besides being bogus, its documented as Expert, so its fair game. I'm just bringing this up as a counterexample... I thought serialization was harmless before myself until I tried to make useful changes (like yanking vector-space model out of the scoring system) and was totally blocked by serialization. So I'm gonna be suspicious of the word no matter what If it were up to me, I'd start by annotating the actually classes themselves with Jackson @nnotations for this purpose, and only create mixins when and if an incompatible change happened, but I'm not really opinionated. Are these annotations in java 6 API? Annotations are no different than other pieces of code, they are an additional responsibility. As far as @Deprecated, we fail the build if we add @deprecated javadocs but forget it, as far as @Override, well nothing automated yet, but we should be somehow enforcing that as well. Besides the fact we don't want to add any dependencies to the lucene core, how would we know such annotations were correct? Would it be sensible to start to concoct a contrib module with all the jackson in it, accompanied by benign 'add a few getters and setters' classes to the queries? Don't let me get in your way: I'm just the devil's advocate when i hear serialization. As far as adding getters and setters to queries I'm not sure if they are benign or not without us looking at them. For example, AutomatonQuery and all of its subclasses: Wildcard, Regexp, etc are totally immutable once created. This is for good reasons so I don't think we should add any setters to these classes, and the stuff it stores internally shouldn't be accessible via a getter.
        Hide
        Benson Margulies added a comment -

        Now you've got me thinking. Several points:

        First, Jackson is far, far, less implementation-sensitive than the built-in Java serialization. It looks at the public API of constructors, getters, and setters. How likely are you to change the API in a way that invalidates the idea that a TermQuery consists of a term name and a value?

        Second, this is not about long-term persistence. I'd be happy to document it as 'do not expect to move one of these across versions.' It's most obvious application is to move arbitrary queries around in SolrCloud.

        However, thirdly, you want it independent? I can make it independent. Here's how. Let's assume that you are willing to state the same sort of invariants for a query object that are embodied in parser support. Once you make a parser for something, you've promised that it has (at least) a particular set of inputs until the next major version, yes?

        So, for each Query object class, make a Jackson mixin class, that maps the 'official inputs' to the current collection of setters/constructor arguments. Want to re-wrangle the query classes? If you actually change the constructors/setters that you would use in a query parser for a particular query, you then have to change the mixin class.

        If it were up to me, I'd start by annotating the actually classes themselves with Jackson @nnotations for this purpose, and only create mixins when and if an incompatible change happened, but I'm not really opinionated.

        Would it be sensible to start to concoct a contrib module with all the jackson in it, accompanied by benign 'add a few getters and setters' classes to the queries?

        Show
        Benson Margulies added a comment - Now you've got me thinking. Several points: First, Jackson is far, far, less implementation-sensitive than the built-in Java serialization. It looks at the public API of constructors, getters, and setters. How likely are you to change the API in a way that invalidates the idea that a TermQuery consists of a term name and a value? Second, this is not about long-term persistence. I'd be happy to document it as 'do not expect to move one of these across versions.' It's most obvious application is to move arbitrary queries around in SolrCloud. However, thirdly, you want it independent? I can make it independent. Here's how. Let's assume that you are willing to state the same sort of invariants for a query object that are embodied in parser support. Once you make a parser for something, you've promised that it has (at least) a particular set of inputs until the next major version, yes? So, for each Query object class, make a Jackson mixin class, that maps the 'official inputs' to the current collection of setters/constructor arguments. Want to re-wrangle the query classes? If you actually change the constructors/setters that you would use in a query parser for a particular query, you then have to change the mixin class. If it were up to me, I'd start by annotating the actually classes themselves with Jackson @nnotations for this purpose, and only create mixins when and if an incompatible change happened, but I'm not really opinionated. Would it be sensible to start to concoct a contrib module with all the jackson in it, accompanied by benign 'add a few getters and setters' classes to the queries?
        Hide
        Robert Muir added a comment -

        My first question is, what about backward compatibility requirements? The problem
        is if people start using such a structure ('remote api') that depends upon
        the structure of our java source code, then they will be upset if we break it.

        If we totally change a Query's API, does that push all the responsibility of the
        API designer to deal with serialization backwards compat? APIs are difficult enough
        as-is just for java consumers.

        This seems similar to the java serialization issue (which we removed for this reason).
        Can't the serialization be totally independent?

        Show
        Robert Muir added a comment - My first question is, what about backward compatibility requirements? The problem is if people start using such a structure ('remote api') that depends upon the structure of our java source code, then they will be upset if we break it. If we totally change a Query's API, does that push all the responsibility of the API designer to deal with serialization backwards compat? APIs are difficult enough as-is just for java consumers. This seems similar to the java serialization issue (which we removed for this reason). Can't the serialization be totally independent?

          People

          • Assignee:
            Unassigned
            Reporter:
            Benson Margulies
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development