Lucene - Core
  1. Lucene - Core
  2. LUCENE-4906

PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient...

      Or, if (Rob's idea we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet.

      1. LUCENE-4906.patch
        8 kB
        Michael McCandless
      2. LUCENE-4906.patch
        48 kB
        Luca Cavanna
      3. LUCENE-4906.patch
        9 kB
        Michael McCandless

        Activity

        Hide
        Luca Cavanna added a comment -

        Sounds interesting. Is anybody working on this already? I'd like to volunteer.
        What do you have in mind exactly? Now the format method returns a string. What would you like to see as output instead?

        Show
        Luca Cavanna added a comment - Sounds interesting. Is anybody working on this already? I'd like to volunteer. What do you have in mind exactly? Now the format method returns a string. What would you like to see as output instead?
        Hide
        Robert Muir added a comment -

        Luca i dont think anyone is working on it.

        I tried about a week ago to do this, but I couldn't figure out how to do it without making the api really ugly: I think thats the challenging part.

        Since doing this custom formatting is an expert thing, I think its important that its done in such a way that simple ordinary highlighting use cases don't become more complicated.

        Show
        Robert Muir added a comment - Luca i dont think anyone is working on it. I tried about a week ago to do this, but I couldn't figure out how to do it without making the api really ugly: I think thats the challenging part. Since doing this custom formatting is an expert thing, I think its important that its done in such a way that simple ordinary highlighting use cases don't become more complicated.
        Hide
        Luca Cavanna added a comment -

        I see! If you couldn't make it how can I make it?

        But the idea is that you could have some kind of object as output instead of a string, like for example an array of tokens plus maybe some more information?

        It would avoid to parse again the string output and somehow re-analyze the text as needed to have a snippet that we could provide as output directly?

        Show
        Luca Cavanna added a comment - I see! If you couldn't make it how can I make it? But the idea is that you could have some kind of object as output instead of a string, like for example an array of tokens plus maybe some more information? It would avoid to parse again the string output and somehow re-analyze the text as needed to have a snippet that we could provide as output directly?
        Hide
        Robert Muir added a comment -

        Well i didnt try many possibilities, just played for an hour. So more exploration and ideas would be good.

        but here was my high level idea:
        If we look at PostingsHighlighter, it has 'simple' methods and 'advanced' methods:

        /** simple */
        String[] highlight(String field, Query query, IndexSearcher searcher, TopDocs topDocs);
        /** advanced */
        Map<String,String[]> highlightFields(String fieldsIn[], Query query, IndexSearcher searcher, int[] docidsIn, int maxPassagesIn[]);
        

        I think its important that the simple methods remain simple... but it would be good for 'advanced' users to have formatters that return maybe some more complex stuff.

        I think there are a few ways that could be done:

        1. internally formatter returns Object, and the simple methods call toString()
        2. generics Unfortunately to me generics are never simple...
        3. bite the bullet and change all method signatures to return some kind of different "result" structure. Maybe this could be done in a way that even simplifies the 'simple' signatures, e.g. maybe its confusing that this thing returns an array in parallel to your topdocs today..
        4. other ideas...

        In general I also think a nice baby step to making this more flexible is LUCENE-4896. This would be a good one to do first, and i think its much easier and an obvious win!

        Show
        Robert Muir added a comment - Well i didnt try many possibilities, just played for an hour. So more exploration and ideas would be good. but here was my high level idea: If we look at PostingsHighlighter, it has 'simple' methods and 'advanced' methods: /** simple */ String [] highlight( String field, Query query, IndexSearcher searcher, TopDocs topDocs); /** advanced */ Map< String , String []> highlightFields( String fieldsIn[], Query query, IndexSearcher searcher, int [] docidsIn, int maxPassagesIn[]); I think its important that the simple methods remain simple... but it would be good for 'advanced' users to have formatters that return maybe some more complex stuff. I think there are a few ways that could be done: internally formatter returns Object, and the simple methods call toString() generics Unfortunately to me generics are never simple... bite the bullet and change all method signatures to return some kind of different "result" structure. Maybe this could be done in a way that even simplifies the 'simple' signatures, e.g. maybe its confusing that this thing returns an array in parallel to your topdocs today.. other ideas... In general I also think a nice baby step to making this more flexible is LUCENE-4896 . This would be a good one to do first, and i think its much easier and an obvious win!
        Hide
        Luca Cavanna added a comment -

        Sure, I hadn't seen that issue yet but I was about to propose the same looking at the code.

        Thanks for your insight!
        I thought about generics too, but then we'd have to be really careful otherwise the generics policeman jumps in

        I'll play around with some ideas and post the results here as soon as I have something.

        Show
        Luca Cavanna added a comment - Sure, I hadn't seen that issue yet but I was about to propose the same looking at the code. Thanks for your insight! I thought about generics too, but then we'd have to be really careful otherwise the generics policeman jumps in I'll play around with some ideas and post the results here as soon as I have something.
        Hide
        Michael McCandless added a comment -

        Here's a simple patch, implementing Robs #1 idea (PassageFormatter.format returns Object, and then add an expert PostingsHighlighter.highlightFieldsAsObjects).

        The change seems minimal and seems to work (I added a basic test) ...

        Show
        Michael McCandless added a comment - Here's a simple patch, implementing Robs #1 idea (PassageFormatter.format returns Object, and then add an expert PostingsHighlighter.highlightFieldsAsObjects). The change seems minimal and seems to work (I added a basic test) ...
        Hide
        Luca Cavanna added a comment -

        Hi Mike,
        I had a look at your patch, looks good to me. Being able to get back arbitrary objects is a great improvement.

        The only thing I would love to improve here is the need to cast the returned Objects to the type that the custom PassageFormatter uses.

        We could work around this using generics, but the fact that the PassageFormatter can vary per field makes it harder. The only way I see to work around this is to prevent the PassageFormatter from returning different types of objects per field. That would mean that even though every field can have his own PassageFormatter, they all must return the same type. It kinda makes sense to me since I wouldn't want to have heterogeneous types in the Map<Integer, Object>, but that is something that's currently possible. What do you think?

        Show
        Luca Cavanna added a comment - Hi Mike, I had a look at your patch, looks good to me. Being able to get back arbitrary objects is a great improvement. The only thing I would love to improve here is the need to cast the returned Objects to the type that the custom PassageFormatter uses. We could work around this using generics, but the fact that the PassageFormatter can vary per field makes it harder. The only way I see to work around this is to prevent the PassageFormatter from returning different types of objects per field. That would mean that even though every field can have his own PassageFormatter, they all must return the same type. It kinda makes sense to me since I wouldn't want to have heterogeneous types in the Map<Integer, Object>, but that is something that's currently possible. What do you think?
        Hide
        Shai Erera added a comment -

        I am not sure we should complicate the API? Performance-wise this casting will have marginal effects (if at all), considering it's done, usually, for the top-N documents only for which you compute highlighting...

        Generics, while sometimes are great, some other times actually limit the API. It doesn't feel like PostHighlighter is a good place for generics.

        Show
        Shai Erera added a comment - I am not sure we should complicate the API? Performance-wise this casting will have marginal effects (if at all), considering it's done, usually, for the top-N documents only for which you compute highlighting... Generics, while sometimes are great, some other times actually limit the API. It doesn't feel like PostHighlighter is a good place for generics.
        Hide
        Luca Cavanna added a comment -

        I don't see why adding generics would complicate or limit the API. To me it would make it simpler and nicer (not a big change in terms of api itself though).

        Attaching a patch with my thoughts to make more concrete what I had in mind, regardless of whether it will be integrated or not.

        It's backwards compatible (even though the class is marked experimental): we have an abstract postings highlighter class that does most of the work and returns arbitrary objects (uses generics in order to do so). The PostingsHighlighter is its natural extension that returns String snippets.

        I updated Mike's new test according to my changes. It should make it easier to understand what's needed to work with arbitrary objects in terms of code using this approach.

        I find it more explicit that if you want to extend the abstract one you have to declare what type the formatter is supposed to return, which makes it more explicit and avoids any cast.

        Limitations with this approach:
        1) as mentioned before (to me it's more of a benefit) there cannot be heterogeneous types returned by the same highlighter instance.
        2) generics don't play well with arrays, thus all the highlight methods that returned arrays are still in the subclass that returns string snippets to keep backwards compatibility. Moving them to the base class would most likely require to return List<FormattedPassage> instead (not backward compatible).

        I haven't updated the javadoc yet, but if you like my approach I can go ahead with it.

        I would love to hear what you guys think about it. Generics can be scary... but useful sometimes too

        Show
        Luca Cavanna added a comment - I don't see why adding generics would complicate or limit the API. To me it would make it simpler and nicer (not a big change in terms of api itself though). Attaching a patch with my thoughts to make more concrete what I had in mind, regardless of whether it will be integrated or not. It's backwards compatible (even though the class is marked experimental): we have an abstract postings highlighter class that does most of the work and returns arbitrary objects (uses generics in order to do so). The PostingsHighlighter is its natural extension that returns String snippets. I updated Mike's new test according to my changes. It should make it easier to understand what's needed to work with arbitrary objects in terms of code using this approach. I find it more explicit that if you want to extend the abstract one you have to declare what type the formatter is supposed to return, which makes it more explicit and avoids any cast. Limitations with this approach: 1) as mentioned before (to me it's more of a benefit) there cannot be heterogeneous types returned by the same highlighter instance. 2) generics don't play well with arrays, thus all the highlight methods that returned arrays are still in the subclass that returns string snippets to keep backwards compatibility. Moving them to the base class would most likely require to return List<FormattedPassage> instead (not backward compatible). I haven't updated the javadoc yet, but if you like my approach I can go ahead with it. I would love to hear what you guys think about it. Generics can be scary... but useful sometimes too
        Hide
        Michael McCandless added a comment -

        Thanks Luca, that patch is nice.

        I don't mind the generics ... but I do think this is added complexity (now there are two public classes, and <T> throughout) for what is a supremely expert use case. This is why I opted initially for the Object + super-expert users simply cast it: it's minimal added complexity. It also matches the current approach on LUCENE-5133 (though I had started with a class there to break each snippet into the match and non-match parts).

        Unfortunately, highlighting APIs seem to quickly become complex / hard to use, and one of the awesome things about this highlighter is its API is very simple compared to the others ...

        Show
        Michael McCandless added a comment - Thanks Luca, that patch is nice. I don't mind the generics ... but I do think this is added complexity (now there are two public classes, and <T> throughout) for what is a supremely expert use case. This is why I opted initially for the Object + super-expert users simply cast it: it's minimal added complexity. It also matches the current approach on LUCENE-5133 (though I had started with a class there to break each snippet into the match and non-match parts). Unfortunately, highlighting APIs seem to quickly become complex / hard to use, and one of the awesome things about this highlighter is its API is very simple compared to the others ...
        Hide
        Luca Cavanna added a comment -

        Hi Mike,
        I definitely agree that highlighting api should be simple and the postings highlighter is probably the only one that's really easy to use.

        On the other hand, I think it's good to make explicit that if you use a Formatter<YourObject>, YourObject is what you're going to get back from the highlighter. People using the string version wouldn't notice the change, while advanced users would have to extend the base class and get type safety too, that in my opinion makes it clearier and easier. Using Object feels to me a little old-fashioned and bogus, but again that's probably me

        I do trust your experience though. If you think the object version is better that's fine with me. What I care about is that this improvement gets committed soon, since it's a really useful one

        Thanks a lot for sharing your ideas

        Show
        Luca Cavanna added a comment - Hi Mike, I definitely agree that highlighting api should be simple and the postings highlighter is probably the only one that's really easy to use. On the other hand, I think it's good to make explicit that if you use a Formatter<YourObject>, YourObject is what you're going to get back from the highlighter. People using the string version wouldn't notice the change, while advanced users would have to extend the base class and get type safety too, that in my opinion makes it clearier and easier. Using Object feels to me a little old-fashioned and bogus, but again that's probably me I do trust your experience though. If you think the object version is better that's fine with me. What I care about is that this improvement gets committed soon, since it's a really useful one Thanks a lot for sharing your ideas
        Hide
        Luca Cavanna added a comment -

        One more thing: re-reading Robert's previous comments, I find also interesting the idea he had about changing the api to return a proper object instead of the Map<String, String[]>, or the String[] for the simplest methods. I wonder if it's worth to address this as well in this issue, or if the current api is clear enough in your opinion. Any thoughts?

        Show
        Luca Cavanna added a comment - One more thing: re-reading Robert's previous comments, I find also interesting the idea he had about changing the api to return a proper object instead of the Map<String, String[]>, or the String[] for the simplest methods. I wonder if it's worth to address this as well in this issue, or if the current api is clear enough in your opinion. Any thoughts?
        Hide
        Michael McCandless added a comment -

        I think the challenge here is that these are not just "advanced" uses;
        they are super expert uses, and I don't feel like that justifies the
        added cost of generics for normal users.

        There are definitely times when generics make sense but I don't think
        this case applies ...

        I agree the Object approach is rather old fashioned ... but it should
        still work for these super expert cases. So, it's not ideal, but it's
        a step forward at least (progress not perfection) ... I'd like to
        commit the Object approach so we move forward.

        If future use cases emerge that makes the generics use-case more
        common we can always re-visit this (this API is experimental; we are
        free to change it), so none of this is set in stone ...

        Show
        Michael McCandless added a comment - I think the challenge here is that these are not just "advanced" uses; they are super expert uses, and I don't feel like that justifies the added cost of generics for normal users. There are definitely times when generics make sense but I don't think this case applies ... I agree the Object approach is rather old fashioned ... but it should still work for these super expert cases. So, it's not ideal, but it's a step forward at least (progress not perfection) ... I'd like to commit the Object approach so we move forward. If future use cases emerge that makes the generics use-case more common we can always re-visit this (this API is experimental; we are free to change it), so none of this is set in stone ...
        Hide
        Luca Cavanna added a comment -

        No problem, thanks Mike!

        Show
        Luca Cavanna added a comment - No problem, thanks Mike!
        Hide
        Luca Cavanna added a comment -

        How about committing this? Would be great to have it with the next release!

        Show
        Luca Cavanna added a comment - How about committing this? Would be great to have it with the next release!
        Hide
        Michael McCandless added a comment -

        How about committing this? Would be great to have it with the next release!

        Thanks for the reminder Luca! This had fallen off the event horizon of my TODO list ...

        But, I don't think we should put this in 4.5 at the last minute; I'll commit for 4.6.

        Show
        Michael McCandless added a comment - How about committing this? Would be great to have it with the next release! Thanks for the reminder Luca! This had fallen off the event horizon of my TODO list ... But, I don't think we should put this in 4.5 at the last minute; I'll commit for 4.6.
        Hide
        Michael McCandless added a comment -

        New patch, removing nocommits; I think it's ready.

        Show
        Michael McCandless added a comment - New patch, removing nocommits; I think it's ready.
        Hide
        ASF subversion and git services added a comment -

        Commit 1522619 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1522619 ]

        LUCENE-4906: PostingsHighlighter: add expert API to render highlights to Object

        Show
        ASF subversion and git services added a comment - Commit 1522619 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1522619 ] LUCENE-4906 : PostingsHighlighter: add expert API to render highlights to Object
        Hide
        ASF subversion and git services added a comment -

        Commit 1522630 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1522630 ]

        LUCENE-4906: PostingsHighlighter: add expert API to render highlights to Object

        Show
        ASF subversion and git services added a comment - Commit 1522630 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1522630 ] LUCENE-4906 : PostingsHighlighter: add expert API to render highlights to Object
        Hide
        Luca Cavanna added a comment -

        Thanks Mike!

        Show
        Luca Cavanna added a comment - Thanks Mike!
        Hide
        Robert Muir added a comment -

        Can we fix the visibility of this method to be protected?

        Show
        Robert Muir added a comment - Can we fix the visibility of this method to be protected?
        Hide
        Michael McCandless added a comment -

        Can we fix the visibility of this method to be protected?

        So that apps that want to use it must subclass PostingsHighlighter? OK.

        Show
        Michael McCandless added a comment - Can we fix the visibility of this method to be protected? So that apps that want to use it must subclass PostingsHighlighter? OK.
        Hide
        ASF subversion and git services added a comment -

        Commit 1523225 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1523225 ]

        LUCENE-4906: make expert Object method protected

        Show
        ASF subversion and git services added a comment - Commit 1523225 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1523225 ] LUCENE-4906 : make expert Object method protected
        Hide
        ASF subversion and git services added a comment -

        Commit 1523226 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1523226 ]

        LUCENE-4906: make expert Object method protected

        Show
        ASF subversion and git services added a comment - Commit 1523226 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1523226 ] LUCENE-4906 : make expert Object method protected
        Hide
        Robert Muir added a comment -

        Well I think its good for a few reasons:
        1. even though protected still keeps it in javadocs, users who use an IDE and type "highlighter." and wait for autocomplete see less methods in their API. so its still less overwhelming here. This is the most important benefit.
        2. encourages users to e.g. implement their public own method with a proper return value (e.g. JsonObject or whatever it is you are doing). This way there is just one cast from object and its contained inside their custom Highlighter, otherwise there rest of their app is type safe.

        and I think the functionality of highlighting to something other than string is sufficiently custom that its not really an imposition. Are there other expert methods in PH that belong in the same category?

        Show
        Robert Muir added a comment - Well I think its good for a few reasons: 1. even though protected still keeps it in javadocs, users who use an IDE and type "highlighter." and wait for autocomplete see less methods in their API. so its still less overwhelming here. This is the most important benefit. 2. encourages users to e.g. implement their public own method with a proper return value (e.g. JsonObject or whatever it is you are doing). This way there is just one cast from object and its contained inside their custom Highlighter, otherwise there rest of their app is type safe. and I think the functionality of highlighting to something other than string is sufficiently custom that its not really an imposition. Are there other expert methods in PH that belong in the same category?
        Hide
        ASF subversion and git services added a comment -

        Commit 1556786 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1556786 ]

        LUCENE-4906, LUCENE-5376: using the expert 'render to Object' APIs in PostingsHighlighter to render directly to JSONArray in lucene server

        Show
        ASF subversion and git services added a comment - Commit 1556786 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1556786 ] LUCENE-4906 , LUCENE-5376 : using the expert 'render to Object' APIs in PostingsHighlighter to render directly to JSONArray in lucene server

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development