Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: LARQ 1.0.0
    • Fix Version/s: Jena 2.11.0
    • Component/s: LARQ
    • Labels:
      None
    • Environment:

      Fuseki

      Description

      In previous versions the LARQ score seemed to be normalized to range [0, 1]. In LARQ 1.0.0 some scores can be higher than 1.

      Normalized scores are needed to filter sparql results (so that only items above certain quality is shown).

        Activity

        laotao created issue -
        Hide
        Paolo Castagna added a comment -

        Thanks Tao.

        All searches call the IndexLARQ.search(...) [1]
        There is a getMaxScore method in Lucene's TopDocs [2] which we can use to normalize scores for the same query.

        [1] http://svn.apache.org/repos/asf/incubator/jena/Jena2/LARQ/trunk/src/main/java/org/apache/jena/larq/IndexLARQ.java
        [2] http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/TopDocs.html#getMaxScore%28%29

        Show
        Paolo Castagna added a comment - Thanks Tao. All searches call the IndexLARQ.search(...) [1] There is a getMaxScore method in Lucene's TopDocs [2] which we can use to normalize scores for the same query. [1] http://svn.apache.org/repos/asf/incubator/jena/Jena2/LARQ/trunk/src/main/java/org/apache/jena/larq/IndexLARQ.java [2] http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/TopDocs.html#getMaxScore%28%29
        Hide
        Stephen Allen added a comment - - edited

        I haven't really had a chance to use LARQ much, but I'm not sure that normalizing the scores is necessarily the best thing to do. It makes a bunch of assumptions (underlying data isn't changing, that there is a linear relationship between scores, that scores mean something across queries, etc). Importantly, as the scores between different queries are not related to each other, an arbitrary value for the FILTER clause doesn't make sense. If the result of your query was a bunch of really bad matches, but they all had the same score, then they'd show up as 100% relevance, and then pass your filter (see [1]).

        Instead, I think you should use ORDER BY on the score, and then maybe LIMIT the results to a subset. Or if you really must have a normalized result, then retrieve all the results and calculate the normalized score in your application (although I encourage you not to). You could even achieve this in the query itself with the use of aggregation and subqueries. More info about scoring at [2].

        [1] http://wiki.apache.org/lucene-java/ScoresAsPercentages
        [2] http://lucene.apache.org/core/3_6_0/scoring.html

        Show
        Stephen Allen added a comment - - edited I haven't really had a chance to use LARQ much, but I'm not sure that normalizing the scores is necessarily the best thing to do. It makes a bunch of assumptions (underlying data isn't changing, that there is a linear relationship between scores, that scores mean something across queries, etc). Importantly, as the scores between different queries are not related to each other, an arbitrary value for the FILTER clause doesn't make sense. If the result of your query was a bunch of really bad matches, but they all had the same score, then they'd show up as 100% relevance, and then pass your filter (see [1] ). Instead, I think you should use ORDER BY on the score, and then maybe LIMIT the results to a subset. Or if you really must have a normalized result, then retrieve all the results and calculate the normalized score in your application (although I encourage you not to). You could even achieve this in the query itself with the use of aggregation and subqueries. More info about scoring at [2] . [1] http://wiki.apache.org/lucene-java/ScoresAsPercentages [2] http://lucene.apache.org/core/3_6_0/scoring.html
        Hide
        laotao added a comment - - edited

        Raw Lucene scores (normalized or not) really don't reflect the absolute similarity between a query and the results. Maybe TF-IDF algorithm is not appropriate to calculate these similarities for RDF literals, because they are usually short, compared to the usual (web) documents. Have you considered other algorithms, e.g. minimal edit distance?

        Another clue to improve the search, I think, is to take the underlying ontology constructs into account. For example, when there is an exact basic pattern match that has owl:differentFrom relationship with a Lucene match, the similarity score of the latter should be cut significantly (even to zero, so that this Lucene match is abandoned). This is important because many resources which are owl:differentFrom from each other can be very similar, literally.

        Show
        laotao added a comment - - edited Raw Lucene scores (normalized or not) really don't reflect the absolute similarity between a query and the results. Maybe TF-IDF algorithm is not appropriate to calculate these similarities for RDF literals, because they are usually short, compared to the usual (web) documents. Have you considered other algorithms, e.g. minimal edit distance? Another clue to improve the search, I think, is to take the underlying ontology constructs into account. For example, when there is an exact basic pattern match that has owl:differentFrom relationship with a Lucene match, the similarity score of the latter should be cut significantly (even to zero, so that this Lucene match is abandoned). This is important because many resources which are owl:differentFrom from each other can be very similar, literally.
        Hide
        Paolo Castagna added a comment -

        > Instead, I think you should use ORDER BY on the score, and then maybe LIMIT the results to a subset.

        Hi Stephen, thanks for adding your comments. And, yes, this is what I was trying to argue on jena-users ml. We both pointed at http://wiki.apache.org/lucene-java/ScoresAsPercentages.

        However, the behaviour of LARQ has changed with the latest release and LARQ now does not report normalised scores any more. This is better, however, it breaks compatibility with the past (I don't think it's a problem, since probably only a few people are actually using LARQ in any serious/production environment. Ready to be proven wrong on this, if it's not the case). In particular, LARQ's documentation says you can limit the number of matches using:

        • ?lit pf:textMatch ( '+text' 100 ) . # Limit to at most 100 hits
        • ?lit pf:textMatch ( '+text' 0.5 ) . # Limit to Lucene scores of 0.5 and over.
        • ?lit pf:textMatch ( '+text' 0.5 100 ) . # Limit to scores of 0.5 and limit to 100 hits

        I think we should just allow for (and this is my favourite choice):

        • ?lit pf:textMatch ( '+text' 100 ) . # Limit to at most 100 hits

        If we are happy with this, I can close this issue as "Won't fix", explaining why. I can then open another issue to remove the ability to limit results by score.

        Or, less work (I am happy with this option as well), we just change the documentation appropriately specifying the score is not normalised and it varies query by query (and if future it might change as/if we add new indexing systems, such as Solr, ElasticSearch, etc.).

        Lao, using ontology constructs to improve search results is a very interesting topic, but not quite relevant to this issue. Here we are not trying to develop a better scoring system for LARQ. We are discussing whether we should return normalised or non normalised scores to the users. Non normalised scores cause a small issue only when people try to limit the number of matches via ?lit pf:textMatch ( '+text' 0.5 ).

        Lao, Stephen (others?) what do you think?

        Show
        Paolo Castagna added a comment - > Instead, I think you should use ORDER BY on the score, and then maybe LIMIT the results to a subset. Hi Stephen, thanks for adding your comments. And, yes, this is what I was trying to argue on jena-users ml. We both pointed at http://wiki.apache.org/lucene-java/ScoresAsPercentages . However, the behaviour of LARQ has changed with the latest release and LARQ now does not report normalised scores any more. This is better, however, it breaks compatibility with the past (I don't think it's a problem, since probably only a few people are actually using LARQ in any serious/production environment. Ready to be proven wrong on this, if it's not the case). In particular, LARQ's documentation says you can limit the number of matches using: ?lit pf:textMatch ( '+text' 100 ) . # Limit to at most 100 hits ?lit pf:textMatch ( '+text' 0.5 ) . # Limit to Lucene scores of 0.5 and over. ?lit pf:textMatch ( '+text' 0.5 100 ) . # Limit to scores of 0.5 and limit to 100 hits I think we should just allow for (and this is my favourite choice): ?lit pf:textMatch ( '+text' 100 ) . # Limit to at most 100 hits If we are happy with this, I can close this issue as "Won't fix", explaining why. I can then open another issue to remove the ability to limit results by score. Or, less work (I am happy with this option as well), we just change the documentation appropriately specifying the score is not normalised and it varies query by query (and if future it might change as/if we add new indexing systems, such as Solr, ElasticSearch, etc.). Lao, using ontology constructs to improve search results is a very interesting topic, but not quite relevant to this issue. Here we are not trying to develop a better scoring system for LARQ. We are discussing whether we should return normalised or non normalised scores to the users. Non normalised scores cause a small issue only when people try to limit the number of matches via ?lit pf:textMatch ( '+text' 0.5 ). Lao, Stephen (others?) what do you think?
        Hide
        Rob Vesse added a comment -

        Paolo I don't think there is any reason to not allow users to limit by score

        Regardless of the perceived value of Lucene scores it's a useful feature to have whether or not the scores are normalized. Just because the scores aren't normalized doesn't mean a user won't want to limit by the score

        Show
        Rob Vesse added a comment - Paolo I don't think there is any reason to not allow users to limit by score Regardless of the perceived value of Lucene scores it's a useful feature to have whether or not the scores are normalized. Just because the scores aren't normalized doesn't mean a user won't want to limit by the score
        Hide
        Paolo Castagna added a comment -

        > Just because the scores aren't normalized doesn't mean a user won't want to limit by the score

        In my opinion, offering the feature of 'limiting matches by score' => use of 'normalized scores'.
        If scores are not normalized how you are supposed to know which value to set? 42? 8? 3? 2.5?
        It depends on the query as well... and on the data, you delete a document, same query, scores will be different.

        Show
        Paolo Castagna added a comment - > Just because the scores aren't normalized doesn't mean a user won't want to limit by the score In my opinion, offering the feature of 'limiting matches by score' => use of 'normalized scores'. If scores are not normalized how you are supposed to know which value to set? 42? 8? 3? 2.5? It depends on the query as well... and on the data, you delete a document, same query, scores will be different.
        Hide
        Rob Vesse added a comment -

        > In my opinion, offering the feature of 'limiting matches by score' => use of 'normalized scores'.

        That's not my opinion, dotNetRDF supports a full text search extension that follows the LARQ style syntax and the scores were never normalized with Lucene.Net (I'm using 2.9.x currently which has behaviour pretty close to 3.0 as I understand it) yet I still supported the ability to limit by score.

        > If scores are not normalized how you are supposed to know which value to set? 42? 8? 3? 2.5?

        Just because you don't necessarily know what value to set doesn't mean we should remove an existing feature. This feature has been available to users for some time and is also supported by other systems like dotNetRDF which follow the LARQ style syntax with the intention of fostering some community standardisation around full text search in SPARQL syntax.

        The fact that the limit is subjective is not necessarily a bad thing. In the same way I can limit overall results by adding a LIMIT clause I can limit just my full text results by adding either a score/hit limit and refine that on the fly depending on how many results I'm getting/wanting.

        Show
        Rob Vesse added a comment - > In my opinion, offering the feature of 'limiting matches by score' => use of 'normalized scores'. That's not my opinion, dotNetRDF supports a full text search extension that follows the LARQ style syntax and the scores were never normalized with Lucene.Net (I'm using 2.9.x currently which has behaviour pretty close to 3.0 as I understand it) yet I still supported the ability to limit by score. > If scores are not normalized how you are supposed to know which value to set? 42? 8? 3? 2.5? Just because you don't necessarily know what value to set doesn't mean we should remove an existing feature. This feature has been available to users for some time and is also supported by other systems like dotNetRDF which follow the LARQ style syntax with the intention of fostering some community standardisation around full text search in SPARQL syntax. The fact that the limit is subjective is not necessarily a bad thing. In the same way I can limit overall results by adding a LIMIT clause I can limit just my full text results by adding either a score/hit limit and refine that on the fly depending on how many results I'm getting/wanting.
        Hide
        Paolo Castagna added a comment -

        Hi Rob, ok. So, I propose to close this issue as 'Won't Fix' (I'll do that in one or two days, to see if there are others who might want to share their point of view).
        Then, I'll add a note to the LARQ documentation clarifying that scores are not normalized and I'll change the examples which might let users thinking they are.

        Lao, if you really want normalised scores you'll need to implement that client side, as Stephen has already suggested.

        Good news, very little to do.

        Show
        Paolo Castagna added a comment - Hi Rob, ok. So, I propose to close this issue as 'Won't Fix' (I'll do that in one or two days, to see if there are others who might want to share their point of view). Then, I'll add a note to the LARQ documentation clarifying that scores are not normalized and I'll change the examples which might let users thinking they are. Lao, if you really want normalised scores you'll need to implement that client side, as Stephen has already suggested. Good news, very little to do.
        Hide
        Andy Seaborne added a comment -

        Overtaken by jena-text

        Show
        Andy Seaborne added a comment - Overtaken by jena-text
        Andy Seaborne made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s Jena 2.11.0 [ 12324437 ]
        Resolution Not A Problem [ 8 ]
        Andy Seaborne made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            laotao
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development