Solr
  1. Solr
  2. SOLR-651

A SearchComponent for fetching TF-IDF values

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: None
    • Labels:
      None

      Description

      A SearchComponent that can return TF-IDF vector for any given document in the SOLR index
      Query : A Document Number / a query identifying a Document
      Response : A Map of term vs.TF-IDF value of every term in the Selected
      Document
      Why ?

      Most of the Machine Learning Algorithms work on TFIDF representation of
      documents, hence adding a Request Handler proving the TFIDF representation
      will pave the way for incorporating Learning Paradigms to SOLR framework.

      1. SOLR-651.patch
        27 kB
        Grant Ingersoll
      2. SOLR-651.patch
        26 kB
        Grant Ingersoll
      3. SOLR-651.patch
        26 kB
        Grant Ingersoll
      4. SOLR-651.patch
        23 kB
        Grant Ingersoll
      5. SOLR-651.patch
        17 kB
        Grant Ingersoll
      6. SOLR-651.patch
        16 kB
        Grant Ingersoll
      7. SOLR-651-fixes.patch
        3 kB
        Yonik Seeley

        Activity

        Show
        Grant Ingersoll added a comment - See http://lucene.markmail.org/message/k7mr6d6qpulnizgb?q=Solr+TFIDF
        Hide
        Grant Ingersoll added a comment -

        Here's a first crack at this. It still needs more unit tests to exercise the various combination of options, but I think it is a reasonable first crack at the idea.

        Questions to be answered/things to still do:
        1. How do people like the format for output? It's basically broken down by doc, then field, then term, then term information, See the unit tests for some samples
        2. Would be good to have a more efficient lookup for IDF. At a minimum, a cache of IDF values would be useful, but the memory would need to be controlled. Lucene may do some caching under the hood, so that should be investigated more
        3. It relies on the query component doing it's thing. That is, you send in a query, start and rows, and this component just loops over the doc list and fetches. I could see a case for doing things separately, but that seems like duplication. People using this can just send explicit queries designed for this Component.
        4. Probably needs some error handling for documents that don't have term vectors, but haven't tested yet.

        Show
        Grant Ingersoll added a comment - Here's a first crack at this. It still needs more unit tests to exercise the various combination of options, but I think it is a reasonable first crack at the idea. Questions to be answered/things to still do: 1. How do people like the format for output? It's basically broken down by doc, then field, then term, then term information, See the unit tests for some samples 2. Would be good to have a more efficient lookup for IDF. At a minimum, a cache of IDF values would be useful, but the memory would need to be controlled. Lucene may do some caching under the hood, so that should be investigated more 3. It relies on the query component doing it's thing. That is, you send in a query, start and rows, and this component just loops over the doc list and fetches. I could see a case for doing things separately, but that seems like duplication. People using this can just send explicit queries designed for this Component. 4. Probably needs some error handling for documents that don't have term vectors, but haven't tested yet.
        Hide
        Noble Paul added a comment -

        The output looks fine . may be we can improve on iterations. We should add the uniqueKey value if it exists . That is the only way we the user can correlate it with the actual results

        <lst name="doc-3">
        <str name="<unique-key-field-name><unique-key-value></str>
        <lst name="cat">
        ....
        

        another observation is this gets a searcher and does not do a decref() after the use. The operations must be put in

        try{
        //the operations
        }finally{ 
        searcher.decref();
        }
        

        A suggestion on the name why not make it all lowercase. tvrh instead of tvRH

        Most of the users may not need the "results" so it must be able to be used separately as a requesthandler register it as "/tvrh"

        Show
        Noble Paul added a comment - The output looks fine . may be we can improve on iterations. We should add the uniqueKey value if it exists . That is the only way we the user can correlate it with the actual results <lst name= "doc-3" > <str name="<unique-key-field-name><unique-key-value></str> <lst name= "cat" > .... another observation is this gets a searcher and does not do a decref() after the use. The operations must be put in try { //the operations } finally { searcher.decref(); } A suggestion on the name why not make it all lowercase. tvrh instead of tvRH Most of the users may not need the "results" so it must be able to be used separately as a requesthandler register it as "/tvrh"
        Hide
        Grant Ingersoll added a comment -

        Addresses Noble's thoughts.

        Show
        Grant Ingersoll added a comment - Addresses Noble's thoughts.
        Hide
        Grant Ingersoll added a comment -

        Here's a start at making this support distributed. Still needs testing. I'm not sure if I'm doing the distributed right, but there ain't a whole lot of docs on it just yet, so I'm going based off of what I see in the other components. I'm especially not clear if I am understanding the stages correctly.

        Also, would be handy if there was a better way of testing the distributed stuff. So far, I call directly into the component to call distributedProcess, but would also be nice to have a harness that does what TestDistributedSearch does (i.e. setup a couple of Jetty instances and actually run them)

        Show
        Grant Ingersoll added a comment - Here's a start at making this support distributed. Still needs testing. I'm not sure if I'm doing the distributed right, but there ain't a whole lot of docs on it just yet, so I'm going based off of what I see in the other components. I'm especially not clear if I am understanding the stages correctly. Also, would be handy if there was a better way of testing the distributed stuff. So far, I call directly into the component to call distributedProcess, but would also be nice to have a harness that does what TestDistributedSearch does (i.e. setup a couple of Jetty instances and actually run them)
        Hide
        Shalin Shekhar Mangar added a comment -

        Grant, would it be better to handle the distributed case through another issue and commit the fully baked patch for a single server? A lot of people can start using it immediately. Distributed Search is still relatively rare though it is certainly picking up the pace.

        Show
        Shalin Shekhar Mangar added a comment - Grant, would it be better to handle the distributed case through another issue and commit the fully baked patch for a single server? A lot of people can start using it immediately. Distributed Search is still relatively rare though it is certainly picking up the pace.
        Hide
        Vaijanath N. Rao added a comment -

        Hi Grant,

        I have applied the patch and to start with it is really good. I just have few suggestions.

        a) if one set qt=tvrh tv=true does not make sense as the user is anyway requesting term vectors.
        b) Repeating uniqueKeyFieldName for every record does not add any value.
        c) Ideally what I would have liked is something below.
        <response>
        <lst name="termVectors">
        <lst name="firstdoc">
        <str name="term1">term1tf-idf</str>
        <str name="term2">term1tf-idf</str>
        ....
        </lst>
        <lst name="seconddoc">
        <str name="term1">term1tf-idf</str>
        ...
        </lst>
        ....
        </lst>
        </response>

        Having said all this, the above patch helps us a lot in terms of using solr for document clustering.

        --Thanks and Regards
        Vaijanath

        Show
        Vaijanath N. Rao added a comment - Hi Grant, I have applied the patch and to start with it is really good. I just have few suggestions. a) if one set qt=tvrh tv=true does not make sense as the user is anyway requesting term vectors. b) Repeating uniqueKeyFieldName for every record does not add any value. c) Ideally what I would have liked is something below. <response> <lst name="termVectors"> <lst name="firstdoc"> <str name="term1">term1tf-idf</str> <str name="term2">term1tf-idf</str> .... </lst> <lst name="seconddoc"> <str name="term1">term1tf-idf</str> ... </lst> .... </lst> </response> Having said all this, the above patch helps us a lot in terms of using solr for document clustering. --Thanks and Regards Vaijanath
        Hide
        Grant Ingersoll added a comment -

        Vajijanath

        Having said all this, the above patch helps us a lot in terms of using solr for document clustering.

        Please see SOLR-769.

        Show
        Grant Ingersoll added a comment - Vajijanath Having said all this, the above patch helps us a lot in terms of using solr for document clustering. Please see SOLR-769 .
        Hide
        Grant Ingersoll added a comment -

        a) if one set qt=tvrh tv=true does not make sense as the user is anyway requesting term vectors.

        qt=tvrh is setting the request handler. tv=true is turning on the TermVectorComponent (TVC). Those are two separate actions. The TVC can be added to any ReqHandler.

        b) Repeating uniqueKeyFieldName for every record does not add any value.

        Good point. I will move that out.

        c) Ideally what I would have liked is something below.
        <response>
        <lst name="termVectors">
        <lst name="firstdoc">
        <str name="term1">term1tf-idf</str>
        <str name="term2">term1tf-idf</str>
        ....

        Can you make a case for that? I have a couple of issues with it. First, the term is incorrectly typed. Presumably TF-IDF is a double. Second, it requires Solr to do TF*IDF for every term, when not everyone will want that, thus it would be a wasted calculation. I suppose it could be an option to have Solr do it, though.

        Are you proposing not to return the other info or is this just in the case where tf = true and idf = true?

        Show
        Grant Ingersoll added a comment - a) if one set qt=tvrh tv=true does not make sense as the user is anyway requesting term vectors. qt=tvrh is setting the request handler. tv=true is turning on the TermVectorComponent (TVC). Those are two separate actions. The TVC can be added to any ReqHandler. b) Repeating uniqueKeyFieldName for every record does not add any value. Good point. I will move that out. c) Ideally what I would have liked is something below. <response> <lst name="termVectors"> <lst name="firstdoc"> <str name="term1">term1tf-idf</str> <str name="term2">term1tf-idf</str> .... Can you make a case for that? I have a couple of issues with it. First, the term is incorrectly typed. Presumably TF-IDF is a double. Second, it requires Solr to do TF*IDF for every term, when not everyone will want that, thus it would be a wasted calculation. I suppose it could be an option to have Solr do it, though. Are you proposing not to return the other info or is this just in the case where tf = true and idf = true?
        Hide
        Vaijanath N. Rao added a comment -

        Hi Grant,
        I agree with the first point.

        For the third point
        The assumption is if someone has asked for TF-IDF representation, one would not expect the entire document to be back. So if the request is qt=trvh&tf.tv=true&tv=true or tf=true one would expect only to get the termvectors component.

        I agree with you regarding the tf-idf computation, just returning tf is more than sufficient.

        --Thanks and Regards
        Vaijanath

        Show
        Vaijanath N. Rao added a comment - Hi Grant, I agree with the first point. For the third point The assumption is if someone has asked for TF-IDF representation, one would not expect the entire document to be back. So if the request is qt=trvh&tf.tv=true&tv=true or tf=true one would expect only to get the termvectors component. I agree with you regarding the tf-idf computation, just returning tf is more than sufficient. --Thanks and Regards Vaijanath
        Hide
        Grant Ingersoll added a comment -

        Adds in All parameter to get tf, idf, offsets, and positions. Moves uniqueFieldKeyName out of the loop.

        Show
        Grant Ingersoll added a comment - Adds in All parameter to get tf, idf, offsets, and positions. Moves uniqueFieldKeyName out of the loop.
        Hide
        Grant Ingersoll added a comment -

        The assumption is if someone has asked for TF-IDF representation, one would not expect the entire document to be back. So if the request is qt=trvh&tf.tv=true&tv=true or tf=true one would expect only to get the termvectors component.

        OK, I think I understand. You're suggesting that if one only wants one component (i.e. TF), that we could flatten the structure a bit such that instead of:

        <lst name="display">
        <int name="freq">2</int>
        </lst>
        

        we just do:

        <int name="display">2</int>
        

        The former is slightly more verbose, but then it only requires people to have one approach to handling the various options, whereas the latter approach, while more compact, requires people to have two ways of handling the output. Is my understanding correct?

        Show
        Grant Ingersoll added a comment - The assumption is if someone has asked for TF-IDF representation, one would not expect the entire document to be back. So if the request is qt=trvh&tf.tv=true&tv=true or tf=true one would expect only to get the termvectors component. OK, I think I understand. You're suggesting that if one only wants one component (i.e. TF), that we could flatten the structure a bit such that instead of: <lst name= "display" > < int name= "freq" >2</ int > </lst> we just do: < int name= "display" >2</ int > The former is slightly more verbose, but then it only requires people to have one approach to handling the various options, whereas the latter approach, while more compact, requires people to have two ways of handling the output. Is my understanding correct?
        Hide
        Grant Ingersoll added a comment -

        Fix bug with position output.

        Show
        Grant Ingersoll added a comment - Fix bug with position output.
        Hide
        Vaijanath N. Rao added a comment -

        Hi Grant,

        I think my understanding is slightly different, let me try to clarify them both.

        If the user has asked for tf=true he is expecting term frequency value
        so the output would be
        <int name="display">2</int>

        If the user has asked for tf=true&idf=true user implies give me both computation
        so the output would be
        <float name="display">1.0</float>

        I thought of this output and hence thought this representation would be ideal. But I think I might been missing something that you have though about.

        --Thanks and Regards
        Vaijanath N. Rao

        Show
        Vaijanath N. Rao added a comment - Hi Grant, I think my understanding is slightly different, let me try to clarify them both. If the user has asked for tf=true he is expecting term frequency value so the output would be <int name="display">2</int> If the user has asked for tf=true&idf=true user implies give me both computation so the output would be <float name="display">1.0</float> I thought of this output and hence thought this representation would be ideal. But I think I might been missing something that you have though about. --Thanks and Regards Vaijanath N. Rao
        Hide
        Grant Ingersoll added a comment -

        If the user has asked for tf=true&idf=true user implies give me both computation
        so the output would be
        <float name="display">1.0</float>

        Ah, so you are expecting Solr to do the computation tf*idf when you specify those two options, right? My interpretations (and implmentation) is that it means return those two values in the response. See http://wiki.apache.org/solr/TermVectorComponentExampleOptions

        I suppose we could add something like &tf-idf=true in which case Solr does the computation. Otherwise, I think it is useful to be able to deliver them separately, as I think it gives the most flexibility. Would adding a tf-idf option suit your needs? Such that, when it is true, Solr returns that computation?

        Show
        Grant Ingersoll added a comment - If the user has asked for tf=true&idf=true user implies give me both computation so the output would be <float name="display">1.0</float> Ah, so you are expecting Solr to do the computation tf*idf when you specify those two options, right? My interpretations (and implmentation) is that it means return those two values in the response. See http://wiki.apache.org/solr/TermVectorComponentExampleOptions I suppose we could add something like &tf-idf=true in which case Solr does the computation. Otherwise, I think it is useful to be able to deliver them separately, as I think it gives the most flexibility. Would adding a tf-idf option suit your needs? Such that, when it is true, Solr returns that computation?
        Hide
        Grant Ingersoll added a comment -

        Added tf_idf calculation option. I'm going to commit today.

        Show
        Grant Ingersoll added a comment - Added tf_idf calculation option. I'm going to commit today.
        Hide
        Grant Ingersoll added a comment -

        Committed revision 707399.

        Show
        Grant Ingersoll added a comment - Committed revision 707399.
        Hide
        Yonik Seeley added a comment -

        Is there a reason that this component asks for the latest searcher from the core instead of getting the one bound to the SolrQueryRequest? Assuming it's just a bug... patch attached.

        Show
        Yonik Seeley added a comment - Is there a reason that this component asks for the latest searcher from the core instead of getting the one bound to the SolrQueryRequest? Assuming it's just a bug... patch attached.
        Hide
        Yonik Seeley added a comment - - edited

        Some random thoughts on this patch:

        • Adding the uniqueKeyFieldName seems out of place.... it's just one element of the schema and it doesn't seem like it belongs in this component.
        • How about using the "id" as the key, as is done in other places like highlighting.
          So instead of
            <lst name="doc-170">
              <str name="uniqueKey">3007WFP</str>
              <lst name="cat">
          	<lst name="electronics"/>
          	<lst name="monitor"/>
             </lst>
           </lst>
          

          it could look like

            <lst name="3007WFP">
              <lst name="cat">
          	<lst name="electronics"/>
          	<lst name="monitor"/>
             </lst>
           </lst>
          
        • It doesn't seem like we should link the ability to return term vectors with term vectors being stored. Like highlighting, they should be used when available for speed, but stored fields should also be possible. It's fine for the impl of that to wait, but perhaps the interface should support that via a tv.fl parameter. update: just looked at the code again, and I see there is a tv.fl param.... so I guess the only discussion point is if the default is right (all fields with term vectors stored).
        • "idf" actually isn't the idf, it's the doc freq that is being returned. The label should probably be changed to "df"
        • instead of "freq", how about just using the shorter and well-known "tf"?
        • the docs say that tf_idf "Calculates tf*idf for each term.", but the code is actually returning "freq"/"idf" (but the idf is actually a df, so it is a straight tf * idf). But this doesn't seem that useful because the user could trivially do tf/df themselves. What would seem useful is to get the actual scoring tf-idf (via the Similarity). For better language mappings, I think we should avoid dashes in parameter names too.... perhaps tv.tfidf or tv.tf_idf?
        Show
        Yonik Seeley added a comment - - edited Some random thoughts on this patch: Adding the uniqueKeyFieldName seems out of place.... it's just one element of the schema and it doesn't seem like it belongs in this component. How about using the "id" as the key, as is done in other places like highlighting. So instead of <lst name= "doc-170" > <str name= "uniqueKey" >3007WFP</str> <lst name= "cat" > <lst name= "electronics" /> <lst name= "monitor" /> </lst> </lst> it could look like <lst name= "3007WFP" > <lst name= "cat" > <lst name= "electronics" /> <lst name= "monitor" /> </lst> </lst> It doesn't seem like we should link the ability to return term vectors with term vectors being stored. Like highlighting, they should be used when available for speed, but stored fields should also be possible. It's fine for the impl of that to wait, but perhaps the interface should support that via a tv.fl parameter. update: just looked at the code again, and I see there is a tv.fl param.... so I guess the only discussion point is if the default is right (all fields with term vectors stored). "idf" actually isn't the idf, it's the doc freq that is being returned. The label should probably be changed to "df" instead of "freq", how about just using the shorter and well-known "tf"? the docs say that tf_idf "Calculates tf*idf for each term.", but the code is actually returning "freq"/"idf" (but the idf is actually a df, so it is a straight tf * idf). But this doesn't seem that useful because the user could trivially do tf/df themselves. What would seem useful is to get the actual scoring tf-idf (via the Similarity). For better language mappings, I think we should avoid dashes in parameter names too.... perhaps tv.tfidf or tv.tf_idf?
        Hide
        Grant Ingersoll added a comment -

        Is there a reason that this component asks for the latest searcher from the core instead of getting the one bound to the SolrQueryRequest? Assuming it's just a bug... patch attached

        Nope. Go ahead and commit.

        Show
        Grant Ingersoll added a comment - Is there a reason that this component asks for the latest searcher from the core instead of getting the one bound to the SolrQueryRequest? Assuming it's just a bug... patch attached Nope. Go ahead and commit.
        Hide
        Grant Ingersoll added a comment -
        1. Adding the uniqueKeyFieldName seems out of place.... it's just one element of the schema and it doesn't seem like it belongs in this component.
        2. How about using the "id" as the key, as is done in other places like highlighting.

        That's fine. I think my thinking was that by using a "constant" for the name, then one could ask explicitly for that property in the NamedList. That is namedList.getVal("uniqueKey");

        It doesn't seem like we should link the ability to return term vectors with term vectors being stored. Like highlighting, they should be used when available for speed, but stored fields should also be possible. It's fine for the impl of that to wait, but perhaps the interface should support that via a tv.fl parameter. update: just looked at the code again, and I see there is a tv.fl param.... so I guess the only discussion point is if the default is right (all fields with term vectors stored).

        That's reasonable. We can open a separate issue for it if anyone wants it.

        1. "idf" actually isn't the idf, it's the doc freq that is being returned. The label should probably be changed to "df"
        2. instead of "freq", how about just using the shorter and well-known "tf"?
        3. the docs say that tf_idf "Calculates tf*idf for each term.", but the code is actually returning "freq"/"idf" (but the idf is actually a df, so it is a straight tf * idf). But this doesn't seem that useful because the user could trivially do tf/df themselves. What would seem useful is to get the actual scoring tf-idf (via the Similarity). For better language mappings, I think we should avoid dashes in parameter names too.... perhaps tv.tfidf or tv.tf_idf?

        All fine as well. I just added the tf*idf computation in as a based on Vaijanath's comments. I'll update these and the wiki.

        Show
        Grant Ingersoll added a comment - Adding the uniqueKeyFieldName seems out of place.... it's just one element of the schema and it doesn't seem like it belongs in this component. How about using the "id" as the key, as is done in other places like highlighting. That's fine. I think my thinking was that by using a "constant" for the name, then one could ask explicitly for that property in the NamedList. That is namedList.getVal("uniqueKey"); It doesn't seem like we should link the ability to return term vectors with term vectors being stored. Like highlighting, they should be used when available for speed, but stored fields should also be possible. It's fine for the impl of that to wait, but perhaps the interface should support that via a tv.fl parameter. update: just looked at the code again, and I see there is a tv.fl param.... so I guess the only discussion point is if the default is right (all fields with term vectors stored). That's reasonable. We can open a separate issue for it if anyone wants it. "idf" actually isn't the idf, it's the doc freq that is being returned. The label should probably be changed to "df" instead of "freq", how about just using the shorter and well-known "tf"? the docs say that tf_idf "Calculates tf*idf for each term.", but the code is actually returning "freq"/"idf" (but the idf is actually a df, so it is a straight tf * idf). But this doesn't seem that useful because the user could trivially do tf/df themselves. What would seem useful is to get the actual scoring tf-idf (via the Similarity). For better language mappings, I think we should avoid dashes in parameter names too.... perhaps tv.tfidf or tv.tf_idf? All fine as well. I just added the tf*idf computation in as a based on Vaijanath's comments. I'll update these and the wiki.
        Hide
        Grant Ingersoll added a comment -

        I committed:

        freq -> tf
        idf -> df
        tf-idf -> tf_idf

        See the Wiki page: http://wiki.apache.org/solr/TermVectorComponent

        Show
        Grant Ingersoll added a comment - I committed: freq -> tf idf -> df tf-idf -> tf_idf See the Wiki page: http://wiki.apache.org/solr/TermVectorComponent
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Noble Paul
          • Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development