Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9252

Feature selection and logistic regression on text

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: None
    • Fix Version/s: 6.2
    • Component/s: search, SolrCloud, SolrJ
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:

      Description

      This ticket adds two new streaming expressions: features and train

      These two functions work together to train a logistic regression model on text, from a training set stored in a SolrCloud collection.

      The syntax is as follows:

      train(collection1, q="*:*",
            features(collection1, 
                     q="*:*",  
                     field="body", 
                     outcome="out_i", 
                     positiveLabel=1, 
                     numTerms=100),
            field="body",
            outcome="out_i",
            maxIterations=100)
      

      The features function extracts the feature terms from a training set using information gain to score the terms. http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

      The train function uses the extracted features to train a logistic regression model on a text field in the training set.

      For both features and train the training set is defined by a query. The doc vectors in the train function use tf-idf to represent the terms in the document. The idf is calculated for the specific training set, allowing multiple training sets to be stored in the same collection without polluting the idf.

      In the train function a batch gradient descent approach is used to iteratively train the model.

      Both the features and the train function are embedded in Solr using the AnalyticsQuery framework. So only the model is transported across the network with each iteration.

      Both the features and the models can be stored in a SolrCloud collection. Using this approach Solr can hold millions of models which can be selectively deployed. For example a model could be trained for each user, to personalize ranking and recommendations.

      Below is the final iteration of a model trained on the Enron Ham/Spam dataset. The model includes the terms and their idfs and weights as well as a classification evaluation describing the accuracy of model on the training set.

      {
      			"idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
      			"alpha_d": 7.150861416624748E-4,
      			"terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link", "featured", "net", "international", "most", "investing", "works", "readers", "uncertainties", "differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers", "should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
      			"iteration_i": 100,
      			"weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395],
      			"field_s": "body",
      			"trueNegative_i": 3570,
      			"falseNegative_i": 35,
      			"falsePositive_i": 75,
      			"error_d": 176.8112932306374,
      			"truePositive_i": 1381,
      			"id": "model_100"
      		}
      
      1. SOLR-9252.patch
        88 kB
        Cao Manh Dat
      2. SOLR-9252.patch
        87 kB
        Joel Bernstein
      3. SOLR-9252.patch
        88 kB
        Joel Bernstein
      4. SOLR-9252.patch
        93 kB
        Cao Manh Dat
      5. SOLR-9252.patch
        92 kB
        Cao Manh Dat
      6. SOLR-9252.patch
        90 kB
        Cao Manh Dat
      7. SOLR-9252.patch
        87 kB
        Cao Manh Dat
      8. SOLR-9252.patch
        87 kB
        Cao Manh Dat
      9. SOLR-9252.patch
        74 kB
        Cao Manh Dat
      10. SOLR-9252.patch
        75 kB
        Cao Manh Dat
      11. SOLR-9299-1.patch
        7 kB
        Cao Manh Dat

        Issue Links

          Activity

          Hide
          caomanhdat Cao Manh Dat added a comment -

          Enron mail dataset

          Show
          caomanhdat Cao Manh Dat added a comment - Enron mail dataset
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Initial patch.

          Show
          caomanhdat Cao Manh Dat added a comment - Initial patch.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          This is an exciting patch!

          I closed out SOLR-9186 so work can focus on this patch.

          I'll open another ticket describing a broader framework for optimizing, storing and deploying AI models within Streaming Expression framework and link it to this ticket.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited This is an exciting patch! I closed out SOLR-9186 so work can focus on this patch. I'll open another ticket describing a broader framework for optimizing , storing and deploying AI models within Streaming Expression framework and link it to this ticket.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          Updated patch, I changed the features selection formulation to correct one (https://en.wikipedia.org/wiki/Information_gain_in_decision_trees). Here are the test result of new formulation (https://docs.google.com/spreadsheets/d/1BRjFgZDiJPBT51kggcCznoK0ES1-N-RbOIJaoDT3qgM/edit?usp=sharing).

          I thinks the patch is ready now.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited Updated patch, I changed the features selection formulation to correct one ( https://en.wikipedia.org/wiki/Information_gain_in_decision_trees ). Here are the test result of new formulation ( https://docs.google.com/spreadsheets/d/1BRjFgZDiJPBT51kggcCznoK0ES1-N-RbOIJaoDT3qgM/edit?usp=sharing ). I thinks the patch is ready now.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Cao Manh Dat, I have the patch applied and have begun the review.

          I've started with the FeaturesSelectionStream and IGainTermsQParserPlugin. I'll need more time and some collaboration to review the math. But I can say now that the mechanics of feature selection look very good. The use of the streaming framework and analytics query is really nice.

          The one thing that we'll want to do is put some thought into how the features can be stored and retrieved.

          Currently it looks like there is a tuple for each term/score pair. I think this works well using the update() function to send the tuples to another collection for storage. A few minor things to consider:

          1) Should we use a field type post fix (term_s , score_f) to ensure that fields are indexed properly in another collection.

          2) We'll need to add some kind of feature set ID so the feature set can be retrieved later. Each tuple will then be tagged with the feature set ID. Possibly adding a featureSet parameter to the stream makes sense for this.

          3) We can also add a unique ID which will be used for the unique ID for each tuple in the index. We could concat the term with the feature set ID to make the unique ID.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Cao Manh Dat , I have the patch applied and have begun the review. I've started with the FeaturesSelectionStream and IGainTermsQParserPlugin. I'll need more time and some collaboration to review the math. But I can say now that the mechanics of feature selection look very good. The use of the streaming framework and analytics query is really nice. The one thing that we'll want to do is put some thought into how the features can be stored and retrieved. Currently it looks like there is a tuple for each term/score pair. I think this works well using the update() function to send the tuples to another collection for storage. A few minor things to consider: 1) Should we use a field type post fix (term_s , score_f) to ensure that fields are indexed properly in another collection. 2) We'll need to add some kind of feature set ID so the feature set can be retrieved later. Each tuple will then be tagged with the feature set ID. Possibly adding a featureSet parameter to the stream makes sense for this. 3) We can also add a unique ID which will be used for the unique ID for each tuple in the index. We could concat the term with the feature set ID to make the unique ID.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          This is a good change. I think we should store the index of terms (from 0 -> number of terms) as well, so we can retrieve terms in sorted order.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited This is a good change. I think we should store the index of terms (from 0 -> number of terms) as well, so we can retrieve terms in sorted order.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Hi Joel,

          Should we add postfix for textlogitstream output too? Do you think we should define a standard schema for all ML stream output?

          Show
          caomanhdat Cao Manh Dat added a comment - Hi Joel, Should we add postfix for textlogitstream output too? Do you think we should define a standard schema for all ML stream output?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I think we should postfix the textlogiststream also.

          A standard schema would be nice, but I don't know if it will be possible. For example the logit() model is pretty different then the tlogit() model.

          I'll provide some more feedback on textlogitstream shortly.

          Show
          joel.bernstein Joel Bernstein added a comment - I think we should postfix the textlogiststream also. A standard schema would be nice, but I don't know if it will be possible. For example the logit() model is pretty different then the tlogit() model. I'll provide some more feedback on textlogitstream shortly.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok I reviewed the TextLogitStream and it looks great! The ClassificationEvaluation is really nice.

          Really the whole patch looks very good.

          What I need to do now is test with a few different data sets. This will verify the results that Cao Manh Dat has been getting. It will also test out the mechanics of running the functions and storing and retrieving the features and models.

          Show
          joel.bernstein Joel Bernstein added a comment - Ok I reviewed the TextLogitStream and it looks great! The ClassificationEvaluation is really nice. Really the whole patch looks very good. What I need to do now is test with a few different data sets. This will verify the results that Cao Manh Dat has been getting. It will also test out the mechanics of running the functions and storing and retrieving the features and models.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Thanks for the review, I will upload updated patch shortly.

          Because in ML we deal a lot with number, array of numbers. So I think we will use dynamic field to define a standard schema.
          For example :

          *_i : int
          *_is : array of int
          ...
          
          Show
          caomanhdat Cao Manh Dat added a comment - Thanks for the review, I will upload updated patch shortly. Because in ML we deal a lot with number, array of numbers. So I think we will use dynamic field to define a standard schema. For example : *_i : int *_is : array of int ...
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch. This patch includes :

          • Support store textLogit & featureSelection by using updateStream.
          • TextLogit model now support exact idfs by using SOLR-9243.
          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch. This patch includes : Support store textLogit & featureSelection by using updateStream. TextLogit model now support exact idfs by using SOLR-9243 .
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          I just reviewed the latest patch, it looks good. One implementation detail:

          The terms component also returns the numDocs now that SOLR-9193 has been committed. So you can retrieve the numDocs along with the doc frequencies by adding the terms.stats param.

          And one question about the use of tf-idf:

          You're using tf-idf for the doc vectors which seems like a good idea. Is this a typical approach for text regression or is this something you decided to do because we have access to these types of stats in the index?

          Show
          joel.bernstein Joel Bernstein added a comment - - edited I just reviewed the latest patch, it looks good. One implementation detail: The terms component also returns the numDocs now that SOLR-9193 has been committed. So you can retrieve the numDocs along with the doc frequencies by adding the terms.stats param. And one question about the use of tf-idf: You're using tf-idf for the doc vectors which seems like a good idea. Is this a typical approach for text regression or is this something you decided to do because we have access to these types of stats in the index?
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Thanks, That's seem a good improvement, I will update the patch soon.

          In general, TF-IDF is a good/standard way to represent document for classification. We can use TF only, but it wont as good as TF-IDF and a nice thing about SOLR that we can get IDF of terms very quickly.

          Show
          caomanhdat Cao Manh Dat added a comment - Thanks, That's seem a good improvement, I will update the patch soon. In general, TF-IDF is a good/standard way to represent document for classification. We can use TF only, but it wont as good as TF-IDF and a nice thing about SOLR that we can get IDF of terms very quickly.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch based on Joel Bernstein about numDocs().

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch based on Joel Bernstein about numDocs().
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Update stream expression to expression test.

          Show
          caomanhdat Cao Manh Dat added a comment - Update stream expression to expression test.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I've been working with the latest patch. After putting the enron1.zip file in place all the test methods in StreamExpressionTest pass on their own. But if you run the entire StreamExpressionTest you get failures. I'm investigating this now. I'll update the ticket when I've got this resolved. The latest run had the following failures, but different ones fail on each run:

          [junit4] Tests with failures [seed: F53E526DA62A037F]:
          [junit4] - org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testFeaturesSelectionStream
          [junit4] - org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testUpdateStream

          Show
          joel.bernstein Joel Bernstein added a comment - I've been working with the latest patch. After putting the enron1.zip file in place all the test methods in StreamExpressionTest pass on their own. But if you run the entire StreamExpressionTest you get failures. I'm investigating this now. I'll update the ticket when I've got this resolved. The latest run had the following failures, but different ones fail on each run: [junit4] Tests with failures [seed: F53E526DA62A037F] : [junit4] - org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testFeaturesSelectionStream [junit4] - org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testUpdateStream
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          One of the things I've been thinking about is the function names. I think we can shorten the featureSelection function to just be features.

          I think we could change the tlogit function to train. So the syntax would look like this:

          train(collection1, q="*:*",
                features(collection1, 
                         q="*:*",  
                         field="tv_text", 
                         outcome="out_i", 
                         positiveLabel=1, 
                         numTerms=100),
                field="tv_text",
                outcome="out_i",
                maxIterations=100)
          

          In the future both the features and the train functions can have a parameter for setting the algorithm. The default algorithm in the initial release will be information gain for feature selection, and logistic regression for training

          Show
          joel.bernstein Joel Bernstein added a comment - - edited One of the things I've been thinking about is the function names. I think we can shorten the featureSelection function to just be features . I think we could change the tlogit function to train . So the syntax would look like this: train(collection1, q= "*:*" , features(collection1, q= "*:*" , field= "tv_text" , outcome= "out_i" , positiveLabel=1, numTerms=100), field= "tv_text" , outcome= "out_i" , maxIterations=100) In the future both the features and the train functions can have a parameter for setting the algorithm. The default algorithm in the initial release will be information gain for feature selection, and logistic regression for training
          Hide
          caomanhdat Cao Manh Dat added a comment -

          +1
          That can help expression cleaner.

          Show
          caomanhdat Cao Manh Dat added a comment - +1 That can help expression cleaner.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          It turn out that, the cause of test fail is we create temporary collection and do not delete it. So the exception is thrown when we wanna create same temporary collection in another test.

          Show
          caomanhdat Cao Manh Dat added a comment - It turn out that, the cause of test fail is we create temporary collection and do not delete it. So the exception is thrown when we wanna create same temporary collection in another test.
          Hide
          dsmiley David Smiley added a comment -

          Can the point of this be explained in laymans terms? I am not familiar with logistic regression and how it relates to search.

          Show
          dsmiley David Smiley added a comment - Can the point of this be explained in laymans terms? I am not familiar with logistic regression and how it relates to search.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          This is part of the larger ticket SOLR-9258, which will provide more context.

          Here are some specifics about this ticket:

          Logistic regression is a machine learning classification algorithm.

          It's binary, so it's used to determine if something belongs to a class or not.

          With logistic regression you train a model using a training data set. And then use that model to classify other documents.

          This ticket trains a logistic regression model on text. So it builds a model based on the terms in the documents. New documents can then be classified based on the terms in the documents.

          The terms in the document are known as features.

          The first step in the process is feature selection. Which is to select the important terms from the training set that will be used to build the model. This ticket uses an algorithm called Information Gain to select the features.

          The next step is to train a model based on those features. This ticket uses Stochastic Gradient Descent to train a logistic regression model over the training set. Stochastic Gradient Descent is an iterative approach.

          Both the features and the model can then be stored in a SolrCloud collection.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited This is part of the larger ticket SOLR-9258 , which will provide more context. Here are some specifics about this ticket: Logistic regression is a machine learning classification algorithm. It's binary, so it's used to determine if something belongs to a class or not. With logistic regression you train a model using a training data set. And then use that model to classify other documents. This ticket trains a logistic regression model on text. So it builds a model based on the terms in the documents. New documents can then be classified based on the terms in the documents. The terms in the document are known as features . The first step in the process is feature selection. Which is to select the important terms from the training set that will be used to build the model. This ticket uses an algorithm called Information Gain to select the features. The next step is to train a model based on those features. This ticket uses Stochastic Gradient Descent to train a logistic regression model over the training set. Stochastic Gradient Descent is an iterative approach. Both the features and the model can then be stored in a SolrCloud collection.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          One of the things that's really interesting about this ticket is that the algorithms used rely heavily on the statistics in the index. So, a search engine is really the best possible place to be doing machine learning on text documents. Logistic Regression is the easiest algorithm to get started with, but other algorithms will likely follow.

          Show
          joel.bernstein Joel Bernstein added a comment - One of the things that's really interesting about this ticket is that the algorithms used rely heavily on the statistics in the index. So, a search engine is really the best possible place to be doing machine learning on text documents. Logistic Regression is the easiest algorithm to get started with, but other algorithms will likely follow.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Here is an example of how you could use the models in a search context.

          1) Identify a training set for each user based on usage logs. This could involve what the user has viewed before, or liked. To increase the size of the training set you could use graph queries to find documents that co-occur in the same session most frequently with documents that user has viewed or liked.

          2) Optimize a model for the specific training set and store the model in a solrcloud collection.

          3) Use the model either in the re-ranker to boost documents based on the score from the model, or as part of an alerting engine to push documents the user might be interested in.

          4) Backgrounds daemons could run in Solr that would build models for users. This would result in possibly millions of models, which is fine, because the models are simply stored in a SolrCloud collection.

          Show
          joel.bernstein Joel Bernstein added a comment - Here is an example of how you could use the models in a search context. 1) Identify a training set for each user based on usage logs. This could involve what the user has viewed before, or liked. To increase the size of the training set you could use graph queries to find documents that co-occur in the same session most frequently with documents that user has viewed or liked. 2) Optimize a model for the specific training set and store the model in a solrcloud collection. 3) Use the model either in the re-ranker to boost documents based on the score from the model, or as part of an alerting engine to push documents the user might be interested in. 4) Backgrounds daemons could run in Solr that would build models for users. This would result in possibly millions of models, which is fine, because the models are simply stored in a SolrCloud collection.
          Hide
          dsmiley David Smiley added a comment -

          So is a "model" ultimately a Document? I'm guessing so since you mentioned putting them in Solr and having millions of them.

          Are there intermediate steps where data is put into Solr that isn't a model? You mentioned feature selection but it's not clear if that is materialized into Solr data or if it's used purely in-memory transiently.

          Thanks for your explanation.

          Show
          dsmiley David Smiley added a comment - So is a "model" ultimately a Document? I'm guessing so since you mentioned putting them in Solr and having millions of them. Are there intermediate steps where data is put into Solr that isn't a model? You mentioned feature selection but it's not clear if that is materialized into Solr data or if it's used purely in-memory transiently. Thanks for your explanation.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Yes the model can be saved as a document. The model contains the features that were used to create the model. And the associated weights for each feature.

          feature selection can be done as a separate step and stored in the index. Feature selection takes time and it's likely users will want to view the features that were extracted from the training data. Also, features could be used for other purposes as they are really just a list of terms that provide the most "information" about a training set. So it would be useful to store them.

          The training function reads the features as a stream, so they can either be a stored feature set, or generated on the fly.

          Show
          joel.bernstein Joel Bernstein added a comment - Yes the model can be saved as a document. The model contains the features that were used to create the model. And the associated weights for each feature. feature selection can be done as a separate step and stored in the index. Feature selection takes time and it's likely users will want to view the features that were extracted from the training data. Also, features could be used for other purposes as they are really just a list of terms that provide the most "information" about a training set. So it would be useful to store them. The training function reads the features as a stream, so they can either be a stored feature set, or generated on the fly.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          I've been working with the latest patch. I started with the featureSelection function.

          I found that as I increased the numTerms that terms were dropping off the list. I'm not exactly sure why that is.

          Using the Enron data set:

          numTerms=5:

          {"result-set":{"docs":[
          {"index_i":1,"featureSet_s":"first","id":"first_1","score_f":0.07897711944252839,"term_s":"daren"},
          {"index_i":2,"featureSet_s":"first","id":"first_2","score_f":0.08489573252924343,"term_s":"hpl"},
          {"index_i":3,"featureSet_s":"first","id":"first_3","score_f":0.09142281976072042,"term_s":"cc"},
          {"index_i":4,"featureSet_s":"first","id":"first_4","score_f":0.09565949144858465,"term_s":"2000"},
          {"index_i":5,"featureSet_s":"first","id":"first_5","score_f":0.11833764555978427,"term_s":"enron"},
          {"EOF":true,"RESPONSE_TIME":12}]}}
          

          numTerms:10

          {"result-set":{"docs":[
          {"index_i":1,"featureSet_s":"first","id":"first_1","score_f":0.06381886672504677,"term_s":"hou"},
          {"index_i":2,"featureSet_s":"first","id":"first_2","score_f":0.06554238725349948,"term_s":"ect"},
          {"index_i":3,"featureSet_s":"first","id":"first_3","score_f":0.06622407002267094,"term_s":"pm"},
          {"index_i":4,"featureSet_s":"first","id":"first_4","score_f":0.06679642321097634,"term_s":"thanks"},
          {"index_i":5,"featureSet_s":"first","id":"first_5","score_f":0.0679334895610123,"term_s":"forwarded"},
          {"index_i":6,"featureSet_s":"first","id":"first_6","score_f":0.06883768842689886,"term_s":"gas"},
          {"index_i":7,"featureSet_s":"first","id":"first_7","score_f":0.07775307852726465,"term_s":"http"},
          {"index_i":8,"featureSet_s":"first","id":"first_8","score_f":0.07897711944252839,"term_s":"daren"},
          {"index_i":9,"featureSet_s":"first","id":"first_9","score_f":0.08489573252924343,"term_s":"hpl"},
          {"index_i":10,"featureSet_s":"first","id":"first_10","score_f":0.09142281976072042,"term_s":"cc"},
          {"EOF":true,"RESPONSE_TIME":12}]}}
          

          Notice that enron had the highest score in the first result set but is missing from the second result set.

          Also in the code below it's taking the highest score from the shards for a term rather then combining the score. Is that the preferred approach for distributed IGain?

          for (Future<NamedList<Double>> getTopTermsCall : callShards(getShardUrls())) {
                    NamedList<Double> shardTopTerms = getTopTermsCall.get();
                    for (int i = 0; i < shardTopTerms.size(); i++) {
                      String term = shardTopTerms.getName(i);
                      double score = shardTopTerms.getVal(i);
                      if ( (termScores.containsKey(term) && termScores.get(term) < score) || !termScores.containsKey(term)) {
                        termScores.put(term, score);
                      }
                    }
                  }
          
          Show
          joel.bernstein Joel Bernstein added a comment - - edited I've been working with the latest patch. I started with the featureSelection function. I found that as I increased the numTerms that terms were dropping off the list. I'm not exactly sure why that is. Using the Enron data set: numTerms=5: { "result-set" :{ "docs" :[ { "index_i" :1, "featureSet_s" : "first" , "id" : "first_1" , "score_f" :0.07897711944252839, "term_s" : "daren" }, { "index_i" :2, "featureSet_s" : "first" , "id" : "first_2" , "score_f" :0.08489573252924343, "term_s" : "hpl" }, { "index_i" :3, "featureSet_s" : "first" , "id" : "first_3" , "score_f" :0.09142281976072042, "term_s" : "cc" }, { "index_i" :4, "featureSet_s" : "first" , "id" : "first_4" , "score_f" :0.09565949144858465, "term_s" : "2000" }, { "index_i" :5, "featureSet_s" : "first" , "id" : "first_5" , "score_f" :0.11833764555978427, "term_s" : "enron" }, { "EOF" : true , "RESPONSE_TIME" :12}]}} numTerms:10 { "result-set" :{ "docs" :[ { "index_i" :1, "featureSet_s" : "first" , "id" : "first_1" , "score_f" :0.06381886672504677, "term_s" : "hou" }, { "index_i" :2, "featureSet_s" : "first" , "id" : "first_2" , "score_f" :0.06554238725349948, "term_s" : "ect" }, { "index_i" :3, "featureSet_s" : "first" , "id" : "first_3" , "score_f" :0.06622407002267094, "term_s" : "pm" }, { "index_i" :4, "featureSet_s" : "first" , "id" : "first_4" , "score_f" :0.06679642321097634, "term_s" : "thanks" }, { "index_i" :5, "featureSet_s" : "first" , "id" : "first_5" , "score_f" :0.0679334895610123, "term_s" : "forwarded" }, { "index_i" :6, "featureSet_s" : "first" , "id" : "first_6" , "score_f" :0.06883768842689886, "term_s" : "gas" }, { "index_i" :7, "featureSet_s" : "first" , "id" : "first_7" , "score_f" :0.07775307852726465, "term_s" : "http" }, { "index_i" :8, "featureSet_s" : "first" , "id" : "first_8" , "score_f" :0.07897711944252839, "term_s" : "daren" }, { "index_i" :9, "featureSet_s" : "first" , "id" : "first_9" , "score_f" :0.08489573252924343, "term_s" : "hpl" }, { "index_i" :10, "featureSet_s" : "first" , "id" : "first_10" , "score_f" :0.09142281976072042, "term_s" : "cc" }, { "EOF" : true , "RESPONSE_TIME" :12}]}} Notice that enron had the highest score in the first result set but is missing from the second result set. Also in the code below it's taking the highest score from the shards for a term rather then combining the score. Is that the preferred approach for distributed IGain? for (Future<NamedList< Double >> getTopTermsCall : callShards(getShardUrls())) { NamedList< Double > shardTopTerms = getTopTermsCall.get(); for ( int i = 0; i < shardTopTerms.size(); i++) { String term = shardTopTerms.getName(i); double score = shardTopTerms.getVal(i); if ( (termScores.containsKey(term) && termScores.get(term) < score) || !termScores.containsKey(term)) { termScores.put(term, score); } } }
          Hide
          joel.bernstein Joel Bernstein added a comment -

          It looks like the final sort in the FeatureSelectionStream is ascending when it should be descending.

          Show
          joel.bernstein Joel Bernstein added a comment - It looks like the final sort in the FeatureSelectionStream is ascending when it should be descending.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          You are absolutely right! The fix for the problem should be

          st.sorted( Map.Entry.comparingByValue(
                  (c1, c2) -> c2.compareTo(c1)
              ) ).forEachOrdered( e -> result.put(e.getKey(), e.getValue()) );
          

          Also in the code below it's taking the highest score from the shards for a term rather then combining the score. Is that the preferred approach for distributed IGain?

          It's just my currently approach because we have dont have any paper about distributed IGain. I will do some more test to check both approaches.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited You are absolutely right! The fix for the problem should be st.sorted( Map.Entry.comparingByValue( (c1, c2) -> c2.compareTo(c1) ) ).forEachOrdered( e -> result.put(e.getKey(), e.getValue()) ); Also in the code below it's taking the highest score from the shards for a term rather then combining the score. Is that the preferred approach for distributed IGain? It's just my currently approach because we have dont have any paper about distributed IGain. I will do some more test to check both approaches.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          I'm thinking about change tlogit to train function. Because different algorithms have different set of parameters. For example : tlogit vs logit have totally different parameters. I think we should change featuresSelection to features but keep tlogit as it is.

          Joel Bernstein +1 for sum up the igain score from all shards. So we can get best terms from all shards. But this is not yet proven because it based on a lot of assumption about how documents, classes, terms is distributed. also, I think it will be good enough for most cases. If you dont have any comments, I will submit a fixed patch soon.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited I'm thinking about change tlogit to train function. Because different algorithms have different set of parameters. For example : tlogit vs logit have totally different parameters. I think we should change featuresSelection to features but keep tlogit as it is. Joel Bernstein +1 for sum up the igain score from all shards. So we can get best terms from all shards. But this is not yet proven because it based on a lot of assumption about how documents, classes, terms is distributed. also, I think it will be good enough for most cases. If you dont have any comments, I will submit a fixed patch soon.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok, I made the change with the sort locally and now the terms are not dropping off as the numTerms changes. I'll keep testing.

          Show
          joel.bernstein Joel Bernstein added a comment - Ok, I made the change with the sort locally and now the terms are not dropping off as the numTerms changes. I'll keep testing.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Ok, I reviewed the TextLogisticRegressionCollector. I think we're going to have to change this implementation. I thought I saw an older version of this that was using the finish() method to perform the logit. But in the current version it's doing this in the collect method. This is going to have trouble scaling and it also requires term vectors which we don't want to have to use.

          I think the approach to take is to collect the matching bitset in the collect() method.

          Then in the finish() method the logic is:

          1) Get the top level TermsEnum
          2) Iterate the features and seek into the terms enum
          3) For each feature iterate the DocsEnum and compare to the matching docs bitset to build the doc vectors.

          With this approach the DocVectors are a multi-dimension array:

          doc0-> [featureValue, featureValue, featureValue]
          doc1-> [featureValue, featureValue, featureValue]
          ...

          With this approach we'll have to hold all the docVectors in memory at once. So if you have hundreds of features and millions of records in the training set then you'll have a large cluster to do the work.

          We can also add a randomized approach to this so that not every doc vector is calculated on each iteration.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Ok, I reviewed the TextLogisticRegressionCollector. I think we're going to have to change this implementation. I thought I saw an older version of this that was using the finish() method to perform the logit. But in the current version it's doing this in the collect method. This is going to have trouble scaling and it also requires term vectors which we don't want to have to use. I think the approach to take is to collect the matching bitset in the collect() method. Then in the finish() method the logic is: 1) Get the top level TermsEnum 2) Iterate the features and seek into the terms enum 3) For each feature iterate the DocsEnum and compare to the matching docs bitset to build the doc vectors. With this approach the DocVectors are a multi-dimension array: doc0-> [featureValue, featureValue, featureValue] doc1-> [featureValue, featureValue, featureValue] ... With this approach we'll have to hold all the docVectors in memory at once. So if you have hundreds of features and millions of records in the training set then you'll have a large cluster to do the work. We can also add a randomized approach to this so that not every doc vector is calculated on each iteration.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch. This patch have changed some points

          • Do the training in finish() method. It's much faster than previous approach (thanks Joel Bernstein)
          • Change featuresSelection to features
          • FeaturesSelectionStream sum up term score from all shard.
          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch. This patch have changed some points Do the training in finish() method. It's much faster than previous approach (thanks Joel Bernstein ) Change featuresSelection to features FeaturesSelectionStream sum up term score from all shard.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok, I have the patch running and it looks great.

          I have the following expression running:

          train(training, 
                  features(training, q="*:*", featureSet="first", field="body", outcome="out_i", numTerms=200), 
                  q="*:*", 
                  name="model", 
                  field="body", 
                  outcome="out_i", 
                  maxIterations=100)
          

          In the patch train is still the function name in the /stream handler. But we can make a final decision on this before committing.

          The accuracy seems to be 98% on the Enron training data with this patch. Here is the final model:

          {
          			"idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
          			"alpha_d": 7.150861416624748E-4,
          			"terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link", "featured", "net", "international", "most", "investing", "works", "readers", "uncertainties", "differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers", "should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
          			"iteration_i": 100,
          			"weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395],
          			"field_s": "body",
          			"trueNegative_i": 3570,
          			"falseNegative_i": 35,
          			"falsePositive_i": 75,
          			"error_d": 176.8112932306374,
          			"truePositive_i": 1381,
          			"id": "model_100"
          		}
          
          Show
          joel.bernstein Joel Bernstein added a comment - Ok, I have the patch running and it looks great. I have the following expression running: train(training, features(training, q= "*:*" , featureSet= "first" , field= "body" , outcome= "out_i" , numTerms=200), q= "*:*" , name= "model" , field= "body" , outcome= "out_i" , maxIterations=100) In the patch train is still the function name in the /stream handler. But we can make a final decision on this before committing. The accuracy seems to be 98% on the Enron training data with this patch. Here is the final model: { "idfs_ds" : [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744], "alpha_d" : 7.150861416624748E-4, "terms_ss" : [ "enron" , "2000" , "cc" , "hpl" , "daren" , "http" , "gas" , "forwarded" , "pm" , "ect" , "hou" , "thanks" , "meter" , "2001" , "attached" , "deal" , "am" , "farmer" , "your" , "nom" , "corp" , "more" , "mmbtu" , "xls" , "here" , "j" , "let" , "volumes" , "questions" , "www" , "2004" , "sitara" , "no" , "money" , "01" , "volume" , "know" , "best" , "meds" , "bob" , "prescription" , "please" , "online" , "file" , "viagra" , "02" , "stop" , "me" , "nomination" , "v" , "on" , "i" , "click" , "texas" , "03" , "prices" , " for " , "paliourg" , "php" , "09" , "contract" , "fyi" , "actuals" , "u" , "04" , "pain" , "713" , "drugs" , "microsoft" , "email" , "robert" , "cialis" , "melissa" , "investment" , "teco" , "pat" , "11" , "save" , "professional" , "world" , "biz" , "flow" , "dollars" , "noms" , "2005" , "act" , "remove" , "results" , "soft" , "xp" , "mary" , "80" , "spam" , "following" , "06" , "software" , "n" , "dealer" , "08" , "ena" , "offer" , "sex" , "products" , "special" , "compliance" , "see" , "free" , "cheap" , "html" , "07" , "gary" , "000" , "low" , "our" , "houston" , "many" , "april" , "size" , "r" , "tap" , "lots" , "product" , "pills" , "xanax" , "vance" , "ami" , "chokshi" , "12" , "clynes" , "ticket" , "counterparty" , " super " , "thousand" , "daily" , "offers" , "weight" , "05" , "all" , "call" , "photoshop" , "julie" , "stock" , "lisa" , "steve" , "million" , "health" , "site" , "quality" , "stocks" , "link" , "featured" , "net" , "international" , "most" , "investing" , "works" , "readers" , "uncertainties" , "differ" , "news" , "david" , "seek" , "31" , "only" , "1933" , "creative" , "windows" , "subscribers" , "should" , "adobe" , "security" , "1934" , "valium" , "brand" , "visit" , "action" , "canon" , "pharmacy" , "sexual" , "inherent" , "construed" , "assumptions" , "internet" , "mobile" , "risks" , "wide" , "smith" , "ex" , "pill" , "states" , "projections" , "medications" , "predictions" , "anticipates" , "deciding" , "events" , "advice" , "now" , "com" , "browser" ], "iteration_i" : 100, "weights_ds" : [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395], "field_s" : "body" , "trueNegative_i" : 3570, "falseNegative_i" : 35, "falsePositive_i" : 75, "error_d" : 176.8112932306374, "truePositive_i" : 1381, "id" : "model_100" }
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Ok, here is my thinking with train versus tlogit

          The train function would initially map directly to the TextLogitStream. We can document that train is a text logistic regression model trainer in the first release.

          As we add more algorithms the train function will map to the TrainStream. The TrainStream won't have any implementations, it will simply be a facade for different training algorithms. The TrainStream will have a parameter called algorithm which it will use to select the stream implementation, such as TextLogitStream. The underlying implementation will handle the parameters, all the TrainStream will do is instantiate the alogrithm and run it.

          Sample syntax:

          train(collection, 
                features(...), 
                algorithm="tlogit", 
                q="*:*", ....)
          

          We can use the same facade approach for the classify and features function.

          The documentation can provide documentation for calling train with different algorithms.

          I like this approach because it provides three very easy to understand functions: train, classify and features

          It also stops the explosion of functions that would occur when we have multiple classify, train and features algorithms.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Ok, here is my thinking with train versus tlogit The train function would initially map directly to the TextLogitStream. We can document that train is a text logistic regression model trainer in the first release. As we add more algorithms the train function will map to the TrainStream . The TrainStream won't have any implementations, it will simply be a facade for different training algorithms. The TrainStream will have a parameter called algorithm which it will use to select the stream implementation, such as TextLogitStream. The underlying implementation will handle the parameters, all the TrainStream will do is instantiate the alogrithm and run it. Sample syntax: train(collection, features(...), algorithm= "tlogit" , q= "*:*" , ....) We can use the same facade approach for the classify and features function. The documentation can provide documentation for calling train with different algorithms. I like this approach because it provides three very easy to understand functions: train, classify and features It also stops the explosion of functions that would occur when we have multiple classify, train and features algorithms.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          +1, that's make sense!

          Show
          caomanhdat Cao Manh Dat added a comment - +1, that's make sense!
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          I've been doing a final review of the patch. I have a question about the use of numDocs and docFreq in IGainTermsQParserPlugin.

          Currently the numDocs and docFreq for the entire index is used instead of calculating these values specific to the training set.

          I've been testing with an index which only contains the training set. In this case it doesn't matter because the numDocs and docFreq for the index is the same as for the training set.

          But in scenarios where IGain is run on a slice of a larger index, does it make sense to calculate numDocs and docFreq for the training set? Or is there value in using the global numDocs and docFreq is this scenario?

          Also, is the use case that we always load the training set into it's own collection? If that's the case then we could drop the q parameter.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited I've been doing a final review of the patch. I have a question about the use of numDocs and docFreq in IGainTermsQParserPlugin. Currently the numDocs and docFreq for the entire index is used instead of calculating these values specific to the training set. I've been testing with an index which only contains the training set. In this case it doesn't matter because the numDocs and docFreq for the index is the same as for the training set. But in scenarios where IGain is run on a slice of a larger index, does it make sense to calculate numDocs and docFreq for the training set? Or is there value in using the global numDocs and docFreq is this scenario? Also, is the use case that we always load the training set into it's own collection? If that's the case then we could drop the q parameter.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          New patch with all tests passing

          Show
          joel.bernstein Joel Bernstein added a comment - New patch with all tests passing
          Hide
          caomanhdat Cao Manh Dat added a comment -

          +1, this should be positive examples vs training set!

          Show
          caomanhdat Cao Manh Dat added a comment - +1, this should be positive examples vs training set!
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Hi Joel Bernstein
          The lastest patch seem missing :

          • Necessary qparser in QParserPlugin.
          • Test for tlogit expression (inside StreamingExpressionTest)
          Show
          caomanhdat Cao Manh Dat added a comment - Hi Joel Bernstein The lastest patch seem missing : Necessary qparser in QParserPlugin. Test for tlogit expression (inside StreamingExpressionTest)
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Yeah, I just realized my last patch wasn't correct.

          New patch coming up shortly.

          Show
          joel.bernstein Joel Bernstein added a comment - Yeah, I just realized my last patch wasn't correct. New patch coming up shortly.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          New patch adding the idfs to the features.

          Show
          joel.bernstein Joel Bernstein added a comment - New patch adding the idfs to the features.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Ok, just added a new patch which I believe was generated properly. I removed my last patch which was not generated properly.

          The new patch calculates the idf from the training set instead of the using global docFreq and numDocs. This is done in the IGainTermsQParserPlugin and the idf is now emitted with the Term in the FeatureSelectionStream.

          The TextLogitStream has been adjusted to use the idf provided with the features.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Ok, just added a new patch which I believe was generated properly. I removed my last patch which was not generated properly. The new patch calculates the idf from the training set instead of the using global docFreq and numDocs. This is done in the IGainTermsQParserPlugin and the idf is now emitted with the Term in the FeatureSelectionStream. The TextLogitStream has been adjusted to use the idf provided with the features.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          +1, The new patch look goods!

          Show
          caomanhdat Cao Manh Dat added a comment - +1, The new patch look goods!
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Cao Manh Dat, a couple of questions about the tests:

          1) In the feature selection test there isn't an assertion for the order of terms. But it would seem like there could be because the results are ordered by score and the score seems to be deterministic. Should we add an assert on the order of the terms?

          2) In text logit stream test, how did you choose the values for test records:

            // first feature is bias value
              Double[] testRecord = {1.0, 1.17, 0.691, 0.0, 0.0};
              double d = sum(multiply(testRecord, lastWeightsArray));
              double prob = sigmoid(d);
              assertEquals(prob, 1.0, 0.1);
          
              // first feature is bias value
              Double[] testRecord2 = {1.0, 0.0, 0.0, 1.17, 0.691};
              d = sum(multiply(testRecord2, lastWeightsArray));
              prob = sigmoid(d);
              assertEquals(prob, 0, 0.1);
          

          It would probably be good to document the values for the test records.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Cao Manh Dat , a couple of questions about the tests: 1) In the feature selection test there isn't an assertion for the order of terms. But it would seem like there could be because the results are ordered by score and the score seems to be deterministic. Should we add an assert on the order of the terms? 2) In text logit stream test, how did you choose the values for test records: // first feature is bias value Double [] testRecord = {1.0, 1.17, 0.691, 0.0, 0.0}; double d = sum(multiply(testRecord, lastWeightsArray)); double prob = sigmoid(d); assertEquals(prob, 1.0, 0.1); // first feature is bias value Double [] testRecord2 = {1.0, 0.0, 0.0, 1.17, 0.691}; d = sum(multiply(testRecord2, lastWeightsArray)); prob = sigmoid(d); assertEquals(prob, 0, 0.1); It would probably be good to document the values for the test records.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          New patch that asserts the order of terms in the feature selection test.

          Also removes the terms parameter from the TextLogitStream and requires a features stream.

          Show
          joel.bernstein Joel Bernstein added a comment - New patch that asserts the order of terms in the feature selection test. Also removes the terms parameter from the TextLogitStream and requires a features stream.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch which correct the test for textLogitStream.

          Joel Bernstein In this patch, the testRecord is built from string.

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch which correct the test for textLogitStream. Joel Bernstein In this patch, the testRecord is built from string.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          This looks great, thanks for adding this.

          I've got a commit ready to push out that doesn't include this patch, but we can work it into a follow-up commit.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited This looks great, thanks for adding this. I've got a commit ready to push out that doesn't include this patch, but we can work it into a follow-up commit.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 87938e00e9f1006801fbf0e8c0d7b2a84b5eda48 in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=87938e0 ]

          SOLR-9252: Feature selection and logistic regression on text

          Show
          jira-bot ASF subversion and git services added a comment - Commit 87938e00e9f1006801fbf0e8c0d7b2a84b5eda48 in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=87938e0 ] SOLR-9252 : Feature selection and logistic regression on text
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 73de207201f43b1d8d3f3623dd12dd0ae2f9605c in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=73de207 ]

          SOLR-9252: Pre-commit fixes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 73de207201f43b1d8d3f3623dd12dd0ae2f9605c in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=73de207 ] SOLR-9252 : Pre-commit fixes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e38d6d535c38c2d679cd9b0302fb96a75eda19c9 in lucene-solr's branch refs/heads/branch_6x from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e38d6d5 ]

          SOLR-9252: Feature selection and logistic regression on text

          Conflicts:
          solr/core/src/java/org/apache/solr/handler/StreamHandler.java

          Show
          jira-bot ASF subversion and git services added a comment - Commit e38d6d535c38c2d679cd9b0302fb96a75eda19c9 in lucene-solr's branch refs/heads/branch_6x from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e38d6d5 ] SOLR-9252 : Feature selection and logistic regression on text Conflicts: solr/core/src/java/org/apache/solr/handler/StreamHandler.java
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 728b4fbcdcf3682b2b1d571d088c0fbb78850606 in lucene-solr's branch refs/heads/branch_6x from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=728b4fb ]

          SOLR-9252: Pre-commit fixes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 728b4fbcdcf3682b2b1d571d088c0fbb78850606 in lucene-solr's branch refs/heads/branch_6x from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=728b4fb ] SOLR-9252 : Pre-commit fixes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2c4542ea0204f8cb3a966fc697651226e09d2ee5 in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2c4542e ]

          SOLR-9252: Update CHANGES.txt

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2c4542ea0204f8cb3a966fc697651226e09d2ee5 in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2c4542e ] SOLR-9252 : Update CHANGES.txt
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f8cf9a7bf2f69094e0c20b97e53de46c870df490 in lucene-solr's branch refs/heads/branch_6x from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8cf9a7 ]

          SOLR-9252: Update CHANGES.txt

          Show
          jira-bot ASF subversion and git services added a comment - Commit f8cf9a7bf2f69094e0c20b97e53de46c870df490 in lucene-solr's branch refs/heads/branch_6x from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8cf9a7 ] SOLR-9252 : Update CHANGES.txt
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Cao Manh Dat, thanks for all your work on this ticket! It looks really good.

          I'll make the last change to the test case and close this ticket out later in the week.

          Show
          joel.bernstein Joel Bernstein added a comment - Cao Manh Dat , thanks for all your work on this ticket! It looks really good. I'll make the last change to the test case and close this ticket out later in the week.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment - - edited

          Nice! Great to see this land in Solr. One question though – In IGainTermsCollector, both positiveSet and negativeSet are kept around and used while iterating the postingsEnum. Is that to handle deleted docs? If not, isn't any doc not in the positiveSet automatically in the negativeSet?

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - - edited Nice! Great to see this land in Solr. One question though – In IGainTermsCollector, both positiveSet and negativeSet are kept around and used while iterating the postingsEnum. Is that to handle deleted docs? If not, isn't any doc not in the positiveSet automatically in the negativeSet?
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Thanks!

          The negative set is needed because we're calculating idf specific to the training set, rather then using the global idf for the index. Originally we were using the idf for the full index and the negative set was not needed.

          This will allow us to have multiple training sets in the same collection without polluting each others idf.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Thanks! The negative set is needed because we're calculating idf specific to the training set, rather then using the global idf for the index. Originally we were using the idf for the full index and the negative set was not needed. This will allow us to have multiple training sets in the same collection without polluting each others idf.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Thanks for all your work to review this ticket, too!

          Show
          caomanhdat Cao Manh Dat added a comment - Thanks for all your work to review this ticket, too!
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Makes sense. Thanks for the explanation Joel.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Makes sense. Thanks for the explanation Joel.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          While testing out with different feature selection approaches I ran across what I believe is a bug with TextLogisticRegressionQParserPlugin.

          If a document doesn't contain any of the features a doc vector isn't created for the document.

          So that document is skipped while optimizing the model.

          I'm not sure if this is the correct behavior.

          Show
          joel.bernstein Joel Bernstein added a comment - While testing out with different feature selection approaches I ran across what I believe is a bug with TextLogisticRegressionQParserPlugin. If a document doesn't contain any of the features a doc vector isn't created for the document. So that document is skipped while optimizing the model. I'm not sure if this is the correct behavior.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          In that case, I think we should ignore these documents in training/classify step.

          Show
          caomanhdat Cao Manh Dat added a comment - In that case, I think we should ignore these documents in training/classify step.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok, then we can leave it as is.

          Show
          joel.bernstein Joel Bernstein added a comment - Ok, then we can leave it as is.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          I mean we should ignore that documents inside training for loop

          So it will be

          for (Map.Entry<Integer, double[]> entry : docVectors.entrySet()) {
            ...
          }
          

          to

          for (Map.Entry<Integer, double[]> entry : docVectors.entrySet()) {
            if (isZeros(vector)) continue
            ...
          }
          

          Because we will have same zero vectors which have different label (both positive and negative).
          I will submit a patch soon to include this change and regularization.

          Show
          caomanhdat Cao Manh Dat added a comment - I mean we should ignore that documents inside training for loop So it will be for (Map.Entry< Integer , double []> entry : docVectors.entrySet()) { ... } to for (Map.Entry< Integer , double []> entry : docVectors.entrySet()) { if (isZeros(vector)) continue ... } Because we will have same zero vectors which have different label (both positive and negative). I will submit a patch soon to include this change and regularization.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          A minor patch :

          Show
          caomanhdat Cao Manh Dat added a comment - A minor patch : In training step, we will ignore document that dont have any given features. Add regularization for logit ( http://www.holehouse.org/mlclass/07_Regularization.html )
          Hide
          risdenk Kevin Risden added a comment -

          Joel Bernstein - Should this ticket still be open? Looks like there were commits to master and branch_6x?

          Show
          risdenk Kevin Risden added a comment - Joel Bernstein - Should this ticket still be open? Looks like there were commits to master and branch_6x?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Cao Manh Dat, added an improved test case which I was planning to commit, but haven't gotten to it yet. We could resolve this ticket and create a new ticket with the the lastest patch as a starting point.

          Show
          joel.bernstein Joel Bernstein added a comment - Cao Manh Dat , added an improved test case which I was planning to commit, but haven't gotten to it yet. We could resolve this ticket and create a new ticket with the the lastest patch as a starting point.
          Hide
          jeroens Jeroen Steggink added a comment - - edited

          This would be great, as the regularization makes the training way more useful.

          Show
          jeroens Jeroen Steggink added a comment - - edited This would be great, as the regularization makes the training way more useful.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          I think the latest patches on this ticket have fallen through the cracks.

          Let's close out this ticket and open a new one for Cao Manh Dat's latest work.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited I think the latest patches on this ticket have fallen through the cracks. Let's close out this ticket and open a new one for Cao Manh Dat 's latest work.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          SOLR-9816 has been opened. We can add the latest patches from this ticket when we're ready to work on it.

          Show
          joel.bernstein Joel Bernstein added a comment - SOLR-9816 has been opened. We can add the latest patches from this ticket when we're ready to work on it.

            People

            • Assignee:
              joel.bernstein Joel Bernstein
              Reporter:
              caomanhdat Cao Manh Dat
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development