Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9252

Feature selection and logistic regression on text

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: None
    • Fix Version/s: 6.2
    • Component/s: search, SolrCloud, SolrJ
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:

      Description

      This ticket adds two new streaming expressions: features and train

      These two functions work together to train a logistic regression model on text, from a training set stored in a SolrCloud collection.

      The syntax is as follows:

      train(collection1, q="*:*",
            features(collection1, 
                     q="*:*",  
                     field="body", 
                     outcome="out_i", 
                     positiveLabel=1, 
                     numTerms=100),
            field="body",
            outcome="out_i",
            maxIterations=100)
      

      The features function extracts the feature terms from a training set using information gain to score the terms. http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

      The train function uses the extracted features to train a logistic regression model on a text field in the training set.

      For both features and train the training set is defined by a query. The doc vectors in the train function use tf-idf to represent the terms in the document. The idf is calculated for the specific training set, allowing multiple training sets to be stored in the same collection without polluting the idf.

      In the train function a batch gradient descent approach is used to iteratively train the model.

      Both the features and the train function are embedded in Solr using the AnalyticsQuery framework. So only the model is transported across the network with each iteration.

      Both the features and the models can be stored in a SolrCloud collection. Using this approach Solr can hold millions of models which can be selectively deployed. For example a model could be trained for each user, to personalize ranking and recommendations.

      Below is the final iteration of a model trained on the Enron Ham/Spam dataset. The model includes the terms and their idfs and weights as well as a classification evaluation describing the accuracy of model on the training set.

      {
      			"idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
      			"alpha_d": 7.150861416624748E-4,
      			"terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link", "featured", "net", "international", "most", "investing", "works", "readers", "uncertainties", "differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers", "should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
      			"iteration_i": 100,
      			"weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395],
      			"field_s": "body",
      			"trueNegative_i": 3570,
      			"falseNegative_i": 35,
      			"falsePositive_i": 75,
      			"error_d": 176.8112932306374,
      			"truePositive_i": 1381,
      			"id": "model_100"
      		}
      

        Attachments

        1. SOLR-9252.patch
          88 kB
          Cao Manh Dat
        2. SOLR-9252.patch
          87 kB
          Joel Bernstein
        3. SOLR-9252.patch
          88 kB
          Joel Bernstein
        4. SOLR-9252.patch
          93 kB
          Cao Manh Dat
        5. SOLR-9252.patch
          92 kB
          Cao Manh Dat
        6. SOLR-9252.patch
          90 kB
          Cao Manh Dat
        7. SOLR-9252.patch
          87 kB
          Cao Manh Dat
        8. SOLR-9252.patch
          87 kB
          Cao Manh Dat
        9. SOLR-9252.patch
          74 kB
          Cao Manh Dat
        10. SOLR-9252.patch
          75 kB
          Cao Manh Dat
        11. SOLR-9299-1.patch
          7 kB
          Cao Manh Dat

          Issue Links

            Activity

              People

              • Assignee:
                joel.bernstein Joel Bernstein
                Reporter:
                caomanhdat Cao Manh Dat
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: