Details
Description
This ticket adds two new streaming expressions: features and train
These two functions work together to train a logistic regression model on text, from a training set stored in a SolrCloud collection.
The syntax is as follows:
train(collection1, q="*:*", features(collection1, q="*:*", field="body", outcome="out_i", positiveLabel=1, numTerms=100), field="body", outcome="out_i", maxIterations=100)
The features function extracts the feature terms from a training set using information gain to score the terms. http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
The train function uses the extracted features to train a logistic regression model on a text field in the training set.
For both features and train the training set is defined by a query. The doc vectors in the train function use tf-idf to represent the terms in the document. The idf is calculated for the specific training set, allowing multiple training sets to be stored in the same collection without polluting the idf.
In the train function a batch gradient descent approach is used to iteratively train the model.
Both the features and the train function are embedded in Solr using the AnalyticsQuery framework. So only the model is transported across the network with each iteration.
Both the features and the models can be stored in a SolrCloud collection. Using this approach Solr can hold millions of models which can be selectively deployed. For example a model could be trained for each user, to personalize ranking and recommendations.
Below is the final iteration of a model trained on the Enron Ham/Spam dataset. The model includes the terms and their idfs and weights as well as a classification evaluation describing the accuracy of model on the training set.
{ "idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744], "alpha_d": 7.150861416624748E-4, "terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link", "featured", "net", "international", "most", "investing", "works", "readers", "uncertainties", "differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers", "should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding", "events", "advice", "now", "com", "browser"], "iteration_i": 100, "weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395], "field_s": "body", "trueNegative_i": 3570, "falseNegative_i": 35, "falsePositive_i": 75, "error_d": 176.8112932306374, "truePositive_i": 1381, "id": "model_100" }