[SOLR-9252] Feature selection and logistic regression on text - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: None
Fix Version/s: 6.2
Component/s: search, SolrCloud, SolrJ
Labels:
- Streaming

Description

This ticket adds two new streaming expressions: features and train

These two functions work together to train a logistic regression model on text, from a training set stored in a SolrCloud collection.

The syntax is as follows:

train(collection1, q="*:*",
      features(collection1, 
               q="*:*",  
               field="body", 
               outcome="out_i", 
               positiveLabel=1, 
               numTerms=100),
      field="body",
      outcome="out_i",
      maxIterations=100)

The features function extracts the feature terms from a training set using information gain to score the terms. http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The train function uses the extracted features to train a logistic regression model on a text field in the training set.

For both features and train the training set is defined by a query. The doc vectors in the train function use tf-idf to represent the terms in the document. The idf is calculated for the specific training set, allowing multiple training sets to be stored in the same collection without polluting the idf.

In the train function a batch gradient descent approach is used to iteratively train the model.

Both the features and the train function are embedded in Solr using the AnalyticsQuery framework. So only the model is transported across the network with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using this approach Solr can hold millions of models which can be selectively deployed. For example a model could be trained for each user, to personalize ranking and recommendations.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. The model includes the terms and their idfs and weights as well as a classification evaluation describing the accuracy of model on the training set.

{
			"idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
			"alpha_d": 7.150861416624748E-4,
			"terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link", "featured", "net", "international", "most", "investing", "works", "readers", "uncertainties", "differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers", "should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
			"iteration_i": 100,
			"weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395],
			"field_s": "body",
			"trueNegative_i": 3570,
			"falseNegative_i": 35,
			"falsePositive_i": 75,
			"error_d": 176.8112932306374,
			"truePositive_i": 1381,
			"id": "model_100"
		}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-9252.patch
26/Jun/16 15:44
75 kB
Cao Manh Dat
SOLR-9252.patch
02/Jul/16 09:12
74 kB
Cao Manh Dat
SOLR-9252.patch
12/Jul/16 03:59
87 kB
Cao Manh Dat
SOLR-9252.patch
13/Jul/16 04:02
87 kB
Cao Manh Dat
SOLR-9252.patch
15/Jul/16 02:51
90 kB
Cao Manh Dat
SOLR-9252.patch
20/Jul/16 13:10
92 kB
Cao Manh Dat
SOLR-9252.patch
26/Jul/16 13:46
93 kB
Cao Manh Dat
SOLR-9252.patch
01/Aug/16 12:18
88 kB
Joel Bernstein
SOLR-9252.patch
03/Aug/16 01:35
87 kB
Joel Bernstein
SOLR-9252.patch
03/Aug/16 03:04
88 kB
Cao Manh Dat
SOLR-9299-1.patch
12/Aug/16 02:14
7 kB
Cao Manh Dat

Issue Links

is related to

SOLR-9403 Add classify() function query for use with re-ranking

Open

SOLR-9258 Optimizing, storing and deploying AI models with Streaming Expressions

Closed

SOLR-9816 Improvements to text logistic regression

Open

supercedes

SOLR-9186 Logistic regression modeling for text

Closed

Feature selection and logistic regression on text

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates