Details
-
Question
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
2.3.2
-
None
Description
I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark.
The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors.
I have included all the stages in a Pipeline
regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature) idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels) feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100) pipeline = Pipeline(stages=[feature_pipeline, estimator, converter]) model = pipeline.fit(df)
Generating the feature importances as
rdc = model.stages[-2] print (rdc.featureImportances)
So far so good, but when i try to map the feature importances to the feature columns as below
attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values()))) [(name, rdc.featureImportances[idx]) for idx, name in attrs if dtModel_1.featureImportances[idx]]
I get the key error on ml_attr
KeyError: 'ml_attr'
The printed the dictionary,
print (df_pred.schema["featurescol"].metadata)
and it's empty {}
Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names.
Thanks