How to restrict the columns to be passed to final classifier in PMML Pipeline

Question

I am working on building XGBoost PMML using SKLearn and SKLearn2PMML.
I am having some numerical,somecategorical and datetime columns from which i am creating new feature inside the pipeline. When i am trying to train the model, it gets failed as the original categorical features also gets passed to the final classfier by default. Is there any way to restrict the features by specifying the feature names ?

Akshay Tilekar · Accepted Answer

After digging down too much and some help from sklearn2pmml creator, I
managed to filter the final columns to be passed to the classifier.
Note : Here recorder is DataFrameMapper object.

1.Getting categorical column indexes.
cat_cols = [recorder.transformed_names_.index(c) for c in categoricalCols if c in recorder.transformed_names_]

2.Adding ColumnTransformer to filter those column with the help of their indexes.
pipeline = PMMLPipeline([
    ("mapper", recorder),
    ("select", ColumnTransformer([("drop", "drop", cat_cols)], remainder='passthrough')),
    ("classifier", xgb.XGBClassifier())
])

3.Fitting the Data to the pipeline.
pipeline.fit(X_train,y_train)

4.Creating PMML file out of Pipeline.
out_file = "XGBoost.pmml"
sklearn2pmml(pipeline, out_file)

How to restrict the columns to be passed to final classifier in PMML Pipeline

One Answer

Add your own answers!

Ask a Question