Using SMOTENC in a pipeline

Question

I am trying to figure out the appropriate way to build a pipeline to train a model which includes using the SMOTENC algorithm:

Given that the N-Nearest Neighbors algorithm and Euclidian distance are used, should the data by normalized (Scale input vectors individually to unit norm). Prior to applying SMOTENC in the pipeline?

Can the algorithm handle missing values? If data imputation and outlier removal based on median and percentiles values are performed prior to SMOTENC rather than after it, wouldn’t this bias the imputation/percentiles?

Can SMOTENC be applied after one-hot encoding and defining the numerical binary columns as categorical features?

When the pipeline is included in a cross validation schema, will the data balancing only be applied to the imbalanced training fold or also for the test fold?

Here is how my pipeline currently looks like:
from imblearn.pipeline import Pipeline as Pipeline_imb
from imblearn.over_sampling import SMOTENC

categorical_features_bool = [True, True, ……. False, False]
smt = SMOTENC(categorical_features =categorical_features_bool, 
                random_state=RANDOM_STATE_GRID,
                k_neighbors=10
                ,n_jobs=-1
                     )

preprocess_pipeline = ColumnTransformer(
        transformers=[
            ('Winsorize', FunctionTransformer(winsorize, validate=False, 
                                              kw_args={'limits':[0, 0.02],'inplace':False,'axis':0}), 
             ['feat_1,'Feat_2']),

('num_impute', SimpleImputer(strategy='median', add_indicator=True) , 
             ['feat_10,'Feat_15']),
        ], remainder='passthrough', #passthough features not listed
        n_jobs=-1,
        verbose = False
    )

Model = LogisticRegression()

model_pipeline = Pipeline_imb([
            ('preprocessing', preprocess_pipeline),
            ('smt', smt),
            ('Std', StandardScaler()),
            ('classifier', Model)
            ])

Jacques Wainer · Accepted Answer

The usual normalisation for Euclidian distance is NOT to scale each input to unit length, but to scale each column to mean 0 and variance 1. The scaling of each data is possible but it is not common.

I dont know

The whole point of SMOTENC is not to do the one-hot encoding. One hot encoding is a way to transform categorical data into numeric data (on multiple dimensions) for algorithms that cannot deal with categorical data. SO my suggestion is not to convert the categorical columns and let SMOTENC deal with them

The Pipeline in imblearn does the right thing - it only applies the oversampling (of other imbalance strategies) on the training set not on the test set. See this question in StackOverflow: https://stackoverflow.com/questions/63520908/does-imblearn-pipeline-turn-off-sampling-for-testing

Using SMOTENC in a pipeline

One Answer

Add your own answers!

Ask a Question