TransWikia.com

How do we standardize arrays with NaN?

Data Science Asked by uharsha33 on February 8, 2021

I used StandardScaler() to standardize data so far, but this doesn’t work with NaNs. None of the other methods I know of (MinMaxScaler, RobustScaler, MaxAbsScaler) work with NaNs either. Are there other methods?

My search results came up with a solution

df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())

But this works only with panda dataframes (they have column names). Is there a way to implement column headers in the matrix?

import pandas as pd
import numpy as np
import random
import sklearn.preprocessing import StandardScaler

data = pd.DataFrame({'sepal_length': [3.4, 4.5, 3.5], 
                     'sepal_width': [1.2, 1, 2],
                'petal_length': [5.5, 4.5, 4.7],
                'petal_width': [1.2, 1, 3],
                    'species': ['setosa', 'verginica', 'setosa']})

#Shuffle the data and reset the index
from sklearn.utils import shuffle
data = shuffle(data).reset_index(drop = True)  

#Create Independent and dependent matrices
X = data.iloc[:, [0, 1, 2, 3]].values 
y = data.iloc[:, 4].values

#train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1, random_state = 0)


#Impute missing values at random
prop = int(X_train.size * 0.5) #Set the % of values to be replaced
prop1 = int(X_test.size * 0.5)

a = [random.choice(range(X_train.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
b = [random.choice(range(X_train.shape[1])) for _ in range(prop)]
c = [random.choice(range(X_test.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
d = [random.choice(range(X_test.shape[1])) for _ in range(prop)]
X_train[a, b] = np.NaN
X_test[c, d] = np.NaN

This is where I get the error :Input contains NaN, infinity or a value too large for dtype(‘float64’).

from sklearn.preprocessing import StandardScaler #importing the library that does feature scaling

sc_X = StandardScaler() # created an object with the scaling class

X_train = sc_X.fit_transform(X_train)  # Here we fit and transform the X_train matrix
X_test = sc_X.transform(X_test)

4 Answers

You can use sklearn.preprocessing.Imputer.

Demo:

from sklearn import datasets as ds
from sklearn.model_selection import train_test_split

# load Iris data set    
data = ds.load_iris()

X = data.data
y = data.target

# artificially set 33% of [X] data set to NaN's    
X.ravel()[np.random.choice(X.size, int(X.shape[0]*.33), replace=False)] = np.nan

yields:

In [137]: X
Out[137]:
array([[5.1, 3.5, nan, nan],
       [nan, 3. , 1.4, 0.2],
       [4.7, nan, 1.3, 0.2],
       ...,
       [6.5, 3. , nan, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , nan, 1.8]])

now we can impute and standardize it:

imp = Imputer(strategy="mean", axis=0)
scale = StandardScaler()

In [139]: X_new = scale.fit_transform(imp.fit_transform(X))

result:

In [160]: X_new
Out[160]:
array([[-1.03733263e+00,  1.22587069e+00, -1.37398311e-15, -3.17837019e-16],
       [ 1.18191646e-15, -5.32987255e-02, -1.43399195e+00, -1.35522269e+00],
       [-1.56962048e+00,  2.27226133e-15, -1.49587065e+00, -1.35522269e+00],
       ...,
       [ 8.25674859e-01, -5.32987255e-02, -1.37398311e-15,  1.22131653e+00],
       [ 4.26458969e-01,  9.70036804e-01,  1.04115598e+00,  1.65073974e+00],
       [ 2.72430791e-02, -5.32987255e-02, -1.37398311e-15,  9.35034396e-01]])

Demo2, using Pipeline:

from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import Pipeline

#...    

estimator = Pipeline([("impute", Imputer(strategy="mean", axis=0)),
                      ("scale", StandardScaler()),
                      ("forest", RandomForestRegressor(random_state=0,
                                                       n_estimators=100))])

estimator.fit(X_train, y_train)
#...    

Answered by MaxU on February 8, 2021

Working with NaNs is always a bit difficult. Maybe it would be useful if you try to enrich NaN values. For example, by averaging the considered feature for groups like an age class. If only a few records have NaN values, you might simply drop these (pandas dropna).

Answered by MBDev on February 8, 2021

Standardizing (subtracting mean and dividing by standard deviation for each column), can be done using numpy:

Xz = (X - np.nanmean(X, axis=0))/np.nanstd(X, axis=0) 

where X is a matrix (containing NaNs), and Xz is the standardized version of X. Hope this helps.

EDITED:

For a test/training scenario, the mean and std could be stored in respective variables:

m         = np.nanmean(X_train, axis=0)
s         = np.nanstd(X_train, axis=0)
X_train_z = (X_train - m)/s 
X_test_z  = (X_test - m)/s

Answered by Jonathan Foldager on February 8, 2021

This is no longer the case; as of sklearn 0.20.0, missing values are ignored in such preprocessors' fit and silently passed along in their transform:
https://scikit-learn.org/stable/whats_new/v0.20.html#id37 (fourth bullet)
https://github.com/scikit-learn/scikit-learn/issues/10404

Answered by Ben Reiniger on February 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP