String Values in a data frame in Pandas

Question

Suppose I have a data frame like this :
Hospital_name    State    Employees    ......
Fortis           Delhi    5000         ......
AIIMS            Delhi    1000000      ......
SuperSpeciality  Chennai  1000         ......

Now I want to use this data frame to build a machine learning model for predictive analysis. For that, I must convert the strings to float values. Also, some of these columns in Hospital_name and State contains 'NAN' values. In such a case, how should I prepare my data for building a model in Keras?

Hima Varsha · Accepted Answer

To convert from string to float in pandas (assuming you want to convert Employees and you loaded the data frame with df), you can use:
df['Employees'].apply(lambda x:float(x))

You have not given enough information about your input and expected output. So let us assume that hospital name or anything for that matter which is the input for your model is nan. You would like to remove it from the dataset because extracting features from 'nan' wouldn't make sense. Apart from that, if they are just other peripheral features, then it might be alright. In that case, if you wish to convert them into blank, then use:
df.replace(np.nan,' ', regex=True)`

Else, if you wish to remove that frame, you can check for nan using this.

user666 · Answer

A more direct way of converting Employees to float.
df.Employees = df.Employees.astype(float)

You didn't specify what you wanted to do with NaNs, but you can replace them with a different value (int or string) using:
df = df.fillna(value_to_fill)

If you want to drop rows with NaN in it, use:
df = df.dropna()

Emre · Answer

The best way to deal with types is to specify it when ingesting the file:
pandas.read_csv(file_name, dtype={"Employees": float})

What you do with the missing data in Keras is up to you. You can elaborate further as it actually depends on your plan.

smw · Answer

I don't understand why you would map the strings to floats. I would suggest using one hot encoding to categorize the strings with a Boolean 1 or 0.
In pandas this would be:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)

a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

You can also add pd.get_dummies(l, dummy_na=True) to deal with the NaN values.

String Values in a data frame in Pandas

4 Answers

Add your own answers!

Ask a Question