TransWikia.com

String Values in a data frame in Pandas

Data Science Asked on January 15, 2021

Suppose I have a data frame like this :

Hospital_name    State    Employees    ......
Fortis           Delhi    5000         ......
AIIMS            Delhi    1000000      ......
SuperSpeciality  Chennai  1000         ......

Now I want to use this data frame to build a machine learning model for predictive analysis. For that, I must convert the strings to float values. Also, some of these columns in Hospital_name and State contains ‘NAN’ values. In such a case, how should I prepare my data for building a model in Keras?

4 Answers

To convert from string to float in pandas (assuming you want to convert Employees and you loaded the data frame with df), you can use:

df['Employees'].apply(lambda x:float(x))

You have not given enough information about your input and expected output. So let us assume that hospital name or anything for that matter which is the input for your model is nan. You would like to remove it from the dataset because extracting features from 'nan' wouldn't make sense. Apart from that, if they are just other peripheral features, then it might be alright. In that case, if you wish to convert them into blank, then use:

df.replace(np.nan,' ', regex=True)`

Else, if you wish to remove that frame, you can check for nan using this.

Correct answer by Hima Varsha on January 15, 2021

A more direct way of converting Employees to float.

df.Employees = df.Employees.astype(float)

You didn't specify what you wanted to do with NaNs, but you can replace them with a different value (int or string) using:

df = df.fillna(value_to_fill)

If you want to drop rows with NaN in it, use:

df = df.dropna()

Answered by user666 on January 15, 2021

The best way to deal with types is to specify it when ingesting the file:

pandas.read_csv(file_name, dtype={"Employees": float})

What you do with the missing data in Keras is up to you. You can elaborate further as it actually depends on your plan.

Answered by Emre on January 15, 2021

I don't understand why you would map the strings to floats. I would suggest using one hot encoding to categorize the strings with a Boolean 1 or 0.

In pandas this would be:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

You can also add pd.get_dummies(l, dummy_na=True) to deal with the NaN values.

Answered by smw on January 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP