TransWikia.com

Using pandas, check a column for matching text and update new column if TRUE

Data Science Asked by RustyNails on July 31, 2021

My objective: Using pandas, check a column for matching text [not exact] and update new column if TRUE.

From a csv file, a data frame was created and values of a particular column – COLUMN_to_Check, are checked for a matching text pattern – ‘PEA’. Based on whether pattern matches, a new column on the data frame is created with YES or NO.

I have the following data in file DATA2.csv

ASSIGNMENT,Open date,Resolved date,COLUMN_to_Check,NUMBER,Open Time,RESOLVED_GROUP,RESOLVED_TIME,SUBCATEGORY
GBL_IS_GRC_PROCESSCONTROL,3/1/2017 13:39,11/1/2017 13:09,APAC_LT-ERP-FICO-BOKADABISH_PRD,IM-17-001200,3/1/2017 13:39,GBL_GSO_MQG,11/1/2017 13:09,Security (breach or weakness)
RSP_SERVICEDESK,12/1/2017 0:08,12/1/2017 0:27,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006462,12/1/2017 0:08,RSP_SERVICEDESK,12/1/2017 0:27,failure
RSP_SERVICEDESK,10/1/2017 5:27,12/1/2017 0:52,APAC_LT-ERP-SUPPLY-PEA_PRD,IM-17-004667,10/1/2017 5:27,RSP_PCS_INCIDENTS,12/1/2017 0:52,failure
RSP_SERVICEDESK,12/1/2017 2:35,12/1/2017 3:03,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006483,12/1/2017 2:35,RSP_SERVICEDESK,12/1/2017 3:03,access
RSP_SAP_BI,10/1/2017 21:04,12/1/2017 6:01,APAC_LT-ERP-SALES-PEA_PRD,IM-17-005498,10/1/2017 21:04,RSP_SAP_SALES,12/1/2017 6:01,SAP Sales

And using this code….

import pandas as pd

df=pd.read_csv('DATA2.csv')

Search_for_These_values = ['PEA', 'DEF', 'XYZ'] #creating list

pattern = '|'.join(Search_for_These_values)     # joining list for comparision

IScritical=df['COLUMN_to_Check'].str.contains(pattern)
for CHECK in IScritical:
    if not CHECK:
        print CHECK
        df['NEWcolumn']='NO'
    else:
        print CHECK
        df['NEWcolumn']='YES'

df.to_csv('OUPUT.csv')

Printing the value of ‘CHECK’ returns correct values, i.e., first row returns false.

C:UsersMEDocumentsSandBox (master)
λ python numpytest_pub.py
False
True
True
True
True

But the output csv file shows all values of ‘NEWColumn’ as ‘YES’, where on ‘NEWcolumn’, row[0], value should be ‘NO’ as the ‘COLUMN_to_Check’ here should not match the pattern.

,ASSIGNMENT,Open date,Resolved date,COLUMN_to_Check,NUMBER,Open Time,RESOLVED_GROUP,RESOLVED_TIME,SUBCATEGORY,NEWcolumn
0,GBL_IS_GRC_PROCESSCONTROL,3/1/2017 13:39,11/1/2017 13:09,APAC_LT-ERP-FICO-BOKADABISH_PRD,IM-17-001200,3/1/2017 13:39,GBL_GSO_MQG,11/1/2017 13:09,Security (breach or weakness),YES
1,RSP_SERVICEDESK,12/1/2017 0:08,12/1/2017 0:27,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006462,12/1/2017 0:08,RSP_SERVICEDESK,12/1/2017 0:27,failure,YES
2,RSP_SERVICEDESK,10/1/2017 5:27,12/1/2017 0:52,APAC_LT-ERP-SUPPLY-PEA_PRD,IM-17-004667,10/1/2017 5:27,RSP_PCS_INCIDENTS,12/1/2017 0:52,failure,YES
3,RSP_SERVICEDESK,12/1/2017 2:35,12/1/2017 3:03,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006483,12/1/2017 2:35,RSP_SERVICEDESK,12/1/2017 3:03,access,YES
4,RSP_SAP_BI,10/1/2017 21:04,12/1/2017 6:01,APAC_LT-ERP-SALES-PEA_PRD,IM-17-005498,10/1/2017 21:04,RSP_SAP_SALES,12/1/2017 6:01,SAP Sales,YES

I can sense that something is missing in the CHECK part, but not able to figure out what. Can anyone help ?

Let me know if the question needs rephrasing for better understanding or future community use.

4 Answers

df['NEWcolumn']='NO' sets the whole column to the value 'NO'. So you see the result for the last row in your table, distributed over the whole column.

Here is a way to achieve what you want:

df['NEWcolumn'][IScritical]='YES'
df['NEWcolumn'][~IScritical]='NO'

See https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking

Answered by Matthias Berth on July 31, 2021

You may use directly the IScritical feature you created:

import pandas as pd

df=pd.read_csv('DATA2.csv')

Search_for_These_values = ['PEA', 'DEF', 'XYZ'] #creating list

pattern = '|'.join(Search_for_These_values)     # joining list for comparision

IScritical=df['COLUMN_to_Check'].str.contains(pattern)

df['NEWcolumn'] = IScritical.replace((True,False), ('YES','NO'))

Answered by michaelg on July 31, 2021

You simply need to do:

df['NEWcolumn'] = df['COLUMN_to_Check'].str.contains(pattern)
df['NEWcolumn'] = df['NEWcolumn'].map({True: 'Yes', False: 'No'})

Answered by Suresh Kasipandy on July 31, 2021

You could first add the column and default the value to 'NO' and then update the dataframe with .loc:

df['NEWcolumn']='NO'
df.loc[df['COLUMN_to_Check'].str.contains(pattern), 'NEWcolumn'] = 'YES'

Answered by Bas on July 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP