TransWikia.com

Combine data from multiple rows into a single row with new headers in Pandas

Stack Overflow Asked by giantg2 on December 10, 2020

I am trying to combine multiple rows with the same VAERS_ID. I have my current code below. Is there a better way to do this? This code is extremely slow, even when I launch each file concurrently using multiprocessing. I’m not sure what I can do to speed this up. I think it takes a few hours to run the 30 years of VAERS data.

Sample Input:
VAERS_ID,VAX_TYPE,VAX_MANU,VAX_LOT,VAX_DOSE_SERIES,VAX_ROUTE,VAX_SITE,VAX_NAME

794159,DTAPIPV,GLAXOSMITHKLINE BIOLOGICALS,G9P35,1,IM,LL,DTAP + IPV (KINRIX)

794159,MMRV,MERCK and CO. INC.,R015744,1,SC,LL,MEASLES + MUMPS + RUBELLA + VARICELLA (PROQUAD)
Sample Output:
VAERS_ID,VAX_TYPE_1,VAX_MANU_1,VAX_LOT_1,VAX_DOSE_SERIES_1,VAX_ROUTE_1,VAX_SITE_1,VAX_NAME_1,VAX_TYPE_2,VAX_MANU_2,VAX_LOT_2,VAX_DOSE_SERIES_2,VAX_ROUTE_2,VAX_SITE_2,VAX_NAME_2

794159,DTAPIPV,GLAXOSMITHKLINE BIOLOGICALS,G9P35,1,IM,LL,DTAP + IPV (KINRIX),MMRV,MERCK and CO. INC.,R015744,1,SC,LL,MEASLES + MUMPS + RUBELLA + VARICELLA (PROQUAD)
def combineVaxRecords(file):
    print('processing ' + file)
    headers = ['VAX_TYPE_1', 'VAX_MANU_1', 'VAX_LOT_1', 'VAX_DOSE_SERIES_1','VAX_ROUTE_1', 'VAX_SITE_1', 'VAX_NAME_1',
               'VAX_TYPE_2', 'VAX_MANU_2', 'VAX_LOT_2', 'VAX_DOSE_SERIES_2','VAX_ROUTE_2', 'VAX_SITE_2', 'VAX_NAME_2',
               'VAX_TYPE_3', 'VAX_MANU_3', 'VAX_LOT_3', 'VAX_DOSE_SERIES_3','VAX_ROUTE_3', 'VAX_SITE_3', 'VAX_NAME_3',
               'VAX_TYPE_4', 'VAX_MANU_4', 'VAX_LOT_4', 'VAX_DOSE_SERIES_4','VAX_ROUTE_4', 'VAX_SITE_4', 'VAX_NAME_4',
               'VAX_TYPE_5', 'VAX_MANU_5', 'VAX_LOT_5', 'VAX_DOSE_SERIES_5','VAX_ROUTE_5', 'VAX_SITE_5', 'VAX_NAME_5',
               'VAX_TYPE_6', 'VAX_MANU_6', 'VAX_LOT_6', 'VAX_DOSE_SERIES_6','VAX_ROUTE_6', 'VAX_SITE_6', 'VAX_NAME_6']

    dfOut = pd.DataFrame(columns=headers)
    df = pd.read_csv(file, engine='python', error_bad_lines=False) #drop records with errors  
          
    # get a unique list of the IDs
    idList = list(df['VAERS_ID'])
    idList = list(dict.fromkeys(idList))

    inRows = pd.DataFrame()
    # for each record, write the row if it's the only one found for that ID. Otherwise combine the rows
    for record in idList:
        inRows = df.loc[df['VAERS_ID'] == record] 
        count = 1
        for index, row in inRows.iterrows(): 
            if count == 1:
                outRow = row
            else:
                if count > 6:
                    print('error - more than 6 vaccines for this id ' + str(record))

                # map the current record to the combined record
                strCount = str(count)
                vaxType = 'VAX_TYPE_' + strCount
                vaxMenu = 'VAX_MANU_' + strCount
                vaxLot = 'VAX_LOT_' + strCount
                vaxSeries = 'VAX_DOSE_SERIES_' + strCount
                vaxRoute = 'VAX_ROUTE_' + strCount
                vaxSite = 'VAX_SITE_' + strCount 
                vaxName = 'VAX_NAME_' + strCount

                countIndex = count - 1
                location = (countIndex * 7) + 1

                # combine the data for record to be writen to the new file
                outRow[vaxType] = inRows.iat[countIndex,location]
                outRow[vaxMenu] = inRows.iat[countIndex,location+1]
                outRow[vaxLot] = inRows.iat[countIndex,location+2]
                outRow[vaxSeries] = inRows.iat[countIndex,location+3]
                outRow[vaxRoute] = inRows.iat[countIndex,location+4]
                outRow[vaxSite] = inRows.iat[countIndex,location+5]
                outRow[vaxName] = inRows.iat[countIndex,location+6]

        #write the outRow to new df here
        dfOut = dfOut.append(outRow) 
        count += 1                
            
    #change to new dataframe   
    dfOut.set_index("VAERS_ID", inplace=True)
    dfOut.to_csv(file)

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP