TransWikia.com

Python: Unwanted Truncation of strings in list after pd.DataFrame()

Stack Overflow Asked by santma on November 17, 2021

The Data

I have data taken from a webscraper that I am trying to clean. For each webpage scraped, I exported a csv consisting of one row and 10-14 columns.

Input:

"Featured Snippet" | title | misc. content | misc. content | misc. content | website | page title | "Feedback" | "About snippets"

The misc. content cells vary from csv to csv. Sometimes there are two, three, or four. What I am trying to do is combine these middle columns into a single string.

Output:

filename | website | page title | title | content

So, my code imports each csv in a for loop as a pandas dataframe. It extracts the second column for the title, then flips the dataframe to extract the 3rd-to-last column for website, 4th-to-last for Page title, and the whole row up to the 5th column for the content (so the content includes extra data (title and "featured snippet") but thats ok because i can clean it in excel later. It also gret the filename as a value. It puts all these values for each csv into lists, which I combine into a dataframe at the end.

Code

files = sorted(glob.glob('*.csv'))

filenames = []
websites = []
pagetitles = []
titles = []
contents = []

for f in files: 

    df = pd.read_csv(f,index_col=False)
    df = df[0:1]
    
    title = df.iloc[:,1]
    title = title.to_string(index = False)
    titles.append(title)
    
    df_flipped = df.iloc[:, ::-1]

    website = df_flipped.iloc[:,2]
    website = website.to_string(index = False)
    websites.append(website)
    
    pagetitle =  df_flipped.iloc[:,3]  
    pagetitle = pagetitle.to_string(index = False)
    pagetitles.append(pagetitle)
    
    content = df_flipped.iloc[:,4:]
    content = content.dropna(axis = 1)
    
    content = content.apply(lambda row: ' // '.join(row.values.astype(str)), axis=1)        
    contents.append(content)
    
    filename = os.path.splitext(str(f))[0]
    filenames.append(filename)



snippet_data = pd.DataFrame(list(zip(filenames, websites, pagetitles, titles)))
snippet_data.to_csv('datasets/black-friday-snippets.csv')      

My Problem

I’ve actually done everything I wanted to do, but my content keeps getting truncated. I’ve tried a billion variations of the .join function, tried converting the content into a bunch of different datatypes, and I’ve already tried about 3904312590781038941 different ways of this:

pd.set_option('display.max_columns', 50000000)
pd.set_option('display.width', 1500000000)

Also, I’ve done a bunch of similar codes and never had a problem.

Clues

  1. I am using Spyder, and when I open up the content variable, I have to double click on the row to see the full content.

  2. Content is a Series, and contents is a list of Series. Likewise when I open contents variable, I have to double click on the cell to see the full text.

  3. Just to @#$# with my head even more, it shows the truncated version when i try print(content)

  4. It truncates after pd.Dataframe(), but since it also truncates with the print() function, I have no idea exactly why to how to avoid this.

  5. Yes, I tried pd.set_options(blah blah blah). Maybe I’m not using it right.

One Answer

Ok, so i figured this one out by putting pd.options.display.max_colwidth = 500 in the for loop right after pd.read_csv()

So it goes:

files = sorted(glob.glob('*.csv'))

filenames = []
websites = []
pagetitles = []
titles = []
contents = []

for f in files: 

    df = pd.read_csv(f,index_col=False)
    df = df[0:1]

    pd.options.display.max_colwidth = 500
    
    title = df.iloc[:,1]
    title = title.to_string(index = False)
    titles.append(title)
    
    df_flipped = df.iloc[:, ::-1]

    website = df_flipped.iloc[:,2]
    website = website.to_string(index = False)
    websites.append(website)
    
    pagetitle =  df_flipped.iloc[:,3]  
    pagetitle = pagetitle.to_string(index = False)
    pagetitles.append(pagetitle)
    
    content = df_flipped.iloc[:,4:]
    content = content.dropna(axis = 1)
    
    content = content.apply(lambda row: ' // '.join(row.values.astype(str)), axis=1)        
    contents.append(content)
    
    filename = os.path.splitext(str(f))[0]
    filenames.append(filename)



snippet_data = pd.DataFrame(list(zip(filenames, websites, pagetitles, titles, content)))
snippet_data.to_csv('datasets/black-friday-snippets.csv')    

Answered by santma on November 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP