TransWikia.com

Using BeautifulSoup to get the text after a strong tag, when that text is not within a ... itself

Stack Overflow Asked by papelr on December 20, 2021

I’m scraping this website: https://eintaxid.com/companies/a/?page=1

I’ve successfully extracted the company name:

r = requests.get('https://eintaxid.com/companies/a/?page=1')
soup = BeautifulSoup(r.content, "html.parser")

# outputs all strong tags
for tag in soup.find_all('strong'):
    print(tag.text)

I can easily isolate the company names doing this:

r = requests.get('https://eintaxid.com/companies/a/?page=1')
soup = BeautifulSoup(r.content, "html.parser")
table = soup.find_all('strong')

comp_list = []

# loop to extract just the company names from the strong tags, then using the a tags
for j in table:
    td = j.find_all(['a'])
    row = [i.text for i in td]
    comp_list.append(row)

# puts company names into a pandas df
comp_list = list(filter(lambda x: len(x) > 0, comp_list))
comp_list = pd.DataFrame(comp_list, columns = ['Company']).reset_index(drop = True)
comp_list

I can’t for the life of me extract the EIN numbers, though. The <strong>EIN Number:</strong> is plain to see, and I can extract that from the first code chunk above. But how do I get the actual number? The 98-1455367 as seen in the following screenshot?

Screenshot of page html

For reference, I’m going to put the EIN number next to each company in a pandas df – but can’t really do that until I’ve extracted the EIN number itself.

One Answer

Referring to the docs you might want to use the next_sibling of your tag, catch the strong tag first, then get the next item from the context:

strong_element.next_sibling    # contains "EIN number"

"Sibling" in this context is the next node, not the next element/tag. Your element's next node is a text node, so you get the text you want.

Answered by Asiri H. on December 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP