How to scrape all topics from twitter

Question

All topics in twitter can be found in this link
I would like to scrape all of them with each of the subcategory inside.
BeautifulSoup doesn't seem to be useful here. I tried using selenium, but I don't know how to match the Xpaths that come after clicking the main category.
from selenium import webdriver
from selenium.common import exceptions

url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()

main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')

topics = {}
for main_topic in main_topics[2:]:
    print(main_topic.text.strip())
    topics[main_topic.text.strip()] = {}

I know I can click the main category using main_topics[3].click(), but I don't know how I can maybe recursively click through them until I find only the ones with Follow on the right.

DebanjanB · Answer

To scrape all the main topics e.g. Arts & culture, Business & finance, etc using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using XPATH and text attribute:
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])

Using XPATH and get_attribute():
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])

Console Output:
['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']

To scrape all the main and sub topics using Selenium and WebDriver you can use the following Locator Strategy:

Using XPATH and get_attribute("textContent"):
driver.get("https://twitter.com/i/flow/topics_selector")
elements =  WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))
for element in elements:
    element.click()
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@role='button']/div/span[text()]")))])
driver.quit()

Console Output:
['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']

Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

bendub89 · Answer

Take a look at how XPATH works. Just put '//element[@attribute="foo"]' and you don't have to write out the whole path. Be careful as both main topics and sub topics (which are visible after clicking the main topics) have the same class name.  That was causing the error.  So, here's how I was able to click the subtopics, but I'm sure there's a better way:
I found the topics elements using:
topics = WebDriverWait(browser, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//div[@class="css-901oao r-13gxpu9 r-1qd0xha r-1b6yd1w r-1vr29t4 r-ad9z0x r-bcqeeo r-qvutc0"]'))
    )

then I created an empty list named:
main_topics = []

Then, I for looped through topics and appeneded each element.text to the main_topics list, and clicked each element to show the main topics.
for topic in topics:
    main_topics.append(topic.text)
    topic.click()

Then, I a new variable called sub_topics: (it's now all the opened topics)
sub_topics = WebDriverWait(browser, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//span[@class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0"]'))
    )

Then, I created two more empty lists named:
subs_list = []

skip_these_words = ["Done", "Follow your favorite Topics", "You’ll see top Tweets about them in your timeline. Don’t see your favorite Topics yet? New Topics are added every week.", "Follow"]
]

Then, I for looped through the sub_topics and made an if statement to only append the elements.text to the subs_list ONLY IF they were not in the main_topics and skip_these_words lists.  I did this to filter out the main topics and unecessary text at the top since all these dern elements have the same class name.  Finally, each sub topic is clicked.  This last part is confusing so here's an example:
for sub in sub_topics:
    if sub.text not in main_topics and sub.text not in skip_these_words:
        subs_list.append(sub.text)
        sub.click()

There are also a few more hidden sub-sub topics.  See if you can click the remaining sub-sub topics.  Then, see if you can find the follow button element and click each one.

How to scrape all topics from twitter

2 Answers

Add your own answers!

Ask a Question