TransWikia.com

How to avoid getting broken words while webcrawling

Stack Overflow Asked by data_minD on February 10, 2021

I’m trying to web crawl movie titles from this website: https://www.the-numbers.com/market/2019/top-grossing-movies

And keep getting broken word like "John Wick: Chapter 3 — ".

this is the picture:
enter image description here

This is the code:

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
                  headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
  print(movie_list[i].text)

And these are the outputs:

Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 — Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
Pokémon: Detective Pikachu
Once Upon a Time…in Hollywo…

I want to know why I keep getting these broken words and how to fix this!

2 Answers

You need to handle the strings like this, the solution code is:

import requests
from bs4 import BeautifulSoup

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
                  headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"


import unicodedata
for i in range(len(movie_list)):
    movie_name = movie_list[i].text
    print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())

The output is like this:

Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.

Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...

Answered by Anup Tiwari on February 10, 2021

Due to this page is server-render, you could request those page separately when the title getting broken.(Also don't forget to get the title by regex, because the title of its page contain the publication date.)

Try code below:

import requests
from bs4 import BeautifulSoup

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a")  # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
    raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
    raw.encoding = 'utf-8'
    html = BeautifulSoup(raw.text, "html.parser")
    print(html.select_one("#main > div > h1").text)

That's gave me:

Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....

Answered by jizhihaoSAMA on February 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP