TransWikia.com

Unable to parse the links of different cases from next pages using requests

Stack Overflow Asked by robots.txt on November 29, 2021

I’ve created a script to parse the links of different cases revealed upon selecting an option in dropdown from a webpage. This is the website link and this is the option Probate that should be chosen from the dropdown titled as Case Type located at the top right before hitting the search button. All the other options should be as they are.

The script can parse the links of different cases from the first page flawlessly. However, I can't make the script go on to the next pages to collect links from there as well.

This is how next pages are visible in there at the bottom:
enter image description here

And the dropdown should look when the option is chosen:
enter image description here

I’ve tried so far:

import requests
from bs4 import BeautifulSoup

link = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name],select')}
    for k,v in payload.items():
        if k.endswith('ComboBox_case_type'):
            payload[k] = "Probate"
        elif k.endswith('ComboBox_case_type_VI'):
            payload[k] = "WILL"
        elif k.endswith('ComboBox_case_type$DDD$L'):
            payload[k] = "WILL"
        elif k.endswith('ComboBox_town$DDD$L'):
            payload[k] = "%"

    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for pk_id in soup.select("a.dxeHyperlink_Youthful[href*='Q_PK_ID']"):
        print(pk_id.get("href"))

How can I collect the links of different cases from next pages using requests?

PS I’m not after any selenium related solution.

2 Answers

Firstly examine the network requests in Dev Tools (press F12 in Chromes) and monitor the payload. There are bits of data that are missing in your request.

The reason for the missing form data is because it is added by JavaScript (when the user clicks on the page number). Once the form data has been set, there is JavaScript that executes the following:

xmlRequest.open("POST", action, true);
xmlRequest.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=utf-8");
xmlRequest.send(postData);

So all you need to do is emulate that in your Python script. Although it looks like the paging functionality only requires two additional values __CALLBACKID and __CALLBACKPARAM

In the following example; I've scraped the first 4 pages (note: the first post is just the landing page):

import requests
from bs4 import BeautifulSoup
link = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    r = s.get(link)
    r.raise_for_status()
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name],select')}
    for k,v in payload.items():
        if k.endswith('ComboBox_case_type'):
            payload[k] = "Probate"
        elif k.endswith('ComboBox_case_type_VI'):
            payload[k] = "WILL"
        elif k.endswith('ComboBox_case_type$DDD$L'):
            payload[k] = "WILL"
        elif k.endswith('ComboBox_town$DDD$L'):
            payload[k] = "%"

    page_id_list = ['PN0','PN1', 'PN2', 'PN3'] # TODO: This is proof of concept. You need to refactor code. Purhaps scrape the page id from paging html.

    for page_id in page_id_list:
        # Add 2 post items. This is required for ASP.NET Gridview AJAX postback event.          
        payload['__CALLBACKID'] = 'ctl00$ContentPlaceHolder1$ASPxGridView_search',
        # TODO: you might want to examine "__CALLBACKPARAM" acrross multiple pages. However it looks like it works by swapping PageID (e.g PN1, PN2)
        payload['__CALLBACKPARAM'] = 'c0:KV|151;["5798534","5798533","5798532","5798531","5798529","5798519","5798518","5798517","5798515","5798514","5798512","5798503","5798501","5798496","5798495"];CR|2;{};GB|20;12|PAGERONCLICK3|' + page_id + ';'
        
        r = s.post(link, data=payload)
        r.raise_for_status()
        soup = BeautifulSoup(r.text,"lxml")
        for pk_id in soup.select("a.dxeHyperlink_Youthful[href*='Q_PK_ID']"):
            print(pk_id.get("href"))

Output:

WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798668
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798588
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798584
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798573
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798572
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798570
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798569
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798568
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798566
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798564
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798560
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798552
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798542
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798541
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798535
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798534
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798533
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798532
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798531
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798529
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798519
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798518
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798517
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798515
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798514
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798512
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798503
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798501
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798496
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798495
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798494
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798492
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798485
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798480
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798479
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798476
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798475
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798474
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798472
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798471
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798470
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798469
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798466
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798463
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798462
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798460
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798459
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798458
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798457
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798455
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798454
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798453
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798452
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798449
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798448
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798447
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798446
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798445
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798444
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798443

Whilst the solution can be achieved using Requests. It can be temperamental. Selenium is usually a better approach.

Answered by Greg on November 29, 2021

This codes works but use selenium instead of requests.

You need to install selenium python lib and download gecko driver. If you do not want to have geckodriver in c:/program you have to change executable_path= to the path you have geckodriver in. You maybe want to make the sleep time shorter to, but the site is loading so slow (for me) so i have to set long sleep times so the site loads correctly before trying to read from it.

from selenium import  webdriver
from bs4 import BeautifulSoup
import time

link = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(link)
dropdown = driver.find_element_by_css_selector('#ContentPlaceHolder1_ASPxSplitter1_ASPxComboBox_case_type_B-1')
dropdown.click()
time.sleep(0.5)
cases = driver.find_elements_by_css_selector('.dxeListBoxItem_Youthful')
for case in cases:
    if case.text == 'Probate':
        time.sleep(5)
        case.click()
        time.sleep(5)
search = driver.find_element_by_css_selector('#ContentPlaceHolder1_ASPxSplitter1_ASPxButton_search')
search.click()
while True:
    time.sleep(15)
    soup = BeautifulSoup(driver.page_source,"lxml")
    for pk_id in soup.select("a.dxeHyperlink_Youthful[href*='Q_PK_ID']"):
        print(pk_id.get("href"))
    next = driver.find_elements_by_css_selector('.dxWeb_pNext_Youthful')
    if len(next) > 0:
        next[0].click()
    else:
        break

Answered by UWTD TV on November 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP