TransWikia.com

wget - how to reject string from downloading html pages

Unix & Linux Asked by speld_rwong on November 9, 2021

I am using the following wget command and it downloads the required files I need except for one thing…

wget -U "Mozilla/5.0" --wait=3 --load-cookies cookies.txt --timestamping --recursive --level=2 --convert-links --no-parent --page-requisites --adjust-extension --max-redirect=0 --exclude-directories=blog --reject "*per_page=18.html" --reject "*per_page=36.html" (url here)

I want to download files like these:

a1546997.html

But I don’t want to download files like these:

a1546997.html?pwd=&per_page=36.html

I cannot seem to figure out how to reject downloading the html pages containing the extra stuff at the end.

The main problem is that wget gets stuck retrying and times out on the second types of links because the don’t go anywhere – and then wget client gets banned.

Any suggestions?

2 Answers

What I would do, pragmatic approach ahead:

wget ....
rename 's/.html?.*/.html/' *.html*

This the Perl's rename command

Answered by Gilles Quenot on November 9, 2021

Try using the --reject-regex switch of wget. You could probably do something like:

wget --recursive --no-parent --reject-regex '[^?]' url

Answered by gabriel on November 9, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP