Download entire website

Question:
How to download an external website

An example is shown below.

This first example uses a username and password.

wget -r http-user=jeremy@la-maison.co.uk http-password=lilli http://www.tradefairinternational.com/cgi-bin/bb000001.pl?ACTINIC_REFERR...

This example

Wget gives you the ability to download entire sites. However, many sites do not want you to download their entire site. In an attempt to prevent this, they check what web browser is being used. In the event they detect you are not using a web browser they refuse your connection or send a blank page. You might get a message like:

Sorry, but the download manager you are using to view this site is not supported. We do not support use of such download managers as flashget, go!zilla, or getright

In order to overcome this wget has a very useful -U option. By using -U My-browser it tells the site you are using a commonly accepted browser:

wget -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

In addition, two further important command line options are --limit-rate= and --wait=. You should add --wait=20 to pause 20 seconds between retrievals, this makes sure you are not manually added to a blacklist. --limit-rate defaults to bytes, add K to set KB/s. Example:

wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

A web-site owner will probably get upset if you attempt to download his entire site using a simple wget http://foo.bar command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.

Use --no-parent

--no-parent is a very handy option that guarantees wget will not download anything from the folders beneath the folder you want to acquire. Use this to make sure wget does not fetch more than it needs to if just just want to download the files in a folder.

wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

Original source http://linuxreviews.org/quicktips/wget/

Taxonomy: