Tag Archives: backup wget

Offline Browsing in Linux: wget and some tricks

Ever since I joined Hostgator.com, I’ve been learning a lot of Linux in the hopes that I switch my career into Linux. Hopefully Forensics related.

So this new dilemma I had was to download a website for offline browsing. I went on the hunt for an offline file browser for Linux…. I found that I could use wget to mirror a whole website.

For example, I want to make a copy of blackberrysimunlockcode.com, Here’s how:

wget -m http://blackberrysimunlockcode.com

Here the -m option is telling wget to mirror the website. This is the basic command. But say I need some advanced options. What do I do?

I was trying to get all the script files off of a website to save for later learning and all it was downloading was the index.html and robots.txt
The robots.txt file was blocking user agent wget. To confirm this I used the debug option in wget:

wget -m -d http://blackberrysimunlockcode.com

You’ll get something like:

Not following http://blackberrysimunlockcode.com/privacy.shtml because robots.txt forbids it.

or

Rejecting path sh/eg/talk.sh.txt because of rule `sh’

or

no-follow in index.html

I tried using the option –user-agent “Mozilla” ….. no luck

I tried adding the following in .wgetrc :

## Local settings (for a user to set in his $HOME/.wgetrc). It is
## *highly* undesirable to put these settings in the global file, since
## they are potentially dangerous to “normal” users.
##
## Even when setting up your own ~/.wgetrc, you should know what you
## are doing before doing so.
##

header = Accept-Language: en-us,en;q=0.5
header = Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
header = Accept-Encoding: gzip,deflate
header = Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
header = Keep-Alive: 300
user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
referer = http://www.google.com

…Still no luck.

The trick is to use option -e robots=off

So my new command became:

wget -m -k -e robots=off -w 2 --random-wait -U "Mozilla" -np http://blackberrysimunlockcode.com

Heres what the options do:

-m mirrors website
-k fix links so you don’t get directed to www.blackberrysimunlockcode.com/sh/eg instead of /sh/eg (relative vs absolute)
-e executes command robots=off
-w 2 sets wait time as 2 seconds so you don’t overload server and get ip blocked
–random-wait can be random in 2 secon increments
-U sets user agent
-np no parent, so if the current subdirectory/page links to parent pages, it might crawl whole website