Ever since I joined Hostgator.com, I’ve been learning a lot of Linux in the hopes that I switch my career into Linux. Hopefully Forensics related.
So this new dilemma I had was to download a website for offline browsing. I went on the hunt for an offline file browser for Linux…. I found that I could use wget to mirror a whole website.
For example, I want to make a copy of blackberrysimunlockcode.com, Here’s how:
wget -m http://blackberrysimunlockcode.com
Here the -m option is telling wget to mirror the website. This is the basic command. But say I need some advanced options. What do I do?
I was trying to get all the script files off of a website to save for later learning and all it was downloading was the index.html and robots.txt
The robots.txt file was blocking user agent wget. To confirm this I used the debug option in wget:
wget -m -d http://blackberrysimunlockcode.com
You’ll get something like:
Not following http://blackberrysimunlockcode.com/privacy.shtml because robots.txt forbids it.
Rejecting path sh/eg/talk.sh.txt because of rule `sh’
no-follow in index.html
I tried using the option –user-agent “Mozilla” ….. no luck
I tried adding the following in .wgetrc :
## Local settings (for a user to set in his $HOME/.wgetrc). It is
## *highly* undesirable to put these settings in the global file, since
## they are potentially dangerous to “normal” users.
## Even when setting up your own ~/.wgetrc, you should know what you
## are doing before doing so.
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
header = Accept-Encoding: gzip,deflate
header = Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
header = Keep-Alive: 300
user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:220.127.116.11) Gecko/20070725 Firefox/18.104.22.168
referer = http://www.google.com
…Still no luck.
The trick is to use option -e robots=off
So my new command became:
wget -m -k -e robots=off -w 2 --random-wait -U "Mozilla" -np http://blackberrysimunlockcode.com
Heres what the options do:
-m mirrors website
-k fix links so you don’t get directed to www.blackberrysimunlockcode.com/sh/eg instead of /sh/eg (relative vs absolute)
-e executes command robots=off
-w 2 sets wait time as 2 seconds so you don’t overload server and get ip blocked
–random-wait can be random in 2 secon increments
-U sets user agent
-np no parent, so if the current subdirectory/page links to parent pages, it might crawl whole website