2017-03-11 10:40:18 Uchenna
How to set up a web crawler with bypassing abilities using tor on python.
Before we get to the nerdy parts, we would want to just enjoy the Tor web browser for anonymous web browsing on Ubuntu (not crawling). This comes in handy when we want to bypass some ehmmm.... lets call them "walls" :). I assume you did not install Tor before stumbling across this post. To install Tor browser, enter the following code one by one :
sudo add-apt-repository ppa:webupd8team/tor-browser sudo apt-get update sudo apt-get install tor-browser
However, if you had tried a bunch of stuffs and end up having some tor related "residues" on you system or just want to make a clean installation, don't worry.. I gat you! call
sudo apt-get remove --purge tor-browser
Before calling the other codes above.
Now to our main BUSINESS
Some of us “no funners” want to just get the crawling going up in here! No time to enjoy the graphical interface. Straight up!
We install Tor by calling the following code
sudo apt-get update sudo apt-get install tor
After the installation, you will get some notices (warnings) one of which is :
“This is experimental software. Do not rely on it for strong anonymity”
so be smart! Anyway, you will also notice that the sock listener is open on port 9050. We will need to enable the controlPort listener to listen to 9051. But first lets just restart it to make sure we are good
sudo /etc/init.d/tor restart
Ok, we are good! now to enable the controlPort listener to listen to 9051, we need to edit the torrc file. But before then, since we are all serious about privacy and security here, we don’t want any random access to our ControlPort listener. For those that don’t know the function of the controlPort listener:
ControlPort listener is like “headphones” Tor will listen to for any communication from applications talking to the Tor controller. OK, its more like Tor’s mobile phone.
With that said, we need to create a password to protect our sock listener port. You can use any password, but I will advise hashing the password also
tor --hash-password SUPERPASSWORD
Copy the hashed password which should look something like
16:42C310FDE560C60B60A03F324522868EBFE70065D143B3ADE654BEAF3
Hey! better don't use my password! I will use vi as my favorite editor. You can use any editor you want
sudo vi /etc/tor/torrc
Then we insert into the file
ControlPort 9051 HashedControlPassword 16:42C310FDE560C60B60A03F324522868EBFE70065D143B3ADE654BEAF31
save it
:wq!
Restart again
sudo /etc/init.d/tor restart
Now you should notice the control listener open on 9051
Hurray!
If ( all worked well to this point){ My congratulations! }else{ check what went wrong carefully and retry! } Step1.return() // end of step 1
requesocks is python module that provides us with sock proxy support and makes working with HTTP request less cumbersome. If interested, you can check out their website https://pypi.python.org/pypi/requesocks.
sudo pip install requesocks
stem is a python controller library for communication which enables us interact with tor.
sudo pip install stem
We need beautifulsoup to make sense of all the jargons (bunch of texts) we will be getting from the crawler results
sudo pip install beautifulsoup4
Congratulation! Now we are ready to write our crawlers
Attached to this post is a web crawler on my website. Watch as the tor IP address changes at every crawl. Do not forget to remove the .txt (rename to ucaku_crawl.py)
N/B: This Tutorial was done using Python 2.7
© 2017 UCAKU, Inc. All rights reserved.