Setting Up A Web Bot (With Bypassing Ability) Using Tor

2017-03-11 10:40:18 Uchenna

How to set up a web crawler with bypassing abilities using tor on python.

STEP 0

Before we get to the nerdy parts, we would want to just enjoy the Tor web browser for anonymous web browsing on Ubuntu (not crawling). This comes in handy when we want to bypass some ehmmm.... lets call them "walls" :). I assume you did not install Tor before stumbling across this post. To install Tor browser, enter the following code one by one :

sudo add-apt-repository ppa:webupd8team/tor-browser

sudo apt-get update

sudo apt-get install tor-browser

However, if you had tried a bunch of stuffs and end up having some tor related "residues" on you system or just want to make a clean installation, don't worry.. I gat you! call

sudo apt-get remove --purge tor-browser

Before calling the other codes above.

Now to our main BUSINESS

Some of us “no funners” want to just get the crawling going up in here! No time to enjoy the graphical interface. Straight up!

STEP 1 (INSTALLING TOR):

We install Tor by calling the following code

sudo apt-get update

sudo apt-get install tor

After the installation, you will get some notices (warnings) one of which is :

“This is experimental software. Do not rely on it for strong anonymity”
so be smart! Anyway, you will also notice that the sock listener is open on port 9050. We will need to enable the controlPort listener to listen to 9051. But first lets just restart it to make sure we are good

sudo /etc/init.d/tor restart


Ok, we are good! now to enable the controlPort listener to listen to 9051, we need to edit the torrc file. But before then, since we are all serious about privacy and security here, we don’t want any random access to our ControlPort listener. For those that don’t know the function of the controlPort listener:

ControlPort listener is like “headphones” Tor will listen to for any communication from applications talking to the Tor controller. OK, its more like Tor’s mobile phone.
With that said, we need to create a password to protect our sock listener port. You can use any password, but I will advise hashing the password also

tor --hash-password SUPERPASSWORD

Copy the hashed password which should look something like

16:42C310FDE560C60B60A03F324522868EBFE70065D143B3ADE654BEAF3

Hey! better don't use my password! I will use vi as my favorite editor. You can use any editor you want

sudo vi /etc/tor/torrc

Then we insert into the file

ControlPort 9051
HashedControlPassword 16:42C310FDE560C60B60A03F324522868EBFE70065D143B3ADE654BEAF31

save it

:wq!

Restart again

sudo /etc/init.d/tor restart

Now you should notice the control listener open on 9051

Hurray!

If ( all worked well to this point){

My congratulations!

}else{

check what went wrong carefully and retry!

}

Step1.return() // end of step 1

STEP 2( INSTALLING REQUESOCKS AND STEM) :

requesocks is python module that provides us with sock proxy support and makes working with HTTP request less cumbersome. If interested, you can check out their website https://pypi.python.org/pypi/requesocks.

sudo pip install requesocks

stem is a python controller library for communication which enables us interact with tor.

sudo pip install stem 

STEP 3 (INSTALLING BEAUTIFULSOUP)

We need beautifulsoup to make sense of all the jargons (bunch of texts) we will be getting from the crawler results

sudo pip install beautifulsoup4

Congratulation! Now we are ready to write our crawlers

Attached to this post is a web crawler on my website. Watch as the tor IP address changes at every crawl. Do not forget to remove the .txt (rename to ucaku_crawl.py)
N/B: This Tutorial was done using Python 2.7

ELEMENT INDEX

SPONSORS