Tor Project: Alexa Top Sites Captcha and Block Monitoring

The Tor Project: Alexa Top Sites Captcha and Tor Block Monitoring

Hey all,
This is Apratim from India. I’m a final year student and I’m excited to be a part of GSoC 2021.

About My Project

My project will be focused on tracking the Alexa Top 500 Websites to get a detailed knowledge of websites returning Captchas or limiting functionalities or blocking Tor clients. The project aims to do so by fetching webpages over a period of time from both Tor clients and non tor browsers (the mainframe ones). More details about the project could be found in here. and blog posts could be seen here.

My Time Zone

Indian Standard Time (UTC +5:30)

Getting in Touch

You could reach me out in here:
irc : _ranchak_ ( OFTC ) in tor channels (#tor, #tor-dev)
GitHub
LinkedIn

Looking forward to have a learn lots!

Community Bonding Period:

It’s been quite late from my previous update. Asked @downey and @pili if it were okay, and they seemed fine with a blog post describing the summary of the past two weeks as a whole :slightly_smiling_face:

Communication:

Earlier this week I was asked to create issues in the gitlab.tpo that provides me with the usage of gitlab, which I’m new to and at the same time it also helps in tracking down the project. I also planned with my mentor about weekly meetings to give them updates, also I try to be active in irc to help with the conversations and try to find answers that I could possibly give off to members or even random people who look for help.

Work:

I first tried installing the Captcha Monitor with the help of my mentor, but the tor part of the application didn’t work in my laptop, thus I couldn’t get connected to tor using the application. Since it didn’t work, I was advised to not waste much time there and write my scripts, and start with the experimentation, which would be of a great benefit to the project as well as me.

Meanwhile researching on the problem also helped me find some cool results and corner cases that would help me to tackle the problem in a different way. I changed my flowchart or the approach, and I guess it’s not yet complete but looks better than the flowchart I mentioned before:

Version 1 Version 2

Problems:

While experimenting I noticed my ISP injects advertisement (DNS poisoning) to the http websites, which I took to the irc, to get responses like using own recursor, DoT(DNS over TLS)/DoH(DNS over HTTPS). Earlier I was using Cloudflare’s DNS for my system, but it didn’t seem to work. Maybe I’ll have to check setting up the router’s DNS to Cloudflare’s or Google’s. I also tried opening the network tab to find out the website and set a firewall denial (ufw) for the ip-address, but that didn’t seem to work too. For the time-being I’m using NextDNS which seems to minimize it for me, if not work 100%, also I noticed allowing a 45sec wait, redirects back to the website. So if any of the above doesn’t work for me. I’ll be going over with the later.

Also if you have any suggestions would love to give it a try!!

Ahead:

I plan to locally install the Captcha Monitor and also write scripts for the dashed boxes (which haven’t been made yet)

Community Bonding Period: [Week 3]

This week was a week with some crest and troughs. Due to a cyclone I didn’t have my internet for few days, and it caused some delay in the installing process, tho back on-line I still got to know that my internet is still a mess to run the dockers and the tests kept failing. As suggested by mentor I tried opting for a server, but nor my card, or any of my kith and kin cards were of use. Thankfully, my mentor provided me with a server, and the systems working.

I ended this week on a good note, with a PR and lots of knowledge of the codebase. Looking ahead to work with small PRs which indeed would be shaping the application bits by bits.

Coding Period: [Week 4]

This week my main task was to start writing about the analyser.py which would be the logic part of my code. I did have some trouble integrating and adding tests it with the current code-base, so we discussed and came to a conclusion of writing the logic part and later could think about integrating it with the present code-base and further on move with the Integration and Unit Testing. The logic as of now does work, but there is always a scope of improvements which are mentioned below.

Insights:

At present I’m checking for the reliability of the modules, like for example:

Cloudflare blocks requests library and hence request library isn't suited in here. I read it's because the headers(User-Agent) of the request is sent by the name of python which get's marked as a bot. I changed the User-Agent but it was still the same so it isn't much of use in this case. Also for specific cases like mastercard where there is status 3xx (reload) it returns results easily.

Also, I’ve been asked to not use selenium-wire, which I’ll be changing soon.

So I plan on also making a checker method that would check the following:

def check():
# Non Tor:
  if request_module is blocked:
    # Just to be cautious
    check HAR
    if HAR.first() returns 4xx or 5xx:
      go with request
    elif HAR.first() returns 0:
      "No case found till now"
    else:
      go with HAR.first()
# Tor:
  if request_module is blocked:
    check HAR 
    if HAR.first() returns 3xx or 4xx or 5xx:
      go with request
    if HAR.first() returns 0:
      "check for captcha and warnings"
      pass
    else:
      go with HAR.first()

I hope this would tend to make the code a bit better in terms of reliabilty. Discussions needed in here because it’s my thought as of now.
Also HAR.first() mean the first request status code sent to server. Generally the index page.

This in above is done keeping in mind that the request library is blocked by a bot-detection tool like Cloudflare. But it makes it difficult for cases when only the request library is blocked.

For example: cloudflare.com returns captcha for the Tor Users:

but Request library gets a 403 error(which is a bit different when I see the HAR export), which says it is blocked.
Similar cases for https://hugedomains.com, https://brandbucket.com.

And it does provide reliable information on websites like mastercard.de, that are blocked, hence I’m also checking the Non-Tor results for request to make it sound a bit appealing.

Week ahead:

I plan to still optimize and correct some markings or improving the tests as the HARExportTrigger I’m using provides glitches at times, while it exports the HAR data from the browser. Also I’ve used some GDPR_match_list which now uses few keywords for testing purpose. I plan to extend the list too.

1 Like

Coding Period: [Week 5]

I implemented the logic part of analyser.py and wrote scripts to add it into the Database. Basically I implemented and integrated the analyser.py into the Captcha Monitor.
The details are as follows:

Logic for Captcha has been implemented (Analyser.py)
Changed a bit of logic and added the use of flags and integral results (dom_analyse, captcha_checker, check) to insert into DB.
Data from the analyser need to be inserted into DB
Add analyser_completed table to the models.py with details such as captcha_checker, dom_analyse, status_check to get the details from the Database entries.
Wrote tests for Analyser.py
I personally feel more tests could be added to get a better Coverage of the code.

The Week Ahead:

This week I’ll be experimenting and try to implement the Consensus Module or the Senser Paper. The basic idea of this paper states that using multiple proxies one could get a rough estimate of a website and it’s behavior, which then could be used to our convenience and check if the websites are blocked or not.

Using the above approach one can compare the websites generated from the Consensus Module with the websites we get by the exit relays. Thereby getting a better results than the stop words or filter list we are using as of now. Also for the Consensus Module to work we need to use different or multiple proxies or vpns to get a rough idea.

Week Ahead:

I will be working and experimenting with the different vpns and proxies and check which would suit the best in this case and further implement it.