Tor Project: Alexa Top Sites Captcha and Block Monitoring

The Tor Project: Alexa Top Sites Captcha and Tor Block Monitoring

Hey all,
This is Apratim from India. I’m a final year student and I’m excited to be a part of GSoC 2021.

About My Project

My project will be focused on tracking the Alexa Top 500 Websites to get a detailed knowledge of websites returning Captchas or limiting functionalities or blocking Tor clients. The project aims to do so by fetching webpages over a period of time from both Tor clients and non tor browsers (the mainframe ones). More details about the project could be found in here. and blog posts could be seen here.

My Time Zone

Indian Standard Time (UTC +5:30)

Getting in Touch

You could reach me out in here:
irc : _ranchak_ ( OFTC ) in tor channels (#tor, #tor-dev)
GitHub
LinkedIn

Looking forward to have a learn lots!

Community Bonding Period:

It’s been quite late from my previous update. Asked @downey and @pili if it were okay, and they seemed fine with a blog post describing the summary of the past two weeks as a whole :slightly_smiling_face:

Communication:

Earlier this week I was asked to create issues in the gitlab.tpo that provides me with the usage of gitlab, which I’m new to and at the same time it also helps in tracking down the project. I also planned with my mentor about weekly meetings to give them updates, also I try to be active in irc to help with the conversations and try to find answers that I could possibly give off to members or even random people who look for help.

Work:

I first tried installing the Captcha Monitor with the help of my mentor, but the tor part of the application didn’t work in my laptop, thus I couldn’t get connected to tor using the application. Since it didn’t work, I was advised to not waste much time there and write my scripts, and start with the experimentation, which would be of a great benefit to the project as well as me.

Meanwhile researching on the problem also helped me find some cool results and corner cases that would help me to tackle the problem in a different way. I changed my flowchart or the approach, and I guess it’s not yet complete but looks better than the flowchart I mentioned before:

Version 1 Version 2

Problems:

While experimenting I noticed my ISP injects advertisement (DNS poisoning) to the http websites, which I took to the irc, to get responses like using own recursor, DoT(DNS over TLS)/DoH(DNS over HTTPS). Earlier I was using Cloudflare’s DNS for my system, but it didn’t seem to work. Maybe I’ll have to check setting up the router’s DNS to Cloudflare’s or Google’s. I also tried opening the network tab to find out the website and set a firewall denial (ufw) for the ip-address, but that didn’t seem to work too. For the time-being I’m using NextDNS which seems to minimize it for me, if not work 100%, also I noticed allowing a 45sec wait, redirects back to the website. So if any of the above doesn’t work for me. I’ll be going over with the later.

Also if you have any suggestions would love to give it a try!!

Ahead:

I plan to locally install the Captcha Monitor and also write scripts for the dashed boxes (which haven’t been made yet)

Community Bonding Period: [Week 3]

This week was a week with some crest and troughs. Due to a cyclone I didn’t have my internet for few days, and it caused some delay in the installing process, tho back on-line I still got to know that my internet is still a mess to run the dockers and the tests kept failing. As suggested by mentor I tried opting for a server, but nor my card, or any of my kith and kin cards were of use. Thankfully, my mentor provided me with a server, and the systems working.

I ended this week on a good note, with a PR and lots of knowledge of the codebase. Looking ahead to work with small PRs which indeed would be shaping the application bits by bits.

Coding Period: [Week 4]

This week my main task was to start writing about the analyser.py which would be the logic part of my code. I did have some trouble integrating and adding tests it with the current code-base, so we discussed and came to a conclusion of writing the logic part and later could think about integrating it with the present code-base and further on move with the Integration and Unit Testing. The logic as of now does work, but there is always a scope of improvements which are mentioned below.

Insights:

At present I’m checking for the reliability of the modules, like for example:

Cloudflare blocks requests library and hence request library isn't suited in here. I read it's because the headers(User-Agent) of the request is sent by the name of python which get's marked as a bot. I changed the User-Agent but it was still the same so it isn't much of use in this case. Also for specific cases like mastercard where there is status 3xx (reload) it returns results easily.

Also, I’ve been asked to not use selenium-wire, which I’ll be changing soon.

So I plan on also making a checker method that would check the following:

def check():
# Non Tor:
  if request_module is blocked:
    # Just to be cautious
    check HAR
    if HAR.first() returns 4xx or 5xx:
      go with request
    elif HAR.first() returns 0:
      "No case found till now"
    else:
      go with HAR.first()
# Tor:
  if request_module is blocked:
    check HAR 
    if HAR.first() returns 3xx or 4xx or 5xx:
      go with request
    if HAR.first() returns 0:
      "check for captcha and warnings"
      pass
    else:
      go with HAR.first()

I hope this would tend to make the code a bit better in terms of reliabilty. Discussions needed in here because it’s my thought as of now.
Also HAR.first() mean the first request status code sent to server. Generally the index page.

This in above is done keeping in mind that the request library is blocked by a bot-detection tool like Cloudflare. But it makes it difficult for cases when only the request library is blocked.

For example: cloudflare.com returns captcha for the Tor Users:

but Request library gets a 403 error(which is a bit different when I see the HAR export), which says it is blocked.
Similar cases for https://hugedomains.com, https://brandbucket.com.

And it does provide reliable information on websites like mastercard.de, that are blocked, hence I’m also checking the Non-Tor results for request to make it sound a bit appealing.

Week ahead:

I plan to still optimize and correct some markings or improving the tests as the HARExportTrigger I’m using provides glitches at times, while it exports the HAR data from the browser. Also I’ve used some GDPR_match_list which now uses few keywords for testing purpose. I plan to extend the list too.

1 Like

Coding Period: [Week 5]

I implemented the logic part of analyser.py and wrote scripts to add it into the Database. Basically I implemented and integrated the analyser.py into the Captcha Monitor.
The details are as follows:

Logic for Captcha has been implemented (Analyser.py)
Changed a bit of logic and added the use of flags and integral results (dom_analyse, captcha_checker, check) to insert into DB.
Data from the analyser need to be inserted into DB
Add analyser_completed table to the models.py with details such as captcha_checker, dom_analyse, status_check to get the details from the Database entries.
Wrote tests for Analyser.py
I personally feel more tests could be added to get a better Coverage of the code.

The Week Ahead:

This week I’ll be experimenting and try to implement the Consensus Module or the Senser Paper. The basic idea of this paper states that using multiple proxies one could get a rough estimate of a website and it’s behavior, which then could be used to our convenience and check if the websites are blocked or not.

Using the above approach one can compare the websites generated from the Consensus Module with the websites we get by the exit relays. Thereby getting a better results than the stop words or filter list we are using as of now. Also for the Consensus Module to work we need to use different or multiple proxies or vpns to get a rough idea.

Week Ahead:

I will be working and experimenting with the different vpns and proxies and check which would suit the best in this case and further implement it.

1 Like

Coding Period: [Week 6]

This week I worked on experimenting with different vpns and proxies and discussed with my mentor the trials of the implementation of Consensus Module. The initial phase looked like this :

But then I found that free proxies were easily available, and also updated at regular basis, with less cumbersome integration and dropped the above plan.

Finally, I discussed and reached to implementing the below logic:

This is the tabular representation, for which I will be implementing the code and continue with the further process of the construction of the Consensus Module.

Also took review of the modules done previously, the working of it with real world data and have recieved positive feedback about it.

Week Ahead:

  • If things go as planned I’ll start pushing the scripts to this module to the code-base.
  • If not then will have to tweak in with the logic to get better results
1 Like

Coding Period: [Week 7]

This week I wrote a bare minimum script to see if the said logic works as intended and did find it to fit well. Since the above logic is a spin-off to the paper, I decided to change the Consensus into Consensus Lite and as advised by my mentors I also tried contacting the author to the paper, Micah Sherr about what more could I add and his suggestions.

As for the code, I merged few MRs related to adding proxies and modules that would use it, in the current Codebase.

Week Ahead:

This week I’ll further integrate the logic portion and, hope to take advise from Micah Sherr regarding the project.

1 Like

Coding Period: [Week 8]

For this week, I added the Consensus Lite Module and it’s tests. Also discussed about the pros and cons of adding the Consensus Module, since till now I had focused on the comparing the architecture of the websites and not the content of the website because for the dynamic websites (like reddit which might return different posts to different profiles/users based on their recommendation system based on interests/regions etc) and websites based on Geo-based-location it might be a bit difficult for content based approach, and Consensus Module does so.

So, based on it I figured running and testing for myself how correct the Captcha Monitor works for now would give the answers for:

  • Does it work as intended?
  • If not, where does it lack and how could it be improved?
  • Would Consensus Module better the efficiency rate in this case?

and might have answers to more unknown questions.
That said, I ran few tests (with ~120 websites) as of now it works good. Certainly more the number of tests, better will be the assertion.

Meanwhile I’m improving the documentations and few more things such that it would help someone understand my thoughts and ways of proceeding with the problem. This might also help in identifying errors, by both the community:

Given enough eyeballs, all bugs are shallow
Linus’s Law

and me: Rubber Duck Debugging

Week Ahead:

This week I plan to check the errors and how could it be improved, also plan to have a final discussion if Consensus Module is needed.

1 Like

Coding Period: [Week 9]

It’s been a bit late this week for the updates due to some college work.
So this week I tried running the Captcha Monitor for the first time and found it running with some issues. The status_checker wasn’t working as intended so I debugged it and there was a small discussion about the present approach.
Also there was a issue with the code for GDPR removal for https://nordstrom.com/. So the testing was indeed helpful to run checks for few websites and other than small issues the overall test was good (modules were working).

Week Ahead:

I will be working with the metrics and the different graph representation of the modules to further identifying the data points.

1 Like

Coding Period: [Week 10]

This week, my main motive was to focus on the metrics but thankfully I caught some bugs in my code for which I wrote a patch. Also since the application is running quite some fetchers (browsers: Tbb, Chrome, Brave, Firefox, Opera), the servers I use starts to lag after a certain time. To overcome this issue temporarily, I’ve written some tests which is of-course a small part of the actual data points I would receive but still is good to generate the metrics and further for the graphs I would generate over time.

I also learnt using gitlab more productively creating more tickets and using it for discussions, set up a progress tracker, other than just MRs.

Discussions related to the Metrics: #88
Ticket for the known errors I face: #94

Further, I implemented graph related to relay ids, and its extensions which wasn’t planned on, that was currently a backlog. As for now it’s just in its primitive stage which I’ll improve and further ask for more reviews to the #tor-dev community .

Week Ahead:

I plan to create more graphs, and to extend the Issue #94 to get idea of the different edge cases where my proposed logic would fail and thereby for minimizing the errors.

1 Like

Coding Period: [Week 11]

This week I spent time on working my backlogs, focusing on the metrics part. More details could be found out here:

This MR basically consists the logic for the creating the graphs, taking data from the db.

As of now I’ve made a page (basically just a bare html page to render the graphs) with the different plots (blocks, partial blocks, not-blocked, etc.) related to different relays:

Further, I’m working on a search-bar that would be useful to search according to the relay fingerprint and then redirect to this page (individual page for relays to gain more details).

Week Ahead:

To work on the dashboard and refactoring earlier written codes (focused on optimization)

Coding Period: [Week 12]

This week I refined the code part and further wrote a module to automatically generate webpages according to the Relay fingerprints that would consist of data and metrics for individual fingerprints tuning it for better visualization of data.

I then created the dashboard, that would show the overview for the total Relays. and further :
Improves on the way the output is produced, which would enable to reuse the data produced by the Captcha Monitor
Improves the execution time of the render_dashboard.py which earlier took few couple of minutes to get data from a single relay and now takes seconds.

Dashboard:

Further running the code in the server I noticed that the docker container that runs tor is unhealty but the logs shows it to be working fine. I suspect the HEALTHCHECK for the container has some flaws.

Week Ahead:

Try to make the docker container work.
Integrate the generated data into the frontend part.
Work on the module: Graphs according to website ids.
Take suggestions from the Community

Coding Period: [Week 13]

This week, I started off with making the container work (unhealthy to healthy). That being done, I found another bug that wouldn’t kill the container on my system, which resulted in immense usage of storage. So, I started reading about dockers to try to fix the task.

I, also finished with the logic part of the Graph according to website ids and will hopefully add it, during this week.

Simultaneously, I’m also going through this issue:

Week Ahead:

Raise Tickets for further improvements and work to be done.
Submission for Final Evaluation.