Tor Project: Cloudflare CAPTCHA Monitoring

Hi all,
I’m Barkin and I will be working on the “Cloudflare CAPTCHA Monitoring” project as a part of the Tor Project with my mentors @Georg and @arma. I’m going to post updates about my progress under this topic. You can find more detailed information about my project here.

I’m super excited about participating in GSoC 2020!

3 Likes

Hi!

Welcome aboard. We are excited as well to see this issue tackled. If there are questions let us know.

2 Likes

Community Bonding Period - Week 1

This week (or in the last 3 days), I spent my time updating the GSoC page on the Tor Project’s trac and talking to my mentors for future planning for my project. They told me to create a “core page” on Tor Project’s trac for explaining my project. So, we can direct people to that page if they want to learn more about the project or check the code. I started working on that page, as well.

IRC is not something new to me, but I have never used IRC actively. Once I read an article that resembled IRC to a crowded bar, where everyone talks to everyone loudly. This week, I realized how true that is. I also observed that people tag their messages in different ways (like using numbers or letters) to make it easier for other people to reply using these tags. It is a different type of communication method than I used to, but I am getting used to using it.

I am looking forward to finishing the finals for my university classes and start working on the code!

1 Like

Community Bonding Period - Week 2

This week I expanded my knowledge about “receiving IRC messages 24/7”, even while my computer is turned off. This was an important issue for me to solve since IRC doesn’t buffer the incoming messages. Instead, users need to have special systems to buffer these messages, so that the users can view buffered messages once they are available.

Meanwhile, I started talking to the OONI people about improving my project and benefitting from their past experience in this field.

Finally, I sent a semi-formal introduction about my project to the tor-dev mailing list and asked to get some feedback from the community. I waited to finalize the wiki article to send this email because I thought it would be more meaningful to have documentation of the project attached to the email. I also brainstormed about various approaches for getting feedback and having the community involved in the idea development process.

Once again, I am looking forward to finishing the finals for my university classes and start working on the code!

2 Likes

A post was split to a new topic: Tips to avoid missing messages on IRC while away

Community Bonding Period - Week 3

This week I created the parent and children trac tickets for the milestones for my project. These trac tickets will help me to let the community know about my progress and get feedback from them while I keep track of what I do each week.

I already received a few feedback about what I did so far on ticket #33010. This week, I also spent my time incorporating this feedback into my project and the milestones.

I additionally worked on getting an SSL certificate for my IRC bouncer setup because it is 2020 and encryption is a must. I discovered that Let’s Encrypt issues SSL certificates at no cost and I got one for my IRC bouncer server.

1 Like

Community Bonding Period - Week 4

This week I restructured the preliminary code I had previously. I did this to make it work as I explained in my project diagram below. The changes I implemented made it possible to easily download the database.

Later, I worked on making the code reliable because it wasn’t always working in the “headless” mode. There was an undocumented dependency problem in the tor-browser-selenium library that I was using. I needed to install the Firefox browser to reliably use the library. I don’t think it is related to having “Firefox” installed on the system but I think it is related to having a piece of code Firefox installs. It took a long time to figure this out and I will raise this issue in the library’s GitHub repository to further investigate with the maintainers.

I also added the functionality to add new headers to the requests and save response HTTP headers. The original selenium library doesn’t have this functionality, and I needed to find another way to interact with the headers. I ended up using the selenium-wire, which is an extension of the original selenium library and it allowed me to interact with the HTTP headers.

So, I spent this week trying to get a very basic working version of the project, and I did it! Now, I will extend it and make it complete during the coding phase. I think it has been a great community bonding period, which helped me to get feedback from the community and get prepared for the coding phase.

Coding Phase - Week 5

This week I spent my time getting the first version of the system up and running. I deployed a continuously running instance of my code to my server, and I connected the database to the dashboard. I also worked on adding a few meaningful graphs to the dashboard.

I communicated with my mentors to make sure that I am on track and to get feedback on the dashboard. Based on the feedback I received, I will update the dashboard and the way I collect data with my code.

Meanwhile, I integrated Tor Stem into the system, and now I can specify a Tor exit node for testing purposes. I also merged the code that I have been restructuring into master, and I updated the README file to reflect the changes. Now, I’m working on integrating the Cloudflare API, and I plan to finish implementing it this weekend.

As you may have realized, I love flowcharts & diagrams, and I made another one to explain the current state of my code :slight_smile: Actually, the code doesn’t ask for these details “step by step” and I enter all of these details all at once at the beginning, instead. That being said, I believe, breaking down the process into smaller steps help us, humans, to understand what is going on better.

Coding Phase - Week 6

For the first time, I encountered problems regarding the speed of my code, and I’m glad that it happened. So, I could learn how to make it run faster. I need to perform daily measurements on the Tor exit relays as a part of my project, and there are many of them. I repeat the same exact measurement over and over again and compare the results. With this scale, every extra second in the execution of the individual measurements, cause an extra ~25 minute execution time overall.

When I first started, the total execution time was well over 100 hours, and we have 24 hours in a day. This week I worked on implementing a worker pool to run many operations in parallel. The worker pool system helped me to reduce the total execution time significantly (down to 40 hours), but still, it is not enough. Later, I started looking at other similar projects like exitmap to see how they handle the measurement. This was helpful as well, and I implemented my learnings from these projects, but I still need to reduce the total execution time a lot.

The biggest bottleneck is the web browser itself. Currently, every individual measurement takes ~10 seconds, and ~7 seconds of this belongs to the web browser getting started. I hope to cut down the individual measurement time to ~5. If that is not possible, I will try to find ways to run more workers in parallel more efficiently.

Coding Phase - Week 7

This week I spent my time parallelizing the CAPTCHA Monitor using processes on the host machine. Previously I was using Docker swarm to replicate the instances of the code, but it turned out to be slow and memory consuming. Instead, I used Python’s multiprocessing library to replicate the workers. I needed to make a few changes in the architecture to make this happen. I needed to separate the code that manages Tor and Tor Browser from the main program loop. Now, the main program loop creates instances of that code in separate processes and makes sure that they keep running. By using the updated code, I started collecting data one more time. Every day I collect data for a different metric.

The next step was to display the collected data in a dashboard. You might remember that I mentioned a dashboard already and put a screenshot of it. Actually, that was the second dashboard solution I tried. In the very beginning, I tried using Graphana. It is a really neat open source dashboard solution, and it has well-designed layout options. These are all great features, but Graphana is geared towards time series data like the temperature of a CPU or amount of ram usage of a computer. So, the data sources and the backend are designed for that kind of data. It also doesn’t provide flexibility with data manipulation. Grafana wants to display what the database query returns directly on the dashboard. Unfortunately, I needed more flexibility in the way I process data, and I needed to combine multiple queries sometimes. Still, I used Graphana for a while to see if I was wrong and I wasn’t wrong.

I did further research, and I found Metabase, which is another open-source dashboard solution. As opposed to Graphana, Metabase had all the flexibility I needed in the backend to process data before showing them on the dashboard. I really liked using Metabase, but it had a lot of flaws on the frontend. For example, some of the graphs were clipped for no reason, and there was no option for fixing it. It was also consuming a lot of memory on my VPS, and I thought I could use that memory for data collection rather than spending on the dashboard for no solid reason.

So, I ended up building my own dashboard using Node.js, Bootstrap, Chart.js, and Express.js:


I used my learnings from my weeks of dashboard search to create something simple and elegant. I used Node.js & Express.js on the backend to create an API and Bootstrap & Chart.js on the front end for displaying data. The cool thing is I can process the data in the way I want on the backend and send it to the dashboard through API. If I don’t like anything about the frontend, I can just change it! Sure, I could do changes in the other open-source dashboard solutions as well, but I needed to go through an unnecessary amount of steps to achieve it. Also, now I can use the same backend API solution for other purposes. I was already planning to have an API for third parties to fetch data from the system, and there I have it!

Finally, I spent some time moving my project to Tor Project’s new GitLab server. Previously, code, issue tracker, and wiki page were all on different locations. Now, they are all in the same place and unified. GitLab also have a lot of extra productivity tools, and I can’t wait to use them. Here is the new home for my code: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor

Coding Phase - Week 8

This week started with an unexpected issue. The CAPTCHA rates I was getting were very high when compared to what Tor Browser users experience in real life. After investigating, I realized that the seleniumwire library I used to capture HTTP headers was causing this issue. Interestingly, this was the case only with Tor. I wasn’t getting high rates of CAPTCHA when I used seleniumwire with regular internet. Clearly, using seleniumwire and Tor together triggers something on the Cloudflare side. I think they might be detecting the increased latency or the changed TLS fingerprint.

Anyway, I opted out using seleniumwire because it was affecting the results negatively. I started using the HTTP-Header-Live addon for capturing the headers. The addon starts automatically when the browser starts and captures the headers inside of the browser without touching to the traffic itself. When the page is completely loaded, the addon writes the headers to a text file in JSON format. Later, my code reads this file saves the results. It is not the most elegant way to solve this problem but I needed to use this method since the elegant method (seleniumwire) caused problems.

Here is a sample of the code I used to connect Tor Browser to the Tor network via seleniumwire. Feel free to do further testing, if this issue sounds interesting to you.

After solving this unexpected problem, I worked on adding support for older versions of the browsers. Now, -b or --browser_version flag can be used to provide the exact browser version. The code doesn’t automatically download that version of the browser but it can be a nice future addition.

I also realized that Cloudflare injects code that wasn’t a part of the original page. For example here is the original code:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Hello world!</title>
    </head>
    <body>
        Hello world!
    </body>
</html>

Here is the version Cloudflare serves:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Hello world!</title>
    </head>
    <body>
        Hello world!
        <script defer="" src="https://static.cloudflareinsights.com/beacon.min.js" data-cf-beacon="{&quot;rayId&quot;:&quot;5a974a483cf0b6cc&quot;,&quot;version&quot;:&quot;2020.5.1&quot;,&quot;si&quot;:10}">.     </script>
    </body>
</html>

So, I decided to detect these kinds of changes as well by hashing the page. Now, the system automatically takes the MD5 hash of the page contents and compares it with the original hash. If there is a change, it also saves that change.

Additionally, I created a new section called ‘Measurement Search’ for showing the individual measurements that from the graphs. It also enables users to perform custom queries on the data using the search box:


Coding Phase - Week 9

This week I worked on solving the memory leak problem. I found the root cause and stopped the leak. I was using a timeout function while fetching the pages. It turns out the timeout value I used was shorter than it is supposed to be, and the timeout function wasn’t invoking the right signals to properly kill the instances of the browser. So, I increased the timeout and added the right calls to shut down the browser instances properly. This solved the memory leak issue.

Next, I worked on the algorithm for deciding which test to run for exit relays. The algorithm compiles a list of measurements and checks if a given relay completed the list of measurements. If the measurements are not complete, the algorithm assigns one of the uncompleted measurements to the exit relay. If the given relay completed all measurements, the algorithm refreshes the oldest measurement. I plan to add priorities to the measurements to take this algorithm one step further and perform some more important measurements more frequently than others.

Finally, I worked on annotating the data using the CAPTCHA Monitor’s versions. The main problem was that I didn’t have properly defined versions. So, I needed to define the versions first using the merge requests I made. After that, I added the code which attaches to the version information to the results.

With the completion of this week, I completed a full month of coding, and it has been great so far. I managed to stick to the timeline I set and released a fully working version of the system. I didn’t get a lot of CAPTCHAs with my system so far. I will be working on expanding the modules to track other metrics and test more websites for CAPTCHA throughout the next weeks.

Coding Phase - Week 10

This week I spent a lot of time working on the dashboard to add new graphs and features. That said, I needed to add a few new features in the backend to support the updates in the dashboard. First, I used the freely available GeoLite2 database to get the physical locations (country and continent) of the exit relays and I recorded that information in the database.

Next, I used the geolocation information to create the graphs for understanding how the CAPTCHA faced exit relays are distributed around the world as seen below. An overwhelming majority of the CAPTCHA faced relays are in the US and Germany but it was expected since most of the exit relays are also located in these two countries.

Later, I worked on creating the “Relay Search” section as shown below. It aims to show the measurements, CAPTCHA rates, and other statistics for a given exit relay.

Next, I added a “JavaScript Required” warning as shown below. It pops up if JavaScript is disabled in the web browser. I used a few JavaScript libraries to create the graphs and JavaScript is required to see them. That said, I host all of the JavaScript code on my server. So, there is no third-party server interaction needed while visiting the dashboard.

Finally, I created an onion service for the dashboard and added the meta-tag for advertising the onion address of the dashboard. Now, the dashboard can also be reached through http://5yalu72ryu4xu457kmcze5kxb4on6xh2vkom35jnu4s3respg7hsguqd.onion/

1 Like

Coding Phase - Week 11

This week I worked on tasks that I wasn’t planning and expecting to work on. I first finished the “Relays Search” section by adding the code for calculating the CAPTCHA probabilities. After this, I wanted to work on the “Experiment Search” section of the dashboard, but I realized that I forgot to put the Chromium fetchers in production. So, I spent some time updating the old Chromium fetcher to become compatible with the HTTP Header Live extension based system, and I put in production. My mentor, Roger, also suggested me to add a new fetcher based on Brave browser’s “Tor Tabs”. I figured out how to make it happen and spent time experimenting with it.

Additionally, I found more URLs to track by simply going through the Alexa Top 500 list. I identified a lot of websites that show CAPTCHAs to Tor users and put these sites into the system. Meanwhile, I realized that the SQLite database I’m using wasn’t able to handle the increased demand. It was frequently locking itself because of the concurrent connections. Thus I decided to switch to a more robust solution like PostgreSQL, and I worked on adapting the parts of the code that interfaces with the database to use PostgreSQL.

Finally, I presented my project in Tor’s Metrics Team meeting. I got good feedback, and the Metrics team liked the project. I hope and plan to integrate my project into the Tor’s metrics dashboard to reach a broader audience.

1 Like

Coding Phase - Week 12

Until this point, the “API” was a small part in the dashboard code, and it wasn’t a real API. This week I spent time on creating a fully-fledged and properly documented API, which is located at api.captcha.wtf

The new API can perform complex filtering on the data on the database level before transmitting any data to users. The old version was literally transferring all rows in the database in a single call. This was working OK when I had a fewer amount of data in the database. Now, having the ability to filter results before fetching the data saves a lot of bandwidth and processing power.

Also, the documentation framework I used shows the same API calls and I think it is pretty useful for people who are interested in using the API to learn how it works.

The other significant “achievement” was getting Brave Browser’s “Private Window with Tor” to work with my current system. Last week, I thought I could easily add it to the system, but later, I realized that the chromedriver that I was using is not capable of receiving key shortcuts, which is the only way to open Private Window with Tor" in Brave Browser. After thinking more, I thought about sending key shortcuts to the Brave Browser process directly instead of using a high-level tool like chromedriver, and it did work! Sometimes you need to think outside of the box to make things work :slight_smile: The rest of the code wasn’t that difficult after completing this small step.

1 Like

Coding Phase - Week 13

This week I released CAPTCHA Monitor v0.2.0, and it mostly contains changes from last week. I wrote the changelog and made changes to the README file. This release contains major changes (like the migration to PostgreSQL) that the installation steps needed to be updated.

After finalizing this release, I started working on a design document for the dashboard. I contacted a few people in addition to my mentors, and they will give me feedback once I’m done with the document. This document mainly aims for the reproducibility of the graphs, and it makes sure that the graphs I’m designing are correct and make sense. Also, trying to build the dashboard without a solid plan turned into a nightmare since there are too many variables attached to a single measurement. So, this design document will help me be more organized and not waste time on unnecessary graphs that would turn out useless.

While working on the design document, I realized that my API didn’t offer a proper way to count the number of rows that a query returns. This is essential for performing statistics for the dashboard. So, I added a meta parameter called count_only to the API calls. When this parameter is set to 1, the API only returns the number of rows that match the requested criteria. This simple but effective save a lot of bandwidth and decreases API call durations dramatically for the cases where megabytes of HTML data is not needed for the required operation.

I also worked on migrating the user in the Dockerfile to a non-root user and decreasing the Docker image size. By default, the active user is the root user for docker containers, which might pose security issues. Thus, I worked on creating a new non-root user and running CAPTCHA Monitor with that user.

Finally, I completed my second GSoC evaluation and finalized the second coding stage. My experience has been amazing so far, and I’m on track with coding. The dashboard and visualizations turned out to be more problematic than I expected, but I will overcome it with the help of the design document that I’m working on.

1 Like

Coding Phase - Week 14

This week I started with working on the dashboard design document. I expected to quickly finish working on it, but it turned out to be a more complex task than I imagined. I wrote a very basic draft and asked for getting feedback from people on the IRC chat. Dennis Jackson was particularly very helpful with his feedback. He suggested me to consider the exit probabilities while calculating the CAPTCHA rates. This was a critical suggestion because larger exit relays with larger exit probabilities have a greater effect on the network. So, we care about the larger exit relays more than smaller ones. That said, I didn’t know how the exit probabilities were calculated, and I needed to read the dir-spec document and lots of source code to figure out how to do it. My lack of knowledge about the very inner technical details of the Tor was another issue. I needed to ask a lot of questions on IRC, and thanks to everyone who replied to my questions, I managed to understand these details and finish working on the dashboard design document. I still keep getting feedback on the content of the document, and I will probably be working on it next week to incorporate the feedback I receive.

Meanwhile, I did work on refactoring the dashboard code and encountered the Tor Styleguide for building Tor themed websites. I also worked on importing this style guide code into my code and make the dashboard compliant with the style guide.

1 Like

Coding Phase - Week 15

I started this week by experimenting with D3.js to replace Chart.js library. I use Chart.js to generate the graphs on the dashboard, but for some reason, the graphs it produced were looking very blurry on the Tor Browser, disturbing me a lot. Thus, I was looking for a way to replace it with D3.js, another library for drawing graphs. D3.js is actually intended for drawing SVG objects driven by data, and some people use it for drawing SVG graphs. D3.js gives a lot of freedom to the user, but this freedom comes with a cost: the user needs to code every detail related to a graph. Whereas Chart.js is designed only for creating graphs, and users only need to feed the data, and it just works. That said, I was really unhappy with the blurry look on Tor Browser and tried using D3.js to draw the graphs. It was a horrible experience because I needed to spend time coding the functions for actually creating the graph itself, and all my attempts looked pretty bad. I couldn’t manage to align the elements of the graph properly. I have mild OCD, and that unorganized look drove me crazy. So, I abandoned the idea of using D3.js and tried fixing the blur issue instead. It turns out; the blurring issue was very trivial to fix. I learned that Chart.js uses computer pixel density information to scale what it is drawing. As expected, the Tor Browser doesn’t leak this information to the JS libraries, and Chart.js was assuming the pixel density of my screen wrong (I use a retina screen, where four physical pixels form one virtual pixel). I fixed this issue by simply overriding the default value with a higher value and voilà!

Later, I finished working on the dashboard design document itself and coded the functions I described in the design document. I created a parser for the Tor consensus documents. The class I created uses the parsed consensus information to calculate exit weights of each exit relay and calculates a weighted CAPTCHA rate by using the exit weight information. Now, I have the backend for generating the data for the graphs. The next step is generating the graphs themselves using Chart.js and the data generated by the backend. I plan to finish that part of the project next week and conclude the GSoC period. That said, I will keep working on the project and further improve it.

1 Like

Coding Phase - Week 16

Here we are, the final coding week! I started this week by refactoring the backend code that produces the data for feeding the graphs. I also worked on writing the docstrings for the other undocumented parts of the codebase. Later, I got hospitalized on Tuesday and had a surgery :worried: This surgery was totally unplanned and unfortunately delayed my plans for this week. After the surgery, I was still in pain and decided to do work that required less brainpower to complete while I try to recover. So, I worked revising the frontend layout I planned to implement.

I’m feeling better now, and I will try to catch up during the weekend. I am planning to keep working on my project after GSoC anyway, and this week’s delay shouldn’t be a problem.

Stay safe people!

1 Like

Final Evaluations - Week 17

In the final week of the program, I worked on compiling the final report for my project. I already contacted my mentors in the previous weeks about what to include in the report and where to place it. We settled on creating a new wiki page in the code repository. I could use the existing home page since it already explains the project in detail. However, I didn’t want to expand the home page anymore and give first-time visitors a long boring page to read. Instead, I keep using the home page as a list of pointers for my project’s other resources. Later, I checked the final report examples from previous years. It gave me a good sense of what to include in the report. I also looked at the project reports that the Tor people wrote for previous projects funded by the sponsors. These professional reports provided me a different approach to what to include in my report. I also drafted my final review on the GSoC page, and I will submit it over the weekend.

I want to thank everyone, starting from my mentors Georg and Roger, for helping me succeed in this project. My mentors tirelessly helped me figure out the pieces of this puzzle, and they taught me more than just technical knowledge. This opportunity exposed me to a “culture” that I have never seen before. Simple things like using IRC and email formatting are a part of this “learning experience” as well. I feel genuinely integrated into the Tor community. Next, I want to thank the DIAL for acting a the umbrella organization and letting the Tor Project participate in GSoC. During the application time, I was sad when I couldn’t find the Tor Project in the “organizations” list, but later I discovered what DIAL is doing, and my emotions flipped quickly! Thank you, really :heart:

Finally, I want to thank Google for organizing GSoC. Yes, these are all open-source projects, and everyone is welcomed to join, but there is a barrier for newcomers like me. I tried contributing to Project Jupyter in the past, and I wasn’t successful due to a lack of structure in my contributions. It is especially difficult to create a plan and a structure if you don’t know what you are doing. However, GSoC forces people to have proper plans and a structure with the help of mentors. I learned a lot about project management during GSoC. This year I will be working on a capstone for my final year at the university, and I already see myself using these skills for planning my capstone project.

Bonus:
I have been working on creating a new mascot/logo for my project in my free time. I think it is ready to be surfaced. It is inspired by the iconic I'm not a robot button in CAPTCHA tests, and it depicts a confused & angry robot that failed the CAPTCHA test.

I have two versions, and the green one uses the Tor Project color pallette. I can’t decide which one to use :slight_smile:

3 Likes