ITP-JN

When I ran the command for the access logs last week it didn't look long enough to perform an analysis yet so I waited a few more days — I now have 465 logs so that feels substantial enough to maybe start seeing patterns.

I pasted the logs into a spreadsheet and started sorting and cleaning the data. The first thing I noticed was that all logs were from today (I'm writing this on Oct 3rd). It could have been the case that I only asked for the last few ones, but I asked for 500 and got 465 so I assumed that's all there is.

Did some digging and found out there are more log files! Apparently my logs are being "rotated", which means they are archived and being replaced with new ones periodically. This sounds like good practice...

‍

So I've decided to take a look at those files too. I tried opening the .gz files the same way and I think it cursed my terminal. But I then found out about the zcat command and the data didn't seem too messy to I just used that and cleaned it in the spreadsheet.

‍

Link to spreadsheet (NYU login required)

So I'm looking at four days of logs: September 28th & 29th, and October 2nd and 3rd (earliest and latest). At first glance it seems like there is a similar amount of logs each day, ranging from 280 to 486, but it's mostly 400+. Most have a uniform structure, but every now and then there are weird gibberish ones. The structure should be:

IP address of the requester
Identity (usually not used)
timestamp (UTC!)
request method and path*
HTTP status code**
size of response in bytes (that my server sent back)
referrer (the page the client was on when making the request)
user-agent (the browser making the request — looks like when it's missing that means the request came from an automated program, like a web crawler or bot)

‍

*HTTP methods include:

GET: Retrieve data (read-only requests).
POST: Send data to create/update resources (submit forms, upload files).
PUT: Replace an entire resource.
PATCH: Partially update a resource.
DELETE: Remove a resource.
HEAD: Retrieve metadata about a resource.

‍

**Common status codes include:

1xx codes are Informational Responses (100 is continue, 101 is switching protocols from like HTTP to websockets)
2xx codes are Successful Responses (200 is OK, usually the result of GET of a completed POST; 201 is ok + something was created, usually the result of POST + PUT; 202 is accepted but not yet completed; 203 is "Non-Authoritative Information" - the request was successful but the response is coming through a proxy or a chache and not the original server; 204 is success but no content to send back, usually the result of DELETE).
3xx codes are Redirection Messages (301 is moved permanently and future requests should point there; 302 is temporarily available somewhere else but future requests should still go to the original one; 303 is indicating that the requested resource has been moved to a new location; 304 is "not modified" so the client can use a cached version; 307 is a temporary redirect).
4xx codes are Client Error Responses (400 is bad request; 401 is unauthorized (needs authentication / login); 403 is forbidden; 404 is the beloved "not found"; 405 is request method is not allowed; 408 is request timed out; 409 is conflict, usually seen with version control; 429 is too many requests - rate limiting).
5xx codes are Server Error Responses (500 is internal server error; 501 is not implemented - can't support the functionality required; 502 is bad gateway — as we've seen in class — an unvalid response came from another upstream server; 503 is service unavailable — overloaded or undergoing maintenance; 504 is gateway timeout.

‍

As for my logs — It seems like there's a round-the-clock activity, something is happening every few minutes or so. The intervals do not seem to be consistent though. A lot of the logs are kind of "lumped" — they took place at the same time or immediately one after the other.

‍

Overall I'm looking at 1585 entries, out of which 366 are unique addresses. This means 1219 requests were multiple ones from the same source.

This dense chart shows all IP addresses that have requested access to my address multiple times. I've eliminated anything below 3 requests to make it somewhat more legible.

‍

The winner, IP address number 143.110.222.166 that requested access 88 times over these 4 days points to... Digital Ocean! Which is where my server is hosted — so I'm assuming this probably means they are just monitoring all addresses hosted on their servers, but... How can I know it's not someone else who has a server through Digital Ocean too? I used conditional formatting to paint all cells with this IP address the same color so I can look more easily through the sheet. All requests from this address were using the GET HTTP method (just getting information), and the "1.1" at the end of all entries apparently means it's just version 1.1 of HTTP.

‍

Noteworthy side quest: I just wanted to quickly look at one entry that looked a bit suspicious:

154.213.184.15 - - [28/Sep/2024:19:18:03 +0000] "POST /cgi-bin/.%%%%32%%65/.%%%%32%%65/.%%%%32%%65/.%%%%32%%65/.%%%%32%%65/bin/sh HTTP/1.1" 400 166 "-" "-"

It only occurred 4 times but it used the POST method and the path was /cgi-bin/... which really stood out. The status code was 400 = Bad request — my server rejected the request, and the response is quite small — only 166 bytes, meaning it could possibly just be an error message. The IP address leads to a seemingly random location in the Netherlands, and the registered organization is "AS51396 Pfcloud UG", which is just a domain registrar / server hosting service so I guess that's a dead-end. But I did try to look into what the weird path of the POST request means and apparently "This is an attempted path traversal and command injection attack targeting the /cgi-bin/ directory, where CGI scripts are often placed". From what I gather (through some Googling and asking ChatGPT), an Attempted Path Traversal Attack is where an attacker tries to access files or directories that are outside the web root directory by manipulating the file path in a request. So they can use characters like "../" which means "go one directory up" to sort of "escape" the intended directory and access files elsewhere on the server. A Command Injection Attack is when an attacker is able to run arbitrary commands on the server by exploiting vulnerabilities in the way user input is handled. The "%" with numbers apparently represent encoded characters — indicating an attempt to bypass security filters or exploit vulnerabilities to access hidden files. The path ends with "/bin/sh" which is a shell executable - could possibly suggest an attempt to run a shell command on the server. For example, if an application takes user input as file name and pass it to the shell to open it ("cat userinput.txt"), the attacker might modify the input to "userinput.txt; rm -rf /" which could delete all files on the server. Yikes!

I asked ChatGPT some more niche questions and it sounds like this request was blocked thanks to Nginx but not thanks to my firewall. It said it's most likely Nginx's default behavior that handles invalid requests with unusual encoding such as this. It said UFW would mainly help by restricting access to certain ports, but won't directly cause the 400 response code from Nginx.

I also learned that these kind of attacks are pretty common, but have varying degree of sophistication and are generally considered to be basic to moderately advanced attacks. Path Traversal Attacks usually rely on poor input validation, while Command Injection Attacks usually target old applications that pass user input directly into system commands (this is not commonplace any more) and is considered more dangerous, because if successful, it could potentially give the attacker full control over the server.

I did look at other requests like GET and POST using CGI (common gateway interface) and not HTTP — these are attempts to communicate directly with a script on my server — I didn't see any other potentially malicious attempts there, though I did realize some of my data is only partial and got somehow lost on the way (this is fixable but it might be time to leave this rabbit hole).

‍

So back to the other logs:

The second most highly occurring IP address (64 times) leads to the Netherlands as well, to a company called Amarutu Technology LTD, which points to another company that performs cyber security operations. The third leads to a similar company in Warsaw, and the fourth is a cloud hosting provider in Berlin that has absolutely terrible reviews as the first results when googling the company's name.

I continued by skimming through the data and looking for points of interest. One that stood out is pictured below:

This IP address requested access 44 times, all in the course of 13 seconds. A bit much?

There were 3 POST requests (note that they came first too) and the rest were GET. The last POST asked this: hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp: and received a 404 (not found) in response. The "hello world" part is what caught my eye, as well as the fact that there were so many different requests for different things — not that I would know what is legit and what isn't, but I can see this is different than most entries and it feels kind of brute-forced so I want to look into it (also, "hello world", "test", "tests", "admin", "backup", "testing", "src", "eval", "cms", "blog" etc sure make this look suspicious).

I will spare you the details, but this was another attempted attack — namely a Remote File Inclusion Attack — trying to exploit php vulnerabilities by turning on "allow_url_include" (could let the attacker include external php files that would execute code on my server), and using "auto_prepend_file" that would execute php code embedded in the request body. It seems like the attack failed because I don't have a file called "hello.world" on my server which is kind of silly, isn't it? But all the other GET requests suggest an automated attack just trying to access my server through known vulnerabilities / misconfigurations. All the requests relate to either php frameworks, or common paths for website management.

‍

Upon closer inspection I realized there were so many of these! Different IPs but all the same patterns, the same methods, the same lame generalized brute force requests trying to exploit known vulnerabilities. I guess they sometime work if there are so many.

‍

For now I didn't look into the user-agent part of the log yet, but I remember this particular detail wasn't very useful or reliable when I was accessing it in one of my projects last year. Most browsers and devices seemed to be kind of "hiding" behind the same Mozilla / like-gecko definition. So it really wan't as exciting as one would think...

‍

I had one entry from an IP address that belonged to Palo Alto Networks (a major cybersecurity company) and they left a message, which was nice? — "Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com".

It made me wonder about the rights we have on the internet — like we have a virtual server running, and we have the right to request that it wouldn't be scanned — i.e. visited? (do we?) but it's already been scanned by that time? Like can the whole web just be scanned like this? What does it mean to opt out at this point? What's the significance of being included in the scans or not? (besides search engine's crawlers). And what about the "dark web"? Is it a completely different infrastructure?

‍

Last class I briefly mentioned how the brain only uses 20 watts:
"The human brain is an amazingly energy-efficient device. In computing terms, it can perform the equivalent of an exaflop — a billion-billion (1 followed by 18 zeros) mathematical operations per second — with just 20 watts of power. In comparison, one of the most powerful supercomputers in the world, the Oak Ridge Frontier, has recently demonstrated exaflop computing. But it needs a million times more power — 20 megawatts — to pull off this feat."

A couple sources that mention this fact and seem trustworthy:

Brain-Inspired Computing Can Help Us Create Faster, More Energy-Efficient Devices — If We Win the Race. By Advait Madhavan.
Brain Power by Vijay Balasubramanian.

Explainer: Radio waves

Packet analysis

Topics for definition / explainer

Server problems & MQTT

Microservices

Node.js setup

HTTP logs

Nginx server set up

Networked game controller

Traceroute analysis

Internet host and firewall logs

HTTP logs