Suricata ruleset to prevent AI scraping?

povoq · 18 August 2024 12:22

I have an issue with regular scraping attempts of my public git-forge causing CPU spikes on my server.

They are clearly malicious as they ignore robots.txt and come from randomly changing globally distributed IPs, making normal IP based blocks very work intensive.

I was wondering if anyone knows a collaborative effort to banlist these scrapers via a Suricata ruleset or so?

Thanks.

ms · 19 August 2024 10:42

We used to have that a lot and so we decided to implement some throttling in Apache for some clients.

You might also want to check out the IP blocklists as they should have a few scanners listed. This does however probably not fall into any of the usual categories.

peppetech · 19 August 2024 23:15

I think Suricata rule might be too taxing on your resources.

What type of scraping did you detect? Are they using a particular bot or scraper that you identified by Useragent or service like Kimono, Scrapinghub
or are these Selenium /Phantom screenscrapers? or simple shell scripts like wget, curl?

Would blocking IP access from AWS, or Google cloud, or other VPS providers help?

I know there is a Shodan scanner blocklist already included in IPFire IP Address Blocklist

povoq · 21 August 2024 19:44

My separate firewall mostly idles and is already running Suricata with other rulesets, so I doubt performance would be an issue.

It’s a mixed set of suspicious IPs mostly from Chinese owned datacenters around the world. They mostly do not use common identifiers, so simple rule-sets based on that will not work.

Other git forge hosters like codeberg.org seem to have resorted to an ever increasing manually curated IP blocklist.

povoq · 28 August 2024 16:40

https://blog.uberspace.de/2024/08/bad-robots/

Another data point (in German).

dr_techno · 24 October 2024 12:55

isn’t that a python web ui application?

So there is no apache or nginx, so no .htaccess , so it gets index.

The only sites that are ignored are php based sites. That is why they have to provide a sitemap.xml file, etc.

But to let you know, most robots ignore robots.txt use the html meta tag. Google uses both, but will index the site then it reads the robots.txt last if the html meta tag is not set. robots.txt only works with a few, but all (legitimate ones) obey the html meta tag.

povoq · 30 October 2024 13:06

What are you referring to?

dr_techno · 30 October 2024 15:52

GIt forge
But I wonder if the traffic is from others running the git forge app that runs the program from the git forge as a Platform-as-a-Service (PaaS) server-less computing.

Because the git forge server runs these PaaS instances

povoq · 30 October 2024 17:32

Git forge is a generic term for webservices like Gitlab and Gitea/Forgejo.

dr_techno · 1 November 2024 02:38

So are you using a gitweb self hosted with Apache or Nginx?

And, do you want a region block, or just block all vps which will block anyone using a vpn or is running a web host on their internet connection?

povoq · 1 November 2024 20:58

No I am hosting a public Forgejo instance that is regularly attacked by what appears to be AI bot scrapers and and this seems to be a common issue these days with many such code forges like Codeberg.org and others.

I think it would be a great idea with there was a shared Suricata ruleset to filter such scraping attempts out at the firewall level as they do not respect the robots.txt etc. and often change the IP addresses from globally distributed data-centers to circumvent basic IP or geo-blocking.

dr_techno · 2 November 2024 09:35

Quite a few VPNs operate this way and it could be from the same person since the software doesn’t establish a session so a few to every html response is a different ip address.

Looking at this, it looks like an implementation of RUST/C++ which would be a question if it scales and multi threads very well. It would have to be a powerful server (or even a cluster) to deal with this if the client load is high. I’m not a C++ coder so I can’t tell you if they properly sanitise to prevent cross side scripting and side code execution. I hear its the thing for small businesses to host one, but are more resource hungry than just setting up a secure FTP server or a web server with zip or tar files to download.

The difference is every gitlab client executes a copy of gitlab with a request compared to an Apache server running in the background and serving all clients. Github is actually larger scale because they are running a web server environment.

There are other types of servers too that work like gitlab, but they are programmed in Erlang on purpose because they will purposely crash the client instance if someone tries to hack especially with buffer overflow methods and the hacker has to close their browser to disconnect from the crashed instance so automatic hacking scripts are impossible to deploy on those platforms. I’m really surprised that they didn’t use that language writing Gitlab since the style of writing they chose is very similar to Erlang and could easily integrate web site front ends with Elixir and Phoenix.

The only thing I will stress is that all web methods above should always use TLS/SSL encryption for security reasons.

povoq · 23 January 2025 19:14

Looks like more and more people have similar issues with massive AI scraper abuse.