I got into the self-hosting scene this year when I wanted to start up my own website run on old recycled thinkpad. A lot of time was spent learning about ufw, reverse proxies, header security hardening, fail2ban.

Despite all that I still had a problem with bots knocking on my ports spamming my logs. I tried some hackery getting fail2ban to read caddy logs but that didnt work for me. I nearly considered giving up and going with cloudflare like half the internet does. But my stubbornness for open source self hosting and the recent cloudflare outages this year have encouraged trying alternatives.

Coinciding with that has been an increase in exposure to seeing this thing in the places I frequent like codeberg. This is Anubis, a proxy type firewall that forces the browser client to do a proof-of-work security check and some other nice clever things to stop bots from knocking. I got interested and started thinking about beefing up security.

I’m here to tell you to try it if you have a public facing site and want to break away from cloudflare It was VERY easy to install and configure with caddyfile on a debian distro with systemctl. In an hour its filtered multiple bots and so far it seems the knocks have slowed down.

https://anubis.techaro.lol/

My botspam woes have seemingly been seriously mitigated if not completely eradicated. I’m very happy with tonights little security upgrade project that took no more than an hour of my time to install and read through documentation. Current chain is caddy reverse proxy -> points to Anubis -> points to services

Good place to start for install is here

https://anubis.techaro.lol/docs/admin/native-install/

  • Arghblarg@lemmy.ca
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    12 hours ago

    I have a script that watches apache or caddy logs for poison link hits and a set of bot user agents, adding IPs to an ipset blacklist, blocking with iptables. I should polish it up for others to try. My list of unique IPs is well over 10k in just a few days.

    git repos seem to be real bait for these damn AI scrapers.

    • JustTesting@lemmy.hogru.ch
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      4 hours ago

      This is the way. I also have rules for hits to url, without a referer, that should never be hit without a referer, with some threshold to account for a user hitting F5. Plus a whitelist of real users (ones that got a 200 on a login endpoint). Mostly the Huawei and Tencent crawlers have fake user agents and no referer. Another thing crawlers don’t do is caching. A user would never download that same .js file 100s of times in a hour, all their devices’ browsers would have cached it. There’s quite a lot of these kinds of patterns that can be used to block bots. Just takes watching the logs a bit to spot them.

      Then there’s ratelimiting and banning ip’s that hit the ratelimit regularly. Use nginx as a reverse proxy, set rate limits for URLs where it makes sense, with some burst set, ban IPs that got rate-limited more than x times in the past y hours based on the rate limit message in the nginx error.log. Might need some fine tuning/tweaking to get the thresholds right but can catch some very spammy bots. Doesn’t help with those that just crawl from 100s of ips but only use each ip once every hour, though.

      Ban based on the bot user agents, for those that set it. Sure, theoretically robots.txt should be the way to deal with that, for well behaved crawlers, but if it’s your homelab and you just don’t want any crawlers, might as well just block those in the firewall the first time you see them.

      Downloading abuse ip lists nightly and banning those, that’s around 60k abusive ip’s gone. At that point you probably need to use nftables directly though instead of iptables or going through ufw, for the sets, as having 60k rules would be a bad idea.

      there’s lists of all datacenter ip ranges out there, so you could block as well, though that’s a pretty nuclear option, so better make sure traffic you want is whitelisted. E.g. for lemmy, you can get a list of the ips of all other instances nightly, so you don’t accidentally block them. Lemmy traffic is very spammy…

      there’s so much that can be done with f2b and a bit of scripting/writing filters

        • JustTesting@lemmy.hogru.ch
          link
          fedilink
          English
          arrow-up
          1
          ·
          22 minutes ago

          You mean for the referer part? Of course you don’t want it for all urls and there’s some legitimate cases. I have that on specific urls where it’s highly unlikely, not every url. E.g. a direct link to a single comment in lemmy, plus whitelisting logged-in users. Plus a limit, like >3 times an hour before a ban.

          It’s a pretty consistent bot pattern, they will go to some subsubpage with no referer with no prior traffic from that it, and then no other traffic from that ip after that for a bit (since they cycle though ip’s on each request) but you will get a ton of these requests across all ips they use. It was one of the most common patterns i saw when i followed the logs for a while.