I got into the self-hosting scene this year when I wanted to start up my own website run on old recycled thinkpad. A lot of time was spent learning about ufw, reverse proxies, header security hardening, fail2ban.

Despite all that I still had a problem with bots knocking on my ports spamming my logs. I tried some hackery getting fail2ban to read caddy logs but that didnt work for me. I nearly considered giving up and going with cloudflare like half the internet does. But my stubbornness for open source self hosting and the recent cloudflare outages this year have encouraged trying alternatives.

Coinciding with that has been an increase in exposure to seeing this thing in the places I frequent like codeberg. This is Anubis, a proxy type firewall that forces the browser client to do a proof-of-work security check and some other nice clever things to stop bots from knocking. I got interested and started thinking about beefing up security.

I’m here to tell you to try it if you have a public facing site and want to break away from cloudflare It was VERY easy to install and configure with caddyfile on a debian distro with systemctl. In an hour its filtered multiple bots and so far it seems the knocks have slowed down.

https://anubis.techaro.lol/

My botspam woes have seemingly been seriously mitigated if not completely eradicated. I’m very happy with tonights little security upgrade project that took no more than an hour of my time to install and read through documentation. Current chain is caddy reverse proxy -> points to Anubis -> points to services

Good place to start for install is here

https://anubis.techaro.lol/docs/admin/native-install/

    • url@feddit.fr
      link
      fedilink
      Français
      arrow-up
      3
      ·
      21 minutes ago

      Did i forgot to mention it doesnt work without js that i keep disabled

  • quick_snail@feddit.nl
    link
    fedilink
    English
    arrow-up
    1
    ·
    18 minutes ago

    Kinda sucks how it makes websites inaccessible to folks who have to disable JavaScript for security.

  • A_norny_mousse@feddit.org
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    33 minutes ago

    At the time of commenting, this post is 8h old. I read all the top comments, many of them critical of Anubis.

    I run a small website and don’t have problems with bots. Of course I know what a DDOS is - maybe that’s the only use case where something like Anubis would help, instead of the strictly server-side solution I deploy?

    I use CrowdSec (it seems to work with caddy btw). It took a little setting up, but it does the job.
    (I think it’s quite similar to fail2ban in what it does, plus community-updated blocklists)

    Am I missing something here? Why wouldn’t that be enough? Why do I need to heckle my visitors?

    Despite all that I still had a problem with bots knocking on my ports spamming my logs.

    By the time Anubis gets to work, the knocking already happened so I don’t really understand this argument.

    If the system is set up to reject a certain type of requests, these are microsecond transactions of no (DDOS exception) harm.

    • SmokeyDope@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      2
      ·
      edit-2
      4 hours ago

      If crowdsec works for you thats great but also its a corporate product whos premium sub tier starts at 900$/month not exactly a pure self hosted solution.

      I’m not a hypernerd, still figuring all this out among the myriad of possible solutions with different complexity and setup times. All the self hosters in my internet circle started adopting anubis so I wanted to try it. Anubis was relatively plug and play with prebuilt packages and great install guide documentation.

      Allow me to expand on the problem I was having. It wasnt just that I was getting a knock or two, its that I was getting 40 knocks every few seconds scraping every page and searching for a bunch that didnt exist that would allow exploit points in unsecured production vps systems.

      On a computational level the constant network activity of bytes from webpage, zip files and images downloaded from scrapers pollutes traffic. Anubis stops this by trapping them in a landing page that transmits very little information from the server side. By traping the bot in an Anubis page which spams that 40 times on a single open connection before it gives up, it reduces overall network activity/ data transfered which is often billed as a metered thing as well as the logs.

      And this isnt all or nothing. You don’t have to pester all your visitors, only those with sketchy clients. Anubis uses a weighted priority which grades how legit a browser client is. Most regular connections get through without triggering, weird connections get various grades of checks by how sketchy they are. Some checks dont require proof of work or JavaScript.

      On a psychological level it gives me a bit of relief knowing that the bots are getting properly sinkholed and I’m punishing/wasting the compute of some asshole trying to find exploits my system to expand their botnet. And a bit of pride knowing I did this myself on my own hardware without having to cop out to a corporate product.

      Its nice that people of different skill levels and philosophies have options to work with. One tool can often complement another too. Anubis worked for what I wanted, filtering out bots from wasting network bandwith and giving me peace of mind where before I had no protection. All while not being noticeable for most people because I have the ability to configure it to not heckle every client every 5 minutes like some sites want to do.

      • A_norny_mousse@feddit.org
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        25 minutes ago

        If crowdsec works for you thats great but also its a corporate product

        It’s also fully FLOSS with dozens of contributors (not to speak of the community-driven blocklists). If they make money with it, great.

        not exactly a pure self hosted solution.

        Why? I host it, I run it. It’s even in Debian Stable repos, but I choose their own more up-to-date ones.

        All the self hosters in my internet circle started adopting anubis so I wanted to try it. Anubis was relatively plug and play with prebuilt packages

        Yeah…

        Allow me to expand on the problem I was having. It wasnt just that I was getting a knock or two, its that I was getting 40 knocks every few seconds scraping every page and searching for a bunch that didnt exist that would allow exploit points in unsecured production vps systems.

        1. Again, a properly set up WAF will deal with this pronto
        2. You should not have exploit points in unsecured production systems, full stop.

        On a computational level the constant network activity of bytes from webpage, zip files and images downloaded from scrapers pollutes traffic. Anubis stops this by trapping them in a landing page that transmits very little information from the server side.

        1. And instead you leave the computations to your clients. Which becomes a problem on slow hardware.
        2. Again, with a properly set up WAF there’s no “traffic pollution” or “downloading of zip files”.

        Anubis uses a weighted priority which grades how legit a browser client is.

        And apart from the user agent and a few other responses, all of which are easily spoofed, this means “do some javascript stuff on the local client” (there’s a link to an article here somewhere that explains this well) which is much less trivial than you make it sound.

        Also, I use one of those less-than-legit, weird and non-regular browsers, and I am being punished by tools like this.

  • 0_o7@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    19
    arrow-down
    1
    ·
    8 hours ago

    I don’t mind Anubis but the challenge page shouldn’t really load an image. It’s wasting extra bandwidth for nothing.

    Just parse the challenge and move on.

      • Voroxpete@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        43 minutes ago

        It’s actually a brilliant monetization model. If you want to use it as is, it’s free, even for large corporate clients.

        If you want to get rid of the puppygirls though, that’s when you have to pay.

    • Kilgore Trout@feddit.it
      link
      fedilink
      English
      arrow-up
      11
      ·
      edit-2
      7 hours ago

      It’s a palette of 10 colours. I would guess it uses an indexed colorspace, reducing the size to a minimum.
      edit: 28 KB on disk

      • CameronDev@programming.dev
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        3
        ·
        6 hours ago

        A HTTP get request is a few hundred bytes. The response is 28KB. Thats 280x. If a large botnet wanted to denial of service an Anubis protected site, requesting that image could be enough.

        Ideally, Anubis should serve as little data as possible until the POW is completed. Caching the POW algorithm (and the image) to a CDN would also mitigate the issue.

        • teolan@lemmy.world
          link
          fedilink
          English
          arrow-up
          7
          ·
          5 hours ago

          The whole point of Anubis is to not have to go through a CDN to sustain scrapping botnets

          • CameronDev@programming.dev
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 hours ago

            I dunno that is true, nothing in the docs indicates that it is explicitly anti-CDN. And using a CDN for a static javascript resource and an image isn’t the same as running the entire site through a CDN proxy.

  • sudo@programming.dev
    link
    fedilink
    English
    arrow-up
    40
    arrow-down
    6
    ·
    11 hours ago

    I’ve repeatedly stated this before: Proof of Work bot-management is only Proof of Javascript bot-management. It is nothing to a headless browser to by-pass. Proof of JavaScript does work and will stop the vast majority of bot traffic. That’s how Anubis actually works. You don’t need to punish actual users by abusing their CPU. POW is a far higher cost on your actual users than the bots.

    Last I checked Anubis has an JavaScript-less strategy called “Meta Refresh”. It first serves you a blank HTML page with a <meta> tag instructing the browser to refresh and load the real page. I highly advise using the Meta Refresh strategy. It should be the default.

    I’m glad someone is finally making an open source and self hostable bot management solution. And I don’t give a shit about the cat-girls, nor should you. But Techaro admitted they had little idea what they were doing when they started and went for the “nuclear option”. Fuck Proof of Work. It was a Dead On Arrival idea decades ago. Techaro should strip it from Anubis.

    I haven’t caught up with what’s new with Anubis, but if they want to get stricter bot-management, they should check for actual graphics acceleration.

    • rtxn@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      edit-2
      32 minutes ago

      POW is a far higher cost on your actual users than the bots.

      That sentence tells me that you either don’t understand or consciously ignore the purpose of Anubis. It’s not to punish the scrapers, or to block access to the website’s content. It is to reduce the load on the web server when it is flooded by scraper requests. Bots running headless Chrome can easily solve the challenge, but every second a client is working on the challenge is a second that the web server doesn’t have to waste CPU cycles on serving clankers.

      POW is an inconvenience to users. The flood of scrapers is an existential threat to independent websites. And there is a simple fact that you conveniently ignored: it fucking works.

    • SmokeyDope@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      29
      arrow-down
      1
      ·
      edit-2
      10 hours ago

      Something that hasn’t been mentioned much in discussions about Anubis is that it has a graded tier system of how sketchy a client is and changing the kind of challenge based on a a weighted priority system.

      The default bot policies it comes with has it so squeaky clean regular clients are passed through, then only slightly weighted clients/IPs get the metarefresh, then its when you get to moderate-suspicion level that JavaScript Proof of Work kicks. The bot policy and weight triggers for these levels, challenge action, and duration of clients validity are all configurable.

      It seems to me that the sites who heavy hand the proof of work for every client with validity that only last every 5 minutes are the ones who are giving Anubis a bad wrap. The default bot policy settings Anubis comes with dont trigger PoW on the regular Firefox android clients ive tried including hardened ironfox. meanwhile other sites show the finger wag every connection no matter what.

      Its understandable why some choose strict policies but they give the impression this is the only way it should be done which Is overkill. I’m glad theres config options to mitigate impact normal user experience.

    • ___qwertz___@feddit.org
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      2
      ·
      5 hours ago

      Funnily enough, PoW was a hot topic in academia around the late 90s / early 2000, and it’s somewhat clear that the autor of Anubis has not read much about the discussion back then.

      There was a paper called “Proof of work does not work” (or similar, can’t be bothered to look it up) that argued that PoW can not work for spam protection, because you have to support both low-powered consumer devices while blocking spammers with heavy hardware. And that is very valid concern. Then there was a paper arguing that PoW can still work, as long as you scale the difficulty in such a way that a legit user (e.g. only sending one email) has a low difficulty, while a spammer (sending thousands of emails) has a high difficulty.

      The idea of blocking known bad actors actually is used in email quite a lot in forms of DNS block lists (DNSBLs) such as spamhaus (this has nothing to do with PoW, but such a distributed list could be used to determine PoW difficulty).

      Anubis on the other hand does nothing like that and a bot developed to pass Anubis would do so trivially.

      Sorry for long text.

      • Flipper@feddit.org
        link
        fedilink
        English
        arrow-up
        5
        ·
        4 hours ago

        At least in the beginning the scrapers just used curl with a different user agent. Forcing them to use a headless client is already a 100x increase in resources for them. That in itself is already a small victory and so far it is working beautifully.

  • non_burglar@lemmy.world
    link
    fedilink
    English
    arrow-up
    143
    arrow-down
    1
    ·
    14 hours ago

    Anubis is an elegant solution to the ai bot scraper issue, I just wish the solution to everything wasn’t just spending compute everywhere. In a world where we need to rethink our energy consumption and generation, even on clients, this is a stupid use of computing power.

    • Leon@pawb.social
      link
      fedilink
      English
      arrow-up
      84
      arrow-down
      4
      ·
      edit-2
      13 hours ago

      It also doesn’t function without JavaScript. If you’re security or privacy conscious chances are not zero that you have JS disabled, in which case this presents a roadblock.

      On the flip side of things, if you are a creator and you’d prefer to not make use of JS (there’s dozens of us) then forcing people to go through a JS “security check” feels kind of shit. The alternative is to just take the hammering, and that feels just as bad.

      No hate on Anubis. Quite the opposite, really. It just sucks that we need it.

      • SmokeyDope@piefed.socialOP
        link
        fedilink
        English
        arrow-up
        44
        ·
        edit-2
        13 hours ago

        Theres a compute option that doesnt require javascript. The responsibility lays on site owners to properly configure IMO, though you can make the argument its not default I guess.

        https://anubis.techaro.lol/docs/admin/configuration/challenges/metarefresh

        From docs on Meta Refresh Method

        Meta Refresh (No JavaScript)

        The metarefresh challenge sends a browser a much simpler challenge that makes it refresh the page after a set period of time. This enables clients to pass challenges without executing JavaScript.

        To use it in your Anubis configuration:

        # Generic catchall rule
        - name: generic-browser
          user_agent_regex: >-
            Mozilla|Opera
          action: CHALLENGE
          challenge:
            difficulty: 1 # Number of seconds to wait before refreshing the page
            algorithm: metarefresh # Specify a non-JS challenge method
        

        This is not enabled by default while this method is tested and its false positive rate is ascertained. Many modern scrapers use headless Google Chrome, so this will have a much higher false positive rate.

        • z3rOR0ne@lemmy.ml
          link
          fedilink
          English
          arrow-up
          5
          ·
          9 hours ago

          Yeah I actually use the noscript extension and i refuse to just whitelist certain sites unless I’m very certain I trust them.

          I run into Anubis checks all the time and while I appreciate the software, having to consistently temporarily whitelist these sites does get cumbersome at times. I hope they make this noJS implementation the default soon.

      • cecilkorik@piefed.ca
        link
        fedilink
        English
        arrow-up
        8
        ·
        13 hours ago

        if you are a creator and you’d prefer to not make use of JS (there’s dozens of us) then forcing people to go through a JS “security check” feels kind of shit. The alternative is to just take the hammering, and that feels just as bad.

        I’m with you here. I come from an older time on the Internet. I’m not much of a creator, but I do have websites, and unlike many self-hosters I think, in the spirit of the internet, they should be open to the public as a matter of principle, not cowering away for my own private use behind some encrypted VPN. I want it to be shared. Sometimes that means taking a hammering. It’s fine. It’s nothing that’s going to end the world if it goes down or goes away, and I try not to make a habit of being so irritating that anyone would have much legitimate reason to target me.

        I don’t like any of these sort of protections that put the burden onto legitimate users. I get that’s the reality we live in, but I reject that reality, and substitute my own. I understand that some people need to be able to block that sort of traffic to be able to limit and justify the very real costs of providing services for free on the Internet and Anubis does its job for that. But I’m not one of those people. It has yet to cost me a cent above what I have already decided to pay, and until it does, I have the freedom to adhere to my principles on this.

        To paraphrase another great movie: Why should any legitimate user be inconvenienced when the bots are the ones who suck. I refuse to punish the wrong party.

      • Nate Cox@programming.dev
        link
        fedilink
        English
        arrow-up
        18
        arrow-down
        18
        ·
        13 hours ago

        I feel comfortable hating on Anubis for this. The compute cost per validation is vanishingly small to someone with the existing budget to run a cloud scraping farm, it’s just another cost of doing business.

        The cost to actual users though, particularly to lower income segments who may not have compute power to spare, is annoyingly large. There are plenty of complaints out there about Anubis being painfully slow on old or underpowered devices.

        Some of us do actually prefer to use the internet minus JS, too.

        Plus the minor irritation of having anime catgirls suddenly be a part of my daily browsing.

    • cadekat@pawb.social
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      1
      ·
      13 hours ago

      Scarcity is what powers this type of challenge: you have to prove you spent a certain amount of electricity in exchange for access to the site, and because electricity isn’t free, this imposes a dollar cost on bots.

      You could skip the detour through hashes/electricity and do something with a proof-of-stake cryptocurrency, and just pay for access. The site owner actually gets compensated instead of burning dead dinosaurs.

      Obviously there are practical roadblocks to this today that a JavaScript proof-of-work challenge doesn’t face, but longer term…

      • Nate Cox@programming.dev
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        6
        ·
        13 hours ago

        The cost here only really impacts regular users, too. The type of users you actually want to block have budgets which easily allow for the compute needed anyways.

        • chicken@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          9
          ·
          11 hours ago

          I think maybe they wouldn’t if they are trying to scale their operations to scanning through millions of sites and your site is just one of them

          • cadekat@pawb.social
            link
            fedilink
            English
            arrow-up
            10
            ·
            11 hours ago

            Yeah, exactly. A regular user isn’t going to notice an extra few cents on their electricity bill (boiling water costs more), but a data centre certainly will when you scale up.

  • Arghblarg@lemmy.ca
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    11 hours ago

    I have a script that watches apache or caddy logs for poison link hits and a set of bot user agents, adding IPs to an ipset blacklist, blocking with iptables. I should polish it up for others to try. My list of unique IPs is well over 10k in just a few days.

    git repos seem to be real bait for these damn AI scrapers.

    • JustTesting@lemmy.hogru.ch
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      4 hours ago

      This is the way. I also have rules for hits to url, without a referer, that should never be hit without a referer, with some threshold to account for a user hitting F5. Plus a whitelist of real users (ones that got a 200 on a login endpoint). Mostly the Huawei and Tencent crawlers have fake user agents and no referer. Another thing crawlers don’t do is caching. A user would never download that same .js file 100s of times in a hour, all their devices’ browsers would have cached it. There’s quite a lot of these kinds of patterns that can be used to block bots. Just takes watching the logs a bit to spot them.

      Then there’s ratelimiting and banning ip’s that hit the ratelimit regularly. Use nginx as a reverse proxy, set rate limits for URLs where it makes sense, with some burst set, ban IPs that got rate-limited more than x times in the past y hours based on the rate limit message in the nginx error.log. Might need some fine tuning/tweaking to get the thresholds right but can catch some very spammy bots. Doesn’t help with those that just crawl from 100s of ips but only use each ip once every hour, though.

      Ban based on the bot user agents, for those that set it. Sure, theoretically robots.txt should be the way to deal with that, for well behaved crawlers, but if it’s your homelab and you just don’t want any crawlers, might as well just block those in the firewall the first time you see them.

      Downloading abuse ip lists nightly and banning those, that’s around 60k abusive ip’s gone. At that point you probably need to use nftables directly though instead of iptables or going through ufw, for the sets, as having 60k rules would be a bad idea.

      there’s lists of all datacenter ip ranges out there, so you could block as well, though that’s a pretty nuclear option, so better make sure traffic you want is whitelisted. E.g. for lemmy, you can get a list of the ips of all other instances nightly, so you don’t accidentally block them. Lemmy traffic is very spammy…

      there’s so much that can be done with f2b and a bit of scripting/writing filters

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      8
      ·
      9 hours ago

      The front page of the web site is excellent. It describes what it does, and it does its feature set in quick, simple terms.

      I can’t tell you how many times I’ve gone to a website for some open-source software and had no idea what it was or how it was trying to do it. They often dive deep into the 300 different ways of installing it, tell you what the current version is and what features it has over the last version, but often they just assume you know the basics.

    • Cyberflunk@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      6
      ·
      12 hours ago

      thank you! this needed said.

      • This post is a bit critical of a small well-intentioned project, so I felt obliged to email the maintainer to discuss it before posting it online. I didn’t hear back.

      i used to watch the dev on mastodon, they seemed pretty radicalized on killing AI, and anyone who uses it (kidding!!) i’m not even surprised you didn’t hear back

      great take on the software, and as far as i can tell, playwright still works/completes the unit of work. at scale anubis still seems to work if you have popular content, but does hasnt stopped me using claude code + virtual browsers

      im not actively testing it though. im probably very wrong about a few things, but i know anubis isn’t hindering my personal scraping, it does fuck up perplexity and chatgpt bots, which is fun to see.

      good luck Blue team!