AI went nuts on my website and generated a $155 excessive bandwidth bill

limelight79@lemmy.world · 29 days ago

AI went nuts on my website and generated a $155 excessive bandwidth bill

artyom@piefed.social · edit-2 29 days ago

What you are experiencing is the unfortunate reality of hosting any kind of site on the open internet in the AI era. You can’t do it without implementing some sort of bot detection and rate limiting or your site will either be DDOS’d or you’ll incurr insane fees from your provider.

The bot suggested things like “robots.txt”,

You can do that but they will ignore it.

I’ll re-enable the site in a few days after the dust settles.

They’ll just attack again.

It’s clearly a bug.

It’s not a bug. This is very common practice these days.

My provider recommended implementing Cloudflare, which initially irritated me, until I realized there was a free tier.

Please consider Anubis instead.

gladflag@lemmy.ml · 29 days ago

TBH it feels like a bug if they’re redownloading the same images again and again.

DreamButt@lemmy.world · 29 days ago

Assuming A) honest intentions and B) they give a fuck

OpenAi isn’t exactly known for either

leds@feddit.dk · 29 days ago

I’m wondering, are they intentionally trying to kill the open web? Make small websites give up and then AI has monopoly on useful information?

jollyrogue@lemmy.ml · 27 days ago

Yes. This is it.

One of the great things about Web3 and AI, for corps, is forcing decentralized systems into centralized platforms, limiting hosting access to people who have money, and limiting competition to companies which have the capital to invest in mitigations, or the money to pay for exceptions.

sexhaver87@sh.itjust.works · edit-2 28 days ago

Their intentions remain unclear, however given their CEO’s desire for unchecked mass-scale absolute power, I’d bet on this!

e: all this is in addition to the data they collect via their web crawling, the bugs resulting in this behavior and its effects are either happy accidents or intentional malware, right now depending on your distaste for the company. Ultimately none of this is set in stone until the psychotic criminals at OpenAI get audited or jailed.

artyom@piefed.social · 29 days ago

I would agree except they do the same thing to thousands (millions?) of sites across the web every day. Google will scrape your site as well but they manage to do it on a way that doesn’t absolutely destroy it.

limelight79@lemmy.world · 29 days ago

Yeah exactly. I want people to be able to find the info, that’s the whole point. Legitimate search engines, even Bing, are fine.

DaPorkchop_@lemmy.ml · 28 days ago

I beg to differ, a few months ago my site was getting absolutely hammered by GoogleBot with hundreds of requests per second, faster than my server could keep up with - to the point that the entire apache daemon kept locking up.

wonderingwanderer@sopuli.xyz · 29 days ago

Sounds like a class action lawsuit

artyom@piefed.social · edit-2 29 days ago

Good luck, they are firmly in the pocket of the federal govt at this point. They’re allowed to do whatever they want because our entire economy hinges on allowing them to do so.

AsgerFD@programming.dev · edit-2 29 days ago

There’s also Iocaine ~~as an alternative to Anubis~~. I’ve not tried Anubis nor Iocaine myself though.

db0@lemmy.dbzer0.com · 29 days ago

Iocaine is not an alternative to anubis. It’s a different tool in the toolbox and can be used along with it, but it has a different purpose. Something like haphash is an anubis alternative

NotMyOldRedditName@lemmy.world · edit-2 29 days ago

Even if not using cloudflare / others for bot protection, setting up your images and other static content to be served from a CDN can help.

You can set that up with cloudflare and others as well.

CallMeAnAI@lemmy.world · edit-2 29 days ago

Removed by mod

artyom@piefed.social · 29 days ago

It is not at all 30 years old. Search engines never DDOS’d your site to death. Only these shitty AI scrapers do that. That’s why everyone wants their site scraped by Google while spending inordinate amounts of time and money to block AI scraper bots.

CallMeAnAI@lemmy.world · 29 days ago

Removed by mod

artyom@piefed.social · 29 days ago

Oh, honey…

elbarto777@lemmy.world · edit-2 11 days ago

deleted by creator

dependencyinjection@discuss.tchncs.de · 29 days ago

You keep saying this but that’s just not the case. Robots.txt would be respected by older bots and they wouldn’t DDOS your site either.

artyom@piefed.social · 29 days ago

I said it once. And “older bots” are not the problem. AI crawlers are the problem. Robots.txt was originally created with the primary intent of keeping websites off of search engines. Server load wasn’t even really an issue.

CallMeAnAI@lemmy.world · 29 days ago

Removed by mod

dependencyinjection@discuss.tchncs.de · 29 days ago

Read the room bro.

halcyoncmdr@lemmy.world · edit-2 29 days ago

File in small claims court for the fees and time if they refuse or don’t respond. OpenAI isn’t going to bother sending a representative for such a small amount.

hansolo@lemmy.today · 29 days ago

OP, please consider this. It’s likely to actually work.

tate@lemmy.sdf.org · 29 days ago

Once you win a small claim it is up to you to collect. They will never manage to collect.

Nollij@sopuli.xyz · 29 days ago

Something to remember is that small claims is very cheap, and accessible for the average person. It’s something like $35 filing, and they can’t even send their lawyers. You need to do some research and bring all sorts of documentation to support your claims, but it’s not meant to be intimidating.

Once you win, you can enlist the police to help you enforce the judgment. See what Warren and Maureen Nyerges did to Bank of America in 2011.

Yes, you will probably need additional judgments to enforce the original one that they will ignore, but you can keep getting attorneys fees added to the total.

toiletobserver@lemmy.world · 29 days ago

When they collect via sheriff enforcement, do something funny like seizing any and all Ethernet terminations for bulk resale.

halcyoncmdr@lemmy.world · 29 days ago

halcyoncmdr@lemmy.world · 29 days ago

You just go back to the court showing they’re not paying the court mandated restitution.

Yes it takes time, yes it will probably cost more in time alone than the $155 issue that started it. But you can get increased penalties awarded for failure to pay.

Small claims courts really don’t like big businesses ignoring the little man.

limelight79@lemmy.world · 28 days ago

Hopefully they just pay the bill or at least negotiate something. Any sign of good faith would be welcome.

I doubt they will, and maybe I will file in small claims if they don’t.

UpperBroccoli@lemmy.blahaj.zone · 29 days ago

I have experienced something similar. I run a small forum for a computer games series, a series I myself have not been interested in a long time. I am just running it because the community has no other place to go, and they seem to really enjoy it.

A few months ago, I received word from them that the forum barely responded anymore. I checked it out and noticed there were several hundred active connections at any time, something we have never seen before. After checking the whois info on the IPs, I realized they were all connected to meta, google, apple, microsoft and other AI companies.

It felt like a coordinated DDoS attack and certainly had almost the same effect. Now, I have a hosting contract where I pay a flat monthly fee for a complete server and any traffic going through it, so it was not a problem financially speaking, but those AI bots made the server almost unusable. Naturally, I went ahead and blocked all the crawler IPs that I could find, and that relieved the pressure a lot, but I still keep finding new ones.

Fuck all of those companies, fuck the lot of them. All they do is rob and steal and plunder, and leave charred ruins. And for what? Fan fiction. Unbelievable.

Dave.@aussie.zone · 29 days ago

Maybe it’s time to implement an AI tarpit. Each response for a request from a particular IP address or range takes double the time of the previous, with something like a 30 second cool down window before response time halves.

Would stop AI scrapers in their tracks, but it wouldn’t hurt normal users too much.

Maybe I should start looking into it a bit more 🤔

limelight79@lemmy.world · 29 days ago

Apparently my phpbb forum served as a nice tar pit. The only thing I can figure is that they neglected to take session IDs into account, so they assumed every url was a different page.

gothic_lemons@lemmy.world · 29 days ago

Not an expert or anything but could a script be made that feeds a bot an endless steam of unique tinyurls that points to images openai pays to host?

edit-2 29 days ago

Could you run a script that presents the AI bots with alternative believable but incorrect text based information? That would be a great way to fight back.

You could even implement an AI to rewrite your content with intentional errors so you don’t have to generate the misinformation yourself. Sounds like a great use for AI.

Cypher@lemmy.world · 29 days ago

Nepenthes already does a better job of this than what you’re proposing and doesn’t require AI.

https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

29 days ago

Nice

indomara@lemmy.world · 29 days ago

Thank you for continuing that site. We donate every month for a similar forum that we never use. I love that these places still exist, and appreciate those who help run them.

4am@lemmy.zip · 29 days ago

Send an invoice to OpenAI for abusing your EULA and demand payment. Report them to all three credit bureaus when they don’t. Encourage others to do the same.

IphtashuFitz@lemmy.world · 29 days ago

Hell. I’d look into taking them to small claims court if they don’t pay the invoice. If that became common practice then OpenAI may actually do something about it.

cally [he/they]@pawb.social · 29 days ago

Consider looking into Anubis or Iocaine. I have heard about them and apparently they’re pretty helpful, though I don’t self-host my own website so take this with a grain of salt.

Bortcorns4Jeezus@lemmy.world · 29 days ago

+1 for Iocaine

pedroapero@lemmy.ml · edit-2 28 days ago

There’s a similar Caddy module too: https://github.com/jasonlovesdoggo/caddy-defender
Beware, locaine will actually worsen the issue by generating infinite bandwidth traffic.

[deleted]@piefed.world · 29 days ago

This isn’t a bug, this is how AI is designed to work and it is absolutely terrible foe the web. If it was actually designed well it would use robots.txt (it doesn’t care) and cache common query results but instead it sends out fresh queries and pulls down data over and over again just in case something changed.

It is malicious and should be treated as such, but it isn’t.

adamth0@lemmy.world · 29 days ago

Where robots.txt has failed for me in the past, I have added dummy paths to it (and other similar paths hidden in html or in JS variables) which, upon being visited, cause the offending IP to be blocked.
Eg, I’ll add a /blockmeplease/ reference in robots.txt, and when anything visits that path, its IP, User-Agent, etc get recorded and it gets its IP blocked automatically.

perviouslyiner@lemmy.world · edit-2 28 days ago

https://www.jwz.org/robots.txt is a useful template that results in fail2ban “running full tilt”, but apparently the AI companies will slurp that shut up enough to cause more technical problems - that site seems to have some practical advice on handling this sort of thing without giving control to cloudflare

limelight79@lemmy.world · 28 days ago

Nice. I love it.

drunkpostdisaster@lemmy.world · 29 days ago

It gets worse every fucking day

FalschgeldFurkan@lemmy.world · 28 days ago

That shit cannot be legal. It’s like DDoS but without getting the target offline… I hope this all works out for you, and that you get OpenAI to pay for it.

(Why are these asshats calling themselves “open” anyways when they are clearly not?)

limelight79@lemmy.world · 28 days ago

They did get the target offline!

Echo Dot@feddit.uk · 29 days ago

The robots.txt scene is a waste of time. I’ve had arguments with people about this for about 10 years now and they still seem to think it’s some god level solution.

By all means make use of it as some callers do pay attention but you can just download a basic boilerplate and use that, there is zero pointing customising it beyond that unless you find that the basic template causes a problem for your specific configuration. Lots of the bots simply ignore it, this has been a problem for years and has only got worse in the AI era.

Cloudflare probably would stop most of the problems of course the other option is to just rate limit the site in general it sounds like you probably don’t need anything particularly complicated since it doesn’t seem like the site is hugely active.

limelight79@lemmy.world · edit-2 28 days ago

I don’t know how often they reread the robots.txt file, so that might be a factor, too. I put the gptbot in it right away, but it kept going. It’s possible that it doesn’t reread robots.txt on a regular basis. Of course we wouldn’t want them to continuously read that file in case it changes, but it should reread it at some point.

Personally I’d be horrified if a program I wrote did this.

Anyone else remember the old Netgear router story, where they hard coded a DNS server at the University of Wisconsin (iirc)? I couldn’t find the story after a quick search, but basically they sent a ton of traffic to the university’s dns server. To their credit, Netgear took responsibility and compensated the university for the traffic and server costs incurred.

Edit: Found the Netgear router story! It’s still on their site, and it’s still a really interesting read. Also, it was NTP, not DNS.

minorkeys@lemmy.world · 29 days ago

Get used to it, there won’t be AI regulation until Trump and MAGA are gone.

MinnesotaGoddam@lemmy.world · 29 days ago

there might be regulation protecting ai

perviouslyiner@lemmy.world · 28 days ago

“A criminal is trying to block these requests from accessing their website!”

Ensign_Crab@lemmy.world · 28 days ago

There won’t be regulation until after the bubble pops, regardless of who is in office.

nublug@piefed.blahaj.zone · edit-2 29 days ago

try out crowdsec, it’s a modern alternative to fail2ban that crowdsources ip blocks from its users with similar setups (optional to contribute to). in the first day i had it set up it had blocked over 50k attempts, mostly scraping and enumeration but also some known http exploit attempts and bruteforcing.

you get 3 blocklists with a free acct so sort the blocklists on their site by size and get the three biggest and you’ll block the vast majority before crowdsec even has to evaluate rules. only like 100 or so or mine have been blocked dynamically by crowdsec, the rest of the now 200k or so total have been those blocklists.

edit: tho i don’t know how much control you have on your hosting service, whether you can install something like this or only plugin things they have integrated into the service themselves.

limelight79@lemmy.world · 29 days ago

Yeah that’s kind of the issue with some of the solutions that have been offered, I can’t install new software on the server.

SLVRDRGN@lemmy.world · edit-2 28 days ago

Robots.txt is a standard, developed in 1994, that relies on voluntary compliance.

Voluntary compliance is conforming to a rule, without facing negative consequences if not complying.

Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them.

This is all from Wikipedia’s entry on Robots.txt.
I don’t get how we only have voluntary protocols for things like this at this point in 2025 AD…

limelight79@lemmy.world · 28 days ago

Yeah that’s part of why I was so frustrated with the answer from OpenAI about it. I don’t think I mentioned it in the writeup, but I actually did modify robots.txt on Jan 1 to block OpenAI’s bot, and it didn’t stop. In fairness, there’s probably some delay before it re-reads the file, but who knows how long it would have taken for the bot to re-read it and stop flooding the site (assuming it obeys at all) - and it still would have been sucking data until that point.

I also didn’t mention that the support bot gave me the wrong URL for the robots.txt info on their site. I pointed it out and it gave me the correct link. So, it HAD the correct link and still gave me the wrong one! Supporters say, “Oh, yeah, you have to point out its errors!” Why the fuck would I want to argue with it? Also, I’m asking questions because I don’t know the answer! If I knew the correct answer, why would I be asking?

In the abstract, I see the possibilities of AI. I get what they’re trying to do, and I think there may be some value to AI in the future for some applications. But right now they’re shoveling shit at all of us and ripping content creators off.

gwl [he/him]@lemmy.blahaj.zone · 28 days ago

answer from OpenAI

There’s your problem, you’re trusting the blind idiot

limelight79@lemmy.world · 28 days ago

Where did I say I trusted them? Seriously, please, so I can fix that.

gwl [he/him]@lemmy.blahaj.zone · 27 days ago

Well, your first step was using it at all

limelight79@lemmy.world · 27 days ago

What were my other options? That’s the only option for contacting them I could find on their site.

gkak.laₛ@lemmy.zip · 29 days ago

For self-hosting there is go-away and anubis

Also, at least on my server, the LLM bots have correct User-Agents set (they identify themselves) so you could block them from that as well

(If you are able to manage it yourself, self-hosting on a VPS instead of a web hosting provider would also have the benefit of costing around €80/year with ~20TB/month; phpBB should be easy to set up with containers (I don’t know about migrating your existing data though))

limelight79@lemmy.world · 29 days ago

Your vps hosting is much cheaper than ours!

gkak.laₛ@lemmy.zip · 28 days ago

Oh it’s just Hetzner 😅

(referral link gives you a coupon! Sorry for the plug 😔 but you may need it if you’d like to try them out)

limelight79@lemmy.world · 28 days ago

I was like, damn $38/month! Then I looked at the company I use now, and their cheapest vps is $22. They’re a lot cheaper than I remembered.

But they also have less disc space than my shared hosting.

Huh now you have me looking. The price of the additional disc space I’d need is still less than I’m currently paying. I’d just about break even if I matched my current allocation of disc space, but I don’t use anywhere near that much. And it would do email, instead of being charged separately. Hmmmmmmmmmm!

gkak.laₛ@lemmy.zip · 28 days ago

$22

(I was researching providers last year, and even $22 is still kind of expensive compared to others 😅

storage

For data storage (not rootfs) I have their plan with €3.5/1TB/month; it’s a network drive that you can mount on the VPS with sshfs, webdav, samba etc, so you could check if your provider has something like that as well!

email

Oh for email I recommend mailu.io! It has a docker-compose generator, so the set up is pretty straightforward! I however have it on a separate VPS so it won’t be affected when I’m trying to deploy other services etc (e.g. the configuration of the reverse proxy and certificates), and with webmail and spam filtering etc it’s quite a few containers anyway 😅 so I’d like them to have dedicated resources and not be affected by the resources used by the other services

limelight79@lemmy.world · 28 days ago

You’re right, I’m doing some looking and …definitely some less expensive options out there.

I’m always a little nervous because I switched to another company years ago, and it was a mess - frequent down time, missed connections, etc. I switched back to my current host within a year.

I could also switch to AWS. I already have an account (I use their S3 glacier service for backups). Advantage would be that I’d pay for exactly what I use…

gkak.laₛ@lemmy.zip · edit-2 27 days ago

It doesn’t have to be FAANG for it to be a good service 😅 (you unfortunately probably just found an outlier 🥲) In fact, AWS is one of the most expensive, and when I looked into them (I think it was AWS) the server options were way too complicated 😅 And I couldn’t find advantages to most use cases compared to others to justify paying so much more

Hetzner I think is one of the largest providers in Europe, Scaleway is very good as well (although still a bit expensive but with more flexibility but you may not always need it), I remember some friends and our local hackerspace’s server used Leaseweb a while ago and didn’t have any complaints, and I’ve heard good words about Ionos as well.

For storage there’s Backblaze, Wasabi, DigitalOcean (and again Scaleway and Hetzner); also TIL https://www.s3compare.io/

There are so many good options out there 😅

pay for exactly what I use

You can still easily spin/scale up/down instances on most providers: Hetzner and Scaleway have APIs and CLI utilities, and they charge by the hour. But even more importantly, even if you use fewer resources, for me I found that e.g. having a VM off 50% of the time on AWS was still much more expensive than 100% on Hetzer 😅

limelight79@lemmy.world · 27 days ago

Yeah, I was mostly kidding about AWS. It really is complicated, and there are bandwidth fees and storage fees and this fee and that fee… I also don’t want to be worrying about how much each month is costing me like that. I’d rather pay a fixed rate and not worry about it.

I’m a little hesitant on a European provider only because practically all of my users are in North America. I know it would work fine, but it makes sense to me to locate where I am and where the users are.

My contract with this company ends in July, so I’ve made a note in my calendar for June to find a VPS host. Thanks for the insight, you definitely have me thinking. I had no idea I could get a VPS less expensively than I’m currently paying…

Greg Clarke@lemmy.ca · 29 days ago

Cloudflare also has caching on the free tier which will reduce these kinds of AI attacks

limelight79@lemmy.world · 29 days ago

Yeah that’s what I’ve set up. I haven’t turned the site on yet, I want to leave it off all day tomorrow so that the logs will show nothing, then when I restart it, I can watch the logs for requests from the AI bots.