News:

Am able to again make updates to the Shield Gallery!
- Alex

Main Menu

Too many guests

Started by LilianaUwU, May 05, 2025, 10:52:22 AM

Previous topic - Next topic

LilianaUwU

Quote from: freebrickproductions on May 08, 2025, 05:43:17 PMIIRC, there are services out there that should help hinder/limit the number of requests from bots/scrapers?
I know Cloudflare has a DDOS protection thing that's basically a captcha.
"Volcano with no fire... Not volcano... Just mountain."
—Mr. Thwomp

My pronouns are she/her. Also, I'm an admin on the AARoads Wiki.

vdeane

I saw a mention of feeding the bots nonsense instead of the legitimate pages: https://marcusb.org/hacks/quixotic.html

Also something about responding to them with HTTP code 429 instead of real pages.

Quote from: LilianaUwU on May 08, 2025, 06:04:05 PM
Quote from: freebrickproductions on May 08, 2025, 05:43:17 PMIIRC, there are services out there that should help hinder/limit the number of requests from bots/scrapers?
I know Cloudflare has a DDOS protection thing that's basically a captcha.
Cloudflare caused the redirect loop.  It might still be worth attempting, however, so long as all existing redirects are disabled for the attempt.  From what I read, it seemed like the issue was because the existing redirects would trigger the Cloudflare redirect, which would trigger the existing redirect, etc.
Please note: All comments here represent my own personal opinion and do not reflect the official position of NYSDOT or its affiliates.

ClassicHasClass

I don't know if you have access to the network stack, but the only way I solved this on my own sites was IP-blocking a lot of networks. For several days I blocked the whole of 47.*.*.* until the site calmed down, and then cut the netblock down to the remnant segments that kept popping up. It's not enough to block them at the web server - you have to block them at the firewall or IP filter level.

The problem with feeding them nonsense is they will simply download the nonsense. Half of these are from Chinese startups anyway. They don't know if they're getting Shakespeare or sh... well, you know.

Scott5114

#28
Quote from: Alex on May 08, 2025, 04:36:16 PMI have consulted AI to try to find a solution, but found nothing.

The conspiratorially-minded would be quick to point out that the bots are there in the first place because they're feeding the forum's contents to the AI, so of course the AI wouldn't tell you how to stop it...

Quote from: ClassicHasClass on May 08, 2025, 09:17:44 PMThe problem with feeding them nonsense is they will simply download the nonsense. Half of these are from Chinese startups anyway. They don't know if they're getting Shakespeare or sh... well, you know.

The goal of feeding them nonsense would be that it would make the "throw a million bots at the problem" approach ultimately give them a garbage result, so they are motivated to slow down to something more like the pace of a search engine crawler. But that is a long-term approach that doesn't solve the problem of traffic coming in so fast and furious that it causes the server to keel over.
uncontrollable freak sardine salad chef

edwaleni

I don't know the details of your hosting arrangements, but usually 'get scraping' requires some kind of rule that limits the number of requests per minute or per hour by IP or IP block.

These are usually implemented in the load balancer for your web services provider, or a policy enforcement appliance your host provider may use.

If you are being hosted in a basic services site, where it is up to the web admin to sort it out at the host which is the target of the requests, you could use a reverse proxy.

You can install nginx or apache, set up a reverse proxy on your web server, and setup rules.
It will add some overhead, but cant be as bad as dealing with the scraping issue.

Just some thoughts.

vdeane

Quote from: ClassicHasClass on May 08, 2025, 09:17:44 PMThe problem with feeding them nonsense is they will simply download the nonsense. Half of these are from Chinese startups anyway. They don't know if they're getting Shakespeare or sh... well, you know.
Wouldn't it still at least keep the forum database from getting hammered since they wouldn't get the real content?  And quite frankly, if they're essentially DDOSing the sites they crawl, they deserve to have their AI ruined by garbage data.
Please note: All comments here represent my own personal opinion and do not reflect the official position of NYSDOT or its affiliates.

hotdogPi

I was under the impression that the bots aren't actually causing any issues. We've had 90000 guests at times in the last few days, but the issues with the forum randomly going down and search not working unless you do a hyper-specific advanced search are caused by a PHP version mismatch, not the bots.

For example, I'm seeing 82,187 guests now and no slowdowns.
Clinched

Traveled, plus
US 13, 50
MA 22, 35, 40, 53, 79, 107, 109, 126, 138, 141, 159
NH 27, 78, 111A(E); CA 90; NY 366; GA 42, 140; FL A1A, 7; CT 32, 320; VT 2A, 5A; PA 3, 51, 60, WA 202; QC 162, 165, 263; 🇬🇧A100, A3211, A3213, A3215, A4222; 🇫🇷95 D316

Lowest untraveled: 36

Scott5114

Quote from: vdeane on May 09, 2025, 12:43:17 PM
Quote from: ClassicHasClass on May 08, 2025, 09:17:44 PMThe problem with feeding them nonsense is they will simply download the nonsense. Half of these are from Chinese startups anyway. They don't know if they're getting Shakespeare or sh... well, you know.
Wouldn't it still at least keep the forum database from getting hammered since they wouldn't get the real content?  And quite frankly, if they're essentially DDOSing the sites they crawl, they deserve to have their AI ruined by garbage data.

It would presumably keep mysqld from getting hammered, but httpd would still have to serve the requests, and it would still use a portion of the server's bandwidth.

Quote from: hotdogPi on May 09, 2025, 12:49:03 PMI was under the impression that the bots aren't actually causing any issues. We've had 90000 guests at times in the last few days, but the issues with the forum randomly going down and search not working unless you do a hyper-specific advanced search are caused by a PHP version mismatch, not the bots.

For example, I'm seeing 82,187 guests now and no slowdowns.

Alex can correct me if I'm wrong, but I believe the forum is randomly going down because it's running out of memory due to all of the requests.
uncontrollable freak sardine salad chef

Alex

Quote from: Scott5114 on May 09, 2025, 09:59:46 PM
Quote from: vdeane on May 09, 2025, 12:43:17 PM
Quote from: ClassicHasClass on May 08, 2025, 09:17:44 PMThe problem with feeding them nonsense is they will simply download the nonsense. Half of these are from Chinese startups anyway. They don't know if they're getting Shakespeare or sh... well, you know.
Wouldn't it still at least keep the forum database from getting hammered since they wouldn't get the real content?  And quite frankly, if they're essentially DDOSing the sites they crawl, they deserve to have their AI ruined by garbage data.

It would presumably keep mysqld from getting hammered, but httpd would still have to serve the requests, and it would still use a portion of the server's bandwidth.

Quote from: hotdogPi on May 09, 2025, 12:49:03 PMI was under the impression that the bots aren't actually causing any issues. We've had 90000 guests at times in the last few days, but the issues with the forum randomly going down and search not working unless you do a hyper-specific advanced search are caused by a PHP version mismatch, not the bots.

For example, I'm seeing 82,187 guests now and no slowdowns.

Alex can correct me if I'm wrong, but I believe the forum is randomly going down because it's running out of memory due to all of the requests.

That is exactly what the problem is, the RAM is being tapped out. You can see where the server restarts in the memory usage graph.



The rest of the site certainly is not drawing this kind of traffic, especially considering that there have been no content updates for months and that Wordpress is increasingly making the browsing experience a test of patience.

rschen7754

Quote from: ClassicHasClass on May 08, 2025, 09:17:44 PMI don't know if you have access to the network stack, but the only way I solved this on my own sites was IP-blocking a lot of networks. For several days I blocked the whole of 47.*.*.* until the site calmed down, and then cut the netblock down to the remnant segments that kept popping up. It's not enough to block them at the web server - you have to block them at the firewall or IP filter level.

The problem with feeding them nonsense is they will simply download the nonsense. Half of these are from Chinese startups anyway. They don't know if they're getting Shakespeare or sh... well, you know.

What I have done on the AARoads Wiki side is feed suspicious IPs into https://ipinfo.io/. If it comes up "hosting" or "cloud", I block the entire range on the firewall. It usually takes care of the problem, at least for a few days.

NJRoadfan

New problem: Users aren't able to log in. I got a report from another user here on another forum that he wasn't able to login. When I attempted to do so on another computer, an error "Sorry Guest, you are banned from using this forum!" pops up instead of the login screen. I realize this may be part of the DDoS mitigation measures, but maybe they are TOO good.

Rothman

Quote from: NJRoadfan on May 11, 2025, 10:10:33 PMNew problem: Users aren't able to log in. I got a report from another user here on another forum that he wasn't able to login. When I attempted to do so on another computer, an error "Sorry Guest, you are banned from using this forum!" pops up instead of the login screen. I realize this may be part of the DDoS mitigation measures, but maybe they are TOO good.

Maybe they were banned...
Please note: All comments here represent my own personal opinion and do not reflect the official position(s) of NYSDOT.

Scott5114

Sounds like their IP address shifted to a range that had formerly been used by spammers or a banned user.

Without knowing the IP address (or even the username) of the person affected, though, I wouldn't know what to unban.
uncontrollable freak sardine salad chef

74/171FAN

It may be a general issue.  I clicked "Log In" from my phone, and could not even attempt to log in before seeing the banned message.
I am now a PennDOT employee.  My opinions/views do not necessarily reflect the opinions/views of PennDOT.

Travel Mapping: https://travelmapping.net/user/?units=miles&u=markkos1992
Mob-Rule:  https://mob-rule.com/user/markkos1992

Alex

This is because one of the htaccess rules I placed yesterday to mitigate the 100,000 plus BOT GET requests included the login page

RewriteCond %{QUERY_STRING} (action=reminder|action=login|action=register|action=printpage|action=profile|action=search) [NC]
I'll remove action=login from this and see if the flood gates open again.

Alex

And to show that the bot GET requests are not stopping, this error again appears from SMF:

UPDATE smf_log_activity
SET
hits = 1,
most_on = 88168
WHERE date = '2025-05-12'

which is because that column of the SQL database table is set to smallINT(5), which even with UNSIGNED as an attribute is capped at 65535. The fact that the number needs to exceed 88,168 is indicative of how much bad attention this form is garnering.

To that end I have tried a number of efforts to block these requests and otherwise mitigate them. Went to SMF forums and tried out .htaccess code from developers there. Manually edited some of the SMF scripts directly to remove guest email access options, such as restricting the RemindMe() function to only members. Activated the firewall through Plesk, set up Monitoring via Grafana, increased Fail2Ban rules for blocking IP addresses, etc.

I confirmed that the htaccess restrictions I enabled yesterday were working, as the number of slow.log entries from the forum decreased dramatically. Nonetheless, the bots are still somewhat hammering the Forum, as the memory usage is still somewhat high.



Keep in mind that I am not a software developer, and a lot of what I am trying I am either learning on the fly or relying upon AI and SMF forum posts for advice.

Unfortunately if this keeps up, I will be forced to take the Forum down as the increasing memory usage of the GET requests results in slow loading scripts, time-outs, gateway errors, and "this site is offline" messages. Been trying to finish this 4-month long reprogramming of the back-end of AARoads, and lately the Forum issues have consistently taking me away from completing the debugging and data entry for that.

Rothman

Can't complain.  It's been a good run.
Please note: All comments here represent my own personal opinion and do not reflect the official position(s) of NYSDOT.

Scott5114

#42
We have a solution in place on the wiki that we're testing out. (Go check out https://wiki.aaroads.com/wiki/Chickasaw_Turnpike or something to see it in action—you'll have to look quick though because every time I've seen it it just flashes on the screen for a second or so.) If that quells the bot flood over there, then we can probably use the same thing on the forum. Last I heard the CPU on the wiki server was holding steady at 3% or so, so that's a good sign.
uncontrollable freak sardine salad chef

SEWIGuy

I hope this solution works, but if it doesn't, would simply not allowing any new registrant work?

Scott5114

Quote from: SEWIGuy on May 12, 2025, 09:29:31 AMI hope this solution works, but if it doesn't, would simply not allowing any new registrant work?

The problem isn't registrants, it's that AI companies are having 101,235 computers (actual number, not an exaggeration) try to load the site at the same time. The server, understandably, cannot cope with that. Traditionally, you could put together a file called robots.txt that would essentially serve as a gentleman's agreement for how bots should interact with the site to prevent this from happening, but the AI companies just blatantly disrespect the rules outlined in that. I guess the $$$$ they're theoretically making (not actually making, mind you) is more important to them than whether the website they're ingesting is able to stay online while they're doing it.

This is not just a problem for our site, either—it's happening Internet-wide.
uncontrollable freak sardine salad chef

ClassicHasClass

Quote from: Scott5114 on May 11, 2025, 10:28:25 PMSounds like their IP address shifted to a range that had formerly been used by spammers or a banned user.

Without knowing the IP address (or even the username) of the person affected, though, I wouldn't know what to unban.

It was me. Haven't been banned yet.  :bigass:

There was an article in the Register recently about the IETF coming up with a different take on robots.txt to address AI crawlers, but that just seems like rearranging the deck chairs while the ship gets bombarded. I guess we'll find out in August. https://www.theregister.com/2025/04/09/ietf_ai_preferences_working_group/

Henry

I hate that this is happening, especially the problem from yesterday falsely informing us that we were banned from the forum (myself included). That being said, I hope that the admins find a solution that will keep the site running smoothly, because I'd hate to lose the biggest part of my roadgeeking life forever.
Go Cubs Go! Go Cubs Go! Hey Chicago, what do you say? The Cubs are gonna win today!

LilianaUwU

It's fucked that there is no definitive solution short of burning down the data centers that host the bots.

(Somehow this originally posted in the Buc-ee's thread below even though I had clicked on this one?)
"Volcano with no fire... Not volcano... Just mountain."
—Mr. Thwomp

My pronouns are she/her. Also, I'm an admin on the AARoads Wiki.

Rothman

Makes me wonder what I'd save from the forum.  Truth be told, I'm not so sure if anything.
Please note: All comments here represent my own personal opinion and do not reflect the official position(s) of NYSDOT.

vdeane

Quote from: Rothman on May 12, 2025, 10:07:51 PMMakes me wonder what I'd save from the forum.  Truth be told, I'm not so sure if anything.
I'd say Alanland, but we did that already.
Please note: All comments here represent my own personal opinion and do not reflect the official position of NYSDOT or its affiliates.