If you have a website and look at your logs, I'm sure you have noticed, on more than one occasion, that someone is scanning or scraping your website. On one of my sites, I regularly see bots trying these URLs:
Yes, apparently they're looking for WordPress exploits. PHP seems very popular. Unfortunately (for them) my site is not PHP based, so they just get a 404 Not Found error. And my logs fill up with garbage requests, which hides errors that I actually care about.
There are also some bots that certain legitimate companies use to scrape websites. ZoomInfo.com is one of them. Unfortunately, many of these companies somehow manage to really mess up and either hit sites at high speed or hit sites with URLs they scraped, but they somehow mess up the actual URL path. Sometimes there is an extra "/" at the end of the path, or the query string is missing, they use GET instead of POST, don't honor rel="nofollow", the list goes on. Others, like ZoomInfo.com, just scrape your site looking for email addresses and contact information. But do they really need to do this EVERY day? When only 2% of requests to your site are from REAL users, you really start hating the other 98%.
Sufficiently motivated, I set out to end this. And it was actually very easy (thanks, .NET Core). I implemented some middleware that inspects each request, looking at URL path and user agent headers. With a simple JSON file, I can generate a 401 Not Authorized or 200 OK response for any request that matches the data found in the JSON file.. Here is what a sample JSON file would look like:
In this example, all requests to URLs that end in .php, .asp or .aspx will be rejected with 401 Not Authorized. And any request whose user agent header contains ZoominfoBot or mj12bot.com will equally be rejected.
There is also the ability to return 200 OK. In this example, any request by StatusPie.com, which uses StatusPieSiteCheckBot in the user agent header, will respond with 200 OK. Don't worry, this bot will only scan your site if you actually request it to do so. By returning 200 OK in this middleware, we cut down the path length to process the request. This bot just makes sure the site responds. The contents of the response don't matter.
The source code for the middleware can be found at https://github.com/YetaWF/YetaWF-CoreComponents/blob/master/Core/Startup/MVC6/BlockRequestMiddleware.cs. It needs to be slightly modified when used stand-alone as it is part of the YetaWF Web Framework. Mostly the location of the JSON file and some methods that are part of the YetaWF framework are used, but it should be quite straightforward to replace.
As usual, the middleware is started in your Configure method as follows and should be added very early in the pipeline:
// request blocking middleware