自己写了若干爬虫, 但是自己的网站也有人爬, 呵呵, 这里介绍一种Nginx反爬.我在阿里云只开放80端口, 所有一般端口都通过Nginx进行反向代理. 通过Nginx, 我们还可以拦截大部分爬虫.
然后我们再给自己的网站加上HTTPS支持.
我的系统如下:
jinhan@jinhan-chen-110:~/book/Obiwan/bin$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
安装(如果有apache服务器, 建议卸载了, 或者改Nginx的默认端口):
sudo apt-get install nginx
此时已经开启了80端口, 并且配置处在etc/nginx
lsof -i:80
cd /etc/nginx
将配置放于conf.d/*
server{
listen 80;
server_name php.lenggirl.com;
charset utf-8;
access_log /data/logs/nginx/www.lenggirl.com.log;
#error_log /data/logs/nginx/www.lenggirl.com.err;
location / {
root /data/www/php/blog;
index index.html index.php;
#访问路径的文件不存在则重写URL转交给ThinkPHP处理
if (!-e $request_filename) {
rewrite ^/(.*)$ /index.php/$1 last;
break;
}
}
## Images and static content is treated different
location ~* ^.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ {
access_log off;
expires 30d;
root /data/www/php/blog;
}
location ~\.php/?.*$ {
root /data/www/php/blog;
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
#加载Nginx默认"服务器环境变量"配置
include fastcgi.conf;
#设置PATH_INFO并改写SCRIPT_FILENAME,SCRIPT_NAME服务器环境变量
set $fastcgi_script_name2 $fastcgi_script_name;
if ($fastcgi_script_name ~ "^(.+\.php)(/.+)$") {
set $fastcgi_script_name2 $1;
set $path_info $2;
}
fastcgi_param PATH_INFO $path_info;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name2;
fastcgi_param SCRIPT_NAME $fastcgi_script_name2;
}
}
通过server_name, 用域名访问, 全部会到80端口, 根据域名会转发到8080
域名请A记录到该机器IP地址.
vim /etc/nginx/conf.d/www.lenggirl.com.conf
server{
listen 80;
# 本地测试时可以将域名改为: 127.0.0.1
server_name www.lenggirl.com;
charset utf-8;
access_log /root/logs/nginx/www.lenggirl.com.log;
#error_log /data/logs/nginx/www.lenggirl.com.err;
location / {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://localhost:8080;
# 这个就是反爬虫文件了
include /etc/nginx/anti_spider.conf;
}
}
日志文件要先建立:
sudo mkdir -p /root/logs/nginx
查看配置是否无误, 并重启:
sudo nginx -t
sudo service nginx restart
sudo nginx -s reload
访问127.0.0.1会发现502错误, 因为8080端口我们没开! 此时访问localhost会发现, 这时Nginx欢迎页面出来了, 这是默认80端口页面!
增加反爬虫配额文件:
sudo vim /etc/nginx/anti_spider.conf
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
#屏蔽单个IP的命令是
#deny 123.45.6.7
#封整个段即从123.0.0.1到123.255.255.254的命令
#deny 123.0.0.0/8
#封IP段即从123.45.0.1到123.45.255.254的命令
#deny 124.45.0.0/16
#封IP段即从123.45.6.1到123.45.6.254的命令是
#deny 123.45.6.0/24
# 以下IP皆为流氓
deny 58.95.66.0/24;
在网站配置server段中都插入include /etc/nginx/anti_spider.conf, 见上文. 你可以在默认的80端口配置上加上此句:sudo vim sites-available/default
重启:
sudo nginx -s reload
爬虫UA常见:
FeedDemon 内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy sql注入
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
ApacheBench cc攻击器
Swiftbot 无用爬虫
YandexBot 无用爬虫
AhrefsBot 无用爬虫
YisouSpider 无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
jikeSpider 无用爬虫
MJ12bot 无用爬虫
ZmEu phpmyadmin 漏洞扫描
WinHttp 采集cc攻击
EasouSpider 无用爬虫
HttpClient tcp攻击
Microsoft URL Control 扫描
YYSpider 无用爬虫
jaunty wordpress爆破扫描器
oBot 无用爬虫
Python-urllib 内容采集
Indy Library 扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot 无用爬虫
使用curl -A 模拟抓取即可,比如:
# -A表示User-Agent
# -X表示方法: POST/GET
# -I表示只显示响应头部
curl -X GET -I -A 'YYSpider' localhost
HTTP/1.1 403 Forbidden
Server: nginx/1.10.3 (Ubuntu)
Date: Fri, 08 Dec 2017 10:07:15 GMT
Content-Type: text/html
Content-Length: 178
Connection: keep-alive
模拟UA为空的抓取:
curl -I -A ' ' localhost
模拟百度蜘蛛的抓取:
curl -I -A 'Baiduspider' localhost
见: https://certbot.eff.org/#ubuntuxenial-other
首先执行:
apt-get update
apt-get install software-properties-common
add-apt-repository ppa:certbot/certbot
apt-get update
apt-get install python-certbot-nginx
然后执行这个按说明操作:
certbot --authenticator standalone --installer nginx --pre-hook "nginx -s stop" --post-hook "nginx"
之前/etc/nginx/conf.d/www.lenggirl.com.conf文件变更如下:
server{
listen 80;
# 本地测试时可以将域名改为: 127.0.0.1
server_name www.lenggirl.com;
charset utf-8;
access_log /root/logs/nginx/www.lenggirl.com.log;
#error_log /data/logs/nginx/www.lenggirl.com.err;
location / {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://localhost:4000;
# 这个就是反爬虫文件了
# include /etc/nginx/anti_spider.conf;
#
}
listen 443 ssl; # managed by Certbot
ssl_certificate /etc/letsencrypt/live/www.lenggirl.com/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/www.lenggirl.com/privkey.pem; # managed by Certbot
include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
if ($scheme != "https") {
return 301 https://$host$request_uri;
} # managed by Certbot
}