使用 scrapy 爬虫框架编写爬虫
准备工作
- 安装
pip3 install scrapy --user
- css 选择器 CSS选择器笔记
- xpath 选择器 scrapy提取数据之:xpath选择器
调试
使用 scrapy shell urls
进行调试
- 配置默认请求头
>>> settings.DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \ (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' }
- 使用
fetch(urls)
命令来抓取测试的页面,解析为response
- 使用
response.css
分析页面
新建项目
scrapy startproject projectname
新建项目scrapy genspider demo demo.com
新建爬虫
$ scrapy startproject Demo
New Scrapy project 'Demo', using template directory '~/.local/lib/python3.6/site-packages/scrapy/templates/project', created in:
~/python/crawler/scrapy/Demo
You can start your first spider with:
cd Demo
scrapy genspider example example.com
$ cd Demo
$ scrapy genspider demo demo.com
$ tree
.
├── Demo
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── demo.py
│ └── __init__.py
└── scrapy.cfg
4 directories, 11 files
items.py
数据模型文件middlewares.py
中间件文件,配置所有中间件pipelines.py
管道文件,用于处理数据输出settings.py
配置文件demo.py
爬虫文件scrapy.cfg
整个Scrapy的配置文件,由Scrapy自动生成
配置
配置文件 settings.py
LOG_LEVEL = 'WARNING'
输出级别ROBOTSTXT_OBEY = False
FEED_EXPORT_ENCODING = 'utf-8'
支持中文
爬虫实现
生成的爬虫文件如下
# -*- coding: utf-8 -*-
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
allowed_domains = ['demo.com']
start_urls = ['http://demo.com/']
def parse(self, response):
pass
所有的 Spider 类都必须得继承 scrapy.Spider,其中 name
、start_urls
以及 parse
成员方法是每个 Spider 类必须要声明的。详细见Spider
需要定义 start_urls
或者重写 start_requests
方法,两者都必须是可迭代对象
class Demo(scrapy.Spider):
name = "demo"
allowed_domains = ["demo.com"]
def start_requests(self):
yield Request(url='https://demo.com/p/', headers=headers, callback=self.parse_rank)
def parse_rank(self, response):
for item in response.css('article>a'):
name = item.css('.post-card-title::text').get()
link = item.css('::attr(href)').get()
print("\033[1;31m[%s]: %s\033[0m\n" % (name, link))
yield Request(url=link, headers=headers, callback=self.parse_one)
next_page = response.css('ol>li.next').css('a::attr(href)').get()
if next_page:
print("%s" % next_page)
yield Request(next_page, callback=self.parse_rank)
def parse_one(self, response):
for src in response.css('div#post[role=main] p img::attr(src)').getall():
print("\033[1;32m%s\n\033[0m" % src)
item = HaosuItem()
item['src'] = src
yield item
所有具备解析功能的函数都应该返回 Item 或 Requests
数据定义
在 item.py
中定义数据模型
class HaosuItem(scrapy.Item):
# define the fields for your item here like:
src = scrapy.Field()
pass
数据填充并返回 item
见上述函数 parse_one
图片下载
Scrapy 提供了一个 item pipeline 用于下载
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class HaosuPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(item['src'])
从 item
中获取下载地址并下载,修改 setting.py
开启管道
ITEM_PIPELINES = {
'haosu.pipelines.HaosuPipeline': 5,
}
IMAGES_STORE = 'image'
数据导出
内置工具
$ scrapy crawl spider -o data.json
$ scrapy crawl spider -o data.csv
$ scrapy crawl spider -o data.xml
$ scrapy crawl spider -s FEED_URI='/home/user/folder/mydata.csv' -s FEED_FORMAT=csv
pipeline
from scrapy.exporters import CsvItemExporter
class CsvPipeline (object):
def __init__ (self):
self.file = open ('databases.csv', 'wb')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider (self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item (self, item, spider):
print("%s" % item.__class__)
self.exporter.export_item(item)
return item
在 settings.py
中启动 pipeline
ITEM_PIPELINES = {
'satellites.pipelines.SatellitesPipeline': 300,
'satellites.pipelines.CsvPipeline': 500,
}
classmethod 使用
使用 classmethod
创建不同的数据库
def __init__(self, mysql_dbname):
# create db
sql = "CREATE DATABASE if not exists " + mysql_dbname
self.cursor = self.connect.cursor()
self.cursor.execute(sql)
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
mysql_dbname = settings.get('MYSQL_DBNAME')
print("%s" % mysql_dbname)
return cls(mysql_dbname)
pymysql 数据存储
import pymysql
self.connect = pymysql.connect(
host=settings.MYSQL_HOST,
user=settings.MYSQL_USER,
passwd=settings.MYSQL_PASSWD,
charset='utf8',
use_unicode=True)
sql = "CREATE DATABASE if not exists " + mysql_dbname
self.cursor = self.connect.cursor()
self.cursor.execute(sql)
self.cursor.execute("USE %s;" % mysql_dbname)
# create satellite table
sql = """CREATE TABLE if not exists satellites(
id bigint(20) NOT NULL AUTO_INCREMENT,
position varchar(20),
area varchar(30),
band varchar(10),
name varchar(100),
time varchar(20),
PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;"""
self.cursor.execute(sql)
启动
$ scrapy crawl demo
Problems
网页返回乱码
DEFAULT_REQUEST_HEADERS
设置 Accept-Encoding
为 gzip,deflate
导致网页返回乱码
这个配置表示接受压缩后的数据,需要自己解压缩
robots.txt
ROBOTSTXT_OBEY = False