0%

scrapy使用

使用 scrapy 爬虫框架编写爬虫

准备工作

调试

使用 scrapy shell urls 进行调试

  1. 配置默认请求头
    >>> settings.DEFAULT_REQUEST_HEADERS = {
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     'Accept-Language': 'en',
     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \
     (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
  2. 使用 fetch(urls) 命令来抓取测试的页面,解析为 response
  3. 使用 response.css 分析页面

新建项目

  1. scrapy startproject projectname 新建项目
  2. scrapy genspider demo demo.com 新建爬虫
$ scrapy startproject Demo
New Scrapy project 'Demo', using template directory '~/.local/lib/python3.6/site-packages/scrapy/templates/project', created in:
    ~/python/crawler/scrapy/Demo

You can start your first spider with:
    cd Demo
    scrapy genspider example example.com

$ cd Demo
$ scrapy genspider demo demo.com
$ tree
.
├── Demo
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── demo.py
│       └── __init__.py
└── scrapy.cfg

4 directories, 11 files
  • items.py 数据模型文件
  • middlewares.py 中间件文件,配置所有中间件
  • pipelines.py 管道文件,用于处理数据输出
  • settings.py 配置文件
  • demo.py 爬虫文件
  • scrapy.cfg 整个Scrapy的配置文件,由Scrapy自动生成

配置

配置文件 settings.py

  1. LOG_LEVEL = 'WARNING' 输出级别
  2. ROBOTSTXT_OBEY = False
  3. FEED_EXPORT_ENCODING = 'utf-8' 支持中文

爬虫实现

生成的爬虫文件如下

# -*- coding: utf-8 -*-
import scrapy

class DemoSpider(scrapy.Spider):
    name = 'demo'
    allowed_domains = ['demo.com']
    start_urls = ['http://demo.com/']

    def parse(self, response):
        pass

所有的 Spider 类都必须得继承 scrapy.Spider,其中 namestart_urls 以及 parse 成员方法是每个 Spider 类必须要声明的。详细见Spider

需要定义 start_urls 或者重写 start_requests 方法,两者都必须是可迭代对象

class Demo(scrapy.Spider):
    name = "demo"
    allowed_domains = ["demo.com"]

    def start_requests(self):
        yield Request(url='https://demo.com/p/', headers=headers, callback=self.parse_rank)

    def parse_rank(self, response):
        for item in response.css('article>a'):
            name = item.css('.post-card-title::text').get()
            link = item.css('::attr(href)').get()
            print("\033[1;31m[%s]: %s\033[0m\n" % (name, link))
            yield Request(url=link, headers=headers, callback=self.parse_one)

        next_page = response.css('ol>li.next').css('a::attr(href)').get()
        if next_page:
            print("%s" % next_page)
            yield Request(next_page, callback=self.parse_rank)

    def parse_one(self, response):
        for src in response.css('div#post[role=main] p img::attr(src)').getall():
            print("\033[1;32m%s\n\033[0m" % src)
            item = HaosuItem()
            item['src'] = src
            yield item

所有具备解析功能的函数都应该返回 Item 或 Requests

数据定义

item.py 中定义数据模型

class HaosuItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    pass

数据填充并返回 item 见上述函数 parse_one

图片下载

Scrapy 提供了一个 item pipeline 用于下载

from scrapy.pipelines.images import ImagesPipeline
import scrapy

class HaosuPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        yield scrapy.Request(item['src'])

item 中获取下载地址并下载,修改 setting.py 开启管道

ITEM_PIPELINES = {
    'haosu.pipelines.HaosuPipeline': 5,
}
IMAGES_STORE = 'image'

数据导出

内置工具

$ scrapy crawl spider -o data.json
$ scrapy crawl spider -o data.csv
$ scrapy crawl spider -o data.xml
$ scrapy crawl spider -s FEED_URI='/home/user/folder/mydata.csv' -s FEED_FORMAT=csv

pipeline

from scrapy.exporters import CsvItemExporter
class CsvPipeline (object):

    def __init__ (self):
        self.file = open ('databases.csv', 'wb')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider (self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item (self, item, spider):
        print("%s" % item.__class__)
        self.exporter.export_item(item)
        return item

settings.py 中启动 pipeline

ITEM_PIPELINES = {
    'satellites.pipelines.SatellitesPipeline': 300,
    'satellites.pipelines.CsvPipeline': 500,
}

classmethod 使用

使用 classmethod 创建不同的数据库

def __init__(self, mysql_dbname):
    # create db
    sql = "CREATE DATABASE if not exists " + mysql_dbname
    self.cursor = self.connect.cursor()
    self.cursor.execute(sql)

@classmethod
def from_crawler(cls, crawler):
    settings = crawler.settings
    mysql_dbname = settings.get('MYSQL_DBNAME')
    print("%s" % mysql_dbname)
    return cls(mysql_dbname)

pymysql 数据存储

import pymysql

self.connect = pymysql.connect(
    host=settings.MYSQL_HOST,
    user=settings.MYSQL_USER,
    passwd=settings.MYSQL_PASSWD,
    charset='utf8',
    use_unicode=True)
sql = "CREATE DATABASE if not exists " + mysql_dbname
self.cursor = self.connect.cursor()
self.cursor.execute(sql)
self.cursor.execute("USE %s;" % mysql_dbname)

# create satellite table
sql = """CREATE TABLE if not exists satellites(
    id bigint(20) NOT NULL AUTO_INCREMENT,
    position varchar(20),
    area varchar(30),
    band varchar(10),
    name varchar(100),
    time varchar(20),
    PRIMARY KEY (id)
    ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;"""
self.cursor.execute(sql)

启动

$ scrapy crawl demo

Problems

网页返回乱码

DEFAULT_REQUEST_HEADERS 设置 Accept-Encodinggzip,deflate 导致网页返回乱码

这个配置表示接受压缩后的数据,需要自己解压缩

robots.txt

ROBOTSTXT_OBEY = False

Scrapy 体系结构

Ref

  1. Scrapy简明教程
  2. Scrapy 2.0 documentation