在Scrapy中使用cookie

Posted on | 1435 words | ~3 mins
Python

Python有一个很出色的爬虫包scrapy,架构清晰,设计精巧,能想到的爬虫工具需要的定制化点都有对应的扩展机制。 大部分网站都使用cookie来记录访问用户的识别信息。每个请求都会把用户识别信息带回到服务器,帮助后台程序识别独立用户,这样可以进行鉴权,反爬,限流等很多的操作。所以对于爬虫来说,如何模拟和使用cookie“欺骗”服务器,是十分重要的一步。本文就介绍如何在scrapy中使用cookie技术。

scrapy的cookie操作通过download middleware插件实现。具体的类名称是CookisMiddleware。如果每个request维护一个独立的cookie session,只需按官方文档示例创建request即可(settings文件中要设置COOKIES_ENABLED为True):

1for i, url in enumerate(urls):
2    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i}, callback=self.parse_page)

如果需要在所有的request间共享cookie,则可以按如下操作:

1for url in urls:
2    yield scrapy.Request("http://www.example.com", meta={'cookiejar': 0}, callback=self.parse_page)

所有request都使用0号cookiejar。启用cookie后,scrapy既会从cookiejar中读取cookie,并设置到request header中。也会将response中,服务器写回的cookie写到cookiejar中。

一种常见情况是:

  1. 首先发送一个不带cookie的request到目标url A
  2. 目标服务器发现该请求不带cookie,则302该请求到一个授权页面B
  3. 授权页面B写会授权cookie给我们,同时再次302我们的请求返回到url A。这次新请求会带着cookie到达服务器,成功获得url A对应的页面

scrapy自动支持redirect,所以一切看上去那么美好,启用cookie后,上述过程自动完成,我们获得需要的url A的内容。可事实上,你会发现第3步并没有执行!这是因为scrapy支持dupfilter机制,在一次运行中已经抓过的url不会再次抓取。虽然第一次抓取url A,我们只是被redirect,但是scrapy也会认为我们访问过url A,不会再次发送带cookie的第三步请求。这部分逻辑的实现请参看Scrapy RFPDupeFilter

 1def request_seen(self, request):
 2    fp = self.request_fingerprint(request)
 3    if fp in self.fingerprints:
 4        return True
 5    self.fingerprints.add(fp)
 6    if self.file:
 7        self.file.write(fp + os.linesep)
 8
 9def request_fingerprint(self, request):
10    return request_fingerprint(request)

fingerprint方法确定一个url的signature,用来去重。查看request_fingerprint后发现该方法接受一个include_headers参数(list类型),可以指定header中的哪些key需要计算到fingerprint中。

 1def request_fingerprint(request, include_headers=None):
 2    """
 3    Return the request fingerprint.
 4    The request fingerprint is a hash that uniquely identifies the resource the
 5    request points to. For example, take the following two urls:
 6    http://www.example.com/query?id=111&cat=222
 7    http://www.example.com/query?cat=222&id=111
 8    Even though those are two different URLs both point to the same resource
 9    and are equivalent (ie. they should return the same response).
10    Another example are cookies used to store session ids. Suppose the
11    following page is only accesible to authenticated users:
12    http://www.example.com/members/offers.html
13    Lot of sites use a cookie to store the session id, which adds a random
14    component to the HTTP Request and thus should be ignored when calculating
15    the fingerprint.
16    For this reason, request headers are ignored by default when calculating
17    the fingeprint. If you want to include specific headers use the
18    include_headers argument, which is a list of Request headers to include.
19    """
20    if include_headers:
21        include_headers = tuple(to_bytes(h.lower())
22                                for h in sorted(include_headers))
23    cache = _fingerprint_cache.setdefault(request, {})
24    if include_headers not in cache:
25        fp = hashlib.sha1()
26        fp.update(to_bytes(request.method))
27        fp.update(to_bytes(canonicalize_url(request.url)))
28        fp.update(request.body or b'')
29        if include_headers:
30            for hdr in include_headers:
31                if hdr in request.headers:
32                    fp.update(hdr)
33                    for v in request.headers.getlist(hdr):
34                        fp.update(v)
35        cache[include_headers] = fp.hexdigest()
36    return cache[include_headers]

可惜的是scrapy并没有提供一个setting开关用于在配置文件中指定如何计算fingerprint,所以我们要实现一个dupfilter子类完成将cookie加入fingerprint的逻辑

1class CookieRFPDupeFilter(RFPDupeFilter):
2    def __init__(self, path=None, debug=False):
3        super(CookieRFPDupeFilter, self).__init__(path, debug)
4
5    def request_fingerprint(self, request):
6        return request_fingerprint(request, include_headers=['Cookie'])

最后在setting中添加使用新dupfilter的设置,跳转+cookie的问题得到完美解决。

1DUPEFILTER_CLASS = 'scrapy_malong.dupfilters.CookieRFPDupeFilter'