Python Web Crawler 2 - Urllib

urllib 库是 Python内置的一个HTTP 请求库，虽然功能没有requests模块智能，但由于是内置的标准库，在一些简单的场景使用还是十分方便的和强大的。

它提供了以下几种模块：

urllib.request 请求模块(or opening and reading URLs)
urllib.error 异常处理模块(containing the exceptions raised by urllib.request)
urllib.parse url 解析模块(for parsing URLs)
urllib.robotparser 解析robots.txt 模块(for parsing robots.txt files)

HTTP请求Request

首先，我们来看一下最常用的requests模块, 用于模拟浏览器发起一个 HTTP 请求，参数如下：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, context=None)
# url： 网址
# data： bytes类型的内容，以POST方式提交表单。使用标准格式是application/x-www-form-urlencoded
# timeout：设置请求超时时间，比如：timeout=1 (1s后自动time out)
# cafile和capath： CA 证书和 CA 证书的路径
# context：ssl设置
urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
# 同urlopen
# headers：可修改请求头，还可使用add_header()方法修改请求头
# origin_req_host： 请求方的host名称或者IP地址。
# unverifiable： 请求是否是无法验证的，默认值是False。
# method： HTTP 请求的方式

import urllib.request
def getPage(url):
    page = urllib.request.urlopen(url) 
    print(page)
    ## 返回结果一个http.client.HTTPResponse对象
    print(type(page))
    print(page.status)
    print(page.getheaders())
    ##  获取页面源代码并转换为utf-8编码
    return page.read().decode('utf-8')

## 测试一个网址
getPage("https://docs.python.org/3/library/urllib.html")
# <http.client.HTTPResponse object at 0x0000000004E56748>
# <class 'http.client.HTTPResponse'>
# 200
# [('Connection', 'close'), ('Content-Length', '9530'), ('Server', 'nginx'), ('Content-Type', 'text/html')...
# '\n<!DOCTYPE html>\n\n<html xmlns="http://www.w3.org/1999/xhtml">\n  <h...

## 测试一张图片
getPage("https://cdn.pixabay.com/photo/2020/09/23/19/58/halloween-5596921__340.jpg")
# HTTP Error 403: Forbidden

发现requests请求被禁止，此时我们需要使用Requests模拟一个请求头，如下：

headers = {
    'USER-AGENT':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

def getPage(url):
    page = urllib.request.Request(url, headers=headers)
    print(page)
    print(type(page))
    page = urllib.request.urlopen(page) 
    print(page.getheaders())
    return page.read()

此时输入以上图片地址，输出为：

# <urllib.request.Request object at 0x000000000512CD08>
# <class 'urllib.request.Request'>
# [('Content-Type', 'image/jpeg'), ('Content-Length', '27784'), ('Connection', 'close'), ('Set-Cookie', '__cfduid=d40565824d114929f762f0330755fd5c91604544653; ...
# b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C...

但此时爬取p站比如https://i.pximg.net/img-original/img/2020/10/28/00/04/57/85281729_p0.jpg仍然会是HTTP Error 403: Forbidden, 此时我们需要使用requests给p站增加Referer，参考carry_1024的文章：https://blog.csdn.net/ycarry2017/article/details/79599539

# 若有访问该网站弹出证书不受信任时，直接忽略
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import urllib.request
## 增加请求头，此处用到了Opener方法
opener = urllib.request.build_opener()
opener.addheaders=[('Referer', "https://www.pixiv.net/member_illust.php?mode=medium&illust_id=60541651")]
urllib.request.install_opener(opener)

url = "https://i.pximg.net/img-original/img/2020/10/28/00/04/57/85281729_p0.jpg"
## Copy a network object denoted by a URL to a local file. 复制网络文件到本地
urllib.request.urlretrieve(url,"D://acg-girl.jpg")
# ('D://acg-girl.jpg', <http.client.HTTPMessage at 0x535de48>)

## headers增加Referer参数后，getPage也可正常运行
headers = {
    'USER-AGENT':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'Referer': "https://www.pixiv.net/member_illust.php?mode=medium&illust_id=60541651"
}

URL解析Parse

接下来我们来学习一下urllib的parse模块，该模块用于对网址进行非常方便的操作。

Parse module supports the following URL schemes:

file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais, ws, wss.

也就是几乎支持所有internet协议。

Imgur

from urllib.parse import urlparse as pr
from urllib.parse import urlunparse as upr

# scheme://netloc/path;parameters?query#fragment
result = pr('http://www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt')
result.hostname
# 'www.xiami.com'
print(type(result),result)
# <class 'urllib.parse.ParseResult'> 
# ParseResult(scheme='http', netloc='www.xiami.com', path='/play', \
#                      params='', query='ids=/song/playlist/id/1/type/9', fragment='loadedt')
[print(result[i]) for i in range(len(result))]
# http
# www.xiami.com
# /play
# 
# ids=/song/playlist/id/1/type/9
# loaded

## 若要得到正确的nerloc值，url必须以//开头，否则会被算到path值里去
print(pr('www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt',scheme="https"))
# ParseResult(scheme='https', netloc='', path='www.xiami.com/play',\
#                     params='', query='ids=/song/playlist/id/1/type/9', fragment='loadedt')
print(pr('https://www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt',scheme="http",allow_fragments=False))
# ParseResult(scheme='https', netloc='www.xiami.com', path='/play', \
#             params='', query='ids=/song/playlist/id/1/type/9#loadedt', fragment='')

## 将urlparse()分解的元素再拼合还原为一个url
data = [result.scheme, result.netloc, result.path,result.params, result.query,result.fragment]
print(upr(data))
# http://www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt

urlsplit与urlparsel类似，但不包括params。

from urllib.parse import urlsplit as sp
from urllib.parse import urlunsplit as usp

# # scheme://netloc/path?query#fragment
result = sp('http://www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt')
print(type(result),result)
# <class 'urllib.parse.SplitResult'> 
# SplitResult(scheme='http', netloc='www.xiami.com', path='/play', \
#              query='ids=/song/playlist/id/1/type/9', fragment='loadedt')
data = [result.scheme, result.netloc, result.path, result.query,result.fragment]
print(usp(data))
# http://www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt

urljoin函数用于构造一个绝对url，当参数中的url为绝对路径的URL(即以//或scheme://开始)，那么url的hostname和scheme将会出现在结果中

from urllib.parse import urljoin as jo
 
print(jo("http://www.xiami.com/","play?ids=/song/playlist/id/1/type/9#loadedt"))
print(jo("http://www.xiami.com/play?ids=/song/playlist/","play?ids=/song/playlist/id/1/type/9#loadedt"))
print(jo("http:","//www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt"))
# http://www.xiami.com/play?ids=/song/playlist/id/1/type/9#loadedt

其他，urlencode类

from urllib.parse import urlencode, parse_qs, quote, unquote

params = {
    'tn':'baidu',
    'wd': 'google chrome',
}
base_url = 'http://www.baidu.com/s?'
base_url + urlencode(params)
# 'http://www.baidu.com/s?tn=baidu&wd=google+chrome'

print(parse_qs(urlencode(params)))
# {'tn': ['baidu'], 'wd': ['google chrome']}


'https://www.baidu.com/s?wd=' + quote("百度")
# 'https://www.baidu.com/s?wd=%E7%99%BE%E5%BA%A6'

url = 'https://www.baidu.com/s?wd=%E7%99%BE%E5%BA%A6'
print(unquote(url))
# https://www.baidu.com/s?wd=百度

错误处理Error

我们对网页发起http请求时, 难免会遇到很多错误，比如404，连接超时，拒绝访问等，此时我们可以用urllib的error模块对异常进行处理。
异常处理主要用到两个类，urllib.error.URLError和urllib.error.HTTPError。

URLError 是 urllib.error 异常类的基类, 可以捕获由urllib.request 产生的异常。具有一个属性reason，即返回错误的原因。

import urllib.request
from urllib import error

url = "https://www.google.com"
try:
    response = request.urlopen(url)
except error.URLError as e:
    print(e.reason)
    
# [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

HTTPError 专门处理 HTTP 和 HTTPS 请求的错误，有三个属性。

code：HTTP 请求返回的状态码。
reason：与父类用法一样，表示返回错误的原因。
headers：HTTP 请求返回的响应头信息。

from urllib import request,error

url = "https://www.google.com"
try:
    response = request.urlopen(url)
except error.HTTPError as e:
    print('code: ' + e.code + '\n')
    print('reason: ' + e.reason + '\n')
    print('headers: ' + e.headers + '\n')
# TimeoutError                              Traceback (most recent call last)
...
# URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>

我们这里创建了一个getInfo函数，用于处理异常。

import socket
from urllib import request, parse, error

headers = {
    'User-Agent':' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    'Host': 'httpbin.org'
}
dict = {
    'words1': 'you\'re a miracle' ,
    'words2':'what do you fear'
}

def getInfo(url, data="", headers={}, method="GET",timeout=1):
    try:
        dat = bytes(parse.urlencode(data), encoding='utf8')
        req = request.Request(url=url, data=dat, headers=headers, method=method) 
        req = request.urlopen(req, timeout=timeout)
        print(req.read().decode('utf-8'))
    except error.HTTPError as e:
        print(e.reason, e.code, e.headers, sep='\n')
    except error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            print('TIME OUT')
    else:
        pass

getInfo("http://httpbin.org/post",dict,headers,"POST",5)
# {
#   "args": {}, 
#   "data": "", 
#   "files": {}, 
#   "form": {
#     "words1": "you're a miracle", 
#     "words2": "what do you fear"
#   }, 
#   "headers": {
#     "Accept-Encoding": "identity", 
#     "Connection": "close", 
#     "Content-Length": "49", 
#     "Content-Type": "application/x-www-form-urlencoded", 
#     "Host": "httpbin.org", 
#     "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
#   }, 
#   "json": null, 
#   "origin": "183.246.20.118", 
#   "url": "http://httpbin.org/post"
# }

getInfo('http://httpbin.org/get')
# {
#   "args": {}, 
#   "headers": {
#     "Accept-Encoding": "identity", 
#     "Connection": "close", 
#     "Content-Type": "application/x-www-form-urlencoded", 
#     "Host": "httpbin.org", 
#     "User-Agent": "Python-urllib/3.6"
#   }, 
#   "origin": "183.246.20.118", 
#   "url": "http://httpbin.org/get"
# }

getInfo('http://httpbin.org/get',timeout=.1)
# TIME OUT
getInfo('http://httpbin.org/index.htm')
# NOT FOUND
# 404
# Connection: close
# Server: meinheld/0.6.1
# Date: Sun, 11 Mar 2018 06:25:37 GMT
# Content-Type: text/html
# Content-Length: 233
# Access-Control-Allow-Origin: *
# Access-Control-Allow-Credentials: true
# X-Powered-By: Flask
# X-Processed-Time: 0
# Via: 1.1 vegur

Handler

如果我们需要在请求中添加代理proxy、处理Cookies，就需要用到Handler和OpenerDirector。

https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler

Handler 能处理请求（HTTP、HTTPS、FTP等）中的各种事情。其常见的类有：

HTTPDefaultErrorHandler：处理 HTTP 响应错误。
HTTPRedirectHandler：处理 HTTP 重定向。
HTTPCookieProcessor(cookiejar=None)：处理 HTTP 请求中的 Cookies
ProxyHandler(proxies=None)：设置代理
HTTPPasswordMgr：用于管理密码，它维护了用户名密码的表。
HTTPBasicAuthHandler：用于登录认证，一般和 HTTPPasswordMgr 结合使用。
HTTPPasswordMgrWithDefaultRealm
HTTPPasswordMgrWithPriorAuth
...

对于 OpenerDirector，之前用过 urlopen() 这个方法，实际上它就是 urllib 为我们提供的一个Opener。

opener 对象由 build_opener(handler) 方法来创建出来。创建自定义的 opener，需要使用 install_opener(opener)方法。

使用代理Proxy

有些网站做了浏览频率限制或者禁止了你的IP请求。这个时候就需要我们使用代理来突破这“枷锁”，让对方服务器误以为我们是不同的用户发起的http请求。

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener,install_opener

url = "http://tieba.baidu.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}


proxy_handler = ProxyHandler({
   'http': 'web-proxy.oa.com:8080',
    'https': 'web-proxy.oa.com:8080'
})

opener = build_opener(proxy_handler)
try:
    # urllib.request.install_opener(opener)
    # request = urllib.request.Request(url=url, headers=headers)
    # response = urllib.request.urlopen(request)
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

我们也可以使用requests模块进行更方便的proxy代理处理。

import requests
 
proxies = {
    'http': 'url',
    'https': 'url'
}

## http://user:password@host:port
proxies = {
    'http': 'socks5://user:password@host:port',
    'https': 'socks5://user:password@host:port'
}

requests.get("https://www.baidu.com", proxies=proxies)

认证登录Auth

有些网站需要登录之后才能继续浏览网页,这时就需要用到认证登录。

我们可以使用 HTTPPasswordMgrWithDefaultRealm() 实例化一个账号密码管理对象，然后使用 add_password() 函数添加账号和密码；接着使用 HTTPBasicAuthHandler() 得到 hander；再使用 build_opener() 获取 opener 对象；最后使用 opener 的 open() 函数发起请求。

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

url = "http://tieba.baidu.com/"
user = 'user'
password = 'password'
pwdmgr = HTTPPasswordMgrWithDefaultRealm()
pwdmgr.add_password(None, url, user, password)
auth_handler = HTTPBasicAuthHandler(pwdmgr)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

Cookies设置

如果请求的页面每次需要身份验证，我们可以使用 Cookies 来自动登录，免去重复登录验证的操作。

获取 Cookies 需要使用 http.cookiejar.CookieJar() 实例化一个 Cookies 对象, 再用 urllib.request.HTTPCookieProcessor 构建出 handler 对象,最后使用 opener 的 open() 函数即可。

这个例子是获取请求百度的 Cookies 并保存到文件中，代码如下：

import http.cookiejar
import urllib.request

url = 'http://www.baidu.com'
fileName = 'cookie.txt'

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
print(response)
# <http.client.HTTPResponse object at 0x04D421F0>

for item in cookie:
    print(item.name+"="+item.value)
# BAIDUID=7A55D7DB4ECB570361D1D1186DD85275:FG=1
# ...

f = open(fileName,'a')
for item in cookie:
    f.write(item.name + " = " + item.value + '\n')
f.close()

也可使用 http.cookiejar.LWPCookieJar 或者http.cookiejar.MozillaCookieJar构建：

filename = 'cookies.txt'
cookie = http.cookiejar.LWPCookieJar(filename) # cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open(url)
cookie.save(ignore_discard=True, ignore_expires=True)
## LWP-Cookies-2.0
# Set-Cookie3: BAIDUID="990E47C14A144D813BB6629BEA0D1BEF:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-03-29 08:56:02Z"; version=0
# ...

然后我们可以读取cookie文件：

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
# <!DOCTYPE html>
# <!--STATUS OK-->
# ...

Robots

先列举一个robots.txt，以淘宝为例;

https://www.taobao.com/robots.txt

User-agent:  Baiduspider
Allow:  /article
Allow:  /oshtml
Disallow:  /product/
Disallow:  /

User-Agent:  Googlebot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Disallow:  /

User-agent:  Bingbot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Disallow:  /

User-Agent:  360Spider
Allow:  /article
Allow:  /oshtml
Disallow:  /

User-Agent:  Yisouspider
Allow:  /article
Allow:  /oshtml
Disallow:  /

User-Agent:  Sogouspider
Allow:  /article
Allow:  /oshtml
Allow:  /product
Disallow:  /

User-Agent:  Yahoo!  Slurp
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Disallow:  /

User-Agent:  *
Disallow:  /

解析robots.txt，我们需要用到RobotFileParser。

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

url = "http://httpbin.org/robots.txt "

rp = RobotFileParser(url)
rp.read()
print(rp.can_fetch('*', 'http://httpbin.org/deny'))
print(rp.can_fetch('*', "http://httpbin.org/image"))
# False
# True

FAQ

HTTP Error 403: Forbidden

将请求加以包装，变成浏览器请求模式

1
2
3

headers = {
    'USER-AGENT':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

降低请求频率，每次请求时暂停短暂时间/用不同的IP进行访问
复制请求网址到浏览器进行检查，如仍然为403 Forbidden,检查请求网址是否有误或者过时。
参考：python 爬虫禁止访问解决方法（403）： https://blog.csdn.net/u011808673/article/details/80609221