baseUrl = "http://mwhls.top/page/1"
def askUrl(baseUrl):
    html = ""
    try:
        response = urllib.request.urlopen(baseUrl)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html</code></pre>

代码释义

先定义字符串html用来保存接下来获得的html数据

而后用try获取数据

urllib.request.urlopen(baseUrl)可以获得baseUrl页面内的html文本

之后，使用read()函数读取，并用utf-8解码，将其保存在html中

如果出现错误，就通过except这串代码来处理。

网络质量问题处理

若一个网页加载速度无法被接受，就必须在未引起负向后果的时候将其取消。

使用timeout参数限制最长加载时间，示例：

try:
    response = urllib.request.urlopen("http://mwhls.top", timeout=0.01)
    print(response.read().decode("utf-8"))
except urllib.error.URLError as e:
    print("time out!")

timeout参数的时间为毫秒，1s = 1000ms。

对于反爬虫网站

上面的代码对于一个不反爬虫的网站来说已经可以获得到html数据了。

但对于一个反爬虫的网站，例如豆瓣，直接访问是行不通的。

直接将上面的baseUrl改成豆瓣top250电影的页面：

https://movie.douban.com/top250?start=0

会返回一个418错误，大意就是被网站识别成爬虫了。

因此，我们需要伪装成正常用户

对于豆瓣来说，只需要提供一个user-agent的header就足以了。

将网站链接与header封装起来，就可以正常访问豆瓣了。

修改了baseUrl，添加head与request两串代码后，就能正常访问了：

baseUrl = "https://movie.douban.com/top250?start=1"

def askUrl(baseUrl):
    head = {    # 模拟头部信息，发送信息
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    }
    request = urllib.request.Request(url, headers=head)

    html = ""
    try:
        response = urllib.request.urlopen(baseUrl)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html

通过封装，我们也能实现不同游览器访问的效果，并得到对应的html文本。