BeautifulSoup Parsers Comparison

前言

最近常常在用 Python 寫爬蟲就好奇 BeautifulSoup 不同的 Parser 之間有什麼差別於是寫了這篇文來記錄一下

import requests
from bs4 import BeautifulSoup

url = "https://google.com/"

resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")
# or
soup = BeautifulSoup(resp.text, "lxml")

網路上的爬蟲教學常常會看到以上兩種寫法，可以看到差別就是 html.parser 跟 lxml 這個其實是在跟 BeautifulSoup 說我們要用哪種 Parser 去解析 HTML 但是到底 BeautifulSoup 支援多少種 Parser，及每種 Parser 到底差在哪？於是就隨手 google 了一下發現了 StackOverflow 上的這篇及 BeautifulSoup 的 doc

以下是不同的 Parser 的比較表格

tl;dr
速度最快：lxml
相容性最高：html5lib
剩下用：html.parser

Parser	優點	缺點
html.parser	Python 內建，不需額外安裝	速度跟相容性都普通
lxml	快	需要額外安裝（C dependency）
html5lib	相容性最高，所有版本的 Python 都能用	慢

參考資料

前言#

前言