Beautiful Soup, an allusion to the Mock Turtle’s song found in Chapter 10 of Lewis Carroll’s Alice’s Adventures in Wonderland, is a Python library that allows for quick turnaround on pulling data out of HTML and XML files with navigating, searching, and modifying the parse tree.

“Beautiful Soup”出自于刘易斯·卡罗尔 - 《爱丽丝梦游仙境》,第10章中的 The Mock Turtle’s song ,它是一个能够快速从HTML和XML文件中提取 、搜索、修改数据的解析库,使用起来也非常方便。

Installation

首先我们从python分销商那里领取美丽汤,

1
2
3
apt-get install python3-bs4 ## or 
pip3 install beautifulsoup4 ## or
pip install beautifulsoup4

在汤里面放入配料parsers,

1
2
3
4
5
6
7
apt-get install python-lxml ## or
pip3 install lxml
apt-get install python-html5lib ## or
pip3 install html5lib

## 如果你在linux上安装失败的话
sudo apt-get install -y python3-dev build-essential libssl-dev libffi-dev libxml2 libxml2-dev libxslt1-dev zlib1g-dev
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## upgrade
pip3 install --upgrade beautifulsoup4
pip list|grep "beautifulsoup4"
# beautifulsoup4 4.9.1

python -V
# Python 3.7.0

## BeautifulSoup 模块的名称是 bs4, 很简洁点吧
## import bs4
from bs4 import BeautifulSoup

## 测试:获取Bing背景图
tree = BeautifulSoup(getPage("https://www.bing.com/"),"lxml")
tree.div.select('#bgDiv') # JS rendered
# 200
# [<div data-minhdhor="" data-minhdver="" data-priority="0" id="bgDiv"></div>]

Parsers

Bs4 has 4 different parsers, which are also very different.

BeautifulSoup(markup, “html.parser”) BeautifulSoup(markup, “lxml”) BeautifulSoup(markup, “xml”) BeautifulSoup(markup, “html5lib”)
Decent speed Very fast Very fast( XML ) Very slow( HTML5 )

Differences between parsers

1
2
3
4
5
6
7
8
9
10
##  创建一个 BeautifulSoup 对象 
BeautifulSoup("<a></p>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a/>
BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>
BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>
BeautifulSoup("<a></p>", "html.parser")
# <a></a>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
html_doc = """
<!DOCTYPE html>
<!-- This is a html file. -->
<html lang="en"><head><meta charset="UTF-8" /><title>Alice in Wonderland </title></head>
<body>
<h1>Alice in Wonderland</h1>
<p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>
<pre>start <span>initialize the story</span><i> end</i></pre>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

<p class="title" id= "alice">
<b>The Alice's story!!!</b>
<p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
<p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
<p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
<p class="storymore">......</p>
</p>
</html>
"""

## let us make a soup
soup = BeautifulSoup (html_doc,"xml")
lsoup = BeautifulSoup (html_doc,"lxml")

## This will prettify the html document.
print (soup.prettify ())
## str(soup) non-pretty print
print(lsoup)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<!DOCTYPE html>
<!-- This is a html file. --><html lang="en"><head><meta charset="utf-8"/><title>Alice in Wonderland </title></head>
<body>
<h1>Alice in Wonderland</h1>
<p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>
<pre>start <span>initialize the story</span><i> end</i></pre>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<p class="title" id="alice">
<b>The Alice's story!!!</b>
</p><p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
<p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
<p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
<p class="storymore">......</p>
</body></html>

Basic Usage

We’ll take this article from Alice in Wonderland using lxml and xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
soup.title
# <title>Alice in Wonderland </title>
soup.title.name
# 'title'
soup.title.string
soup.title.text #== soup.title.getText()
# 'Alice in Wonderland '
soup.title.parent
soup.head
# <head><meta charset="UTF-8"/><title>Alice in Wonderland </title></head>
soup.title.parent.name
# 'head'
soup.head.contents
# [<meta charset="UTF-8"/>, <title>Alice in Wonderland </title>]
soup.head.contents[1].contents[0]
soup.head.contents[1].string
# 'Alice in Wonderland '
soup.head.meta["charset"]
# "UTF-8"

soup.p['class'] #== soup.p.get('class')
# 'title'
soup.a['href']
# 'http://example.com/elsie'
soup.p.attrs
# {'class': 'title', 'id': 'dormouse'}

Find

To find the elements you desired in the html document, here are the lists:

  • find_all( name , attrs , recursive , text , **kwargs )
  • find( name , attrs , recursive , text , **kwargs )
  • find_parents( name , attrs , recursive , text , **kwargs )
  • find_parent( name , attrs , recursive , text , **kwargs )
  • find_next_siblings( name , attrs , recursive , text , **kwargs )
  • find_next_sibling( name , attrs , recursive , text , **kwargs )
  • find_previous_siblings( name , attrs , recursive , text , **kwargs )
  • find_previous_sibling( name , attrs , recursive , text , **kwargs )
  • find_all_next( name , attrs , recursive , text , **kwargs )
  • find_next( name , attrs , recursive , text , **kwargs )
  • find_all_previous( name , attrs , recursive , text , **kwargs )
  • find_previous( name , attrs , recursive , text , **kwargs )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import re
len(soup.find_all('a'))
# 6
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
# <a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>,
# <a href="http://example.com/tears" id="link5">The Pool of Tears</a>,
# <a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
soup.find(id="link3")["href"]
# 'http://example.com/tillie'

soup.find_all("p", class_="title") # class is researved keyword
soup.find_all("p", "title")
# [<p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>,
# <p class="title" id="alice">
# <b>The Alice's story!!!</b>
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
# <p class="storymore">......</p>
# </p>]
soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.html.find_all("title", recursive=False)
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# ['Elsie', 'Lacie', 'Tillie']
soup.find_all(["h1", "pre"])
# [<h1>Alice in Wonderland</h1>,
# <pre>start <span>initialize the story</span><i> end</i></pre>]
## Consider only direct children for find_all() and find()
# []
soup.head.find_all("title", recursive=False)
# [<title>Alice in Wonderland </title>]


soup.find(class_="story").find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all("a", attrs={"class": "sister"})
soup.find_all("a",class_="sister")
soup.find_all("a",class_=re.compile("sister"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tSoup = BeautifulSoup ("<div>title<p>p</p></div><div>title<div>div</div></div>","lxml")
tSoup.div.contents
# ['title', <p>p</p>]
tSoup.div.find_all(text=True)
tSoup.div.find_all(string=True)
tSoup.div(text=True)
tSoup.div(string=True)
# ['title', 'p']

for tag in tSoup.find_all('div'):
print(tag(text=True))
# ['title', 'p']
# ['title', 'div']
# ['div']
for tag in tSoup.find_all('div'):
print(tag(text=False))
# [<p>p</p>]
# [<div>div</div>]
# []
for tag in tSoup.find_all('div'):
print(tag.text)
# titlep
# titlediv
# div
for tag in tSoup.find_all('div'):
print(tag.string)
# None
# None
# div

name_soup = BeautifulSoup('<input name="email" id="email"/>',"lxml")
# <html><body><input name="email"/></body></html>
name_soup.input["name"]
# 'email'
name_soup.find_all(name='email')
# []
name_soup.find_all(id="email")
# [<input id="email" name="email"/>]
name_soup.find_all(attrs={"name": "email","id":"email"})
# [<input id="email" name="email"/>]

We can also use Regular expression using re.compile from built in module re.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
soup.find(text=re.compile("sisters")) # text contain sisters
soup.find(string=re.compile("sisters"))
# 'Once upon a time there were three little sisters; and their names were\n'
soup.find_all(text=re.compile("A"))
# ['Alice in Wonderland ',
# 'Alice in Wonderland',
# "The Alice's story!!!",
# 'A Caucus-Race and a Long Tale']
soup.find_all(href=re.compile("elsie"), id='link1')
soup.find_all(href="http://example.com/elsie", id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
# b
for tag in soup.find_all(re.compile("\w{2}t")):
print(tag.name)
# meta
# title

Using loops here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
for link in soup.find_all('a'):
print(link.get('href'))
for link in soup.find_all('a'):
print(link['href'])
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
# http://example.com/rabbit
# http://example.com/tears
# http://example.com/tale

for text in soup.find_all("p"):
print(text.text)
for text in soup.body.find_all("p",recursive=True):
print(text.get_text())
## xml
# The Dormouse's story!!!
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

# The Alice's story!!!
# 1.Down the Rabbit-Hole
# 2.The Pool of Tears
# 3.A Caucus-Race and a Long Tale
# ......

# 1.Down the Rabbit-Hole
# 2.The Pool of Tears
# 3.A Caucus-Race and a Long Tale
# ......

## lxml
# The Dormouse's story!!!
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

# The Alice's story!!!

# 1.Down the Rabbit-Hole
# 2.The Pool of Tears
# 3.A Caucus-Race and a Long Tale
# ......

for text in soup.find_all("p"):
print(text.string)
The Dormouse's story!!!
# None
# ...
# None
# None
# None
# None
# ......

Attributes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
tree = BeautifulSoup('<b class="boldest">Extremely bold</b>',"lxml")
print(tree.name)
# [document]
tag = tree.b
print(type(tag)) # <class 'bs4.element.Tag'>
print(type(tag.string)) #<class 'bs4.element.NavigableString'>
tag.name = "blockquote"
print(tag) # <blockquote class="boldest">Extremely bold</blockquote>
print(tag['class']) # ['boldest']
print(tag.attrs) # {'class': 'boldest'}
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

## tag['class']
# KeyError: 'class'
print(tag.get('class'))
# # None

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

print(tree)
## xml
# <?xml version="1.0" encoding="utf-8"?>
# <blockquote>No longer bold</blockquote>

## lxml/html5lib
# <html><body><blockquote>No longer bold</blockquote></body></html>

## html.parser
# <blockquote>No longer bold</blockquote>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
new_tag = tree.new_tag("a", href="http://www.example.com")
tag.append(new_tag)
new_tag.string = "Link text."
print(tag)
# <blockquote>No longer bold<a href="http://www.example.com">Link text.</a></blockquote>
new_tag = tree.new_tag("div",id="hi")
new_tag.string = "Don't"
tag.a.string.insert_before(new_tag) #== insert(0, "but did not endorse ")
print(tag)
# <blockquote>No longer bold<a href="http://www.example.com"><div id="hi">Don't</div>Link text.</a></blockquote>

new_tag = tree.new_tag("i")
new_tag.string = "Nothing"
tag.a.replace_with(new_tag)
print(tag)
# <blockquote>No longer bold<i>Nothing</i></blockquote>

from bs4 import Comment
new_comment = tree.new_string("Nice to see you.", Comment) # <class 'bs4.element.Comment'>
tag.i.insert_after(new_comment)
tag.append("ever")
tag.name="p"
print(tag)
# <p>No longer bold<i>Nothing</i><!--Nice to see you.-->ever</p>
tag.i.insert(1,new_comment)
print(tag)
# <p>No longer bold<i>Nothing<!--Nice to see you.--></i>ever</p>
tag.find_all(text=lambda text:isinstance(text, Comment))
# ['Nice to see you.']
tag.clear() # removes the contents
print(tag)
# <p></p>

tag.string="I wish I was bold"
tag.string.wrap(tree.new_tag("b",id="id1 id2"))
tag.b['class'] = ['class1', 'class2']
print(tag)
# <p><b class="class1 class2" id="id1 id2">I wish I was bold</b></p>
## select用于传入一个字符串作为CSS选择器
tag.select("b.class1.class2")
# [<b class="class1 class2" id="id1 id2">I wish I was bold</b>]
print(tag.b["class"]) # rel, rev, accept-charset, headers, and accesskey are list
# ['class1', 'class2']
print(tag.b["id"])
# id1 id2
print(tag.b.get_attribute_list("id"))
# ['id1 id2']

# Note: If you parse a document as XML, there are no multi-valued attributes:

tag.b.unwrap()
print(tag)
# <p>I wish I was bold</p>

tag=tree.p.extract() # completely removed
print(tree)
print(tag)
# <html><body></body></html>

tree.body.append(tree.new_tag("a",href="#"))
tree.a.string="Alice"
print(tree)
# <html><body><a href="#">Alice</a></body></html>
tree.a.replace_with(soup.new_tag("p",id="alice"))
print(tree)
# <html><body><p id="alice"/></body></html>
tag=tree.p.decompose() # completely destoryed
print(tag)
# None

Relationship, Contents

Much different if you are using different parsers, including indexing, “\n” etc, for convenient I’m only using xml /lxml parsers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
for i, child in enumerate(soup.p.children):
print(i, child)
# 0 <b>The Dormouse's story!!!</b>
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# p
# body
# html
# [document]
len(list(lsoup.descendants))
len(list(soup.descendants))
# 63
list(soup.children)==soup.contents # children = contents
# True
for child in soup.descendants: # all of a tag’s children, recursively
print(child)

list(soup.stripped_strings)
# ['Alice in Wonderland',
# 'Alice in Wonderland',
# "The Dormouse's story!!!",
# 'start',
# 'initialize the story',
# 'end',
# 'Once upon a time there were three little sisters; and their names were',
# 'Elsie',
# ',',
# 'Lacie',
# 'and',
# 'Tillie',
# ';\nand they lived at the bottom of a well.',
# '...',
# "The Alice's story!!!",
# '1.',
# 'Down the Rabbit-Hole',
# '2.',
# 'The Pool of Tears',
# '3.',
# 'A Caucus-Race and a Long Tale',
# '......']
for i, parent in enumerate(list(soup.body.parent)):
print(i, parent)
# 0 <head><meta charset="UTF-8"/><title>Alice in Wonderland </title></head>
# 1

# 2 <body>
# <h1>Alice in Wonderland</h1>
# <p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>
# <pre>start <span>initialize the story</span><i> end</i></pre>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# <p class="title" id="alice">
# <b>The Alice's story!!!</b>
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
# <p class="storymore">......</p>
# </p>
# </body>
print(list(enumerate(soup.a.next_siblings)))
# [(0, ',\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' and\n'),
# (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, ';\nand they lived at the bottom of a well.')]
soup.h1.next_element
# 'Alice in Wonderland'
list(soup.a.next_siblings)
# [',\n',
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# ' and\n',
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
# ';\nand they lived at the bottom of a well.']
list(soup.find(id="alice").next_siblings)
# ['\n']
soup.find("a", id="link3").next_element
# 'Tillie'
soup.h1.next_element
# 'Alice in Wonderland'
soup.h1.previous_element
# '\n'
soup.find(string="Lacie").find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find(string="Lacie").find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
soup.a.find_next_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find("p", "story").find_next_sibling("p")
# <p class="story">...</p>
soup.a.find_all_next(string=True)
# ['Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
# ';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
soup.a.find_next("p")
# <p class="story">...</p>
soup.h1.find_next("i")
# <i> end</i>
soup.find(id="link3").find_all_previous("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
for p in lsoup.find_all(name='p'):
print(p.find_all(name='a'))
for a in p.find_all(name='a'):
print(a.string)
# []
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# Elsie
# Lacie
# Tillie
# []
# []
# [<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>]
# Down the Rabbit-Hole
# [<a href="http://example.com/tears" id="link5">The Pool of Tears</a>]
# The Pool of Tears
# [<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
# A Caucus-Race and a Long Tale
# []

Functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>,
# <p class="story">...</p>,
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>,
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>,
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>,
# <p class="storymore">......</p>]
def not_lacie(href):
return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
# <a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>,
# <a href="http://example.com/tears" id="link5">The Pool of Tears</a>,
# <a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>,
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>,
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>]
def is_the_only_string_within_a_tag(s):
# ""Return True if this string is the only child of its parent tag.""
return (s == s.parent.string)
soup.find_all(text=is_the_only_string_within_a_tag)
# ['Alice in Wonderland ',
# 'Alice in Wonderland',
# "The Dormouse's story!!!",
# 'initialize the story',
# ' end',
# 'Elsie',
# 'Lacie',
# 'Tillie',
# '...',
# "The Alice's story!!!",
# 'Down the Rabbit-Hole',
# 'The Pool of Tears',
# 'A Caucus-Race and a Long Tale',
# '......']

from bs4 import NavigableString
def surrounded_by_strings(tag):
return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString))

for tag in soup.find(id="alice").find_all(surrounded_by_strings):
print(tag.name)
# b
# p
# a
# p
# a
# p
# a
# p

CSS

lxml supports “.” “#” xml otherwise do not support

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
## Find tags by CSS ID:
print(soup.select("a#link2"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.select('p[id="alice"]')[0].get_text("|",strip=True))
# The Alice's story!!!|1.|Down the Rabbit-Hole|2.|The Pool of Tears|3.|A Caucus-Race and a Long Tale|......

## Find tags by CSS class:
print(soup.select(".sister")) # lxml
print(soup.select("[class~=sister]"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.select_one("[class~=sister]")) #lxml
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.select('.story2')) # lxml
# [<p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>]

## Find the siblings of tags:
print(soup.select('#link1 + .sister')) #lxml
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.select("#link1 ~ .sister")) #lxml
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## Find the a by href:
print(soup.select('a[href]')==soup.find_all("a"))
# True

soup.select('a[href*="http://example.com/t"]') # exact attribute middle match
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
# <a href="http://example.com/tears" id="link5">The Pool of Tears</a>,
# <a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
soup.select('a[href$="t"]')
# [<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>]

## Find the elements a by pesudo select:
print(soup.select("p:nth-of-type(4)"))
# [<p class="title" id="alice">
# <b>The Alice's story!!!</b>
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
# <p class="storymore">......</p>
# </p>]

## lxml
# [<p class="title" id="alice">
# <b>The Alice's story!!!</b>
# </p>]
print(soup.select("p:nth-of-type(5)"))
print(soup.select("#alice > p:nth-of-type(1)")) # xml note coz : lxml--> []
print(soup.select("p > a:nth-of-type(2)"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Input & Output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
soup = BeautifulSoup(open("Alice.html"))
with open("Alice.html") as fp:
soup = BeautifulSoup(fp)

## The default is formatter="minimal". To ensure that Beautiful Soup generates valid HTML/XML:
print(soup.prettify(formatter="minimal"))
## Convert Unicode characters to HTML entities whenever possible
## if formatter=None, Beautiful Soup will not modify strings at all on output.
print(soup.prettify(formatter="html"))

def uppercase(str):
return str.upper()

print(soup.prettify(formatter=uppercase))

Parsing only part of a document

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")

only_tags_with_id_link2 = SoupStrainer(id="link2")

def is_short_string(string):
return len(string) < 10

only_short_strings = SoupStrainer(text=is_short_string)

print(BeautifulSoup(html_doc, "lxml", parse_only=only_a_tags).prettify())
# <!DOCTYPE html>
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# <a href="http://example.com/rabbit" id="link4">
# Down the Rabbit-Hole
# </a>
# <a href="http://example.com/tears" id="link5">
# The Pool of Tears
# </a>
# <a href="http://example.com/tale" id="link6">
# A Caucus-Race and a Long Tale
# </a>
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
# <!DOCTYPE html>
# start
# end
# Elsie
# ,
# Lacie
# and
# Tillie
# ...
# 1.
# 2.
# 3.
# ......

REFERENCES