Python Web Crawler Pt.3 - Beautiful Soup

Beautiful Soup, an allusion to the Mock Turtle’s song found in Chapter 10 of Lewis Carroll’s Alice’s Adventures in Wonderland, is a Python library that allows for quick turnaround on pulling data out of HTML and XML files with navigating, searching, and modifying the parse tree.

“Beautiful Soup”出自于刘易斯·卡罗尔 - 《爱丽丝梦游仙境》，第10章中的 The Mock Turtle’s song ，它是一个能够快速从HTML和XML文件中提取、搜索、修改数据的解析库，使用起来也非常方便。

Installation

首先我们从python分销商那里领取美丽汤，

1
2
3

apt-get install python3-bs4 ## or 
pip3 install beautifulsoup4 ## or 
pip install beautifulsoup4

在汤里面放入配料parsers，

apt-get install python-lxml ## or
pip3 install lxml
apt-get install python-html5lib ## or
pip3 install html5lib

## 如果你在linux上安装失败的话
sudo apt-get install -y python3-dev build-essential libssl-dev libffi-dev libxml2 libxml2-dev libxslt1-dev zlib1g-dev

## upgrade
pip3 install --upgrade beautifulsoup4
pip list|grep "beautifulsoup4"
# beautifulsoup4                     4.9.1

python -V
# Python 3.7.0

## BeautifulSoup 模块的名称是 bs4， 很简洁点吧
## import bs4
from bs4 import BeautifulSoup

## 测试：获取Bing背景图
tree = BeautifulSoup(getPage("https://www.bing.com/"),"lxml")
tree.div.select('#bgDiv')  # JS rendered
# 200
# [<div data-minhdhor="" data-minhdver="" data-priority="0" id="bgDiv"></div>]

Parsers

Bs4 has 4 different parsers, which are also very different.

BeautifulSoup(markup, “html.parser”)	BeautifulSoup(markup, “lxml”)	BeautifulSoup(markup, “xml”)	BeautifulSoup(markup, “html5lib”)
Decent speed	Very fast	Very fast( XML )	Very slow( HTML5 )

Differences between parsers

##  创建一个 BeautifulSoup 对象 
BeautifulSoup("<a></p>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a/>
BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>
BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>
BeautifulSoup("<a></p>", "html.parser")
# <a></a>

html_doc = """
<!DOCTYPE html>
<!-- This is a html file. -->
<html lang="en"><head><meta charset="UTF-8" /><title>Alice in Wonderland </title></head>
<body>
<h1>Alice in Wonderland</h1>
<p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>
<pre>start <span>initialize the story</span><i> end</i></pre>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

<p class="title" id= "alice">
<b>The Alice's story!!!</b>
<p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
<p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
<p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
<p class="storymore">......</p>
</p>
</html>
"""

## let us make a soup
soup = BeautifulSoup (html_doc,"xml")
lsoup = BeautifulSoup (html_doc,"lxml")

## This will prettify the html document.
print (soup.prettify ())
## str(soup) non-pretty print
print(lsoup)

<!DOCTYPE html>
<!-- This is a html file. --><html lang="en"><head><meta charset="utf-8"/><title>Alice in Wonderland </title></head>
<body>
<h1>Alice in Wonderland</h1>
<p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>
<pre>start <span>initialize the story</span><i> end</i></pre>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<p class="title" id="alice">
<b>The Alice's story!!!</b>
</p><p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
<p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
<p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
<p class="storymore">......</p>
</body></html>

Basic Usage

We’ll take this article from Alice in Wonderland using lxml and xml:

soup.title
# <title>Alice in Wonderland </title>
soup.title.name
# 'title'
soup.title.string
soup.title.text #== soup.title.getText()
# 'Alice in Wonderland '
soup.title.parent
soup.head
# <head><meta charset="UTF-8"/><title>Alice in Wonderland </title></head>
soup.title.parent.name
# 'head'
soup.head.contents
# [<meta charset="UTF-8"/>, <title>Alice in Wonderland </title>]
soup.head.contents[1].contents[0]
soup.head.contents[1].string
# 'Alice in Wonderland '
soup.head.meta["charset"]
# "UTF-8"

soup.p['class'] #== soup.p.get('class')
# 'title'
soup.a['href']
# 'http://example.com/elsie'
soup.p.attrs
# {'class': 'title', 'id': 'dormouse'}

Find

To find the elements you desired in the html document, here are the lists:

find_all( name , attrs , recursive , text , **kwargs )
find( name , attrs , recursive , text , **kwargs )
find_parents( name , attrs , recursive , text , **kwargs )
find_parent( name , attrs , recursive , text , **kwargs )
find_next_siblings( name , attrs , recursive , text , **kwargs )
find_next_sibling( name , attrs , recursive , text , **kwargs )
find_previous_siblings( name , attrs , recursive , text , **kwargs )
find_previous_sibling( name , attrs , recursive , text , **kwargs )
find_all_next( name , attrs , recursive , text , **kwargs )
find_next( name , attrs , recursive , text , **kwargs )
find_all_previous( name , attrs , recursive , text , **kwargs )
find_previous( name , attrs , recursive , text , **kwargs )

import re
len(soup.find_all('a'))
# 6
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
#  <a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>,
#  <a href="http://example.com/tears" id="link5">The Pool of Tears</a>,
#  <a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
soup.find(id="link3")["href"]
# 'http://example.com/tillie'

soup.find_all("p", class_="title") # class is researved keyword
soup.find_all("p", "title")
# [<p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>,
#  <p class="title" id="alice">
#  <b>The Alice's story!!!</b>
#  <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
#  <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
#  <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
#  <p class="storymore">......</p>
#  </p>]
soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.html.find_all("title", recursive=False)
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# ['Elsie', 'Lacie', 'Tillie']
soup.find_all(["h1", "pre"]) 
# [<h1>Alice in Wonderland</h1>,
#  <pre>start <span>initialize the story</span><i> end</i></pre>]
## Consider only direct children for find_all() and find() 
# [] 
soup.head.find_all("title", recursive=False)
# [<title>Alice in Wonderland </title>]


soup.find(class_="story").find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all("a", attrs={"class": "sister"})
soup.find_all("a",class_="sister")
soup.find_all("a",class_=re.compile("sister"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tSoup = BeautifulSoup ("<div>title<p>p</p></div><div>title<div>div</div></div>","lxml")
tSoup.div.contents
# ['title', <p>p</p>]
tSoup.div.find_all(text=True)
tSoup.div.find_all(string=True)
tSoup.div(text=True)
tSoup.div(string=True)
# ['title', 'p']

for tag in tSoup.find_all('div'):
    print(tag(text=True))
# ['title', 'p']
# ['title', 'div']
# ['div']
for tag in tSoup.find_all('div'):
    print(tag(text=False))
# [<p>p</p>]
# [<div>div</div>]
# []
for tag in tSoup.find_all('div'):
    print(tag.text)
# titlep
# titlediv
# div
for tag in tSoup.find_all('div'):
    print(tag.string)
# None
# None
# div

name_soup = BeautifulSoup('<input name="email" id="email"/>',"lxml")
# <html><body><input name="email"/></body></html>
name_soup.input["name"]
# 'email'
name_soup.find_all(name='email') 
# []
name_soup.find_all(id="email") 
# [<input id="email" name="email"/>]
name_soup.find_all(attrs={"name": "email","id":"email"})
# [<input id="email" name="email"/>]

We can also use Regular expression using re.compile from built in module re.

soup.find(text=re.compile("sisters")) # text contain sisters
soup.find(string=re.compile("sisters"))
# 'Once upon a time there were three little sisters; and their names were\n'
soup.find_all(text=re.compile("A"))
# ['Alice in Wonderland ',
#  'Alice in Wonderland',
#  "The Alice's story!!!",
#  'A Caucus-Race and a Long Tale']
soup.find_all(href=re.compile("elsie"), id='link1')
soup.find_all(href="http://example.com/elsie", id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b
# b
for tag in soup.find_all(re.compile("\w{2}t")):
    print(tag.name)
# meta
# title

Using loops here.

for link in soup.find_all('a'):
    print(link.get('href'))
for link in soup.find_all('a'):
    print(link['href'])
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
# http://example.com/rabbit
# http://example.com/tears
# http://example.com/tale

for text in soup.find_all("p"):
    print(text.text)
for text in soup.body.find_all("p",recursive=True):
    print(text.get_text())
## xml
# The Dormouse's story!!!
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

# The Alice's story!!!
# 1.Down the Rabbit-Hole
# 2.The Pool of Tears
# 3.A Caucus-Race and a Long Tale
# ......

# 1.Down the Rabbit-Hole
# 2.The Pool of Tears
# 3.A Caucus-Race and a Long Tale
# ......

## lxml
# The Dormouse's story!!!
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

# The Alice's story!!!

# 1.Down the Rabbit-Hole
# 2.The Pool of Tears
# 3.A Caucus-Race and a Long Tale
# ......

for text in soup.find_all("p"):
    print(text.string)
The Dormouse's story!!!
# None
# ...
# None
# None
# None
# None
# ......

Attributes

tree = BeautifulSoup('<b class="boldest">Extremely bold</b>',"lxml")
print(tree.name)
# [document]
tag = tree.b
print(type(tag))    # <class 'bs4.element.Tag'>
print(type(tag.string))    #<class 'bs4.element.NavigableString'>
tag.name = "blockquote"
print(tag) # <blockquote class="boldest">Extremely bold</blockquote>
print(tag['class']) # ['boldest']
print(tag.attrs) # {'class': 'boldest'}
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

## tag['class']
# KeyError: 'class'
print(tag.get('class'))
# # None

tag.string.replace_with("No longer bold")
tag
#  <blockquote>No longer bold</blockquote>

print(tree)
## xml
# <?xml version="1.0" encoding="utf-8"?>
# <blockquote>No longer bold</blockquote>

## lxml/html5lib
# <html><body><blockquote>No longer bold</blockquote></body></html>

## html.parser
# <blockquote>No longer bold</blockquote>

new_tag = tree.new_tag("a", href="http://www.example.com")
tag.append(new_tag)
new_tag.string = "Link text."
print(tag)
# <blockquote>No longer bold<a href="http://www.example.com">Link text.</a></blockquote>
new_tag = tree.new_tag("div",id="hi")
new_tag.string = "Don't"
tag.a.string.insert_before(new_tag) #== insert(0, "but did not endorse ")
print(tag)
# <blockquote>No longer bold<a href="http://www.example.com"><div id="hi">Don't</div>Link text.</a></blockquote>

new_tag = tree.new_tag("i")
new_tag.string = "Nothing"
tag.a.replace_with(new_tag)
print(tag)
# <blockquote>No longer bold<i>Nothing</i></blockquote>

from bs4 import Comment
new_comment = tree.new_string("Nice to see you.", Comment) # <class 'bs4.element.Comment'>
tag.i.insert_after(new_comment)
tag.append("ever")
tag.name="p"
print(tag)
# <p>No longer bold<i>Nothing</i><!--Nice to see you.-->ever</p>
tag.i.insert(1,new_comment)
print(tag)
# <p>No longer bold<i>Nothing<!--Nice to see you.--></i>ever</p>
tag.find_all(text=lambda text:isinstance(text, Comment))
# ['Nice to see you.']
tag.clear() # removes the contents
print(tag)
# <p></p>

tag.string="I wish I was bold"
tag.string.wrap(tree.new_tag("b",id="id1 id2"))
tag.b['class'] = ['class1', 'class2'] 
print(tag)
# <p><b class="class1 class2" id="id1 id2">I wish I was bold</b></p>
## select用于传入一个字符串作为CSS选择器
tag.select("b.class1.class2")
# [<b class="class1 class2" id="id1 id2">I wish I was bold</b>]
print(tag.b["class"]) # rel, rev, accept-charset, headers, and accesskey are list
# ['class1', 'class2']
print(tag.b["id"])
# id1 id2
print(tag.b.get_attribute_list("id"))
# ['id1 id2']

# Note: If you parse a document as XML, there are no multi-valued attributes:

tag.b.unwrap()
print(tag)
# <p>I wish I was bold</p>

tag=tree.p.extract() # completely removed
print(tree)
print(tag)
# <html><body></body></html>

tree.body.append(tree.new_tag("a",href="#"))
tree.a.string="Alice"
print(tree)
# <html><body><a href="#">Alice</a></body></html>
tree.a.replace_with(soup.new_tag("p",id="alice"))
print(tree)
# <html><body><p id="alice"/></body></html>
tag=tree.p.decompose() # completely destoryed
print(tag)
# None

Relationship, Contents

Much different if you are using different parsers, including indexing, “\n” etc, for convenient I’m only using xml /lxml parsers

for i, child in enumerate(soup.p.children):
    print(i, child)
# 0 <b>The Dormouse's story!!!</b>
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
len(list(lsoup.descendants))
len(list(soup.descendants))
# 63
list(soup.children)==soup.contents # children = contents
# True
for child in soup.descendants: # all of a tag’s children, recursively
  print(child)

list(soup.stripped_strings)
# ['Alice in Wonderland',
#  'Alice in Wonderland',
#  "The Dormouse's story!!!",
#  'start',
#  'initialize the story',
#  'end',
#  'Once upon a time there were three little sisters; and their names were',
#  'Elsie',
#  ',',
#  'Lacie',
#  'and',
#  'Tillie',
#  ';\nand they lived at the bottom of a well.',
#  '...',
#  "The Alice's story!!!",
#  '1.',
#  'Down the Rabbit-Hole',
#  '2.',
#  'The Pool of Tears',
#  '3.',
#  'A Caucus-Race and a Long Tale',
#  '......']
for i, parent in enumerate(list(soup.body.parent)):
    print(i, parent)
# 0 <head><meta charset="UTF-8"/><title>Alice in Wonderland </title></head>
# 1 

# 2 <body>
# <h1>Alice in Wonderland</h1>
# <p class="title" id="dormouse"><b>The Dormouse's story!!!</b></p>
# <pre>start <span>initialize the story</span><i> end</i></pre>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# <p class="title" id="alice">
# <b>The Alice's story!!!</b>
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
# <p class="storymore">......</p>
# </p>
# </body>
print(list(enumerate(soup.a.next_siblings)))
# [(0, ',\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' and\n'),
 # (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, ';\nand they lived at the bottom of a well.')]
soup.h1.next_element
# 'Alice in Wonderland'
list(soup.a.next_siblings)
# [',\n',
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  ' and\n',
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
#  ';\nand they lived at the bottom of a well.']
list(soup.find(id="alice").next_siblings)
# ['\n']
soup.find("a", id="link3").next_element
# 'Tillie'
soup.h1.next_element
# 'Alice in Wonderland'
soup.h1.previous_element
# '\n'
soup.find(string="Lacie").find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find(string="Lacie").find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>
soup.a.find_next_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find("p", "story").find_next_sibling("p")
# <p class="story">...</p>
soup.a.find_all_next(string=True)
# ['Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
# ';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
soup.a.find_next("p")
# <p class="story">...</p>
soup.h1.find_next("i")
# <i> end</i>
soup.find(id="link3").find_all_previous("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
for p in lsoup.find_all(name='p'):
    print(p.find_all(name='a'))
    for a in p.find_all(name='a'):
        print(a.string)
# []
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# Elsie
# Lacie
# Tillie
# []
# []
# [<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>]
# Down the Rabbit-Hole
# [<a href="http://example.com/tears" id="link5">The Pool of Tears</a>]
# The Pool of Tears
# [<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
# A Caucus-Race and a Long Tale
# []

Functions

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>,
#  <p class="story">...</p>,
#  <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>,
#  <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>,
#  <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>,
#  <p class="storymore">......</p>]
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
#  <a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>,
#  <a href="http://example.com/tears" id="link5">The Pool of Tears</a>,
#  <a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
#  <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>,
#  <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>,
#  <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>]
def is_the_only_string_within_a_tag(s):
    # ""Return True if this string is the only child of its parent tag.""
    return (s == s.parent.string)
soup.find_all(text=is_the_only_string_within_a_tag)
# ['Alice in Wonderland ',
#  'Alice in Wonderland',
#  "The Dormouse's story!!!",
#  'initialize the story',
#  ' end',
#  'Elsie',
#  'Lacie',
#  'Tillie',
#  '...',
#  "The Alice's story!!!",
#  'Down the Rabbit-Hole',
#  'The Pool of Tears',
#  'A Caucus-Race and a Long Tale',
#  '......']

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString))

for tag in soup.find(id="alice").find_all(surrounded_by_strings):
    print(tag.name)
# b
# p
# a
# p
# a
# p
# a
# p

CSS

lxml supports “.” “#” xml otherwise do not support

## Find tags by CSS ID:
print(soup.select("a#link2"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.select('p[id="alice"]')[0].get_text("|",strip=True))
# The Alice's story!!!|1.|Down the Rabbit-Hole|2.|The Pool of Tears|3.|A Caucus-Race and a Long Tale|......

## Find tags by CSS class:
print(soup.select(".sister")) # lxml
print(soup.select("[class~=sister]"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.select_one("[class~=sister]")) #lxml
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.select('.story2'))  # lxml 
# [<p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>]

## Find the siblings of tags:
print(soup.select('#link1 + .sister')) #lxml
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.select("#link1 ~ .sister")) #lxml
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## Find the a by href:
print(soup.select('a[href]')==soup.find_all("a"))
# True

soup.select('a[href*="http://example.com/t"]') # exact attribute middle match
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
#  <a href="http://example.com/tears" id="link5">The Pool of Tears</a>,
#  <a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a>]
soup.select('a[href$="t"]')
# [<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a>]

## Find the elements a by pesudo select:
print(soup.select("p:nth-of-type(4)"))
# [<p class="title" id="alice">
# <b>The Alice's story!!!</b>
# <p class="story1">1.<a href="http://example.com/rabbit" id="link4">Down the Rabbit-Hole</a></p>
# <p class="story2">2.<a href="http://example.com/tears" id="link5">The Pool of Tears</a></p>
# <p class="story3">3.<a href="http://example.com/tale" id="link6">A Caucus-Race and a Long Tale</a></p>
# <p class="storymore">......</p>
# </p>]

## lxml
# [<p class="title" id="alice">
# <b>The Alice's story!!!</b>
# </p>]
print(soup.select("p:nth-of-type(5)"))
print(soup.select("#alice > p:nth-of-type(1)"))  # xml  note coz : lxml--> []
print(soup.select("p > a:nth-of-type(2)"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Input & Output

soup = BeautifulSoup(open("Alice.html"))
with open("Alice.html") as fp:
    soup = BeautifulSoup(fp)
 
## The default is formatter="minimal". To ensure that Beautiful Soup generates valid HTML/XML:
print(soup.prettify(formatter="minimal"))
## Convert Unicode characters to HTML entities whenever possible
## if formatter=None, Beautiful Soup will not modify strings at all on output.
print(soup.prettify(formatter="html"))

def uppercase(str):
    return str.upper()

print(soup.prettify(formatter=uppercase))

Parsing only part of a document

from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")

only_tags_with_id_link2 = SoupStrainer(id="link2")

def is_short_string(string):
    return len(string) < 10

only_short_strings = SoupStrainer(text=is_short_string)

print(BeautifulSoup(html_doc, "lxml", parse_only=only_a_tags).prettify())
# <!DOCTYPE html>
# <a class="sister" href="http://example.com/elsie" id="link1">
#  Elsie
# </a>
# <a class="sister" href="http://example.com/lacie" id="link2">
#  Lacie
# </a>
# <a class="sister" href="http://example.com/tillie" id="link3">
#  Tillie
# </a>
# <a href="http://example.com/rabbit" id="link4">
#  Down the Rabbit-Hole
# </a>
# <a href="http://example.com/tears" id="link5">
#  The Pool of Tears
# </a>
# <a href="http://example.com/tale" id="link6">
#  A Caucus-Race and a Long Tale
# </a>
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
# <a class="sister" href="http://example.com/lacie" id="link2">
#  Lacie
# </a>
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
# <!DOCTYPE html>
# start
# end
# Elsie
# ,
# Lacie
# and
# Tillie
# ...
# 1.
# 2.
# 3.
# ......