Level Up Your Beautifulsoup4 Skills with 5 Practical Examples
Beautifulsoup4 is a popular Python library for web scraping, allowing users to extract and manipulate data from HTML and XML documents. In this article, we will look at 5 practical examples to level up your Beautifulsoup4 skills, covering different use cases to help you get the most out of this versatile library.
1. Extracting All Links from a Web Page
One common task when scraping a website is to extract all the links present on a web page. Beautifulsoup4 makes this easy with the find_all()
method.
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
links = [a['href'] for a in soup.find_all('a', href=True)]
for link in links:
print(link)
2. Scraping Tables and Exporting to CSV
If you need to extract tabular data from a web page and save it as a CSV file, Beautifulsoup4 can help you with that too.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://example.com/table"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table")
header = [th.text.strip() for th in table.find_all("th")]
rows = [[td.text.strip() for td in tr.find_all("td")] for tr in table.find_all("tr")]
with open("output.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(header)
writer.writerows(rows)
3. Scraping Multiple Pages
Often, you'll need to scrape information from multiple pages. Beautifulsoup4 can help you navigate through pagination.
import requests
from bs4 import BeautifulSoup
base_url = "https://example.com/posts?page="
page_num = 1
while True:
url = base_url + str(page_num)
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find_all("div", class_="post")
if not posts:
break
for post in posts:
title = post.find("h2").text.strip()
content = post.find("p").text.strip()
print(f"Title: {title}\nContent: {content}\n")
page_num += 1
4. Scraping Data with Dynamic Loading
When a website loads data dynamically using JavaScript, Beautifulsoup4 alone may not be enough. In this case, you can use Selenium to load the JavaScript content and then feed it to Beautifulsoup4.
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://example.com/dynamic-content"
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
data = soup.find("div", id="dynamic-data").text.strip()
print(data)
driver.quit()
5. Handling Errors and Timeouts
To make your scraping more robust, it's essential to handle errors and timeouts. You can use the try
and except
blocks along with time.sleep()
to manage these scenarios.
import requests
import time
from bs4 import BeautifulSoup
url = "https://example.com"
for _ in range(5): # Retry up to 5 times
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
break
except (requests.exceptions.RequestException, requests.exceptions.Timeout):
time.sleep(2)
else:
print("Failed to fetch the URL")
exit(1)
soup = BeautifulSoup(response.content, "html.parser")
With these 5 practical examples, you're well on your way to becoming a Beautifulsoup4 expert. By mastering these techniques, you'll be able to tackle a wide range of web scraping tasks efficiently and effectively.