Efficiently Summarize Website Content Using Python, Requests, BeautifulSoup4, and Gemini Flash 1.5
Felipe Gonzalez
CTOEfficient extraction and summarization of the information on websites is crucial in today’s fast digital world for many applications such as content curation, data analysis among others. In this blog post we will learn how to summarize a website using Python's requests and beautifulsoup4 libraries along with Gemini Flash 1.5 which has superior summarization features.
Why Gemini and not GPT-4o mini?
Gemini 1.5 Flash is a new AI model developed by Google that focuses on generating human-like text responses just like GPT-4o mini but it has much larger context window of 1 million tokens unlike GPT-4o mini that has only 127k tokens. It enables Gemini to generate more coherent and contextually relevant responses especially where there is extensive content present in webpages..
Prerequisites
In order to get started with the implementation part, make sure that you have the required libraries installed into your system. You can install them by running:
pip install requests beautifulsoup4
Setting Up
The following are some important things we are going to talk about:
- requests: This library helps us when fetching HTML content of websites.
- beautifulsoup4: The former as it helps in parsing out text from HTML contents.
- Gemini Flash 1.5: This later does facilite us summarizing the extracted text the website.
We'll begin by importing the required libraries and setting up the Gemini class to interact with the Gemini Flash 1.5 API.
import requests
from bs4 import BeautifulSoup
@dataclass
class GeminiAI:
"""Simplified Gemini AI service interface."""
model: str
def __post_init__(self) -> None:
self._session: requests.Session = requests.Session()
def _request(self, method: str, path: str, **kwargs) -> str | dict:
headers = kwargs.pop('headers', {})
headers.update({'Content-Type': 'application/json'})
response = self._session.request(
method=method,
url=f'https://generativelanguage.googleapis.com/v1/models/{self.model}{path}?key=<so-secret>',
headers=headers,
**kwargs,
)
response.raise_for_status()
return response.json()
def _get_model_config(self) -> dict:
return {
'generationConfig': {
'temperature': 0.5,
'topP': 0.8,
'topK': 10,
}
}
def generate_content(self, messages: list[str]) -> list[str]:
data = {
'contents': messages,
**self._get_model_config(),
}
request_response = self._request(
method='POST',
path=':generateContent',
json=data,
)
return request_response['candidates'][0]['content']['parts'][0]['text']
Step 1: Fetching the Website Content
We will use the `requests` library to fetch the HTML content of a website. Let's begin by specifying the URL and retrieving its content.
def fetch_website_content(url: str) -> str:
if not url.startswith(('http://', 'https://')):
url = f'http://{url}'
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.content
Step 2: Parsing the HTML Content
Next, we will use BeautifulSoup to parse the HTML content and extract the main textual information from the website.
def clean_html(html_content: str) -> str:
soup = BeautifulSoup(html_content, 'html.parser')
body_content = soup.find('body')
return body_content.get_text(separator='\n', strip=True)
Step 3: Summarizing the Text Content
With the extracted text content, we can now use Gemini Flash 1.5 to generate a summary. We have created a method to summarize the content using our simplified Gemini AI interface.
SCRAPPING_PROMPT = '''
Using the following HTML content of a website, generate a comprehensive description in one paragraph.
Include the company name, sector, mission and vision, products and services offered if exists
Ensure the response is clear, coherent, and well-structured, keeping it under 1000 characters.
'''
def summarize_webpage(url: str) -> str:
html_content = fetch_website_content(url)
cleaned_content = clean_html(html_content)
geminiai = GeminiAI()
messages = [
{
'role': 'model',
'parts': [
{
'text': SCRAPPING_PROMPT,
}
],
},
{'role': 'user', 'parts': [{'text': cleaned_content}]},
]
return geminiai.generate_content(messages)
This guide covers the basics of using Python to fetch, parse, and summarize website content. With the `requests` and `beautifulsoup4` libraries, as well as the Gemini Flash 1.5 API, you can create a robust solution for website summarization.
Happy coding!
Written by Felipe Gonzalez
A technology visionary, Felipe leads the company’s technological strategy and innovation. With a deep expertise in software development, system architecture, and emerging technologies, he is dedicated to aligning technology initiatives with business goals.