Efficiently Summarize Website Content Using Python, Requests, BeautifulSoup4, and Gemini Flash 1.5

FG Felipe Gonzalez Felipe Gonzalez

Felipe Gonzalez

CTO
2 min read.

Efficient extraction and summarization of the information on websites is crucial in today’s fast digital world for many applications such as content curation, data analysis among others. In this blog post we will learn how to summarize a website using Python's requests and beautifulsoup4 libraries along with Gemini Flash 1.5 which has superior summarization features.

Why Gemini and not GPT-4o mini?

Gemini 1.5 Flash is a new AI model developed by Google that focuses on generating human-like text responses just like GPT-4o mini but it has much larger context window of 1 million tokens unlike GPT-4o mini that has only 127k tokens. It enables Gemini to generate more coherent and contextually relevant responses especially where there is extensive content present in webpages..

Prerequisites

In order to get started with the implementation part, make sure that you have the required libraries installed into your system. You can install them by running:

pip install requests beautifulsoup4

Setting Up

The following are some important things we are going to talk about:

  • requests: This library helps us when fetching HTML content of websites.
  • beautifulsoup4: The former as it helps in parsing out text from HTML contents.
  • Gemini Flash 1.5: This later does facilite us summarizing the extracted text the website.


We'll begin by importing the required libraries and setting up the Gemini class to interact with the Gemini Flash 1.5 API.

import requests
from bs4 import BeautifulSoup


@dataclass
class GeminiAI:
    """Simplified Gemini AI service interface."""

    model: str

    def __post_init__(self) -> None:
        self._session: requests.Session = requests.Session()

    def _request(self, method: str, path: str, **kwargs) -> str | dict:
        headers = kwargs.pop('headers', {})
        headers.update({'Content-Type': 'application/json'})

        response = self._session.request(
            method=method,
            url=f'https://generativelanguage.googleapis.com/v1/models/{self.model}{path}?key=<so-secret>',
            headers=headers,
            **kwargs,
        )
        response.raise_for_status()
        return response.json()

    def _get_model_config(self) -> dict:
        return {
            'generationConfig': {
                'temperature': 0.5,
                'topP': 0.8,
                'topK': 10,
            }
        }

    def generate_content(self, messages: list[str]) -> list[str]:
        data = {
            'contents': messages,
            **self._get_model_config(),
        }
        request_response = self._request(
            method='POST',
            path=':generateContent',
            json=data,
        )
        return request_response['candidates'][0]['content']['parts'][0]['text']
        

Step 1: Fetching the Website Content

We will use the `requests` library to fetch the HTML content of a website. Let's begin by specifying the URL and retrieving its content.

def fetch_website_content(url: str) -> str:
    if not url.startswith(('http://', 'https://')):
        url = f'http://{url}'

    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.content

Step 2: Parsing the HTML Content

Next, we will use BeautifulSoup to parse the HTML content and extract the main textual information from the website.

def clean_html(html_content: str) -> str:
    soup = BeautifulSoup(html_content, 'html.parser')
    body_content = soup.find('body')
    return body_content.get_text(separator='\n', strip=True)

Step 3: Summarizing the Text Content

With the extracted text content, we can now use Gemini Flash 1.5 to generate a summary. We have created a method to summarize the content using our simplified Gemini AI interface.

SCRAPPING_PROMPT = '''
Using the following HTML content of a website, generate a comprehensive description in one paragraph.
Include the company name, sector, mission and vision, products and services offered if exists
Ensure the response is clear, coherent, and well-structured, keeping it under 1000 characters.
'''

def summarize_webpage(url: str) -> str:
    html_content = fetch_website_content(url)
    cleaned_content = clean_html(html_content)

    geminiai = GeminiAI()
    messages = [
        {
            'role': 'model',
            'parts': [
                {
                    'text': SCRAPPING_PROMPT,
                }
            ],
        },
        {'role': 'user', 'parts': [{'text': cleaned_content}]},
    ]
    return geminiai.generate_content(messages)

This guide covers the basics of using Python to fetch, parse, and summarize website content. With the `requests` and `beautifulsoup4` libraries, as well as the Gemini Flash 1.5 API, you can create a robust solution for website summarization.

Happy coding!


Written by Felipe Gonzalez

FG Felipe Gonzalez Felipe Gonzalez

A technology visionary, Felipe leads the company’s technological strategy and innovation. With a deep expertise in software development, system architecture, and emerging technologies, he is dedicated to aligning technology initiatives with business goals.

Newsletter

Subscribe to our newsletter:

Read more

The growth catalyst

Sustainability drives innovation and business growth by integrating environmentally friendly practices.

1 min read.

The problem with generative AI

Have you seen a situation when a lover complains to their partner about how the relationship came to an end...

1 min read.

Sell without selling

Today, selling is more complex and requires sales teams that have a deep understanding of their potential customers.

1 min read.

Build Once. Own Forever.