Efficiently Summarize Website Content Using Python, Requests, BeautifulSoup4, and Gemini Flash 1.5

Efficient extraction and summarization of the information on websites is crucial in today’s fast digital world for many applications such as content curation, data analysis among others. In this blog post we will learn how to summarize a website using Python's requests and beautifulsoup4 libraries along with Gemini Flash 1.5 which has superior summarization features.

Why Gemini and not GPT-4o mini?

Gemini 1.5 Flash is a new AI model developed by Google that focuses on generating human-like text responses just like GPT-4o mini but it has much larger context window of 1 million tokens unlike GPT-4o mini that has only 127k tokens. It enables Gemini to generate more coherent and contextually relevant responses especially where there is extensive content present in webpages..

Prerequisites

In order to get started with the implementation part, make sure that you have the required libraries installed into your system. You can install them by running:

pip install requests beautifulsoup4

Setting Up

The following are some important things we are going to talk about:

requests: This library helps us when fetching HTML content of websites.
beautifulsoup4: The former as it helps in parsing out text from HTML contents.
Gemini Flash 1.5: This later does facilite us summarizing the extracted text the website.

We'll begin by importing the required libraries and setting up the Gemini class to interact with the Gemini Flash 1.5 API.

import requests
from bs4 import BeautifulSoup


@dataclass
class GeminiAI:
    """Simplified Gemini AI service interface."""

    model: str

    def __post_init__(self) -> None:
        self._session: requests.Session = requests.Session()

    def _request(self, method: str, path: str, **kwargs) -> str | dict:
        headers = kwargs.pop('headers', {})
        headers.update({'Content-Type': 'application/json'})

        response = self._session.request(
            method=method,
            url=f'https://generativelanguage.googleapis.com/v1/models/{self.model}{path}?key=<so-secret>',
            headers=headers,
            **kwargs,
        )
        response.raise_for_status()
        return response.json()

    def _get_model_config(self) -> dict:
        return {
            'generationConfig': {
                'temperature': 0.5,
                'topP': 0.8,
                'topK': 10,
            }
        }

    def generate_content(self, messages: list[str]) -> list[str]:
        data = {
            'contents': messages,
            **self._get_model_config(),
        }
        request_response = self._request(
            method='POST',
            path=':generateContent',
            json=data,
        )
        return request_response['candidates'][0]['content']['parts'][0]['text']

Step 1: Fetching the Website Content

We will use the `requests` library to fetch the HTML content of a website. Let's begin by specifying the URL and retrieving its content.

def fetch_website_content(url: str) -> str:
    if not url.startswith(('http://', 'https://')):
        url = f'http://{url}'

    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.content

Step 2: Parsing the HTML Content

Next, we will use BeautifulSoup to parse the HTML content and extract the main textual information from the website.

def clean_html(html_content: str) -> str:
    soup = BeautifulSoup(html_content, 'html.parser')
    body_content = soup.find('body')
    return body_content.get_text(separator='\n', strip=True)

Step 3: Summarizing the Text Content

With the extracted text content, we can now use Gemini Flash 1.5 to generate a summary. We have created a method to summarize the content using our simplified Gemini AI interface.

SCRAPPING_PROMPT = '''
Using the following HTML content of a website, generate a comprehensive description in one paragraph.
Include the company name, sector, mission and vision, products and services offered if exists
Ensure the response is clear, coherent, and well-structured, keeping it under 1000 characters.
'''

def summarize_webpage(url: str) -> str:
    html_content = fetch_website_content(url)
    cleaned_content = clean_html(html_content)

    geminiai = GeminiAI()
    messages = [
        {
            'role': 'model',
            'parts': [
                {
                    'text': SCRAPPING_PROMPT,
                }
            ],
        },
        {'role': 'user', 'parts': [{'text': cleaned_content}]},
    ]
    return geminiai.generate_content(messages)

This guide covers the basics of using Python to fetch, parse, and summarize website content. With the `requests` and `beautifulsoup4` libraries, as well as the Gemini Flash 1.5 API, you can create a robust solution for website summarization.

Happy coding!

Efficiently Summarize Website Content Using Python, Requests, BeautifulSoup4, and Gemini Flash 1.5

Why Gemini and not GPT-4o mini?

Prerequisites

Setting Up

Step 1: Fetching the Website Content

Step 2: Parsing the HTML Content

Step 3: Summarizing the Text Content

Wait. There's more:

The growth catalyst

The problem with generative AI

Sell without selling

Build to
Inspire

Efficiently Summarize Website Content Using Python, Requests, BeautifulSoup4, and Gemini Flash 1.5

Why Gemini and not GPT-4o mini?

Prerequisites

Setting Up

Step 1: Fetching the Website Content

Step 2: Parsing the HTML Content

Step 3: Summarizing the Text Content

Subscribe to our newsletter (free, no ads) With Love 💌

Wait. There's more:

The growth catalyst

The problem with generative AI

Sell without selling

Build to Inspire

Build to
Inspire