How to Integrate Scrapy with Huey in Django: A Step-by-Step Guide

FG Felipe Gonzalez Felipe Gonzalez

Felipe Gonzalez

CTO
2 min read.

Scrapy is a library that helps you extract data from websites. Huey is a lightweight task queue that allows you to run tasks in the background. In this post, I will show you how to use Scrapy with Huey in Django projects.

1. Create a Django Project

First, create a Django project with Scrapy and Huey. If you don't have it installed, you can install it with the following command:

pip install django scrapy huey

Create a new Django project with the following command:

django-admin startproject myproject

Create a new Django app with the following command:

cd myproject
python manage.py startapp myapp

2. Create a Scrapy Spider

Create a Scrapy spider with the following command:

scrapy startproject myspider

Create a new spider with the following command:

cd myspider
scrapy genspider myspider myspider.com

Edit the spider file with the following code:

import scrapy

class MySpiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['mydomain.com']
    start_urls = ['http://mydomain.com/']

    def parse(self, response):
        ...

3. Configure Scrapy in Django

Create a Scrapy configuration in the settings.py file with the following code:

# settings.py
USER_AGENT_SCRAPY = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
SCRAPY_SETTINGS = {
    'ITEM_PIPELINES': {'scraping.pipelines.WebContentResultPipeline': 400},
    'USER_AGENT': USER_AGENT_SCRAPY,
    'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
    'LOG_LEVEL': 'ERROR',
    'LOG_FORMAT': '%(levelname)s: %(message)s',
    'RETRY_ENABLED': True,
    'RETRY_TIMES': 3,
    'CONCURRENT_REQUESTS': 4,
    'DOWNLOAD_DELAY': 2,
    'DOWNLOADER_MIDDLEWARES': {
        'scraping.downloadermiddlewares.CustomRetryMiddleware': 550,
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    },
}

4. Configure Huey in Django

Create a Huey configuration in the settings.py file with the following code:

# settings.py
HUEY = {
    'utc': False,
    'connection': {'url': f'redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_DB}'},
    'immediate': False,
}

5. Create a Huey Task

Create a Huey task with the following code:

# tasks.py
from multiprocessing import Process

import scrapy
from django.conf import settings
from huey.contrib.djhuey import db_task
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class Spider1(scrapy.Spider): ...


class Spider2(scrapy.Spider): ...


def run_spider(spider_class_list: list['scrapy.Spider']) -> None:
    """Run multiple Scrapy spiders.

    Args:
        spider_class_list (list['scrapy.Spider']): A list of Scrapy spider classes.

    Returns:
        None
    """

    def start_crawler():
        # Initialize Scrapy settings
        scrapy_settings = get_project_settings()
        scrapy_settings.update(settings.SCRAPY_SETTINGS)

        # Create and start the CrawlerProcess
        process = CrawlerProcess(settings=scrapy_settings, install_root_handler=False)
        for spider_class in spider_class_list:
            process.crawl(spider_class)
        process.start()

    # Run the crawler in a separate process
    crawler_process = Process(target=start_crawler)
    crawler_process.start()
    crawler_process.join()


@db_task()
def run_spiders_task():
    spider_class_list = [Spider1, Spider2]
    run_spider(spider_class_list=spider_class_list)

Here is the tricky part. You need to create a run_spider function that will run the Scrapy spiders in a separate process. This is necessary because Scrapy's event loop can conflict with other asynchronous task managers like Huey. The run_spider function takes a list of Scrapy spider classes as an argument and runs them in a separate process. The run_spiders_task function is a Huey task that calls the run_spider function with a list of Scrapy spider classes. In this example, we are running two Scrapy spiders: Spider1 and Spider2. You can add more spiders to the list if needed.

6. Run the Huey Task

Run the Huey task with the following command:

python manage.py run_huey

Conclusion

Integrating Scrapy with Huey in a Django project allows you to perform web scraping tasks efficiently in the background. By creating a separate process for running Scrapy spiders, you can avoid conflicts with Django's and Huey's event loops, ensuring smooth operation. The run_spider function, which takes a list of Scrapy spider classes and executes them in a separate process, is crucial for this integration. The run_spiders_task Huey task then calls this function, allowing for the concurrent running of multiple spiders.

This approach offers a robust solution for managing web scraping tasks within a Django project, leveraging the strengths of Scrapy for data extraction and Huey for background task management. By following these steps, you can streamline your data extraction processes and enhance the efficiency of your Django applications.

Happy coding!


Written by Felipe Gonzalez

FG Felipe Gonzalez Felipe Gonzalez

A technology visionary, Felipe leads the company’s technological strategy and innovation. With a deep expertise in software development, system architecture, and emerging technologies, he is dedicated to aligning technology initiatives with business goals.

Newsletter

Subscribe to our newsletter:

Read more

Build Once. Own Forever.