How to Integrate Scrapy with Huey in Django: A Step-by-Step Guide
Felipe Gonzalez
CTOScrapy is a library that helps you extract data from websites. Huey is a lightweight task queue that allows you to run tasks in the background. In this post, I will show you how to use Scrapy with Huey in Django projects.
1. Create a Django Project
First, create a Django project with Scrapy and Huey. If you don't have it installed, you can install it with the following command:
pip install django scrapy huey
Create a new Django project with the following command:
django-admin startproject myproject
Create a new Django app with the following command:
cd myproject
python manage.py startapp myapp
2. Create a Scrapy Spider
Create a Scrapy spider with the following command:
scrapy startproject myspider
Create a new spider with the following command:
cd myspider
scrapy genspider myspider myspider.com
Edit the spider file with the following code:
import scrapy
class MySpiderSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['mydomain.com']
start_urls = ['http://mydomain.com/']
def parse(self, response):
...
3. Configure Scrapy in Django
Create a Scrapy configuration in the settings.py
file with the following code:
# settings.py
USER_AGENT_SCRAPY = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
SCRAPY_SETTINGS = {
'ITEM_PIPELINES': {'scraping.pipelines.WebContentResultPipeline': 400},
'USER_AGENT': USER_AGENT_SCRAPY,
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'LOG_LEVEL': 'ERROR',
'LOG_FORMAT': '%(levelname)s: %(message)s',
'RETRY_ENABLED': True,
'RETRY_TIMES': 3,
'CONCURRENT_REQUESTS': 4,
'DOWNLOAD_DELAY': 2,
'DOWNLOADER_MIDDLEWARES': {
'scraping.downloadermiddlewares.CustomRetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
},
}
4. Configure Huey in Django
Create a Huey configuration in the settings.py
file with the following code:
# settings.py
HUEY = {
'utc': False,
'connection': {'url': f'redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_DB}'},
'immediate': False,
}
5. Create a Huey Task
Create a Huey task with the following code:
# tasks.py
from multiprocessing import Process
import scrapy
from django.conf import settings
from huey.contrib.djhuey import db_task
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class Spider1(scrapy.Spider): ...
class Spider2(scrapy.Spider): ...
def run_spider(spider_class_list: list['scrapy.Spider']) -> None:
"""Run multiple Scrapy spiders.
Args:
spider_class_list (list['scrapy.Spider']): A list of Scrapy spider classes.
Returns:
None
"""
def start_crawler():
# Initialize Scrapy settings
scrapy_settings = get_project_settings()
scrapy_settings.update(settings.SCRAPY_SETTINGS)
# Create and start the CrawlerProcess
process = CrawlerProcess(settings=scrapy_settings, install_root_handler=False)
for spider_class in spider_class_list:
process.crawl(spider_class)
process.start()
# Run the crawler in a separate process
crawler_process = Process(target=start_crawler)
crawler_process.start()
crawler_process.join()
@db_task()
def run_spiders_task():
spider_class_list = [Spider1, Spider2]
run_spider(spider_class_list=spider_class_list)
Here is the tricky part. You need to create a run_spider
function that will run the Scrapy spiders in a separate process. This is necessary because Scrapy's event loop can conflict with other asynchronous task managers like Huey. The run_spider
function takes a list of Scrapy spider classes as an argument and runs them in a separate process. The run_spiders_task
function is a Huey task that calls the run_spider
function with a list of Scrapy spider classes. In this example, we are running two Scrapy spiders: Spider1
and Spider2
. You can add more spiders to the list if needed.
6. Run the Huey Task
Run the Huey task with the following command:
python manage.py run_huey
Conclusion
Integrating Scrapy with Huey in a Django project allows you to perform web scraping tasks efficiently in the background. By creating a separate process for running Scrapy spiders, you can avoid conflicts with Django's and Huey's event loops, ensuring smooth operation. The run_spider
function, which takes a list of Scrapy spider classes and executes them in a separate process, is crucial for this integration. The run_spiders_task
Huey task then calls this function, allowing for the concurrent running of multiple spiders.
This approach offers a robust solution for managing web scraping tasks within a Django project, leveraging the strengths of Scrapy for data extraction and Huey for background task management. By following these steps, you can streamline your data extraction processes and enhance the efficiency of your Django applications.
Happy coding!
Written by Felipe Gonzalez
A technology visionary, Felipe leads the company’s technological strategy and innovation. With a deep expertise in software development, system architecture, and emerging technologies, he is dedicated to aligning technology initiatives with business goals.