GPT-Powered General-Use Web Crawler

Tim included in Network Tools & Applications

2023-12-30 2024-10-06 About 500 words 3 minutes - views

Contents

Language models led by GPT have completely changed the writing of crawlers. Previously, crawling a specific site might require special configuration or processing due to each site’s unique structure to extract desired information. However, with GPT, it’s not impossible for a crawler to extract any information it wants from all sites. To this end, I wrote a general crawler that uses GPT to extract information during the crawling process and open-sourced it on Github.

1 Introduction

GPT-Web-Crawler is a web crawler based on python and puppeteer that can crawl web pages and extract content from them (including page titles, URLs, keywords, descriptions, all text content, all images, and screenshots). It is very easy to use - just a few lines of code are needed to crawl web pages and extract content, making it quite suitable for those not familiar with web crawling and hoping to use web crawlers to extract content from web pages.

Crawler Work

The crawler’s output can be a JSON file, easily converted into a CSV file, imported into a database, or used to build an AI agent.

Assistant Demo

2 Getting Started

Step 1. Install the package.

1

pip install gpt-web-crawler

Step 2. Copy config_template.py and rename it to config.py. Then, edit the config.py file to configure the OpenAI API key and other settings if you need ProSpider to help you extract content from web pages. If you do not need AI to help you extract content from web pages, you can keep the config.py file unchanged.

Step 3. Execute the following code to launch a crawler.

1
2
3
4
5
6


from gpt_web_crawler import run_spider,NoobSpider
run_spider(NoobSpider, 
           max_page_count= 10 ,
           start_urls="https://www.jiecang.cn/", 
           output_file = "test_pakages.json",
           extract_rules= r'.*\.html' )

3 Crawlers

In the code above, NoobSpider is used. There are four types of crawlers in this package, with varying capabilities for extracting content from web pages. The following table shows their differences.

Crawler Type	Description	Returned Content
NoobSpider	Crawls basic webpage information	- title - URL - keywords - description - body: all text content of the webpage
CatSpider	Crawls webpages with screenshots	- title - URL - keywords - description - body: all text content of the webpage - screenshot_path: path of the screenshot
ProSpider	Crawls basic information while using AI to extract content	- title - URL - keywords - description - body: all text content of the webpage - ai_extract_content: text content extracted by GPT
LionSpider	Crawls basic information while extracting all images	- title - URL - keywords - description - body: all text content of the webpage - directory: directory of all images on the webpage

3.1 Cat Spider

Cat Spider is a crawler that can take screenshots of web pages. It is based on Noob Spider and uses puppeteer to simulate browser operations to take a screenshot of the entire webpage and save it as an image. Therefore, you need to install puppeteer before using Cat Spider.

1

npm install puppeteer

Buy me a coffee~

Donate

Alipay

PayPal

WeChat Pay