GPT-Powered General-Use Web Crawler
Language models led by GPT have completely changed the writing of crawlers. Previously, crawling a specific site might require special configuration or processing due to each site’s unique structure to extract desired information. However, with GPT, it’s not impossible for a crawler to extract any information it wants from all sites. To this end, I wrote a general crawler that uses GPT to extract information during the crawling process and open-sourced it on Github.
1 Introduction
GPT-Web-Crawler is a web crawler based on python and puppeteer that can crawl web pages and extract content from them (including page titles, URLs, keywords, descriptions, all text content, all images, and screenshots). It is very easy to use - just a few lines of code are needed to crawl web pages and extract content, making it quite suitable for those not familiar with web crawling and hoping to use web crawlers to extract content from web pages.
The crawler’s output can be a JSON file, easily converted into a CSV file, imported into a database, or used to build an AI agent.
2 Getting Started
Step 1. Install the package.
|
|
Step 2. Copy config_template.py and rename it to config.py. Then, edit the config.py file to configure the OpenAI API key and other settings if you need ProSpider to help you extract content from web pages. If you do not need AI to help you extract content from web pages, you can keep the config.py file unchanged.
Step 3. Execute the following code to launch a crawler.
|
|
3 Crawlers
In the code above, NoobSpider is used. There are four types of crawlers in this package, with varying capabilities for extracting content from web pages. The following table shows their differences.
Crawler Type | Description | Returned Content |
---|---|---|
NoobSpider | Crawls basic webpage information | - title - URL - keywords - description - body: all text content of the webpage |
CatSpider | Crawls webpages with screenshots | - title - URL - keywords - description - body: all text content of the webpage - screenshot_path: path of the screenshot |
ProSpider | Crawls basic information while using AI to extract content | - title - URL - keywords - description - body: all text content of the webpage - ai_extract_content: text content extracted by GPT |
LionSpider | Crawls basic information while extracting all images | - title - URL - keywords - description - body: all text content of the webpage - directory: directory of all images on the webpage |
3.1 Cat Spider
Cat Spider is a crawler that can take screenshots of web pages. It is based on Noob Spider and uses puppeteer to simulate browser operations to take a screenshot of the entire webpage and save it as an image. Therefore, you need to install puppeteer before using Cat Spider.
|
|