Crawler-OpenSource-List
Crawler OpenSource List | 爬虫开源框架索引
Platform
-
2019-SpiderFlow : 新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
-
2020-Crawlab : Distributed web crawler admin platform for spiders management regardless of languages and frameworks.
Framework
Node
-
x-ray : The next web scraper. See through the noise.
-
headless-chrome-crawler : Distributed crawler powered by Headless Chrome.
-
apify-js : Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
-
2021-Crawlee : A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
Python
-
2018-Scrapy : Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
-
2019-pyspider : A Powerful Spider(Web Crawler) System in Python.
-
Photon : Incredibly fast crawler which extracts urls, emails, files, website accounts and much more.
-
Gerapy : Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.
Golang
-
2015-go_spider : An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.
-
2017-Colly : Lightning Fast and Elegant Scraping Framework for Gophers.
-
2018-ferret : ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics.
-
2019-Hakrawler : Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application.
-
2022-katana : A next-generation crawling and spidering framework.
Java
-
Crawler4j : crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
-
2015-WebMagic : A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
Content Analysis | 内容分析
-
Fathom : A framework for extracting meaning from web pages.
-
unicaps : A unified Python API for CAPTCHA solving services.
Visual Config
- 2023-EasySpider : A visual no-code/code-free web crawler/spider 一个可视化爬虫软件,可以无代码图形化设计和执行的爬虫任务