Crawler-OpenSource-List

Crawler OpenSource List | 爬虫开源框架索引

Platform

  • 2019-SpiderFlow : 新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

  • 2020-Crawlab : Distributed web crawler admin platform for spiders management regardless of languages and frameworks.

Framework

Node

  • x-ray : The next web scraper. See through the noise.

  • headless-chrome-crawler : Distributed crawler powered by Headless Chrome.

  • apify-js : Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

  • 2021-Crawlee : A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.

Python

  • 2018-Scrapy : Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

  • 2019-pyspider : A Powerful Spider(Web Crawler) System in Python.

  • Photon : Incredibly fast crawler which extracts urls, emails, files, website accounts and much more.

  • Gerapy : Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.

Golang

  • 2015-go_spider : An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

  • 2017-Colly : Lightning Fast and Elegant Scraping Framework for Gophers.

  • 2018-ferret : ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics.

  • 2019-Hakrawler : Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application.

  • 2022-katana : A next-generation crawling and spidering framework.

Java

  • Crawler4j : crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

  • 2015-WebMagic : A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Content Analysis | 内容分析

  • Fathom : A framework for extracting meaning from web pages.

  • unicaps : A unified Python API for CAPTCHA solving services.

Visual Config

  • 2023-EasySpider : A visual no-code/code-free web crawler/spider 一个可视化爬虫软件,可以无代码图形化设计和执行的爬虫任务
上一页