Web Crawler
pendingby wangrongjia
Crawl web pages and convert them to Markdown files. Supports websites that require login via cookies, including Twitter/X, Reddit, Zhihu, and more.
Web Crawler Plugin
An Obsidian plugin that crawls web pages and converts them to Markdown files. Supports websites that require authentication via cookies, including Twitter/X, Reddit, Zhihu, and more.
๐ Features
- One-Click Crawling: Crawl any web page and save it as a Markdown file with a single click
- Smart Content Extraction: Automatically extracts title, main content, images, and metadata
- Login Support: Configure cookies for websites that require authentication
- Proxy Support: Built-in proxy configuration for accessing international websites
- Dynamic Content: Uses Playwright for JavaScript-heavy sites (Twitter/X, etc.)
- Specialized Optimizations: Custom extractors for popular platforms:
- Twitter/X: Tweets with images and author info
- Reddit: Posts with automatic title extraction from URL
- Zhihu: Q&A content with image lazy-loading support
- V2EX: Forum posts with replies
- Obsidian Properties: Saves source URL and timestamp as file properties
- Auto Link Insertion: Optionally inserts links to the created file in your current editor
๐ธ Usage
Basic Usage
-
Via Command Palette (Ctrl/Cmd+P)
- Type
Web Crawler: ็ฌๅ็ฝ้กตๅ ๅฎน - Enter the URL
- The plugin will crawl and save the page
- Type
-
Via Ribbon Icon
- Click the link icon in the left ribbon
- Enter the URL
- The content will be saved to your vault
-
From Editor
- Use
Web Crawler: ็ฌๅ็ฝ้กตๅ ๅฎนๅนถๆๅ ฅ้พๆฅ - The plugin will insert a link to the created file in your current editor position
- Use
Configuration
Proxy Settings
Go to Settings โ Community Plugins โ Web Crawler Plugin โ Options:
- Use System Proxy: Enable if your system has a proxy configured
- Proxy Server: Manually configure proxy (e.g.,
http://127.0.0.1:7890) - Quick Setup: Choose from presets:
- Clash Verge - HTTP (127.0.0.1:7897)
- Clash - HTTP (127.0.0.1:7890)
- V2RayN - HTTP (127.0.0.1:10809)
- And more...
Login Configuration (for websites requiring authentication)
For sites like Twitter/X, Zhihu, or private forums:
- Scroll to "Login Configuration" section
- Click "Add Login Configuration"
- Fill in:
- URL Pattern: Match pattern (e.g.,
https://twitter.com/*,https://www.zhihu.com/*) - Cookies: Your cookie string from browser DevTools
- Open browser DevTools (F12) in your browser
- Go to Network tab
- Refresh the page
- Find any request and copy the
Cookieheader value - Format:
key1=value1; key2=value2; key3=value3
- User-Agent (optional): Custom user agent string
- URL Pattern: Match pattern (e.g.,
- Save settings
Note: Only cookies are supported. Username/password authentication is not available.
Save Path
Configure where to save crawled content (default: WebCrawler folder in your vault).
๐ Advanced Features
Twitter/X Support
For Twitter/X posts, the plugin uses a local Playwright server:
-
Start the local server (one-time setup):
node server.cjs -
Configure proxy in plugin settings (Twitter/X requires VPN)
-
Crawl tweets:
- Extracts tweet text, author info, images
- Generates filename from tweet content
- Images saved in high resolution
V2EX Forum
- Automatically detects and includes replies
- Clean formatting for forum discussions
Custom Extractors
The plugin uses smart content detection:
- Article tags (
<article>,<main>) - Common content class names
- Fallback to body content
๐ฆ Output Format
Crawled content is saved with Obsidian properties:
---
ๆฅๆบ: https://example.com/article
ๆถ้ด: 2026-01-06 10:30:45
---
# Article Title
Article content goes here...
โ๏ธ Settings
| Setting | Description |
|---|---|
| Save Path | Folder to save crawled files (relative to vault root) |
| Use System Proxy | Use system/browser proxy settings |
| Proxy Server | Manual proxy configuration |
| Include Replies | Include forum replies (V2EX, etc.) |
| Login Configs | Cookie configurations for private websites |
๐ ๏ธ Development
Building
npm install
npm run build
Development Mode
npm run dev
Linting
npm run lint
๐ Changelog
Version 1.0.0
- Initial release
- Support for basic web crawling
- Twitter/X, Reddit, Zhihu, V2EX optimizations
- Proxy and login configuration
- Obsidian properties support
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ License
BSD 0-Clause License - see LICENSE for details.
Copyright (C) 2020-2025 by Dynalist Inc.
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
๐ Acknowledgments
- Built with Obsidian API
- Uses Turndown for HTML to Markdown conversion
- Uses Playwright for dynamic content
๐ง Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Note: This plugin is not affiliated with or endorsed by Obsidian.
For plugin developers
Search results and similarity scores are powered by semantic analysis of your plugin's README. If your plugin isn't appearing for searches you'd expect, try updating your README to clearly describe your plugin's purpose, features, and use cases.