Web Crawler

pending

by wangrongjia

Crawl web pages and convert them to Markdown files. Supports websites that require login via cookies, including Twitter/X, Reddit, Zhihu, and more.

★ 4 starsUpdated 2mo ago0BSDDiscovered via Obsidian Unofficial Plugins

View on GitHub

Web Crawler Plugin

An Obsidian plugin that crawls web pages and converts them to Markdown files. Supports websites that require authentication via cookies, including Twitter/X, Reddit, Zhihu, and more.

🌟 Features

One-Click Crawling: Crawl any web page and save it as a Markdown file with a single click
Smart Content Extraction: Automatically extracts title, main content, images, and metadata
Login Support: Configure cookies for websites that require authentication
Proxy Support: Built-in proxy configuration for accessing international websites
Dynamic Content: Uses Playwright for JavaScript-heavy sites (Twitter/X, etc.)
Specialized Optimizations: Custom extractors for popular platforms:
- Twitter/X: Tweets with images and author info
- Reddit: Posts with automatic title extraction from URL
- Zhihu: Q&A content with image lazy-loading support
- V2EX: Forum posts with replies
Obsidian Properties: Saves source URL and timestamp as file properties
Auto Link Insertion: Optionally inserts links to the created file in your current editor

📸 Usage

Basic Usage

Via Command Palette (Ctrl/Cmd+P)
- Type Web Crawler: 爬取网页内容
- Enter the URL
- The plugin will crawl and save the page
Via Ribbon Icon
- Click the link icon in the left ribbon
- Enter the URL
- The content will be saved to your vault
From Editor
- Use Web Crawler: 爬取网页内容并插入链接
- The plugin will insert a link to the created file in your current editor position

Configuration

Proxy Settings

Go to Settings → Community Plugins → Web Crawler Plugin → Options:

Use System Proxy: Enable if your system has a proxy configured
Proxy Server: Manually configure proxy (e.g., http://127.0.0.1:7890)
Quick Setup: Choose from presets:
- Clash Verge - HTTP (127.0.0.1:7897)
- Clash - HTTP (127.0.0.1:7890)
- V2RayN - HTTP (127.0.0.1:10809)
- And more...

Login Configuration (for websites requiring authentication)

For sites like Twitter/X, Zhihu, or private forums:

Scroll to "Login Configuration" section
Click "Add Login Configuration"
Fill in:
- URL Pattern: Match pattern (e.g., https://twitter.com/*, https://www.zhihu.com/*)
- Cookies: Your cookie string from browser DevTools
  - Open browser DevTools (F12) in your browser
  - Go to Network tab
  - Refresh the page
  - Find any request and copy the Cookie header value
  - Format: key1=value1; key2=value2; key3=value3
- User-Agent (optional): Custom user agent string
Save settings

Note: Only cookies are supported. Username/password authentication is not available.

Save Path

Configure where to save crawled content (default: WebCrawler folder in your vault).

🚀 Advanced Features

Twitter/X Support

For Twitter/X posts, the plugin uses a local Playwright server:

Start the local server (one-time setup):
```
node server.cjs
```
Configure proxy in plugin settings (Twitter/X requires VPN)
Crawl tweets:
- Extracts tweet text, author info, images
- Generates filename from tweet content
- Images saved in high resolution

V2EX Forum

Automatically detects and includes replies
Clean formatting for forum discussions

Custom Extractors

The plugin uses smart content detection:

Article tags (<article>, <main>)
Common content class names
Fallback to body content

📦 Output Format

Crawled content is saved with Obsidian properties:

---
来源: https://example.com/article
时间: 2026-01-06 10:30:45
---

# Article Title

Article content goes here...

⚙️ Settings

Setting	Description
Save Path	Folder to save crawled files (relative to vault root)
Use System Proxy	Use system/browser proxy settings
Proxy Server	Manual proxy configuration
Include Replies	Include forum replies (V2EX, etc.)
Login Configs	Cookie configurations for private websites

🛠️ Development

Building

npm install
npm run build

Development Mode

npm run dev

Linting

npm run lint

📝 Changelog

Version 1.0.0

Initial release
Support for basic web crawling
Twitter/X, Reddit, Zhihu, V2EX optimizations
Proxy and login configuration
Obsidian properties support

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

BSD 0-Clause License - see LICENSE for details.

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

🙏 Acknowledgments

Built with Obsidian API
Uses Turndown for HTML to Markdown conversion
Uses Playwright for dynamic content

📧 Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Note: This plugin is not affiliated with or endorsed by Obsidian.

For plugin developers

Search results and similarity scores are powered by semantic analysis of your plugin's README. If your plugin isn't appearing for searches you'd expect, try updating your README to clearly describe your plugin's purpose, features, and use cases.