微信公众号图片爬取程序开发指南
一、技术选型对比
| 方案 |
优点 |
缺点 |
| Requests+BeautifulSoup |
轻量快速 |
无法处理动态加载内容 |
| 纯接口 |
稳定可靠 |
|
| mitmdump抓包 |
可获取加密数据流 |
配置复杂 |
二、基础版实现(静态页面)
1. 环境准备
pip install requests beautifulsoup4
|
2. 核心代码
import requests from bs4 import BeautifulSoup import os def download_images(url, save_dir='wx_images'): os.makedirs(save_dir, exist_ok=True) headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') img_tags = soup.find_all('img', {'data-src': True}) for idx, img in enumerate(img_tags): img_url = img['data-src'].replace('&', '&') try: img_data = requests.get(img_url).content with open(f'{save_dir}/image_{idx+1}.jpg', 'wb') as f: f.write(img_data) print(f'已下载第 {idx+1} 张图片') except Exception as e: print(f'下载失败:{str(e)}') if __name__ == '__main__': article_url = 'https://mp.weixin.qq.com/s/xxxxxxxx' download_images(article_url)
|
三、进阶版实现(动态加载)
1. 环境准备
pip install selenium webdriver-manager
|
2. 自动化爬取代码
from selenium import webdriver from selenium.webdriver.common.by import By import time def dynamic_crawler(url): options = webdriver.ChromeOptions() options.add_argument('--headless') driver = webdriver.Chrome(options=options) driver.get(url) for _ in range(3): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) images = driver.find_elements(By.XPATH, '//img[@data-src]') img_urls = [img.get_attribute('data-src') for img in images] driver.quit() return img_urls
|
四、关键功能扩展
1. 多文章批量爬取
article_list = [ 'https://mp.weixin.qq.com/s/xxxx1', 'https://mp.weixin.qq.com/s/xxxx2' ] for article in article_list: urls = dynamic_crawler(article) download_images(urls)
|
2. 代理配置(防封禁)
proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } response = requests.get(url, proxies=proxies)
|
五、完整项目结构
wx_image_crawler/ ├── crawler.py # 主程序 ├── config.yaml # 代理/URL配置 ├── requirements.txt # 依赖库 └── /wx_images # 图片存储目录
|
以上只是示例代码,并不能起到功能,仅用于学习。
项目源码下载:Github仓库