微信公众号图片爬取程序开发指南

一、技术选型对比

方案 优点 缺点
Requests+BeautifulSoup 轻量快速 无法处理动态加载内容
纯接口 稳定可靠
mitmdump抓包 可获取加密数据流 配置复杂

二、基础版实现(静态页面)

1. 环境准备

pip install requests beautifulsoup4

2. 核心代码

import requests 
from bs4 import BeautifulSoup
import os

def download_images(url, save_dir='wx_images'):
# 创建存储目录
os.makedirs(save_dir, exist_ok=True)

# 发送请求
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
response = requests.get(url, headers=headers)

# 解析图片链接
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img', {'data-src': True})

# 下载图片
for idx, img in enumerate(img_tags):
img_url = img['data-src'].replace('&', '&')
try:
img_data = requests.get(img_url).content
with open(f'{save_dir}/image_{idx+1}.jpg', 'wb') as f:
f.write(img_data)
print(f'已下载第 {idx+1} 张图片')
except Exception as e:
print(f'下载失败:{str(e)}')

if __name__ == '__main__':
article_url = 'https://mp.weixin.qq.com/s/xxxxxxxx' # 替换为目标文章URL
download_images(article_url)

三、进阶版实现(动态加载)

1. 环境准备

pip install selenium webdriver-manager 

2. 自动化爬取代码

from selenium import webdriver 
from selenium.webdriver.common.by import By
import time

def dynamic_crawler(url):
# 配置无头浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless') # 隐藏浏览器窗口
driver = webdriver.Chrome(options=options)

driver.get(url)

# 模拟滚动加载(解决动态加载问题)[[1]()]
for _ in range(3):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)

# 提取图片链接
images = driver.find_elements(By.XPATH, '//img[@data-src]')
img_urls = [img.get_attribute('data-src') for img in images]

driver.quit()
return img_urls

四、关键功能扩展

1. 多文章批量爬取

article_list = [
'https://mp.weixin.qq.com/s/xxxx1',
'https://mp.weixin.qq.com/s/xxxx2'
]

for article in article_list:
urls = dynamic_crawler(article)
download_images(urls)

2. 代理配置(防封禁)

proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}

response = requests.get(url, proxies=proxies)

五、完整项目结构

wx_image_crawler/
├── crawler.py # 主程序
├── config.yaml # 代理/URL配置
├── requirements.txt # 依赖库
└── /wx_images # 图片存储目录

以上只是示例代码,并不能起到功能,仅用于学习。
项目源码下载:Github仓库