百度貼吧-"我愛妳"相關的圖片

目標網址

521張我愛你電影截圖

觀察網頁結構

大致上像這樣論壇我發現直接用bs的find_all('img')應該就可以找到圖檔

試試看.....。

結果列出一大串 .....,之中有"src",這是主要要擷取出來的網址

url='https://tieba.baidu.com/p/3108805355?pn=2'
savepath=r'D:\python\movieImgs'
html=requests.get(url)
bs=BeautifulSoup(html.text, 'html.parser')
d_post=bs.find_all('img')
print(d_post)

之後把src裡的網址取出來,並加到list裡,要不然等會直接抓圖會抓到一個個的字串。

另外裡面有一些網址是無法用requests抓的,所以另外利用endswitch和startswitch過濾的

http://....和不是.jpg的pass過。

url='https://tieba.baidu.com/p/3108805355?pn=2'
savepath=r'D:\python\movieImgs'
html=requests.get(url)
bs=BeautifulSoup(html.text, 'html.parser')
d_post=bs.find_all('img')
imgsList=[]
for img in d_post:
    imgs=img['src']
    if str(imgs).startswith('https://imgsa') and str(imgs).endswith('.jpg'):
        imgsList.append(imgs)
         
    else:
        pass
for j in imgsList:
    print(j)

接下來就是寫一個downloads的函式來下載

因為它的網址最後是頁數,所以寫一個for回圈讓它跑頁數

def DownloadIMGS(imgsurl, savepath):
    i=0
    for j in imgsurl:
        getUrl=requests.get(j)
        getUrl.raise_for_status()
        print("%s....連線下載完成"%j)
        if not os.path.exists(savepath):
            os.mkdir(savepath)
        else:
            with open(os.path.join(savepath, str(i)+'.jpg'),'wb') as files:
                for d in getUrl.iter_content(10240):
                    files.write(d)
                i+=1
                files.close()

整個程式碼如下:

import requests, os
from bs4 import BeautifulSoup
import time

def DownloadIMGS(imgsurl, savepath):
    i=0
    for j in imgsurl:
        getUrl=requests.get(j)
        getUrl.raise_for_status()
        print("%s....連線下載完成"%j)
        if not os.path.exists(savepath):
            os.mkdir(savepath)
        else:
            with open(os.path.join(savepath, str(i)+'.jpg'),'wb') as files:
                for d in getUrl.iter_content(10240):
                    files.write(d)
                i+=1
                files.close()
        time.sleep(2)
        
    

savepath=r'D:\python\movieImgs'
for j in range(1,4):
    url='https://tieba.baidu.com/p/3108805355?pn='+'str(j)'
    html=requests.get(url)
    bs=BeautifulSoup(html.text, 'html.parser')
    d_post=bs.find_all('img')
    imgsList=[]
    for img in d_post:
        imgs=img['src']
        if str(imgs).startswith('https://imgsa') and str(imgs).endswith('.jpg'):
            imgsList.append(imgs)
            
        else:
            pass

😄抓完了

搜尋此網誌

爬蟲

(抓)百度貼吧-"我愛妳"相關的圖片

百度貼吧-"我愛妳"相關的圖片

目標網址

521張我愛你電影截圖

觀察網頁結構

留言

張貼留言

這個網誌中的熱門文章

(爬)微信公眾號上的圖片下載

(爬)康是美門市查詢並轉存csv

(爬)抓公開資訊的gzip檔並存成txt