博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
抓取天猫手机评论
阅读量:5265 次
发布时间:2019-06-14

本文共 3942 字,大约阅读时间需要 13 分钟。

import reimport jsonimport timeimport requestsfrom bs4 import BeautifulSoup   tm_headers = {             "scheme": "https",            "Connection": "keep-alive",            "Upgrade-Insecure-Requests": "1",            "Cache-Control" : "max-age=0",            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",            "Accept-Encoding": "gzip, deflate, br",            "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",            "Content-Type": "text/html"                       }    def req(url, headers):    soup = None    try:        content = requests.get(url, headers=headers, timeout=2)        code = content.status_code        if code == 200:            soup = BeautifulSoup(content.text, "html.parser")    except Exception as e:        print("get url error, url: {0}".format(url))    return soupdef get_phone_list():    #  获取列表url    phone_list = []    list_url = "https://shouji.tmall.com/?spm=a222t.8063993.a2226c3nav.5.7b8f4da0yjyxC3&acm=lb-zebra-155904-807029.1003.4.767290&scm=1003.4.lb-zebra-155904-807029.OTHER_14592967254716_767290#J_floor12"    soup = req(list_url, tm_headers)    txt = soup.find_all("li", class_="focus-")    for i in txt[:-5]:        a = i.find("a")        name = i.find("h3").get_text()        href = a.get("href")        if name != "":            itemid = href.split("id=")[-1].split("&")[0]            phone_list.append({
"url": "https:" + href, "name": name}) return phone_listdef create_deltail_url(url, page=1, itemid=None, sellerid=None): # 生成评论地址,最终发现获取评论api 参数需要两个id,itemid and sellerid,sellerid 必须去详情页拿
if itemid is None and sellerid is None:         itemid = url.split("id=")[-1].split("&")[0]         soup = req(url, tm_headers)         txt = soup.find_all("meta")[-1].get("content")         sellerid = txt.split("userid=")[-1].replace(";", "")         comment_json_url = "https://rate.tmall.com/list_detail_rate.htm?itemId={0}&sellerId={1}&currentPage={2}".format(itemid, sellerid, page)    return comment_json_url, itemid, selleriddef get_deltail(db, comment_json_url, itemid, sellerid, name):    # 调用评论接口 获取评论数据    pagenum = None    comment_data = req(comment_json_url, tm_headers)    if comment_data is not None:        count = 1        while "paginator" not in str(comment_data) and count < 5:            comment_data = req(comment_json_url, tm_headers)            count += 1            time.sleep(1)        try:            comment_str = str(comment_data)[15:]            comment_json = json.loads(comment_str)        except Exception as e:            return None        rateList = comment_json["rateList"]        for item in rateList:            data = {}            data["itemid"] = itemid            data["usernick"] = item["displayUserNick"]            data["comment_content"] = item["rateContent"]            data["comment_date"] = item["rateDate"]            data["sellerid"] = sellerid                # insert db        pagenum = comment_json["paginator"]["lastPage"]    return pagenumif __name__ == "__main__":    phone_list = get_phone_list()    for phone_url in phone_list:        name = phone_url["name"]        url = phone_url["url"]        print("开始抓取: {0}  手机, 页码: {1}".format(name, 1))        comment_json_url, itemid, sellerid = create_deltail_url(url)        pagenum = get_deltail(db, comment_json_url, itemid, sellerid, name)        if pagenum is not None:            page = 2            while page < pagenum:                print("开始抓取: {0} 手机, 页码: {1}".format(name, page))                comment_json_url, itemid, sellerid = create_deltail_url(phone_url["url"], page, itemid, sellerid)                get_deltail(db, comment_json_url, itemid, sellerid, name)                page += 1                time.sleep(2)

 

转载于:https://www.cnblogs.com/dockers/p/7767914.html

你可能感兴趣的文章
MacOS copy图标shell脚本
查看>>
国外常见互联网盈利创新模式
查看>>
Oracle-05
查看>>
linux grep 搜索查找
查看>>
Not enough free disk space on disk '/boot'(转载)
查看>>
android 签名
查看>>
android:scaleType属性
查看>>
mysql-5.7 innodb 的并行任务调度详解
查看>>
shell脚本
查看>>
Upload Image to .NET Core 2.1 API
查看>>
Js时间处理
查看>>
Java项目xml相关配置
查看>>
三维变换概述
查看>>
vue route 跳转
查看>>
【雷电】源代码分析(二)-- 进入游戏攻击
查看>>
Entityframework:“System.Data.Entity.Internal.AppConfig”的类型初始值设定项引发异常。...
查看>>
Linux中防火墙centos
查看>>
mysql新建用户,用户授权,删除用户,修改密码
查看>>
FancyCoverFlow
查看>>
JS博客
查看>>