京东爬虫词云数据挖掘

python

发布日期: 2019-01-26

文章字数: 1.6k

阅读时长: 7 分

阅读次数:

问题陈述

近年来，网上购物已经越来越流行，人们对于评论一个商品的行为习惯也多多少少受到商家的影响。

好评请求

类似于这样的好评请求，好心的消费者由于见惯了，渐渐的就形成了好评的习惯。一味好评，会拉高商品的好评率，但是这对于新客户来说是非常具有误导性的，可能会让他们买了不好的产品。
京东的评价机制，没有评价的话，默认好评，结果不显示。

默认好评

有些顾客即使说出了差评，但是在打分上还是会给5星好评。

评分评价不一致

我们迫切的需要一款可以查明真实评价的渠道。

需要的信息

京东商品爬取，需要以下信息，①商品的名称；②商品的价格；③商品的评论；④商品的好评率、中评率和差评率。

接口描述	URL	格式	备注
抓取价格	https://p.3.cn/prices/mgets?skuIds={skuid}	json	skuid是商品的id，也就是https://item.jd.com/7029545.html 商品网址的数字部分
抓取名称	https://item.jd.com/{skuid}.html	html	skuid是商品的id，也就是https://item.jd.com/7029545.html 商品网址的数字部分
抓取评论	https://club.jd.com/comment/productPageComments.action?productId={skuid}&score=0&sortType=6&page={i}&pageSize=10&isShadowSku=0&fold=1	json	skuid是商品的id，也就是https://item.jd.com/7029545.html 商品网址的数字部分;i是评论的页数，从零开始递增

爬虫实现

爬取所有商品的基本信息

按照关键字搜索特定商品，或直接爬取https://www.jd.com/allSort.aspx

def get_cellphones_list(page):
    header = {'Connection': 'close',
              'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36'}
    url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=" + str(
        page) + "&s=1&click=0"
    # 请求腾讯新闻的URL，获取其text文本
    wbdata = requests.get(url, headers=header).text
    # 对获取到的文本进行解析
    soup = BeautifulSoup(wbdata, 'lxml')
    # 从解析文件中通过select选择器定位指定的元素，返回一个列表
    cellphones = soup.select("li.gl-item > div.gl-i-wrap > div.p-img > a")
    cell_id = soup.find_all("li", class_="gl-item")

    ret_list = []
    # 对返回的列表进行遍历
    for (a_tags, li_tags) in zip(cellphones, cell_id):
        # 提取出标题和链接信息
        skuid = li_tags.get("data-sku")
        link = a_tags.get("href")
        url = "https://item.jd.com/" + str(skuid) + ".html"
        try:
            html = requests.get(url, headers=header).text
        except:
            print(time.asctime(time.localtime(time.time())) + ": Connection refused by the server..")
            print("Let me sleep for 5 seconds")
            print("ZZzzzz...")
            time.sleep(5)
            print(time.asctime(time.localtime(time.time())) + ": Was a nice sleep, now let me continue...")
            continue
        product_name_whole_html = BeautifulSoup(html, 'lxml')
        # 从解析文件中通过select选择器定位指定的元素，返回一个列表
        product_name_html = product_name_whole_html.select("div.p-name > a")
        product_name = product_name_html[0].get_text()

        data = {
            'skuid': skuid,
            'product_name': product_name,
            'link': link
        }
        ret_list.append(data)
    return ret_list


def get_cellphone_list_whole_jd():
    i = 0
    while i < 200:
        list_data = get_cellphones_list(i)
        for p in list_data:
            # 获取数据库链接
            connection = pymysql.connect(host="localhost", user="root", passwd="123123", db="jd", port=3306,
                                         charset="utf8")
            try:
                # 获取会话指针
                with connection.cursor() as cursor:
                    # 创建sql语句
                    sql = "insert into jd.jd_product (`skuid`, `product_name`,`link`) values (%s,%s,%s)"

                    # 执行sql语句
                    cursor.execute(sql, (p["skuid"], p["product_name"], p["link"]))

                    # 提交数据库
                    connection.commit()
            finally:
                connection.close()
        print(time.asctime(time.localtime(time.time())) + ": 手机信息第" + str(i + 1) + "页已存入数据库")
        i += 1

爬取所有商品的基本信息的流程图

爬取某商品所有评论

爬取某商品所有评论的流程图

为保证爬取顺利，开启三个ip进行爬取：

domains = ["http://jdapi.tellyouwhat.cn", "https://club.jd.com","http://jdapi2.tellyouwhat.cn"]

每次从中任取一个域名进行组合

def crawlProductComment(url, skuid):
    # 读取原始数据(注意选择gbk编码方式)
    try:
        html = urllib.request.urlopen(url).read().decode('gbk', 'ignore')
    except Exception:
        return -1

    data = json.loads(html)

    if len(data['comments']) == 0:
        print(time.asctime(time.localtime(time.time())) + "\nSeems no more data to crawl")
        return -1

    for comment in data['comments']:
        product_name = comment['referenceName']
        comment_time = comment['creationTime']
        content = comment['content']
        # 数据库操作
        # 获取数据库链接
        connection = pymysql.connect(host="localhost", user="root", passwd="123123", db="jd", port=3306, charset="utf8")
        try:
            # 获取会话指针
            with connection.cursor() as cursor:
                # 创建sql语句
                sql = "insert into jd.jd_items (`skuid`, `productName`,`commentTime`,`content`) values (%s,%s,%s,%s)"
                # 执行sql语句
                cursor.execute(sql, (skuid, product_name, comment_time, content))

                # 提交数据库
                connection.commit()
        finally:
            connection.close()
    return 1

爬评论的过程

爬评论的结果

爬取某商品的价格

def get_product_price(skuid, url="https://p.3.cn/prices/mgets?skuIds="):
    price_json = urllib.request.urlopen(url + str(skuid))
    data = json.load(price_json)
    price = data[0]['p']
    return price

对评论进行情感分析

def process(skuid):
    sum_sentiment = 0
    good_counter = 0
    just_so_so_counter = 0
    bad_counter = 0
    datas = read_data(skuid)

    comments_concat = ""
    for data in datas:
        # print(data[3])
        comments_concat += data[3].replace("hellip", "").replace("rdquo", "")
        sentiment = SnowNLP(data[3]).sentiments
        if sentiment > 0.8:
            good_counter += 1
        elif sentiment > 0.4:
            just_so_so_counter += 1
        else:
            bad_counter += 1
        sum_sentiment += sentiment
        # print(data[3])
        # print(sentiment)
    print("-----------------共计" + str(len(datas)) + "条评论------------------------")
    print("-----------------0.8以上" + str(good_counter) + "条评论------------------------")
    print("-----------------0.4-0.8  " + str(just_so_so_counter) + "条评论------------------------")
    print("-----------------0.4以下" + str(bad_counter) + "条评论------------------------")
    print("average sentiment is {}".format(sum_sentiment / len(datas)))

评论性质	情感分析得分区间
好评	0.8~1.0
中评	0.4~0.8
差评	0~0.4

情感分析结果

官网好评达到99%，然而我们利用情感分析得出的结果是76%。

绘制词云

def draw_wordcloud(text):
    font_path = "‪C:\\Windows\\Fonts\\1.ttf"
    weight = 1000
    height = 1000
    bg_img = imread("jdlogo.png")
    wordcloud = WordCloud(font_path=font_path,
                          width=weight,
                          height=height,
                          mask=bg_img).generate(text)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

坚果 3手机jd评论词云

结论

对京东商品的评论分析效果还是很不错的，能找出某商品是否值得购买，对好评率会不会有误导性能提出较为明确的判断依据。

改进建议：引入多线程和异步HttpGet，以提升爬虫效率和稳定性，进一步一体化程序，使之成为一项webservice，可以通过网页直接查询该商品是否值得买。

代码可以在这里https://github.com/HarborZeng/crawler_JD_what_worthy_buying下载到

转载规则

《利用python写一个爬虫爬取京东商品评论分析情感倾向并绘制词云》由 Harbor Zeng 采用知识共享署名 4.0 国际许可协议进行许可。

操作系统期末考试复习资料卷子

[文件] 计算机操作系统.pdf https://pan.wps.cn/l/scnt5b3 [文件] 操作系统精髓与设计原理(第5版).pdf https://pan.wps.cn/l/sxextb8 [文件] 操作系统原理及应用（LINUX）.pdf https://pan.wps.cn/l/s4qbu9e [文件] 操作系统——内核与设计原理.pdf https://pan.

2019-01-26 考试

西北大学复习软件工程操作系统期末

新版中日交流标准日本语单词

包含1~48课的所有单词录音，提供下载