首次公开发行股票申请企业情况的数据爬取,pyecharts和表格生成(一)

效果页面

先发一个最终效果图的网址:cnVar.cn

下载

fundviz/IPOStatus

起因

最近我的同事跟我讨论起在某微信公众号看到的IPO发行情况,他说他比较好奇这些数据的来源出处。于是乎,我就把整个证监会网站都找了一遍,最终找到了一个并不起眼的网页:【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 ,一番研究之后确定这个网页就是每周IPO发行情况的数据出处。本来这事应该就此结束,但鉴于这些用excel展示的数据并不直观,于是乎我就打算将其图像化。

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

数据爬取

首先分析【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 这个页面可以发现,证监会每周都会在此页面更新数据而不更新页面链接,这个为爬虫省下了一些功夫。我们只要用urllibBeautifulSoup就可以达到目的。 首先我们建立一个新的py文件: data_crawler.py, 并复制粘贴下面代码。 下面这段代码是有两个function组成的,最终通过在 main.py下调用此function会生成一个index.md 的markdown文件。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
import urllib
from bs4 import BeautifulSoup
import re
import os
import csv

#DIRTH_DATA = './data/'
#DIRTH_MD = './md/'
#url = "http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201803/t20180324_335702.htm"

#将爬取后的数据保存为文件
def save_file(content, filename):
    f = open(filename, 'w',encoding="utf-8")
    f.write(content)
    f.close()

#利用urllib和BeautifulSoup解析网页    
def parse(url,DIRTH_DATA,DIRTH_TERMINATION,DIRTH_MD):
    try:
        html = urllib.request.urlopen(url)
        soup = BeautifulSoup(html, 'lxml')

        title = soup.title.string[8:]

        #description & statistics
        text = ""
        stat = []
        for p in soup.find_all(name='p')[-1]:
            text += str(p).replace("<span>","").replace("</span>","")
        description = (text[4:])    
        stat = re.findall('\d+', description ) # get all the stat number such as total firms, passed and failed ones
        stat = stat[3:]
        #file links
        links = ""
        for link in soup.find_all(re.compile("^a"))[:3]:
            links += "[{}]({}{})\n".format(link.string,url[:-21],link['href'][1:])

        #date    
        date = soup.select("#headContainer span")[2].string
        date = date.replace('年','').replace('月','').replace('日','')

        #generate markdown as output file
        markdown = """---\ntitle: 首次公开发行股票申请企业情况\ncomment: false\ndate: \n---\n"""
        markdown += """\n{}\n{}\n<iframe src = "graph.html" width="1200px"  height="3000px" frameborder=0 marginheight=0  marginwidth=0 scrolling="no"></iframe> \n
        """.format( description,  links)

        if not os.path.exists(DIRTH_DATA + date +".xls"):
            #save md
            file_name = DIRTH_MD + date + '.md'
            save_file(markdown,file_name)
            save_file(markdown,'processing/data/index.md')

            #download xls file
            status_name = DIRTH_DATA + date + '.xls'
            status_file = soup.find_all(re.compile("^a"))[1]
            status_file = url[:-21] + status_file['href'][1:]
            urllib.request.urlretrieve(status_file, status_name)

            termination_name = DIRTH_TERMINATION + date + '.xls'
            termination_file = soup.find_all(re.compile("^a"))[2]
            termination_file = url[:-21] + termination_file['href'][1:]
            urllib.request.urlretrieve(termination_file, termination_name)

            #apend stat to csv file
            stat.insert(0,date)
            with open('processing/data/IPOstatus/stat.csv', 'a') as f:
                writer = csv.writer(f)
                writer.writerow(stat)

        else:
            print('数据已于'+ date +'更新,无需再次操作!')

    except urllib.error.URLError as e:
        print(e)

#parse(url,DIRTH_DATA,DIRTH_MD)

阅读量: | 柯西君_BingWong | 2018-08-26