效果页面
先发一个最终效果图的网址:cnVar.cn
下载
起因
最近我的同事跟我讨论起在某微信公众号看到的IPO发行情况,他说他比较好奇这些数据的来源出处。于是乎,我就把整个证监会网站都找了一遍,最终找到了一个并不起眼的网页:【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 ,一番研究之后确定这个网页就是每周IPO发行情况的数据出处。本来这事应该就此结束,但鉴于这些用excel展示的数据并不直观,于是乎我就打算将其图像化。
步骤
页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)
目录结构
+--main.py
+--processing
| +--data
| | +--graph.html
| | +--index.md
| | +--IPOstatus
| | | +--data
| | | | +--20180727.xls
| | | | +--20180803.xls
| | | | +--20180810.xls
| | | | +--20180817.xls
| | | | +--20180824.xls
| | | +--md
| | | | +--20180727.md
| | | | +--20180803.md
| | | | +--20180810.md
| | | | +--20180817.md
| | | | +--20180824.md
| | | +--stat.csv
| | | +--termination
| | | | +--20180803.xls
| | | | +--20180810.xls
| | | | +--20180817.xls
| | | | +--20180824.xls
| +--datatomd.py
| +--data_crawler.py
| +--generator.py
| +--__init__.py
数据爬取
首先分析【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 这个页面可以发现,证监会每周都会在此页面更新数据而不更新页面链接,这个为爬虫省下了一些功夫。我们只要用urllib
和 BeautifulSoup
就可以达到目的。 首先我们建立一个新的py文件: data_crawler.py, 并复制粘贴下面代码。 下面这段代码是有两个function组成的,最终通过在 main.py
下调用此function会生成一个index.md 的markdown文件。
# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018
@author: 柯西君_BingWong
#"""
import urllib
from bs4 import BeautifulSoup
import re
import os
import csv
#DIRTH_DATA = './data/'
#DIRTH_MD = './md/'
#url = "http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201803/t20180324_335702.htm"
#将爬取后的数据保存为文件
def save_file(content, filename):
f = open(filename, 'w',encoding="utf-8")
f.write(content)
f.close()
#利用urllib和BeautifulSoup解析网页
def parse(url,DIRTH_DATA,DIRTH_TERMINATION,DIRTH_MD):
try:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')
title = soup.title.string[8:]
#description & statistics
text = ""
stat = []
for p in soup.find_all(name='p')[-1]:
text += str(p).replace("<span>","").replace("</span>","")
description = (text[4:])
stat = re.findall('\d+', description ) # get all the stat number such as total firms, passed and failed ones
stat = stat[3:]
#file links
links = ""
for link in soup.find_all(re.compile("^a"))[:3]:
links += "[{}]({}{})\n".format(link.string,url[:-21],link['href'][1:])
#date
date = soup.select("#headContainer span")[2].string
date = date.replace('年','').replace('月','').replace('日','')
#generate markdown as output file
markdown = """---\ntitle: 首次公开发行股票申请企业情况\ncomment: false\ndate: \n---\n"""
markdown += """\n{}\n{}\n<iframe src = "graph.html" width="1200px" height="3000px" frameborder=0 marginheight=0 marginwidth=0 scrolling="no"></iframe> \n
""".format( description, links)
if not os.path.exists(DIRTH_DATA + date +".xls"):
#save md
file_name = DIRTH_MD + date + '.md'
save_file(markdown,file_name)
save_file(markdown,'processing/data/index.md')
#download xls file
status_name = DIRTH_DATA + date + '.xls'
status_file = soup.find_all(re.compile("^a"))[1]
status_file = url[:-21] + status_file['href'][1:]
urllib.request.urlretrieve(status_file, status_name)
termination_name = DIRTH_TERMINATION + date + '.xls'
termination_file = soup.find_all(re.compile("^a"))[2]
termination_file = url[:-21] + termination_file['href'][1:]
urllib.request.urlretrieve(termination_file, termination_name)
#apend stat to csv file
stat.insert(0,date)
with open('processing/data/IPOstatus/stat.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(stat)
else:
print('数据已于'+ date +'更新,无需再次操作!')
except urllib.error.URLError as e:
print(e)
#parse(url,DIRTH_DATA,DIRTH_MD)