【Python真的很强大】使用yield优雅抓取网页分页数据

Ihavegotyou

浏览: 231323 次
性别:
来自: 深圳

最近访客更多访客>>

skynothing

zjfmail

jackyin5918

waxuanxuan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

python

yield

使用yield优雅抓取网页分页数据

在使用Python来抓取网页数据的时候,常常碰到分页数据,其中部分下一页按钮已经有具体的链接地址,而另外的可能是javascript来处理分页的。这要求同时能解析页面内容，又要采集下一页的url。怎样优雅用python写这样的代码呢？或者说怎样更pythonic？

下面分别给出部分代码实例

def get_next_page(obj):
                '''get next page content from a url or another content '''
                error_occurred = False
                for retry2 in xrange(3):
                    try:
                        if isinstance(obj, (basestring, unicode)):
                            resp = curr_session.get(obj, timeout=TIMEOUT, headers=headers,
                                                    cookies=cookies, allow_redirects=True)
                            content = resp.content
                            save_html_content(obj, content)
                            error_occurred = False
                        else:
                            content = obj
                        soup = BeautifulSoup(content, features='html5lib', from_encoding="utf8")
                        e_next_page = soup.find('a', text="下頁")
                        break
                    except:
                        error_occurred = True
                        time.sleep(2)
                if error_occurred:
                    yield content
                    return
                if e_next_page:
                    next_url = "http://www.etnet.com.hk" + e_next_page.get('href')
                    time.sleep(2)
                    yield content
                    for i in get_next_page(next_url):
                        yield i
                else:
                    yield content

def get_next_page(obj, page=1):
        '''get next page content from a url or another content '''
        error_occurred = False
        for retry2 in xrange(3):
            try:
                if isinstance(obj, (basestring, unicode)):
                    resp = curr_session.get(obj, timeout=TIMEOUT, headers=headers,
                                            cookies=cookies, allow_redirects=True)
                    content = resp.content
                    save_html_content(obj, content)
                    hrefs = re.findall('industrysymbol=.*&market_id=[^;]+', content)
                    if page == 1 and (not "sh=" in obj) and hrefs:
                        reset_url = ("http://www.aastocks.com/tc/cnhk/market/industry"
                                     "/sector-industry-details.aspx?%s&page=1" % \
                            (hrefs[0].replace('sh=1', 'sh=0').replace('&page=', '') \
                             .replace("'", '').split()[0]))
                        for next_page in get_next_page(reset_url):
                            yield next_page
                        return
                    error_occurred = False
                else:
                    content = obj
                soup = BeautifulSoup(content, features='html5lib', from_encoding="utf8")
                e_next_page = soup.find('td', text="下一頁 ")
                break
            except:
                error_occurred = True
                LOG.error(traceback.format_exc())
                time.sleep(2)
        if error_occurred:
            yield content
            return
        if e_next_page:
            hrefs = re.findall('industrysymbol=.*&market_id=[^;]+', content)
            if hrefs:
                next_url = ("http://www.aastocks.com/tc/cnhk/market/industry/sector-industry"
                            "-details.aspx?%s&page=%d" % \
                            (hrefs[0].replace('sh=1', 'sh=0') \
                             .replace('&page=', '').replace("'", '').split()[0], page+1))
            time.sleep(2)
            yield content
            for next_page in get_next_page(next_url, page+1):
                yield next_page
        else:
            yield content

for curr_href in e_href:
                retry_interval = random.randint(MIN_INTERVAL_SECONDS_FOR_RETRIEVING,
                                                MAX_INTERVAL_SECONDS_FOR_RETRIEVING)
                time.sleep(retry_interval)
                contents = get_next_page(curr_href)
                for content in contents:
                    get_page_data(content)

0
顶

0
踩

分享到：

用 Python 编写干净、可测试、高质量的代码 ...

2017-08-29 16:41
浏览 1744
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Python真的很强大】使用yield优雅抓取网页分页数据

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Python真的很强大】使用yield优雅抓取网页分页数据

评论

发表评论

相关推荐

用 Python 编写干净、可测试、高质量的代码[转载]

【Python真的很强大】多线程的使用

【Python真的很强大】使用telnetlib编写nagios命令来监控远程主机cpu-load

python 实现每天产生一个日志文件

【Python真的很强大】使用fabfile.py来自动化你的任务

【Python真的很强大】使用scrapy爬取百度贴吧-上海吧

【Python真的很强大】使用PIL合成图片

【Python真的很强大】md5sum in Python

【Python真的很强大】程序Log实时监控

【Python真的很强大】开发简易在线搜索

使用py2exe打包MySQLdb程序为exe

在CentOS 安装 cx_Oracle

解决mysql中的OperationalError: (2006, 'MySQL server has gone away')

最近访客更多访客>>