博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
网络爬虫: 从allitebooks.com抓取书籍信息: 抓取allitebooks.com书籍信息及ISBN码 from backslash112...
阅读量:5818 次
发布时间:2019-06-18

本文共 2010 字,大约阅读时间需要 6 分钟。

from urllib2 import urlopenfrom bs4 import BeautifulSoup# Get the next page url from the current page urldef get_next_page_url(url):    page = urlopen(url)    soup_page = BeautifulSoup(page, 'lxml')    page.close()    # Get current page and next page tag    current_page_tag = soup_page.find(class_="current")    next_page_tag = current_page_tag.find_next_sibling()    # Check if the current page is the last one    if next_page_tag is None:        next_page_url = None    else:        next_page_url = next_page_tag['href']    return next_page_url # Get the book detail urls by page url def get_book_detail_urls(url):    page = urlopen(url)    soup = BeautifulSoup(page, 'lxml')    page.close()    urls = []    book_header_tags = soup.find_all(class_="entry-title")    for book_header_tag in book_header_tags:        urls.append(book_header_tag.a['href'])    return urls    # Get the book detail info by book detail urldef get_book_isbn(url):    page = urlopen(url)    book_isbn_soup = BeautifulSoup(page, 'lxml')    page.close()    title_tag = book_isbn_soup.find(class_="single-title")    title = title_tag.string    isbn_key_tag = book_isbn_soup.find(text="ISBN-10:").parent    isbn_tag = isbn_key_tag.find_next_sibling()    isbn = isbn_tag.string.strip() # Remove the whitespace with the strip method    return { 'title': title, 'ISBN': isbn }def start():    url = "http://www.allitebooks.com/certification/"    book_info_list = []    def next_page(page_url):        book_detail_urls = get_book_detail_urls(page_url)        for book_detail_url in book_detail_urls: #print all books ISBN one by one            # print(book_detail_url)            book_info = get_book_isbn(book_detail_url)            print(book_info)   # print ISBD            book_info_list.append(book_info)        next_page_url = get_next_page_url(page_url)        if next_page_url is not None:            next_page(next_page_url)        else:            return 0    next_page(url)

 

转载于:https://www.cnblogs.com/XinZhou-Annie/p/7148047.html

你可能感兴趣的文章
【官方文档】Nginx负载均衡学习笔记(三) TCP和UDP负载平衡官方参考文档
查看>>
矩阵常用归一化
查看>>
Oracle常用函数总结
查看>>
【聚能聊有奖话题】Boring隧道掘进机完成首段挖掘,离未来交通还有多远?
查看>>
USNews大学排名遭美国计算机研究学会怒怼,指排名荒谬要求撤回
查看>>
七大关键数据 移动安全迎来历史转折点
查看>>
盘点物联网网关现有联网技术及应用场景
查看>>
mui 总结2--新建第一个app项目
查看>>
nginx的lua api
查看>>
考研太苦逼没坚持下来!看苑老师视频有点上头
查看>>
HCNA——RIP的路由汇总
查看>>
zabbix监控php状态(四)
查看>>
定时任务的创建
查看>>
实战Django:小型CMS Part2
查看>>
原创]windows server 2012 AD架构试验系列 – 16更改DC计算机名
查看>>
统治世界的十大算法
查看>>
linux svn安装和配置
查看>>
SSH中调用另一action的方法(chain,redirect)
查看>>
数据库基础
查看>>
表格排序
查看>>