基于ruby Mechanize的爬虫

scholltop

浏览: 288393 次
性别:
来自: 武汉

最近访客更多访客>>

地方疙瘩人

kodo521

猫狸粽子

wangyy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

爬虫
RUBY

ruby


  def self.sang_carwler
    agent = Mechanize.new
    cc1 = ConsumableCategory.find_or_create_by(name: "生命科学", parent_id: 0)
    resp = agent.get("https://www.XXX.com").search("li.subnav_cat dl.sub_dl")
    resp.each do |e1|
      catalog_name2 = e1.search(".sub_dt a").first.children.first.text.strip
      next unless ["生化试剂", "分子生物学", "蛋白质科学", "抗体", "细胞生物学"].include?(catalog_name2)
      cc2 = ConsumableCategory.find_or_create_by(name: catalog_name2, parent_id: cc1.id)
      e1.search(".sub_dd a").each do |e2|
        catalog_name3 = e2.children.first.text.strip
        cc3 = ConsumableCategory.find_or_create_by(name: catalog_name3, parent_id: cc2.id)
      end
    end
  end

－－－－－－－－－－－－－－

  # 基于接口的查询
  def self.guoyao_crawler(titles = [])
    menu_resp = HTTParty.get "https://XXX/reagent-front/indexApi/goodsClassList"
    menu_resp["data"].each do |catalog_1|
      next unless titles.include?(catalog_1["gcName"])
      cc1 = ConsumableCategory.find_or_create_by(name: catalog_1["gcName"], parent_id: 0)
      catalog_1["classList"].each do |catalog_2|
        next if catalog_1["gcName"] == "化学试剂" && ["通用试剂", "高纯试剂", "色谱应用", "分析标准品"].include?(catalog_2["gcName"])
        cc2 = ConsumableCategory.find_or_create_by(name: catalog_2["gcName"], parent_id: cc1.id)
        catalog_2["classList"].each do |catalog_3|
          cc3 = ConsumableCategory.find_or_create_by(name: catalog_3["gcName"], parent_id: cc2.id)
          if catalog_1["gcName"] == "化学试剂"
            page_size = 100
            page_no = 1
            flag = true
            while flag do
              chemical_resp = HTTParty.get "https://XXX/reagent-front/goodsApi/getGoodsList?pageSize=#{page_size}&pageNo=#{page_no}&searchType=gcIdSearch&keyword=#{catalog_3['gcId']}"
              if chemical_resp["data"].first["pageCount"].to_i > 0
                chemical_resp["data"].first["listApiGoods"].each do |goods|
                  MenuChemical.find_or_create_by(consumable_category_id: cc3.id, cas: goods["casIndexNo"])
                end
                page_no += 1
              else
                flag = false
              end
            end
          end
        end
      end
    end
  end

－－－－－－－－－－－－－－

    def self.worm(sleep_second = 0.15)
      agent = Mechanize.new
      ['nav-1', 'nav-2', 'nav-3', 'nav-4'].each do |nav|
        agent.get("http://www.xxxx/zh_cn/").search("li.level0.#{nav} ul li ul li a").each do |link1|
          puts "#{link1.attributes['href'].value} #{link1.children.children.text}"
          fenlei = link1.children.children.text
          product_list = agent.get(link1.attributes['href'].value)
          total_count = product_list.search(".toolbar-number").last&.children&.text.to_i
          total_page = total_count % 15 == 0 ? (total_count / 15) : ((total_count / 15) + 1)
          (1..total_page).to_a.each do |page|
            product_list = agent.get("#{link1.attributes['href'].value}?p=#{page}")
            break if product_list.search('div.actions-primary a').blank?
            product_list.search('div.actions-primary a').each do |product_link|
              sleep sleep_second
              product_page = agent.get(product_link.attributes['href'].value)
              product_no = product_link.attributes['href'].value.gsub('http://www.xxx/zh_cn/','').gsub('.html', '').upcase
              cas = product_page.search("#product_addtocart_form > div.product-shop > div.product-info > span:nth-child(2) > a").children.text
              if product_page.search("#super-product-table thead tr").children.select{|c| c.name == 'th'}.map{|th| th.children.text.to_s.strip} == ["货号", "规格", "库存", "价格", "数量"]
                product_page.search("#super-product-table tbody tr").each do |tr|
                  tds = tr.children.select{|c| c.name == 'td'}
                  package_unit = tds[0].children.text.to_s.strip.gsub("#{product_no}-",'')
                  package = package_unit.to_f
                  unit = package_unit.slice(/[a-zA-Z]+/).downcase
                  purity = tds[1].children.text.to_s.strip
                  stock = tds[2].children.text.to_s.strip
                  ajax_price_id = tds[3].attributes['attr'].value
                  price = 0
                  response = HTTParty.post("http://www.xxx/zh_cn/catalogb/ajax/price", body: {ajax_price_id => ajax_price_id})
                  price = Nokogiri::HTML(JSON.parse(response.parsed_response)[ajax_price_id]).search("p span.price").last.text.gsub(/[^0-9]/,'').to_f / 100 if response&.parsed_response.present?
                  ReagentCategory.create(name: 'ald', fenlei: fenlei, product_no: product_no, cas: cas, package: package, unit: unit, stock: stock, price: price, purity: purity, ajax_price_id: ajax_price_id, vendor_id: VENDOR_ID, company_id: COMPANY_ID)
                end
              end
            end
          end
        end
      end
    end

分享到：

rails处理上传读取excell&生成excell | 一些常用加密方式

2018-12-20 13:09
浏览 601
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

基于ruby Mechanize的爬虫

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

基于ruby Mechanize的爬虫

评论

发表评论

相关推荐

git仓库创建

puma高并发

抓取异步分页的数据

searchkick

导入线下excell业务数据按权重匹配线上数据

两对象同时映射一对一和一对多

ruby一些类加载方式

基于ruby的gem remotipart的异步上传文件

基于html2canvas的长图分享

rails处理上传读取excell&生成excell

一些常用加密方式

ruby 调用restful接口示例

rails错误日志记录

railsAPI接收Base64文件

ruby 调用savon接口示例

关于国际商城现货展示与购物车的费用设计

基于多线程的全局变量

hash最小值过滤算法

阿里云裸机部署rails运用

打包订单单据发给货代

最近访客更多访客>>