生若直木,不语斧凿.

爬虫-坤州航运运价查询

Posted on By xiaoyongsheng
Email: xiaoyongsheng@hotmail.com
Views:

写在前面

本文来由: 实习期间,看到同事在手动采集数据,近百个港口,大量的复制粘贴,体力劳动,被我无情的嘲讽了一番,然后答应帮他写个爬虫,减少体力消耗,争取多活两年,所以,才有今天这篇博文。

本项目结构

  • spider
    • pic: 用于生成readme的图片;
    • readme: 帮助理解整个项目结构以及代码生成的流程;
    • result: 生成的结果存放在这里;
    • source: 代码和港口映射表在这里,港口映射表需要补充,代码里面的token在每次使用之前需要更新;

爬虫探索流程

本文目标是从坤州获取对应港口的运价信息;

以 NINGBO -> HAMBURG 为例解决单个港口爬取问题

获取报价页面

首先,查询页面输入 NINGBO -> HAMBURG,点击查询,可以获取该request的详细信息:

request headers:

GET /portal/fclratequery/form/index/?pol=CNNGB&fd=DEHAM&loadingtype=FCL HTTP/1.1  
Host: portal.kun-logistics.com  
Connection: keep-alive  
Cache-Control: max-age=0  
Upgrade-Insecure-Requests: 1  
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36  
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8  
Referer: https://portal.kun-logistics.com/portal/form/home/index/  
Accept-Encoding: gzip, deflate, sdch, br  
Accept-Language: zh-CN,zh;q=0.8  
Cookie: ASP.NET_SessionId=qlxcrlplrvfgkc24p2mn500z; uid=6ad87ee9-13b0-4d18-906a-d0a7dcae1322  

其实以 url: https://portal.kun-logistics.com/portal/fclratequery/form/index/?pol=CNNGB&fd=DEHAM&loadingtype=FCL 进行请求就可以获得该报价页面,但是问题在于报价分为多页,该request只返回首页数据,其余页以ajax的形式存在,需要再次请求,所以继续捕获点击下一页时的request对比两者的区别;

翻页数据request分析

page2 request headers:

POST https://portal.kun-logistics.com/portal/fclratequery/form/ajax HTTP/1.1  
Host: portal.kun-logistics.com  
Connection: keep-alive  
Content-Length: 194  
Accept: */*  
Origin: https://portal.kun-logistics.com  
X-Requested-With: XMLHttpRequest  
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36  
Content-Type: application/json; charset=UTF-8  
Referer: https://portal.kun-logistics.com/portal/fclratequery/form/index/?pol=CNNGB&fd=DEHAM&loadingtype=FCL  
Accept-Encoding: gzip, deflate, br  
Accept-Language: zh-CN,zh;q=0.8  
Cookie: ASP.NET_SessionId=qlxcrlplrvfgkc24p2mn500z; uid=6ad87ee9-13b0-4d18-906a-d0a7dcae1322  
  
{"key":"getlist","data":"{\"carrierid\":\"\",\"sailingdate\":\"\",\"portofloadingcode\":\"CNNGB\",\"finaldestinationcode\":\"DEHAM\",\"loadingtype\":\"FCL\",\"page\":\"2\"}","xc_token":"xTMMxL"}  

page3 request headers:

POST https://portal.kun-logistics.com/portal/fclratequery/form/ajax HTTP/1.1  
Host: portal.kun-logistics.com  
Connection: keep-alive  
Content-Length: 194  
Accept: */*  
Origin: https://portal.kun-logistics.com  
X-Requested-With: XMLHttpRequest  
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36  
Content-Type: application/json; charset=UTF-8  
Referer: https://portal.kun-logistics.com/portal/fclratequery/form/index/?pol=CNNGB&fd=DEHAM&loadingtype=FCL  
Accept-Encoding: gzip, deflate, br  
Accept-Language: zh-CN,zh;q=0.8  
Cookie: ASP.NET_SessionId=qlxcrlplrvfgkc24p2mn500z; uid=6ad87ee9-13b0-4d18-906a-d0a7dcae1322  
  
{"key":"getlist","data":"{\"carrierid\":\"\",\"sailingdate\":\"\",\"portofloadingcode\":\"CNNGB\",\"finaldestinationcode\":\"DEHAM\",\"loadingtype\":\"FCL\",\"page\":\"3\"}","xc_token":"xTMMxL"}  

page4 request headers:

POST https://portal.kun-logistics.com/portal/fclratequery/form/ajax HTTP/1.1  
Host: portal.kun-logistics.com  
Connection: keep-alive  
Content-Length: 194  
Accept: */*  
Origin: https://portal.kun-logistics.com  
X-Requested-With: XMLHttpRequest  
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36  
Content-Type: application/json; charset=UTF-8  
Referer: https://portal.kun-logistics.com/portal/fclratequery/form/index/?pol=CNNGB&fd=DEHAM&loadingtype=FCL  
Accept-Encoding: gzip, deflate, br  
Accept-Language: zh-CN,zh;q=0.8  
Cookie: ASP.NET_SessionId=qlxcrlplrvfgkc24p2mn500z; uid=6ad87ee9-13b0-4d18-906a-d0a7dcae1322  
  
{"key":"getlist","data":"{\"carrierid\":\"\",\"sailingdate\":\"\",\"portofloadingcode\":\"CNNGB\",\"finaldestinationcode\":\"DEHAM\",\"loadingtype\":\"FCL\",\"page\":\"4\"}","xc_token":"xTMMxL"}  

现在清晰了,下一页和首页的headers区别如下:
- 下一页是 post 类型, 首页是 get ;
- 请求的url不同, 下一页直接对Ajax发起请求;
- 下一页包括了一串 padload 信息,用以指定始发港、目的港、船公司、页数、xc_token等信息;

这个时候思路也明了了,主要流程如下:
1. 确定始发港、目的港以及目的港港口类型;
2. 组成要访问的首页url,获取首页页面,解析出一共有几页;
3. 修改各个页面的url和padload信息,分别发起请求,获取页面信息,并进行解析;
4. 将各个子页面的报价信息整合到一起即可;

代码实现

import pandas as pd
import requests
import json
from bs4 import BeautifulSoup


def get_page_cnt(SITE_URL):
    """get total pages' cnt"""
    try:
        first_response = requests.get(url=SITE_URL)
    except ConnectionError as e:
        first_response = requests.get(url=SITE_URL)
    first_page = first_response.content.decode('utf-8')
    soup = BeautifulSoup(first_page, "lxml")
    PAGE_CNT = soup.find_all('span', class_=None)[1].get_text().split()[-2]
    return int(PAGE_CNT)


def parse_page(soup):
    """parse page to DataFrame."""
    page_return = soup.find_all('tr', class_="tr-bd")
    page_result = list()
    for item in page_return:
        result = dict()
        result['pol'] = item.find('div', class_='pol').get_text()
        result['pod'] = item.find('div', class_='fd').get_text()
        result['carrier'] = item.find(
            'div', class_='t-muted carrier').get_text()
        result['schedule'] = item.find('div', class_='sailingsche').get_text()
        result['pot'] = item.find_all('td')[2].find_all('div')[0].get_text()
        result['date'] = item.find('div', class_='tt').get_text(strip=True)
        for i in item.find_all('td', class_='td-price'):
            result[i.find('h5').get_text()] = i.find('div').get_text()
        result['start_date'] = item.find_all('td', class_=None)[-2].get_text()
        result['end_date'] = item.find_all('td', class_=None)[-1].get_text()
        page_result.append(result)
    return pd.DataFrame(page_result)


def update_headers_pageCnt(POD, POD_TYPE):
    """get headers and page counts for this port.

     Params:
         POD: standlized code for pod;
         POD_TYPE: ports' type like 'FCL';

     Return:
         headers: request headers;
         page_cnt: prices' amount for this port;
    """

    SITE_URL = "https://portal.kun-logistics.com/portal/fclratequery/form/index/?pol=CNNGB&fd=" + \
        POD + "&loadingtype=" + POD_TYPE

    headers = {
        "Cookie": "ASP.NET_SessionId=qlxcrlplrvfgkc24p2mn500z; uid=6ad87ee9-13b0-4d18-906a-d0a7dcae1322",
        "Content-Type": "application/json; charset=UTF-8",
        "Host": "portal.kun-logistics.com",
        "Referer": SITE_URL,
        "Accept": "*/*",
        "X-Requested-With": "XMLHttpRequest",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.8"
    }

    PAGE_CNT = get_page_cnt(SITE_URL)
    return headers, PAGE_CNT


def update_payload(Payload_page, POD, POD_TYPE):
    """update the page number u want to crawl.

    Params:
        Paload_page: int, page number;

    Return:
        Payload: payload dict.

    """
    Payload_data = "{\"carrierid\":\"\",\"sailingdate\":\"\",\"portofloadingcode\":\"CNNGB\",\"finaldestinationcode\":\"" + \
        POD + "\",\"loadingtype\":\"" + POD_TYPE + \
        "\",\"page\":\"" + str(Payload_page) + "\"}"
    Payload = {
        "key": "getlist",
        "data": Payload_data,
        "xc_token": Payload_xc_token
    }
    return Payload


def get_pod_result(POD, POD_TYPE):
    headers, PAGE_CNT = update_headers_pageCnt(POD=POD, POD_TYPE=POD_TYPE)
    FNL_DF = pd.DataFrame()

    for Payload_page in range(1, PAGE_CNT + 1):
        Payload = update_payload(Payload_page, POD=POD, POD_TYPE=POD_TYPE)
        reponse = requests.post(
            url=PAGE_URL, data=json.dumps(Payload), headers=headers)
        one_page = reponse.content.decode('utf-8')
        soup = BeautifulSoup(one_page, "lxml")
        page_df = parse_page(soup)
        FNL_DF = FNL_DF.append(page_df)
    return FNL_DF.reset_index(drop=True)


if __name__ == '__main__':
    PAGE_URL = "https://portal.kun-logistics.com/portal/fclratequery/form/ajax"
    PORT_PATH = 'POD.xlsx'
    OUTPUT_PATH = 'OUTPUT.xlsx'

    Payload_xc_token = "ztMp8r"

    pod_data = pd.read_excel(PORT_PATH).dropna()
    print("【INFO】searching for {0} ports' prices".format(pod_data.shape[0]))

    ALL_PRICE = pd.DataFrame()
    for pod, pod_type in zip(pod_data['CODE'], pod_data['TYPE']):
        pod_result = get_pod_result(POD=pod, POD_TYPE=pod_type)
        print("【RESULT】{0} have {1} records!".format(pod, pod_result.shape[0]))
        ALL_PRICE = ALL_PRICE.append(pod_result)

    if ALL_PRICE.shape[0]:
        print("【FINALLY】total records: {0}".format(ALL_PRICE.shape[0]))
        ALL_PRICE.to_excel(OUTPUT_PATH)
    else:
        print("【WARN】TRY ANOTHER TOKEN PLEASE!")

代码运行须知

  • Payload_xc_token 是一个动态的变量, 需要看官自己去随便查询一个港口的信息,如HAMBURG,使用chrome打开,然后点击F12,开启调试流程:
    • 点击 network;
    • 点击 下一页;
    • 左侧 name 当中,出现 ajax, 双击;
    • 点击右侧的 headers, 在headers中 找到 request payload , 其中 xc_token 的值 就是我们需要的 Payload_xc_token , 复制过来即可;

  • 港口标准代码获取
    自行获取,更新到 EXCEL 当中, 获取方式如下:
    • 进入坤州物流首页;
    • 在搜索框输入 始发港 和 目的港, 点击 搜索;
    • chrome 地址栏的url当中包含有我们需要的信息:

      URL = https://portal.kun-logistics.com/portal/fclratequery/form/index/?pol=CNNGB&fd=CIABJ&loadingtype=FCL

      pol=CNNGB

      fd=CIABJ (这是目的港)

      loadingtype=FCL (不知道是什么,可能是港口类型?)

    • 我们需要的是 fd 和 loadingtype, 将其更新到 pod.xlsx 的对应位置即可;