爬取矿大教务系统成绩

最近不太忙所以就把上次刚学python时，没能成功爬取教务系统成绩的代码又重新写了一下，但是这一写就是一段时间，其中想过很多方法，又尝试了很多方法，这过程中也摸索学到了一点知识，所以来总结一下吧。

首先打开矿大教务系统登录主页，先分析一下网页。

矿大教务系统

首先分析网站源代码

下面根据使用下方网址通过浏览器f12登陆进去后可以看到如下图所示的信息其中form表单有一个csrftoken,还有加密了的密码，所以简单的post用户名密码是登陆不进去网站的。

http://202.119.200.202/jwglxt/xtgl/login_slogin.html

1530798953033

右键查看源代码可以看到首先在密码处用了autocomplete=”off”，防止浏览器自动填充密码，这或许就是我用splinter一到密码处就输不进去报错的原因吧。或许也不是，毕竟selenium还是可以成功输入的。。。

然后还可以看到此处用了一个csrftoken，可以防止csrf攻击，这也是导致了我想先登录主页保存cookies在通过分析直接跳到得到的成绩页面来爬取成绩失败的原因。所以只能考虑直接访问成绩页面跳转到登录页面，登录成功后便可以爬取到成绩。

具体怎么找到成绩页面就是根据谷歌自带的工具F12一层一层看一下就能找到，具体不详细描述了。url如下。

http://202.119.206.62/jwglxt/cjcx/cjcx_cxDgXscj.htmldoType=query&gnmkdm=N305005


<div class="row sl_log_bor4">
			<div class="col-sm-8 hidden-xs sl_log_lf">
				<img class="img-responsive" src="http://202.119.206.62:80/zftal-ui-v5-1.0.2/assets/images/login_bg_pic.jpg" />
			</div>
			<div class="col-sm-4 sl_log_rt">
				<form class="form-horizontal" role="form" action="/jwglxt/xtgl/login_slogin.html" method="post">
                    <!-- 用了csrftoken防止csrf -->
				<input type="hidden" id="csrftoken" name="csrftoken" value="d6f6b735-2438-4476-a520-a4a7a237d110,d6f6b73524384476a520a4a7a237d110"/>
					<h5>用户登录</h5>
					<!-- 防止浏览器自动填充密码 -->
					<input type="text" style="display: none;" autocomplete="off"/>
					<input type="password" style="display: none;" autocomplete="off"/>
					<!-- 防止浏览器自动填充密码 end -->
					
					
						<p style="display: none;" id="tips" class="bg_danger sl_danger">
						</p>

分析加密算法

其中前四个是js加密密码用的，login.js是负责登录的js点进去看，可以看到对密码加密使用的算法。首先是定义了modulus和exponent两个变量，这两个是为了使用rsa加密算法得到公钥使用的，这两个值可以通过下方的url来得到，所以下方登录网址_t就是js里的函数得到的当前时间距离1970/1/1零点时毫秒数，这样的话密码根据时间的不同加密得到的密文也就不同。

http://jwxt.cumt.edu.cn/jwglxt/xtgl/login_slogin.html?language=zh_CN&_t=1530780180937

本来想分析js这个加密算法来通过写一个python来实现，这样就可以通过post用户名、密码在加上网页源代码可以得到的csrftoken值来登录进去了，但是无奈分析了一下发现还是没能实现成功。所以先留个坑，日后来填！

这个加密算法大致过程是先得到modulus和exponent两个变量，然后通过b64tohex函数转成16进制再通过rsa算法生成公钥，进而在利用公钥对密码加密生成私钥。然后私钥在由16进制转成base64编码即为加密密码的密文。

加密算法代码

var modulus,exponent;
$.getJSON(_path+"/xtgl/login_getPublicKey.html?time="+new Date().getTime(),function(data){
		modulus = data["modulus"];
		exponent = data["exponent"];
	});
	
var rsaKey = new RSAKey();
			rsaKey.setPublic(b64tohex(modulus), b64tohex(exponent));
			var enPassword = hex2b64(rsaKey.encrypt($("#mm").val()));
			$("#mm").val(enPassword);
			$("#hidMm").val(enPassword);

下面我把用到的几个函数从那四个页面提取出来了。日后有机会用python来实现以下。

// Set the public key fields N and e from hex strings
function RSASetPublic(N,E) {
    if(N != null && E != null && N.length > 0 && E.length > 0) {
        this.n = parseBigInt(N,16);
        this.e = parseInt(E,16);
    }
    else
        alert("Invalid RSA public key");
}

// Return the PKCS#1 RSA encryption of "text" as an even-length hex string
function RSAEncrypt(text) {
    var m = pkcs1pad2(text,(this.n.bitLength()+7)>>3);
    if(m == null) return null;
    var c = this.doPublic(m);
    if(c == null) return null;
    var h = c.toString(16);
    if((h.length & 1) == 0) return h; else return "0" + h;
}

function RSAKey() {
    this.n = null;
    this.e = 0;
    this.d = null;
    this.p = null;
    this.q = null;
    this.dmp1 = null;
    this.dmq1 = null;
    this.coeff = null;
}
// public
RSAKey.prototype.setPublic = RSASetPublic;
RSAKey.prototype.encrypt = RSAEncrypt;

var b64map="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
var b64pad="=";

function hex2b64(h) {
    var i;
    var c;
    var ret = "";
    for(i = 0; i+3 <= h.length; i+=3) {
        c = parseInt(h.substring(i,i+3),16);
        ret += b64map.charAt(c >> 6) + b64map.charAt(c & 63);
    }
    if(i+1 == h.length) {
        c = parseInt(h.substring(i,i+1),16);
        ret += b64map.charAt(c << 2);
    }
    else if(i+2 == h.length) {
        c = parseInt(h.substring(i,i+2),16);
        ret += b64map.charAt(c >> 2) + b64map.charAt((c & 3) << 4);
    }
    while((ret.length & 3) > 0) ret += b64pad;
    return ret;
}

// convert a base64 string to hex
function b64tohex(s) {
    var ret = ""
    var i;
    var k = 0; // b64 state, 0-3
    var slop;
    for(i = 0; i < s.length; ++i) {
        if(s.charAt(i) == b64pad) break;
        v = b64map.indexOf(s.charAt(i));
        if(v < 0) continue;
        if(k == 0) {
            ret += int2char(v >> 2);
            slop = v & 3;
            k = 1;
        }
        else if(k == 1) {
            ret += int2char((slop << 2) | (v >> 4));
            slop = v & 0xf;
            k = 2;
        }
        else if(k == 2) {
            ret += int2char(slop);
            ret += int2char(v >> 2);
            slop = v & 3;
            k = 3;
        }
        else {
            ret += int2char((slop << 2) | (v >> 4));
            ret += int2char(v & 0xf);
            k = 0;
        }
    }
    if(k == 1)
        ret += int2char(slop << 2);
    return ret;
}

通过以上的分析最终还是选择了selenium这个自动化测试工具，据说selenium+PhantomJS是爬虫一大杀器。

我选择了selenium+firefox ，首先需要下一个和浏览器匹配的geckodriver.exe版本。还是通过模拟浏览器登录后直接保存cookie然后爬取成绩。

爬虫代码

from selenium import webdriver
import requests
import json

driver = webdriver.Firefox()
session = requests.session()


def get_cookie():
    driver.get('http://202.119.206.62/jwglxt/cjcx/cjcx_cxDgXscj.html?doType=query&gnmkdm=N305005&queryModel.showCount=200')
    driver.find_element_by_id('yhm').send_keys('08133xxx')
    driver.find_element_by_id('mm').send_keys('XXXXXXX')
    driver.find_element_by_id('dl').click()
    cook = driver.get_cookies()
    for item in cook:
        cookie = item['name'] + '=' + item['value']
    return cookie


def get_score(cookie):
    headers = {
        'cookie': cookie,
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
    }
    url = 'http://202.119.206.62/jwglxt/cjcx/cjcx_cxDgXscj.htmldoType=query&gnmkdm=N305005&queryModel.showCount=200'
    r = session.get(url, headers=headers)
    return r.text

# 这个函数是很久之前刚学python写的了，很丑陋，不过拿到成绩后怎么分析就可以随便写了。
def analyse(text):
    json_dict = json.loads(text, encoding="utf-8")
    json_cj = json_dict['items']
    a = 0
    for cj in json_cj:
        cjj = cj['bfzcj']
        cjj = int(cjj)
        if cjj >= 60:
            print('学科名称:', cj['kcmc'], ' ', '成绩:', cj['bfzcj'])
            a += 1
    print('共计', a, '门学科')
    print(' ')
    for cj in json_cj:
        cjj = cj['bfzcj']
        cjj = int(cjj)
        if cjj < 60:
            print('挂科科目', cj['kcmc'], '挂科成绩', cj['bfzcj'])


if __name__ == '__main__':
    cookie = get_cookie()
    text = get_score(cookie)
    analyse(text)
    driver.quit()

运行结果如下图

发现通过bb了一大堆代码还是如此简单…反正能爬到数据就行了是吧…基本原理还是通过cookie（客户端）和session（服务端）来实现的。

遇到动态的js如何爬取时，可以通过一层一层分析找到数据的html进而进行爬取。