用python做youtube自动化下载器 代码
2021-01-12 19:50
901 查看
目录2. 调用解密函数 i. 分析 ii. 先取出js部分 iii. 取第一个解密函数作为我们用的解密函数 iv. 用execjs执行 2. alert不存在 v. 整合代码 3. 分析解密结果 ii. 格式化json iii. 取下载地址 3. 全部代码
iii. 最后再执行
但是我们可以发现马上就报错了(要是有这么简单就好了)-end-
根据 savefrom条例
本实例及教程只用于学习交流用,权利归savefrom.net所有
最后代码+注释大概100行左右,具体代码以github代码为主(可以会在上面修复bug),本文只做具体讲解
项目地址
思路
流程
1. post
根据思路里的第一步,我们首先需要用
post方式取到加密后的js字段,笔者使用了
requests第三方库来执行,关于爬虫可以参考我之前的文章
i. 先把post中的headers格式化
# set the headers or the website will not return information # the cookies in here you may need to change headers = { "cache-Control": "no-cache", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng," "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", "content-type": "application/x-www-form-urlencoded", "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; " "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; " "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; " "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; " "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; " "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1", "origin": "https://en.savefrom.net", "pragma": "no-cache", "referer": "https://en.save 56c from.net/1-youtube-video-downloader-4/", "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"", "sec-ch-ua-mobile": "?0", "sec-fetch-dest": "iframe", "sec-fetch-mode": "navigate", "sec-fetch-site": "same-origin", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/87.0.4280.88 Safari/537.36"}
其中
cookie部分可能要改,然后最好以你们浏览器上的为主,具体每个参数的含义不是本文范围,可以自行去搜索引擎搜
ii.然后把参数也格式化
# set the parameter, we can get from chrome kv = {"sf_url": url, "sf_submit": "", "new": "1", "lang": "en", "app": "", "country": "cn", "os": "Windows", "browser": "Chrome"}
其中
sf_url字段是我们要下载的youtube视频的url,其他参数都不变
iii. 最后再执行requests
库的post请求
# do the POST request r = requests.post ad8 (url="https://en.savefrom.net/savefrom.php", headers=headers, data=kv) r.raise_for_status()
注意是
data=kv
iv. 封装成一个函数
import requests def gethtml(url): # set the headers or the website will not return information # the cookies in here you may need to change headers = { "cache-Control": "no-cache", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng," "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", "content-type": "application/x-www-form-urlencoded", "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; " "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; " "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; " "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; " "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; " "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1", "origin": "https://en.savefrom.net", "pragma": "no-cache", "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/", "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"", "sec-ch-ua-mobile": "?0", "sec-fetch-dest": "iframe", "sec-fetch-mode": "navigate", "sec-fetch-site": "same-origin", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/87.0.4280.88 Safari/537.36"} # set the parameter, we can get from chrome kv = {"sf_url": url, "sf_submit": "", "new": "1", "lang": "en", "app": "", "country": "cn", "os": "Windows", "browser": "Chrome"}# do the POST request r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers, data=kv) r.raise_for_status() # get the result return r.text
2. 调用解密函数
i. 分析
这其中的难点在于在python里执行javascript代码,而晚上的解决方法有
PyV8等,本文选用
execjs。在思路部分我们可以发现js部分的最后几行是解密 ad0 函数,所以我们只需要在
execjs中先执行一遍全部,然后再单独执行解密函数就好了
ii. 先取出js部分
# target(youtube address) url url = "https://www.youtube.com/watch?v=YPvtz1lHRiw" # get the target text reo = gethtml(url) # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part) reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
这里其实可以用正则,不过由于笔者正则表达式还不太熟练就直接用
split了
iii. 取第一个解密函数作为我们用的解密函数
当你多取几次不同视频的结果,你就会发现每次的解密函数都不一样,不过位置都是还是在固定行数
# split each line(help us find the decrypt function in last few line) reA = reo.split("\n") # get the depcrypt function name = reA[len(reA) - 3].split(";")[0] + ";"
所以
name就是我们的解密函数了(变量名没取太好hhh)
iv. 用execjs执行
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(reo) # do the decryption text = ct.eval(name.split("=")[1].replace(";", ""))
其中只取
=后面的和去掉分号是指指执行这个函数而不用赋值,当先执行赋值+解密然后取值也不是不可以
但是我们可以发现马上就报错了(要是有这么简单就好了)
1. this也就是window变量不存在
如果没记错是报错
this或者
$b,笔者尝试把全部
this去掉或者把全部框在一个
class里面(这样子this就变成那个class了)不过都没有成功,然后发现在
npm下有个
jsdom可以在
execjs里模拟window变量(其实应该有更好方法的),所以我们需要下载
npm和里面的
jsdom,然后改写以上代码
addition = """ const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ # use execjs to execute the js code, and the cw ad8 d is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')
其中
cwd
字段是npm root -g
的结果,也就是npm的modules路径addition
是用来模拟window
的
但是我们又可以发现下一个错误
2. alert不存在
这个错误是因为在
execjs下执行
alert函数是没有意义的,因为我们没有浏览器让他弹窗,且原本
alert函数的定义是来源
window而我们自定义了
window,所以我们要在代码前重写覆盖
alert函数(相当于定义一个alert)
# override the alert function, because in the code there has one place using # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
v. 整合代码
# target(youtube address) url url = "https://www.youtube.com/watch?v=YPvtz1lHRiw" # get the target text reo = gethtml(url) # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part) reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]# override the alert function, because in the code there has one place using # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")# split each line(help us find the decrypt function in last few line) reA = reo.split("\n") # get the depcrypt function name = reA[len(reA) - 3].split(";")[0] + ";"# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea) addition = """ const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules') # do the decryption text = ct.eval(name.split("=")[1].replace(";", ""))
3. 2c3c 分析解密结果
i. 取关键json
运行完上面的部分,解密结果就存在text里了,而我们在思路中可以发现,真正对我们重要的就是存在
window.parent.sf.videoResult.show()里的json,所以用正则表达式取这一部分的json
# get the result in json result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
ii. 格式化json
python可以格式化json的库有很多,这里笔者用了
json库(记得import)
# use `json` to load json j = json.loads(result)
iii. 取下载地址
接下来就到了最后一步,根据思路里和json格式化工具我们可以发现
j["url"][num]["url"]就是下载链接,而
num是我们要的视频格式(不同分辨率和类型)
# the selection of video(in this case, num=1 mean the video is # - 360p known from j["url"][num]["quality"] # - MP4 known from j["url"][num]["type"] # - audio known from j["url"][num]["audio"] num = 1 downurl = j["url"][num]["url"] # do some download # thanks :) # - EOF -
3. 全部代码
# -*- coding: utf-8 -*- # @Time: 2021/1/10 # @Author: Eritque arcus # @File: Youtube.py # @License: MIT # @Environment: # - windows 10 # - python 3.6.2# @Dependence: # - jsdom in npm(windows also can use) # - requests, execjs, re, json in pythonimport requests import execjs import re import json def gethtml(url): # set the headers or the website will not return information # the cookies in here you may need to change headers = { "cache-Control": "no-cache", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng," "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", "content-type": "application/x-www-form-urlencoded", "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; " "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; " "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; " "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; " "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; " "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1", "origin": "https://en.savefrom.net", "pragma": "no-cache", "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/", "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"", "sec-ch-ua-mobile": "?0", "sec-fetch-dest": "iframe", "sec-fetch-mode": "navigate", "sec-fetch-site": "same-origin", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/87.0.4280.88 Safari/537.36"} # set the parameter, we can get from chrome kv = {"sf_url": url, "sf_submit": "", "new": "1", "lang": "en", "app": "", "country": "cn", "os": "Windows", "browser": "Chrome"}# do the POST request r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers, data=kv) r.raise_for_status() # get the result return r.text if __name__ == '__main__': # target(youtube address) url url = "https://www.youtube.com/watch?v=YPvtz1lHRiw" # get the target text reo = gethtml(url) # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part) reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]# override the alert function, because in the code there has one place using # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")# split each line(help us find the decrypt function in last few line) reA = reo.split("\n") # get the depcrypt function name = reA[len(reA) - 3].split(";")[0] + ";"# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea) addition = """ const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules') # do the decryption text = ct.eval(name.split("=")[1].replace(";", "")) # get the result in json result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")# use `json` to load json j = json.loads(result)# the selection of video(in this case, num=1 mean the video is # - 360p known from j["url"][num]["quality"] # - MP4 known from j["url"][num]["type"] # - audio known from j["url"][num]["audio"] num = 1 downurl = j["url"][num]["url"] # do some download # thanks :) # - EOF -
- 总计102行
- 开发环境
# @Environment: # - windows 10 # - python 3.6.2
- 依赖
# @Dependence: # - jsdom in npm(windows also can use) # - requests, execjs, re, json in python
For 爬虫
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文作者: https://www.cnblogs.com/Eritque-arcus/ 或https://blog.csdn.net/qq_40832960
相关文章推荐
- Python API 自动化实战详解(纯代码)
- python基于gevent实现并发下载器代码实例
- Python自动化 作为代码小白,我是这样成为自动化大神的!
- 逗号代码,字符图片网格-用Python自动化无聊的东西-chapter4
- python打造百行代码实现简单的下载器
- C语言代码格式自动化(python实现)
- Python 自动化表单提交实例代码
- 基于python实现的百度音乐下载器python pyqt改进版(附代码)
- python办公自动化(6)几行代码实现发送邮件
- 用Python代码连接并控制Excel表格,从此办公自动化,解放你双手
- Python的iOS自动化打包实例代码
- 【Python接口测试】13_持续集成-接口自动化代码持续集成配置
- Python写自动化之使用sphinx提取Python代码docstring
- 下载代码python之小说下载器
- 《learning python the hard way》习题46 项目骨架搭建 问题小结(二)之 自动化测试代码问题
- Python编程快速上手 让繁琐工作自动化 第4章实践项目 逗号代码 字符图网格
- 软件测试自动化…python学习到什么程度?代码好不好学!
- python自动化工具日志查询分析脚本代码实现
- PAMIE- Python实现IE自动化的模块(附 网易注册代码)
- Python自动化之rabbitmq rpc client端代码分析(原创)