您的位置:首页 > 编程语言 > Python开发

用python做youtube自动化下载器 代码

2021-01-12 19:50 901 查看

目录
  • 2. 调用解密函数
  • i. 分析
  • ii. 先取出js部分
  • iii. 取第一个解密函数作为我们用的解密函数
  • iv. 用execjs执行
  • v. 整合代码
  • 3. 分析解密结果
  • 3. 全部代码
  • 根据 savefrom条例
    本实例及教程只用于学习交流用,权利归savefrom.net所有
    最后代码+注释大概100行左右,具体代码以github代码为主(可以会在上面修复bug),本文只做具体讲解

    项目地址

    github仓库

    思路

    用python做youtube自动化下载器 思路

    流程

    1. post

    根据思路里的第一步,我们首先需要用

    post
    方式取到加密后的js字段,笔者使用了
    requests
    第三方库来执行,关于爬虫可以参考我之前的文章

    i. 先把post中的headers格式化

    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
    "cache-Control": "no-cache",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
    "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
    "content-type": "application/x-www-form-urlencoded",
    "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
    "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
    "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
    "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
    "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
    "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
    "origin": "https://en.savefrom.net",
    "pragma": "no-cache",
    "referer": "https://en.save
    56c
    from.net/1-youtube-video-downloader-4/",
    "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
    "sec-ch-ua-mobile": "?0",
    "sec-fetch-dest": "iframe",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "same-origin",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/87.0.4280.88 Safari/537.36"}

    其中

    cookie
    部分可能要改,然后最好以你们浏览器上的为主,具体每个参数的含义不是本文范围,可以自行去搜索引擎搜

    ii.然后把参数也格式化

    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
    "sf_submit": "",
    "new": "1",
    "lang": "en",
    "app": "",
    "country": "cn",
    "os": "Windows",
    "browser": "Chrome"}

    其中

    sf_url
    字段是我们要下载的youtube视频的url,其他参数都不变

    iii. 最后再执行
    requests
    库的post请求

    # do the POST request
    r = requests.post
    ad8
    (url="https://en.savefrom.net/savefrom.php", headers=headers,
    data=kv)
    r.raise_for_status()

    注意是

    data=kv

    iv. 封装成一个函数

    import requests
    
    def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
    "cache-Control": "no-cache",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
    "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
    "content-type": "application/x-www-form-urlencoded",
    "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
    "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
    "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
    "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
    "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
    "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
    "origin": "https://en.savefrom.net",
    "pragma": "no-cache",
    "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
    "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
    "sec-ch-ua-mobile": "?0",
    "sec-fetch-dest": "iframe",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "same-origin",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
    "sf_submit": "",
    "new": "1",
    "lang": "en",
    "app": "",
    "country": "cn",
    "os": "Windows",
    "browser": "Chrome"}# do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
    data=kv)
    r.raise_for_status()
    # get the result
    return r.text
    

    2. 调用解密函数

    i. 分析

    这其中的难点在于在python里执行javascript代码,而晚上的解决方法有

    PyV8
    等,本文选用
    execjs
    。在思路部分我们可以发现js部分的最后几行是解密 ad0 函数,所以我们只需要在
    execjs
    中先执行一遍全部,然后再单独执行解密函数就好了

    ii. 先取出js部分

    # target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]

    这里其实可以用正则,不过由于笔者正则表达式还不太熟练就直接用

    split

    iii. 取第一个解密函数作为我们用的解密函数

    当你多取几次不同视频的结果,你就会发现每次的解密函数都不一样,不过位置都是还是在固定行数

    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"

    所以

    name
    就是我们的解密函数了(变量名没取太好hhh)

    iv. 用execjs执行

    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(reo)
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

    其中只取

    =
    后面的和去掉分号是指指执行这个函数而不用赋值,当先执行赋值+解密然后取值也不是不可以
    但是我们可以发现马上就报错了(要是有这么简单就好了)

    1. this也就是window变量不存在

    如果没记错是报错

    this
    或者
    $b
    ,笔者尝试把全部
    this
    去掉或者把全部框在一个
    class
    里面(这样子this就变成那个class了)不过都没有成功,然后发现在
    npm
    下有个
    jsdom
    可以在
    execjs
    里模拟window变量(其实应该有更好方法的),所以我们需要下载
    npm
    和里面的
    jsdom
    ,然后改写以上代码

    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cw
    ad8
    d is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')

    其中

    • cwd
      字段是
      npm root -g
      的结果,也就是npm的modules路径
    • addition
      是用来模拟
      window

      但是我们又可以发现下一个错误

    2. alert不存在

    这个错误是因为在

    execjs
    下执行
    alert
    函数是没有意义的,因为我们没有浏览器让他弹窗,且原本
    alert
    函数的定义是来源
    window
    而我们自定义了
    window
    ,所以我们要在代码前重写覆盖
    alert
    函数(相当于定义一个alert)

    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")

    v. 整合代码

    # target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]# override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")# split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))
    

    3. 2c3c 分析解密结果

    i. 取关键json

    运行完上面的部分,解密结果就存在text里了,而我们在思路中可以发现,真正对我们重要的就是存在

    window.parent.sf.videoResult.show()
    里的json,所以用正则表达式取这一部分的json

    # get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")

    ii. 格式化json

    python可以格式化json的库有很多,这里笔者用了

    json
    库(记得import)

    # use `json` to load json
    j = json.loads(result)

    iii. 取下载地址

    接下来就到了最后一步,根据思路里和json格式化工具我们可以发现

    j["url"][num]["url"]
    就是下载链接,而
    num
    是我们要的视频格式(不同分辨率和类型)

    # the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -

    3. 全部代码

    # -*- coding: utf-8 -*-
    # @Time: 2021/1/10
    # @Author: Eritque arcus
    # @File: Youtube.py
    # @License: MIT
    # @Environment:
    #           - windows 10
    #           - python 3.6.2# @Dependence:
    #           - jsdom in npm(windows also can use)
    #           - requests, execjs, re, json in pythonimport requests
    import execjs
    import re
    import json
    
    def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
    "cache-Control": "no-cache",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
    "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
    "content-type": "application/x-www-form-urlencoded",
    "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
    "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
    "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
    "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
    "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
    "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
    "origin": "https://en.savefrom.net",
    "pragma": "no-cache",
    "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
    "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
    "sec-ch-ua-mobile": "?0",
    "sec-fetch-dest": "iframe",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "same-origin",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
    "sf_submit": "",
    "new": "1",
    "lang": "en",
    "app": "",
    "country": "cn",
    "os": "Windows",
    "browser": "Chrome"}# do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
    data=kv)
    r.raise_for_status()
    # get the result
    return r.text
    
    if __name__ == '__main__':
    # target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]# override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")# split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))
    # get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")# use `json` to load json
    j = json.loads(result)# the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -
    
    • 总计102行
    • 开发环境
    # @Environment:
    #           - windows 10
    #           - python 3.6.2
    • 依赖
    # @Dependence:
    #           - jsdom in npm(windows also can use)
    #           - requests, execjs, re, json in python
    -end-

    For 爬虫
    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
    本文作者: https://www.cnblogs.com/Eritque-arcus/https://blog.csdn.net/qq_40832960

    内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
    标签: