您的位置:首页 > 其它

破解抖音字体反爬获取用户基础数据

2020-04-19 22:08 609 查看

背景

抖音WEB页面可以获取用户昵称、抖音号、用户签名、关注数、粉丝数、获赞数、作品数、喜欢数

抖音为了防止数据被爬取,所有的的数字数据都是用icon图标填充渲染,直接获取WEB页面代码,发现具体数值为包含字母的字符,不是数字

本文会介绍如何找出字符与数值的对应关系

分析字体文件

1.访问抖音 WEB 页面,发现有woff字体文件请求,复制url直接下载字体文件

2.使用 Python 的一个工具包 fontTools 来查看字体的编码映射关系

安装fontTools工具包命令:

pip install fontTools

利用 fontTools 将字体文件转为 XML 文件,以下是转换代码:

from fontTools.ttLib import TTFont
font = TTFont(r'/Users/linchen/Downloads/iconfont_9eb9a50.woff')
font.saveXML('/Users/linchen/Downloads/font.xml')

得到转换后的 XML 文件(以下为部分内容,只需要 GlyphOrder 和 cmap 数据):

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="\x00\x01\x00\x00" ttLibVersion="4.1">

<GlyphOrder>
<!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
<GlyphID id="0" name="glyph00000"/>
<GlyphID id="1" name="x"/>
<GlyphID id="2" name="num_"/>
<GlyphID id="3" name="num_1"/>
<GlyphID id="4" name="num_2"/>
<GlyphID id="5" name="num_3"/>
<GlyphID id="6" name="num_4"/>
<GlyphID id="7" name="num_5"/>
<GlyphID id="8" name="num_6"/>
<GlyphID id="9" name="num_7"/>
<GlyphID id="10" name="num_8"/>
<GlyphID id="11" name="num_9"/>
</GlyphOrder>

············

<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x78" name="x"/><!-- LATIN SMALL LETTER X -->
<map code="0xe602" name="num_"/><!-- ???? -->
<map code="0xe603" name="num_1"/><!-- ???? -->
<map code="0xe604" name="num_2"/><!-- ???? -->
<map code="0xe605" name="num_3"/><!-- ???? -->
<map code="0xe606" name="num_4"/><!-- ???? -->
<map code="0xe607" name="num_5"/><!-- ???? -->
<map code="0xe608" name="num_6"/><!-- ???? -->
<map code="0xe609" name="num_7"/><!-- ???? -->
<map code="0xe60a" name="num_8"/><!-- ???? -->
<map code="0xe60b" name="num_9"/><!-- ???? -->
<map code="0xe60c" name="num_4"/><!-- ???? -->
<map code="0xe60d" name="num_1"/><!-- ???? -->
<map code="0xe60e" name="num_"/><!-- ???? -->
<map code="0xe60f" name="num_5"/><!-- ???? -->
<map code="0xe610" name="num_3"/><!-- ???? -->
<map code="0xe611" name="num_2"/><!-- ???? -->
<map code="0xe612" name="num_6"/><!-- ???? -->
<map code="0xe613" name="num_8"/><!-- ???? -->
<map code="0xe614" name="num_9"/><!-- ???? -->
<map code="0xe615" name="num_7"/><!-- ???? -->
<map code="0xe616" name="num_1"/><!-- ???? -->
<map code="0xe617" name="num_3"/><!-- ???? -->
<map code="0xe618" name="num_"/><!-- ???? -->
<map code="0xe619" name="num_4"/><!-- ???? -->
<map code="0xe61a" name="num_2"/><!-- ???? -->
<map code="0xe61b" name="num_5"/><!-- ???? -->
<map code="0xe61c" name="num_8"/><!-- ???? -->
<map code="0xe61d" name="num_9"/><!-- ???? -->
<map code="0xe61e" name="num_7"/><!-- ???? -->
<map code="0xe61f" name="num_6"/><!-- ???? -->
</cmap_format_4>

············

</cmap>

············

</ttFont>

3.查看映射关系

访问在线字体编辑网站(例如:https://font.qqe2.com/),上传字体查看数字映射关系

4.最终的映射

结合两个关系,得到最终的映射关系

private static Map<String, String> analyCode = new HashMap<>(0);
static {
analyCode.put("0xe602", "1");
analyCode.put("0xe603", "0");
analyCode.put("0xe604", "3");
analyCode.put("0xe605", "2");
analyCode.put("0xe606", "4");
analyCode.put("0xe607", "5");
analyCode.put("0xe608", "6");
analyCode.put("0xe609", "9");
analyCode.put("0xe60a", "7");
analyCode.put("0xe60b", "8");
analyCode.put("0xe60c", "4");
analyCode.put("0xe60d", "0");
analyCode.put("0xe60e", "1");
analyCode.put("0xe60f", "5");
analyCode.put("0xe610", "2");
analyCode.put("0xe611", "3");
analyCode.put("0xe612", "6");
analyCode.put("0xe613", "7");
analyCode.put("0xe614", "8");
analyCode.put("0xe615", "9");
analyCode.put("0xe616", "0");
analyCode.put("0xe617", "2");
analyCode.put("0xe618", "1");
analyCode.put("0xe619", "4");
analyCode.put("0xe61a", "3");
analyCode.put("0xe61b", "5");
analyCode.put("0xe61c", "7");
analyCode.put("0xe61d", "8");
analyCode.put("0xe61e", "9");
analyCode.put("0xe61f", "6");
}

具体事例

代码部分:
/**
* 映射关系
*/
private static Map<String, String> analyCode = new HashMap<>(0);
static {
analyCode.put("0xe602", "1");
analyCode.put("0xe603", "0");
analyCode.put("0xe604", "3");
analyCode.put("0xe605", "2");
analyCode.put("0xe606", "4");
analyCode.put("0xe607", "5");
analyCode.put("0xe608", "6");
analyCode.put("0xe609", "9");
analyCode.put("0xe60a", "7");
analyCode.put("0xe60b", "8");
analyCode.put("0xe60c", "4");
analyCode.put("0xe60d", "0");
analyCode.put("0xe60e", "1");
analyCode.put("0xe60f", "5");
analyCode.put("0xe610", "2");
analyCode.put("0xe611", "3");
analyCode.put("0xe612", "6");
analyCode.put("0xe613", "7");
analyCode.put("0xe614", "8");
analyCode.put("0xe615", "9");
analyCode.put("0xe616", "0");
analyCode.put("0xe617", "2");
analyCode.put("0xe618", "1");
analyCode.put("0xe619", "4");
analyCode.put("0xe61a", "3");
analyCode.put("0xe61b", "5");
analyCode.put("0xe61c", "7");
analyCode.put("0xe61d", "8");
analyCode.put("0xe61e", "9");
analyCode.put("0xe61f", "6");
}
/**
* 正则匹配表达式
*/
private static final Pattern PATTERN_NICKNAME = Pattern.compile("<p class=\"nickname\">(.*?)<");
private static final Pattern PATTERN_SIGNATURE = Pattern.compile("<p class=\"signature\">([\\S\\s]*?)<");

private static final Pattern PATTERN_ID = Pattern.compile("<p class=\"shortid\">抖音ID:(.*?)<");
private static final Pattern PATTERN_ID_BLOCK = Pattern.compile("<p class=\"shortid\">([\\S\\s]*?)</p>");
private static final Pattern PATTERN_ICON_FONT = Pattern.compile("<i class=\"icon iconfont \"> (.*?) </i>");

private static final Pattern PATTERN_FOCUS_BLOCK = Pattern.compile("<span class=\"focus block\"><span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_FANS_BLOCK = Pattern.compile("<span class=\"follower block\"><span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_LIKE_NUM_BLOCK = Pattern.compile("<span class=\"liked-num block\"><span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_FOLLOW_NUM = Pattern.compile("<i class=\"icon iconfont follow-num\"> (.*?) </i>|\\.|w ");

private static final Pattern PATTERN_POST_BLOCK = Pattern.compile("<div class=\"user-tab active tab get-list\" data-type=\"post\">作品<span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_LIKE_BLOCK = Pattern.compile("<div class=\"like-tab tab get-list\" data-type=\"like\">喜欢<span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_TAB_NUM = Pattern.compile("<i class=\"icon iconfont tab-num\"> (.*?) </i>");

/**
* 正则匹配获取基础信息
*
* @param homepageHtml 页面HTML代码
* @param pattern      基础信息正则表达式
* @return java.lang.StringBuilder
* @author LKET
* @date 2019/11/21 下午3:51
*/
private static String getUserInfo(String homepageHtml, Pattern pattern) {
String info = "";
Matcher matcher = pattern.matcher(homepageHtml);
if (matcher.find()) {
info = matcher.group(1).trim();
}
return info;
}

/**
* 正则匹配获取真实数值
*
* @param homepageHtml 页面HTML代码
* @param blockPattern 外层class正则表达式
* @param numPattern   数值class正则表达式
* @return java.lang.StringBuilder
* @author LKET
* @date 2019/11/21 下午3:51
*/
private static StringBuilder getTrueNum(String homepageHtml, Pattern blockPattern, Pattern numPattern) {
StringBuilder trueNum = new StringBuilder();
Matcher matcherBlock = blockPattern.matcher(homepageHtml);
if (matcherBlock.find()) {
Matcher matcherNumList = numPattern.matcher(matcherBlock.group(1));
while (matcherNumList.find()) {
// 判断是否包含i标签,包含转数字,不包含则为.w字符
if (matcherNumList.group(0).contains("<i")) {
String code = matcherNumList.group(1).replace("&#", "0").replace(";", "");
String number = analyCode.get(code);
trueNum.append(number);
} else {
trueNum.append(matcherNumList.group(0));
}
}
}
return trueNum;
}

/**
* 获取抖音用户基本数据
*/
public static void main(String[] args) {
try {
// 请求用户详情页获取HTML代码
String homepageHtml = doGet("https://www.iesdouyin.com/share/user/76725372134?utm_campaign=client_share&app=aweme&utm_medium=ios&tt_from=copy&utm_source=copy");
System.out.println(homepageHtml);
// 输出数据(抖音号有英文、数值两种类型,分开处理)
String nickname = getUserInfo(homepageHtml, PATTERN_NICKNAME);
System.out.println("昵称:" + nickname);
String id = getUserInfo(homepageHtml, PATTERN_ID);
if (id.isEmpty()) {
id = getTrueNum(homepageHtml, PATTERN_ID_BLOCK, PATTERN_ICON_FONT).toString();
}
System.out.println("抖音id:" + id);
String signature = getUserInfo(homepageHtml, PATTERN_SIGNATURE);
System.out.println("用户签名:" + signature);
StringBuilder focusNum = getTrueNum(homepageHtml, PATTERN_FOCUS_BLOCK, PATTERN_FOLLOW_NUM);
System.out.println("粉丝数:" + focusNum);
StringBuilder fansNum = getTrueNum(homepageHtml, PATTERN_FANS_BLOCK, PATTERN_FOLLOW_NUM);
System.out.println("粉丝数:" + fansNum);
StringBuilder likeNumNum = getTrueNum(homepageHtml, PATTERN_LIKE_NUM_BLOCK, PATTERN_FOLLOW_NUM);
System.out.println("点赞数:" + likeNumNum);
StringBuilder postNum = getTrueNum(homepageHtml, PATTERN_POST_BLOCK, PATTERN_TAB_NUM);
System.out.println("作品数:" + postNum);
StringBuilder likeNum = getTrueNum(homepageHtml, PATTERN_LIKE_BLOCK, PATTERN_TAB_NUM);
System.out.println("喜欢数:" + likeNum);
} catch (Exception e) {
System.out.println(e);
}
}
运行结果:

昵称:罗志祥
抖音id:ShowLoGNF
用户签名:抖音沙雕谁最强 就是本人罗志祥
粉丝数:1
粉丝数:3743.3w
点赞数:31061.5w
作品数:239
喜欢数:656

昵称:仙女酵母
抖音id:1602606308
用户签名:三界接线员等你来打call
wb:@仙女酵母
个人wx:xnjmxiu8034
粉丝数:22
粉丝数:1416.6w
点赞数:15264.5w
作品数:227
喜欢数:154
  • 点赞
  • 收藏
  • 分享
  • 文章举报
「已注销」 发布了4 篇原创文章 · 获赞 1 · 访问量 444 私信 关注
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐