您的位置:首页 > 编程语言 > Python开发

Python +tensorflow+pygame 破解任意字体反爬

2019-05-16 13:32 363 查看
版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。 本文链接:https://blog.csdn.net/Mr_bai_404/article/details/90257165

什么是字体反爬?

每个字符在 都可以用 unicode 编码表示  而字体文件可以理解为Unicode 和 字体形状的映射 ,所以在计算机中字符可以变成我们人类所能理解的形状,所以字体反爬的关键就是字体文件,因为它决定了将Unicode字符渲染成什么形状(字)

1.解析反爬效果

这里我们拿猫眼为例:

第二图可见,猫眼将数字进行了反爬,&#x  表示16进制 ,e309 表示Unicode 的值,第一幅图中可见,浏览器字体文件渲染的效果和默认渲染的效果,爬虫只能抓到̉这种原始Unicode 或默认渲染的 . , 而真实的数字就需要字体文件了。

 

2.字体文件获取

 

从上图可以看到,两个箭头之间就是字体文件了,我们只需将其保存至本地即可

[code]import base64
font_face='d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7ldeY21hcAAAAYAAAAC8AAACTC79iqhnbHlmAAACPAAAA5EAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYVJRd8aGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAAwGhwAAGxvY2EAAAYwAAAAGgAAABoGmAWgbWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI8LScOueJxjYGRgYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBwYKr4/Z9b5r8MQw6zDcAUozAiSAwDoeQvweJzFkrENg0AMRf8FQgKkSJkhsgM1EhKDsAATpKLMFJSpMgcjAOIKKJAoKMm/M00kaBOf3kn+PtmWfQCOABxyJy6g3lAw9qKqrO4gsLqLB/0brlTOyOuq8dusS3TRx0M0plM562Xhi/3Ililm3DomEuLA/lyc4MFnTenE28n0A1P/K/1tF3s/Vy8k+QpbrCvB6I0vcJJoM8HsvEsEs3NdCJwz+ljgxDFEAmePMRW4BUylYP7NrAUEH1DvQih4nD2Ty28aVxTG7x0ixsEY4zKPgBNgGDwzgG3G88LAGAgYEj8pNoMxToixEkLcJnGtOHUSq03oQ0qq/gHpplIX3URdZJ9KVbNqU7Ve9A+o1G13rZSNhXtn7DCLK50rnfN95/fNBRCA43+ABAiAAZCQScJPCAB9mHnAI+w34ABDADAqo0J5RCZZkh+xwULvV1i63GrV/3pRgYc9sfLiCN39eNJ3/B8GsD8AC+Jooh/KUgYmMnAGKrwdt7MhTlU0WfJDknBBNsRzPFQ4NmQnCZqStK8GdTGa5l12HHrj44n1h59uze7p6ftlQ9EcsLMyna5Fog/KP+jqWEb1aaMDZ+xRn+/x9u0vFr7uPvvOmIwbML243lwuRWJr4J0f2EN+gmAcbcTxSAy34y5IZpA16sQF8pRApmiKhoTpWFMVLmSH3zjJsBINRmnnUHBDXjtIXc/febZY+MjQVGfvOV/ktEr5fhWjFHqMDiQvrGpTk9124d7Mt68OmyviZLX3ZtyINZbm1moAWlzOYD+DwCkVzVob5z0MyeCnXkw+iMiXjjktV68VYgVitQiv9/7mg7Ns80my+PHWTGbgdTG/9bzGBRxwp/oTRT+5uXl5TZtuAGDrsyeQygQAHpO1NdfEbUejUZ0gKFlKaOaCNoKiUaGdVC8/3Hm1u50vdv+8mCuJeUVkmUL74vnQWCgSlMlI9ZMK/FzYfv/W3cWOQF3LXz3I6K1S83slGww0C7neU75IeEiCf7xS6XM/OvUCPErCXBEBRm5M+uaeyImkuXkO5RH1+jrLu+nzbrfTNXqjfFMvNSoPV6PCo/AEbHXnl6sb0Zx+O9vml1fn629e3tuDm+mUnAcW13/hMdKJ9blaESdoK0ULsWaF64dI2QKBAue7w5e0jMFHdF/Y4UquZzV51lF3J1PVlDSlSlPZS0871w7O/rKQrx3wgmMJpmfEbCY/3IhP+c7VNxeo4Sulq5/tNKxnYh17yIMTvRR2xAVxVUP7JmS4Vw92hLnpUWEwiYl+3W2EJK9I93veop4wAGMkg/jYzJZ3f0UGSqekcJcNh297/KBjVEhyqTIZWdCzi7Bxdv/3fSZGFERBot8bqFYDfm88rgbF+QvTN+bmS472rV1jYkmiswIzcY4e6mseY6+BByWiMiSabMdZU9WUjsNDtjAre7wDG3DEHUj7cwx2xyiGWw8e5RofRNv6/t3kFQ6A/wHUBeDCAAAAeJxjYGRgYABiXqF/ofH8Nl8ZuFkYQOAm08tHCPr/GxYGpvNALgcDE0gUACrwCzkAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMACgAcAC0AOYBLAFgAaIBvAH2AhoAAHicY2BkYGDgYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebeYIf4D2J1ywQB4nG2JOw6AIBQE3+IHRbyLAQJaAsJdbOxMPL7x2TrNZHZJ0IeifzQEGrTo0ENiwAiFCRoz4ZbXeRS7bK+rjZHbu8x2PrGT47+Elfe6uMqdLbuGErjNbogeH2oXtw=='
b=base64.b64decode(font_face)
with open('myfont.otf','wb')as f:
f.write(b)

3.获取unicode 与字体文件的映射关系

[code]from fontTools.ttLib import TTFont
ttffont=TTFont("myfont.otf")
print(ttffont.getBestCmap())

#{120: 'x', 58066: 'uniE2D2', 58121: 'uniE309', 58475: 'uniE46B', 58956: 'uniE64C', 59276: 'uniE78C',60233: 'uniEB49', 60479: 'uniEC3F', 61519: 'uniF04F', 62378: 'uniF3AA', 63463: 'uniF7E7'}

这是字体文件中的映射

而网页的字体为:.  之类的16位Unicode,我们只需将其转换成10进制的即可

[code]import re
s="."
n16s=re.findall("&#x(.*?);",s)
for n16 in n16s:
n10=int("0x"+n16,16)
print(n10)
# 58121
# 60479
# 58121
# 61519

这样就可以转换成字体文件中可以识别的 unicode 了

4.根据字体文件渲染unicode

这里我们用到pygame 去将unicode 渲染成图片,首先我们要先观察网页用的字体文件的类型

从图中可知猫眼电影所用的字体文件类型为woff ,通过FreeType的支持的所有字体文件格式可以通过渲染 

pygame.freetype
,即
TTF
,Type1和
CFF
,OpenType字体, 
SFNT
PCF
FNT
BDF
PFR
和Type42字体。可以访问具有UTF-32代码点的所有字形,

pygame不支持woff ,所以这里用个函数进行转换一下,具体代码如下:

[code]import pygame.freetype
from PIL import Image
from io import BytesIO
import base64
import struct
import sys
import zlib
def convert_streams(infile):
infile=BytesIO(infile)
outfile=BytesIO()
WOFFHeader = {'signature': struct.unpack(">I", infile.read(4))[0],
'flavor': struct.unpack(">I", infile.read(4))[0],
'length': struct.unpack(">I", infile.read(4))[0],
'numTables': struct.unpack(">H", infile.read(2))[0],
'reserved': struct.unpack(">H", infile.read(2))[0],
'totalSfntSize': struct.unpack(">I", infile.read(4))[0],
'majorVersion': struct.unpack(">H", infile.read(2))[0],
'minorVersion': struct.unpack(">H", infile.read(2))[0],
'metaOffset': struct.unpack(">I", infile.read(4))[0],
'metaLength': struct.unpack(">I", infile.read(4))[0],
'metaOrigLength': struct.unpack(">I", infile.read(4))[0],
'privOffset': struct.unpack(">I", infile.read(4))[0],
'privLength': struct.unpack(">I", infile.read(4))[0]}

outfile.write(struct.pack(">I", WOFFHeader['flavor']));
outfile.write(struct.pack(">H", WOFFHeader['numTables']));
maximum = list(filter(lambda x: x[1] <= WOFFHeader['numTables'], [(n, 2**n) for n in range(64)]))[-1];
searchRange = maximum[1] * 16
outfile.write(struct.pack(">H", searchRange));
entrySelector = maximum[0]
outfile.write(struct.pack(">H", entrySelector));
rangeShift = WOFFHeader['numTables'] * 16 -  searchRange;
outfile.write(struct.pack(">H", rangeShift));

offset = outfile.tell()

TableDirectoryEntries = []
for i in range(0, WOFFHeader['numTables']):
TableDirectoryEntries.append({'tag': struct.unpack(">I", infile.read(4))[0],
'offset': struct.unpack(">I", infile.read(4))[0],
'compLength': struct.unpack(">I", infile.read(4))[0],
'origLength': struct.unpack(">I", infile.read(4))[0],
'origChecksum': struct.unpack(">I", infile.read(4))[0]})
offset += 4*4

for TableDirectoryEntry in TableDirectoryEntries:
outfile.write(struct.pack(">I", TableDirectoryEntry['tag']))
outfile.write(struct.pack(">I", TableDirectoryEntry['origChecksum']))
outfile.write(struct.pack(">I", offset))
outfile.write(struct.pack(">I", TableDirectoryEntry['origLength']))
TableDirectoryEntry['outOffset'] = offset
offset += TableDirectoryEntry['origLength']
if (offset % 4) != 0:
offset += 4 - (offset % 4)

for TableDirectoryEntry in TableDirectoryEntries:
infile.seek(TableDirectoryEntry['offset'])
compressedData = infile.read(TableDirectoryEntry['compLength'])
if TableDirectoryEntry['compLength'] != TableDirectoryEntry['origLength']:
uncompressedData = zlib.decompress(compressedData)
else:
uncompressedData = compressedData
outfile.seek(TableDirectoryEntry['outOffset'])
outfile.write(uncompressedData)
offset = TableDirectoryEntry['outOffset'] + TableDirectoryEntry['origLength'];
padding = 0
if (offset % 4) != 0:
padding = 4 - (offset % 4)
outfile.write(bytearray(padding));
return outfile.getvalue()

font_face='d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7ldeY21hcAAAAYAAAAC8AAACTC79iqhnbHlmAAACPAAAA5EAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYVJRd8aGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAAwGhwAAGxvY2EAAAYwAAAAGgAAABoGmAWgbWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI8LScOueJxjYGRgYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBwYKr4/Z9b5r8MQw6zDcAUozAiSAwDoeQvweJzFkrENg0AMRf8FQgKkSJkhsgM1EhKDsAATpKLMFJSpMgcjAOIKKJAoKMm/M00kaBOf3kn+PtmWfQCOABxyJy6g3lAw9qKqrO4gsLqLB/0brlTOyOuq8dusS3TRx0M0plM562Xhi/3Ililm3DomEuLA/lyc4MFnTenE28n0A1P/K/1tF3s/Vy8k+QpbrCvB6I0vcJJoM8HsvEsEs3NdCJwz+ljgxDFEAmePMRW4BUylYP7NrAUEH1DvQih4nD2Ty28aVxTG7x0ixsEY4zKPgBNgGDwzgG3G88LAGAgYEj8pNoMxToixEkLcJnGtOHUSq03oQ0qq/gHpplIX3URdZJ9KVbNqU7Ve9A+o1G13rZSNhXtn7DCLK50rnfN95/fNBRCA43+ABAiAAZCQScJPCAB9mHnAI+w34ABDADAqo0J5RCZZkh+xwULvV1i63GrV/3pRgYc9sfLiCN39eNJ3/B8GsD8AC+Jooh/KUgYmMnAGKrwdt7MhTlU0WfJDknBBNsRzPFQ4NmQnCZqStK8GdTGa5l12HHrj44n1h59uze7p6ftlQ9EcsLMyna5Fog/KP+jqWEb1aaMDZ+xRn+/x9u0vFr7uPvvOmIwbML243lwuRWJr4J0f2EN+gmAcbcTxSAy34y5IZpA16sQF8pRApmiKhoTpWFMVLmSH3zjJsBINRmnnUHBDXjtIXc/febZY+MjQVGfvOV/ktEr5fhWjFHqMDiQvrGpTk9124d7Mt68OmyviZLX3ZtyINZbm1moAWlzOYD+DwCkVzVob5z0MyeCnXkw+iMiXjjktV68VYgVitQiv9/7mg7Ns80my+PHWTGbgdTG/9bzGBRxwp/oTRT+5uXl5TZtuAGDrsyeQygQAHpO1NdfEbUejUZ0gKFlKaOaCNoKiUaGdVC8/3Hm1u50vdv+8mCuJeUVkmUL74vnQWCgSlMlI9ZMK/FzYfv/W3cWOQF3LXz3I6K1S83slGww0C7neU75IeEiCf7xS6XM/OvUCPErCXBEBRm5M+uaeyImkuXkO5RH1+jrLu+nzbrfTNXqjfFMvNSoPV6PCo/AEbHXnl6sb0Zx+O9vml1fn629e3tuDm+mUnAcW13/hMdKJ9blaESdoK0ULsWaF64dI2QKBAue7w5e0jMFHdF/Y4UquZzV51lF3J1PVlDSlSlPZS0871w7O/rKQrx3wgmMJpmfEbCY/3IhP+c7VNxeo4Sulq5/tNKxnYh17yIMTvRR2xAVxVUP7JmS4Vw92hLnpUWEwiYl+3W2EJK9I93veop4wAGMkg/jYzJZ3f0UGSqekcJcNh297/KBjVEhyqTIZWdCzi7Bxdv/3fSZGFERBot8bqFYDfm88rgbF+QvTN+bmS472rV1jYkmiswIzcY4e6mseY6+BByWiMiSabMdZU9WUjsNDtjAre7wDG3DEHUj7cwx2xyiGWw8e5RofRNv6/t3kFQ6A/wHUBeDCAAAAeJxjYGRgYABiXqF/ofH8Nl8ZuFkYQOAm08tHCPr/GxYGpvNALgcDE0gUACrwCzkAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMACgAcAC0AOYBLAFgAaIBvAH2AhoAAHicY2BkYGDgYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebeYIf4D2J1ywQB4nG2JOw6AIBQE3+IHRbyLAQJaAsJdbOxMPL7x2TrNZHZJ0IeifzQEGrTo0ENiwAiFCRoz4ZbXeRS7bK+rjZHbu8x2PrGT47+Elfe6uMqdLbuGErjNbogeH2oXtw=='
b=base64.b64decode(font_face)
myfont=BytesIO(convert_streams(b))

uni=58121

pygame.freetype.init()
font=pygame.freetype.Font(myfont,64)
rtext=font.render(chr(uni), (0, 0, 0),(255, 255,255))
pil_string_image = pygame.image.tostring(rtext[0], "RGB")
pil_image = Image.frombytes("RGB",rtext[0].get_size(),pil_string_image)
pil_image.show()

运行此段代码可以将  unicode  渲染成图片,

5.利用tensorflow 的cnn 卷石神经网络,训练模型 识别图片中的字符

接下来就是让计算机将图片识别为字符就好了,我们可以从网上下载一个或多个,全字符的字体文件来训练,然后用pygame 来生成样本, 接下来就是TensorFlow 的训练样本的代码:

[code]import numpy as np
import tensorflow as tf
import pygame
import random
from PIL import Image
import pygame.freetype
from io import BytesIO
from io import StringIO
from fontTools.ttLib import TTFont
pygame.init()
sjs=[0,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,

16, 17, 18, 19, 20, 21, 22,23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45,315, 316, 317, 318, 319, 320, 321, 322,
323, 324, 325, 326, 327, 328,329,
330, 331, 332, 333, 334, 335, 336, 337, 338, 339,
340, 341, 342, 343, 344,

345, 346,  347, 348, 349,
350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
]
kb=0.75
ttf_names=["e:syc.otf",
# "e:syz.otf","e:o1.otf","e:o2.otf","e:o3.otf",
# "e:o4.otf","e:o5.otf","e:o6.otf","e:o7.otf",
# "e:t1.ttf","e:t2.ttf","e:t3.ttf","e:t4.ttf",
]
ttf_name="e:syc.otf"
def change_ttf(ttf_name):
global font
print(ttf_name)
# font=pygame.freetype.Font(ttf_name,random.randint(60,64))
font=pygame.freetype.Font(ttf_name,64)
ttffont=TTFont(ttf_name)
gbcs=list(ttffont.getBestCmap())
gbc=[]
for k in gbcs:
if 33<=k<=126 or 19968<=k<=40869:
gbc.append(k)
# gbc=gbc[500:507]
sgbc=sorted(gbc)
print("gggggg")

IMAGE_HEIGHT = 64
IMAGE_WIDTH =  64
def k2name(k):
s=str(hex(k))[2:]
s="0"*(4-len(s))+s
return s
def k2im(k):
rtext=font.render(chr(k), (0, 0, 0),(255, 255,255))
pil_string_image = pygame.image.tostring(rtext[0], "RGB")
pil_image = Image.frombytes("RGB",rtext[0].get_size(),pil_string_image).resize((IMAGE_WIDTH,IMAGE_HEIGHT))
im=np.array(pil_image.convert("1"))
return im
def gen_captcha_text_and_image(k):
captcha_text=k2name(k)
captcha_image=k2im(k)
return captcha_text, captcha_image

# text, image = gen_captcha_text_and_image(55)
# print("验证码图像channel:", image.shape)  # (60, 160, 3)
# 图像大小

MAX_CAPTCHA = 1
print("验证码文本最长字符数", MAX_CAPTCHA)   # 验证码最长4字符; 我全部固定为4,可以不固定. 如果验证码长度小于4,用'_'补齐

# 把彩色图像转为灰度图像(色彩对识别验证码没有什么用)
def convert2gray(img):
if len(img.shape) > 2:
# gray = np.mean(img, -1)
# 上面的转法较快,正规转法如下
r, g, b = img[:,:,0], img[:,:,1], img[:,:,2]
gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
return gray
else:
return img

"""
cnn在图像大小是2的倍数时性能最高, 如果你用的图像大小不是2的倍数,可以在图像边缘补无用像素。
np.pad(image,((2,3),(2,2)), 'constant', constant_values=(255,))  # 在图像上补2行,下补3行,左补2行,右补2行
"""

# 文本转向量
# char_set = number + alphabet + ALPHABET + ['_']  # 如果验证码长度小于4, '_'用来补齐
# CHAR_SET_LEN = len(char_set)
CHAR_SET_LEN=16
CHAR_SET_LEN=len(sgbc)
print(CHAR_SET_LEN)
def text2vec(text):
vector = np.zeros(MAX_CAPTCHA*CHAR_SET_LEN)
idx = int("0x"+text,16)
vector[sgbc.index(idx)] = 1
return vector

# 生成一个训练batch
def get_next_batch(kks):
batch_size=len(kks)
batch_x = np.zeros([batch_size, IMAGE_HEIGHT*IMAGE_WIDTH])
batch_y = np.zeros([batch_size, MAX_CAPTCHA*CHAR_SET_LEN])

# 有时生成图像大小不是(60, 160, 3)
i=0
for kk in kks:
text, image = gen_captcha_text_and_image(kk)
image = convert2gray(image)
batch_x[i,:] = image.flatten() / 1 # (image.flatten()-128)/128  mean为0
batch_y[i,:] = text2vec(text)
i+=1
return batch_x, batch_y

####################################################################

X = tf.placeholder(tf.float32, [None, IMAGE_HEIGHT*IMAGE_WIDTH])
Y = tf.placeholder(tf.float32, [None, MAX_CAPTCHA*CHAR_SET_LEN])
keep_prob = tf.placeholder(tf.float32) # dropout

# 定义CNN
def crack_captcha_cnn(w_alpha=0.01, b_alpha=0.1):
x = tf.reshape(X, shape=[-1, IMAGE_HEIGHT, IMAGE_WIDTH, 1])

#w_c1_alpha = np.sqrt(2.0/(IMAGE_HEIGHT*IMAGE_WIDTH)) #
#w_c2_alpha = np.sqrt(2.0/(3*3*32))
#w_c3_alpha = np.sqrt(2.0/(3*3*64))
#w_d1_alpha = np.sqrt(2.0/(8*32*64))
#out_alpha = np.sqrt(2.0/1024)

# 3 conv layer
w_c1 = tf.Variable(w_alpha*tf.random_normal([3, 3, 1, 16]))
b_c1 = tf.Variable(b_alpha*tf.random_normal([16]))
conv1 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w_c1, strides=[1, 1, 1, 1], padding='SAME'), b_c1))
conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
conv1 = tf.nn.dropout(conv1, keep_prob)
print(conv1.shape)

w_c2 = tf.Variable(w_alpha*tf.random_normal([3, 3, 16, 32]))
b_c2 = tf.Variable(b_alpha*tf.random_normal([32]))
conv2 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME'), b_c2))
conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
conv2 = tf.nn.dropout(conv2, keep_prob)
print(conv2.shape)

w_c3 = tf.Variable(w_alpha*tf.random_normal([3, 3, 32, 64]))
b_c3 = tf.Variable(b_alpha*tf.random_normal([64]))
conv3 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv2, w_c3, strides=[1, 1, 1, 1], padding='SAME'), b_c3))
conv3 = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
conv3 = tf.nn.dropout(conv3, keep_prob)
print(conv3.shape)

# Fully connected layer
w_d = tf.Variable(w_alpha*tf.random_normal([8*8*64, 1024]))
b_d = tf.Variable(b_alpha*tf.random_normal([1024]))
dense = tf.reshape(conv3, [-1, w_d.get_shape().as_list()[0]])
dense = tf.nn.relu(tf.add(tf.matmul(dense, w_d), b_d))
dense = tf.nn.dropout(dense, keep_prob)

w_out = tf.Variable(w_alpha*tf.random_normal([1024, MAX_CAPTCHA*CHAR_SET_LEN]))
b_out = tf.Variable(b_alpha*tf.random_normal([MAX_CAPTCHA*CHAR_SET_LEN]))
out = tf.add(tf.matmul(dense, w_out), b_out)
#out = tf.nn.softmax(out)
return out

# 训练
def train_crack_captcha_cnn():
output = crack_captcha_cnn()
# loss
#loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, Y))
print("ddddddddd",output.shape,Y.shape)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=output, labels=Y))
# 最后一层用来分类的softmax和sigmoid有什么不同?
# optimizer 为了加快训练 learning_rate应该开始大,然后慢慢衰
optimizer = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(loss)

predict = tf.reshape(output, [-1, MAX_CAPTCHA, CHAR_SET_LEN])
max_idx_p = tf.argmax(predict, 2)
max_idx_l = tf.argmax(tf.reshape(Y, [-1, MAX_CAPTCHA, CHAR_SET_LEN]), 2)
correct_pred = tf.equal(max_idx_p, max_idx_l)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint("e://model5/"))
# sess.run(tf.global_variables_initializer())
step = 0
p=1
xh=0
tc=0
change_ttf(ttf_names[xh%len(ttf_names)])
while True:
random.shuffle(gbc)
kks=[]
for k in gbc:
kks.append(k)
if len(kks)>0 and len(kks)%200==0:
batch_x, batch_y = get_next_batch(kks)
_, loss_ = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: kb})
print(step, loss_)
kks=[]
step += 1
else:
if kks:
batch_x, batch_y = get_next_batch(kks)
_, loss_ = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: kb})
print(step, loss_)
kks=[]
step += 1
# if xh%20==0:
if True:

xh+=1
change_ttf(ttf_names[xh%len(ttf_names)])

random.shuffle(gbc)
kks=[]
for k in gbc:
kks.append(k)
if len(kks)>0 and len(kks)%200==0:
break
batch_x_test, batch_y_test = get_next_batch(kks)
acc = sess.run(accuracy, feed_dict={X: batch_x_test, Y: batch_y_test, keep_prob: 1.})
print("预测数据:",xh, acc)

if acc >= p:

# xh+=1
# change_ttf(ttf_names[xh%len(ttf_names)])

p=acc
pp=int(str(acc)[2:])

saver.save(sess, "e:/model5/good.model", global_step=pp)

tc+=1
if tc>=20000:
return True
train_crack_captcha_cnn()

 

我遇到的字体反爬是全字符字体反爬,要比猫眼的只对数字进行字体反爬难度要大的多,所以你们训练的时候如果网站只对

数字进行字体反爬,那就只训练数字即可,准确率应该100%。 

 

 

 

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: