【实例】python中简单分句,通过替代句号 &给句尾(不是句首)添加序号
2018-03-04 21:16
603 查看
>>> fn = open('E:/西方哲学史.txt').read()
>>> fn = fn.replace('。','。\t\r\n')
>>> s = open('E:/西方哲学史分句.txt','w')
>>> s = s.write(fn)
想要给每个句子添加,编号 = =,怎么办?
>>> import re
>>> def createid(matchobject,no=[0]):
... no[0]+=1
... return "[%d]"%no[0]
...
>>> text = "★A child is a human being who is not yet an adult.★A child is a human being who is not yet an adult.★A child is a human being who is not yet an adult."
>>> text=re.sub("★",createid,text)
>>> print(text)
[1]A child is a human being who is not yet an adult.[2]A child is a human being who is not yet an adult.[3]A child is a human being who is not yet an adult.
>>>
参考:https://zhidao.baidu.com/question/1993159681293693067.html |百度知道
-------问题是这里有标注了,可是我的文段没有--------------------------------------------------------
>>> pattern = re.compile(u'wechat', re.I)
>>> pattern.search(u'wechat online')
<_sre.SRE_Match object; span=(0, 6), match='wechat'>
>>>
----------然后我找到了正则表达式 匹配句首的,不过没看懂还------------------------------------------
又找到了https://zhidao.baidu.com/question/2012704092059701388.html,只能找字母的首字母= =
---------问题是如何找到匹配句首的方式---------------------
可是只找到了 如何找寻首字母的 = =方式,参考:https://zhidao.baidu.com/question/814035707149647692.html
>>> import re
>>> content = "a string which defines the name for this spider. the spider name is how the spider is located (and instantiated) by scrapy, so it must be unique. however, nothing prevents you from instantiating more than one instance of the same spider. this is the most important spider attribute and it’s required."
>>> for line in re.split('\.|\?|!', content):
... if line != "":
... print(line.strip().capitalize())
... print(line.strip().split()[0])
...
A string which defines the name for this spider
a
The spider name is how the spider is located (and instantiated) by scrapy, so it must be unique
the
However, nothing prevents you from instantiating more than one instance of the same spider
however,
This is the most important spider attribute and it’s required
this
>>>
-----------我想试试中文-------------
结果:= = 实验证明无法使用到中文中
-----------继续问题---- 应该是中文句号的问题= =----------
再来一次:结果 = =不经汗颜,不行!!!
放弃句首吧,我还是加序号到末尾吧= =。
--------------------------------------------------
新的参考 | https://zhidao.baidu.com/question/1111709810604369899.html
结果= =:
又参考了:http://bbs.csdn.net/topics/390424651
问题变成了,如何定位句子开头,然后定位了之后,就可以标注了,然后就是序号如何添加的问题,立刻把问题分解成了三份。
继续----------------------------------------------------
只简单编码,可以实现
问题是,都是黏在一起的 = =,不太好吧。选择每句用回车换行分开:
虽然解决了编码的问题,可是就是不能定位到开头,或者说句首。到现在这都是一个问题。不过还算马马虎虎解决了一个问题。
然后要考虑的是给每个句子进行定性。
这文章先写到这里吧 = =2018年3月4日21:16:07
>>> fn = fn.replace('。','。\t\r\n')
>>> s = open('E:/西方哲学史分句.txt','w')
>>> s = s.write(fn)
想要给每个句子添加,编号 = =,怎么办?
>>> import re
>>> def createid(matchobject,no=[0]):
... no[0]+=1
... return "[%d]"%no[0]
...
>>> text = "★A child is a human being who is not yet an adult.★A child is a human being who is not yet an adult.★A child is a human being who is not yet an adult."
>>> text=re.sub("★",createid,text)
>>> print(text)
[1]A child is a human being who is not yet an adult.[2]A child is a human being who is not yet an adult.[3]A child is a human being who is not yet an adult.
>>>
参考:https://zhidao.baidu.com/question/1993159681293693067.html |百度知道
-------问题是这里有标注了,可是我的文段没有--------------------------------------------------------
>>> pattern = re.compile(u'wechat', re.I)
>>> pattern.search(u'wechat online')
<_sre.SRE_Match object; span=(0, 6), match='wechat'>
>>>
----------然后我找到了正则表达式 匹配句首的,不过没看懂还------------------------------------------
又找到了https://zhidao.baidu.com/question/2012704092059701388.html,只能找字母的首字母= =
---------问题是如何找到匹配句首的方式---------------------
可是只找到了 如何找寻首字母的 = =方式,参考:https://zhidao.baidu.com/question/814035707149647692.html
>>> import re
>>> content = "a string which defines the name for this spider. the spider name is how the spider is located (and instantiated) by scrapy, so it must be unique. however, nothing prevents you from instantiating more than one instance of the same spider. this is the most important spider attribute and it’s required."
>>> for line in re.split('\.|\?|!', content):
... if line != "":
... print(line.strip().capitalize())
... print(line.strip().split()[0])
...
A string which defines the name for this spider
a
The spider name is how the spider is located (and instantiated) by scrapy, so it must be unique
the
However, nothing prevents you from instantiating more than one instance of the same spider
however,
This is the most important spider attribute and it’s required
this
>>>
-----------我想试试中文-------------
结果:= = 实验证明无法使用到中文中
-----------继续问题---- 应该是中文句号的问题= =----------
再来一次:结果 = =不经汗颜,不行!!!
放弃句首吧,我还是加序号到末尾吧= =。
--------------------------------------------------
新的参考 | https://zhidao.baidu.com/question/1111709810604369899.html
结果= =:
又参考了:http://bbs.csdn.net/topics/390424651
问题变成了,如何定位句子开头,然后定位了之后,就可以标注了,然后就是序号如何添加的问题,立刻把问题分解成了三份。
继续----------------------------------------------------
只简单编码,可以实现
问题是,都是黏在一起的 = =,不太好吧。选择每句用回车换行分开:
虽然解决了编码的问题,可是就是不能定位到开头,或者说句首。到现在这都是一个问题。不过还算马马虎虎解决了一个问题。
然后要考虑的是给每个句子进行定性。
这文章先写到这里吧 = =2018年3月4日21:16:07
相关文章推荐
- Python通过matplotlib绘制动画简单实例
- Python入门(一):爬虫基本结构&简单实例
- 简单的python协同过滤程序实例代码
- python中操作文件函数open的简单操作实例
- Python---BeautifulSoup 简单的爬虫实例
- HTML 简单教程-实例004 HTML链接<a>标签
- Python通过matplotlib画双层饼图及环形图简单示例
- Python Sleep休眠函数使用简单实例
- 通过jquery-ui中的sortable来实现拖拽排序的简单实例
- 使用Python编写简单的端口扫描器的实例分享【转】
- Python 字符串大小写转换的简单实例
- python scrapy 安装 lxml 报 ERROR: 'xslt-config' 不是内部或外部命令,也不是可运行的程序的解决办法。
- werkzeug实现简单Python web框架(3):添加动态路由
- Python简单进程锁代码实例
- cocos2d-x 3.0游戏实例学习笔记 《跑酷》 第六步--金币&岩石添加并且管理
- jquery 获取dom固定元素 添加样式的简单实例
- python实现的简单窗口倒计时界面实例
- 利用python 多进程编写的简单实例
- cisco路由器&三层交换机简单环境配置实例
- Python爬虫的post请求简单实例