您的位置:首页 > 编程语言 > Python开发

【实例】python中简单分句,通过替代句号 &给句尾(不是句首)添加序号

2018-03-04 21:16 603 查看
>>> fn = open('E:/西方哲学史.txt').read()
>>> fn = fn.replace('。','。\t\r\n')
>>> s = open('E:/西方哲学史分句.txt','w')
>>> s = s.write(fn)



想要给每个句子添加,编号 = =,怎么办?
>>> import re
>>> def createid(matchobject,no=[0]):
...     no[0]+=1
...     return "[%d]"%no[0]
...
>>> text = "★A child is a human being who is not yet an adult.★A child is a human being who is not yet an adult.★A child is a human being who is not yet an adult."
>>> text=re.sub("★",createid,text)
>>> print(text)
[1]A child is a human being who is not yet an adult.[2]A child is a human being who is not yet an adult.[3]A child is a human being who is not yet an adult.

>>>
参考:https://zhidao.baidu.com/question/1993159681293693067.html  |百度知道
-------问题是这里有标注了,可是我的文段没有--------------------------------------------------------
>>> pattern = re.compile(u'wechat', re.I)
>>> pattern.search(u'wechat online')
<_sre.SRE_Match object; span=(0, 6), match='wechat'>

>>>
----------然后我找到了正则表达式 匹配句首的,不过没看懂还------------------------------------------
又找到了https://zhidao.baidu.com/question/2012704092059701388.html,只能找字母的首字母= =
---------问题是如何找到匹配句首的方式---------------------
可是只找到了 如何找寻首字母的 = =方式,参考:https://zhidao.baidu.com/question/814035707149647692.html
>>> import re
>>> content = "a string which defines the name for this spider. the spider name is how the spider is located (and instantiated) by scrapy, so it must be unique. however, nothing prevents you from instantiating more than one instance of the same spider. this is the most important spider attribute and it’s required."

>>> for line in re.split('\.|\?|!', content):
...     if line != "":
...        print(line.strip().capitalize())
...        print(line.strip().split()[0])
...
A string which defines the name for this spider
a
The spider name is how the spider is located (and instantiated) by scrapy, so it must be unique
the
However, nothing prevents you from instantiating more than one instance of the same spider
however,
This is the most important spider attribute and it’s required
this
>>>

-----------我想试试中文-------------
结果:= = 实验证明无法使用到中文中



-----------继续问题---- 应该是中文句号的问题= =----------
再来一次:结果 = =不经汗颜,不行!!!



放弃句首吧,我还是加序号到末尾吧= =。
--------------------------------------------------
新的参考 | https://zhidao.baidu.com/question/1111709810604369899.html
结果= =:



又参考了:http://bbs.csdn.net/topics/390424651



问题变成了,如何定位句子开头,然后定位了之后,就可以标注了,然后就是序号如何添加的问题,立刻把问题分解成了三份。
继续----------------------------------------------------
只简单编码,可以实现



问题是,都是黏在一起的 = =,不太好吧。选择每句用回车换行分开:



虽然解决了编码的问题,可是就是不能定位到开头,或者说句首。到现在这都是一个问题。不过还算马马虎虎解决了一个问题。
然后要考虑的是给每个句子进行定性。
这文章先写到这里吧 = =2018年3月4日21:16:07
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: