您的位置:首页 > 编程语言 > Go语言

EM(Exceptation-maximization algorithm)算法寻找蛋白质motif

2020-03-06 19:30 1881 查看

这是一道课程作业:给出了2000条蛋白质序列,每条长度为50,要求使用EM算法寻找其中的motif(假设motif长度为7).

Expectation Maximization (EM) Algorithm
EM is a two-stage iterative process. An initial guess is made as to the location and size of a sequence pattern (a motif or domain) in each sequence in a set of related sequences. These regions are aligned to create a “trial” alignment for the set of sequences. Using the trial alignment, the residue composition of each column in the alignment is first calculated and used to create a PSSM.
Step 1. Expectation
Using the values in the PSSM, the probability of finding the pattern at every possible position in each sequence is calculated.
Step 2. Maximization
The probabilities from step 1 are used to weight the values in the PSSM, essentially providing new information about the likely location of the pattern in each sequence. The values in the PSSM are updated using these weights.
Steps 1 and 2 are repeated until the values in the PSSM don’t change with continued iterations.

具体步骤如下:

1.对于每一条蛋白质,选择随机选择其中的长度为7的motif片段,2000条蛋白质中找到2000*7条。
2.2000 * 7条矩阵按照以下步骤生成20*7(氨基酸种类数量为20种)矩阵:

3.20 * 7 权重打分矩阵对于每一条蛋白质顺序的每一种可能的motif进行打分(如本题50个顺序那么motif起点可能在0,43),每一条选取得分最高的7个片段重新组成2000 * 7矩阵。
4.重复2 3步骤直到收敛(我的理解每一条蛋白质的7个片段高度相似),即为可能的motif序列。

问题难点主要在于理解EM算法如何具体应用到本题中,故做此记录。代码实现不难,有空后续补上。

  • 点赞
  • 收藏
  • 分享
  • 文章举报
amazingbc 发布了6 篇原创文章 · 获赞 3 · 访问量 260 私信 关注
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: