您的位置:首页 > 产品设计 > UI/UE

LeetCode-187.Repeated DNA Sequences

2016-06-19 22:35 435 查看
https://leetcode.com/problems/repeated-dna-sequences/

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].


编码法 参考 https://segmentfault.com/a/1190000003922976
因为子串只有10位,每一位只有4种不同的字母,用4^10个数字来表示每种不同的序列,因为4^10=2^20<2^32所以我们可以用一个int来存储

int encode(string s, int i)
{
//A-0 C-1 G-2 T-3
int code = 0;
for (int j = i; j < i+10; j++)
{
code *= 4;
if (s[j] == 'C')
code += 1;
else if (s[j] == 'G')
code += 2;
else if (s[j] == 'T')
code += 3;
}
return code;
}

vector<string> findRepeatedDnaSequences(string s)
{
int n = s.length()-10,code;
vector<string> res;
unordered_map<int,int> map;
for (int i = 0; i <= n; i++)
{
code = encode(s,i);
map[code]++;
if (map[code] == 2)
res.push_back(s.substr(i, 10));
}
return res;
}


其实可以不需要encode函数,因为10位的ACGT,每一个字符用两位bit表示就是20位,下一次的code值=取当前code末18位左移两位,然后加上新的char对应的bit

vector<string> findRepeatedDnaSequences(string s)
{
int n = s.length();
vector<string> res;
unordered_map<int,int> map;
unordered_map<char, int> dic;
dic['A'] = 0;
dic['C'] = 1;
dic['G'] = 2;
dic['T'] = 3;
int i=0,code=0;
while(i<9)
code = (code << 2) + dic[s[i++]];
while(i<n)
{
code = ((code&0x3ffff) << 2) + dic[s[i++]];
map[code]++;
if (map[code] == 2)
res.push_back(s.substr(i-10, 10));
}
return res;
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  leetcode