【利用perl的基因数据处理】1.基础的的DNA文件读写和碱基特征统计
2017-10-05 18:55
423 查看
我尚且处于生物信息学的小白状态,前两天刚学习了perl的基础语法,之后通过对“Begin Perl for Bioinformatics ”书籍的学习,将生物信息学的基础数据处理和perl语言的编程有了基础的知识框架和应用了解。
一下是在学习过程中对DNA的基本信息统计写的一串小代码,之后可能会将自己在学习过程中发现的一些实用小工具整合成pm包,方便以后使用:
use warnings;
use strict;
my $filename;
my @DNA;
my $DNA;
# main:read the DNA file
sub clean_data{
$filename = <STDIN>;
chomp $filename;
unless(open(DNAFILE,$filename)){
die "Can't read the file and I'm exiting!\n";
}
@DNA = <DNAFILE>;
close DNAFILE;
print "Have read the file sucessfully!\n";
# modify the base sequence
$DNA = join('',@DNA);
$DNA =~ s/\s//g;
return $DNA;
}
# the base number counter and percentage analysis
sub base_counter {
my
@DNA = split('',$DNA);
my $count_of_A = 0;
my $count_of_T = 0;
my $count_of_C = 0;
my $count_of_G = 0;
my $count_of_others = 0;
my $total_count = 0;
foreach(@DNA) {
$total_count++;
if ($_ eq "A"){
$count_of_A++;
}elsif($_ eq "T"){
$count_of_T++;
}elsif($_ eq "C"){
$count_of_C++;
}elsif($_ eq "G"){
$count_of_G++;
}else{
print "This is a wrong base!: $_\n";
$count_of_others++;
}
}
my $percentage_of_A = $count_of_A / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percetage_of_others = $count_of_others / $total_count * 100;
print "The count of A is: $count_of_A\n";
print "The percent of A is: $percent_of_A\%\n";
print "The count of T is: $count_of_T\n";
print "The percent of T is: $percent_of_T\%\n";
print "The count of C is: $count_of_C\n";
print "The percent of C is: $percent_of_C\%\n";
print "The count of G is: $count_of_G\n";
print "The percent of G is: $percent_of_G\%\n";
print "The count of wrong base is: $count_of_others\n";
print "The percent of wrong base is: $percent_of_others\%\n";
return ($total_count,$count_of_A,$count_of_C,$count_of_T,$count_of_G,$count_of_others);
}
# analysis of CG base percentage
sub CG_analyze{
my $dna_filename = shift @_;
my @basic_data = base_analysis(my $dna_filename);
my $count_of_CG = $basic_data[2] + $basic_data[4];
my $percent_of_CG = $count_of_CG / $basic_data[0];
print "The total number of base 'C' and 'G' is :$count_of_CG\n";
print "The percent of 'C' and 'G' is $percent_of_CG\n";
}
以上就是相关的代码,主要用于实现DNA序列的碱基计数和各个碱基的占比计算,CG碱基含量分析,是一些非常基础的基因组研究的必须代码。
一下是在学习过程中对DNA的基本信息统计写的一串小代码,之后可能会将自己在学习过程中发现的一些实用小工具整合成pm包,方便以后使用:
use warnings;
use strict;
my $filename;
my @DNA;
my $DNA;
# main:read the DNA file
sub clean_data{
$filename = <STDIN>;
chomp $filename;
unless(open(DNAFILE,$filename)){
die "Can't read the file and I'm exiting!\n";
}
@DNA = <DNAFILE>;
close DNAFILE;
print "Have read the file sucessfully!\n";
# modify the base sequence
$DNA = join('',@DNA);
$DNA =~ s/\s//g;
return $DNA;
}
# the base number counter and percentage analysis
sub base_counter {
my
@DNA = split('',$DNA);
my $count_of_A = 0;
my $count_of_T = 0;
my $count_of_C = 0;
my $count_of_G = 0;
my $count_of_others = 0;
my $total_count = 0;
foreach(@DNA) {
$total_count++;
if ($_ eq "A"){
$count_of_A++;
}elsif($_ eq "T"){
$count_of_T++;
}elsif($_ eq "C"){
$count_of_C++;
}elsif($_ eq "G"){
$count_of_G++;
}else{
print "This is a wrong base!: $_\n";
$count_of_others++;
}
}
my $percentage_of_A = $count_of_A / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percetage_of_others = $count_of_others / $total_count * 100;
print "The count of A is: $count_of_A\n";
print "The percent of A is: $percent_of_A\%\n";
print "The count of T is: $count_of_T\n";
print "The percent of T is: $percent_of_T\%\n";
print "The count of C is: $count_of_C\n";
print "The percent of C is: $percent_of_C\%\n";
print "The count of G is: $count_of_G\n";
print "The percent of G is: $percent_of_G\%\n";
print "The count of wrong base is: $count_of_others\n";
print "The percent of wrong base is: $percent_of_others\%\n";
return ($total_count,$count_of_A,$count_of_C,$count_of_T,$count_of_G,$count_of_others);
}
# analysis of CG base percentage
sub CG_analyze{
my $dna_filename = shift @_;
my @basic_data = base_analysis(my $dna_filename);
my $count_of_CG = $basic_data[2] + $basic_data[4];
my $percent_of_CG = $count_of_CG / $basic_data[0];
print "The total number of base 'C' and 'G' is :$count_of_CG\n";
print "The percent of 'C' and 'G' is $percent_of_CG\n";
}
以上就是相关的代码,主要用于实现DNA序列的碱基计数和各个碱基的占比计算,CG碱基含量分析,是一些非常基础的基因组研究的必须代码。
相关文章推荐
- 【利用perl的基因数据处理】复杂数据结构:矩阵 and Edit Distance Matrix
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组 (转)
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- 一次面试碰到的机试题:计数分词器---文件读写,比较排序,计数统计(觉得偏基础,值得关注)
- 统计一行中字符串字符的个数的三种方法:利用excel,利用perl(length函数,tr//),利用vim。统计引物中引物的碱基数目必用
- c++简单读写文本,统计文件的行数,读取文件数据到数组
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- 鸟哥的Linux私房菜 基础学习篇 第三版 第十二章 正则表达式与文件格式化处理 12.4.2 awk 好用的数据处理工具
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- perl基础:利用Perl的哈希建立键-值数据映射
- 一个简单的文件处理--16进制数据统计分析
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组
- 利用多线程技术优化文件读写转换处理
- perl,读取所需文件的路径,然后打开相应的文件,并对文件中的DNA序列进行计数,substr函数对长字符串的片段化处理功能
- C++ 简单读写文本文件、统计文件的行数、读取文件数据到数组