您的位置:首页 > 其它

【利用perl的基因数据处理】1.基础的的DNA文件读写和碱基特征统计

2017-10-05 18:55 423 查看
我尚且处于生物信息学的小白状态,前两天刚学习了perl的基础语法,之后通过对“Begin Perl for  Bioinformatics ”书籍的学习,将生物信息学的基础数据处理和perl语言的编程有了基础的知识框架和应用了解。

一下是在学习过程中对DNA的基本信息统计写的一串小代码,之后可能会将自己在学习过程中发现的一些实用小工具整合成pm包,方便以后使用:

use warnings;
use strict;

my $filename;
my @DNA;
my $DNA;

# main:read the DNA file
sub clean_data{
$filename = <STDIN>;
chomp $filename;
unless(open(DNAFILE,$filename)){
die "Can't read the file and I'm exiting!\n";
}
@DNA = <DNAFILE>;
close DNAFILE;
print "Have read the file sucessfully!\n";

# modify the base sequence
$DNA = join('',@DNA);
$DNA =~ s/\s//g;
return $DNA;
}
# the base number counter and percentage analysis
sub base_counter {
my
@DNA = split('',$DNA);
my $count_of_A = 0;
my $count_of_T = 0;
my $count_of_C = 0;
my $count_of_G = 0;
my $count_of_others = 0;
my $total_count = 0;
foreach(@DNA) {
$total_count++;
if ($_ eq "A"){
$count_of_A++;
}elsif($_ eq "T"){
$count_of_T++;
}elsif($_ eq "C"){
$count_of_C++;
}elsif($_ eq "G"){
$count_of_G++;
}else{
print "This is a wrong base!: $_\n";
$count_of_others++;
}
}
my $percentage_of_A = $count_of_A / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percentage_of_T = $count_of_T / $total_count * 100;
my $percetage_of_others = $count_of_others / $total_count * 100;
print "The count of A is: $count_of_A\n";
print "The percent of A is: $percent_of_A\%\n";
print "The count of T is: $count_of_T\n";
print "The percent of T is: $percent_of_T\%\n";
print "The count of C is: $count_of_C\n";
print "The percent of C is: $percent_of_C\%\n";
print "The count of G is: $count_of_G\n";
print "The percent of G is: $percent_of_G\%\n";
print "The count of wrong base is: $count_of_others\n";
print "The percent of wrong base is: $percent_of_others\%\n";
return ($total_count,$count_of_A,$count_of_C,$count_of_T,$count_of_G,$count_of_others);
}

# analysis of CG base percentage
sub CG_analyze{
my $dna_filename = shift @_;
my @basic_data = base_analysis(my $dna_filename);
my $count_of_CG = $basic_data[2] + $basic_data[4];
my $percent_of_CG = $count_of_CG / $basic_data[0];
print "The total number of base 'C' and 'G' is :$count_of_CG\n";
print "The percent of 'C' and 'G' is $percent_of_CG\n";
}

以上就是相关的代码,主要用于实现DNA序列的碱基计数和各个碱基的占比计算,CG碱基含量分析,是一些非常基础的基因组研究的必须代码。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐