您的位置：首页 > 产品设计 > UI/UE

whole-genome-sequencing Data Analysis 学习笔记3: 测试数据及参考基因组的准备

2017-02-25 17:46 483 查看

test data

reference data:

hg19<-NCBI

GRCH37<-UCSC

ensembl 75<-ENSEMBL

download reference data:

.使用nohup在登出SSH会话后仍运行命令

nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz &
tar zvfx chromFa.tar.gz

遇到的问题是提示：nohup: 忽略输入并把输出追加到”nohup.out”

解决方案

nohup myprogram >myprogram.out 2>&1no

例：

nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz & tar zxvf chromFa.tar.gz > wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz & tar zxvf chromFa.tar.gz 2>&1no

上面的问题在下列中就未出现

nohup wget -c -r -nd -np -k -L -p ftp://ftp.kobic.re.kr/pub/KPGP/2015_release_candidate/WGS/KPGP-00001 1>/dev/null 2>&1 &

正是后面的 1>/dev/null 2>&1 & 将输出定位到其他路径了

上面语句的解析：

wget后缀的其他用法

-c 断点续传

-r 递归下载，下载指定网页某一目录下（包括子目录）的所有文件

-nd 递归下载时不创建一层一层的目录，把所有的文件下载到当前目录

-np 递归下载时不搜索上层目录，如wget -c -r http://www.chenzei.com/junshi

没有加参数-np，就会同时下载path的上一级目录pub下的其它文件

-k 将绝对链接转为相对链接，下载整个站点后脱机浏览网页，最好加上这个参数

-L 递归时不进入其它主机，如wget -c -r www.chenzei.com/ 如果网站内有一个这样的链接：

www.chenzei.com，不加参数-L，就会像大火烧山一样，会递归下载www.chenzei.com网站

-p 下载网页所需的所有文件，如图片等

-A 指定要下载的文件样式列表，多个样式用逗号分隔

-i 后面跟一个文件，文件内指明要下载的URL

后台挂起任务的查看方法是

jobs

或

ps -ef | grep command

想要终止该任务则输入命令

kill -9  pid(此为pid编号)

暂时先不管out了

先下载吧，如下jobs

[1] 运行中 nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz & (工作目录: ~/reference/genome/hg19)

[2] 运行中 nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz & (工作目录: ~/reference/genome/hg38)

[3] 运行中 nohup wget http://hgdownload.cse. c128
ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz & (工作目录: ~/reference/genome/mm10)

[4] 运行中 nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz &

[5]- 运行中 nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz &

[6]+ 运行中 nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grcm38.tar.gz &

24小时后查看下载情况

mary@administrator-ThinkStation-P710:~/reference/genome$ ls -lh

总用量 3.3G

drwxrwxr-x 2 mary mary 4.0K 2月 25 20:01 hg19

drwxrwxr-x 2 mary mary 4.0K 2月 25 20:07 hg38

-rw-rw-r– 1 mary mary 3.3G 2月 26 14:57 KPGP-00001_L1_R1.fq.gz

drwxrwxr-x 2 mary mary 4.0K 2月 25 20:09 mm10

注意KPGP-00001应该是13G,但是只有3.3G ，说明没有下载完整

重新创建文件夹下载。。。( ▼-▼ )

记得下载完明天过来比较md5码是否一致

再看看其他文件在大小上是否下载完整

查看某一文件夹下的所有文件大小

mary@administrator-ThinkStation-P710:~/reference$ du -h

940M ./genome/hg38

7.7G ./genome/hg19

8.0M ./genome/KPGP00001

832M ./genome/mm10

9.5G ./genome

16K ./index/bwa

532K ./index/hisat/hg38

12G ./index/hisat

36K ./index/bowtie

12G ./index

21G .

下面这个命令也能看，可以看到KPGP0001正在下载

mary@administrator-ThinkStation-P710:~/reference$ du -lh

940M ./genome/hg38 应该是3.1G，也是没下完就停了。。

7.7G ./genome/hg19 应该是3G，可能是下重复了。。

20M ./genome/KPGP00001

832M ./genome/mm10 应该是2.6G,可能是没下完就停了。。

9.5G ./genome 应该是8.7G，该下的没下来，还下重复了。。

此处应该有3.8G的 reference/index/hisat/grcm38

此处应该有4.2G的 reference/index/hisat/hg19

此处应该有4.4G的reference/index/hisat/hg38

16K ./index/bwa

532K ./index/hisat/hg38

12G ./index/hisat 应该是13G

36K ./index/bowtie 应该是12G

此处应该有15G的 ./index/bwa

12G ./index 应该是39G

21G . 应该是48G

此处应该有942M的gtf

( ▼-▼ )

哎，重来吧

首先删除

然后重新下载

nohup wget -c -k -p http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz 1>/dev/null 2>&1 &

nohup wget -c -k -p http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz 1>/dev/null 2>&1 &

nohup wget -c -k -p http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz 1>/dev/null 2>&1 &

这样下载后查看情况：

mary@administrator-ThinkStation-P710:~/reference/genome$ du -h

4.6M ./hg38/hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips

4.6M ./hg38/hgdownload.cse.ucsc.edu/goldenPath/hg38

4.6M ./hg38/hgdownload.cse.ucsc.edu/goldenPath

4.6M ./hg38/hgdownload.cse.ucsc.edu

4.6M ./hg38

6.5M ./hg19/hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips

6.5M ./hg19/hgdownload.cse.ucsc.edu/goldenPath/hg19

6.5M ./hg19/hgdownload.cse.ucsc.edu/goldenPath

6.5M ./hg19/hgdownload.cse.ucsc.edu

6.6M ./hg19

239M ./KPGP00001

3.7M ./mm10/hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips

3.7M ./mm10/hgdownload.cse.ucsc.edu/goldenPath/mm10

3.7M ./mm10/hgdownload.cse.ucsc.edu/goldenPath

3.7M ./mm10/hgdownload.cse.ucsc.edu

3.7M ./mm10

254M .

整齐了点：）

注意到JIM说下载好的基因组需要构建索引，因为我们会比较bowtie2,hisat2和bwa这3个主流比对软件的区别，所以我们会构建所有的索引，下载完毕后如下大小

嗯，要下载好后再建立索引。。。。。

建立索引文件的语句解析

cd ~/reference

mkdir -p index/bowtie && cd index/bowtie

mkdir 命令扩展

mkdir命令是常用的命令，用来建立空目录，它还有2个常用参数：

-m, –mode=模式设定权限<模式> (类似 chmod)，而不是 rwxrwxrwx 减 umask

-p, –parents 需要时创建上层目录，如目录早已存在则不当作错误

下面是英文原版

-m, –mode=MODE set file mode (as in chmod), not a=rwx - umask

-p, –parents no error if existing, make parent directories as needed

-v, –verbose print a message for each created directory

-Z set SELinux security context of each created directory

to the default type

–context[=CTX] like -Z, or if CTX is specified then set the SELinux

or SMACK security context to CTX

–help display this help and exit

–version output version information and exit

nohup time ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/reference/genome/hg19/hg19.fa ~/reference/index/bowtie/hg19 1>hg19.bowtie_index.log 2>&1 &

nohup time ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/reference/genome/hg38/hg38.fa ~/reference/index/bowtie/hg38 1>hg38.bowtie_index.log 2>&1 &

nohup time ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/reference/genome/mm10/mm10.fa ~/reference/index/bowtie/mm10 1>mm10.bowtie_index.log 2>&1 &

解析

bowtie2建立参考基因组的索引——bowtie2-build

1）使用方法： bowtie2-build <要生成的索引文件前缀名>；比如：

nohup /home/cuckoo/software/bowtie2-2.2.3/bowtie2-build genome.fa bowtie2index/genome>>bowtie2.log &

2）参数说明：genome.fa是fasta文件；

genome是要生成的索引文件的前缀名；

bowtie2index是一个文件夹，用来存放索引文件，方便日后查看和使用；

注意：程序运行完后genome.fa文件要放在bowtie2index索引目录中，tophat2软件才能正确运行。

不知道是网络的问题还是其他原因，自建索引下载效果不好（先下载完再建索引，好用）

所以如果已有索引，最好使用已有索引

在index/bowtie目录下：下载hg19的索引

nohup -c wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip 1>/dev/null 2>&1 &

做到这里发现自己犯了个很大的错误

对于小于1G的文件，工作顺序应该是：

1.建立对应文件夹

2.在文件夹下载目标序列

3.解压

解压的时候一定要看清gz文件的路径。。。不要直接在hg19之类的根目录下找。。

在hg19下面好几层的目录里。。

**可能与我nohup的命令迭代有关

以后不要乱迭代了。。**

hg38被我放在这里了：

~/reference/genome/hg38/hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips

这直接在原目录下建立索引怎么可能找的到

来移动吧：

先切到cd ~/reference/genome/hg38/hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips

然后移动 mv hg38.fa.gz /home/mary/reference/genome/hg38

最后查看~/reference/genome/hg38$ ls

hg38.bowtie_index.log hg38.fa.gz hgdownload.cse.ucsc.edu

移动hg19

~/reference/genome/hg19/hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips

去找mm10

~/reference/genome/mm10/hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips$ ls

移动：mv chromFa.tar.gz /home/mary/reference/genome/mm10

4.用cat来合并cat *.fa > hg19.fa

5.删除未合并前数据：rm chr*.fa

下载后查看KPGP应当为13G，如下才下了4.4G，好慢。。续。。

mary@administrator-ThinkStation-P710:~/reference/genome/KPGP00001$ ls -lh

总用量 4.4G

-rw-rw-r– 1 mary mary 4.4G 2月 28 08:50 KPGP-00001_L1_R1.fq.gz

另外向服务器上传：用rz

通过Xshell向Linux服务器上传文件

1

打开Xshell，登录Linux服务器

通过Xshell向Linux服务器上传文件

2

查看lrzsz(rpm -qa|grep lrzsz)，是否已经安装

通过Xshell向Linux服务器上传文件

3

若lrzsz没有安装，通过WinCSP上传安装包（安装包可从Linux操作系统镜像文件中获取）

通过Xshell向Linux服务器上传文件

4

安装lrzsz

通过Xshell向Linux服务器上传文件

5

执行rz上传文件，弹出文件选择窗口

下载用sz

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航