您的位置:首页 > 其它

Extract User Defined Region From An Chromosome Fasta File

2016-12-30 09:59 260 查看

Extract User Defined Region From An Chromosome Fasta File

小麦基因组一条染色体序列至少有500M,我们想要获得染色体某一段的序列还是比较难办的。除了samtool可以做到,今天发现一个用python写的脚本pyfaidx,使用起来也比较方便。

首先安装

pip install --user pyfaidx  #安装在自己账户的 $HOME/.local/bin/faidx
sudo pip install pyfaidx #需要权限,安装在/usr/local/bin/faidx


输入命令 faidx 默认返回使用帮助

faidx
$ faidx
usage: faidx [-h] [-b BED] [-o OUT]
[-i {bed,chromsizes,nucleotide,transposed}] [-c] [-r]
[-a SIZE_RANGE] [-n | -f] [-x] [-l] [-s DEFAULT_SEQ]
[-d DELIMITER] [-g REGEX | -v] [-m | -M] [--no-rebuild]
[--version]
fasta [regions [regions ...]]

Fetch sequences from FASTA. If no regions are specified, all entries in the
input file are returned. Input FASTA file must be consistently line-wrapped,
and line wrapping of output is based on input line lengths.

positional arguments:
fasta                 FASTA file
regions               space separated regions of sequence to fetch e.g.
chr1:1-1000

optional arguments:
-h, --help            show this help message and exit
-b BED, --bed BED     bed file of regions
-o OUT, --out OUT     output file name (default: stdout)
-i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed}
transform the requested regions into another format.
default: None
-c, --complement      complement the sequence. default: False #
-r, --reverse         reverse the sequence. default: False
-a SIZE_RANGE, --size-range SIZE_RANGE
selected sequences are in the size range [low, high].
example: 1,1000 default: None
-n, --no-names        omit sequence names from output. default: False
-f, --full-names      output full names including description. default:
False
-x, --split-files     write each region to a separate file (names are
derived from regions)
-l, --lazy            fill in --default-seq for missing ranges. default:
False
-s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
default base for missing positions and masking.
default: N
-d DELIMITER, --delimiter DELIMITER
delimiter for splitting names to multiple values
(duplicate names will be discarded). default: None
-g REGEX, --regex REGEX
selected sequences are those matching regular
expression. default: .*
-v, --invert-match    selected sequences are those not matching 'regions'
argument. default: False
-m, --mask-with-default-seq
mask the FASTA file using --default-seq default: False
-M, --mask-by-case    mask the FASTA file by changing to lowercase. default:
False
--no-rebuild          do not rebuild the .fai index even if it is out of
date. default: False
--version             print pyfaidx version number

Please cite: Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. (2015) Efficient
"pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196 https://dx.doi.org/10.7287/peerj.preprints.970v1[/code] 
从上边的帮助信息了解到除了获取指定的序列外还有诸如倒序反转等等功能。

举个最简单的例子,想要获取chr5B_part2上75800000-76000000的序列,则命令如下

faidx wheat_genome.fasta chr5B_part2:75800000-76000000     # 默认返回序列屏幕

>chr5B_part2:75800000-76000000
GTCGTCTGCGAAGCCCCTACCAGAAATCCGCAACCTCCATGACATCCACCACTACACCAG
GCCTCCACGTCTCTTGTCCTCTGCCGCCAATCCACACCTCTTGTCCGCCCACCATGTGGT
ACCACATCCGCCCCTATAGAAGTCACCACGCCATGCAAGTGTTCGGTGAAATGCTCATGC
TAATTTCCTTGTCCCTTCTTTATACAGTAATGGATTCAGACATGGAGAACATATATGAGC
GCTATGTTGAGTCGTCCGAGGAGCAAGACTACATGGATGAAATGATAATGATGCAGGTGG
TCCTTGCAGACGCGGAGCAAGCAGAAGAGCATGTTCTCAATTTCAAGGGATCCATCAA.... #剩下的序列未列出
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息