您的位置：首页 > 产品设计 > UI/UE

Data File Formats－UCSC－GFF，PSL

2009-10-10 14:34 351 查看

GFF3&GFF&GFF2PS

GFF3: http://song.sourceforge.net/gff3.shtml
GFF: http://www.sanger.ac.uk/Software/formats/GFF/
GFF2PS: http://genome.imim.es/software/gfftools/GFF2PS.html

DATA FORMAT

PSL format

PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. See the BLAT documentation for more details. All of the following fields are required on each data line within a PSL file:

matches - Number of bases that match that aren't repeats

misMatches - Number of bases that don't match

repMatches - Number of bases that match but are part of repeats

nCount - Number of 'N' bases

qNumInsert - Number of inserts in query

qBaseInsert - Number of bases inserted in query

tNumInsert - Number of inserts in target

tBaseInsert - Number of bases inserted in target

strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand

qName - Query sequence name

qSize - Query sequence size

qStart - Alignment start position in query

qEnd - Alignment end position in query

tName - Target sequence name

tSize - Target sequence size

tStart - Alignment start position in target

tEnd - Alignment end position in target

blockCount - Number of blocks in the alignment (a block contains no gaps)

blockSizes - Comma-separated list of sizes of each block

qStarts - Comma-separated list of starting positions of each block in query

tStarts - Comma-separated list of starting positions of each block in target

Example:

Here is an example of an annotation track in PSL format. Note that line breaks have been inserted into the PSL lines in this example for documentation display purposes. Click here for a copy of this example that can be pasted into the browser without editing.

track name=fishBlats description="Fish BLAT" useScore=1

59 9 0 0 1 823 1 96 +- FS_CONTIG_48080_1 1955 171 1062 chr22

47748585 13073589 13073753 2 48,20, 171,1042, 34674832,34674976,

59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr22

47748585 13073626 13073747 2 21,45, 2456,2532, 34674838,34674914,

59 7 0 0 1 55 1 55 -+ FS_CONTIG_26780_1 2825 2455 2676 chr22

47748585 13073727 13073848 2 45,21, 249,349, 13073727,13073827,

Be aware that the coordinates for a negative strand in a PSL line are handled in a special way. In the qStart and qEnd fields, the coordinates indicate the position where the query matches from the point of view of the forward strand, even when the match is on the reverse strand. However, in the qStarts list, the coordinates are reversed.

Example:

Here is a 30-mer containing 2 blocks that align on the minus strand and 2 blocks that align on the plus strand (this sometimes can happen in response to assembly errors):

0 1 2 3 tens position in query

0123456789012345678901234567890 ones position in query

++++ +++++ plus strand alignment on query

-------- ---------- minus strand alignment on query

Plus strand:

qStart=12

qEnd=31

blockSizes=4,5

qStarts=12,26

Minus strand:

qStart=4

qEnd=26

blockSizes=10,8

qStarts=5,19

Essentially, the minus strand blockSizes and qStarts are what you would get if you reverse-complemented the query. However, the qStart and qEnd are not reversed. To convert one to the other:

qStart = qSize - revQEnd

qEnd = qSize - revQStart

GFF format

GFF (General Feature Format) lines are based on the GFF standard file format. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. For more information on GFF format, refer to http://www.sanger.ac.uk/Software/formats/GFF.

Here is a brief description of the GFF fields:

seqname - The name of the sequence. Must be a chromosome or scaffold.

source - The program that generated this feature.

feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon".

start - The starting position of the feature in the sequence. The first base is numbered 1.

end - The ending position of the feature (inclusive).

score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annot
4000
ation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".

strand - Valid entries include '+', '-', or '.' (for don't know/don't care).

frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.

group - All lines with the same group are linked together into a single item.

Example:

Here's an example of a GFF-based track. Click here for a copy of this example that can be pasted into the browser without editing. NOTE: Paste operations on some operating systems will replace tabs with spaces, which will result in an error when the GFF track is uploaded. You can circumvent this problem by pasting the URL of the above example (http://genome.ucsc.edu/goldenPath/help/regulatory.txt) instead of the text itselfinto the custom annotation track text box.

track name=regulatory description="TeleGene(tm) Regulatory Regions"

chr22 TeleGene enhancer 1000000 1001000 500 + . touch1

chr22 TeleGene promoter 1010000 1010100 900 + . touch1

chr22 TeleGene promoter 1020000 1020000 800 - . touch2

GTF format

GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.

The attribute list must begin with the two mandatory attributes:

gene_id value - A globally unique identifier for the genomic source of the sequence.

transcript_id value - A globally unique identifier for the predicted transcript.

Example:

Here is an example of the ninth field in a GTF data line:

gene_id "Em:U62317.C22.6.mRNA"; transcript_id "Em:U62317.C22.6.mRNA"; exon_number 1

For more information on this format, see http://genes.cse.wustl.edu/GTF2.html.

The Genome Browser groups together GTF lines that have the same transcript_id value. It only looks at features of type exon and CDS.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： file alignment query documentation attributes browser

相关文章推荐

新的分享

章节导航