Difference between revisions of "GenPlay File Formats"

From GenPlay, Einstein Genome Analyzer

Revision as of 15:28, 23 November 2010

1 BED File
2 bigBed format
3 PSL format
4 GFF format
5 GTF format
6 MAF format
7 BAM format
8 WIG format
9 bigWig format
10 Microarray format

BED File

Description

BED format has three necessary fields and nine additional optional fields. The number of fields per line have to be constant throughout any single set of data in an annotation track. The order of the optional fields requires that lower-numbered fields be filled if higher-numbered fields are used.

The first three required BED fields are:
1.	chromosome - The name of the chromosome (e.g. chr1, chrX etc.) or scaffold (e.g. scaffold10461).
2.	Start - The beginning position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3.	End - The ending position of the feature in the chromosome or scaffold. 

The 9 additional optional BED fields are:
4.	name – This is the name of the gene.
5.	score - A score between 0 and 1000. 
6.	strand - Defines the strand as either '+' (red) or '-' (blue).
Unused fields
7.	thickStart 
8.	thickEnd
9.	itemRgb
10.	blockCount
11.	blockSizes
12.	blockStarts

Example

Here's an example of an annotation track that uses a complete BED definition: chr13 1450 5540 cloneX 1506 + 1450 5540 0 2 567,488, 0,3545

Note

The search URL "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?exdb=AceView&db=33a&submit=Go&term=" is included at the beginning of each BED file. It provides the URL that contains the description of the genes. One can form the complete URL by appending the name of the gene at the end of the search URL after the equal to sign. For example, http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?exdb=AceView&db=36a&submit=Go&term=WDR78.aApr07 will provide information on WDR 78.

bigBed format

The bigBed format stores annotation items as simple or a linked collection of exons, similar to the BED files. BigBed files are created initially from bed type files. The format of the bigBED files is indexed binary. The primary benefit of the bigBed files is that they are way faster than regular bed files. The bigBed file remains on your web accessible server (http, https, or ftp).

PSL format

PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. All of the following fields are required on each data line within a PSL file: 1. matches - Number of bases that match that aren’t repeats 2. mismatches - Number of bases that don't match 3. repeats - Number of bases that match but are part of repeats 4. nbasecount - Number of 'N' bases 5. numofinsertsquery - Number of inserts in query 6. numofbaseinsertsquery - Number of bases inserted in query 7. numofinsertstarget - Number of inserts in target 8. numofbaseinsertsquery - Number of bases inserted in target 9. strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand 10. queryseqname - Query sequence name 11. queryseqize - Query sequence size 12. querystart - Alignment start position in query 13. queryend - Alignment end position in query 14. targetseqname - Target sequence name 15. targetseqsize - Target sequence size 16. targetstart - Alignment start position in target 17. targetend - Alignment end position in target 18. blockcount - Number of blocks in the alignment (a block contains no gaps) 19. blocksizes - Comma-separated list of sizes of each block 20. querystarts - Comma-separated list of starting positions of each block in query 21. targetstarts - Comma-separated list of starting positions of each block in target

Example: track name=fishBlats description="Fish BLAT" useScore=1 59 9 0 0 1 823 1 96 +- FS_CONTIG_48080_1 1955 171 1062 chr22

   47748585 13073589 13073753 2 48, 20, 171, 1042, 34674832, 34674976,

59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr22

   47748585 13073626 13073747 2 21, 45, 2456, 2532, 34674838, 34674914,

59 7 0 0 1 55 1 55 -+ FS_CONTIG_26780_1 2825 2455 2676 chr22

GFF format

GFF stands for General Feature Format. They have 9 compulsory fields which are tab delimited. They are as follows: 1. seqname - The name of the sequence. Must be a chromosome. 2. source - The source code that was responsible for generation of this feature. 3. feature - The name of this type of feature. 4. start - The starting position of the feature in the sequence. The first base are 1 indexed. 5. end - The ending position of the feature (inclusive). 6. score - A score between 0 and 1000. 7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame – Is a number between 0 and 2, if the feature is a coding exon and if the feature is not a coding exon, it is ‘.’. 9. group - Lines belonging to the same group are linked together into a single item.

Example:

track name=regulatory description="TeleGene(tm) Regulatory Regions" chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2

GTF format

GTF stands for Gene Transfer Format and is a stricter version of GFF. The first 8 GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute is a type/value pair. Attributes must be terminated by a semi-colon. The gap between any two attributes must be exactly one space. The attribute list must begin with the two required attributes: • gene_id value - A GUID for the genomic source of the sequence. • transcript_id value - A GUID for the predicted transcript.

Example: Here is an example of the ninth field in a GTF data line:

   gene_id "Em:U62317.C22.6.mRNA"; transcript_id "Em:U62317.C22.6.mRNA"; exon_number 1

MAF format

It stands for Multiple Alignment Format. It stores a sequence of numerous alignments in an easily parse-able and readable format. Each multiple alignment ends with a blank line. Each sequence in a alignment is on a single line with words delimited by white space. Comments begin with ‘#’ and lines beginning with ‘##’ contain meta-data.

The file is divided into paragraphs that terminate in a blank line. Within a paragraph, the first word of a line indicates its type. Each multiple alignment is in a separate paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment. Some MAF files may contain other optional line types: • an "i" line containing information about what is in the aligned species DNA before and after the immediately preceding "s" line • an "e" line containing information about the size of the gap between the alignments that span the current block • a "q" line indicating the quality of each aligned base for the species

A Simple Example Here is a simple example of a three alignment blocks derived from five starting sequences. The first track line is necessary for custom tracks, but should be removed otherwise. Repeats are shown as lowercase, and each block may have a subset of the input sequences. All sequence columns and rows must contain at least one nucleotide (no columns or rows that contain only insertions). track name=euArc visibility=pack

1. maf version=1 scoring=tba.v8
tba.v8 (((human chimp) baboon) (mouse rat))

a score=23262.0 s hg18.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG a score=5062.0 s hg18.chr7 27699739 6 + 158545518 TAAAGA s panTro1.chr6 28862317 6 + 161576975 TAAAGA s baboon 241163 6 + 4622798 TAAAGA s mm4.chr6 53303881 6 + 151104725 TAAAGA s rn3.chr4 81444246 6 + 187371129 taagga a score=6636.0 s hg18.chr7 27707221 13 + 158545518 gcagctgaaaaca s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca s baboon 249182 13 + 4622798 gcagctgaaaaca s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA

BAM format

BAM is a Binary version of the Sequence Alignment / Map (SAM) format. For custom track display, the main advantage of indexed BAM over PSL and other human-readable alignment formats is that only the portions of the files needed to display a particular region are transferred to UCSC. This makes it possible to display alignments from files that are so large that the connection to UCSC would time out when attempting to upload the whole file to UCSC. Both the BAM file and its associated index file remain on your web-accessible server (http or ftp), not on the UCSC server. UCSC temporarily caches the accessed portions of the files to speed up interactive display.

WIG format

Wiggle format (WIG) allows the display of continuous-valued data in a track format.

bigWig format

  The bigWig format is for display of dense, continuous data that will be displayed in
   the Genome Browser as a graph. The files are in an indexed binary format.

Microarray format

The datasets for the built-in microarray tracks in the Genome Browser are stored in BED15 format, an extension of BED format that includes three additional fields: expCount, expIds, and expScores. To display correctly in the Genome Browser, microarray tracks require the setting of several attributes in the trackDb file associated with the track's genome assembly. Each microarray track set must also have an associated microarrayGroups.ra configuration file that contains additional information about the data in each of the arrays. User-created microarray custom tracks are similar in format to BED custom tracks with the addition of three required track line parameters in the header--expNames, expScale, and expStep--that mimic the trackDb and microarrayGroups.ra settings of built-in microarray tracks.

Retrieved from "http://genplay.net/wiki/index.php?title=GenPlay_File_Formats&oldid=143"