GenPlay File Formats

From GenPlay, Einstein Genome Analyzer

Revision as of 00:41, 25 November 2010 by Bouhassi (talk | contribs) (Description)
Jump to: navigation, search

This page presents the different file formats that can be loaded in GenPlay. Part of the information on this page is from the FAQ of the UCSC genome browser. Don't hesitate to check it out for more details.

.2bit format

Description

A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.

The file begins with a 16-byte header containing the following fields:

  1. signature - the number 0x1A412743 in the architecture of the machine that created the file
  2. version - zero for now. Readers should abort if they see a version number higher than 0.
  3. sequenceCount - the number of sequences in the file.
  4. reserved - always zero for now

All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.

The header is followed by a file index, which contains one entry for each sequence. Each index entry contains three fields:

  1. nameSize - a byte containing the length of the name field
  2. name - the sequence name itself, of variable length depending on nameSize
  3. offset - the 32-bit offset of the sequence data relative to the start of the file

The index is followed by the sequence records, which contain nine fields:

  1. dnaSize - number of bases of DNA in the sequence
  2. nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)
  3. nBlockStarts - the starting position for each block of Ns
  4. nBlockSizes - the size of each block of Ns
  5. maskBlockCount - the number of masked (lower-case) blocks
  6. maskBlockStarts - the starting position for each masked block
  7. maskBlockSizes - the size of each masked block
  8. reserved - always zero for now
  9. packedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11.

The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011. The packedDna field is padded with 0 bits as necessary to take an even multiple of 32 bits in the file, which improves I/O performance on some machines.

Usage

The .2bit files can be loaded as sequence tracks.

To be loaded the file needs to have a '.2bit' extension.

BAM / SAM format

Description

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

BAM is a Binary version of the Sequence Alignment / Map (SAM) format.

More information about these file formats can be found here

Usage

SAM files can be loaded as fixed window track. BAM files can't be used directly in GenPlay but can be easily converted into SAM files using tools available on the Internet.

To be loaded a SAM file needs to either have a '.sam' extension or start with the following line:

track type=sam

BED format

Description

BED format has three necessary fields and nine additional optional fields. The number of fields per line must be constant throughout any single set of data in an annotation track. The order of the optional fields requires that lower-numbered fields be filled if higher-numbered fields are used.

  • The first three required BED fields are:
  1. chromosome - The name of the chromosome (e.g. chr1, chrX etc.) or scaffold (e.g. scaffold10461).
  2. Start - The beginning position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. End - The ending position of the feature in the chromosome or scaffold.
  • The 9 additional optional BED fields are:
  1. name – This is the name of the gene.
  2. score - A score between 0 and 1000.
  3. strand - Defines the strand as either '+' (red) or '-' (blue).
  4. thickStart
  5. thickEnd
  6. itemRgb
  7. blockCount
  8. blockSizes
  9. blockStarts

Example

Here is an example of a bed file.

track type=bed
searchURL="http://genome.ucsc.edu/cgi-bin/hgGene?org=Human&hgg_chrom=none&db=hg19&hgg_gene="
chr1	11873	14409	uc001aaa.3	0	+	11873	11873	0	3	354,109,1189,	0,739,1347,
chr1	11873	14409	uc010nxq.1	0	+	12189	13639	0	3	354,127,1007,	0,721,1529,
chr1	11873	14409	uc010nxr.1	        0	+	11873	11873	0	3	354,52,1189,	0,772,1347,
chr1	14362	16765	uc009vis.2	        0	-	14362	14362	0	4	467,69,147,159,	0,607,1433,2244,

Note: The search URL "http://genome.ucsc.edu/cgi-bin/hgGene?org=Human&hgg_chrom=none&db=hg19&hgg_gene=" is included at the beginning of each BED file. It provides the URL that contains the description of the genes. One can form the complete URL by appending the name of the gene at the end of the search URL after the equal to sign. For example, "http://genome.ucsc.edu/cgi-bin/hgGene?org=Human&hgg_chrom=none&db=hg19&hgg_gene=uc001aaa.3" will provide information on WDR 78.

Usage

A Bed file with the 3 required fields can generate stripes. If the 2 first required fields are specified, the file can be loaded as a fixed windows, a variable windows or a repeat track. If all the fields are specified the file can be loaded as a gene track.

To be loaded a Bed file needs to either have a '.bed' extension or start with the following line:

track type=bed

BedGraph format

Description

The BedGraph format is a really simple format useful to visualize windows on the genome. This windows can have a score. The fields in a BedGraph file are the followings:

  1. Chromosome
  2. Window start position
  3. Window stop position
  4. Score

Example

track type =bedgraph
chr1	18598	19673	1
chr1	124987	125426	3
chr1	317653	318092	15
chr1	427014	428027	8

Usage

BedGraph files can be used to load fixed windows and variable windows track.

They can also be loaded as stripes

A valid BedGraph file needs to have a '.bgr' extension or should start with the following line:

track type=bedgraph

Eland Extended format

Description

The Eland Extended files contain the result of read alignments on a reference genome.

Each line of an Eland Extended file contains the following fields:

  1. Sequence or read name
  2. Sequence
    • Either NM (No match found), QC (no matching done because of quality control failure (too many Ns)), RM (No matching done: repeat masked (may be seen if repeatFile.txt specified)
    • Or x:y:z where x, y, and z are the number of exact, single-error, and 2-error matches found
    • Either blank, if no matches found or if too many matches found
    • Or the following:
      BAC_plus_vector.fa:163022R1,170128F2,E_coli.fa:3909847R1
      This says there are two matches to BAC_plus_vector.fa: one in the reverse direction starting at position 160322 with one error, one in the forward direction starting at position 170128 with two errors. There is also a single-error match to E_coli.fa.

Usage

The Eland Extended files can generate fixed windows tracks.

The file needs to have a '.elx' extension or to start with the following line:

track type=eland_extended

GenPlay Project format

Description

GenPlay Project files can contain an entire project from GenPlay. These files are extremely compact because they are saved as compressed binary files. This means that they can't be edited with a text editor.

Usage

An entire GenPlay project can be loaded or saved as a GenPlay Project file. The only tracks that are not included in the project files are the sequence tracks.

A project file must have a '.gen' extension.

GdpGene format

Description

The GdpGene file format can be used to store information about genes. It is different from the BED format because it can store one score value for each exon.

The 7 mandatory fields of a GdpGene file are:

  1. name
  2. chromosome
  3. strand
  4. start
  5. stop
  6. exon starts (list of the exon start positions separated by commas and with no spaces or tabulations)
  7. exon stops (list of the exon stop positions separated by commas and with no spaces or tabulations)

There is also an optional field:

  1. exon scores (list of the exon scores separated by commas and with no spaces or tabulations)

Example

Here is an example of a GdpGene file:

track type=GdpGene name=Genes_Mouse_RefSeq_07-2007.txt
Xkr4		chr1	-	3204562	3661579	3204562,3411782,3660632		3207049,3411982,3661579	
Rp1		chr1	-	4334223	4350473	4334223,4341990,4342282,4350280	4340172,4342162,4342918,4350473	
Sox17	chr1	-	4481008	4486494	4481008,4483852,4485216,4486371	4482749,4483944,4486023,4486494

Note: you can add a searchURL line to the GdpGene file. It works the same way as for a BED file, described here.

Usage

GdpGene files can be used to load gene tracks.

A valid GdpGene file needs either to have a '.gdp' extension or to have the following first line:

track type=GdpGene

GFF format

Description

GFF stands for General Feature Format. They have 9 compulsory fields which are tab delimited. They are as follows:

  1. seqname - The name of the sequence. Must be a chromosome.
  2. source - The source code that was responsible for generation of this feature.
  3. feature - The name of this type of feature.
  4. start - The starting position of the feature in the sequence. The first base are 1 indexed.
  5. end - The ending position of the feature (inclusive).
  6. score - A score between 0 and 1000.
  7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
  8. frame – Is a number between 0 and 2, if the feature is a coding exon and if the feature is not a coding exon, it is ‘.’.
  9. group - Lines belonging to the same group are linked together into a single item.

More information about this format can be found on Sanger website

Usage

GFF files can be used to load repeats, variable and fixed windows track as well as stripes.

A GFF file needs to have a '.gff' extension or to have the following first line:

##GFF

GTF format

Description

GTF stands for Gene Transfer Format and is a stricter version of GFF. The first 8 GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute is a type/value pair. Attributes must be terminated by a semi-colon. The gap between any two attributes must be exactly one space. The attribute list must begin with the two required attributes:

  1. gene_id value - A GUID for the genomic source of the sequence
  2. transcript_id value - A GUID for the predicted transcript

Refer to the Sanger website for further information.

Usage

Stripes, fixed and variable windows, repeat and gene tracks can be generated from a GTF file.

A GTF files must either have a '.gtf' extension or to have the following first line:

##GTF

Pair format

Description

This file format is used by NimbleGen. The fields of a NimbleGen Pair file are:

  1. Image ID
  2. Gene Expression Option
  3. Sequence ID
  4. Probe ID
  5. Position
  6. X
  7. Y
  8. Match Index
  9. PM
  10. MM

Please refer to the documentation of NimbleScan (page 114) for further information on this format.

Usage

The pair files can generate fixed windows tracks.

The file needs to have a '.pair' extension to be loaded in GenPlay.

PSL format

Description

PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. All of the following fields are required on each data line within a PSL file:

  1. matches - Number of bases that match that aren’t repeats
  2. mismatches - Number of bases that don't match
  3. repeats - Number of bases that match but are part of repeats
  4. nbasecount - Number of 'N' bases
  5. numofinsertsquery - Number of inserts in query
  6. numofbaseinsertsquery - Number of bases inserted in query
  7. numofinsertstarget - Number of inserts in target
  8. numofbaseinsertsquery - Number of bases inserted in target
  9. strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand
  10. queryseqname - Query sequence name
  11. queryseqize - Query sequence size
  12. querystart - Alignment start position in query
  13. queryend - Alignment end position in query
  14. targetseqname - Target sequence name
  15. targetseqsize - Target sequence size
  16. targetstart - Alignment start position in target
  17. targetend - Alignment end position in target
  18. blockcount - Number of blocks in the alignment (a block contains no gaps)
  19. blocksizes - Comma-separated list of sizes of each block
  20. querystarts - Comma-separated list of starting positions of each block in query
  21. targetstarts - Comma-separated list of starting positions of each block in target

Usage

A PSL file can generate stripes. It can also be loaded as a fixed windows, a variable windows, a gene or a repeat track.

To be loaded a PSL file needs either to have a '.psl' extension or to have the following first line:

track type=psl

SOAPsnp format

Description

The SOAPsnp files contain the result of a resequencing utility that can assemble consensus sequence for the genome of a newly sequenced individual based on the alignment of raw sequencing reads on a known reference. The SNPs can then be identified on the consensus sequence through the comparison with the reference. More information about this file format can be found here

The result of SOAPsnp has 17 columns:

  1. Chromosome ID
  2. Coordinate on chromosome, start from 1
  3. Reference genotype
  4. Consensus genotype
  5. Quality score of consensus genotype
  6. Best base
  7. Average quality score of best base
  8. Count of uniquely mapped best base
  9. Count of all mapped best base
  10. Second best bases
  11. Average quality score of second best base
  12. Count of uniquely mapped second best base
  13. Count of all mapped second best base
  14. Sequencing depth of the site
  15. Rank sum test p_value
  16. Average copy number of nearby region
  17. Whether the site is a dbSNP.

Usage

The SOAPsnp files can be used to generate SNPs tracks.

Be sure that the file has a '.soapsnp' extension or starts with the following line:

track type=SOAPsnp

WIG format

Description

Wiggle format (WIG) allows the display of continuous-valued data in a track format. Please refer to this page for the whole description of a wiggle file.

General Structure

Wiggle format is line-oriented. For wiggle custom tracks, the first line must be a track definition line, which designates the track as a wiggle track and adds a number of options for controlling the default display.

Wiggle format is composed of declaration lines and data lines. There are two options for formatting wiggle data: variableStep and fixedStep. These formats were developed to allow the file to be written as compactly as possible.

  • variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format. It begins with a declaration line and is followed by two columns containing chromosome positions and data values:
variableStep  chrom=chrN  [span=windowSize]
chromStartA  dataValueA
chromStartB  dataValueB
... etc ...  ... etc ...

The declaration line starts with the word variableStep and is followed by a specification for a chromosome. The optional span parameter (default: span=1) allows data composed of contiguous runs of bases with the same data value to be specified more succinctly. The span begins at each chromosome position specified and indicates the number of bases that data value should cover.

For example, this variableStep specification:

variableStep chrom=chr2
300701 12.5
300702 12.5
300703 12.5
300704 12.5
300705 12.5

is equivalent to:

variableStep chrom=chr2 span=5
300701 12.5

Both versions display a value of 12.5 at position 300701-300705 on chromosome 2.

  • fixedStep is for data with regular intervals between new data values and is the more compact wiggle format. It begins with a declaration line and is followed by a single column of data values:
fixedStep  chrom=chrN  start=position  step=stepInterval  [span=windowSize]
dataValue1
dataValue2
... etc ...

The declaration line starts with the word fixedStep and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in variableStep format.

For example, this fixedStep specification:

fixedStep chrom=chr3 start=400601 step=100
11
22
33

displays the values 11, 22, and 33 as single-base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively.

Adding span=5 to the declaration line:

fixedStep chrom=chr3 start=400601 step=100 span=5
11
22
33

causes the values 11, 22, and 33 to be displayed as 5-base regions on chromosome 3 at positions 400601-400605, 400701-400705, and 400801-400805, respectively.

Note that for both variableStep and fixedStep formats, the same span must be used throughout the dataset. If no span is specified, the default span of one is used.

Usage

A wiggle file can generate fixed or variable windows tracks as well as stripes.

Be sure that the file extension is '.wig' or that the file starts with the following line:

track type=wiggle