GenPlay File Formats

From GenPlay, Einstein Genome Analyzer

Revision as of 18:06, 23 November 2010 by Julien (talk | contribs) (→‎= Description)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

.2bit format

Description

A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.

The file begins with a 16-byte header containing the following fields:

signature - the number 0x1A412743 in the architecture of the machine that created the file
version - zero for now. Readers should abort if they see a version number higher than 0.
sequenceCount - the number of sequences in the file.
reserved - always zero for now

All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.

The header is followed by a file index, which contains one entry for each sequence. Each index entry contains three fields:

nameSize - a byte containing the length of the name field
name - the sequence name itself, of variable length depending on nameSize
offset - the 32-bit offset of the sequence data relative to the start of the file

The index is followed by the sequence records, which contain nine fields:

dnaSize - number of bases of DNA in the sequence
nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)
nBlockStarts - the starting position for each block of Ns
nBlockSizes - the size of each block of Ns
maskBlockCount - the number of masked (lower-case) blocks
maskBlockStarts - the starting position for each masked block
maskBlockSizes - the size of each masked block
reserved - always zero for now
packedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11.

The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011. The packedDna field is padded with 0 bits as necessary to take an even multiple of 32 bits in the file, which improves I/O performance on some machines.

Usage

The .2bit files can be loaded as sequence tracks.

To be loaded the file need to have a '.2bit' extension.

BAM / SAM format

Description

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

BAM is a Binary version of the Sequence Alignment / Map (SAM) format.

More information about these file formats can be found here

Usage

SAM files can generate fixed window track. BAM can't be used directly with GenPlay but can be easily converted in a SAM file using tools available on the Internet.

To be loaded a SAM file needs either to have a '.sam' extension or to have 'track type=sam' as a first line.

BED format

Description

BED format has three necessary fields and nine additional optional fields. The number of fields per line have to be constant throughout any single set of data in an annotation track. The order of the optional fields requires that lower-numbered fields be filled if higher-numbered fields are used.

The first three required BED fields are:

chromosome - The name of the chromosome (e.g. chr1, chrX etc.) or scaffold (e.g. scaffold10461).
Start - The beginning position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
End - The ending position of the feature in the chromosome or scaffold.

The 9 additional optional BED fields are:

name – This is the name of the gene.
score - A score between 0 and 1000.
strand - Defines the strand as either '+' (red) or '-' (blue).
thickStart
thickEnd
itemRgb
blockCount
blockSizes
blockStarts

Example

Here is an example of a bed file.

track type=bed
searchURL="http://genome.ucsc.edu/cgi-bin/hgGene?org=Human&hgg_chrom=none&db=hg19&hgg_gene="
chr1	11873	14409	uc001aaa.3	0	+	11873	11873	0	3	354,109,1189,	0,739,1347,
chr1	11873	14409	uc010nxq.1	0	+	12189	13639	0	3	354,127,1007,	0,721,1529,
chr1	11873	14409	uc010nxr.1	        0	+	11873	11873	0	3	354,52,1189,	0,772,1347,
chr1	14362	16765	uc009vis.2	        0	-	14362	14362	0	4	467,69,147,159,	0,607,1433,2244,

Note: The search URL "http://genome.ucsc.edu/cgi-bin/hgGene?org=Human&hgg_chrom=none&db=hg19&hgg_gene=" is included at the beginning of each BED file. It provides the URL that contains the description of the genes. One can form the complete URL by appending the name of the gene at the end of the search URL after the equal to sign. For example, "http://genome.ucsc.edu/cgi-bin/hgGene?org=Human&hgg_chrom=none&db=hg19&hgg_gene=uc001aaa.3" will provide information on WDR 78.

Usage

A Bed file with the 3 required fields can generate stripes. If the 2 first required fields are specified, the file can be loaded as a fixed windows, a variable windows or a repeat track. If all the fields are specified the file can be loaded as a gene track.

To be loaded a Bed file needs either to have a '.bed' extension or to have 'track type=bed' as a first line.

BedGraph format

Description

The BedGraph format is a really simple format useful to visualize windows on the genome. This windows can have a score. The fields in a BedGraph file are the followings:

Chromosome
Window start position
Window stop position
Score

Example

track type =bedgraph
chr1	18598	19673	1
chr1	124987	125426	3
chr1	317653	318092	15
chr1	427014	428027	8

Usage

BedGraph files can be used to load fixed windows and variable windows track.

They can also be loaded as stripes

A valid BedGraph file needs either to have a '.bgr' extension or to have 'track type=bedgraph' as a first line.

GFF format

Description

GFF stands for General Feature Format. They have 9 compulsory fields which are tab delimited. They are as follows:

seqname - The name of the sequence. Must be a chromosome.
source - The source code that was responsible for generation of this feature.
feature - The name of this type of feature.
start - The starting position of the feature in the sequence. The first base are 1 indexed.
end - The ending position of the feature (inclusive).
score - A score between 0 and 1000.
strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
frame – Is a number between 0 and 2, if the feature is a coding exon and if the feature is not a coding exon, it is ‘.’.
group - Lines belonging to the same group are linked together into a single item.

More information about this format can be found on Sanger website

Usage

GFF files can be used to load repeats, variable and fixed windows track as well as stripes.

A GFF file needs to have a gff extension or have the following first line '##GFF '

GTF format

Description

GTF stands for Gene Transfer Format and is a stricter version of GFF. The first 8 GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute is a type/value pair. Attributes must be terminated by a semi-colon. The gap between any two attributes must be exactly one space. The attribute list must begin with the two required attributes:

gene_id value - A GUID for the genomic source of the sequence
transcript_id value - A GUID for the predicted transcript

Refer to the Sanger website for further information.

Usage

Stripes, fixed and variable windows, repeat and gene tracks can be generated from a GTF file.

A GTF files must either have a '.gtf' extension or '##GTF' as first line.

PSL format

Description

PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. All of the following fields are required on each data line within a PSL file:

matches - Number of bases that match that aren’t repeats
mismatches - Number of bases that don't match
repeats - Number of bases that match but are part of repeats
nbasecount - Number of 'N' bases
numofinsertsquery - Number of inserts in query
numofbaseinsertsquery - Number of bases inserted in query
numofinsertstarget - Number of inserts in target
numofbaseinsertsquery - Number of bases inserted in target
strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand
queryseqname - Query sequence name
queryseqize - Query sequence size
querystart - Alignment start position in query
queryend - Alignment end position in query
targetseqname - Target sequence name
targetseqsize - Target sequence size
targetstart - Alignment start position in target
targetend - Alignment end position in target
blockcount - Number of blocks in the alignment (a block contains no gaps)
blocksizes - Comma-separated list of sizes of each block
querystarts - Comma-separated list of starting positions of each block in query
targetstarts - Comma-separated list of starting positions of each block in target

Usage

WIG format

Wiggle format (WIG) allows the display of continuous-valued data in a track format.

Retrieved from "http://genplay.net/wiki/index.php?title=GenPlay_File_Formats&oldid=199"

GenPlay File Formats

From GenPlay, Einstein Genome Analyzer

Contents

.2bit format

Description

Usage

BAM / SAM format

Description

Usage

BED format

Description

Example

Usage

BedGraph format

Description

Example

Usage

GFF format

Description

Usage

GTF format

Description

Usage

PSL format

Description

Usage

WIG format