Difference between revisions of "Documentation"

From GenPlay, Einstein Genome Analyzer

Jump to: navigation, search
(Bin-ed Layers Only)
Line 1,082: Line 1,082:
 
This is based on data fron a sequence layer in order to find CG sequences.
 
This is based on data fron a sequence layer in order to find CG sequences.
  
==== Bin-ed Layers Only ====
+
==== Binned Layers Only ====
  
 
===== Smooth =====
 
===== Smooth =====

Revision as of 11:53, 26 March 2014

Contents

Getting started

Starting GenPlay

GenPlay is freely available at http://www.genplay.net/wiki/index.php/Web_Start To start the software, click the button corresponding to the amount of memory that you wish to allocate to the Java virtual machine.

The amount of memory determines how many layers you will be able to load simultaneously. The programming philosophy behind GenPlay is to provide fast performances once the data is loaded. To achieve that goal the entire genome need to be loaded in memory for multiple layers at the same time. This results in high quality performance, but requires a lot of memory. The amount of memory needed per layer depends on the genome, the layer type, the window size, the data precision etc.

You should generally choose as much memory as you can afford on your system (generally about 70% of the total RAM memory that exists on your system). For mammalian genomes we recommend allocating at least 4 GB of RAM although you should be able to load a couple of genome-wide layers with 1GB or 1.5GB of RAM. Selecting analysis of only one chromosome at a time will drastically reduce the memory requirement and should allow you to load many layers at very high resolutions. Layers loaded in GenPlay can also be compressed as explained later in this documentation.

The amount of RAM memory available to GenPlay is displayed in the lower right corner of the screen.

The Welcome screen

The welcome screen is the first screen of GenPlay-MG and allow user to create or to load a project.

New Project

In order to create a new project, users must give it a name.

Text field to define the project name

The precision of the project will change the number of bits used to code numbers.

  • High-Precision: Numbers are coded using 32 bits which offers the highest precision level in GenPlay.
  • Low-Precision: Numbers are coded using 16 bits. It may be useful to lower memory usage. However, the maximum score is 65504 and decimals may be rounded in a different way (here for more information).
Project precision

The second step is to choose a reference genome. Users can choose it using the different list according to the clade, the genome and the assembly.

Assembly chooser

Several chromosomes are available for each assembly but users can choose to select only some of them.

To open the chromosome chooser, users have to click on the tools button next to the assembly name.

Chromosome chooser

The third and last step is to choose between a Simple Genome Project and a Multi Genome Project. If the multi genome project option is selected, the welcome screen should be as the one shown in figure below.

Empty welcome screen for multi-genome project
Single Genome Project

The Single Genome Project is the most common/regular project in GenPlay. If you do not know or understand yet what the Multi Genome Project is, please use the Single Genome Project.

Multi Genome Project
Introduction
VCF Files

VCF files describe differences between genomes. Usually, it concerns differences between one or several genomes of interest and the reference genome used for the mapping process. VCF files define multiple type of variations; GenPlay is able to read and represent the followings:

  • InDels
  • SNPs
  • SV (Structural Variation)

A complete description of VCF files is given on the 1000 genomes project website:

Variant Call Format specification

Tabix
1. Introduction

VCF files contain a lot of information which makes the scanning (loading) processes longer.

In order to increase the scanning efficiency, VCF files have to be compressed and indexed. The compression is done using BGZip and the indexing with Tabix.

Tabix manual reference pages

Tabix download

2. VCF files indexing methods
2.1. Using GenPlay

GenPlay is now able to compress and index VCF files using the VCF Loader.

The way the VCF Loader works is explained below. When you want to select the compressed file (.vcf.gz), simply select the VCF file (.vcf) instead. You may need to change the file extension filter in the file chooser in order to see .vcf files.

GenPlay will look then for compressed/indexed files at the same location, if nothing is found, it will offer to compress and index the selected VCF file (Figure 1).

Figure 1: VCF Loader compress/index

It is fully automatic and non-platform dependent (works on Windows, Linux and Mac).

2.2. Manually

First, please note the following process must be performed in either Linux or Mac environments.

Each VCF files must be first compress to a BGZF (.bgz file) format. Tabix provides a tool to perform the compression. After compression, VCF files must be indexed using the associated command. Once Tabix is installed, two commands are necessary to perform the indexation.

Available commands from the Tabix folder:

bgzip -f VCF_PATH;

tabix –p vcf VCF_PATH;

For example, a VCF file named my_vcf.vcf located in the same folder as Tabix can be indexed with the following commands (Figure 2):

bgzip -f ./my_vcf.vcf;

tabix –p vcf ./my_vcf.vcf.gz;

Figure 2: VCF file indexation command

Note: the first command replaces the current VCF file by the compressed VCF file (.vcf.gz). The second command creates the indexed VCF file in the current folder (.vcf.gz.tbi).

More options are available on Tabix manual reference pages.

The VCF Loader
1. Introduction

The VCF Loader is the most important part of multi-genome project settings. It allows users to load all necessary VCF files and to define how to extract information from them. It appears when users click on the "Edit" button from the welcome screen.

The Figure 3 shows an empty VCF Loader screen.

Figure 3: VCF loader

GenPlay-MG does not use directly the VCF file, it uses a compress version of it (.gz). Moreover, GenPlay-MG also needs the compress VCF file to be indexed with Tabix. Both file versions must be in the same folder and must have the same name, only file extensions differ (.gz and .tbi). In order to use GenPlay to generate additional files, please refer to the section above.

The user can add or remove rows by right clicking on the table.

2. Columns description

File

This column refers to the VCF file path. Once loaded, the raw name column is automatically filled with every raw genome name contained in the selected VCF file.

Raw name

The Raw name column list is automatically filled when a VCF file has been chosen. That list contains every genotype headers contained inside the selected VCF file. Because Genome names might be difficult to remembers, GenPlay-MG offers users the option of adding another name (an alias) using the Genome column.

Nickname

The Nickname column allows users to associate an alias to the selected genome. This alias will appear in GenPlay-MG and can be useful because genome names in VCF files are often non descriptive numbers that can be hard to remember.

Group

Users can gather genomes by group. Group names are used to distinguish genomes and to perform some specific functionalities.

3. Columns edition

Group, Nickname and File column have their own editable list.To edit a cell, click on it, go over the item you want to edit and choose one of the following action:

- Add (green symbol on empty item)

- Edit (pen symbol on an item)

- Delete (red symbol on an item)

That way, users can set up all columns before starting (or at the same time) to fill the table.

Note: The Raw name(s) column is automatically filled with genome name from the selected VCF file, that column cannot be edited manually.

Import/Export

Once a project has been set up, it can be saved using the import/export function. Pressing the export button saves an XML files to the hard drive. This XML file can then be imported to reload the project.

The XML file structure is simple. Each row are stored in row mark containing every attribute names such as group, genome, file and raw_name. The settings file is formatted as shown in Figure 4.

Figure 4: XML file settings

Note: If the user moves the VCF files or changes one of its genotype headers, the XML file will not work anymore. User has to modify file and/or raw_name attribute values.

Load Project

Load an existing project

In order to load a project, the user has to select the "Load an existing project" option.

The list of the 5 last projects shows on the lower part of the dialog. An additional option "Other" will let the user select a GenPlay project file to load.

The upper part updates automatically when selecting a project in order to remind the following information:

  • Name: The name of the project.
  • Precision: The precision of the project, either high or low.
  • Genome: The genome used.
  • Project type: The type of project, either single or multi-genome.
  • Last modified: The last time the project has been modified.
  • Track number: The number of track in the project.

GUI Overview

GUI Overview 1.Ruler 2.Track List 3.Control Panel 4.Status Bar

GenPlay main window is divided in 4 main parts:

  1. Ruler
  2. Track List
  3. Control Panel
  4. Status Bar

Ruler

The ruler shows the coordinates of the current displayed position.

Ruler 1.Option Button 2.Absolute Positions 3.Relative Positions

General Option Button

The button on the left of the ruler opens the pop-up menu with all the general options.

Absolute Positions

The numbers written in red on top of the ruler are the absolute position on the selected chromosome or scaffold.

The number on the left is the position of the first displayed base. This value can be negative.

The number in the middle is the position of the red line. This value can go from 0 to the length of the current chromosome or scaffold as specified in the chromosome configuration file.

The value on the right is the last displayed position. This value range from 1 to 2*(chromosome length).

Relative Positions

The numbers written in black on the second line represent the distance from the middle in base pair.

Track List

The track list is the cornerstone of the GUI. From here you can load layers and execute operations.

The tracks are divided into two parts.

On the left, there is the track handler that becomes highlighted when the mouse is over it. By right clicking on the track handler, a contextual menu appears with all the operations that can be executed on the track and its layer(s).

On the right, the data can be visualized.

Control Panel

Control Panel 1.Position Bar 2.Zoom Bar 3.Chromosome Box 4.Position Text Field

The control panel is divided into 4 parts:

  1. Position Bar: the position bar allows you to change the position of the current displayed windows
  2. Zoom Bar: use the zoom bar to modify the level of zoom
  3. Chromosome Box: set the selected chromosome with the chromosome box
  4. Position Text Field: the position text field follows the format of the UCSC Genome Browser position field so it is easy to copy and paste the position from one browser to the other

Status Bar

Status Bar 1.Progress Bar 2.Stop Button 3.Operation Description 4.Memory Bar

The status bar helps monitor the progress of the current operation as well as memory usage. It is divided into 4 sub-components:

  1. Progress bar, shows the level of completion of the current operation
  2. Stop button, allows users to stop the current operation. If the button is not bright red the operation can't be stopped
  3. Operation description, displays a short text describing the current operation as well as the elapsed time from the beginning of the operation
  4. Memory bar, shows the amount of memory used and the amount of memory available. Make sure that you have enough memory before starting a new operation. You can delete or compress layers to free up memory.

Browsing the Genome

Changing the Position

You can change the position of the displayed window by:

  1. Dragging any track on the left or on the right with the left button of the mouse
  2. Clicking with the middle button of the mouse inside a track and then moving the cursor on the left or on the right of the middle red line
  3. Moving the knob of the position bar on the control panel
  4. Changing the value of the position text field on the control panel
  5. Using the keyboard left and right arrows
  6. Double-clicking on a track where you want to center the view

Changing the Chromosomes

You can switch the selected chromosome by:

  1. Changing the selection in the chromosome box on the control panel
  2. Changing the text of the position text field on the control panel

Changing the Zoom

The level of the zoom can be modified by:

  1. Wheeling up or down inside a track with the mouse wheel
  2. Using the zoom bar on the control panel
  3. Changing the text of the position text field on the control panel

Loading a Layer

Introduction

The layers are the way to show information from files. They can represent information in different manners.

A layer is created from a track, each track can contain one or several layers.

To load a layer in a track, right click on its handler (the blue part on the left of the track). This opens a contextual menu with the different actions available on the track.

The menu of a track empty of layer looks like the one in figure 1.

By clicking "Add Layer" appears a dialog to select one of the different layer type GenPlay offers (Figure 2).

Examples of layers that can be loaded in GenPlay are available for download from the GenPlay Library accessible from the GenPlay.net website.

Loading a Sequencing/Microarray Layer

The Sequencing/Microarray layer allows the visualization of windows of variable/fix sizes with a score associated to these windows. Select the “Sequencing/Microarray Layer” option. This opens up a file chooser dialog box. Load the file of your choice from the list of available window files and click the open button.

Please refer to the File formats section if you want to know what kind of file can be loaded as a sequencing/microarray layer.

This opens a new dialog to set different parameters for the new layer (as shown on the figure below). The dialog is separated in 6 sections detailed below.

New Layer Settings Dialog

Layer Name

Gives a name to the layer.

Bin

By default, the windows generated in sequencing/microarray layer have a variable size. It represents very precisely the content of the file.

For some other purposes, users may want to have fixed windows size. They are useful to represent the results of many types of experiments including, but not limited to: CHIP-seq, RNA seq, and TimEX-seq. Files containing the results of alignments (SAM, bowtie, Eland) and files containing already created bin lists (bed, bgr, etc.) can be loaded using this option. In the case of alignment files, bin lists will be created on the fly as described below. Files containing the results of micro-array experiments can also be loaded as long as they are in one of the accepted formats.

It lowers the resolution but usually offers better memory usage.

This is implemented here by enabling the "Bin Data" option. The "Bin Size" field will then be available in order to give the size of the windows in base pairs.

Important Note: A bin size of 1 bp will use a lot of memory. According to the experiment, it may be more efficient to disable the bin data option and stay in variable window size mode.

Score Calculation

Name and Score Calculation

It can happen that files contain overlapping windows. In this case, GenPlay splits them into smaller windows using a simple algorithm.

This algorithm can be chosen in that section offering the following possibilities:

  • Addition
  • Average
  • Maximum
  • Minimum

Some examples are shown in the sections below for both non bined and bined layers.

Strand

If your input file contains information regarding the strands, you'll be able to choose to load the data from either both or only one strand.

You can also decide to shift the reads from both strands as shown in the figure on the left. To shift the strands just put a value in the "Shift" input box.

The value you entered is going to be added to the position of the data on the 5' strand and subtracted from the ones on the 3' strand.

Fragment Length

Selected Chromosomes

By default all the chromosomes of the project are selected. If you want to change this selection, click on the "modify selection" button and uncheck the undesired chromosomes. Working on fewer chromosomes will save memory and loading time.

Important Note: GenPlay can accelerate the loading if you know that your file is sorted by chromosome. If you press Yes when GenPlay asks you if the file is sorted when your file is actually not sorted, the file may load incompletely, leading to a loss of valuable information. The chromosomes must be ordered the same way it is ordered in the chromosome selection combo-box.

Examples of Score Calculations

For non bined layer
Example 1

Input file

Chr Start Stop Score
Chr1 1125 1126 1
Chr1 1135 1136 1
Chr1 1135 1136 1
Chr1 1149 1150 1
Chr1 1175 1176 1
Chr1 1210 1211 1
Chr1 1230 1231 1
Chr1 1340 1341 1
Chr1 1345 1346 1


Result

Loading of an alignment file as a variable window layer




Example 2
Chr Start Stop Score
Chr1 1020 1120 30
Chr1 1120 1300 120
Chr1 1010 1350 100


Loading of an interval file as a variable window layer


Result

Chr Start Stop Average Maximum Sum
Chr1 1010 1020 100 100 100
Chr1 1020 1120 (100 + 30) / 2 = 65 Max(100, 30) = 100 100 + 30 = 130
Chr1 1120 1300 (100 + 120) / 2 = 110 Max(100, 120) = 120 100 + 120 = 220
Chr1 1300 1350 100 100 100
For binned layer
Example 1

Loading of an alignment file as a fixed window layer with a window size of 100:

(each line represents one read position, score is always one)

Input file

Chr Start Stop Score
Chr1 1125 1126 1
Chr1 1135 1136 1
Chr1 1135 1136 1
Chr1 1149 1150 1
Chr1 1175 1176 1
Chr1 1210 1211 1
Chr1 1230 1231 1
Chr1 1340 1341 1
Chr1 1345 1346 1


Loading of an alignment file as a fixed window layer with a window size of 100


Result

Chr Start Stop Average Maximum Sum
Chr1 1000 1100 1 1 5
Chr1 1100 1200 1 1 2
Chr1 1200 1300 1 1 2




Example 2

Loading of an alignment file as a fixed window layer with a window size of 100:

(each line represents one read position, score varies)

Input file

Chr Start Stop Score
Chr1 1125 1126 1
Chr1 1135 1136 3
Chr1 1145 1146 1
Chr1 1149 1150 1
Chr1 1175 1176 1
Chr1 1210 1211 1
Chr1 1230 1231 1
Chr1 1340 1341 6
Chr1 1345 1346 1


Loading of an alignment file as a fixed window layer with a window size of 100


Result

Chr Start Stop Average Maximum Sum
Chr1 1000 1100 7 / 5 = 1.4 3 7
Chr1 1100 1200 1 1 2
Chr1 1200 1300 7 / 2 = 3.5 6 7




Example 3

Loading of an interval file as a fixed window layer with a window size of 100:

Input file

Chr Start Stop Score
Chr1 1020 1120 30
Chr1 1120 1300 120
Chr1 1010 1350 100


Loading of an interval file as a fixed window layer with a window size of 100


Result

Chr Start Stop Average Maximum Sum
Chr1 1000 1100 (26.47 + 24) / 2 = 25.23 Max(26.47, 24) = 26.47 26.47 + 24 = 50.47
Chr1 1100 1200 (29.41 + 6 + 60) / 3 = 31.80 Max(29.41, 6, 60) = 60 29.41 + 6 + 60 = 95.41
Chr1 1200 1300 (29.41 + 60) / 2 = 44.70 Max(29.41 +60) = 60 29.41 +60 = 89.41
Chr1 1300 1400 14.70 14.70 14.70

Loading a Gene Annotation Layer

A Gene Layer
Score Color

Select the “Gene Layer" option. This opens up a file chooser dialog box that allows you to select the file that you want to load. Please refer to the File formats section if you want to know what kind of file can be loaded as a gene layer.

Once it's done, just wait until the loading is complete and the gene layer will appear in the track you selected.

Note that the genes on the plus strand are in red and the genes on the minus strand are in blue. If the file contains expression values, the exons are color coded to represent the expression (red = high, blue = low, as shown on the right).

Loading a Repeat Family Layer

Select the "Repeat Layer" option. This opens up a file chooser dialog box that allows you to select the file that you want to load. Please refer to the File formats section if you want to know what kind of file can be loaded as a repeat layer.

This layer type displays repeats organized by family or class.

Loading a DNA Sequence Layer

Select the “DNA Sequence Layer” option. This opens up a file chooser dialog box that allows you to select the file that you want to load. Please refer to the File formats section if you want to know what kind of file can be loaded as a sequence layer.

A Sequence Layer

Sequence layers show DNA sequences from .2bit files.

The hg18, hg19, mm8 and mm9 sequence files can be downloaded from the library of GenPlay.

Loading a Mask Layer

Select the "Mask Layer" option. The stripes acting as masks can be useful to show regions of interest such as CpG Islands or repeat regions.

Check the File Formats section out if you need to know what kind of file can be loaded as a stripes.

Loading a Variant Layer

Add a Variant Layer

Add a Variant Layer

Select the "Variant Layer" option, this option is only available in multi-genome projects. This will pop up a new dialog to select which sample the user wants to load, and which variation(s). A variant layer is according to only one sample. It is also possible to change the colors of each variation independently by clicking on the colored square next to the variation checkbox.

Multi-Genome Features

Select Coordinate System
Coordinate System chooser

The coordinate system of GenPlay can be changed by selecting one on the list located on the bottom right of the main frame. The default system is the one of the Meta Genome; the Reference Genome coordinate system is also available. The user can also choose the one of any of the loaded genome. This does not affect operation, only the red position numbers on the top of the frame as well as the position search bar on the bottom.

Multi-Genome Project Properties
Properties Dialog Button

In Multi-Genome Projects only, a new button appears on the bottom left of the frame. This button leads to the Multi-Genome Project Properties dialog allowing the user to visualize and handle the project settings. Right-clicking on the button opens a contextual menu offering shortcuts to the different sections of the properties dialog.

General
General Section

The General section is an overview of how the project has been loaded. Projects can be very complex, using many files and samples. This section reminds the user how the project has been set up.

Settings
Settings Section

The Settings section lets the user choose how to handle multi-genome various options.

  • Properties Dialog
    • Default section to open: the default section of the Multi-Genome Project Properties dialog to open when clicking the button.
  • VCF Loader
    • Default group text name: Default name for groups.
  • Stripes transparency: Sets the transparency of stripes reprensenting variations.
  • Global display settings
    • Show legend: Allow to show the enabled variations and their colors into the track layer.
  • Variant stripes settings
    • Show filtered variation: Filtered variations can be shown but will be represented with a cross over their stripes.
    • Show border of insertion: Insertion stripes have a specific border, it may help to recognize them easily when many layers are loaded, independantly of the color.
    • Show border of deletion: Deletion stripes have a specific border, it may help to recognize them easily when many layers are loaded, independantly of the color.
    • Show nucleotides of insertion stripes: Added nucleotides will be retrieved from the VCF files if possible.
    • Show nucleotides of deletion stripes: Deleted nucleotides will be retrieved from the VCF files if possible.
    • Show nucleotides of SNP stripes: SNP nucleotides will be retrieved from the VCF files if possible.
  • Reference stripes settings
    • Show reference stripes: Stripes representing the reference genome can be either shown or hidden.
    • Reference stripes color: Defines a color for reference stripes.
Files

The Files section lists all the VCF files loaded into GenPlay. Their information are separated into two categories:

  • Information: the information part shows the name and the location of the file. It also segments the header of the VCF file for an easy reading and interpretation.
  • Statistics: This part gives various descriptive statistics of the file and for each sample. All tables can be copied and pasted as regular text tab-delimited.
Filters

The filters section is covered in the section below.

Loading Data From a DAS Server

The distributed annotation system (DAS) is a client-server system in which a client can retrieve data from one or multiple servers. GenPlay can connect to any server that follows the DAS/1 protocol as specified by BioDAS

DAS Dialog

The “Add Layer from DAS Server” option from the track handler menu will show the DAS Dialog.

Select the server from which you want to retrieve the data in the "Server" box.

Then select the "Data Source". Most of the time, the Data Source corresponds to the reference genome that you want to work on.

Once that's done you need to select the data that you want to retrieve in the "Data Type" box.

GenPlay can either generate a gene layer or a variable window layer from the retrieved data. You can select what type of output layer you want in the "Generate" option.

Finally, you can also choose to download data on only a part of the genome. This can be useful because retrieving data from a DAS server can be time consuming.

Note: The DAS server section shows how to add new servers to the list of available servers in the DAS dialog.

Main Menu

Main Menu

On GenPlay’s main screen, click on the top left button (shown by a little hammer and wrench) to pop up the main menu.

New Project

This will pop up the welcome screen in order to start a new project. All work not saved will be lost.

Load / Save Project

This menu allows you to load or to save a whole GenPlay project in a space efficient binary compressed format. When you load a GenPlay project, all the tracks and layers of your current project will be replaced by the ones from the loaded project and all the information that hasn't been saved will be lost. Important Note: The GenPlay project files may be dependent on the version of GenPlay you're using. Be sure to remember with which version of GenPlay you saved a project and use the same version next time you load your project.

Full Screen

Click on this item from the main menu to toggle the full screen mode. When the full screen mode is on, the control panel and the status bar are hidden.

You can also toggle the full screen mode by pressing the F11 key.

Warnings report

This option will pop up the Warnings report dialog in order to consult previous and current alerts.

Option

The option menu item allows you to modify the configuration of GenPlay. Please refer to the section Changing the configuration of GenPlay for further information.

RNA To DNA Reference

This option allows you to transformed the coordinate system of the result of a RNA-Seq experiment based on alignment to a transcriptome (for instance all refseq genes), to a genomic coordinate system.

You need two files in order to use this functionality.

  1. The result of the RNA-Seq experiment, called "Coverage File" in GenPlay. This file must be in bedGraph file format.
  2. An annotation file in bed format.

Two output files can be generated:

  1. A bedGraph file with the position based on a reference genome
  2. A annotation GdpGene file

Here is an example: Coverage File:

NM_000016	0	413	0
NM_000016	413	456	1
NM_000016	456	471	2
NM_000016	471	488	3
NM_000016	488	494	2
NM_000016	494	504	3

Annotation File:

chr1	76190042	76229353	NM_000016	0	+	76190472	76228448	0	12	460,88,98,70,101,81,131,109,141,96,249,977,	0,4043,8286,8495,9170,10433,15622,21448,25061,26093,36764,38334,

The result as a bedGraph file is:

chr1	76190455	76190498	43.0
chr1	76190498	76190502	8.0
chr1	76194085	76194096	22.0
chr1	76194096	76194113	51.0
chr1	76194113	76194119	12.0
chr1	76194119	76194129	30.0

And the result as a GdpGene file is:

NM_000016	chr1	+	76190042	76229353	76190042,76194085,76198328,76198537,76199212,76200475,76205664,76211490,76215103,76216135,76226806,76228376	76190502,76194173,76198426,76198607,76199313,76200556,76205795,76211599,76215244,76216231,76227055,76229353	667888.95,1506024.1,0,0,0,0,0,0,0,0,0,0

Help and About GenPlay

The help and the about GenPlay options open a browser showing respectively the documentation and about pages of GenPlay website.

Exit

This option closes the application after asking for confirmation.

Changing the Configuration of GenPlay

Option Menu

Click on the option item of the main menu to open the configuration screen.

General Options

The following screen lets you set the general options.

The Default Directory lets the user choose which folder to open by default for any of the file chooser within GenPlay.

From this screen, you can also modify the appearance of the software by changing the look & feel.

Track Option

The Number of Tracks text box defines the maximum number of tracks that can be loaded on GenPlay.

The Default Track Height text box defines the height of each of the tracks.

The Undo Count text box defines the number of operations that can be undone. Note that the higher the number of undos selected, the more memory will be required.

The reset option allows the user to easily reset a layer in order to come back as if it has been freshly loaded.

The legend showing layers name on the upper right of a track can also be enabled or disabled.

DAS Server

The DAS server option shows the list of existing DAS servers along with the URL where these servers are located. It also provides the options to add new servers and remove existing servers.

GenPlay can communicate and retrieve data from the servers implementing the DAS/1 protocol

Restore Default

The Restore Default configuration restores everything back to the factory settings.

File Formats

The different file formats used in GenPlay are described on this page.

Using Tracks

Track Menu

Handling Tracks

Moving a Track

To move a track up or down in the track list, just click on the track handler (the left part of the track with the track number) and drag the track to the desired position.

Inserting a Track

To insert a track, right click on the track handler of the track right under where you want to insert and choose the "Insert" option.

Deleting a Track

To delete, select a track and click on the delete option of the contextual menu or press Delete on the keyboard.

Copying, Cutting and Pasting a Layer

Track Menu

To copy layers, select the desired track where the layers are and click on the copy option in the contextual menu or press CTRL+C.

To cut layers, select the desired track where the layers are and click on the cut option in the contextual menu or press CTRL+X.

To paste a track, select the track where you want to paste and click on the paste option in the contextual menu or press CTRL+P.

A new window will appear showing all layers recently copied/cut that can be pasted on the track. The user has to select all layers he wants to paste and then click "Ok".

Taking a Screenshot of the Track

To take a screenshot, select a track and choose the "Save as Image" option in the contextual menu.

Using the Undo / Redo / Reset Options

The undo, redo and reset options are only available for the Variable and Fixed Window layers. They are accessible from the contextual menu when you right click on the track handler.

The number of undo and redo operations available can be specified as described in the Track Option section. Note that this operations are memory consuming and reducing the number of undo / redo available can save memory.

The reset operation restore the track to the way it was right after being loaded. A reset operation can also be undone.

Track/Layer Settings

General

Track Settings - General
Basic Options
  • Name: The name of the track.
  • Height: The height of the track.
Axis Options
  • Show horizontal lines: Split the track horizontally.
  • Horizontal line count: Number of horizontal lines, equally separated.
  • Show vertical lines: Split the track vertically.
  • Vertical line count: Number of vertical lines, equally separated.
Score Options
  • Minimum Score: The minimum score to show.
  • Maximum Score: The maximum score to show.
  • Auto-rescaled: Enable the automatic score rescaling.
  • Score Position: Choose where the score is shown (top/bottom).
  • Score Color: Set the font color of the score.

Layers

Track Settings - Layers
  • Name: Click on the name to edit it.
  • Type: The type of layer.
  • Color: Click to edit the color of the layer.
  • Graph Type: Click to change the graph type:
    • Curve
    • Points
    • Bar
    • Dense
  • Visible: Show/hide the layer.
  • Active: Set the layer as "active". The active layer as direct interaction with the mouse pointer and clicks.
  • Set For Deletion: If set, the layer(s) will be deleted when clicking "Ok".

Operations

Once a layer is loaded, a right click on the location of the track handler opens a popup menu as shown in the figure below.

Operation Menu

The Operation sub-menu of the popup menu contains all the actions that you can use on the selected layer.

Sequencing/Microarray Layer Operations

Bin-ed and non bin-ed layers do not have all the same operations. They share most of them but some are specific.

Common operations

Show History

Show the history of the layer, every changes that have been made since loaded.

Constant Operation
Operation With Constant

Thes operations use one constant in the following ways:

  • Addition: adds the constant to each window (F(x) = x + constant).
  • Subtraction: substracts the constant to each window (F(x) = x - constant).
  • Multiplication: multiplies the score by the constant(F(x) = x * constant).
  • Division: divides the score by the constant (F(x) = x / constant).
  • Inversion: inverts the score of each windows (F(x) = constant / x).
  • Unique Score: sets all windows to an unique score (F(x) = constant).

The function can also be applied to null windows by checking the box.

Two Layers Operation

This allows operations between two Sequencing/Microarray layers, bin-ed and non bin-ed.

In order to set the operations, few windows appear in the following order:

  1. A first window appears in order to select the second layer.
  2. The second window asks in which track the resulting layer will be put.
  3. The third and last window offers the algorithms to complete the operation (x1: score first layer; x2: score second layer):
  • Addition: add scores (x = x1 + x2).
  • Subtraction: substract scores (x = x1 - x2).
  • Multiplication: multiply scores (x = x1 * x2).
  • Division: divide scores (x = x1 / x2).
  • Average: average score (x = (x1 + x2) / 2).
  • Maximum: keeps the highest score.
  • Minimum: keeps the lowest score.

Note: The only way the resulting layer would be a bin-ed layer is to make an operation between two bin-ed layer having the same bin size. Any other case will result in a non bin-ed layer.

Index

Indexation can be useful to compare multiple layers at the same scale. It "re-scales" existing scores to a new range defined by the user.

If scores go from 10 to 600 but for some reason would need to be observed between 0 and 100, this operation will do the work.

It will first ask for the new minimum and the new maximum. The next dialog asks to perfom the re-scaling by chromosome independently or genome wide.

Using the previous example, for a new scale of [0; 100] if the first chromosome as a maximum score of 600 and the second one has a maximum score of 800; 800 will become the reference value of 100 for both chromosomes if the operation is processed genome wide. If the operation is processed by chromosome independently, 600 will become the reference value of 100 for the first chromosome, and 800 for the second chromosome.

Since this operation uses the minimum and maximum scores, it is very important to note that indexing does not work well in the presence of outliers. Indexing works best if outliers are eliminated or removed first using a filter (see below).

Log
Logarithm Bases

For each window, the log operation applies the function f(x) = log(x), where x is the window score. The base of the logarithm function can be selected between either 2 (binary log), e (natural log) or 10 (common log).

Normalize
Normalization Coefficient

After a normalize operation the score of each window is divided by the result of the Score Count operation and multiplied by a specified fixed value. By default, after normalization the scores are expressed per 10 millions reads.

Standard Score

Calculates the standard score for the selected layer i.e. (x - avg) / stdev; where x is the score, avg is the average score of the layer and stdev is the standard deviation of the scores of the layer.

Filter

GenPlay provides four different filters:

Percentage Filter
Percentage Filter

This option filters the X% lowest values and the Y% greatest values where X and Y are two decimals and where X + Y <= 100. You can choose between removing the filtered values (remove) or setting the filtered values to the boundary values (saturate).

Threshold Filter
Threshold Filter

This option removes the values that are lower than X OR greater than Y, where X and Y are two specified threshold values. You can choose between removing the filtered values (remove) or setting the filtered values to the boundary values (saturate).

Band-Stop Filter
Band-Stop Filter

This option removes values between two specified threshold.

Count Filter
Count Filter

This option filters the X lowest values and the Y greatest values, where X and Y are two specified integers. You can choose between removing the filtered values (remove) or setting the filtered values to the boundary values (saturate).

Transfrag

This operation aggregates the windows of the selected layer that are separated by a gap smaller than a specified size (in bp).

The score of the new window can be the sum, the average or the maximum of the scores of the aggregated windows.

Score Distribution Histogram

The show repartition operation generates a graph showing the distribution of the scores of the selected layers. The options for the type of plot are score v/s window count and score v/s base pair count.

The user needs to choose a size for the bins of scores. The graphics will show, depending on the selection, how many windows or how many base pair there is for each bin of scores.

Convert Layer

This operation converts the current layer into another layer among the following:

  • Gene Annotation Layer
  • Microarray/Sequencing Layer bin/non-bin
  • Mask Layer

Non Bin-ed Layers Only

CG Methylation Profile

This operation computes the methylation values on CG sequences by combining the value on the C position and the value on the G position.

This is based on data fron a sequence layer in order to find CG sequences.

Binned Layers Only

Smooth

The smooth operation can be processed according to the 3 following algorithms:

Gauss Smoothing
Sigma Value

This operation applies a Gaussian filter to the layer, depending on the sigma value provided by the user.

G(x) = (1 / v (2?) s) * e-x2 / 2 s2

Where, x is the score and s is the standard deviation of the layer.

You can choose the extrapolate option to "fill" the windows with a score of zero.

Loess Smoothing

This operation computes the Loess regression of degree 1 on the selected layer.

For each x value where a y value is to be calculated, the Loess technique performs a regression on points in a moving range around the x value, where the values in the moving range are weighted according to their distance from this X value.

The Loess regression is a smoothing function. You will need to precise the half size of the moving window on which the regression will be computed.

The weight function of the Loess regression is computed as follow: W(i) = (1 - X(i)^3)^3, where X(i) is the normalized distance: current distance / maximum distance among points in the moving regression.

You can choose the extrapolate option to "fill" the windows with a score of zero.

Moving Average Smoothing

For each window of the layer, compute the average on a region of a specified size center on the window and score the window with the result of this average. The half-size of the region is prompted prior to the calculation.

You can choose the extrapolate option to "fill" the windows with a score of zero.

Find Peaks

The find peak operation offers three different algorithms that can be used to find the peaks:

Standard Deviation Peak Finder
Standard Deviation Peak Finder

The standard deviation peak finder prompts the user to enter two parameters.

The parameter ‘S’ specifies the number of windows to be considered for each window on either side in order to calculate the standard deviation.

For example, if S = 10, it means that for each window we consider 10 windows to the left and 10 windows to the right to calculate the standard deviation.

For a window to be accepted, its standard deviation needs to be at least ‘T’ times greater than the value of the standard deviation of the chromosome.

Density Peak Finder
Density Peak Finder

The Density Finder works as follows:

The parameter ‘S’ specifies the number of windows to be considered for each window on either side of the window under consideration.

For the window under consideration to be accepted, at least ‘P’ percentage of values must be above the high threshold ‘H’ or at least ‘P’ percentage of values must be below the low threshold ‘L’.

Island Finder
Island Finder

The Island Finder is based on the algorithm described in the paper Zang, C., Schones, D. E., Zeng, C., Cui, K., Zhao, K., and Peng, W. (2009). A clustering approach for identification of enriched domains from histone modification chip-seq data. Bioinformatics (Oxford, England), 25(15):1952-1958.

The parameters window value and gap of the island finder are the parameters ‘l0’ and ‘g’ respectively. The island score allows the user to select the scores greater than or equal to a particular value. The island length parameter allows the user to select islands encompassing at least specified number of windows. There are two result types:

  • Start values: Depicts only those islands that are selected and removes the ones that are rejected.
  • Island score: Depicts the islands by considering the score.
  • Island Summit: Depicts the island with the summit of the input island as a score.
Correlation
Correlation Report

The correlation operation computes the Pearson’s correlation between the score values of two layers. The two layers need to have the same bin size. The following formula is used to calculate the correlation:

? = ( ? xi yi – n x’ y’) / ((n - 1) sx sy)

Where:

  •  ? is the Pearson’s correlation
  • xi and yi are the scores of the layers
  • n is the number of values
  • x’ and y’ are the means of the scores of the layers
  • sx and sy are the standard deviations of the scores of the layers

The figure on the right shows a correlation report.

Note: The correlation is computed only on the windows that are different from zero on both layer. If one of the layer has a zero value window, the window of the other layer with the same coordinate will be skipped as well.

Density

This operation generates a new fixed window layer where the score of the windows represent the density of non null windows in the neighborhood of the windows. You first need to enter the size S of the neighborhood. For each window W, the algorithm count how many of the S windows before W and the S windows after W have a score different from zero. This value is then divided by 2 * S + 1 and the result is the score of W.

Intervals Scoring

This operation needs two layers:

  • The selected layer that defines the scores
  • A second layer that defines the intervals

This operation generates a new layer containing the intervals of the "interval track". For each interval the algorithm then looks at the corresponding scores in the score layer, and compute either the maximum, the average or the sum of all the scores that fall in the interval. This value is the new score value in the result layer.

You can also choose to use only a certain percentage of the greatest scores that falls in the interval.

Concatenate
Select Layers to Concatenate

The concatenate operations allows you to generate a file containing the scores of multiple fixed window layers that have the same bin size. The output file contains the following fields:

  1. chromosome
  2. start position
  3. stop position
  4. score layer 1
  5. score layer 2
  6. score layer 3
  7. ...

Gene Layer Operations

Directly on a gene layer, you can:

  1. Double click on a gene to open a web page describing the gene. Make sure that your input file contains a geneDBURL line as described in the File Formats section in order to enable this option.
  2. Put the mouse over a gene to have some information about the name and the score of the gene. If the exons of the gene have different scores you can put your mouse over an exon to have the exon score.

Score Count

This operation count the sum of all scores.

A window asks first to select chromosomes to include in the calculation (all by default).

Average

This operation computes the average of all scores.

A window asks first to select chromosomes to include in the calculation (all by default).

Count Genes

This operation count the total number of genes.

A window asks first to select chromosomes to include in the calculation (all by default).

Count Genes with Non-Null Score

This operation count the total number of genes excluding the ones with a score of 0.

A window asks first to select chromosomes to include in the calculation (all by default).

Count Exons

This operation count the total number of exons.

A window asks first to select chromosomes to include in the calculation (all by default).

Search Gene

Search Gene

Use this option to search a gene on the selected layer by typing the name of the gene.

Check the Match Case option if you want the search to be case sensitive. Check the whole word option if you want to search genes where the input match the whole name of the gene. Press next or previous to find respectively the next or previous gene found. You can also open the Find Gene dialog by pressing CTRL+F after selecting a gene layer.

Extract Intervals

Extract Intervals

This option allows you to extract intervals defined relatively to the beginning, the end or the middle of a gene and to generate a new gene layer showing these intervals.

You can, for example, defined promoters as regions that starts 100bp before the beginning of genes and that ends 150bp after the beginning of genes. This option would allow you to generate a new layer from this parameters.

Extract Exons

Extract Exons

This option generate a new gene layer showing only the exons of the genes of the selected layer.

You can choose between the three following options:

  1. Extract the first exon of the genes
  2. Extract the last exon
  3. Extract all the exons

Unique Score

Unique Score

This operation sets the same score for all exons.

Score Exons

Score Exons

To execute this operation you need to have at least one microarray/sequencing layer loaded. For each exon of each gene of the selected gene layer, this operation computes a new score based on the window score from the selected layer that falls into the exon. There are 3 different ways to compute the new score:

  • Base Coverage Sum
  • Maximum coverage
  • RPKM

Filter

This option provides four different filters for gene layers:

Percentage Filter
Percentage Filter

This option filters the genes with the X% lowest overall score and the Y% greatest overall scores where X and Y are two decimals and where X + Y <= 100. You can choose between removing the filtered values (remove) or setting the filtered values to the boundary values (saturate).

Threshold Filter
Threshold Filter

This option filters the genes with an overall score that are lower than X OR greater than Y, where X and Y are two specified threshold values. You can choose between removing the filtered values (remove) or setting the filtered values to the boundary values (saturate).

Band-Stop Filter
Band-Stop Filter

This option removes the genes with an overall score between two specified threshold.

Count Filter
Count Filter

This option filters the X lowest scored genes and the Y greatest scored genes, where X and Y are two specified integers. You can choose between removing the filtered values (remove) or setting the filtered values to the boundary values (saturate).

Filter Strand

You need to select a strand when prompted. At the end of the operation the layer will contain only the genes on the selected strand. All the other genes will have been removed.

Rename Genes

This operation allows you to change the name of the genes. You need to provide a text file where each line contains the current gene name and the new gene name separated by a tabulation. Every time a gene with a name from the first column is found this name will be replace by the new gene name from the second column.

Distance Calculation

Development in progress, coming soon.

Score Repartition Around Start

You first need to select a Fixed window layer containing the scores. After that, you need to select the chromosomes on which you want to execute the operation. You also need to specify a bin size S, a bin count C and a method for the calculation of the scores.

The operation will create C bins on each side of the start position of each gene. The size S of each bin is in base-pair. Depending of the method of calculation chosen the operation is going to compute the sum, the maximum or the average of the scores for each corresponding bin from each gene and display a bar graph of the result. The data can be exported by right-clicking on the graph and using the "save as" function.

Multi-curve graph can be generated using the following procedure:

To generate a comparison between 2 fixed-window layers: 1) Perform an analysis for the first layer as described above. 2) Save it to your hard drive. 3) Close the graph window. 4) Perform the same analysis on the second layer. 4) Right click on the second graph and choose the load data option. 5) Load the first analysis. Colors of the curves, type of graphs (bar, points, curve) and scale can be adjusted by right-clicking on the graph. Procedure can be used to load more than two graphs. To produce more complex graphs we recommend loading the saved data on your favorites spreadsheet software. Score Repartition Around Start

Repeat Layer Operations

There is currently no operation available for the repeat layer.

DNA Sequence Layer Operations

There is currently no operation available for the sequence layers.

Mask Layer Operations

Apply Mask

Applying a mask means filtering the data that are not inside the windows of the mask.

All information overlapping a mask window will be kept, everything else will be lost.

Invert Mask

This operation simply inverts all windows of the mask. All current windows become empty spaces, all empty spaces become windows.

Variant Layer Operations

Edit Variant Layer

Edit Variant Layer Dialog

This feature will popup the same window used to load the Variant Layer offering the possibility to change the variation types to show.

Generate track statistics

This operations generates various statistics about loaded information.

It also compares these statistics before and after applying any filters in order to see their effects.

Filters

Filters can be applied on Variant Layers, they interact directly on data found in the VCF in order to select on data of interest. All filters are set in the Filters section of the Multi-Genome Project Properties dialog.

Simply click on "Add" in order to create a new filter. As shown below, a new window appears to define the filter.

Filter selection dialog
  • Layer(s): The layers affected by the filter.
  • File: A filter is also file specific, if data to filter are separated over different files, several filters must be created.
  • ID: A filter can be set on any ID defined on the header of the VCF. IDs can be of different types which affects the selection of the next steps.
  • Genome(s): Any "FORMAT" ID will require to know which genome(s) is/are concerned by the filter.
  • Operator: If more than one genome has been selected in the previous step, the operator will decide how the result from each genome will be processed in order to have a result for the whole line.
    • And: The selected ID value from each genome must pass the filter.
    • Or: At least one selected ID value must pass the filter.
    • Sum: If the selected ID value is an integer, the sum value from each genome will be filtered.
    • Mean: If the selected ID value is an integer, the mean value from all genomes will be filtered.
  • Filter: This filter panel will change according to the selected ID type.
    • String: The input value will be tested and the user has to choose if the value must be present or must not be present in the ID value.
    • Number: The ID value is here tested using one of the given numeric operator against an input value. The ID value can also be tested against two input value using the secong part of the filter, the user then has to choose how both filters are handled.
    • Flag: When the ID value is a flag, it reacts as boolean, meaning the value is here, or is not.
    • Genotype: The genotype ID has a special filter editor in order to set it up more easily. The regular string editor can be found below. The genotype can be homozygote/heterofygote/phased/unphased.

Export as VCF

This operation exports all visible variations of the layer into a new VCF file. It includes filters meaning that it exports what can be seen on the layer.

Convert into variable window track

This operation converts the Variant Layer into a Microarray/Sequencing Layer. The new windows match the positions of the variation stripes. The score of the new windows can be set to any integer value present into the VCF lines. For haploid genomes, only one layer will be generated. For diploid genomes, the maternal and paternal alleles will be generated over two different layers.

Apply Genotype

Coming soon...