Difference between revisions of "Tutorials"

From GenPlay, Einstein Genome Analyzer

Jump to: navigation, search
(Add Constant to Control)
(Normalize Input and Control)
Line 39: Line 39:
 
=== Add Constant to Control ===
 
=== Add Constant to Control ===
  
== Normalize Input and Control ==
+
=== Normalize Input and Control ===
  
 
== Compare Input to Control ==
 
== Compare Input to Control ==

Revision as of 11:22, 27 January 2011

The following tutorials aim to give you some of the basic concept on the track manipulation techniques.

ChIP-Seq Analysis

Goal: The objective is first to isolate the peaks from the data generated from a ChIP-Seq experiment. Then, we want to generate a list of genes that have a peak in their promoter and associate for each promoter the score of the peak summit.

Load the File

The first thing to do is to download the file CHiP-Seq file and the RefSeq gene annotation file from the tutorial directory here. After that, you can start GenPlay from the Web Start link that is located on top of this page. The 1 GB link is enough for this tutorial, but generally you should allocate as much memory as you can afford. For this experiment we're going to work only on the first chromosome so the loading time is shorter and the amount of memory needed is smaller.

To obtain the narrowest peak possible, it is generally advisable to correct for the strand bias that is caused by the fact that the cross-linked DNA fragments are sequenced from the end while the actual binding site might be anywhere within the immuno-precipitated fragments. (reviewed in Wilbanks EG et al. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010 Jul 8;5(7):e11471.)

To measure the strand bias, we need to load the 3' and 5' reads separately.

To achieve that you will need to right click on the track handler of the 1st row, in order to open the menu that will allow you to load tracks (figure 1). Select the Load Fixed Window Track option.

Figure 1: Load Menu

After selecting the file, an option window is going to prompt you to enter information on how to load the data. You can keep the default name for the track. You then need to choose a window size and a method of score caculation. The size of the window that you should choose depend on the number of reads that are available. The smaller the windows the higher the resolution. For this example, we will choose a window of 100 bp. The option for the score calculation are discussed in the Documentation. For the type of files used in this example, you should choose sum as the method for the score calculation. You can keep the default data precision. But we need to select a strand. Let's start with the 5' strand. You will also need to select the 1st chromosome. To do so, click on the "Modify Selection" button on the bottom right corner of the screen and then uncheck all the chromosomes but the first one. The figure 2 shows how the screen should like before you click on the OK button.

Figure 2: Load Fixed Window Track Menu

The operation needs to be repeated for the 3' strand. Once the two tracks are loaded you can modify the Y axis by right clicking on the track handlers and selecting the "Set Y Axis" option. Set the maximum to 100. You can also change the color and the appearance of the peaks. Now that the two tracks are loaded we can graphically determine how much the strands need to be shifted. Select a peak, zoom on it with the mouse wheel and check how far the summits of the same peak on the 5' and the 3' strands are. Verify that this value is the same on other peaks. When you're sure about the value, divide it by two and note this result. This value should be very close to the average size of the insert of the sequenced library. In the example, we notice that the summits are about 300 bp apart (figure 3) so the shifting value is 150 bp (meaning that the 5' is shifted 150 bp forward and the backward strand is shifted 150 bp backward).

Figure 3: Find Strand Shifting

We need to load the file again but this time we're going to load both strands with the appropriate strand shifting. This time the loading screen should look like on the figure 4.

For comparison purpose, you can also load both strand without any shifting. This should clearly show that the shifted peaks are generally narrower than the unshifted peaks.

Figure 4: Load Fixed Window Track Menu, both strands


Load Control

It's now time to load a control track. This track is going to help us remove the peaks (enrichment region) that we believe to be artifacts. They can be caused either by preferential sequencing of specific fragment by the instruments or by differences between the genome sequenced and the genome assembly used to align the reads, since any repeat in the sequenced genome that is not present in the genome assembly will result in a peak.

Identifying these outliers can be quite difficult. One useful method is to compare the IP libraries with a control (input) library for the same sample.

To load the control file, right click on the handler of an empty track and select "Load Fixed Window Track". The select the input file that you downloaded earlier and on the next window set the parameters as shown on the figure 5.

Figure 5: Load Control Parameters


Add Constant to Control

Normalize Input and Control

Compare Input to Control

Remove Outliers

In most sequencing experiments, there are a few very large peaks that are artifactual and that are caused either by preferential sequencing of specific fragment by the instruments or by differences between the genome sequenced and the genome assembly used to align the reads, since any repeat in the sequenced genome that is not present in the genome assembly will result in a peak.

To get rid of the 0.05% windows with the greatest score you need to right click on the track handler of the last loaded track. A menu will pop-up. Select the "Operation" sub-menu and then select the "Filter" option (figure 5).

Figure 5: Filter Menu

Set the parameters of the filter as shown on the figure 6 and validate by clicking on Ok.

Figure 6: Filter Dialog

The figure 7 shows the result of the operation. The track 4 is the one with the outliers removed. Instead of removing the tallest peak, we could have chosen to saturate it which decreases the size of the peak rather than eliminate it.

Note that the color of the tracks had been modified by right clicking on the track handler and selecting the "Appearance" option.

Figure 7: Filter Result

Now we need to remove the background noise and to keep only the islands (the peaks).

Isolate Peaks

This goal of this step is to remove the background noise from the track so just the peaks remain.

To do so, right click on the track handler, choose the "Operation" sub-menu and click on the "Find Peaks" option (figure 8).

Figure 8: Find Peaks Operation

After the find peaks dialog opens, choose the "Island Finder" option on the right panel and set the parameters as shown on the figure 9.

Figure 9: Find Peaks Menu

The island finder is described in the documentation section of this website. You'll notice that the selected output is "Peak Summits". This means that for each island, the score of the windows on the output track will be the greatest score of the windows of the input track.

The result should be similar to what is shown on figure 10.

Figure 10: Find Peaks Result

Extract Gene Promoters

First, we need to load the gene track. Right click on an empty track handler and select "Load Gene Track". Select the RefSeq file that we've already downloaded when prompted.

When it's done, right click on the track handler of the gene track and select "Extract Intervals" in the Operation sub-menu (Figure 11).

Figure 11: Extract Intervals Menu

A dialog box will pop-up. We decide to define a promoter as a region that starts 100bp before a gene start position and ends 50bp after. In order to do so, fill in the parameters as shown in figure 12.

Figure 12: Extract Intervals Dialog

You'll finally be asked to select the result track position in the track list. The result track represents only the promoters of the genes of the input track (figure 13).

Figure 13: Gene Promoters

Score Promoters

Now that we have a track with the peaks and a track with the promoters we can score the promoters using the score of the peaks and export the result as a bed file.

To score the promoters, right click on the handler of the track with the promoters and select the "Score Exons" option of the "Operation" sub-menu (figure 14).

Figure 14: Score Exons Menu

You'll be prompted to choose the track containing the scores. Select the track with the peaks extracted.

Then select maximum for the method of calculation and select a track where the result should appear (figure 15). Note that the color of the promoters represents the scores associated to the promoters (as described in the gene track section of the documentation).

Figure 15: Score Exons Result

The last thing we need to do is to export the result of our analysis. Right click on the newly created track handler. Select "Save As". Choose where you want to save the track and make sure that the file type is set to Bed file. Once it's done, you can open the file that you created with a text editor such as notepad. You'll notice that the result file contains the position (field 1 to 3) of the promoters, the name of the genes (field 4), the strand of the gene (field 6) as well as the scores of the promoters (field 5). For more details about the result file you can refer to the File Type section of the documentation.