ChIP-Seq Tutorial

From GenPlay, Einstein Genome Analyzer

Revision as of 18:28, 20 March 2014 by Julien (talk | contribs) (Compare Input to Input)
Jump to: navigation, search

Goal: The objective of this tutorial is to illustrate how GenPlay can be used to isolate peaks from the data generated from a ChIP-Seq experiment. Then, to generate a list of genes that have a peak in their promoter and finally to associate the score of the peak summit with each promoter.

Prerequisite: GenPlay need to be installed on your computer. If you haven't installed GenPlay yet, please visit the Downloads page and follow the instruction to download and install GenPlay.

Note: The final result of this tutorial is available as a project that can be loaded from the Projects page of this website.

Start a New Project

After starting GenPlay you will be prompted to select a name, a clade, a genome and an assembly for your project. You can enter "ChIP-Seq Tutorial" for the name, select the mammal clade and human genome (figure 1).

Figure 1: New Project Window

Then, click on the tool box button on the assembly line. A new window will appear allowing you to select chromosomes. For this tutorial we will work only on the first chromosome so loading times are shorter and the amount of memory needed is smaller. You can select all the other chromosome and then click on the unselect button (figure 2).

Figure 2: Project Chromosomes

Load the Files

We first need to download the ChIP-Seq IP file, the ChIP-Seq input control sample file and the RefSeq gene annotation file and to uncompress this files.

The input file should contain reads obtained by sequencing the sample before the immnuno-precipitation step. This control is necessary to normalize for sequence dependent variations in the efficiency of the library preparation and sequencing procedure. One major difficulty is that to be able to really normalize for these parameters a considerable depth of sequencing of the input is necessary since this sample is not enriched. If the depth is too low to be able to completely normalize, it is preferable to subtract the input from the IP rather than divide (which tend to magnify the differences). The IP (immuno-precipitated) sample should contain the result of the sequencing of the sample after the precipitation step.

We now need to load the IP data. To achieve that you will need to right click on the track handler of the 1st row. A popup menu should appear, select the Add Layer(s) option (Figure 3).

Figure 3: Add Layer(s) Menu

Then select the IP file that you downloaded earlier. You will be prompted to choose a layer type. Choose Sequencing/Microarray Layer.

We now need to tell GenPlay how to load the data. You can keep the default name for the track or change it if you prefer. Then, you need to check Bin Data and choose a window size and a method of score caculation. The size of the window that you should choose depends on the number of reads that are available. The smaller the windows the higher the resolution. For this example, we will choose a window of 100 bp. The option for the score calculation are discussed in the Documentation. For the type of files used in this example, you should choose Addition as the method for the score calculation.

To obtain the narrowest peak possible, it is generally advisable to correct the strand bias that is caused by the fact that the cross-linked DNA fragments are sequenced from the end while the actual binding site might be anywhere within the immuno-precipitated fragments. (reviewed in Wilbanks EG et al. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010 Jul 8;5(7):e11471.)

To correct the strand bias we need to specify what is the average fragment length of our library and what is the length of the sequenced reads. In our case we know that the average fragment length is 300bp and the reads were 50bp. Check the Define Fragment Length box and enter these informations in the Fragment Length section of the window. At the end you should have something similar to figure 4.

Figure 4: New Layer Window

Click on Okay and wait for the track to load. That can take more than a minute. Once the data is loaded, double click on the handle of the track 1 to adjust the display options. Check Show horizontal lines', uncheck Auto rescale and enter 100 for the Maximum Score (figure 5).

Figure 5: Track Options

You can now look at your data by zooming using the mouse wheel and moving within the chromosome by pressing the left button of the mouse on the track and releasing it after moving the mouse on the left or on the right.

Figure 6: IP Layer Loaded


Load Input

It's now time to load the input control sample track (the track resulting from sequencing the cross-linked chromatin before the immuno-precipitation). This track will help us remove the peaks (enrichment region) that are non specific, either because of problem with the method (cross-linking,antibody etc...) or because of sequencing artifact. These artifacts can be caused either by preferential sequencing of specific fragment by the instruments or by differences between the genome sequenced and the genome assembly used to align the reads, since any repeat in the sequenced genome that is present as a none repeated region in the genome assembly will result in a peak.

Identifying these outliers can be quite difficult. One useful method is to compare the IP libraries with a control (input) library for the same sample. It's what we will to do in this tutorial.

To load the input file, right click on the handler of an empty track, choose the Add Layer(s) option. Then select the Input file downloaded earlier and choose Sequencing/Microarray Layer. On the next window set the parameters as shown on the figure 7.

Figure 7: Load Control Parameters

Once the layer is loaded you can modify the display options of the track by double clicking on the track handler the same way you did for the IP layer.

Normalize IP and Input

In order to be comparable, the IP and the input tracks need first to be normalized.

The normalize operation of GenPlay divides each score by the sum of all the scores of the track and multiply the result by a large constant specified by the user. The only purpose of the constant is to make the score more readable.

Let's start by normalizing the IP. You need first to right click on the track handler. After that, select the Normalize option of the layer sub-menu as shown on figure 8. You can keep the default constant, just click OK (figure 9). After that operation the score are expressed as number of read per windows per 1 billion reads (if you kept the default constant).

Once it's done repeat the operation with the input track.

Figure 8: Normalize Menu
Figure 9: Normalize Dialog Box

Add Constant to Input

One of the specificities of GenPlay is that while computing an operation, the windows with a score of zero are not taken into account.

This is helpful in most cases because we don't know if a zero window is caused by the presence of a repeat (which cannot be map), a region that is not sequenced in the genome assembly or if it is a mappable window in which no read mapped.

In the case of this tutorial though, since we want to compare the IP track to the input track, we want to be able to do this comparison even when a window of the input has a null score. Since the comparison will be a subtraction we can add a really small constant (so no window has a score of zero) without impacting significantly the result of the subtraction.

We chose to add a constant of 0.01 to all the windows of the input track. In order to do that, you first need to right click on the handler of the input track (figure 10). Then, choose the "Constant Operation" option of the lalyer sub-menu. Enter 0.01 in the next windows as shown on figure 11 and check the Apply to null windows option before clicking OK.

Figure 10: Addition Constant Menu
Figure 11: Constant Dialog Box

Compare Input to Input

We can now compare the IP to the input track in order to filter out the artifactual peaks. As we already decided, we want to subtract the input scores from the IP scores.

To do so, you need to right right click on the handler of the IP track and select the IP layer sub-menu. Then click on "Two Tracks Operation" as shown on figure 12.

In the first dialog window you need to choose the second track for the operation. Select the input track (figure 13).

The operation to choose is subtraction (figure 14). You also need to choose in which empty track you want the result to be displayed. The figure 15 shows the result of the operation.

We choose to subtract rather than divide (as would be done in a PCR experiment) because in chip-Seq experiments the input sample is not enriched wile the IP might be enriched 10 to 20 time or more. As a result, to have comparable coverage the input would have to be sequenced 10 to 20 times deeper than the IP sample (which would be very expensive). This lower coverage for the input control causes a sampling error that can drastically affect the results if the data are normalised by calculating the ratio IP/IN. The sampling error problem on the input increase when the binning size decrease (when the window size used loading the track decrease). An alternative to subtracting the input from the IP track is to smooth the input track in order to attenuate the sampling error on that track. That can be done in GenPlay using either the moving average or the Gaussian smoothing operation. In practice to decrease the sampling error, we find that smoothing over 5 to 10 kb is necessary. Finally, whatever the method use, some of the peak present in the IP track might not have been completely eliminated from the IP track, or some artifact might be present only in the IP track. If that is the case, one can also eliminate the outliers by simply using the filter to eliminate the highest peak. This can be done easily in GenPlay using either either absolute or relative values in the filter operation. We find that there are no optimal method to analyse all tracks and that it is best for the users to proceed by trial and error and exercise their judgment in performing these operations.

Figure 12: Two Tracks Operation Menu
Figure 13: Dialog to Choose the Second Track
Figure 14: Dialog to Choose the Operation Type
Figure 15: Result of the Subtraction

Filter Negative Values

Since we're just interesting in the enriched regions we can try to remove all the negative values in order to make the track easier to comprehend.

To filter the track you need to select the "Filter" option of the operation sub-menu (figure 14). On the left panel of the filter window select the threshold option. Set the parameters of the right panel as shown on figure 15. The figure 16 shows the result of the filter operation.

Figure 14: Filter Operation Menu
Figure 15: Filter Operation Window
Figure 16: Filter Operation Result (the negative values of the track 5 have been removed)

We now need to remove the background noise and to keep only the islands (the peaks).



Isolate Peaks

This goal of this step is to remove the background noise from the track so just the peaks remain.

To do so, right click on the track handler, choose the "Operation" sub-menu and click on the "Find Peaks" option (figure 17).

Figure 17: Find Peaks Operation

After the find peaks dialog opens, choose the "Island Finder" option on the right panel and set the parameters as shown on the figure 18.

Figure 18: Find Peaks Menu

The island finder is described in the documentation section of this website. You'll notice that the selected output is "Peak Summits". This means that for each island, the only selected window will be the one with greatest score on the island. This generally corresponds to the window with the highest probability of containing the binding site. Other software can be used to identify the exact sequence of the binding site. In addition to choosing the summit as the output, one could add the Island score or the "start value". These two values are a much more quantitative estimate of the level of enrichment of the signal than the summit value. The result should be similar to what is shown on figure 19.

Figure 19: Find Peaks Result (the track 6 contains only the summit of the peaks)





Extract Gene Promoters

First, we need to load a gene track. Right click on an empty track handler and select "Load Gene Track". Select the RefSeq file that we've already downloaded when prompted.

When it's done, right click on the track handler of the gene track and select "Extract Intervals" in the Operation sub-menu (Figure 20).

Figure 20: Extract Intervals Menu

A dialog box will pop-up. We decide to define a promoter as a region that starts 500 bp before a gene start position and ends 500bp after. In order to do so, fill in the parameters as shown in figure 21.

Figure 21: Extract Intervals Dialog

You'll finally be asked to select the result track position in the track list. The result track represents only the promoters of the genes of the reference track (figure 22).

Figure 22: Gene Promoters (the track 8 contains only the gene promoters)

Score Promoters

Now that we have a track with the peaks and a track with the promoters we can score the promoters using the score of the peaks and export the result as a bed file.

To score the promoters, right click on the handler of the track with the promoters and select the "Score Exons" option of the "Operation" sub-menu (figure 23).

Figure 23: Score Exons Menu

You'll be prompted to choose the track containing the scores. Select the track with the peaks extracted.

Then select maximum for the method of calculation and select a track where the result should appear (figure 15). Note that the color of the promoters represents the scores associated to the promoters (as described in the gene track section of the documentation).

Figure 24: Score Exons Result

The last thing we need to do is to export the result of our analysis. Right click on the newly created track handler. Select "Save As". Choose where you want to save the track and make sure that the file type is set to Bed file. Once it's done, you can open the file that you created with a text editor such as notepad. You'll notice that the result file contains the position (field 1 to 3) of the promoters, the name of the genes (field 4), the strand of the gene (field 6) as well as the scores of the promoters (field 5). For more details about the result file you can refer to the File Type section of the documentation.