ChIP-Seq Tutorial

From GenPlay, Einstein Genome Analyzer

Revision as of 15:58, 20 March 2014 by Julien (talk | contribs) (Load the Files)
Jump to: navigation, search

Goal: The objective of this tutorial is to illustrate how GenPlay can be used to isolate peaks from the data generated from a ChIP-Seq experiment. Then, to generate a list of genes that have a peak in their promoter and finally to associate the score of the peak summit with each promoter.

Note: The final result of this tutorial is available as a project that can be loaded from the Projects page of this website.

Load the Files

The first thing to do is to download the ChIP-Seq IP file, the ChIP-Seq input control sample file and the RefSeq gene annotation file and to uncompress this files.

The input file should contain reads obtained by sequencing the sample before the immnuno-precipitation step. This control is necessary to normalize for sequence dependent variations in the efficiency of the library preparation and sequencing procedure. One major difficulty is that to be able to really normalize for these parameters a considerable depth of sequencing of the input is necessary since this sample is not enriched. If the depth is too low to be able to completely normalize, it is preferable to subtract the input from the IP rather than divide (which tend to magnify the differences). The IP (immuno-precipitated) sample should contain the result of the sequencing of the sample after the precipitation step.

After downloading the files, you can start GenPlay from the Web Start link that is located on top of this page. The 1 GB link is enough for this tutorial, but generally you should allocate as much memory as you can afford. For this experiment we're going to work only on the first chromosome so the loading time is shorter and the amount of memory needed is smaller.

To obtain the narrowest peak possible, it is generally advisable to correct the strand bias that is caused by the fact that the cross-linked DNA fragments are sequenced from the end while the actual binding site might be anywhere within the immuno-precipitated fragments. (reviewed in Wilbanks EG et al. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010 Jul 8;5(7):e11471.)

To measure the strand bias, we need to load the 3' and 5' reads separately.

To achieve that you will need to right click on the track handler of the 1st row, in order to open the menu that will allow you to load tracks (figure 1). Select the Load Fixed Window Track option.

Figure 1: Load Menu

After selecting the IP file, an option window is going to prompt you to enter information on how to load the data. You can keep the default name for the track or change it if you prefer. Then, you need to choose a window size and a method of score caculation. The size of the window that you should choose depend on the number of reads that are available. The smaller the windows the higher the resolution. For this example, we will choose a window of 100 bp. The option for the score calculation are discussed in the Documentation. For the type of files used in this example, you should choose sum as the method for the score calculation. You can keep the default data precision. But we need to select a strand. Let's start with the 5' strand. You will also need to select the 1st chromosome. To do so, click on the "Modify Selection" button on the bottom right corner of the screen and then uncheck all the chromosomes but the first one. The figure 2 shows how the screen should like before you click on the OK button.

Figure 2: Load Fixed Window Track Menu

The operation needs to be repeated for the 3' strand. Once the two tracks are loaded you can modify the Y axis by right clicking on the track handlers and selecting the "Set Y Axis" option. Set the maximum to 100. You can also change the color and the appearance of the peaks. Now that the two tracks are loaded we can graphically determine how much the strands need to be shifted. Select a peak, zoom on it with the mouse wheel and check how far the summits of the same peak on the 5' and the 3' strands are. Verify that this value is the same on other peaks. When you're sure about the value, divide it by two and note this result. This value should be very close to the average size of the insert of the sequenced library. In the example, we notice that the summits are about 300 bp apart (figure 3) so the shifting value is 150 bp (meaning that the 5' is shifted 150 bp forward and the backward strand is shifted 150 bp backward).

Figure 3: Find Strand Shifting

We need to load the file again but this time we're going to load both strands with the appropriate strand shifting. This time the loading screen should look like on the figure 4.

For comparison purpose, you can also load both strand without any shifting. This should clearly show that the shifted peaks are generally narrower than the unshifted peaks.

Figure 4: Load Fixed Window Track Menu, both strands


Load Input

It's now time to load the input control sample track (the track resulting from sequencing the cross-linked chromatin before the immuno-precipitation). This track will help us remove the peaks (enrichment region) that are non specific, either because of problem with the method (cross-linking,antibody etc...) or because of sequencing artifact. These artifacts can be caused either by preferential sequencing of specific fragment by the instruments or by differences between the genome sequenced and the genome assembly used to align the reads, since any repeat in the sequenced genome that is present as a none repeated region in the genome assembly will result in a peak.

Identifying these outliers can be quite difficult. One useful method is to compare the IP libraries with a control (input) library for the same sample. It's what we will to do in this tutorial.

To load the input file, right click on the handler of an empty track and select "Load Fixed Window Track". Then select the input file that you downloaded earlier. On the next window set the parameters as shown on the figure 5.

Figure 5: Load Control Parameters

Normalize IP and Input

In order to be comparable, the IP and the input tracks need first to be normalized.

The normalize operation of GenPlay divides each score by the sum of all the scores of the track and multiply the result by a large constant specified by the user. The only purpose of the constant is to make the score more readable.

Let's start by normalizing the IP. You need first to right click on the track handler. After that, select the normalize option of the operation sub-menu as shown on figure 6. You can keep the default constant, just click OK (figure 7). After that operation the score are expressed as number of read per windows per 10 million reads (if you kept the default constant).

Once it's done repeat the operation with the input track.

Figure 6: Normalize Menu
Figure 7: Normalize Dialog Box

Add Constant to Input

One of the specificities of GenPlay is that while computing an operation, the windows with a score of zero are not taken into account.

This is helpful in most cases because we don't know if a zero window is caused by the presence of a repeat (which cannot be map), a region that is not sequenced in the genome assembly or if it is a mappable window in which no read mapped.

In the case of this tutorial though, since we want to compare the IP track to the input track, we want to be able to do this comparison even when a window of the input has a null score. Since the comparison will be a subtraction we can add a really small constant (so no window has a score of zero) without impacting significantly the result of the subtraction.

We chose to add a constant of 0.01 to all the windows of the input track. In order to do that, you first need to right click on the handler of the input track (figure 8). Then, choose the "Addition (Constant)" option of the operation sub-menu. Enter 0.01 in the next windows as shown on figure 9 a click OK.

Figure 8: Addition Constant Menu
Figure 9: Constant Dialog Box

Compare Input to Input

We can now compare the IP to the input track in order to filter out the artifactual peaks. As we already decided, we want to subtract the input scores from the IP scores.

To do so, you need to right right click on the handler of the IP track and select the operation sub-menu. Then click on "Two Tracks Operation" as shown on figure 10.

In the first dialog window you need to choose the second track for the operation. Select the input track (figure 11).

The operation to choose is subtraction (figure 12) and you can select the default option (32-Bit) for the result data precision. You also need to choose in which empty track you want the result to be displayed. The figure 13 shows the result of the operation.

We choose to subtract rather than divide (as would be done in a PCR experiment) because in chip-Seq experiments the input sample is not enriched wile the IP might be enriched 10 to 20 time or more. As a result, to have comparable coverage the input would have to be sequenced 10 to 20 times deeper than the IP sample (which would be very expensive). This lower coverage for the input control causes a sampling error that can drastically affect the results if the data are normalised by calculating the ratio IP/IN. The sampling error problem on the input increase when the binning size decrease (when the window size used loading the track decrease). An alternative to subtracting the input from the IP track is to smooth the input track in order to attenuate the sampling error on that track. That can be done in GenPlay using either the moving average or the Gaussian smoothing operation. In practice to decrease the sampling error, we find that smoothing over 5 to 10 kb is necessary. Finally, whatever the method use, some of the peak present in the IP track might not have been completely eliminated from the IP track, or some artifact might be present only in the IP track. If that is the case, one can also eliminate the outliers by simply using the filter to eliminate the highest peak. This can be done easily in GenPlay using either either absolute or relative values in the filter operation. We find that there are no optimal method to analyse all tracks and that it is best for the users to proceed by trial and error and exercise their judgment in performing these operations.

Figure 10: Two Tracks Operation Menu
Figure 11: Dialog to Choose the Second Track
Figure 12: Dialog to Choose the Operation Type
Figure 13: Result of the Subtraction

Filter Negative Values

Since we're just interesting in the enriched regions we can try to remove all the negative values in order to make the track easier to comprehend.

To filter the track you need to select the "Filter" option of the operation sub-menu (figure 14). On the left panel of the filter window select the threshold option. Set the parameters of the right panel as shown on figure 15. The figure 16 shows the result of the filter operation.

Figure 14: Filter Operation Menu
Figure 15: Filter Operation Window
Figure 16: Filter Operation Result (the negative values of the track 5 have been removed)

We now need to remove the background noise and to keep only the islands (the peaks).



Isolate Peaks

This goal of this step is to remove the background noise from the track so just the peaks remain.

To do so, right click on the track handler, choose the "Operation" sub-menu and click on the "Find Peaks" option (figure 17).

Figure 17: Find Peaks Operation

After the find peaks dialog opens, choose the "Island Finder" option on the right panel and set the parameters as shown on the figure 18.

Figure 18: Find Peaks Menu

The island finder is described in the documentation section of this website. You'll notice that the selected output is "Peak Summits". This means that for each island, the only selected window will be the one with greatest score on the island. This generally corresponds to the window with the highest probability of containing the binding site. Other software can be used to identify the exact sequence of the binding site. In addition to choosing the summit as the output, one could add the Island score or the "start value". These two values are a much more quantitative estimate of the level of enrichment of the signal than the summit value. The result should be similar to what is shown on figure 19.

Figure 19: Find Peaks Result (the track 6 contains only the summit of the peaks)





Extract Gene Promoters

First, we need to load a gene track. Right click on an empty track handler and select "Load Gene Track". Select the RefSeq file that we've already downloaded when prompted.

When it's done, right click on the track handler of the gene track and select "Extract Intervals" in the Operation sub-menu (Figure 20).

Figure 20: Extract Intervals Menu

A dialog box will pop-up. We decide to define a promoter as a region that starts 500 bp before a gene start position and ends 500bp after. In order to do so, fill in the parameters as shown in figure 21.

Figure 21: Extract Intervals Dialog

You'll finally be asked to select the result track position in the track list. The result track represents only the promoters of the genes of the reference track (figure 22).

Figure 22: Gene Promoters (the track 8 contains only the gene promoters)

Score Promoters

Now that we have a track with the peaks and a track with the promoters we can score the promoters using the score of the peaks and export the result as a bed file.

To score the promoters, right click on the handler of the track with the promoters and select the "Score Exons" option of the "Operation" sub-menu (figure 23).

Figure 23: Score Exons Menu

You'll be prompted to choose the track containing the scores. Select the track with the peaks extracted.

Then select maximum for the method of calculation and select a track where the result should appear (figure 15). Note that the color of the promoters represents the scores associated to the promoters (as described in the gene track section of the documentation).

Figure 24: Score Exons Result

The last thing we need to do is to export the result of our analysis. Right click on the newly created track handler. Select "Save As". Choose where you want to save the track and make sure that the file type is set to Bed file. Once it's done, you can open the file that you created with a text editor such as notepad. You'll notice that the result file contains the position (field 1 to 3) of the promoters, the name of the genes (field 4), the strand of the gene (field 6) as well as the scores of the promoters (field 5). For more details about the result file you can refer to the File Type section of the documentation.