Difference between revisions of "Tutorials"

From GenPlay, Einstein Genome Analyzer

Jump to: navigation, search
(TimEX Analysis)
(How to Create a VCF File From a Chain File)
 
(28 intermediate revisions by 2 users not shown)
Line 2: Line 2:
  
 
== ChIP-Seq Analysis ==
 
== ChIP-Seq Analysis ==
'''Goal:''' The objective of this tutorial is to illustrate how GenPlay can be used to isolate peaks from the data generated from a ChIP-Seq experiment. Then, to generate a list of genes that have a peak in their promoter and finally to associate  the score of the peak summit with each promoter.
+
The objective of the [[ChIP-Seq Tutorial]] is to illustrate how GenPlay can be used to isolate peaks from the data generated from a ChIP-Seq experiment.
 
 
'''Note:''' The following tutorial is based on the hg19 genome assembly which is the default genome assembly of GenPlay. If you previously changed the genome assembly used by GenPlay in the configuration menu you would need to restore the hg19 assembly. Please refer to the [[Documentation#Genome File|Documentation]] section of this website for more information on how to change the reference assembly.
 
 
 
'''Note:''' The final result of this tutorial is available as a project that can be loaded from the [[Web Start#Demo|Web Start]] page of this website.
 
 
 
=== Load the File ===
 
The first thing to do is to download the [http://www.genplay.net/tutorials/ChIP-Seq/ChIP-Seq_tutorial_IP.bed ChIP-Seq IP file],  the [http://www.genplay.net/tutorials/ChIP-Seq/ChIP-Seq_tutorial_Input.bed ChIP-Seq input control sample file] and the  [http://www.genplay.net/tutorials/ChIP-Seq/RefSeq_hg19.bed RefSeq gene annotation file] (if the web-browser opens the files in a new tab or window, just select the Save As option of the File menu of your browser to retrieve the files.
 
 
 
The input file should contain reads obtained by sequencing the sample before the immnuno-precipitation step. This control is necessary to normalize for sequence dependent variations in the efficiency of the library preparation and sequencing procedure. One major difficulty is that to be able to really normalize for these parameters a considerable depth of sequencing of the input is necessary since this sample is not enriched.  If the depth is too low to be able to completely normalize, it is preferable to subtract the input from the IP rather than divide (which tend to magnify the differences).
 
The IP (immuno-precipitated) sample should contain the result of the sequencing of the sample after the precipitation step.
 
 
After downloading the files, you can start GenPlay from the Web Start link that is located on top of this page. The 1 GB link is enough for this tutorial, but generally you should allocate as much memory as you can afford. For this experiment we're going to work only on the first chromosome so the loading time is shorter and the amount of memory needed is smaller.
 
 
 
To obtain the narrowest peak possible, it is generally advisable to correct the strand bias that is caused by the fact that the cross-linked DNA fragments are sequenced from the end while the actual binding site might be anywhere within the immuno-precipitated fragments. (reviewed in Wilbanks EG et al.  Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010 Jul 8;5(7):e11471.)
 
 
 
To measure the strand bias, we need to load the 3' and 5' reads separately.
 
 
 
To achieve that you will need to right click on the track  handler of the 1st row, in order to open the menu that will allow you to load  tracks (figure 1). Select the Load Fixed Window Track option.
 
[[image:tutorial1_empty_track_menu.png|center|frame|Figure 1: Load Menu]]
 
 
 
After selecting the IP file, an option window is going to prompt you to enter information on how to load the data. You can keep the default name for the track or change it if you prefer. Then, you need to choose a window size and a method of score caculation. The size of the window that you should choose depend on the number of reads that are available. The smaller the windows the higher the resolution. For this example, we will choose a window of 100 bp. The option for the score calculation are discussed in the Documentation.  For the type of files used in this example, you should choose  sum as the method for the score calculation.  You can keep the default data precision. But we need to select a strand. Let's start with the 5' strand. You will also need to select the 1st chromosome. To do so, click on the "Modify Selection" button on the bottom right corner of the screen and then uncheck all the chromosomes but the first one. The figure 2 shows how the screen should like before you click on the OK button.
 
[[image:tutorial1_Load_FWT_menu.png|center|frame|Figure 2: Load Fixed Window Track Menu]]
 
 
 
The operation needs to be repeated for the 3' strand. Once the two tracks are loaded you can modify the Y axis by right clicking on the track handlers and selecting the "Set Y Axis" option. Set the maximum to 100. You can also change the color and the appearance of the peaks. Now that the two tracks are loaded we can graphically determine how much the strands need to be shifted. Select a peak, zoom on it with the mouse wheel and check how far the summits of the same peak on the 5' and the 3' strands are. Verify that this value is the same on other peaks. When you're sure about the value, divide it by two and note this result. This value should be very close to the average size of the insert of the sequenced library.
 
In the example, we notice that the summits are about 300 bp apart (figure 3) so the shifting value is 150 bp (meaning that the 5' is shifted 150 bp forward and the backward strand is shifted 150 bp backward).
 
[[image:tutorial1_strand_shifting.png|center|frame|Figure 3: Find Strand Shifting]]
 
 
 
We need to load the file again but this time we're going to load both strands with the appropriate strand shifting. This time the loading screen should look like on the figure 4.
 
 
 
For comparison purpose, you can also load both strand without any shifting. This should clearly show that the shifted peaks are generally narrower than the unshifted peaks.
 
[[image:tutorial1_Load_FWT_menu2.png|center|frame|Figure 4: Load Fixed Window Track Menu, both strands]]
 
<br style="clear: both" />
 
 
 
=== Load Input ===
 
It's now time to load the input control sample track (the track resulting from sequencing the cross-linked chromatin before the immuno-precipitation).  This track will help us remove the peaks (enrichment region) that are non specific, either because of problem with the method (cross-linking,antibody etc...) or because of sequencing artifact. These artifacts can be caused either by preferential sequencing of specific fragment by the instruments or by differences between the genome sequenced and the genome assembly used to align the reads, since any repeat in the sequenced genome that is present as a none
 
repeated region in the genome assembly will result in a peak.
 
 
 
Identifying these outliers can be quite difficult. One useful method is to compare the IP  libraries with a control (input) library for the same sample. It's what we will to do in this tutorial.
 
 
 
To load the input file, right click on the handler of an empty track and select "Load Fixed Window Track". Then select the input file that you downloaded earlier. On the next window set the parameters as shown on the figure 5.
 
[[image:tutorial1_load_control1.png|center|frame|Figure 5: Load Control Parameters]]
 
 
 
=== Normalize IP and Input ===
 
In order to be comparable, the IP and the input tracks need first to be normalized.
 
 
 
The normalize operation of GenPlay divides each score by the some of all the scores of the track and multiply the result by a large constant specified by the user. The only purpose of the constant is to make the score more readable.
 
 
 
Let's start by normalizing the IP.  You need first to right click on the track handler. After that, select the normalize option of the operation sub-menu as shown on figure 6. You can keep the default constant, just click OK (figure 7). After that operation the score are expressed as number of read per windows per 10 million reads (if you kept the default constant).
 
 
 
Once it's done repeat the operation with the input track.
 
 
 
[[image:tutorial1_normalize1.png|center|frame|Figure 6: Normalize Menu]]
 
[[image:tutorial1_normalize2.png|center|frame|Figure 7: Normalize Dialog Box]]
 
 
 
=== Add Constant to Input ===
 
One of the specificities of GenPlay is that while computing an operation, the windows with a score of zero are not taken into account. 
 
 
 
This is helpful in most cases because we don't know if a zero window is caused by the presence of a repeat (which cannot be map), a region that is not sequenced in the genome assembly or if it is a mappable  window in which no read mapped.
 
 
 
In the case of this tutorial though, since we want to compare the IP track to the input track, we want to be able to do this comparison even when a window of the input has a null score. Since the comparison will be a subtraction we can add a really small constant (so no window has a score of zero) without impacting significantly the result of the subtraction.
 
 
 
We chose to add a constant of 0.01 to all the windows of the input track. In order to do that, you first need to right click on the handler of the input track (figure 8). Then, choose the "Addition (Constant)" option of the operation sub-menu. Enter 0.01 in the next windows as shown on figure 9 a click OK.
 
 
 
[[image:tutorial1_add_constant1.png|center|frame|Figure 8: Addition Constant Menu]]
 
[[image:tutorial1_add_constant2.png|center|frame|Figure 9: Constant Dialog Box]]
 
 
 
=== Compare Input to Input ===
 
We can now compare the IP to the input track in order to filter out the artifactual peaks. As we already decided, we want to subtract the input scores from the IP scores.
 
 
 
To do so, you need to right right click on the handler of the IP track and select the operation sub-menu. Then click on "Two Tracks Operation" as shown on figure 10.
 
 
 
In the first dialog window you need to choose the second track for the operation. Select the input track (figure 11).
 
 
 
The operation to choose is subtraction (figure 12) and you can select the default option (32-Bit) for the result data precision. You also need to choose in which empty track you want the result to be displayed. The figure 13 shows the result of the operation.
 
 
 
We choose to subtract rather than divide (as would be done in a PCR experiment) because in chip-Seq experiments the input sample is not enriched wile the IP might be enriched 10 to 20 time or more. As a result, to have comparable coverage the input would have to be sequenced 10 to 20 times deeper than the IP sample (which would be very expensive). This lower coverage for the input control causes a sampling error that can drastically affect the results if the data are normalised by calculating the ratio IP/IN.  The sampling error problem on the input  increase when the binning size decrease (when the window size used loading the track decrease). An alternative to subtracting the input from the IP track is to smooth the input track in order to attenuate the sampling error on that track. That can be done in GenPlay using either the moving average or the Gaussian smoothing operation. In practice to decrease the sampling error, we find that smoothing  over 5 to 10 kb is necessary. Finally, whatever the method use, some of the peak present in the IP track might not have been completely eliminated from the IP track, or some artifact might be present only in the IP track. If that is the case, one can also eliminate the outliers by simply using the filter to eliminate the highest peak. This can be done easily in GenPlay using either either absolute or relative values in the filter operation. We find that there are no optimal method to analyse all tracks and that it is best for the users to proceed by trial and error and exercise their judgment in performing these operations.
 
 
 
[[image:tutorial1_subtract1.png|center|frame|Figure 10: Two Tracks Operation Menu]]
 
[[image:tutorial1_subtract2.png|center|frame|Figure 11: Dialog to Choose the Second Track]][[image:tutorial1_subtract3.png|center|frame|Figure 12: Dialog to Choose the Operation Type]]
 
[[image:tutorial1_subtract4.png|center|frame|Figure 13: Result of the Subtraction]]
 
 
 
=== Filter Negative Values ===
 
Since we're just interesting in the enriched regions we can try to remove all the negative values in order to make the track easier to comprehend.
 
 
 
To filter the track you need to select the "Filter" option of the operation sub-menu (figure 14). On the left panel of the filter window select the threshold option. Set the parameters of the right panel as shown on figure 15. The figure 16 shows the result of the filter operation.
 
 
 
[[image:tutorial1_filter1.png|center|frame|Figure 14: Filter Operation Menu]]
 
[[image:tutorial1_filter2.png|center|frame|Figure 15: Filter Operation Window]]
 
[[image:tutorial1_filter3.png|center|frame|Figure 16: Filter Operation Result (the negative values of the track 5 have been removed)]]
 
 
 
We now need to remove the background noise and to keep only the islands (the peaks).
 
 
 
<br/><br/>
 
 
 
=== Isolate Peaks ===
 
This goal of this step is to remove the background noise from the track so just the peaks remain.
 
 
 
To do so, right click on the track handler, choose the "Operation" sub-menu and click on the "Find Peaks" option (figure 17).
 
[[image:Tutorial1 find peaks1.png|center|frame|Figure 17: Find Peaks Operation]]
 
 
 
After the find peaks dialog opens, choose the "Island Finder" option on the right panel and set the parameters as shown on the figure 18.
 
[[image:Tutorial1 find peaks2.png|center|frame|Figure 18: Find Peaks Menu]]
 
 
 
The island finder is described in the documentation section of this website.
 
You'll notice that the selected output is "Peak Summits". This means that for each island, the only selected window will be the one with greatest score on the island. This generally corresponds to the window with the highest probability of containing the binding site. Other software can be used to identify the exact sequence of the binding site. In addition to choosing the summit as the output, one could add the Island score or the  "start value". These two values are a much more quantitative estimate of the level of enrichment of the signal than the summit value. 
 
The result should be similar to what is shown on figure 19.
 
[[image:Tutorial1 find peaks3.png|center|frame|Figure 19: Find Peaks Result (the track 6 contains only the summit of the peaks)]]
 
 
 
 
 
<br/><br/><br/>
 
=== Extract Gene Promoters ===
 
First, we need to load a gene track. Right click on an empty track handler and select "Load Gene Track". Select the RefSeq file that we've already downloaded when prompted.
 
 
 
When it's done, right click on the track handler of the gene track and select "Extract Intervals" in the Operation sub-menu (Figure 20).
 
[[image:Tutorial1 extract promoters1.png|center|frame|Figure 20: Extract Intervals Menu]]
 
 
 
A dialog box will pop-up. We decide to define a promoter as a region that starts 500 bp before a gene start position and ends 500bp after. In order to do so, fill in the parameters as shown in figure 21.
 
[[image:Tutorial1 extract promoters2.png|center|frame|Figure 21: Extract Intervals Dialog]]
 
 
 
You'll finally be asked to select the result track position in the track list. The result track represents only the promoters of the genes of the reference track (figure 22).
 
[[image:Tutorial1 extract promoters3.png|center|frame|Figure 22: Gene Promoters (the track 8 contains only the gene promoters)]]
 
 
 
=== Score Promoters ===
 
Now that we have a track with the peaks and a track with the promoters we can score the promoters using the score of the peaks and export the result as a bed file.
 
 
 
To score the promoters, right click on the handler of the track with the promoters and select the "Score Exons" option of the "Operation" sub-menu (figure 23).
 
[[image:Tutorial1 score exons1.png|center|frame|Figure 23: Score Exons Menu]]
 
 
 
You'll be prompted to choose the track containing the scores. Select the track with the peaks extracted.
 
 
 
Then select maximum for the method of calculation and select a track where the result should appear (figure 15). Note that the color of the promoters represents the scores associated to the promoters (as described in the gene track section of the documentation).
 
[[image:Tutorial1 score exons2.png|center|frame|Figure 24: Score Exons Result]]
 
 
 
The last thing we need to do is to export the result of our analysis. Right click on the newly created track handler. Select "Save As". Choose where you want to save the track and make sure that the file type is set to Bed file. Once it's done, you can open the file that you created with a text editor such as notepad. You'll notice that the result file contains the position  (field 1 to 3) of the promoters, the name of the genes (field 4), the strand of the gene (field 6) as well as the scores of the promoters (field 5). For more details about the result file you can refer to the File Type section of the documentation.
 
  
 
== TimEX Analysis ==
 
== TimEX Analysis ==
'''Goal:''' This tutorial illustrates how GenPlay can be used to show timing of replication profiles.  The goal of the tutorial is to compute the correlation coefficient between the replication timing in human embryonic stem (ES) cells and in primary basophilic erythroblasts derived in culture from primary CD34 positive cells. The TimEX procedure is described in  Desprat et al. (Genome Res. 2009 Dec;19(12):2288-99). Briefly, the timing of DNA replication can be estimated by measuring the number of copy of each DNA segment in cells that are undergoing replication ( S phase cells) as compared to the number of copies of the same DNA segment for cells that have not yet started to replicate (cells in the G1 phase of the cel cycle). When both allele of any DNA segment replicate early in S phase, there are four copies during most of S phase and the average numbers of copies in a population of cells in S phase is close to 4. Hence the S/G1 ratio is close to 2 (since G1 cells have 2 copies of each DNA segments). By contrast when a DNA segment replicates late in S phase, the average numbers of copies in a population of cells in S phase is close to 2 and the S/G1 ratio close to 1. In this tutorial we will calculate the S/G1 ratio genome wide in 5000 bases windows for both cell types and then compare the results.
+
The [[TimEX Tutorial]] illustrates how GenPlay can be used to show timing of replication profiles.
 
 
 
 
 
 
'''Note:''' The following tutorial is based on the hg19 genome assembly which is the default genome assembly of GenPlay. If you previously changed the genome assembly used by GenPlay in the configuration menu you would need to restore the hg19 assembly. Please refer to the [[Documentation#Genome File|Documentation]] section of this website for more information on how to change the reference assembly.
 
 
 
'''Note:''' The final result of this tutorial is available as a project that can be loaded from the [[Web Start#Demo|Web Start]] page of this website.
 
 
 
=== Load the File ===
 
The first thing to do is to download the four files used during this tutorial.  The files are available [http://www.genplay.net/tutorials/TimEX/ here] (if your web-browser opens the files in a new tab or window, please select the Save As option of the File menu of your browser to retrieve the files.
 
 
 
After downloading the files, you can start GenPlay from the Web Start link that is located on top of this page. The 1 GB link is enough for this tutorial, but generally you should allocate as much memory as you can afford.
 
 
 
Once GenPlay is started, right click on the track  handler of the 1st row, in order to open the menu that will allow you to load  tracks (figure 1). Select the Load Fixed Window Track option.
 
[[image:TimEX_Fig1.png|center|frame|Figure 1: Load Menu]]
 
 
 
After selecting the ES G1 file, an option window is going to prompt you to enter information on how to load the data. You can keep the default name for the track or change it if you prefer. Then, you need to choose a window size and a method of score caculation. The size of the window that you should choose depend on the number of reads that are available. For timing of replication studies, we can choose a window size of 5,000bp.  The option for the score calculation are discussed in the Documentation.  For the type of files used in this example, you should choose sum as the method for the score calculation.  You can keep the default data precision. 
 
[[image:TimEX_Fig2.png|center|frame|Figure 2: Load Fixed Window Track Menu]]
 
 
 
Once your track is loaded you need to repeat the same operation for the 3 other files.  Beside the name of the track, all the other options are the same.
 
 
 
=== Filtering the G1 tracks ===
 
Now that our tracks are loaded, we need to filter the windows with less than 8 reads for not being statistically significant.
 
 
 
This step is necessary becasue we found that windows with low number of reads were increasing the signal to noise ratio because of excessive sampling errors.
 
 
 
To filter the windows with less than 8 reads, right-click on the ES G1 track handler, select the operation menu and then select the filter sub-menu (figure 3).
 
 
 
Then select the Threshold filter option and set the values as shown on figure 4. Click on the OK button in order to remove all the windows with a score smaller than 8.
 
 
 
[[image:TimEX_Fig3.png|center|frame|Figure 3: Filter Menu]]
 
[[image:TimEX_Fig4.png|center|frame|Figure 4: Threshold Filter]]
 
 
 
=== Normalizing the tracks ===
 
In order to be comparable, all the tracks need to be normalized.
 
 
 
The normalize operation of GenPlay divides each score by the some of all the scores of the track and multiply the result by a large constant specified by the user. The only purpose of the constant is to make the score more readable.
 
Let's start by normalizing the ES G1 track. You need first to right click on the track handler. After that, select the normalize option of the operation sub-menu as shown on figure 5. You can keep the default constant, just click OK. After that operation the score are expressed as number of read per window per 10 million reads (if you kept the default constant) because the window score represents the sum of the reads that mapped to that window, the total of all the score is therefore the total number of reads.
 
After normalisation each score  is equal to  (# of read per window *10,000,000)/total # of reads.
 
 
 
 
 
Once normalisation is done for  the first track. The three other tracks are processed in the same manner.
 
 
 
[[image:TimEX_Fig5.png|center|frame|Figure 5: Normalize]]
 
 
 
=== Computing the ratio S / G1 ===
 
The next step is to generate new tracks representing the ratio S / G1 for each cell type.  To do that you need to right click on the ES S track handler and then to select the "Two Tracks Operation" option of the Operation menu.  After that, you need to select the ES G1 track when prompted.  Select a track where you want to generate the result and select the division operation.
 
 
 
After repeating the same operation with the erythroid cells the result should be similar to the one on figure 6.
 
 
 
[[image:TimEX_Fig6.png|center|frame|Figure 6: S / G1]]
 
 
 
=== Gaussing the tracks ===
 
 
 
If you zoom out, you can already see that the S/G1 ratio varies and that there seem to be replication timing domains that can be several megabases in size.
 
We now want to smooth and remove the noise from the curves to obtained a more usable curve.
 
 
 
So far, two smoothing algorithms had been incorporated to GenPlay: the Gaussian smoothing and the Loess smoothing.  In practice these two operations produce similar results.
 
 
 
To gauss the tracks, first right click on one of the two S / G1 track handler and select the Gauss option of the operation menu.  We need to set the gaussian smoothing parameter Sigma.  We set Sigma to 200,000 kb for this experiment.
 
 
 
Repeat the operation for the second track.  The figure 7 show the result of the smoothing.
 
[[image:TimEX_Fig7.png|center|frame|Figure 7: Gaussian Smoothing]]
 
 
 
=== Saturating and Indexing the tracks ===
 
To make the tracks easier to compare we need to index (rescale) them between 0 and 100.
 
 
 
Before indexing the data we want to saturate the 1% greatest and smallest value in order to reduce the effect of eventual outliers.  To saturate the track you need to right click on the track handler and select the Operation > Filter menu.  On the filter option select the percentage filter and set the parameters as shown on figure 8.  You need to do this operation for the ES and ERY tracks.
 
 
 
Now we need to index the two tracks.  Select one of the track, right click on the track handler to show the contextual menu and click on the Operation > Index option.  Set the new minimum to 0 and the new maximum to 100.  Repeat the operation for the second track.  The result should be similar to what is shown on figure 9.
 
  
 +
== Multi-Genome Analysis ==
 +
The [[GRCh37/hg19 GRCh38/hg38 Multi-Genome Tutorial]] explains how to use the multi-genome functionality of GenPlay.
  
[[image:TimEX_Fig8.png|center|frame|Figure 8: Saturate Menu]]
+
It shows how data aligned on hg38 and hg19 can be displayed simultaneously and compared using GenPlay Multi-Genome
[[image:TimEX_Fig9.png|center|frame|Figure 9: Index Result]]
 
  
=== Computing the Correlation Coefficient ===
+
'''Note:''' Here is a version for the comparison of hg18 and hg19: [[Multi-Genome Tutorial]].
The last step is to compute the correlation coefficient between these two tracks. To do so you need to select the Operation > Correlation option of the contextual menu.  When asked choose the second track for the correlation.  A window showing the correlation coefficient for each chromosome as well as the genome wide correlation should pop-up (figure 10).
 
  
[[image:TimEX_Fig10.png|center|frame|Figure 10: Correlation Coefficient]]
+
== How to Create a VCF File From a Chain File ==
 +
The goal of [[How to Create a VCF File From a Chain File|this tutorial]] is to show how to generate a VCF file such as the one used in the [[GRCh37/hg19 GRCh38/hg38 Multi-Genome Tutorial]] from a Chain file that can be downloaded from the UCSC genome browser website.

Latest revision as of 11:18, 22 August 2014

The following tutorials aim to give you some of the basic concept on the track manipulation techniques.

ChIP-Seq Analysis

The objective of the ChIP-Seq Tutorial is to illustrate how GenPlay can be used to isolate peaks from the data generated from a ChIP-Seq experiment.

TimEX Analysis

The TimEX Tutorial illustrates how GenPlay can be used to show timing of replication profiles.

Multi-Genome Analysis

The GRCh37/hg19 GRCh38/hg38 Multi-Genome Tutorial explains how to use the multi-genome functionality of GenPlay.

It shows how data aligned on hg38 and hg19 can be displayed simultaneously and compared using GenPlay Multi-Genome

Note: Here is a version for the comparison of hg18 and hg19: Multi-Genome Tutorial.

How to Create a VCF File From a Chain File

The goal of this tutorial is to show how to generate a VCF file such as the one used in the GRCh37/hg19 GRCh38/hg38 Multi-Genome Tutorial from a Chain file that can be downloaded from the UCSC genome browser website.