How to Create a VCF File From a Chain File
From GenPlay, Einstein Genome Analyzer
Goal: This tutorial illustrates how to generate a VCF file describing the differences between two reference genomes from a Chain file. In this tutorial we will create a hg19 to hg38 VCF file. This means that the reference genome of the VCF file is hg38.
The tutorial is divided into two steps. The first step consists in generating a VCF containing the insertions and the deletions using a Chain file and a program developed in Scala called ChainToVCF.
In the second step we will use GenPlay and ChainToVCF to add SNPs to our VCF file.
Prerequisite: You will need to have a Linux or Mac computer.
Scala needs to be installed on you computer. Scala is available to download from http://www.scala-lang.org/download/
GenPlay needs to be installed on your computer. If you haven't installed GenPlay yet, please visit the Downloads page and follow the instructions to download and install GenPlay.
First, let's download the files needed from the UCSC genome browser. All the file are available from the download page of the UCSC genome browser. We will need the following:
1. hg19 to hg38 chain file (this file can be found in the LiftOver section): http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
2. hg19 reference file in 2bit format (from the full dataset section): http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
3. h38 reference file in 2bit format: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
Note that the chain file needs to be uncompressed.
Generate a VCF with the Insertions and the Deletions
In order to generate our VCF file we will need the Scala program ChainToVCF available at https://github.com/JulienLajugie/ChainToVCF/releases/download/V1.0/ChainToVCF.jar
First, make sure that Scala is properly installed on your system.
Then, from a terminal, run the following command:
scala -classpath ./ChainToVCF.jar edu.yu.einstein.chainToVCF.ChainToVCF --chain ./hg19ToHg38.chain --source hg19 --target hg38 > hg19ToHg38.vcf
Modify the paths to ChainToVCF.jar and hg19ToHg38.chain if needed.
The resulting VCF will contain all the insertion and deletion of hg19 using hg38 as a reference genome. In the next step we will add the SNPs to our VCF file.
Add SNPs to the VCF File
The process of adding SNPs to the VCF file can be divided into two parts.
We will first extract the differences between the hg19 and hg38 sequences. GenPlay can generate a BGR file containing these differences.
After that, we will use use the ChainToVCF program previously downloaded to add SNPs to the VCF file.
Generate a BGR File Containing the Remaining Differences Between hg19 and hg38 Using GenPlay
To add SNPs to the VCF file we need to extract the remaining differences from the sequences of hg19 and hg38. This can be done using GenPlay.
You first need to go through the following tutorial hg19_GRCh38/hg38_Multi-Genome_Tutorial. Use the VCF generated during the previous phase instead of the one provided for the tutorial.
At the end of the tutorial you should end up with a sequence layer for hg19 and a sequence layer for hg38. Right click on the track handler of the hg38 track handler and select the hg38 layer at the bottom of the contextual menu and then select the Compare Sequences option (figure 1). Select hg19 when prompter to select a second layer and select an empty track for the result.
Generating the differences between two sequence layer is slow and can take a couple hour.
Once this is done, right click on the track handler of the newly created layer, select the layer sub-menu at the bottom of the contextual menu and select "Save As" (figure 2). Select a BGR extension in the file dialog and select hg38 as the reference genome when prompted. Let's call the file hg19ToHg39-SNP.bgr