From GenPlay, Einstein Genome Analyzer
The GenPlay Multi-Genome Project
GenPlay Multi-Genome (GenPlay-MG) is a project under development that will provides a novel platform to analyze personal genome and epigenome information. Presently, there are very few applications to take advantage of the increased availability of information regarding the variability of the structure of the genome between humans. A few tools exist to compare the structure of genomes across species, but none can be used to study multiple ChIP-Seq tracks or RNA-Seq tracks aligned on different genomes.
To allow users to perform such analysis, we are writing GenPlay-MG, a new version of GenPlay in which VCF files summarizing the differences between different human genomes can be loaded. After loading the VCF files, GenPlay creates a meta-genome that is the sum of all the genomes to be analyzed in the multi-genome session. Once the meta-genome is generated, GenPlay users can load tracks aligned on any genome as well perform all the operations available in the software and export the data. SNPs, deletion, insertion, structural variants appear as stripes. GenPlay-MG can be used to visualise results of 1,000 genome projects, convert between assemblies etc...
The cost of sequencing personal genomes is rapidly decreasing and will likely fall below $1,000 in the next few years. Genomes can be sequenced to various degree of accuracy. Currently, most genomes are sequenced at a low level of precision. A genome is considered sequenced when all the SNPs and (for the best published genomes) when the indels and structural variants are identified in comparison to a reference genome. However, as the technology improves, we are moving toward a much more stringent definition of a completely sequenced genome: Several projects are under way to provide the complete sequence of an individual defined as the sequence of the two haploid genomes of maternal and paternal origin. This involves identifying and phasing all SNPs, indels (deletion, insertion, CNVs) and structural variations (inversions, transpositions, translocations) as well as all novel sequences not present in any existing reference genome by de novo assembly.
At the current time, there are very few tools available to visualize all this information, particularly phased structural variations, insertion and novel sequences because most existing browsers can only display one reference genome at a time. We are not aware of any tool that can allow researchers to graphically compare multiple genomes and particularly multiple epigenomes at the allelic level. To fill this gap, we are in the process of adding this function to GenPlay.
The main functions of GenPlay multi-genome (GenPlay-MG ) is to allow users to load experimental tracks (RNA-seq, ChIP-seq, DNA methylation, TimEX etc..) aligned on different genomes (either haploid genomes, in which all snps and variant are phased, or unphased diploid genomes). Because GenPlay-MG, as all former versions of GenPlay, loads most tracks entirely in RAM it is capable of comparing and transforming all tracks loaded in a multi genome sessions using the operations currently available in GenPlay. GenPlay-MG will also be able to export the results of such analysis in the coordinates of any of the loaded genomes, in several formats (bed, bgr, gff, sam,fasta, etc..). These new capabilities are important because when new human genomes are sequenced and properly analyzed, new sequences not present in the reference genomes are often discovered by de novo assembly and because the junction fragments for all indels and SVs do not exist in the reference genome. Therefore, when the results of expression or epigegomic experiments are only aligned to a reference genome, a lot of the variations is lost simply because new sequences and junction fragments do not exist in the reference genome. In addition, many reads are mis-aligned because the reference differs form the tested sample. This can mask important biological differences. Alignments of experimental results should ideally be done against the genome of the cells being studied, or, if that genome is not available, against multiple genomes. A major difficulty with this 'ideal" approach is that the results of such alignments cannot be easily compared with standard annotation tracks (refseq, UCSC gene, Aceview etc...) and with the large amounts of data generated by the large consortium (ENCODE etc..) which are all based on GRCh36/hg18, or GRCh37/hg19. GenPlay multi-genome allow users to perform this type of analyses by performing seamlessly in real-time all the necessary conversions between the different coordinate systems and graphically displaying all the tracks at the same time.
GenPlay multi genome will also include novels filters and operations specific for multi-genome sessions. For instance, we will program functions to display, project and perform correlation and other operations only on the variations that are different between two or more genomes. GenPlay will be able to to display only variants of a particular type (SNP indel, SV etc...), variants that are within a particular features (promoters, gene bodies, CpG, CpG islands, Chip-seq peak etc..) , variants that are at no more than a specific distance to a feature etc...
UNDER THE HOOD
The general strategy to perform the conversions in real-time is based on the creation of a meta-genome and of difference files (.diff files) at the time of loading of the VCF files. The meta-genome is a genome that is bigger than all the loaded genome because it contains the sequences of all the loaded genome. The meta-genome is the sum of all the loaded genomes. Diff file contains all the difference between the meta-genome and a loaded genome. In a Diff files the position and sequences of all the variants in a genome are represented in two sets of coordinates. For instance, loading two VCF files representing two genomes (G1 and G2) mapped relative to GRCh37/hg19 lead to the creation by GenPlay of three .Diff files: (hg19/meta); (G1/meta); (G2/meta)). Using the .Diff files, GenPlay can convert the coordinates of any feature of any loaded genomes, in any other genomes. This procedure is similar to the procedure used to translate multiple languages into each other, by first translating them all in the same meta-language.
In the current beta version the concept of the meta-genome has been implemented. It is currently possible to load several VCF files as well as tracks aligned in any of the loaded genomes. None of the multi-genome specific operations have been programmed but most of the existing operations seem to work fine.
As it stands GenPlay multi-Genome (beta) can be used, for instance , to visualize some of the results of the 1,000 Genome project. It can also be used to load at the same time data tracks aligned on either GRCh36 (hg18)and GRCh37 (hg19).
Several VCF files from the multi genome project (that are all based on the hg19 GRCh37 reference genome) can be found in the GenPlay library. All of these files are in reference to GRCh37/hg19.
We have also created a VCF file that contains all the differences between GRCh37/hg18 and GRCh37/hg19. Once this file is loaded, it can be used to load at the same time in GenPlay, tracks aligned on both assemblies. It is pretty cool! try it. The creation of an accurate genome-wide VCF files between hg18 and 19 is quite cumbersome with the data available and there might be a few bugs in the current version of our hg18 to hg19 VCF. We are working on producing a better file.
Comments and suggestions are welcome! as are contributions.
We are working on a preliminary doc for GenPlay-MG (beta).