GenPlay Multi-Genome
From GenPlay, Einstein Genome Analyzer
Contents
The GenPlay Multi-Genome Project
Introduction
GenPlay Multi-Genome is a project under development that provides a unique platform to analyze personal genome and epigenome information. Because the amount of individual genomes sequenced is expected to increase exponentially in the next few years, we expect this feature of GenPlay to become extremely useful to analyze multiple epigenomes in parallel.
Presently, there are no applications to compare human epigenomes in real time. Software such as Vista and D-code are very useful tools that can be used to compare the structure of genomes across species, but they cannot be used to study multiple ChIP-Seq tracks or RNA-Seq tracks aligned on different genomes.
To allow users to perform such analysis, we are writing a new version of GenPlay in which several VCF files representing several genomes can be loaded at the same time. After loading the VCF files, GenPlay creates a meta-genome that is the sum of all the genomes to be studies at the same time. Once the meta-genome is generated, GenPlay users can load tracks aligned on any genome as well perform all the operations available in the software. SNPs, deletion, insertion, structural variants appear as stripes.
Description
The price of sequencing personal genomes is rapidly decreasing and will likely fall below $1,000 in the next few years. Genomes can be sequenced to various degree of accuracy. Low level of precision is to identify SNPs in comparison to a reference genome, but as the technology improves the complete sequence of an individual will in fact be the sequence of two haploid genomes including complete phasing of all SNPs, identification of indels (defined as any deletion or insertion relative to the reference sequence, therefore including CNVs) and structural variations as well as novel sequences not present in any existing reference genome.
At the current time, the tools available to visualize all this information are very primitive, particularly in the case of structural variations, insertion and novel sequence which cannot be compared to existing genomes. We are not aware of any tool that can allow researchers to graphically compare multiple genomes and particularly multiple epigenomes at the allelic level. Because of the lack of existing tools, we are currently in the process of adding this function to GenPlay.
The main functions of GenPlay multi-genome will be to allow users to load experimental tracks (RNA-seq, ChIP0seq, DNA methylation, TimEX etc..) aligned on different genomes (either haploid genomes, in which all snps and variant are phased, or unphased diploid genomes). GenPlay multi-genome will be capable of comparing and transforming all tracks loaded in a multi genome sessions using the operations currently available in GenPlay. it will also be able to export the results of such analysis in several formats (bed, bgr, gff, sam,fasta, etc..) in the coordinates of any of the loaded genomes. This new capabilities are important because when new human genomes are sequenced and properly analyzed, new sequences not present in the reference genomes are often discovered by de novo assembly. These new sequences cannot be studied if results of expression or epigegomic experiments are aligned to the reference genome, since the data pertaining to these new sequences will appear as unmatched. Alignments of experimental results should ideally be done against the genome of the cells being studied, or, if that genome is not available, against multiple genomes. A major difficulty with this approach is that the results of such alignments cannot be easily compared with the large amounts of existing data (aligned on GRCh36/hg18, or GRCh37/hg19) or with standard annotation tracks (refseq, UCSC gene, Aceview etc...) without converting all the results in the same genomic coordinate system. GenPlay multi-genome will perform all of these conversions seamlessly and in real time.
GenPlay multi genome will also include novels filters and operations specific for multi-genome sessions. For instance, we will program functions to display, project and perform correlation and other operations only on the variations that are different between two or more genomes. GenPlay will be able to to display only variants of a particular type (SNP indel, SV etc...), variants that are within a particular features (promoters, gene bodies, CpG, CpG islands, Chip-seq peak etc..) , variants that are at no more than a specific distance to a feature etc...
Under the hood of GenPlay multi genome:
The general strategy to perform all the necessary conversion is based on the creation of a meta-genome and of difference files (.diff files) at the time of loading of the VCF files. The meta-genome is a genome that is bigger than all the loaded genome because it contains the sequences of all the loaded genome. The meta-genome is the sum of all the loaded genome. Diff file contains all the difference between the meta-genome and a loaded genome. In a Diff files the position and sequences of all the variants in a genome are represented in two sets of coordinates. For instance, loading two VCF files representing two genomes (G1 and G2) mapped relative to GRCh37/hg19 lead to the creation by GenPlay of three .Diff files: (hg19/meta); (G1/meta); (G2/meta)). Using the .Diff files, GenPlay can convert the coordinates of any feature of any loaded genomes, in any other genomes. This procedure is similar to the procedure used to translate multiple languages into each other, by first translating them all in the same meta-language.
Beta-Version
Files
Coming Soon
Web Start
Coming Soon