GenPlay Multi-Genome

From GenPlay, Einstein Genome Analyzer

Revision as of 16:38, 9 June 2011 by Nicolas (talk | contribs) (Files)
Jump to: navigation, search

The GenPlay Multi-Genome Project

Introduction


GenPlay Multi-Genome is a project under development that provides a unique platform to analyze personal genome and epigenome information. Because the amount of individual genomes sequenced is expected to increase exponentially in the next few years, we expect this feature of GenPlay to become extremely useful to analyze multiple epigenomes in parallel.

Presently, there are no applications to compare human epigenomes in real time. Software such as Vista and D-code are very useful tools that can be used to compare the structure of genomes across species, but they cannot be used to study multiple ChIP-Seq tracks or RNA-Seq tracks aligned on different genomes.

To allow users to perform such analysis, we are writing a new version of GenPlay in which several VCF files representing several genomes can be loaded at the same time. After loading the VCF files, GenPlay creates a meta-genome that is the sum of all the genomes to be studies at the same time. Once the meta-genome is generated, GenPlay users can load tracks aligned on any genome as well perform all the operations available in the software as well export the data. SNPs, deletion, insertion, structural variants appear as stripes.

Description


The cost of sequencing personal genomes is rapidly decreasing and will likely fall below $1,000 in the next few years. Genomes can be sequenced to various degree of accuracy. Currently, most genomes are sequenced at a low level of precision. A genome is considered sequenced when all the SNPs and (for the best published genomes) when the indels and structural variants are identified in comparison to a reference genome. However, as the technology improves, we are moving toward a much more stringent definition of a completely sequenced genome: Several projects are under way to provide the complete sequence of an individual defined as the sequence of the two haploid genomes of maternal and paternal origin. This involves identifying and phasing all SNPs, indels (deletion, insertion, CNVs) and structural variations (inversions, transpositions, translocations) as well as all novel sequences not present in any existing reference genome by de novo assembly.

At the current time, there are very few tools available to visualize all this information, particularly phased structural variations, insertion and novel sequences because most existing browsers can only display one reference genome at a time. We are not aware of any tool that can allow researchers to graphically compare multiple genomes and particularly multiple epigenomes at the allelic level. To fill this gap, we are in the process of adding this function to GenPlay.

The main functions of GenPlay multi-genome (GenPlay-MG ) is to allow users to load experimental tracks (RNA-seq, ChIP-seq, DNA methylation, TimEX etc..) aligned on different genomes (either haploid genomes, in which all snps and variant are phased, or unphased diploid genomes). Because GenPlay-MG, as all former versions of GenPlay, loads most tracks entirely in RAM it is capable of comparing and transforming all tracks loaded in a multi genome sessions using the operations currently available in GenPlay. GenPlay-MG will also be able to export the results of such analysis in the coordinates of any of the loaded genomes, in several formats (bed, bgr, gff, sam,fasta, etc..). These new capabilities are important because when new human genomes are sequenced and properly analyzed, new sequences not present in the reference genomes are often discovered by de novo assembly and because the junction fragments for all indels and SVs do not exist in the reference genome. Therefore, when the results of expression or epigegomic experiments are only aligned to a reference genome, a lot of the variations is lost simply because new sequences and junction fragments do not exist in the reference genome. In addition, many reads are mis-aligned because the reference differs form the tested sample. This can mask important biological differences. Alignments of experimental results should ideally be done against the genome of the cells being studied, or, if that genome is not available, against multiple genomes. A major difficulty with this 'ideal" approach is that the results of such alignments cannot be easily compared with standard annotation tracks (refseq, UCSC gene, Aceview etc...) and with the large amounts of data generated by the large consortium (ENCODE etc..) which are all based on GRCh36/hg18, or GRCh37/hg19. GenPlay multi-genome allow users to perform this type of analyses by performing seamlessly in real-time all the necessary conversions between the different coordinate systems and graphically displaying all the tracks at the same time.

GenPlay multi genome will also include novels filters and operations specific for multi-genome sessions. For instance, we will program functions to display, project and perform correlation and other operations only on the variations that are different between two or more genomes. GenPlay will be able to to display only variants of a particular type (SNP indel, SV etc...), variants that are within a particular features (promoters, gene bodies, CpG, CpG islands, Chip-seq peak etc..) , variants that are at no more than a specific distance to a feature etc...


UNDER THE HOOD


The general strategy to perform the conversions in real-time is based on the creation of a meta-genome and of difference files (.diff files) at the time of loading of the VCF files. The meta-genome is a genome that is bigger than all the loaded genome because it contains the sequences of all the loaded genome. The meta-genome is the sum of all the loaded genomes. Diff file contains all the difference between the meta-genome and a loaded genome. In a Diff files the position and sequences of all the variants in a genome are represented in two sets of coordinates. For instance, loading two VCF files representing two genomes (G1 and G2) mapped relative to GRCh37/hg19 lead to the creation by GenPlay of three .Diff files: (hg19/meta); (G1/meta); (G2/meta)). Using the .Diff files, GenPlay can convert the coordinates of any feature of any loaded genomes, in any other genomes. This procedure is similar to the procedure used to translate multiple languages into each other, by first translating them all in the same meta-language.


Beta-Version


In the current beta version the concept of the meta-genome has been implemented. It is currently possible to load several VCF files as well as tracks aligned in any of the loaded genomes. None of the multi-genome specific operations have been programmed but most of the existing operations seem to work fine.

As it stands GenPlay multi-Genome (beta) can be used, for instance , to visualize some of the results of the 1,000 Genome project. It can also be used to load at the same time data tracks aligned on either GRCh36 (hg18)and GRCh37 (hg19).

Several VCF files from the multi genome project (that are all based on the hg19 GRCh37 reference genome) can be found in the GenPlay library. All of these files are in reference to GRCh37/hg19.
We have also created a VCF file that contains all the differences between GRCh37/hg18 and GRCh37/hg19 for chromosome 1. Once this file is loaded, it can be used to load at the same time in GenPlay, tracks aligned on both assemblies. It is pretty cool! try it. Unfortunately, the creation of an accurate genome-wide VCF files between hg18 and 19 is quite cumbersome with the data available. We are working on producing such file.

Comments and suggestions are welcome! as are contributions. If anyone knows a simple method to create an accurate vcf comparing hg18 and 19. Please let us know.

We are working on a preliminary doc for GenPlay-MG (beta).

Files

sdad

dhsajkdhk sdsadsuiagd

dsadsa

Coming Soon

Web Start

Coming Soon