Difference between revisions of "GenPlay Multi-Genome"

From GenPlay, Einstein Genome Analyzer

Revision as of 16:01, 8 June 2011

1 The GenPlay Multi-Genome Project
- 1.1 Introduction
- 1.2 Description
2 Beta-Version
- 2.1 Files
- 2.2 Web Start

The GenPlay Multi-Genome Project

Introduction

GenPlay Multi-Genome is a project under development that provides a unique platform to analyze personal genome and epigenome information. Because the amount of individual genomes sequenced is expected to increase exponentially in the next few years, we expect this feature of GenPlay to become extremely useful to analyze multiple epigenomes in parallel. Presently, there are no applications that can be used by biologists, post-docs and students to compare human epigenomes in real time. Software such as Vista and D-code are very useful tools that offers the capability to compare the structure of genomes across species, but they do not allow loading of multiple ChIP-Seq tracks or RNA-Seq tracks aligned on different genomes. The meta-genome concept that we have developed is new and powerful. Although all genomes are represented by difference files that are relative to a reference genome, creation of a meta-genome which includes insertion allows for comparison of any two (or more) genomes without any need to use a reference genome.

Description

The price of sequencing personal genomes is rapidly decreasing and will likely fall below $1,000 in the next few years. Genomes can be sequenced to various degree of accuracy. Low level of precision is to identify SNPs in comparison to a reference genome, but as the technology improves the complete sequence of an individual will in fact be the sequence of two haploid genomes including complete phasing of all SNPs, identification of indels (defined as any deletion or insertion relative to the reference sequence, therefore including CNVs) and structural variations as well as novel sequences not present in any existing reference genome.

At the current time, the tools available to visualize all this information are very primitive and totally inade-quate, particularly in the case of structural variations, insertion and novel sequence which cannot be compared to existing genomes. We are not aware of any tool that can allow researchers to graphically compare multiple genomes and particularly multiple epigenomes at the allelic level. Because of the lack of existing tools, we are currently in the process of adding this function to GenPlay since we need it to complete one of the major projects in the lab.

This project is briefly described below to illustrate our approach and the need for such functionality. It in-volves sequencing a family of four (two parents and their two children) in order to produce a complete genome sequence as defined above Then it exploits the information from the new genome sequence to study expression, reprogramming and imprinting in iPS cells at the allelic level genome-wide.

The genomes of the four family members have been sequenced on the HiSeq 2000 at a depth of about 20x each and assembled using a variety tools. The assembly strategy is a combination of family sequencing  for phas-ing and error correction (6), of the general approach reviewed in (7) to assemble the genome and find the in-dels, CNVs and structural mapping, and of de novo assembly to find novel sequences as described in (8).  The details of this strategy are irrelevant to this application since we focus here on the tools to allow the comparison of the four assembled genomes. In Aim 2 we propose to incorporate these tools within GenPlay.

Sequencing a family of four yields eight haploid genomes (four independent genomes from the two parents, and four genomes that are recombinant versions of the parental one), since all variants are phased. The 8 haploid genomes are assembled using hg19 as a frame of reference. We have designed a simple file format termed variant file (.var files) that contains all the differences between a haploid genome and Hg 19 (Figure 3). Var files contain information about SNPs, deletions, and insertions. At the current time, inversions and replacements are treated as a combination of insertion and deletions. The format can also be used for novel sequences: If the novel sequences can be attached to existing sequences in hg19, they simply become insertions; otherwise they are placed in orphan sequence scaffolds.

Each family member’s genome is composed of two haploid genomes termed “a” and b”. For instance, individual M, has two haploid genomes Ma and Mb composed of 23 chromosomes (Ma = chr1a, chr2a etc.; Mb= chr1b, chr2b etc.). After phasing of all the variants, the two haploid genomes are created by randomly assigning each chromosome to one of the two genomes of each individual. In the case of the children we use the same notation but additionally keep track of the parent of origin of each chromosome. The strategy to compare and visualize multiple genomes is straightforward and greatly simplified by the use of the var files and of the reference genome (currently Hg19). The main idea is to use the var files to construct a meta-genome that is the sum of all the haploid genomes to be visualized at the same time, reference genome included. The meta-genome contains all of the insertions in all the genomes and is, therefore, longer than any of the existing genomes. This meta-genome is the cornerstone of our approach to multi-genome visualization. The meta-genome is used as the overall coordinate system in GenPlay instead of the reference genome that we use in single genome analysis.

During the construction of the meta-genome, difference files (.dif files) are created for each haploid genome. Dif files contain the position and sequences of all the variants in a haploid genome in two sets of coordinates. For instance loading two var files representing two genomes (G1 and G2) mapped relative to Hg19 would lead to the creation by GenPlay of three dif files (hg19/meta); (G1/meta); (G2/meta) (see figure 3). Once the dif files are created, they can be used to compare and visualize multiple genomes. This procedure is similar to the procedure used to translate multiple languages into each other, by first translating them all in the same meta-language.

Beta-Version

Files

Web Start

Retrieved from "http://genplay.net/wiki/index.php?title=GenPlay_Multi-Genome&oldid=1206"

@@ Line 1: / Line 1: @@
-== Project Introduction ==
+== The GenPlay Multi-Genome Project ==
+=== Introduction ===
 GenPlay Multi-Genome is a project under development that provides a unique platform to analyze personal genome and epigenome information.  Because the amount of individual genomes sequenced is expected to increase exponentially in the next few years, we expect this feature of GenPlay to become extremely useful to analyze multiple epigenomes in parallel.  Presently, there are no applications that can be used by biologists, post-docs and students to compare human epigenomes in real time.  Software such as Vista and D-code are very useful tools that offers the capability to compare the structure of genomes across species, but they do not allow loading of multiple ChIP-Seq tracks or RNA-Seq tracks aligned on different genomes.  The meta-genome concept that we have developed is new and powerful.  Although all genomes are represented by difference files that are relative to a reference genome, creation of a meta-genome which includes insertion allows for comparison of any two (or more) genomes without any need to use a reference genome.
-GenPlay is designed to analyze human genomes, but is perfectly suitable to study any genome provided that the annotation and difference files are available.  In fact, some of our most satisfied users are researchers working on small genomes because a smaller genome size allows for truly rapid analysis since GenPlay is built to analyze large mammalian genomes.
-The programming methods used throughout the proposal as well as the use of Java are not innovative by de-sign.  Using well-tried and commonly used programming methods and language will facilitate maintenance and modification of the software.
+=== Description ===
+The price of sequencing personal genomes is rapidly decreasing and will likely fall below $1,000 in the next few years.  Genomes can be sequenced to various degree of accuracy. Low level of precision is to identify SNPs in comparison to a reference genome, but as the technology improves the complete sequence of an individual will in fact be the sequence of two haploid genomes including complete phasing of all SNPs, identification of indels (defined as any deletion or insertion relative to the reference sequence, therefore including CNVs) and structural variations as well as novel sequences not present in any existing reference genome.
+ At the current time, the tools available to visualize all this information are very primitive and totally inade-quate, particularly in the case of structural variations, insertion and novel sequence which cannot be compared to existing genomes. We are not aware of any tool that can allow researchers to graphically compare multiple genomes and particularly multiple epigenomes at the allelic level. Because of the lack of existing tools, we are currently in the process of adding this function to GenPlay since we need it to complete one of the major projects in the lab.
+This project is briefly described below to illustrate our approach and the need for such functionality.  It in-volves sequencing  a family of four (two parents and their two children) in order to produce a complete  genome sequence as defined above   Then it exploits the information from the new genome sequence to study expression, reprogramming and imprinting in iPS cells at the allelic level genome-wide.
+ The genomes of the four family members have been sequenced on the HiSeq 2000 at a depth of about 20x each and assembled using a variety tools. The assembly strategy is a combination of family sequencing  for phas-ing and error correction (6), of the general approach reviewed in (7) to assemble the genome and find the in-dels, CNVs and structural mapping, and of de novo assembly to find novel sequences as described in (8).  The details of this strategy are irrelevant to this application since we focus here on the tools to allow the comparison of the four assembled genomes. In Aim 2 we propose to incorporate these tools within GenPlay.
+Sequencing a family of four yields eight haploid genomes (four independent genomes from the two parents, and four genomes that are recombinant versions of the parental one), since all variants are phased.  The 8 haploid genomes are assembled using hg19 as a frame of reference. We have designed a simple file format termed variant file (.var files) that contains all the differences between a haploid genome and Hg 19 (Figure 3).  Var files contain information about SNPs, deletions, and insertions.  At the current time, inversions and replacements are treated as a combination of insertion and deletions.  The format can also be used for novel sequences: If the novel sequences can be attached to existing sequences in hg19, they simply become insertions; otherwise they are placed in orphan sequence scaffolds.
+Each family member’s genome is composed of two haploid genomes termed “a” and b”.  For instance, individual M, has two haploid genomes Ma and Mb composed of 23 chromosomes (Ma = chr1a, chr2a etc.; Mb= chr1b, chr2b etc.).  After phasing of all the variants, the two haploid genomes are created by randomly assigning each chromosome to one of the two genomes of each individual.  In the case of the children we use the same notation but additionally keep track of the parent of origin of each chromosome.
+The strategy to compare and visualize multiple genomes is straightforward and greatly simplified by the use of the var files and of the reference genome (currently Hg19). The main idea is to use the var files to construct a meta-genome that is the sum of all the haploid genomes to be visualized at the same time, reference genome included.  The meta-genome contains all of the insertions in all the genomes and is, therefore, longer than any of the existing genomes. This meta-genome is the cornerstone of our approach to multi-genome visualization. The meta-genome is used as the overall coordinate system in GenPlay instead of the reference genome that we use in single genome analysis.
+During the construction of the meta-genome, difference files (.dif files) are created for each haploid genome.  Dif files contain the position and sequences of all the variants in a haploid genome in two sets of coordinates. For instance loading two var files representing  two  genomes (G1 and G2) mapped relative to Hg19 would lead to the creation by GenPlay of three dif files (hg19/meta); (G1/meta); (G2/meta) (see figure 3). Once the dif files are created, they can be used to compare and visualize multiple genomes. This procedure is similar to the procedure used to translate multiple languages into each other, by first translating them all in the same meta-language.
 == Beta-Version ==