Difference between revisions of "GenPlay Multi-Genome"

From GenPlay, Einstein Genome Analyzer

Latest revision as of 15:46, 13 September 2011

1 The GenPlay Multi-Genome Project
- 1.1 Introduction
- 1.2 Description
2 Beta-Version

The GenPlay Multi-Genome Project

Introduction

GenPlay Multi-Genome (GenPlay-MG) is a project under development that will provides a novel platform to analyze personal genome and epigenome information. Presently, there are very few applications to take advantage of the increased availability of information regarding the variability of the structure of the genome between humans. A few tools exist to compare the structure of genomes across species, but none can be used to study multiple ChIP-Seq tracks or RNA-Seq tracks aligned on different genomes.

To allow users to perform such analysis, we are writing GenPlay-MG, a new version of GenPlay in which VCF files summarizing the differences between different human genomes can be loaded. After loading the VCF files, GenPlay creates a meta-genome that is the sum of all the genomes to be analyzed in the multi-genome session. Once the meta-genome is generated, GenPlay users can load tracks aligned on any genome as well perform all the operations available in the software and export the data. SNPs, deletion, insertion, structural variants appear as stripes. GenPlay-MG can be used to visualise results of 1,000 genome projects, convert between assemblies etc...

Description

The cost of sequencing personal genomes is rapidly decreasing and will likely fall below $1,000 in the next few years. Genomes can be sequenced to various degree of accuracy. Currently, most genomes are sequenced at a low level of precision. A genome is considered sequenced when all the SNPs and (for the best published genomes) when the indels and structural variants are identified in comparison to a reference genome. However, as the technology improves, we are moving toward a much more stringent definition of a completely sequenced genome: Several projects are under way to provide the complete sequence of an individual defined as the sequence of the two haploid genomes of maternal and paternal origin. This involves identifying and phasing all SNPs, indels (deletion, insertion, CNVs) and structural variations (inversions, transpositions, translocations) as well as all novel sequences not present in any existing reference genome by de novo assembly.

At the current time, there are very few tools available to visualize all this information, particularly phased structural variations, insertion and novel sequences because most existing browsers can only display one reference genome at a time. We are not aware of any tool that can allow researchers to graphically compare multiple genomes and particularly multiple epigenomes at the allelic level. To fill this gap, we are in the process of adding this function to GenPlay.

The main functions of GenPlay multi-genome (GenPlay-MG ) is to allow users to load experimental tracks (RNA-seq, ChIP-seq, DNA methylation, TimEX etc..) aligned on different genomes (either haploid genomes, in which all snps and variant are phased, or unphased diploid genomes). Because GenPlay-MG, as all former versions of GenPlay, loads most tracks entirely in RAM it is capable of comparing and transforming all tracks loaded in a multi genome sessions using the operations currently available in GenPlay. GenPlay-MG will also be able to export the results of such analysis in the coordinates of any of the loaded genomes, in several formats (bed, bgr, gff, sam,fasta, etc..). These new capabilities are important because when new human genomes are sequenced and properly analyzed, new sequences not present in the reference genomes are often discovered by de novo assembly and because the junction fragments for all indels and SVs do not exist in the reference genome. Therefore, when the results of expression or epigegomic experiments are only aligned to a reference genome, a lot of the variations is lost simply because new sequences and junction fragments do not exist in the reference genome. In addition, many reads are mis-aligned because the reference differs form the tested sample. This can mask important biological differences. Alignments of experimental results should ideally be done against the genome of the cells being studied, or, if that genome is not available, against multiple genomes. A major difficulty with this 'ideal" approach is that the results of such alignments cannot be easily compared with standard annotation tracks (refseq, UCSC gene, Aceview etc...) and with the large amounts of data generated by the large consortium (ENCODE etc..) which are all based on GRCh36/hg18, or GRCh37/hg19. GenPlay multi-genome allow users to perform this type of analyses by performing seamlessly in real-time all the necessary conversions between the different coordinate systems and graphically displaying all the tracks at the same time.

GenPlay multi genome will also include novels filters and operations specific for multi-genome sessions. For instance, we will program functions to display, project and perform correlation and other operations only on the variations that are different between two or more genomes. GenPlay will be able to to display only variants of a particular type (SNP indel, SV etc...), variants that are within a particular features (promoters, gene bodies, CpG, CpG islands, Chip-seq peak etc..) , variants that are at no more than a specific distance to a feature etc...

UNDER THE HOOD

The general strategy to perform the conversions in real-time is based on the creation of a meta-genome and of difference files (.diff files) at the time of loading of the VCF files. The meta-genome is a genome that is bigger than all the loaded genome because it contains the sequences of all the loaded genome. The meta-genome is the sum of all the loaded genomes. Diff file contains all the difference between the meta-genome and a loaded genome. In a Diff files the position and sequences of all the variants in a genome are represented in two sets of coordinates. For instance, loading two VCF files representing two genomes (G1 and G2) mapped relative to GRCh37/hg19 lead to the creation by GenPlay of three .Diff files: (hg19/meta); (G1/meta); (G2/meta)). Using the .Diff files, GenPlay can convert the coordinates of any feature of any loaded genomes, in any other genomes. This procedure is similar to the procedure used to translate multiple languages into each other, by first translating them all in the same meta-language.

Beta-Version

In the current beta version the concept of the meta-genome has been implemented. It is currently possible to load several VCF files as well as tracks aligned in any of the loaded genomes. None of the multi-genome specific operations have been programmed but most of the existing operations seem to work fine.

As it stands GenPlay multi-Genome (beta) can be used, for instance , to visualize some of the results of the 1,000 Genome project. It can also be used to load at the same time data tracks aligned on either GRCh36 (hg18)and GRCh37 (hg19).

Several VCF files from the multi genome project (that are all based on the hg19 GRCh37 reference genome) can be found in the GenPlay library. All of these files are in reference to GRCh37/hg19.
We have also created a VCF file that contains all the differences between GRCh37/hg18 and GRCh37/hg19. Once this file is loaded, it can be used to load at the same time in GenPlay, tracks aligned on both assemblies. It is pretty cool! try it. The creation of an accurate genome-wide VCF files between hg18 and 19 is quite cumbersome with the data available and there might be a few bugs in the current version of our hg18 to hg19 VCF. We are working on producing a better file.

Comments and suggestions are welcome! as are contributions.

We are working on a preliminary doc for GenPlay-MG (beta).

Retrieved from "http://genplay.net/wiki/index.php?title=GenPlay_Multi-Genome&oldid=1387"

@@ Line 40: / Line 40: @@
 Comments and suggestions are welcome! as are contributions.
 We are working on a preliminary doc for GenPlay-MG (beta).
-=== Files ===
-Coming Soon
-=== Web Start ===
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_256.jnlp GenPlay (256 MB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_512.jnlp GenPlay (512 MB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_768.jnlp GenPlay (768 MB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_1024.jnlp GenPlay (1 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_1536.jnlp GenPlay (1.5 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_2048.jnlp GenPlay (2 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_4096.jnlp GenPlay (4 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_8192.jnlp GenPlay (8 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_12288.jnlp GenPlay (12 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_16384.jnlp GenPlay (16 GB)]
-*[http://www.genplay.net/GenPlayMG/GenPlayWS/GenPlay_24576.jnlp GenPlay (24 GB)]
-<br/><br/>
-== Tutorials ==
-<br/>
-=== Getting started ===
-==== Introduction ====
-To create a multi-genome session, users must first load VCF files. VCF files are files that describe all the differences between a reference genome and a particular genome.
-The first part of this tutorial presents how to set up a multi-genome project, especially how to load VCF files.
-The second part concerns loading data tracks. Data tracks must be mapped to one of the loaded genome to be loaded.
-Finally, we will see how to highlight information from VCF files.
-==== The Welcome screen ====
-The welcome screen is the first screen of GenPlay-MG and allow user to create or to load a project.
-===== New Project =====
-In order to create a new project, users must give it a name as shown in Figure 1.
-[[image:mg_basics_project name.png|center|frame|Figure 1: Text field to define the project name]]
-<br/>
-The second step is to choose a reference genome. Users can choose it using the different list according to the clade, the genome and the assembly (Figure 2).
-[[image:mg_basics_assembly_chooser.png|center|frame|Figure 2: Assembly chooser]]
-<br/>
-Several chromosomes are available for each assembly but users can choose to select only some of them.<br/>
-To open the chromosome chooser (Figure 3), users have to click on the button next to the assembly name.
-[[image:mg_basics_chromosome_chooser.png|center|frame|Figure 3: Chromosome chooser]]
-<br/>
-The third and last step is to choose between a ''Simple Genome Project'' and a ''Multi Genome Project''. This tutorial is about multi genome project, after having checked this option, the welcome screen should be as the one shown in Figure 4.
-[[image:mg_basics_empty_welcome_screen.png|center|frame|Figure 4: Empty welcome screen for multi-genome project]]
-<br/>
-===== Load Project =====
-Coming soon<br/>
-(Option unavailable in the beta version)
-==== VCF Files ====
-===== Description =====
-VCF files describe differences between two genomes. Usually, it concerns differences between a genome of interest and the reference genome used for the mapping process. It is possible to distinguish 3 different structures of VCF file according to the origin of the difference:
-* Indels
-* SNPs
-* SV (Structural Variation)
-<br/>
-A complete description of VCF files is given on the 1000 genomes project website:<br/>
-[http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF (Variant Call Format)]<br/>
-[http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20%28Variant%20Call%20Format%29%20version%204.0/encoding-structural-variants Encoding Structural Variants in VCF]
-===== Tabix =====
-====== Introduction ======
-VCF files can contain a lot of information which is makes scan processes longer.<br/>
-In order to increase the scanning efficiency, VCF files are indexed with Tabix.<br/>
-[http://samtools.sourceforge.net/tabix.shtml Tabix manual reference pages]<br/>
-[http://sourceforge.net/projects/samtools/files/tabix/ Tabix download]
-====== VCF files indexation method ======
-Each VCF files must be first compress to a BGZF (.bgz file) format. Tabix provides a tool to perform that compression operation.
-After compression, VCF files must be indexed using the associated command.
-Once Tabix installed, two commands are necessary to perform the indexation.
-<br/><br/>
-Available commands from the Tabix folder:<br/>
-''bgzip -f VCF_PATH;''<br/>
-''tabix –p vcf VCF_PATH;''
-<br/><br/>
-For example, a VCF file named my_vcf.vcf located in the same folder as Tabix can be indexed with the following commands (Figure 5):<br/>
-''bgzip -f ./my_vcf.vcf;''<br/>
-''tabix –p vcf ./my_vcf.vcf.gz;''
-[[image:mg_basics_indexation_commands.png|center|frame|Figure 5: VCF file indexation command]]
-<br/><br/>
-'''Note:''' the first command '''replaces''' the current VCF file by the compressed VCF file (.vcf.gz). The second command '''creates''' the indexed VCF file in the current folder (.vcf.gz.tbi).<br/>
-More options are available on [http://samtools.sourceforge.net/tabix.shtml Tabix manual reference pages].
-==== VCF files loading ====
-===== The VCF Loader =====
-====== Introduction ======
-The VCF Loader is the most important part of multi-genome project settings. It allows users to load every VCF files and to define how to extract information from them. It appears when users click on the "Edit" button from the welcome screen.<br/>
-The Figure 6 shows an empty VCF Loader screen.
-[[image:mg_basics_empty_vcf_loader.png|center|frame|Figure 6: VCF loader]]
-<br/>
-To understand this process, users should know that VCF files can contain information pertaining to one type of variant (SNP, indel or SV) for multiple individuals.<br/>
-To load the complete genome of one or several individuals GenPLay-MG must therefore open several VCF files and organizes them according to genomes and group of genomes: In this example, a group of genome will represent a family, and a genome a member of the family. Each row in the table represents a genome and a VCF file (and therefore a type of variant).<br/>
-If a project (a session) involves 2 families with 3 individuals each and if user has the 3 different types of VCF file (indels, SNPs, SV) for each individual, 18 lines will have be filled (2 families x 3 individuals x 3 VCF type). If user wants to compare individuals from the same family, he still has to fill group cells (with the same name).<br/>
-'''Note:''' GenPlay-MG does not use directly the VCF file, it uses a compress version of it (.gz). Moreover, GenPlay-MG also needs the compress VCF file to be indexed with Tabix. It increases the speed of scans. Both file versions must be in the '''same folder''' and must have the '''same name''', only file extensions differ (.gz and .tbi).
-The user can add or remove rows.<br/>
-When user has done, the "ok" button will valid settings but the "cancel" button and closing the window will cancel every modification.
-====== Columns description ======
-'''''Group'''''<br/>
-Users can gather genomes by group. It will be used to distinguish genomes are come from and to perform some specific functionalities.<br/><br/>
-'''''Genome'''''<br/>
-The ''Genome'' column allows users to associate another name to the selected genome. This new name will appear in GenPlay-MG to help users to recognize easily genomes.<br/><br/>
-'''''Type'''''<br/>
-The ''Type'' column contains the different types of VCF files. At the current time GenPlay-MG recognizes three types of VCF files:<br/>
-* Indels
-* SNPs
-* SV (Structural Variations)<br/><br/>
-'''''File'''''<br/>
-This column refers to the VCF file path. Once loaded, the raw name column is automatically filled with every raw genome name contained in the selected VCF file.<br/><br/>
-'''''Raw name(s)'''''<br/>
-The ''Raw name(s)'' column list is automatically filled when a VCF file has been chosen. That list contains every genotype headers contained inside the selected VCF file. Because Genome names might be difficult to remembers, GenPlay-MG offers users the option of adding another name using the ''Genome'' column.<br/>
-====== Columns edition ======
-''Group'', ''Genome'' and ''File'' column have their own editable list. To edit a column, user has to go to ''Column list edition'' on the bottom left of the VCF Loader (Figure 7).
-[[image:mg_basics_edition_section.png|center|frame|Figure 7: Edition section]]
-Once he has clicked on the ''Edit'' button, a new window appears (as shown on Figure 8 for the group column) with a plus and a minus button in order to add or remove elements from the list. Plus button will show a new dialog allowing users to write a short text.<br/>
-Regarding the ''File'' column, the plus button opens a file chooser.
-[[image:mg_basics_empty_list_manager.png|center|frame|Figure 8: Group list manager]]
-<br/>
-That way, users can set up all columns before starting (or at the same time) to fill the table.<br/>
-'''Note: ''' The ''Type'' column contains static value, it is not editable. The ''Raw name(s)'' column is automatically filled with genome name from the selected VCF file, that column cannot be edited manually.
-===== Import/Export =====
-Once a project has been set up, it can be saved using the import/export function. Pressing the export button save san XML files to the hard drive.  This XML file can then be imported to reload the project in another project.
-The XML file structure is simple. Each row are stored in ''row'' mark containing every attribute names such as ''group'', ''genome, ''type'', ''file'' and ''raw_name''. The settings file is formatted as shown in Figure 9.
-[[image:mg_basics_xml_settings.png|center|frame|Figure 9: XML file settings]]
-<br/>
-'''Note:''' If the user moves the VCF files or changes one of its genotype headers, the XML file will not work anymore. User has to modify ''file'' and/or ''raw_name'' attribute values.<br/>
-==== Tracks loading ====
-Once a multi-genome project has been created, GenPlay creates a meta-genome that is the sum of all the loaded genomes and is capable of converting the coordinates of any data files into the meta-genome coordinates. GenPLay can therefore load data files (tracks) mapped in the coordinates of any of the loaded genomes. Of course, when loading a file, user must specify which genome was used for the mapping.
-When loading a track, GenPlay displays a list showing every loaded genome below the window allowing user to load the files. Once the information is entered, GenPlay transforms the coordinate s in the file into meta-genome coordinates using the differences information of the specified genome.
-==== Displaying information ====
-In order to display variants information related to genome differences, each track has its own window settings. When user right-clicks on the track handler (on the left of the track) of an empty tracks, and clicks on ''Multi Genome Stripes'' the window below appears.
-[[image:mg_basics_unset_mg_selector.png|center|frame|Figure 10: Multi-genome stripes selection on an empty track]]
-<br/>
-In this example,, there are three families with one member each. Genomes have been mapped on the reference genome GRCh37/hg19. It is possible to show on the selected track information such as insertion, deletion, SNPs and structural variants. User can define the colors (by clicking on the colored squared) and also define stripes transparency using the slider.
-If a data file concerning the member of the first family has been loaded, the window looks like the one on Figure 11.
-[[image:mg_basics_set_mg_selector.png|center|frame|Figure 11: Multi-genome stripes selection]]
-<br/>
-Firstly, all stripes related to variants of the selected genome appear (In this case the insertion and the deletion for person 1, family 1). Secondly, all insertions in the genome of family 1 are shown as black stripes on all other genomes (black stripes are synchronization marker that have been introduced in the meta-genome to be able to display multiple genome at the same time. The black stripes represent insertions in genome others than in the current genome. User can modify stripes visualization using this panel.
-=== Conversion between NCBI36/hg18 and GRCh37/hg19 ===
-==== Description ====
-This tutorial will explain how to display at the same time tracks mapped on genome assembly NCBI36/hg18 or GRCh37/hg19. In the example, user will be able to see all the modifications on the NCBI36/hg18 genome leading to the GRCh37/hg19 reference genome.
-==== Files ====
-*[http://www.genplay.net/GenPlayMG/library/hg18tohg19_tutorial_settings.xml XML settings file]
-*[http://www.genplay.net/GenPlayMG/library/hg18tohg19_tutorial_sv.vcf.gz VCF file]
-*[http://www.genplay.net/GenPlayMG/library/hg18tohg19_tutorial_sv.vcf.gz.tbi Indexed VCF file (Tabix)]
-*[http://www.genplay.net/GenPlayMG/library/RefSeq_From_UCSC_04-23-10(hg19).bed Genes BED file for GRCh37/hg19]
-*[http://www.genplay.net/GenPlayMG/library/RefSeq_From_UCSC_04-23-10(hg18).bed Genes BED file for NCBI36/hg18]
-==== Steps ====
-===== Project settings =====
-====== Project name ======
-User must choose a name for a new project; here the name is ''GenPlay-MG – Reference genome tutorial'' (Figure 1).
-[[image:mg_hg18tohg19_project_name.png|center|frame|Figure 1: Project name]]
-====== Project assembly ======
-According to bed files provided in this tutorial, the reference genome is GRCh37/hg19. User has to select the ''mammal'' clade, the ''human'' genome and the ''Feb 2009 (GRCh37/hg19)'' assembly as in Figure 2.
-[[image:mg_hg18tohg19_project_assembly.png|center|frame|Figure 2: Project assembly]]
-====== Chromosome selection ======
-The VCF file is about Structural Variants and contains information for chromosomes 1 to 22 and chromosomes X and Y. User can select only the interested chromosomes clicking on the settings button next to the assembly name (Figure 3).
-[[image:mg_hg18tohg19_chromosome_chooser.png|center|frame|Figure 3: Chromosome chooser]]
-====== VCF Loading ======
-'''''Manually'''''<br/>
-In order to make the settings manually for this tutorial, user will have to set column lists by himself. The VCF Loader appears after clicking on the ''Edit'' button from the welcome screen. The bottom left part of the VCF Loader contains the ''Column list edition'' section. User has to select a column and click on ''Edit'' button in order to show the associated list.
-Only one VCF file is going to be loaded for this tutorial. The VCF file contains differences between the reference genome NCBI36/hg18 and the reference genome GRCh37/hg19.
-<br/><br/>
-''Group'' column<br/>
-This tutorial compares reference genome; a generic group name can be ''Reference genome''.
-On the ''Group name list editor'', user clicks on the plus button to show the input text box and fills it (Figure 4).<br/>
-''value:'' ''' Reference genome'''
-[[image:mg_hg18tohg19_group_input.png|center|frame|Figure 4: Group name input dialog]]
-The ''Group name list editor'' should looks like the Figure 5 below:
-[[image:mg_hg18tohg19_group_editor.png|center|frame|Figure 5: Group name editor]]
-<br/><br/>
-''Genome'' column<br/>
-The genome name is a second name for the selected raw name. In this tutorial, the genome name is going to be '''Hg18'''.
-On the ''Genome name list editor'', user clicks on the plus button to show the input text box and fills it (Figure 6).<br/>
-''value:'' '''Hg18'''
-[[image:mg_hg18tohg19_genome_input.png|center|frame|Figure 6: Genome name input dialog]]
-The ''Genome name list editor'' should looks like the Figure 7 below:
-[[image:mg_hg18tohg19_genome_editor.png|center|frame|Figure 7: Genome name editor]]
-<br/><br/>
-''Type'' column<br/>
-This field cannot be edited by the users. The provided VCF file is a Structural Variant type, user has to choose '''SV''' (Figure 8).<br/>
-''value:'' '''SV'''
-[[image:mg_hg18tohg19_type.png|center|frame|Figure 8: VCF type list]]
-<br/><br/>
-''File'' column<br/>
-Once the VCF file downloaded, user has to open the ''File list editor'', user clicks on the plus button to show the file chooser dialog and choose the VCF file according to its location.<br/>
-''value:'' '''VCF path'''
-[[image:mg_hg18tohg19_file_editor.png|center|frame|Figure 9: VCF File editor]]
-<br/><br/>
-''Raw name(s)'' column<br/>
-The raw name list is automatically filled, there is only one genome: '''NCBI36''' (Figure 10).<br/>
-''value:'' '''NCBI36'''
-[[image:mg_hg18tohg19_raw_name.png|center|frame|Figure 10: Raw name list]]
-<br/><br/>
-'''''Import XML settings'''''<br/>
-In order to set the project with ease, user can import the settings using the XML file above. Please be careful about the VCF path, user must changes it directly on the xml file if he wants to use the import function.
-<br/>
-'''''Conclusion'''''<br/>
-Finally, the screen should be like the one on Figure 11.
-[[image:mg_hg18tohg19_vcf_loader.png|center|frame|Figure 11: VCF loader]]
-====== Conclusion ======
-The welcome screen should finally be similar to the Figure 12.
-[[image:mg_hg18tohg19_welcome_screen.png|center|frame|Figure 12: Welcome screen]]
-The "Create" button will create the project and will run the synchronization.
-===== GRCh37/hg19 genes loading =====
-To load a file, user has to do a right click on the left part of the track. Then to choose "Load Gene Track", a file chooser appears to select the file given in this tutorial. After having chosen the BED file, a new selection box appears (Figure 13).
-[[image:mg_hg18tohg19_genome_selector_01.png|center|frame|Figure 13: Genome selection dialog for GRCh37/hg19 genes file]]
-This box asks which genome is related to the BED file. Here, user has to choose "Feb 2009 (GFCh37/hg19)" option because the BED file contains information about that genome.
-Gene file for GRCh37/hg19 reference has been loaded.
-===== NCBI36/hg18 genes loading =====
-The same operation as loading a gene files for GRCh37/hg19 reference genome. The only step changing is to choose the "Reference genome - hg18 (NCBI36)" option after the BED file selection (Figure 14)
-[[image:mg_hg18tohg19_genome_selector_02.png|center|frame|Figure 14: Genome selection dialog for NCBI36/hg18 genes file]]
-===== Conclusion =====
-User can navigate into the different chromosomes and visualizes differences between both genomes using the stripes. All genes are perfectly synchronized and are display according to the meta-genome coordinates.