In this view, we are highlighting genes that are part of the core genome of known B. The highlighted ribbon in Figure 1 leads the user to a group of genes that are dispensable in B. This example shows the use of the Parallel Sets view to show nine discrete categories simultaneously, many more than is possible with the familiar Venn Diagram. With the summary view that Parallel Sets provides, users are able to see not only the gene count but also the relationships between categories.
Case study 2 further demonstrates the use of Parallel Sets and the other connecting visualizations to identify features of interest based on the visual cues the displays provide. A Parallel Sets view is used to subdivide genes that occur in genomes of the Brucella genus. The highlighted ribbon contains the genes that are part of the dispensable gene vocabulary of known B. When selected, the list of genes attached to the ribbon propagates to other coordinated views. The core analytical visualization implemented in the system is the Parallel Sets view.
However, additional visualizations are implemented to allow the user to explore and sort data in a collection where hundreds of thousands of genome features must be manipulated simultaneously. Figure 2 illustrates the complete interface. In a CMV system, all visualizations have the ability to communicate with one another, meaning that selections in one view are propagated to all others.
Within the interface, the user can navigate the dimensions of the dataset and get instant feedback on the selections. The user can select a subset of data in one display and then create another view from that selected subset.
For example, the user may select the genes in one genome that do not have orthologs in another genome, then propagates that set to a view that displays the GO term enrichment associated with those genes. The multiple views allow the user to progressively build queries at multiple levels of detail. The Parallel Sets view A allows you to create study sets by partition your genes based on orthologous relationships across multiple taxonomic levels.
Each of these views is displaying the same information in alternative forms for a better overall analysis. The details for each gene associated with the selected GO term are displayed in the Ortholog Cluster view E. Study sets are built by making selections in the Parallel Sets view. The study sets created can be viewed and managed from the Study Set Navigator.
The study set navigator allows the user to rename study sets and view the categories that were used to create them.
Jack Chen, Associate Professor
All of the sets in this list can be analyzed for GO term enrichment and can be added or removed at any time. This view is also connected to the CMV interface, and all the views showing details about a specific study set are updated when the user selects a set in this navigator.
All selections in these views are coordinated together such that a selection in one updates what is viewed in the other. It represents the hierarchical structure of the Gene Ontology together with the categorical information about the number of genes associated with each term and the statistical significance of the term.
Treemaps have been developed to simultaneously visualize both hierarchical and quantitative data  and have previously been used to summarize Gene Ontology terms, although not in the context of genome comparison.
The GO Treemap displays the GO term hierarchy as nested rectangles with each GO term drawn as a single rectangle and all child GO terms are drawn inside the parent rectangle. The size of each rectangle represents the number of genes classified with that GO annotation in the entire population set . In GenoSets, the treemap view is coordinated to previous set selections made by the user. When the user selects a study set, the GO Treemap is updated to represent the enrichment for that study set only. GO terms that have a significant p-value are highlighted.
The contrasting highlight colors represent the ratio of the study set to population set for that specific term. If the term ratio is higher for the study set than the population set, the term is colored one color; otherwise, the term is colored the other contrasting color. The user is able to select the colors used in the display and also set the p-value cutoff threshold, which determines the range of values that are colored in the treemap.
In the GO Treemap view, as the user points to a GO term represented as a rectangle the display will show the name of the term, the number of genes annotated with that term, and the level in the GO hierarchy where the term is found. A search box allows the user to search for GO terms by name, and the display will highlight any matches containing the search word. The user can also interact with the visualization by selecting different study sets, and the treemap will update to show only terms enriched in the selected set.
This allows the user to make direct comparisons between subsets, highlighting functional categories that may be enriched in one subset relative to another. However, the GO hierarchy is not a true tree structure, it is a directed acyclic graph in which parent nodes may have multiple child nodes and vice-versa, but contains no cycles if the graph is traversed down the hierarchy, the starting node will be visited once and only once.
Representing this as a tree structure creates some redundancy in the graph i. The hierarchy visualization methods used to create the GO Treemap and Tree Explorer views are generalizable to other hierarchical data types; for instance, a hierarchical view could be connected to a taxonomic dimension in a larger dataset to allow the user to navigate through that hierarchy.
Detail lists of gene information are available to the user from multiple points in the interface Figures 2D and 2E. Feature details are available from all of the views, and contain all of the known information about genes or features. The user can enhance feature detail views by uploading files created in other analyses; for example, a tab delimited file containing pathway information can be uploaded, and this information will then show in all the feature detail views.
Right clicking on any item in the interface will show which details are available for that view. It is in table format and includes the GO term identifier and name along with the p-value associated with that term for the selected set. The table also includes the study term total and population total which is the total genes in the study set and population set annotated with each GO term, respectively. The table also includes a ratio which is the total genes in the study set divided by the total genes in the population for each GO term.
The table may be sorted by any of the displayed columns and also filtered by p-value ranges. The Ortholog Cluster view Figure 2E lists genes that are members of any selected set, grouping each of the genes together by ortholog cluster into a list structure. The gene identifier, name, product description, and organism to which a gene belongs are all shown. Like all other views, this view is connected to the selections in Parallel Sets and in the hierarchical views.
The list is filtered to show the genes in the selected study set and the selected GO term. GenoSets is a flexible system that supports the set-based comparative analysis of an arbitrary collection of genomes chosen by the user, based on features defined by both annotative operations and comparative operations.
One of the key components of the system is the ability to load data and perform calculations through wizards. The creation of a new database is also performed using a wizard.http://heptorssa.pro
When the user initiates a new database, the system creates it and all necessary tables. There is no need to run any database scripts or manual configurations; users need only to provide a user name and password with sufficient database privileges for the creation process. The user may also connect to an existing database through this wizard.
Because the database can be housed either locally or on a remote server, multiple people can access the system simultaneously. The database that supports GenoSets is a multi-dimensional data warehouse. The multi-dimensional design is a widely accepted approach for real-time data mining and knowledge discovery, allowing for rapid, ad hoc querying of large, dimensional datasets.
This is typically the support database used in business intelligence software.
- Marie Claire (May 2013).
- Sciences from Below: Feminisms, Postcolonialities, and Modernities (Next Wave: New Directions in Womens Studies).
The aim of business intelligence software is knowledge discovery with the ability to support aggregates and hierarchical relationships within data. Aggregation and drill-down functions summarize data within a dimension at varying levels of granularity within a hierarchy. GenoSets uses a star schema model presented by Kimball . In the star schema, source data is partitioned into facts, representing the numerical measurements and dimensions that give context to the facts.
The associated textual information describing the fact is separated into dimensions. Dimensions can have a hierarchical structure which allows for the facts to be rolled up into aggregates, i. In the GenoSets database, the central fact is the existence of a feature. Because the fact is a measure of existence and not a numerical measure, aggregation by count is often the most logical summarization of the data.
Using this database design in the analysis of comparative genomic data allows for a comprehensive study of the relationships among multiple dimensions describing the data, and eliminates the need to examine each individual feature at its finest level of detail. It enables the identification of annotated features that meet a set of criteria that spans multiple dimensions.
Original Research ARTICLE
Combinations of dimensions that occur frequently together and rare combinations or outliers can easily be identified. A The Parallel Sets view allows the user to create sets of genes that are only in the high-pathogenic strains. C View the genes associated with an enriched term in the Ortholog Cluster view. Genes are grouped together by ortholog clusters in this view. To demonstrate the applicability of the GenoSets system for query and analysis of multi-genome datasets, we chose several sequenced genomes of species belonging to the genus Brucella.
We have previously carried out a comparative analysis of the Brucella species  ,  using a predecessor to the GenoSets system, and identified regions useful for PCR-based species identification in a multi-step assay. The more recently sequenced genomes of the Brucella genus have since been analyzed and compared  ,  , primarily with a focus on identifying pathogenicity islands.
These previous studies provide us with a basis to evaluate the observations we can generate using GenoSets, and we demonstrate that the GenoSets process can be used to efficiently access insights that have generally been arrived at through a much more laborious manual analysis process. A Parallel Sets highlights sets of interest that can be further analyzed using multiple alternative views. If the term ratio is higher for the study set than the population set, the term is colored rose; otherwise, the term is colored blue.
All of the views are coordinated with one another such that selection in one view is propagated to all others.
The Brucellae are gram-negative, intracellular pathogens with the ability to infect multiple hosts. Individual Brucella species tend to have a host preference, but can be infectious to other species. Brucella infection, or brucellosis, causes undulant fever in humans and can be fatal if untreated.
- Comparative Genomics | SpringerLink.
- Course content.
- Comparative Genomics.
Brucellosis can also have severe economic effects on agriculture when livestock infections result in infertility, fetal loss, and reduced milk production. The Brucella genus has classically been described as containing six species, identified through their distinct host preferences and biotyping. New species that infect marine mammals and have also been associated with human infection have recently been discovered.
Currently, there are 12 completely sequenced Brucella strains available in public repositories with three strains representing B. To identify genes or functions that could potentially be involved in either virulence or host preferences, the analyst must have the ability to group genomes together based on knowledge of the properties of each strain or species.
Handbook of Comparative Genomics: Principles and Methodology by Cecilia Saccone
A question-focused grouping of species, along with a query that supports the rational approach of comparing gene content in order to identify potential functional differences, automatically prompts exploration of potentially significant gene differentials when applied in GenoSets. In the current case studies we report that the visualizations provided pointers to gene families with known significance to function, as proof of concept. The same principle can be used in exploratory mode to identify gene targets from fresh data.
This journal has an embargo period of 12 months.
Related Handbook of Comparative Genomics: Principles and Methodology
Copyright 2019 - All Right Reserved