IMPC data portal documentation
More information about the way IMPC uses statistics
Highthroughput phenotyping generates large volumes of varied data including both categorical and continuous data. Operational and cost constraints can lead to a workflow that precludes traditional analysis methods. Furthermore, for a high throughput environment, a robust automated statistical pipeline that alleviates manual intervention is required.
The IMPC has produced a short guide to help with understanding the statistical analysis pipeline:
The IMPC uses a variety of statistical methods for making phenotype calls, including:
 Fisher's Exact test  used for categorical data parameters
 Mixed model  used for continuous data parameters which include random effects
 Linear model  used for continuous data parameters when random effects are not significant
 MannWhitney U Rank sum test  used for continuous data parameters when conditions for Mixed model are not appropriate
 Reference Range Plus  used for some unidimensional data parameters
All analysis frameworks output a statistical significance measure, an effect size measure, model diagnostics (when appropriate), and graphical visualisation.
PhenStat
The statistical methods used by the IMPC have been formalized into an R package called PhenStat.
The PhenStat package provides statistical methods for the identification of abnormal phenotypes with an emphasis on highthroughput dataflows. The package contains:
 dataset checks and cleaning in preparation for the analysis
 2 statistical frameworks for genotype to phenotype identification
 Fisher's Exact test for Categorical data
 Linear Mixed model for continuous data
 Reference range plus model for low N continuous data
 and additional functions that help to decide the correct method for analysis.
 PhenStat User Guide
 How to Guide  Installing PhenStat
 PhenStat is available as a Bioconductor package
 See the complete PhenStat user's guide
Statistical details
The Mixed model framework assumes that base line values of the dependent variable are normally distributed but batch (assay date) adds noise and models variables accordingly in order to separate the batch and the genotype. Model optimisation starting with:
Y = Genotype + Sex + Genotype*Sex + (1Batch)Genotype*Sex is sometimes called the "interaction term" in PhenStat.
Assume batch is normally distributed with defined variance.
NOTE: The MM encoded in PhenStat supports an optional "weight" term.
The Mixed model framework is an iterative process to select the best model for the data which considers both the best modelling approach (Mixed model or general linear regression) and which factors to include in the model.
If PhenStat assumptions about the input data are not met, a second attempt at analyzing the data will be attempted — a MannWhitney U Rank Sum test.
Control selection strategy
One side effect of producing data in a high throughput pipeline is that the input data for a statistical calculation might be produced over multiple days. Environmental fluctuations have been identified as a confounding factor when comparing data gathered on different days. The IMPC describes this as a "batch effect" and it is treated as a random effect in the Mixed model framework.
The data sets to be analysed are identified using unique combinations of these fields:
Field  Description 

Background strain  The original strain from which the mutant specimen was derived. 
Allele / Colony  The genomic variation in the mutant. The allele describes the character of the mutation and the Genotype effect term of the Mixed model. 
Zygosity  The severity of the mutation.

Pipeline  The standardized phenotyping pipeline as described in IMPReSS Pipelines. 
Procedure  The standardised set of procedures (experiments) as described in IMPReSS procedures. 
Parameter  The standardised set of measurements as described in IMPReSS parameters. 
Metadata group  Some parameters are indicated as "procedureMetadata" type. Some of these metadata are used to group comparable data together as described on the IMPReSS parameters page under the "Required For Data Analysis" section. The parameters that are marked as "Required For Data Analysis" are collectively identified by an identifier called the metadata group. 
Organisation  The phenotyping organisation that performed the experiment and collected the data. 
Sex^{[1]}  The sex of the specimens. When analyzed using the Mixed model males and females are analysed together to determine the
Sex and Sex*Genotype interaction effect terms. [1]  optional 
IMPC phenotyping centers operate using different work flows which contribute to the batch effect.
Workflow  Description  Statistical implications  Control selection strategy 

One batch  All mutant and control data are measured on one day.  No batch effect. The controls and mutants are analysed using Y = Genotype + Sex + Genotype*Sex  Concurrent control strategy — Use control data that are collected on the same day as the mutant data. 
Multibatch batch (2+)  Mutant and control data are gathered over a few days.  Possible batch effect. The controls and mutants are analysed using Y = Genotype + Sex + Genotype*Sex + (1Batch), the batch effect might be removed. 
Baseline control strategy — Use all control data within the same metadata group. 
For each data set, the appropriate work flow is determined and the statistical calculation is performed. For continuous data, Mixed model is the IMPC preferred method of analysis, however, this method requires that the following assumptions are met:
 1. The data is normally distributed
 2. The data has some variation
 3. There must be more than four data points per sex per genotype
The graph pages display plots according to the data type of the parameter. Categorical data parameters display a stacked bar chart whereas continuous data displays a box plot and a scatter plot of the data point values. See the graph documentation for more details.
Fisher's exact output
A table displaying more information about the data used to determine the P value and effect size is displayed below the graph.
Mixed model (PhenStat) output
The more statistics link at the bottom of the table will list the statistical method as "MM framework, generalized least squares, equation withoutWeight" when the batch term is not significant, otherwise "MM framework, linear mixedeffects model, equation withoutWeight".
Rank sum output
The more statistics link at the bottom of the table will list the statistical method as "Wilcoxon rank sum test with continuity correction" when a rank sum calculation has generated the statistics.
Reference Range Plus output
The more statistics link at the bottom of the table will list the statistical method as "Reference Range Plus" when a reference range calculation has generated the statistics.
Statistics to Phenotype
If the mutant genotype effect represents a significant change from the control group, then the IMPC pipeline will attempt to associate a Mammalian Phenotype (MP) term to the data.
The particular MP term(s) defined for a parameter are maintained in IMPReSS. Frequently, the term indicates an increase or decrease of the parameter measured.
When a statistical result is determined as significant, the following diagram is used for associating MP terms:
Significance
When a mutant genotype effect P value is less than 1.0E4 (i.e. 0.0001), it is considered significant.
Tools
 GO lookup tool: GO annotations to phenotyped IMPC genes
 Paper lookup tool: References using IKMC and IMPC resources
 Parallel coordinate
Parallel coordinates
The parallel coordinates tool allows users to compare strains across different parameters. Hover over a row in the table to highlight the corresponding line on the chart.
To start using the tool select one or more procedures from the dropdown select box. Once this is done you can filter the data based on the phenotyping center.
The values displayed are the genotype effect, which accounts for different variation sources. Information about this and the statistical methods used is available in the statistics documentation.
To help visualize we have added two special lines: mean, displaying the average genotype effect for all genes displayed and no effect, that runs through the zero values to help visualize how a gene with no genotype effect for the measured parameters would look like. For large datasetst the mean and no effect line usually converge.
The tool allows filtering on each axis (parameter) by selecting the region of interest with the mouse.
The clear button removes existing filters.
The export button generates an export of the values in the table. If any filter is set, only the data displayed in the table will be exported.
The generation of this chart is computationally intensive and the number of parameters that can be plotted may vary from one machine to the other. If you notice the tool becoming too slow, please consider selecting fewer procedures.