Statistical Functions
This section contains functions that are often used in statistical analysis for performance improvement.
Bin Data
Bin column numbers obtain frequency of numbers in a certain range. If a column of data contains numbers that range from 1 to 1000 for instance, create bins of a certain width (eg.100) and count how many column items fall into each range (i.e. 1-100, 101-200 and so on). The bin width may be set automatically or be specified by the user. The user may provide the actual width of each bin or, alternatively, specify the total number of bins.
Parameter Set
Parameters | Description |
Column number | Column on which to perform the bin operation. Column numbers begin with zero |
Bin type | This parameter determines how the bins will be set up. The options are Automatic, By Number, By Width |
Bin size | If Bin type is “Automatic”, a bin size will be calculated automatically and need not be specified by the user. If Bin type is “By Number”, bin size is the number of bins into which to the total data range is divided. If Bin type is “By Width”, bin size refers to the width of the data range into which each column item will be binned. |
Min range | Beginning point of first data range. Will be calculated if not provided by user. |
Max range | End point of last data range. Will be calculated if not provided by user. |
Include Original Data | If “Yes” include the original dataset in the result table |
Applications
Can be used to determine problems with accounts receivables over different periods (e.g. < 30 days, 30-60, > 60), the volume of product sales using different price ranges, number of radio signals in different frequency bands, etc. Many more uses.
Data Sample
Input | Parameter | Result | |||||||||||||||||||||||||||||||||||||||||||||
|
Column = 0 Bin type = By Width Bin size = 30 Min range = 0 Max range = 120 Include Original Data = No |
|
Chart Sample
Univariate Metrics
Calculate basic metrics such as count, sum, average, minimum, maximum, variance and standard deviation for a column of numbers. The above metrics are calculated when the option BASIC is selected. The user may also select the option ALL to calculate kurtosis, skewness and standard error of the mean in addition to the above metrics. The formulas used for the calculations are provided in the appendix.
Parameter Set
Parameters | Description |
Column | Column for which to calculate basic metrics. Column numbers begin with zero |
Option | The option ALL or BASIC determines exactly what metrics will be calculated |
Include Original Data | If “Yes” include the original dataset in the result table |
Applications
This function provides basic calculations that can be displayed directly on a data presentation (via a table or chart text annotations).
Data Sample
Input | Parameter | Result | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Column = 0 Option = ALL Include Original Data = Yes |
|
Histogram Distribution
Similar to the Bin Data module, Histogram Distribution bins column numbers to obtain frequency of numbers in a certain range. In addition, it provides the option to bin data by probability (total probability summed over whole range of data is 1) or by percentage (total percentage summed over whole range of data is 100). The bin width may be set automatically or be specified by the user. The user may provide the actual width of each bin or, alternatively, specify the total number of bins.
Parameter Set
Parameters | Description |
Column number | Column on which to perform the histogram operation. Column numbers begin with zero |
Bin type | This parameter determines how the bins will be set up. The options are Automatic, By Number, By Width |
Histogram size | If Bin type is “Automatic”, a histogram size will be calculated automatically and need not be specified by the user, If Bin type is “By Number”, histogram size is the number of bins into which to the total data range is divided, If Bin type is “By Width”, histogram size refers to the width of the data range into which each column item will be binned. |
Min range | Beginning point of first data range. Will be calculated if not provided by user. |
Max range | End point of last data range. Will be calculated if not provided by user. |
Histogram type | Histogram result is returned By Number, Percentage, or Probability |
Include Original Data | If “Yes” include the original dataset in the result table |
Applications
Very similar to binning. Can show, on a percentage basis, where products are selling, where test score distributions occur, etc.
Data Sample
Input | Parameter | Result | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Column number = 0 Bin type = By Width Histogram Size = 1 Min Range = 0 Max Range = 8 Histogram type = By Number, Percentage,Probability (displayed in successive columns) Include Original Data = No |
|
Pareto Analysis
Given a set of causes that lead to problems in a certain area, the Pareto Principle states that in many cases only a “vital few” factors (about 20%) are responsible for producing most (about 80%) of the problems. To help identify the major components in a set of data, Pareto analysis sorts a column of data and accompanying labels in descending order of magnitude and provides the cumulative percentage contribution that each additional item makes to the total. The (x,y) values for an 80% y line and a 20% x line are also provided. Plotting the points at which the cumulative percentage line crosses the 80% and 20% lines allows the user to note what percentage of items contribute to 80% of the total and what percentage of the total is made up of 20% of the items respectively.
Parameter Set
Parameters | Description |
Category Names Column | Column that holds item labels. Column numbers begin with zero |
Occurrences Column | Column that holds item values. Column numbers begin with zero |
Top Extent | Number of items to be returned in the result. Setting this parameter to a small number results in data being provided for only the first few major items |
Sort Columns | By default, sorts all columns based on the contents of the Occurances column. |
Applications
Useful for determining which products/people are the biggest contributors, which factors cause the bulk of the problems, etc.
Data Sample
Input | Parameter | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Category Names Column = 0 Occurrences Column = 1 Top Extent = Sort Columns = All |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Result | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Chart Sample
Normal Distribution
Given a column of data, the Normal Distribution module calculates the normal probability distribution or “bell curve” obtained from the mean and standard deviation of the data set for the default analysis type “Probability”. If analysis type is “Cumulative”, the cumulative probability over the range of data is returned in the result columns. Analysis type “Inverse” is identical to cumulative except for the fact that the X and Y columns are interchanged for ease of plotting a horizontally oriented cumulative probability.
Parameter Set
Parameters | Description |
Column | Column for which to calculate the normal distribution. Column numbers begin with zero |
Number of Points | The number of points within the data range for which to evaluate the normal distribution |
Analysis Type | Cumulative, Inverse or Probability |
Include Original Data | If “Yes” include the original dataset in the result table |
Applications
Shows the distribution of data over a given dataset. Can be used to determine when the number of errors/failures in a given set or subset is within acceptable expectations or outside of “normal” bounds.
Data Sample
Input | Parameter | Result | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Column = 0 Number of points = 16 Analysis Type = Probability Original dataset = No |
|
Chart Sample
This chart sample is based on a bar chart (with bars mapped to the top axis to provide histogram appearance) and the normal distribution line applied. It shows that, on average, invoices take about 68 days to get paid. Note that plus or minus 1 standard deviation represents 68% of the data. In this case, there is considerable variation in the data which might highlight the need for tighter controls.
Quantile
Quantiles are essentially points taken at regular vertical intervals from the cumulative distribution function of a random variable. Dividing ordered data into equal sized data subsets is the motivation for quantiles. The quantiles are the data values marking the boundaries between consecutive subsets. This module provides for determining the percentiles, deciles, quintiles, quartiles and the median.
Parameter Set
Parameters | Description |
Column Index Of Item Values | Column on which to perform the Quantile analysis. Column numbers are zero indexed. |
Quantile Type | Percentiles/Deciles/Quintiles/Quartiles/Median |
Include Original Data | If “Yes” include the original dataset in the result table |
Applications
Show distribution of test scores, compare income distribution of different groups, many more uses.
Data Sample
Input | Parameter | Result | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Column Index Of Item Values =0 Quantile Type=Deciles Include Original Data = Yes |
|
Chart Sample
The following chart shows the distribution of test scores among the Decile (10%) marks.
Mode
In a set of data, the mode is the most frequently observed data value. There may be no mode if no value appears more than any other. There may also be two modes (bimodal), three modes (trimodal) or more modes (multimodal). In case of grouped frequency distributions, the modal class is the class with the largest frequency.
Parameter Set
Parameters | Description |
Column Index Of Item Values | Column on which to perform the Mode analysis. Column numbers are zero indexed. |
Mode Type | Ungrouped/GroupedIf Mode Type is Ungrouped the result will be the most frequently observed data value/s.
If Mode Type is Grouped, based on bin size, the result will be the class/s that has the most number of observed data value. |
Bin Size | This is the width of each bin. Number of bins will be determined by dividing the difference between highest observed value and lowest observed value by bin size. |
Include Original Data | If “Yes” include the original dataset in the result table |
Applications
Track highest incidence of test score, age, product code, user ids, weight, error codes, zip codes, much more.
Data Sample
Input | Parameter | Result | ||||||||||||||||||||||||||||||
|
Column Index Of Item Values = 0 Mode Type = Ungrouped Bin Size = Include Original Data = No |
|