Statistical Functions

This section contains functions that are often used in statistical analysis for performance improvement.

Bin Data

Bin column numbers obtain frequency of numbers in a certain range. If a column of data contains numbers that range from 1 to 1000 for instance, create bins of a certain width (eg.100) and count how many column items fall into each range (i.e. 1-100, 101-200 and so on). The bin width may be set automatically or be specified by the user. The user may provide the actual width of each bin or, alternatively, specify the total number of bins.

Parameter Set

Parameters Description
Column number Column on which to perform the bin operation. Column numbers begin with zero
Bin type This parameter determines how the bins will be set up. The options are Automatic, By Number, By Width
Bin size If Bin type is “Automatic”, a bin size will be calculated automatically and need not be specified by the user.
 
If Bin type is “By Number”, bin size is the number of bins into which to the total data range is divided.
 
If Bin type is “By Width”, bin size refers to the width of the data range into which each column item will be binned.
Min range Beginning point of first data range. Will be calculated if not provided by user.
Max range End point of last data range. Will be calculated if not provided by user.
Include Original Data If “Yes” include the original dataset in the result table

 

Applications

Can be used to determine problems with accounts receivables over different periods (e.g. < 30 days, 30-60, > 60), the volume of product sales using different price ranges, number of radio signals in different frequency bands, etc. Many more uses.

Data Sample

Input Parameter Result

Days Outstanding

55

32

45

23

24

72

75

110

221

29

44

78

38

143

87

23

Column = 0
Bin type = By Width
Bin size = 30
Min range = 0
Max range = 120
Include Original Data = No

Bin Name

Bin Min

Bin Max

Bin
Frequency

Less

-1.0E30

0.0

0

Bin1

0.0

30.0

4

Bin2

30.0

60.0

5

Bin3

60.0

90.0

4

Bin4

90.0

120.0

1

More

120.0

1.0E30

2

 
Chart Sample
 
chart
 

Univariate Metrics

Calculate basic metrics such as count, sum, average, minimum, maximum, variance and standard deviation for a column of numbers. The above metrics are calculated when the option BASIC is selected. The user may also select the option ALL to calculate kurtosis, skewness and standard error of the mean in addition to the above metrics. The formulas used for the calculations are provided in the appendix.

Parameter Set

Parameters Description
Column Column for which to calculate basic metrics. Column numbers begin with zero
Option The option ALL or BASIC determines exactly what metrics will be calculated
Include Original Data If “Yes” include the original dataset in the result table

 

Applications

This function provides basic calculations that can be displayed directly on a data presentation (via a table or chart text annotations).

Data Sample

Input Parameter Result

X

1

6

3

0

5

9

5

2

3

1

7

Column = 0
Option = ALL
Include Original Data = Yes

X

Metric

Value

1

Count

11.0

6

Sum

42.0

3

Average

3.8181818181818183

0

Min

0.0

5

Max

9.0

9

Median

3.0

5

Variance

7.239669421487604

2

SDev

2.690663379445226

3

Kurtosis

-0.7094170263338975

1

Skewness

0.39787509406925864

7

Standard Error

0.8112655333426796

 

Histogram Distribution

Similar to the Bin Data module, Histogram Distribution bins column numbers to obtain frequency of numbers in a certain range. In addition, it provides the option to bin data by probability (total probability summed over whole range of data is 1) or by percentage (total percentage summed over whole range of data is 100). The bin width may be set automatically or be specified by the user. The user may provide the actual width of each bin or, alternatively, specify the total number of bins.

Parameter Set

Parameters Description
Column number Column on which to perform the histogram operation. Column numbers begin with zero
Bin type This parameter determines how the bins will be set up. The options are Automatic, By Number, By Width
Histogram size If Bin type is “Automatic”, a histogram size will be calculated automatically and need not be specified by the user,
 
If Bin type is “By Number”, histogram size is the number of bins into which to the total data range is divided,
 
If Bin type is “By Width”, histogram size refers to the width of the data range into which each column item will be binned.
Min range Beginning point of first data range. Will be calculated if not provided by user.
Max range End point of last data range. Will be calculated if not provided by user.
Histogram type Histogram result is returned By Number, Percentage, or Probability
Include Original Data If “Yes” include the original dataset in the result table

 

Applications

Very similar to binning. Can show, on a percentage basis, where products are selling, where test score distributions occur, etc.

Data Sample

Input Parameter Result

val

1

2

3

3

3

4

4

4

4

5

5

5

6

6

7

Column number = 0
Bin type = By Width
Histogram Size = 1
Min Range = 0
Max Range = 8
Histogram type = By Number, Percentage,Probability (displayed in successive columns)
Include Original Data = No

Bin No

Bin Min

Bin Max

Histogram

1.0

0.0

1.0

0.06666666666666667

2.0

1.0

2.0

0.06666666666666667

3.0

2.0

3.0

0.2

4.0

3.0

4.0

0.26666666666666666

5.0

4.0

5.0

0.2

6.0

5.0

6.0

0.13333333333333333

7.0

6.0

7.0

0.06666666666666667

8.0

7.0

8.0

0.0

 

 

Pareto Analysis

Given a set of causes that lead to problems in a certain area, the Pareto Principle states that in many cases only a “vital few” factors (about 20%) are responsible for producing most (about 80%) of the problems. To help identify the major components in a set of data, Pareto analysis sorts a column of data and accompanying labels in descending order of magnitude and provides the cumulative percentage contribution that each additional item makes to the total. The (x,y) values for an 80% y line and a 20% x line are also provided. Plotting the points at which the cumulative percentage line crosses the 80% and 20% lines allows the user to note what percentage of items contribute to 80% of the total and what percentage of the total is made up of 20% of the items respectively.

Parameter Set

Parameters Description
Category Names Column Column that holds item labels. Column numbers begin with zero
Occurrences Column Column that holds item values. Column numbers begin with zero
Top Extent Number of items to be returned in the result. Setting this parameter to a small number results in data being provided for only the first few major items
Sort Columns By default, sorts all columns based on the contents of the Occurances column.

 
Applications

Useful for determining which products/people are the biggest contributors, which factors cause the bulk of the problems, etc.

Data Sample

Input Parameter

Cause

Issues

Processor

35

Display Card

30

Keyboard

10

Hard Disk

165

Mouse

5

USB

5

Category Names Column = 0
Occurrences Column = 1
Top Extent =
Sort Columns = All
Result

Cause

Issues

Cumulative
Percentage

Line 80 X

Line 80 Y

Line 20 X

Line 20 Y

Axis

Min

Max

Hard Disk

165

66.0

0.0

80.0

1.7

0.0

LHS

0.0

250.0

Processor

35

80.0

1.0

80.0

1.7

16.666666666666668

RHS

0.0

100.0

Display Card

30

92.0

2.0

80.0

1.7

33.333333333333336

null

null

null

Keyboard

10

96.0

3.0

80.0

1.7

50.0

null

null

null

USB

5

98.0

4.0

80.0

1.7

66.66666666666667

null

null

null

Mouse

5

100.0

5.0

80.0

1.7

100.0

null

null

null

 
Chart Sample
 
chart
 

Normal Distribution

Given a column of data, the Normal Distribution module calculates the normal probability distribution or “bell curve” obtained from the mean and standard deviation of the data set for the default analysis type “Probability”. If analysis type is “Cumulative”, the cumulative probability over the range of data is returned in the result columns. Analysis type “Inverse” is identical to cumulative except for the fact that the X and Y columns are interchanged for ease of plotting a horizontally oriented cumulative probability.

Parameter Set

Parameters Description
Column Column for which to calculate the normal distribution. Column numbers begin with zero
Number of Points The number of points within the data range for which to evaluate the normal distribution
Analysis Type Cumulative, Inverse or Probability
Include Original Data If “Yes” include the original dataset in the result table

 
Applications

Shows the distribution of data over a given dataset. Can be used to determine when the number of errors/failures in a given set or subset is within acceptable expectations or outside of “normal” bounds.

Data Sample

Input Parameter Result

Days Outstanding

55

32

45

23

24

72

75

110

221

29

44

78

38

143

87

23

Column = 0
Number of points = 16
Analysis Type = Probability
Original dataset = No 

Days
Outstanding

Nx

Ny

55

23.0

0.00522829909383379

32

36.2

0.006355220302356322

45

49.4

0.0072322524261434025

23

62.6

0.007705295915223013

24

75.8

0.0076856012545516985

72

89.0

0.007176937353672592

75

102.2

0.00627441503114732

110

115.4

0.005135468973532381

221

128.6

0.003935136099286401

29

141.8

0.0028230084959605046

44

155.0

0.001895995897570596

78

168.2

0.0011921622939091978

38

181.4

7.017883434060527E-4

143

194.6

3.8676725603556064E-4

87

207.8

1.9955656903427628E-4

23

221.0

9.639513990883174E-5

 
Chart Sample
This chart sample is based on a bar chart (with bars mapped to the top axis to provide histogram appearance) and the normal distribution line applied. It shows that, on average, invoices take about 68 days to get paid. Note that plus or minus 1 standard deviation represents 68% of the data. In this case, there is considerable variation in the data which might highlight the need for tighter controls.
 
chart
 

Quantile

Quantiles are essentially points taken at regular vertical intervals from the cumulative distribution function of a random variable. Dividing ordered data into equal sized data subsets is the motivation for quantiles. The quantiles are the data values marking the boundaries between consecutive subsets. This module provides for determining the percentiles, deciles, quintiles, quartiles and the median.

Parameter Set

Parameters Description
Column Index Of Item Values Column on which to perform the Quantile analysis. Column numbers are zero indexed.
Quantile Type Percentiles/Deciles/Quintiles/Quartiles/Median
Include Original Data If “Yes” include the original dataset in the result table

 

Applications

Show distribution of test scores, compare income distribution of different groups, many more uses.

Data Sample

Input Parameter Result

Test Scores

52

67

68

77

78

81

83

84

84

85

87

88

88

92

94

97

99

Column Index Of Item Values =0
Quantile Type=Deciles
Include Original Data = Yes

Test Scores

DECILES

DECILES_POINT

52

1

67.0

67

2

77.0

68

3

81.0

77

4

83.0

78

5

84.0

81

6

87.0

83

7

88.0

84

8

92.0

84

9

97.0

85

null

null

87

null

null

88

null

null

88

null

null

92

null

null

94

null

null

97

null

null

99

null

null

 
Chart Sample
The following chart shows the distribution of test scores among the Decile (10%) marks.
 
chart
 

Mode

In a set of data, the mode is the most frequently observed data value. There may be no mode if no value appears more than any other. There may also be two modes (bimodal), three modes (trimodal) or more modes (multimodal). In case of grouped frequency distributions, the modal class is the class with the largest frequency.

Parameter Set

Parameters Description
Column Index Of Item Values Column on which to perform the Mode analysis. Column numbers are zero indexed.
Mode Type Ungrouped/GroupedIf Mode Type is Ungrouped the result will be the most frequently observed data value/s.

If Mode Type is Grouped, based on bin size, the result will be the class/s that has the most number of observed data value.

Bin Size This is the width of each bin. Number of bins will be determined by dividing the difference between highest observed value and lowest observed value by bin size.
Include Original Data If “Yes” include the original dataset in the result table

 

Applications

Track highest incidence of test score, age, product code, user ids, weight, error codes, zip codes, much more.

Data Sample

Input Parameter Result

UserId

4388

4003

2003

4002

7163

3872

4002

2187

4883

4892

7633

4002

7633

9290

3882

4004

2391

4859

3729

4992

5837

2099

3820

3872

4002

Column Index Of Item Values = 0
Mode Type = Ungrouped
Bin Size =
Include Original Data = No

ITEM

COUNT

4002.0

4