Statistical Functions

This section contains functions that are often used in statistical analysis for performance improvement.

Bin Data

Bin column numbers obtain frequency of numbers in a certain range. If a column of data contains numbers that range from 1 to 1000 for instance, create bins of a certain width (eg.100) and count how many column items fall into each range (i.e. 1-100, 101-200 and so on). The bin width may be set automatically or be specified by the user. The user may provide the actual width of each bin or, alternatively, specify the total number of bins.

Parameter Set

Parameters	Description
Column number	Column on which to perform the bin operation. Column numbers begin with zero
Bin type	This parameter determines how the bins will be set up. The options are Automatic, By Number, By Width
Bin size	If Bin type is “Automatic”, a bin size will be calculated automatically and need not be specified by the user. If Bin type is “By Number”, bin size is the number of bins into which to the total data range is divided. If Bin type is “By Width”, bin size refers to the width of the data range into which each column item will be binned.
Min range	Beginning point of first data range. Will be calculated if not provided by user.
Max range	End point of last data range. Will be calculated if not provided by user.
Include Original Data	If “Yes” include the original dataset in the result table

Applications

Can be used to determine problems with accounts receivables over different periods (e.g. < 30 days, 30-60, > 60), the volume of product sales using different price ranges, number of radio signals in different frequency bands, etc. Many more uses.

Data Sample

Input

Parameter

Result

Days Outstanding
55
32
45
23
24
72
75
110
221
29
44
78
38
143
87
23

Column = 0
Bin type = By Width
Bin size = 30
Min range = 0
Max range = 120
Include Original Data = No

Bin Name	Bin Min	Bin Max	Bin Frequency
Less	-1.0E30	0.0	0
Bin1	0.0	30.0	4
Bin2	30.0	60.0	5
Bin3	60.0	90.0	4
Bin4	90.0	120.0	1
More	120.0	1.0E30	2

Chart Sample

chart

Univariate Metrics

Calculate basic metrics such as count, sum, average, minimum, maximum, variance and standard deviation for a column of numbers. The above metrics are calculated when the option BASIC is selected. The user may also select the option ALL to calculate kurtosis, skewness and standard error of the mean in addition to the above metrics. The formulas used for the calculations are provided in the appendix.

Parameter Set

Parameters	Description
Column	Column for which to calculate basic metrics. Column numbers begin with zero
Option	The option ALL or BASIC determines exactly what metrics will be calculated
Include Original Data	If “Yes” include the original dataset in the result table

Applications

This function provides basic calculations that can be displayed directly on a data presentation (via a table or chart text annotations).

Data Sample

Input

Parameter

Result

X
1
6
3
0
5
9
5
2
3
1
7

Column = 0
Option = ALL
Include Original Data = Yes

X	Metric	Value
1	Count	11.0
6	Sum	42.0
3	Average	3.8181818181818183
0	Min	0.0
5	Max	9.0
9	Median	3.0
5	Variance	7.239669421487604
2	SDev	2.690663379445226
3	Kurtosis	-0.7094170263338975
1	Skewness	0.39787509406925864
7	Standard Error	0.8112655333426796

Histogram Distribution

Similar to the Bin Data module, Histogram Distribution bins column numbers to obtain frequency of numbers in a certain range. In addition, it provides the option to bin data by probability (total probability summed over whole range of data is 1) or by percentage (total percentage summed over whole range of data is 100). The bin width may be set automatically or be specified by the user. The user may provide the actual width of each bin or, alternatively, specify the total number of bins.

Parameter Set

Parameters	Description
Column number	Column on which to perform the histogram operation. Column numbers begin with zero
Bin type	This parameter determines how the bins will be set up. The options are Automatic, By Number, By Width
Histogram size	If Bin type is “Automatic”, a histogram size will be calculated automatically and need not be specified by the user, If Bin type is “By Number”, histogram size is the number of bins into which to the total data range is divided, If Bin type is “By Width”, histogram size refers to the width of the data range into which each column item will be binned.
Min range	Beginning point of first data range. Will be calculated if not provided by user.
Max range	End point of last data range. Will be calculated if not provided by user.
Histogram type	Histogram result is returned By Number, Percentage, or Probability
Include Original Data	If “Yes” include the original dataset in the result table

Applications

Very similar to binning. Can show, on a percentage basis, where products are selling, where test score distributions occur, etc.

Data Sample

Input

Parameter

Result

val
1
2
3
3
3
4
4
4
4
5
5
5
6
6
7

Column number = 0
Bin type = By Width
Histogram Size = 1
Min Range = 0
Max Range = 8
Histogram type = By Number, Percentage,Probability (displayed in successive columns)
Include Original Data = No

Bin No	Bin Min	Bin Max	Histogram
1.0	0.0	1.0	0.06666666666666667
2.0	1.0	2.0	0.06666666666666667
3.0	2.0	3.0	0.2
4.0	3.0	4.0	0.26666666666666666
5.0	4.0	5.0	0.2
6.0	5.0	6.0	0.13333333333333333
7.0	6.0	7.0	0.06666666666666667
8.0	7.0	8.0	0.0

Pareto Analysis

Given a set of causes that lead to problems in a certain area, the Pareto Principle states that in many cases only a “vital few” factors (about 20%) are responsible for producing most (about 80%) of the problems. To help identify the major components in a set of data, Pareto analysis sorts a column of data and accompanying labels in descending order of magnitude and provides the cumulative percentage contribution that each additional item makes to the total. The (x,y) values for an 80% y line and a 20% x line are also provided. Plotting the points at which the cumulative percentage line crosses the 80% and 20% lines allows the user to note what percentage of items contribute to 80% of the total and what percentage of the total is made up of 20% of the items respectively.

Parameter Set

Parameters	Description
Category Names Column	Column that holds item labels. Column numbers begin with zero
Occurrences Column	Column that holds item values. Column numbers begin with zero
Top Extent	Number of items to be returned in the result. Setting this parameter to a small number results in data being provided for only the first few major items
Sort Columns	By default, sorts all columns based on the contents of the Occurances column.

Applications

Useful for determining which products/people are the biggest contributors, which factors cause the bulk of the problems, etc.

Data Sample

Input

Parameter

Cause	Issues
Processor	35
Display Card	30
Keyboard	10
Hard Disk	165
Mouse	5
USB	5

Category Names Column = 0
Occurrences Column = 1
Top Extent =
Sort Columns = All

Result

Cause	Issues	Cumulative Percentage	Line 80 X	Line 80 Y	Line 20 X	Line 20 Y	Axis	Min	Max
Hard Disk	165	66.0	0.0	80.0	1.7	0.0	LHS	0.0	250.0
Processor	35	80.0	1.0	80.0	1.7	16.666666666666668	RHS	0.0	100.0
Display Card	30	92.0	2.0	80.0	1.7	33.333333333333336	null	null	null
Keyboard	10	96.0	3.0	80.0	1.7	50.0	null	null	null
USB	5	98.0	4.0	80.0	1.7	66.66666666666667	null	null	null
Mouse	5	100.0	5.0	80.0	1.7	100.0	null	null	null

Chart Sample

chart

Normal Distribution

Given a column of data, the Normal Distribution module calculates the normal probability distribution or “bell curve” obtained from the mean and standard deviation of the data set for the default analysis type “Probability”. If analysis type is “Cumulative”, the cumulative probability over the range of data is returned in the result columns. Analysis type “Inverse” is identical to cumulative except for the fact that the X and Y columns are interchanged for ease of plotting a horizontally oriented cumulative probability.

Parameter Set

Parameters	Description
Column	Column for which to calculate the normal distribution. Column numbers begin with zero
Number of Points	The number of points within the data range for which to evaluate the normal distribution
Analysis Type	Cumulative, Inverse or Probability
Include Original Data	If “Yes” include the original dataset in the result table

Applications

Shows the distribution of data over a given dataset. Can be used to determine when the number of errors/failures in a given set or subset is within acceptable expectations or outside of “normal” bounds.

Data Sample

Input

Parameter

Result

Days Outstanding
55
32
45
23
24
72
75
110
221
29
44
78
38
143
87
23

Column = 0
Number of points = 16
Analysis Type = Probability
Original dataset = No

Days Outstanding	Nx	Ny
55	23.0	0.00522829909383379
32	36.2	0.006355220302356322
45	49.4	0.0072322524261434025
23	62.6	0.007705295915223013
24	75.8	0.0076856012545516985
72	89.0	0.007176937353672592
75	102.2	0.00627441503114732
110	115.4	0.005135468973532381
221	128.6	0.003935136099286401
29	141.8	0.0028230084959605046
44	155.0	0.001895995897570596
78	168.2	0.0011921622939091978
38	181.4	7.017883434060527E-4
143	194.6	3.8676725603556064E-4
87	207.8	1.9955656903427628E-4
23	221.0	9.639513990883174E-5

Chart Sample
This chart sample is based on a bar chart (with bars mapped to the top axis to provide histogram appearance) and the normal distribution line applied. It shows that, on average, invoices take about 68 days to get paid. Note that plus or minus 1 standard deviation represents 68% of the data. In this case, there is considerable variation in the data which might highlight the need for tighter controls.

chart

Quantile

Quantiles are essentially points taken at regular vertical intervals from the cumulative distribution function of a random variable. Dividing ordered data into equal sized data subsets is the motivation for quantiles. The quantiles are the data values marking the boundaries between consecutive subsets. This module provides for determining the percentiles, deciles, quintiles, quartiles and the median.

Parameter Set

Parameters	Description
Column Index Of Item Values	Column on which to perform the Quantile analysis. Column numbers are zero indexed.
Quantile Type	Percentiles/Deciles/Quintiles/Quartiles/Median
Include Original Data	If “Yes” include the original dataset in the result table

Applications

Show distribution of test scores, compare income distribution of different groups, many more uses.

Data Sample

Input

Parameter

Result

Test Scores
52
67
68
77
78
81
83
84
84
85
87
88
88
92
94
97
99

Column Index Of Item Values =0
Quantile Type=Deciles
Include Original Data = Yes

Test Scores	DECILES	DECILES_POINT
52	1	67.0
67	2	77.0
68	3	81.0
77	4	83.0
78	5	84.0
81	6	87.0
83	7	88.0
84	8	92.0
84	9	97.0
85	null	null
87	null	null
88	null	null
88	null	null
92	null	null
94	null	null
97	null	null
99	null	null

Chart Sample
The following chart shows the distribution of test scores among the Decile (10%) marks.

chart

Mode

In a set of data, the mode is the most frequently observed data value. There may be no mode if no value appears more than any other. There may also be two modes (bimodal), three modes (trimodal) or more modes (multimodal). In case of grouped frequency distributions, the modal class is the class with the largest frequency.

Parameter Set

Parameters	Description
Column Index Of Item Values	Column on which to perform the Mode analysis. Column numbers are zero indexed.
Mode Type	Ungrouped/GroupedIf Mode Type is Ungrouped the result will be the most frequently observed data value/s. If Mode Type is Grouped, based on bin size, the result will be the class/s that has the most number of observed data value.
Bin Size	This is the width of each bin. Number of bins will be determined by dividing the difference between highest observed value and lowest observed value by bin size.
Include Original Data	If “Yes” include the original dataset in the result table

Applications

Track highest incidence of test score, age, product code, user ids, weight, error codes, zip codes, much more.

Data Sample

Input

Parameter

Result

UserId
4388
4003
2003
4002
7163
3872
4002
2187
4883
4892
7633
4002
7633
9290
3882
4004
2391
4859
3729
4992
5837
2099
3820
3872
4002

Column Index Of Item Values = 0
Mode Type = Ungrouped
Bin Size =
Include Original Data = No

ITEM	COUNT
4002.0	4

< previous page – next page >

– NetCharts by Visual Mining

Statistical Functions

Bin Data

Univariate Metrics

Histogram Distribution

Pareto Analysis

Normal Distribution

Quantile

Mode