API for stats - Incanter 2.0 (in development)

by David Edgar Liebke and Bradford Cross

Full namespace name: incanter.stats

Overview

This is the core statistical library for Incanter.
It provides probability functions (cdf, pdf, quantile),
random number generation, statistical tests, basic
modeling functions, similarity/association measures,
and more.

This library is built on Parallel Colt
(http://sites.google.com/site/piotrwendykier/software/parallelcolt),
an extension of the Colt numerics library
(http://acs.lbl.gov/~hoschek/colt/).

Public Variables and Functions



auto-correlation

function
Usage: (auto-correlation x lag)
       (auto-correlation x lag mean variance)
Returns the auto-correlation of x with given lag, mean, and variance.
If no mean or variance is provided, the they are calculated from x.

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html

    
    
    Source
  


benford-test

function
Usage: (benford-test coll)
Performs Benford's Law test using chisq-test.

Argument:
coll: -- a sequence of numbers

Returns:
  :X-sq -- the Pearson X-squared test statistics
  :p-value -- the p-value for the test statistic
  :df -- the degress of freedom

Reference:
http://data-sorcery.org/2009/06/21/chi-square-goodness-of-fit/
http://en.wikipedia.org/wiki/Benford%27s_Law

    
    
    Source
  


bootstrap

function
Usage: (bootstrap data statistic & {:keys [size replacement smooth? smooth-sd], :or {replacement true, smooth? false, smooth-sd (/ (sqrt (count data)))}})
Returns a bootstrap sample of the given statistic on the given data.

Arguments:
  data -- vector of data to resample from
  statistic -- a function that returns a value given a vector of data

Options:
  :size -- the number of bootstrap samples to return
  :smooth -- (default false) smoothing option
  :smooth-sd -- (default (/ (sqrt (count data)))) determines the standard
                deviation of the noise to use for smoothing
  :replacement -- (default true) determines if sampling of the data
                  should be done with replacement


References:
  1. Clifford E. Lunneborg, Data Analysis by Resampling Concepts and Applications, 2000, pages 105-117
  2. http://en.wikipedia.org/wiki/Bootstrapping_(statistics)


Examples:

  ;; example from Data Analysis by Resampling Concepts and Applications
  ;; Clifford E. Lunneborg (pages 119-122)

  (use '(incanter core stats charts))

  ;; weights (in grams) of 50 randoincanter. sampled bags of preztels
  (def weights [464 447 446 454 450 457 450 442
                433 452 449 454 450 438 448 449
                457 451 456 452 450 463 464 453
                452 447 439 449 468 443 433 460
                452 447 447 446 450 459 466 433
                445 453 454 446 464 450 456 456
                447 469])

  ;; calculate the sample median, 450
  (median weights)

  ;; generate bootstrap sample
  (def t* (bootstrap weights median :size 2000))

  ;; view histogram of bootstrap histogram
  (view (histogram t*))

  ;; calculate the mean of the bootstrap median ~ 450.644
  (mean t*)

  ;; calculate the standard error ~ 1.083
  (def se (sd t*))

  ;; 90% standard normal CI ~ (448.219 451.781)
  (plus (median weights) (mult (quantile-normal [0.05 0.95]) se))

  ;; 90% symmetric percentile CI ~ (449.0 452.5)
  (quantile t* :probs [0.05 0.95])


  ;; 90% non-symmetric percentile CI ~ (447.5 451.0)
  (minus (* 2 (median weights)) (quantile t* :probs [0.95 0.05]))

  ;; calculate bias
  (- (mean t*) (median weights)) ;; ~ 0.644

  ;; example with smoothing
  ;; Newcomb's speed of light data

  (use '(incanter core stats charts))

  ;; A numeric vector giving the Third Series of measurements of the
  ;; passage time of light recorded by Newcomb in 1882. The given
  ;; values divided by 1000 plus 24 give the time in millionths of a
  ;; second for light to traverse a known distance. The 'true' value is
  ;; now considered to be 33.02.

  (def speed-of-light [28 -44  29  30  24  28  37  32  36  27  26  28  29
                       26  27  22  23  20  25 25  36  23  31  32  24  27
                       33  16  24  29  36  21  28  26  27  27  32  25 28
                       24  40  21  31  32  28  26  30  27  26  24  32  29
                       34  -2  25  19  36 29  30  22  28  33  39  25  16  23])

  ;; view histogram of data to see outlier observations
  (view (histogram speed-of-light :nbins 30))

  (def samp (bootstrap speed-of-light median :size 10000))
  (view (histogram samp :density true :nbins 30))
  (mean samp)
  (quantile samp :probs [0.025 0.975])

  (def smooth-samp (bootstrap speed-of-light median :size 10000 :smooth true))
  (view (histogram smooth-samp :density true :nbins 30))
  (mean smooth-samp)
  (quantile smooth-samp :probs [0.025 0.975])

    
    
    Source
  


category-col-summarizer

function
Usage: (category-col-summarizer col ds)
Returns a summarizer function which takes a category column and returns a list of the top 5 columns by volume, and a
count of remaining rows

    
    
    Source
  


cdf-beta

function
Usage: (cdf-beta x & {:keys [alpha beta lower-tail?], :or {alpha 1, beta 1, lower-tail? false}})
Returns the Beta cdf of the given value of x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's pbeta function.

Options:
  :alpha (default 1)
  :beta (default 1)
  :lower-tail (default true)

See also:
    pdf-beta and sample-beta

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Beta.html
    http://en.wikipedia.org/wiki/Beta_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-beta 0.5 :alpha 1 :beta 2)
    (cdf-beta 0.5 :alpha 1 :beta 2 :lower-tail false)

    
    
    Source
  


cdf-binomial

function
Usage: (cdf-binomial x & {:keys [size prob lower-tail?], :or {size 1, prob 1/2, lower-tail? true}})
Returns the Binomial cdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's pbinom.

Options:
  :size (default 1)
  :prob (default 1/2)
  :lower-tail (default true)

See also:
    pdf-binomial and sample-binomial

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Binomial.html
    http://en.wikipedia.org/wiki/Binomial_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-binomial 10 :prob 1/4 :size 20)

    
    
    Source
  


cdf-chisq

function
Usage: (cdf-chisq x & {:keys [df lower-tail?], :or {df 1, lower-tail? true}})
Returns the Chi Square cdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's pchisq function.

Options:
  :df (default 1)
  :lower-tail (default true)

See also:
    pdf-chisq and sample-chisq

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/ChiSquare.html
    http://en.wikipedia.org/wiki/Chi_square_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-chisq 5.0 :df 2)
    (cdf-chisq 5.0 :df 2 :lower-tail false)

    
    
    Source
  


cdf-empirical

function
Usage: (cdf-empirical x)
Returns a step-function representing the empirical cdf of the given data.
Equivalent to R's ecdf function.

The following description is from the ecdf help in R: The e.c.d.f.
(empirical cumulative distribution function) Fn is a step function
with jumps i/n at observation values, where i is the number of tied
observations at that value.  Missing values are ignored.

For observations 'x'= (x1,x2, ... xn), Fn is the fraction of
observations less or equal to t, i.e.,

Fn(t) = #{x_i <= t} / n  =  1/n sum(i=1,n) Indicator(xi <= t).


Examples:
  (use '(incanter core stats charts))

  (def exam1 [192 160 183 136 162 165 181 188 150 163 192 164 184
              189 183 181 188 191 190 184 171 177 125 192 149 188
              154 151 159 141 171 153 169 168 168 157 160 190 166 150])

  ;; the ecdf function returns an empirical cdf function for the given data
  (def ecdf (cdf-empirical exam1))

  ;; plot the data's empirical cdf
  (view (scatter-plot exam1 (map ecdf exam1)))

    
    
    Source
  


cdf-exp

function
Usage: (cdf-exp x & {:keys [rate], :or {rate 1}})
Returns the Exponential cdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's pexp.

Options:
  :rate (default 1)

See also:
    pdf-exp and sample-exp

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Exponential.html
    http://en.wikipedia.org/wiki/Exponential_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-exp 2.0 :rate 1/2)

    
    
    Source
  


cdf-f

function
Usage: (cdf-f x & {:keys [df1 df2 lower-tail?], :or {df1 1, df2 1, lower-tail? true}})
Returns the F-distribution cdf of the given value, x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's pf function.

Options:
  :df1 (default 1)
  :df2 (default 1)
  :lower-tail? (default true)

See also:
    pdf-f and quantile-f

References:
    http://en.wikipedia.org/wiki/F_distribution
    http://mathworld.wolfram.com/F-Distribution.html
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-f 1.0 :df1 5 :df2 2)

    
    
    Source
  


cdf-gamma

function
Usage: (cdf-gamma x & {:keys [shape scale rate lower-tail?], :or {shape 1, lower-tail? true}})
Returns the Gamma cdf for the given value of x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's pgamma function.

Options:
  :shape (k) (default 1)
  :scale (θ) (default 1 or 1/rate, if :rate is specified)
  :rate  (β) (default 1/scale, if :scale is specified)
  :lower-tail (default true)

See also:
    pdf-gamma and sample-gamma

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Gamma.html
    http://en.wikipedia.org/wiki/Gamma_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-gamma 10 :shape 1 :scale 2)
    (cdf-gamma 3 :shape 1 :lower-tail false)

    
    
    Source
  


cdf-neg-binomial

function
Usage: (cdf-neg-binomial x & {:keys [size prob lower-tail?], :or {size 10, prob 1/2, lower-tail? true}})
Returns the Negative Binomial cdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's dnbinom.

Options:
  :size (default 10)
  :prob (default 1/2)
  :lower-tail? (default true)

See also:
    pdf-neg-binomial and sample-neg-binomial

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/NegativeBinomial.html
    http://en.wikipedia.org/wiki/Negative_binomial_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-neg-binomial 10 :prob 1/2 :size 20)

    
    
    Source
  


cdf-normal

function
Usage: (cdf-normal x & {:keys [mean sd], :or {mean 0, sd 1}})
Returns the Normal cdf of the given value, x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's pnorm function.

Options:
  :mean (default 0)
  :sd (default 1)

See also:
    pdf-normal, quantile-normal, sample-normal

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Normal.html
    http://en.wikipedia.org/wiki/Normal_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-normal 1.96 :mean -2 :sd (sqrt 0.5))

    
    
    Source
  


cdf-poisson

function
Usage: (cdf-poisson x & {:keys [lambda lower-tail?], :or {lambda 1, lower-tail? true}})
Returns the Poisson cdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent R's ppois.

Options:
  :lambda (default 1)
  :lower-tail (default true)

See also:
    pdf-poisson and sample-poisson

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Poisson.html
    http://en.wikipedia.org/wiki/Poisson_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-poisson 5 :lambda 10)

    
    
    Source
  


cdf-t

function
Usage: (cdf-t x & {:keys [df lower-tail?], :or {df 1, lower-tail? true}})
Returns the Student's t cdf for the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's pt function.

Options:
  :df (default 1)

See also:
    pdf-t, quantile-t, and sample-t

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/StudentT.html
    http://en.wikipedia.org/wiki/Student-t_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-t 1.2 :df 10)

    
    
    Source
  


cdf-uniform

function
Usage: (cdf-uniform x & {:keys [min max], :or {min 0.0, max 1.0}})
Returns the Uniform cdf of the given value of x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's punif function.

Options:
  :min (default 0)
  :max (default 1)

See also:
    pdf-uniform and sample-uniform

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/DoubleUniform.html
    http://en.wikipedia.org/wiki/Uniform_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-uniform 5)
    (cdf-uniform 5 :min 1 :max 10)

    
    
    Source
  


cdf-weibull

function
Usage: (cdf-weibull x & options)
Returns the Weibull cdf for the given value of x. It will return a sequence
of values, if x is a sequence.

Options:
  :shape (default 1)
  :scale (default 1)

See also:
    pdf-weibull and sample-weibull

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Distributions.html
    http://en.wikipedia.org/wiki/Weibull_distribution
    http://en.wikipedia.org/wiki/Cumulative_distribution_function

Example:
    (cdf-weibull 10 :shape 1 :scale 0.2)

    
    
    Source
  


chebyshev-distance

function
Usage: (chebyshev-distance a b)
In the limiting case of Lp reaching infinity we obtain the Chebyshev distance.

    
    
    Source
  


chisq-test

function
Usage: (chisq-test & {:keys [x y correct table probs freq], :or {correct true}})
Performs chi-squared contingency table tests and goodness-of-fit tests.

If the optional argument :y is not provided then a goodness-of-fit test
is performed. In this case, the hypothesis tested is whether the
population probabilities equal those in :probs, or are all equal if
:probs is not given.

If :y is provided, it must be a sequence of integers that is the
same length as x. A contingency table is computed from x and :y.
Then, Pearson's chi-squared test of the null hypothesis that the joint
distribution of the cell counts in a 2-dimensional contingency
table is the product of the row and column marginals is performed.
By default the Yates' continuity correction for 2x2 contingency
tables is performed, this can be disabled by setting the :correct
option to false.


Options:
  :x -- a sequence of numbers.
  :y -- a sequence of numbers
  :table -- a contingency table. If one dimensional, the test is a goodness-of-fit
  :probs (when (nil? y) -- (repeat n-levels (/ n-levels)))
  :freq (default nil) -- if given, these are rescaled to probabilities
  :correct (default true) -- use Yates' correction for continuity for 2x2 contingency tables


Returns:
  :X-sq -- the Pearson X-squared test statistics
  :p-value -- the p-value for the test statistic
  :df -- the degress of freedom


Examples:
  (use '(incanter core stats))
  (chisq-test :x [1 2 3 2 3 2 4 3 5]) ;; X-sq 2.6667
  ;; create a one-dimensional table of this data
  (def table (matrix [1 3 3 1 1]))
  (chisq-test :table table) ;; X-sq 2.6667
  (chisq-test :table (trans table)) ;; throws exception

  (chisq-test :x [1 0 0 0  1 1 1 0 0 1 0 0 1 1 1 1]) ;; 0.25

  (use '(incanter core stats datasets))
  (def math-prog (to-matrix (get-dataset :math-prog)))
  (def x (sel math-prog :cols 1))
  (def y (sel math-prog :cols 2))
  (chisq-test :x x :y y) ;; X-sq = 1.24145, df=1, p-value = 0.26519
  (chisq-test :x x :y y :correct false) ;; X-sq = 2.01094, df=1, p-value = 0.15617

  (def table (matrix [[31 12] [9 8]]))
  (chisq-test :table table) ;; X-sq = 1.24145, df=1, p-value = 0.26519
  (chisq-test :table table :correct false) ;; X-sq = 2.01094, df=1, p-value = 0.15617
  ;; use the detabulate function to create data rows corresponding to the table
  (def detab (detabulate :table table))
  (chisq-test :x (sel detab :cols 0) :y (sel detab :cols 1))

  ;; look at the hair-eye-color data
  ;; turn the count data for males into a contingency table
  (def male (matrix (sel (get-dataset :hair-eye-color) :cols 3 :rows (range 16)) 4))
  (chisq-test :table male) ;; X-sq = 41.280, df = 9, p-value = 4.44E-6
  ;; turn the count data for females into a contingency table
  (def female (matrix (sel (get-dataset :hair-eye-color) :cols 3 :rows (range 16 32)) 4))
  (chisq-test :table female) ;; X-sq = 106.664, df = 9, p-value = 7.014E-19,


  ;; supply probabilities to goodness-of-fit test
  (def table [89 37 30 28 2])
  (def probs [0.40 0.20 0.20 0.19 0.01])
  (chisq-test :table table :probs probs) ;; X-sq = 5.7947, df = 4, p-value = 0.215

  ;; use frequencies instead of probabilities
  (def freq [40 20 20 15 5])
  (chisq-test :table table :freq freq) ;; X-sq = 9.9901, df = 4, p-value = 0.04059



References:
  http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
  http://en.wikipedia.org/wiki/Pearson's_chi-square_test
  http://en.wikipedia.org/wiki/Yates'_chi-square_test

    
    
    Source
  


choose-singletype-col-summarizer

function
Usage: (choose-singletype-col-summarizer col-type)
Takes in a type, and returns a suitable column summarizer

    
    
    Source
  


concordant-pairs

function
Usage: (concordant-pairs a b)
http://en.wikipedia.org/wiki/Concordant_pairs

    
    
    Source
  


concordant?

function
Usage: (concordant? [[a1 b1] [a2 b2] & more])
Given two pairs of numbers, checks if they are concordant.

    
    
    Source
  


correlation

function
Usage: (correlation x y)
       (correlation mat)
Returns the sample correlation of x and y, or the correlation
matrix of the given matrix.

Examples:

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Correlation

    
    
    Source
  


correlation-linearity-test

function
Usage: (correlation-linearity-test a b)
http://en.wikipedia.org/wiki/Correlation_ratio
It is worth noting that if the relationship between values of  and values of
overline y_x is linear (which is certainly true when there are only two
possibilities for x) this will give the same result as the square of the
correlation coefficient, otherwise the correlation ratio will be larger in magnitude.
It can therefore be used for judging non-linear relationships.

    
    
    Source
  


correlation-ratio

function
Usage: (correlation-ratio & xs)
http://en.wikipedia.org/wiki/Correlation_ratio

In statistics, the correlation ratio is a measure of the relationship between
the statistical dispersion within individual categories and the
dispersion across the whole population or sample. i.e. the weighted variance
of the category means divided by the variance of all samples.

Example

Suppose there is a distribution of test scores in three topics (categories):

  * Algebra: 45, 70, 29, 15 and 21 (5 scores)
  * Geometry: 40, 20, 30 and 42 (4 scores)
  * Statistics: 65, 95, 80, 70, 85 and 73 (6 scores).

Then the subject averages are 36, 33 and 78, with an overall average of 52.

The sums of squares of the differences from the subject averages are 1952
for Algebra, 308 for Geometry and 600 for Statistics, adding to 2860,
while the overall sum of squares of the differences from the overall average
is 9640. The difference between these of 6780 is also the weighted sum of the
square of the differences between the subject averages and the overall average:

  5(36 − 52)2 + 4(33 − 52)2 + 6(78 − 52)2 = 6780

This gives

  eta^2 =6780/9640=0.7033

suggesting that most of the overall dispersion is a result of differences
between topics, rather than within topics. Taking the square root

  eta = sqrt 6780/9640=0.8386

Observe that for η = 1 the overall sample dispersion is purely due to dispersion
among the categories and not at all due to dispersion within the individual
categories. For a quick comprehension simply imagine all Algebra, Geometry,
and Statistics scores being the same respectively, e.g. 5 times 36, 4 times 33, 6 times 78.

    
    
    Source
  


cosine-similarity

function
Usage: (cosine-similarity a b)
http://en.wikipedia.org/wiki/Cosine_similarity
http://www.appliedsoftwaredesign.com/cosineSimilarityCalculator.php

The Cosine Similarity of two vectors a and b is the ratio: a dot b / ||a|| ||b||

Let d1 = {2 4 3 1 6}
Let d2 = {3 5 1 2 5}

Cosine Similarity (d1, d2) =  dot(d1, d2) / ||d1|| ||d2||

dot(d1, d2) = (2)*(3) + (4)*(5) + (3)*(1) + (1)*(2) + (6)*(5) = 61

||d1|| = sqrt((2)^2 + (4)^2 + (3)^2 + (1)^2 + (6)^2) = 8.12403840464

||d2|| = sqrt((3)^2 + (5)^2 + (1)^2 + (2)^2 + (5)^2) = 8

Cosine Similarity (d1, d2) = 61 / (8.12403840464) * (8)
                           = 61 / 64.9923072371
                           = 0.938572618717

    
    
    Source
  


covariance

function
Usage: (covariance x y)
       (covariance mat)
Returns the sample covariance of x and y.

Examples:
  ;; create some data that covaries
  (def x (sample-normal 100))
  (def err (sample-normal 100))
  (def y (plus (mult 5 x) err))
  ;; estimate the covariance of x and y
  (covariance x y)

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Covariance

    
    
    Source
  


cumulative-mean

function
Usage: (cumulative-mean coll)
Returns a sequence of cumulative means for the given collection. For instance
The first value equals the first value of the argument, the second value is
the mean of the first two arguments, the third is the mean of the first three
arguments, etc.

Examples:
  (cumulative-mean (sample-normal 100))

    
    
    Source
  


detabulate

function
Usage: (detabulate & {:keys [table row-labels col-labels]})
Take a contingency table of counts and returns a matrix of observations.

Examples:

  (use '(incanter core stats datasets))

  (def by-gender (group-on (get-dataset :hair-eye-color) 2))
  (def table (matrix (sel (first by-gender) :cols 3) 4))

  (detabulate :table table)
  (tabulate (detabulate :table table))

  ;; example 2
  (def data (matrix [[1 0]
                     [1 1]
                     [1 1]
                     [1 0]
                     [0 0]
                     [1 1]
                     [1 1]
                     [1 0]
                     [1 1]]))
  (tabulate data)

  (tabulate (detabulate :table (:table (tabulate data))))

    
    
    Source
  


dice-coefficient

function
Usage: (dice-coefficient a b)
http://en.wikipedia.org/wiki/Dice%27s_coefficient
Dice's coefficient (also known as the Dice coefficient)
is a similarity measure related to the Jaccard index.

    
    
    Source
  


dice-coefficient-str

function
Usage: (dice-coefficient-str a b)
http://en.wikipedia.org/wiki/Dice%27s_coefficient

When taken as a string similarity measure, the coefficient
may be calculated for two strings, x and y using bigrams.
Here nt is the number of character bigrams found in both strings,
nx is the number of bigrams in string x and
ny is the number of bigrams in string y.
For example, to calculate the similarity between:

  night
  nacht

We would find the set of bigrams in each word:

  {ni,ig,gh,ht}
  {na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Plugging this into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

    
    
    Source
  


discordant-pairs

function
Usage: (discordant-pairs a b)
http://en.wikipedia.org/wiki/Discordant_pairs

    
    
    Source
  


euclidean-distance

function
Usage: (euclidean-distance a b)
http://en.wikipedia.org/wiki/Euclidean_distance

the Euclidean distance or Euclidean metric is the ordinary distance
between two points that one would measure with a ruler, and is
given by the Pythagorean formula. By using this formula as distance,
Euclidean space (or even any inner product space) becomes a metric space.
The associated norm is called the Euclidean norm.
Older literature refers to the metric as Pythagorean metric.

    
    
    Source
  


f-test

function
Usage: (f-test x y)
Test for different variances between 2 samples

Argument:
  x : 1st sample to test
  y : 2nd sample to test

Options:

References:
  http://en.wikipedia.org/wiki/F-test
  http://people.richland.edu/james/lecture/m170/ch13-f.html

    
    
    Source
  


gamma-coefficient

function
Usage: (gamma-coefficient a b)
http://www.statsdirect.com/help/nonparametric_methods/kend.htm
The gamma coefficient is given as a measure of association that
is highly resistant to tied data (Goodman and Kruskal, 1963)

    
    
    Source
  


hamming-distance

function
Usage: (hamming-distance a b)
http://en.wikipedia.org/wiki/Hamming_distance

In information theory, the Hamming distance between two strings of equal
length is the number of positions at which the corresponding symbols are different.
Put another way, it measures the minimum number of
substitutions required to change one string into the other,
or the number of errors that transformed one string into the other.

    
    
    Source
  


indicator

function
Usage: (indicator pred coll)
Returns a sequence of ones and zeros, where ones
are returned when the given predicate is true for
corresponding element in the given collection, and
zero otherwise.

Examples:
  (use 'incanter.stats)

  (indicator #(neg? %) (sample-normal 10))

  ;; return the sum of the positive values in a normal sample
  (def x (sample-normal 100))
  (sum (mult x (indicator #(pos? %) x)))

    
    
    Source
  


jaccard-distance

function
Usage: (jaccard-distance a b)
http://en.wikipedia.org/wiki/Jaccard_index
The Jaccard distance, which measures dissimilarity between sample sets,
is complementary to the Jaccard coefficient and is obtained by subtracting
the Jaccard coefficient from 1, or, equivalently, by dividing the difference
of the sizes of the union and the intersection of two sets by the size of the union.

    
    
    Source
  


jaccard-index

function
Usage: (jaccard-index a b)
http://en.wikipedia.org/wiki/Jaccard_index

The Jaccard index, also known as the Jaccard similarity coefficient
(originally coined coefficient de communauté by Paul Jaccard), is a
statistic used for comparing the similarity and diversity of sample sets.

The Jaccard coefficient measures similarity between sample sets,
and is defined as the size of the intersection divided by the
size of the union of the sample sets.

    
    
    Source
  


kendalls-tau

function
Usage: (kendalls-tau a b)
http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient
http://www.statsdirect.com/help/nonparametric_methods/kend.htm
http://mail.scipy.org/pipermail/scipy-dev/2009-March/011589.html
best explanation and example is in "cluster analysis for researchers" page 165.
http://www.amazon.com/Cluster-Analysis-Researchers-Charles-Romesburg/dp/1411606175

    
    
    Source
  


kendalls-w

function
Usage: (kendalls-w)
http://en.wikipedia.org/wiki/Kendall%27s_W
http://faculty.chass.ncsu.edu/garson/PA765/friedman.htm

Suppose that object i is given the rank ri,j by judge number j, where there
are in total n objects and m judges. Then the total rank given to object i is

  Ri = sum Rij

and the mean value of these total ranks is

  Rbar = 1/2 m (n + 1)

The sum of squared deviations, S, is defined as

  S=sum1-n (Ri - Rbar)

and then Kendall's W is defined as[1]

  W= 12S / m^2(n^3-n)

If the test statistic W is 1, then all the survey respondents have been
unanimous, and each respondent has assigned the same order to the list
of concerns. If W is 0, then there is no overall trend of agreement among
the respondents, and their responses may be regarded as essentially random.
Intermediate values of W indicate a greater or lesser degree of unanimity
among the various responses.

Legendre[2] discusses a variant of the W statistic which accommodates ties
in the rankings and also describes methods of making significance tests based on W.

[{:observation [1 2 3]} {} ... {}] -> W

    
    
    Source
  


kurtosis

function
Usage: (kurtosis x)
Returns the kurtosis of the data, x. "Kurtosis is a measure of the "peakedness"
of the probability distribution of a real-valued random variable. Higher kurtosis
means more of the variance is due to infrequent extreme deviations, as opposed to
frequent modestly-sized deviations." (Wikipedia)

Examples:

  (kurtosis (sample-normal 100000)) ;; approximately 0
  (kurtosis (sample-gamma 100000)) ;; approximately 6

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Kurtosis

    
    
    Source
  


lee-distance

function
Usage: (lee-distance a b q)
http://en.wikipedia.org/wiki/Lee_distance

In coding theory, the Lee distance is a distance between
two strings x1x2...xn and y1y2...yn of equal length n
over the q-ary alphabet {0,1,…,q-1} of size q >= 2. It is metric.

If q = 2 or q = 3 the Lee distance coincides with the Hamming distance.

The metric space induced by the Lee distance is a discrete analog of the elliptic space.

    
    
    Source
  


levenshtein-distance

function
Usage: (levenshtein-distance a b)
http://en.wikipedia.org/wiki/Levenshtein_distance

internal representation is a table d with m+1 rows and n+1 columns

where m is the length of a and m is the length of b.

In information theory and computer science, the Levenshtein distance
is a metric for measuring the amount of difference between two sequences
(i.e., the so called edit distance).
The Levenshtein distance between two strings is given by the minimum number
of operations needed to transform one string into the other,
where an operation is an insertion, deletion, or substitution of a single character.

For example, the Levenshtein distance between "kitten" and "sitting" is 3,
since the following three edits change one into the other,
and there is no way to do it with fewer than three edits:

 1. kitten → sitten (substitution of 's' for 'k')
 2. sitten → sittin (substitution of 'i' for 'e')
 3. sittin → sitting (insert 'g' at the end).

The Levenshtein distance has several simple upper and lower bounds that are useful
in applications which compute many of them and compare them. These include:

  * It is always at least the difference of the sizes of the two strings.
  * It is at most the length of the longer string.
  * It is zero if and only if the strings are identical.
  * If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.

    
    
    Source
  


linear-model

function
Usage: (linear-model y x & {:keys [intercept], :or {intercept true}})
Returns the results of performing a OLS linear regression of y on x.

Arguments:
  y is a vector (or sequence) of values for the dependent variable
  x is a vector or matrix of values for the independent variables

Options:
  :intercept (default true) indicates weather an intercept term should be included

Returns:
  a map, of type ::linear-model, containing:
    :design-matrix -- a matrix containing the independent variables, and an intercept columns
    :coefs -- the regression coefficients
    :t-tests -- t-test values of coefficients
    :t-probs -- p-values for t-test values of coefficients
    :coefs-ci -- 95% percentile confidence interval
    :fitted -- the predicted values of y
    :residuals -- the residuals of each observation
    :std-errors -- the standard errors of the coeffients
    :sse -- the sum of squared errors, also called the residual sum of squares
    :ssr -- the regression sum of squares, also called the explained sum of squares
    :sst -- the total sum of squares (proportional to the sample variance)
    :r-square -- coefficient of determination

Examples:
  (use '(incanter core stats datasets charts))
  (def iris (to-matrix (get-dataset :iris) :dummies true))
  (def y (sel iris :cols 0))
  (def x (sel iris :cols (range 1 6)))
  (def iris-lm (linear-model y x)) ; with intercept term

  (keys iris-lm) ; see what fields are included
  (:coefs iris-lm)
  (:sse iris-lm)
  (quantile (:residuals iris-lm))
  (:r-square iris-lm)
  (:adj-r-square iris-lm)
  (:f-stat iris-lm)
  (:f-prob iris-lm)
  (:df iris-lm)

  (def x1 (range 0.0 3 0.1))
  (view (xy-plot x1 (cdf-f x1 :df1 4 :df2 144)))


References:
  http://en.wikipedia.org/wiki/OLS_Regression
  http://en.wikipedia.org/wiki/Coefficient_of_determination

    
    
    Source
  


mahalanobis-distance

function
Usage: (mahalanobis-distance x & {:keys [y W centroid]})
Returns the Mahalanobis distance between x, which is
 either a vector or matrix of row vectors, and the
 centroid of the observations in the matrix :y.

Arguments:
  x -- either a vector or a matrix of row vectors

Options:
  :y -- Defaults to x, must be a matrix of row vectors which will be used to calculate a centroid
  :W -- Defaults to (solve (covariance y)), if an identity matrix is provided, the mahalanobis-distance
        function will be equal to the Euclidean distance.
  :centroid -- Defaults to (map mean (trans y))


References:
  http://en.wikipedia.org/wiki/Mahalanobis_distance


Examples:

  (use '(incanter core stats charts))

  ;; generate some multivariate normal data with a single outlier.
  (def data (bind-rows
              (bind-columns
                (sample-mvn 100
                            :sigma (matrix [[1 0.9]
                                            [0.9 1]])))
              [-1.75 1.75]))

  ;; view a scatter plot of the data
  (let [[x y] (trans data)]
    (doto (scatter-plot x y)
      (add-points [(mean x)] [(mean y)])
      (add-pointer -1.75 1.75 :text "Outlier")
      (add-pointer (mean x) (mean y) :text "Centroid")
      view))

  ;; calculate the distances of each point from the centroid.
  (def dists (map first (mahalanobis-distance data)))
  ;; view a bar-chart of the distances
  (view (bar-chart (range 102) dists))

  ;; Now contrast with the Euclidean distance.
  (def dists (map first (mahalanobis-distance data :W (matrix [[1 0] [0 1]]))))
  ;; view a bar-chart of the distances
  (view (bar-chart (range 102) dists))


  ;; another example
  (mahalanobis-distance [-1.75 1.75] :y data)
  (mahalanobis-distance [-1.75 1.75]
                    :y data
                    :W (matrix [[1 0]
                                [0 1]]))

    
    
    Source
  


manhattan-distance

function
Usage: (manhattan-distance a b)
http://en.wikipedia.org/wiki/Manhattan_distance

usual metric of Euclidean geometry is replaced by a new metric in which
the distance between two points is the sum of the (absolute) differences
of their coordinates. The taxicab metric is also known as rectilinear distance,
L1 distance or l1 norm (see Lp space), city block distance,
Manhattan distance, or Manhattan length

    
    
    Source
  


mean

function
Usage: (mean x)
Returns the mean of the data, x.

Examples:
  (mean (sample-normal 100))

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Mean

    
    
    Source
  


median

function
Usage: (median x)
Returns the median of the data, x.

Examples:
  (median (sample-normal 100))

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Median

    
    
    Source
  


minkowski-distance

function
Usage: (minkowski-distance a b p)
http://en.wikipedia.org/wiki/Minkowski_distance
http://en.wikipedia.org/wiki/Lp_space

The Minkowski distance is a metric on Euclidean space which can be considered
as a generalization of both the Euclidean distance and the Manhattan distance.

Minkowski distance is typically used with p being 1 or 2. The latter is the
Euclidean distance, while the former is sometimes known as the Manhattan distance.

In the limiting case of p reaching infinity we obtain the Chebyshev distance.

    
    
    Source
  


n-grams

function
Usage: (n-grams n s)
Returns a set of the unique n-grams in a string.
this is using actual sets here, discards duplicate n-grams?

    
    
    Source
  


normalized-kendall-tau-distance

function
Usage: (normalized-kendall-tau-distance a b)
http://en.wikipedia.org/wiki/Kendall_tau_distance
Kendall tau distance is the total number of discordant pairs.

    
    
    Source
  


numeric-col-summarizer

function
Usage: (numeric-col-summarizer col ds)
Returns a summarizer function which takes a purely numeric column with no non-numeric values

    
    
    Source
  


odds-ratio

function
Usage: (odds-ratio p1 p2)
http://en.wikipedia.org/wiki/Odds_ratio

Definition in terms of group-wise odds

The odds ratio is the ratio of the odds of an event occurring in one group
to the odds of it occurring in another group, or to a sample-based estimate of that ratio.


Suppose that in a sample of 100 men, 90 have drunk wine in the previous week,
while in a sample of 100 women only 20 have drunk wine in the same period.
The odds of a man drinking wine are 90 to 10, or 9:1,
while the odds of a woman drinking wine are only 20 to 80, or 1:4 = 0.25:1.
The odds ratio is thus 9/0.25, or 36, showing that men are much more likely
to drink wine than women.

Relation to statistical independence

If X and Y are independent, their joint probabilities can be expressed in
terms of their marginal probabilities. In this case, the odds ratio equals one,
and conversely the odds ratio can only equal one if the joint probabilities
can be factored in this way. Thus the odds ratio equals one if and only if
X and Y are independent.

    
    
    Source
  


pairings

function
Usage: (pairings a b)
Creates pairs by matching a1 with b1, a2 with b2, etc. and returns
all pairs of those pairs without matching a pair with itself.

    
    
    Source
  


pairs

function
Usage: (pairs a b)
Returns unique pairs of a and b where members of a and b can not
be paired with the corresponding slot in the other list.

    
    
    Source
  


pdf-beta

function
Usage: (pdf-beta x & {:keys [alpha beta], :or {alpha 1, beta 1}})
Returns the Beta pdf of the given value of x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's dbeta function.

Options:
  :alpha (default 1)
  :beta (default 1)

See also:
    cdf-beta and sample-beta

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Beta.html
    http://en.wikipedia.org/wiki/Beta_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-beta 0.5 :alpha 1 :beta 2)

    
    
    Source
  


pdf-binomial

function
Usage: (pdf-binomial x & {:keys [size prob], :or {size 1, prob 1/2}})
Returns the Binomial pdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's dbinom.

Options:
  :size (default 1)
  :prob (default 1/2)

See also:
    cdf-binomial and sample-binomial

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Binomial.html
    http://en.wikipedia.org/wiki/Binomial_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-binomial 10 :prob 1/4 :size 20)

    
    
    Source
  


pdf-chisq

function
Usage: (pdf-chisq x & {:keys [df], :or {df 1}})
Returns the Chi Square pdf of the given value of x.  It will return a sequence
of values, if x is a sequence. Equivalent to R's dchisq function.

Options:
  :df (default 1)

See also:
    cdf-chisq and sample-chisq

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/ChiSquare.html
    http://en.wikipedia.org/wiki/Chi_square_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-chisq 5.0 :df 2)

    
    
    Source
  


pdf-exp

function
Usage: (pdf-exp x & {:keys [rate], :or {rate 1}})
Returns the Exponential pdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's dexp.

Options:
  :rate (default 1)

See also:
    cdf-exp and sample-exp

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Exponential.html
    http://en.wikipedia.org/wiki/Exponential_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-exp 2.0 :rate 1/2)

    
    
    Source
  


pdf-f

function
Usage: (pdf-f x & {:keys [df1 df2], :or {df1 1, df2 1}})
Returns the F pdf of the given value, x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's df function.

Options:
  :df1 (default 1)
  :df2 (default 1)

See also:
    cdf-f and quantile-f

References:
    http://en.wikipedia.org/wiki/F_distribution
    http://mathworld.wolfram.com/F-Distribution.html
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-f 1.0 :df1 5 :df2 2)

    
    
    Source
  


pdf-gamma

function
Usage: (pdf-gamma x & {:keys [shape scale rate], :or {shape 1}})
Returns the Gamma pdf for the given value of x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's dgamma function.

Options:
  :shape (k) (default 1)
  :scale (θ) (default 1 or 1/rate, if :rate is specified)
  :rate  (β) (default 1/scale, if :scale is specified)

See also:
    cdf-gamma and sample-gamma

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Gamma.html
    http://en.wikipedia.org/wiki/Gamma_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-gamma 10 :shape 1 :scale 2)

    
    
    Source
  


pdf-neg-binomial

function
Usage: (pdf-neg-binomial x & {:keys [size prob], :or {size 10, prob 1/2}})
Returns the Negative Binomial pdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's dnbinom.

Options:
  :size (default 10)
  :prob (default 1/2)

See also:
    cdf-neg-binomial and sample-neg-binomial

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/NegativeBinomial.html
    http://en.wikipedia.org/wiki/Negative_binomial_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-neg-binomial 10 :prob 1/2 :size 20)

    
    
    Source
  


pdf-normal

function
Usage: (pdf-normal x & {:keys [mean sd], :or {mean 0, sd 1}})
Returns the Normal pdf of the given value, x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's dnorm function.

Options:
  :mean (default 0)
  :sd (default 1)

See also:
    cdf-normal, quantile-normal, sample-normal

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Normal.html
    http://en.wikipedia.org/wiki/Normal_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-normal 1.96 :mean -2 :sd (sqrt 0.5))

    
    
    Source
  


pdf-poisson

function
Usage: (pdf-poisson x & {:keys [lambda], :or {lambda 1}})
Returns the Poisson pdf of the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's dpois.

Options:
  :lambda (default 1)

See also:
    cdf-poisson and sample-poisson

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Poisson.html
    http://en.wikipedia.org/wiki/Poisson_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-poisson 5 :lambda 10)

    
    
    Source
  


pdf-t

function
Usage: (pdf-t x & {:keys [df], :or {df 1}})
Returns the Student's t pdf for the given value of x. It will return a sequence
of values, if x is a sequence. Equivalent to R's dt function.

Options:
  :df (default 1)

See also:
    cdf-t, quantile-t, and sample-t

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/StudentT.html
    http://en.wikipedia.org/wiki/Student-t_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-t 1.2 :df 10)

    
    
    Source
  


pdf-uniform

function
Usage: (pdf-uniform x & {:keys [min max], :or {min 0.0, max 1.0}})
Returns the Uniform pdf of the given value of x. It will return a sequence
of values, if x is a sequence. This is equivalent to R's dunif function.

Options:
  :min (default 0)
  :max (default 1)

See also:
    cdf-uniform and sample-uniform

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/DoubleUniform.html
    http://en.wikipedia.org/wiki/Uniform_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-uniform 5)
    (pdf-uniform 5 :min 1 :max 10)

    
    
    Source
  


pdf-weibull

function
Usage: (pdf-weibull x & options)
Returns the Weibull pdf for the given value of x. It will return a sequence
of values, if x is a sequence.

Options:
    :scale (default 1)
    :shape (default 1)

See also:
    cdf-weibull and sample-weibull

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Distributions.html
    http://en.wikipedia.org/wiki/Weibull_distribution
    http://en.wikipedia.org/wiki/Probability_density_function

Example:
    (pdf-weibull 2 :alpha 1 :beta 0.5)

    
    
    Source
  


permute

function
Usage: (permute x)
       (permute x y)
If provided a single argument, returns a permuted version of the
given collection. (permute x) is the same as (sample x).

If provided two arguments, returns two lists that are permutations
across the given collections. In other words, each of the new collections
will contain elements from both of the given collections. Useful for
permutation tests or randomization tests.

Examples:

  (permute (range 10))
  (permute (range 10) (range 10 20))

    
    
    Source
  


predict

function
Usage: (predict model x)
Takes a linear-model and an x value (either a scalar or vector)
and returns the predicted value based on the linear-model.

    
    
    Source
  


principal-components

function
Usage: (principal-components x & options)
Performs a principal components analysis on the given data matrix.
Equivalent to R's prcomp function.

Returns:
  A map with the following fields:
  :std-dev -- the standard deviations of the principal components
      (i.e. the square roots of the eigenvalues of the correlation
      matrix, though the calculation is actually done with the
      singular values of the data matrix.
  :rotation -- the matrix of variable loadings (i.e. a matrix
      whose columns contain the eigenvectors).


Examples:

  (use '(incanter core stats charts datasets))
  ;; load the iris dataset
  (def iris (to-matrix (get-dataset :iris)))
  ;; run the pca
  (def pca (principal-components (sel iris :cols (range 4))))
  ;; extract the first two principal components
  (def pc1 (sel (:rotation pca) :cols 0))
  (def pc2 (sel (:rotation pca) :cols 1))

  ;; project the first four dimension of the iris data onto the first
  ;; two principal components
  (def x1 (mmult (sel iris :cols (range 4)) pc1))
  (def x2 (mmult (sel iris :cols (range 4)) pc2))

  ;; now plot the transformed data, coloring each species a different color
  (doto (scatter-plot (sel x1 :rows (range 50)) (sel x2 :rows (range 50))
                      :x-label "PC1" :y-label "PC2" :title "Iris PCA")
        (add-points (sel x1 :rows (range 50 100)) (sel x2 :rows (range 50 100)))
        (add-points (sel x1 :rows (range 100 150)) (sel x2 :rows (range 100 150)))
        view)


  ;; alternatively, the :group-by option can be used in scatter-plot
  (view (scatter-plot x1 x2
                      :group-by (sel iris :cols 4)
                      :x-label "PC1" :y-label "PC2" :title "Iris PCA"))


References:
  http://en.wikipedia.org/wiki/Principal_component_analysis

    
    
    Source
  


product-marginal-test

function
Usage: (product-marginal-test j)
the joint PMF of independent variables is equal to the product of their marginal PMFs.

    
    
    Source
  


quantile

function
Usage: (quantile x & {:keys [probs], :or {probs (DoubleArrayList. (double-array [0.0 0.25 0.5 0.75 1.0]))}})
Returns the quantiles of the data, x. By default it returns the min,
25th-percentile, 50th-percentile, 75th-percentile, and max value.

Options:
  :probs (default [0.0 0.25 0.5 0.75 1.0])

Examples:
  (quantile (sample-normal 100))
  (quantile (sample-normal 100) :probs [0.025 0.975])
  (quantile (sample-normal 100) :probs 0.975)

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Quantile

    
    
    Source
  


quantile-normal

function
Usage: (quantile-normal probability & {:keys [mean sd], :or {mean 0, sd 1}})
Returns the inverse of the Normal CDF for the given probability.
It will return a sequence of values, if given a sequence of
probabilities. This is equivalent to R's qnorm function.

Options:
  :mean (default 0)
  :sd (default 1)

Returns:
  a value x, where (cdf-normal x) = probability

See also:
    pdf-normal, cdf-normal, and sample-normal

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/Probability.html
    http://en.wikipedia.org/wiki/Normal_distribution
    http://en.wikipedia.org/wiki/Quantile

Example:
    (quantile-normal 0.975)
    (quantile-normal [0.025 0.975] :mean -2 :sd (sqrt 0.5))

    
    
    Source
  


quantile-t

function
Usage: (quantile-t probability & {:keys [df], :or {df 1}})
Returns the inverse of the Student's t CDF for the given probability
(i.e. the quantile).  It will return a sequence of values, if x is
a sequence of probabilities. This is equivalent to R's qt function.

Options:
  :df (default 1)

Returns:
  a value x, where (cdf-t x) = probability

See also:
   pdf-t, cdf-t, and sample-t

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/Probability.html
    http://en.wikipedia.org/wiki/Student-t_distribution
    http://en.wikipedia.org/wiki/Quantile

Example:
    (quantile-t 0.975)
    (quantile-t [0.025 0.975] :df 25)
    (def df [1 2 3 4 5 6 7 8 9 10 20 50 100 1000])
    (map #(quantile-t 0.025 :df %) df)

    
    
    Source
  


rank-index

function
Usage: (rank-index x)
Given a seq, returns a map where the keys are the values of the seq
and the values are the positional rank of each member o the seq.

    
    
    Source
  


sample

multimethod
No usage documentation available
Returns a sample of the given size from the given collection. If replacement
is set to false it returns a set, otherwise it returns a list.

Arguments:
  coll -- collection or dataset to be sampled from

Options:
  :size -- (default (count x) sample size
  :replacement (default true) -- sample with replacement


Examples:
  (sample (range 10)) ; permutation of numbers zero through ten
  (sample [:red :green :blue] :size 10) ; choose 10 items that are either :red, :green, or :blue.
  (sample (seq "abcdefghijklmnopqrstuvwxyz")  :size 4 :replacement false) ; choose 4 random letters.

    
    
    Source
  


sample-beta

function
Usage: (sample-beta size & {:keys [alpha beta], :or {alpha 1, beta 1}})
Returns a sample of the given size from a Beta distribution.
This is equivalent to R's rbeta function.

Options:
  :alpha (default 1)
  :beta (default 1)
  These default values produce a Uniform distribution.

See also:
    pdf-beta and cdf-beta

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Beta.html
    http://en.wikipedia.org/wiki/Beta_distribution

Example:
    (sample-beta 1000 :alpha 1 :beta 2)

    
    
    Source
  


sample-binomial

function
Usage: (sample-binomial samplesize & {:keys [size prob], :or {size 1, prob 1/2}})
Returns a sample of the given size from a Binomial distribution.
Equivalent to R's rbinom.

Options:
  :size (default 1)
  :prob (default 1/2)

See also:
    pdf-binomial and cdf-binomial

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Binomial.html
    http://en.wikipedia.org/wiki/Binomial_distribution

Example:
    (sample-binomial 1000 :prob 1/4 :size 20)

    
    
    Source
  


sample-chisq

function
Usage: (sample-chisq size & {:keys [df], :or {df 1}})
Returns a sample of the given size from a Chi Square distribution
Equivalent to R's rchisq function.

Options:
  :df (default 1)

See also:
    pdf-chisq and cdf-chisq

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/ChiSquare.html
    http://en.wikipedia.org/wiki/Chi_square_distribution

Example:
    (sample-chisq 1000 :df 2)

    
    
    Source
  


sample-dirichlet

function
Usage: (sample-dirichlet size alpha)
Examples:
  (use '(incanter core stats charts))

  ;; a total of 1447 adults were polled to indicate their preferences for
  ;; candidate 1 (y1=727), candidate 2 (y2=583), or some other candidate or no
  ;; preference (y3=137).

  ;; the counts y1, y2, and y3 are assumed to have a multinomial distribution
  ;; If a uniform prior distribution is assigned to the multinomial vector
  ;; theta = (th1, th2, th3), then the posterior distribution of theta is
  ;; proportional to g(theta) = th1^y1 * th2^y2 * th3^y3, which is a
  ;; dirichlet distribution with parameters (y1+1, y2+1, y3+1)
  (def  theta (sample-dirichlet 1000 [(inc 727) (inc 583) (inc 137)]))
  ;; view means, 95% CI, and histograms of the proportion parameters
  (mean (sel theta :cols 0))
  (quantile (sel theta :cols 0) :probs [0.0275 0.975])
  (view (histogram (sel theta :cols 0)))
  (mean (sel theta :cols 1))
  (quantile (sel theta :cols 1) :probs [0.0275 0.975])
  (view (histogram (sel theta :cols 1)))
  (mean (sel theta :cols 2))
  (quantile (sel theta :cols 2) :probs [0.0275 0.975])
  (view (histogram (sel theta :cols 2)))

  ;; view  a histogram of the difference in proportions between the first
  ;; two candidates
  (view (histogram (minus (sel theta :cols 0) (sel theta :cols 1))))

    
    
    Source
  


sample-exp

function
Usage: (sample-exp size & {:keys [rate], :or {rate 1}})
Returns a sample of the given size from a Exponential distribution.
Equivalent to R's rexp.

Options:
  :rate (default 1)

See also:
    pdf-exp and cdf-exp

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Exponential.html
    http://en.wikipedia.org/wiki/Exponential_distribution

Example:
    (sample-exp 1000 :rate 1/2)

    
    
    Source
  


sample-gamma

function
Usage: (sample-gamma size & {:keys [shape scale rate], :or {shape 1}})
Returns a sample of the given size from a Gamma distribution.
This is equivalent to R's rgamma function.

Options:
  :shape (k) (default 1)
  :scale (θ) (default 1 or 1/rate, if :rate is specified)
  :rate  (β) (default 1/scale, if :scale is specified)

See also:
    pdf-gamma, cdf-gamma, and quantile-gamma

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Gamma.html
    http://en.wikipedia.org/wiki/Gamma_distribution

Example:
    (sample-gamma 1000 :shape 1 :scale 2)

    
    
    Source
  


sample-inv-wishart

function
Usage: (sample-inv-wishart & {:keys [scale p df], :or {p 2}})
Returns a p-by-p symmetric distribution drawn from an inverse-Wishart distribution

Options:
  :p (default 2) -- number of dimensions of resulting matrix
  :df (default p) -- degree of freedoms (aka n), df <= p
  :scale (default (identity-matrix p)) -- positive definite matrix (aka V)

Examples:
  (use 'incanter.stats)
  (sample-inv-wishart :df 10  :p 4)

  ;; calculate the mean of 1000 wishart matrices, should equal (mult df scale)
  (div (reduce plus (for [_ (range 1000)] (sample-wishart :p 4))) 1000)


References:
  http://en.wikipedia.org/wiki/Inverse-Wishart_distribution

    
    
    Source
  


sample-multinomial

function
Usage: (sample-multinomial size & {:keys [probs categories], :or {probs [0.5 0.5]}})
Returns a sequence representing a sample from a multinomial distribution.

Arguments: size -- number of values to return

Options:
  :categories (default [0 1]) -- the values returned
  :probs (default [0.5 0.5]) -- the probabilities associated with each category


References:
  http://en.wikipedia.org/wiki/Multinomial_distribution#Sampling_from_a_multinomial_distribution


Examples:
  (use '(incanter core stats charts))

  (sample-multinomial 10)
  (sample-multinomial 10 :probs [0.25 0.5 0.25])

  ;; estimate sample proportions
  (def sample-size 1000.0)
  (def categories [:red :yellow :blue :green])
  (def data (to-dataset (sample-multinomial sample-size
                                            :categories categories
                                            :probs [0.5 0.25 0.2 0.05])))

  ;; check the sample proportions
  (view (pie-chart categories
                   (map #(div (count ($ :col-0 ($where {:col-0 %} data)))
                              sample-size)
                        categories)))

    
    
    Source
  


sample-mvn

function
Usage: (sample-mvn size & {:keys [mean sigma]})
Returns a sample of the given size from a Multivariate Normal
distribution. This is equivalent to R's mvtnorm::rmvnorm function.

Arguments:
  size -- the size of the sample to return

Options:
  :mean (default (repeat (ncol sigma) 0))
  :sigma (default (identity-matrix (count mean)))


Examples:

  (use '(incanter core stats charts))
  (def mvn-samp (sample-mvn 1000 :mean [7 5] :sigma (matrix [[2 1.5] [1.5 3]])))
  (covariance mvn-samp)
  (def means (map mean (trans mvn-samp)))

  ;; plot scatter-plot of points
  (def mvn-plot (scatter-plot (sel mvn-samp :cols 0) (sel mvn-samp :cols 1)))
  (view mvn-plot)
  ;; add centroid to plot
  (add-points mvn-plot [(first means)] [(second means)])

  ;; add regression line to scatter plot
  (def x (sel mvn-samp :cols 0))
  (def y (sel mvn-samp :cols 1))
  (def lm (linear-model y x))
  (add-lines mvn-plot x (:fitted lm))


References:
  http://en.wikipedia.org/wiki/Multivariate_normal

    
    
    Source
  


sample-neg-binomial

function
Usage: (sample-neg-binomial samplesize & {:keys [size prob], :or {size 10, prob 1/2}})
Returns a sample of the given size from a Negative Binomial distribution.
Equivalent to R's rnbinom.

Options:
  :size (default 10)
  :prob (default 1/2)

See also:
    pdf-neg-binomial and cdf-neg-binomial

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/NegativeBinomial.html
    http://en.wikipedia.org/wiki/Negative_binomial_distribution

Example:
    (sample-neg-binomial 1000 :prob 1/2 :size 20)

    
    
    Source
  


sample-normal

function
Usage: (sample-normal size & {:keys [mean sd], :or {mean 0, sd 1}})
Returns a sample of the given size from a Normal distribution
This is equivalent to R's rnorm function.

Options:
  :mean (default 0)
  :sd (default 1)

See also:
    pdf-normal, cdf-normal, quantile-normal

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Normal.html
    http://en.wikipedia.org/wiki/Normal_distribution

Example:
    (sample-normal 1000 :mean -2 :sd (sqrt 0.5))

    
    
    Source
  


sample-permutations

function
Usage: (sample-permutations n x)
       (sample-permutations n x y)
If provided a two arguments (n x), it returns a list of n permutations
of x. If provided three (n x y) arguments, returns a list with two with n permutations of
each arguments, where each permutation is drawn from the pooled arguments.

Arguments:
  n -- number of randomized versions of the original two groups to return
  x -- group 1
  y -- (default nil) group 2


Examples:

  (use '(incanter core stats))
  (sample-permutations 10 (range 10))
  (sample-permutations 10 (range 10) (range 10 20))

  ;; extended example with plant-growth data
  (use '(incanter core stats datasets charts))

  ;; load the plant-growth dataset
  (def data (to-matrix (get-dataset :plant-growth)))

  ;; break the first column of the data into groups based on treatment (second column).
  (def groups (group-on data 1 :cols 0))

  ;; define a function for the statistic of interest
  (defn means-diff [x y] (minus (mean x) (mean y)))

  ;; calculate the difference in sample means between the two groups
  (def samp-mean-diff (means-diff (first groups) (second groups))) ;; 0.371

  ;; create 500 permuted versions of the original two groups
  (def permuted-groups (sample-permutations 1000 (first groups) (second groups)))

  ;; calculate the difference of means of the 500 samples
  (def permuted-means-diffs1 (map means-diff (first permuted-groups) (second permuted-groups)))

  ;; use an indicator function that returns 1 when the randomized means diff is greater
  ;; than the original sample mean, and zero otherwise. Then take the mean of this sequence
  ;; of ones and zeros. That is the proportion of times you would see a value more extreme
  ;; than the sample mean (i.e. the p-value).
  (mean (indicator #(> % samp-mean-diff) permuted-means-diffs1)) ;; 0.088

  ;; calculate the 95% confidence interval of the null hypothesis. If the
  ;; sample difference in means is outside of this range, that is evidence
  ;; that the two means are statistically significantly different.
  (quantile permuted-means-diffs1 :probs [0.025 0.975]) ;; (-0.606 0.595)

  ;; Plot a histogram of the permuted-means-diffs using the density option,
  ;; instead of the default frequency, and then add a normal pdf curve with
  ;; the mean and sd of permuted-means-diffs data for a visual comparison.
  (doto (histogram permuted-means-diffs1 :density true)
        (add-lines (range -1 1 0.01) (pdf-normal (range -1 1 0.01)
                                                 :mean (mean permuted-means-diffs1)
                                                 :sd (sd permuted-means-diffs1)))
        view)

  ;; compare the means of treatment 2 and control
  (def permuted-groups (sample-permutations 1000 (first groups) (last groups)))
  (def permuted-means-diffs2 (map means-diff (first permuted-groups) (second permuted-groups)))
  (def samp-mean-diff (means-diff (first groups) (last groups))) ;; -0.4939
  (mean (indicator #(< % samp-mean-diff) permuted-means-diffs2)) ;; 0.022
  (quantile permuted-means-diffs2 :probs [0.025 0.975]) ;; (-0.478 0.466)

  ;; compare the means of treatment 1 and treatment 2
  (def permuted-groups (sample-permutations 1000 (second groups) (last groups)))
  (def permuted-means-diffs3 (map means-diff (first permuted-groups) (second permuted-groups)))
  (def samp-mean-diff (means-diff (second groups) (last groups))) ;; -0.865
  (mean (indicator #(< % samp-mean-diff) permuted-means-diffs3)) ;;  0.002
  (quantile permuted-means-diffs3 :probs [0.025 0.975]) ;; (-0.676 0.646)

  (doto (box-plot permuted-means-diffs1)
        (add-box-plot permuted-means-diffs2)
        (add-box-plot permuted-means-diffs3)
        view)


  Further Reading:
    http://en.wikipedia.org/wiki/Resampling_(statistics)

    
    
    Source
  


sample-poisson

function
Usage: (sample-poisson size & {:keys [lambda], :or {lambda 1}})
Returns a sample of the given size from a Poisson distribution.
Equivalent to R's rpois.

Options:
  :lambda (default 1)

See also:
    pdf-poisson and cdf-poisson

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Poisson.html
    http://en.wikipedia.org/wiki/Poisson_distribution

Example:
    (sample-poisson 1000 :lambda 10)

    
    
    Source
  


sample-t

function
Usage: (sample-t size & {:keys [df], :or {df 1}})
Returns a sample of the given size from a Student's t distribution.
Equivalent to R's rt function.

Options:
  :df (default 1)

See also:
    pdf-t, cdf-t, and quantile-t

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/StudentT.html
    http://en.wikipedia.org/wiki/Student-t_distribution

Example:
    (cdf-t 1000 :df 10)

    
    
    Source
  


sample-uniform

function
Usage: (sample-uniform size & {:keys [min max integers], :or {min 0.0, max 1.0, integers false}})
Returns a sample of the given size from a Uniform distribution.
This is equivalent to R's runif function.

Options:
  :min (default 0)
  :max (default 1)
  :integers (default false)

See also:
    pdf-uniform and cdf-uniform

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/DoubleUniform.html
    http://en.wikipedia.org/wiki/Uniform_distribution

Example:
    (sample-uniform 1000)
    (sample-uniform 1000 :min 1 :max 10)

    
    
    Source
  


sample-weibull

function
Usage: (sample-weibull size & options)
Returns a sample of the given size from a Weibull distribution

Options:
  :shape (default 1)
  :scale (default 1)

See also:
    pdf-weibull, cdf-weibull

References:
    http://incanter.org/docs/parallelcolt/api/cern/jet/random/tdouble/Distributions.html
    http://en.wikipedia.org/wiki/Weibull_distribution

Example:
    (sample-weibull 1000 :shape 1 :scale 0.2)

    
    
    Source
  


sample-wishart

function
Usage: (sample-wishart & {:keys [scale p df], :or {p 2}})
Returns a p-by-p symmetric distribution drawn from a Wishart distribution

Options:
  :p (default 2) -- number of dimensions of resulting matrix
  :df (default p) -- degree of freedoms (aka n), df <= p
  :scale (default (identity-matrix p)) -- positive definite matrix (aka V)

Examples:
  (use 'incanter.stats)
  (sample-wishart :df 10  :p 4)

  ;; calculate the mean of 1000 wishart matrices, should equal (mult df scale)
  (div (reduce plus (for [_ (range 1000)] (sample-wishart :p 4))) 1000)


References:
  http://en.wikipedia.org/wiki/Wishart_distribution

    
    
    Source
  


scalar-abs

function
Usage: (scalar-abs x)
Fast absolute value function

    
    
    Source
  


sd

function
Usage: (sd x)
Returns the sample standard deviation of the data, x. Equivalent to
R's sd function.

Examples:
  (sd (sample-normal 100))

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Standard_deviation

    
    
    Source
  


simple-ci

function
Usage: (simple-ci coll)
Get the confidence interval for the data.

    
    
    Source
  


simple-p-value

function
Usage: (simple-p-value coll mu)
Returns the p-value for the data contained in coll.

    
    
    Source
  


simple-regression

function
Usage: (simple-regression y x & {:keys [intercept], :or {intercept true}})
A stripped version of linear-model that returns a map containing only
the coefficients.

    
    
    Source
  


simple-t-test

function
Usage: (simple-t-test coll mu)
Perform a simple t-test on the data contained in coll.

    
    
    Source
  


skewness

function
Usage: (skewness x)
Returns the skewness of the data, x. "Skewness is a measure of the asymmetry
of the probability distribution of a real-valued random variable." (Wikipedia)

Examples:

  (skewness (sample-normal 100000)) ;; approximately 0
  (skewness (sample-gamma 100000)) ;; approximately 2

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Skewness

    
    
    Source
  


sorensen-index

function
Usage: (sorensen-index a b)
http://en.wikipedia.org/wiki/S%C3%B8rensen_similarity_index#cite_note-4

The Sørensen index, also known as Sørensen’s similarity coefficient,
is a statistic used for comparing the similarity of two samples.
where A and B are the species numbers in samples A and B, respectively,
and C is the number of species shared by the two samples.

The Sørensen index is identical to Dice's coefficient which is always in [0, 1] range.
Sørensen index used as a distance measure, 1 − QS, is identical
to Hellinger distance and Bray–Curtis dissimilarity.

The Sørensen coefficient is mainly useful for ecological community data
(e.g. Looman & Campbell, 1960[3]). Justification for its use is primarily
empirical rather than theoretical
(although it can be justified theoretically as the intersection of two fuzzy sets[4]).
As compared to Euclidean distance, Sørensen distance retains sensitivity
in more heterogeneous data sets and gives less weight to outliers

This function assumes you pass in a and b as sets.

The sorensen index extended to abundance instead of incidence of species is called the Czekanowski index.

    
    
    Source
  


spearmans-rho

function
Usage: (spearmans-rho a b)
http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

In statistics, Spearman's rank correlation coefficient or Spearman's rho,
is a non-parametric measure of correlation – that is, it assesses how well
an arbitrary monotonic function could describe the relationship between two
variables, without making any other assumptions about the particular nature
of the relationship between the variables. Certain other measures of correlation
are parametric in the sense of being based on possible relationships of a
parameterised form, such as a linear relationship.

    
    
    Source
  


square-devs-from-mean

function
Usage: (square-devs-from-mean x)
       (square-devs-from-mean x m)
takes either a sample or a sample and a precalculated mean.
returns the squares of the difference between each observation and the sample mean.

    
    
    Source
  


sum-of-square-devs-from-mean

function
Usage: (sum-of-square-devs-from-mean x)
       (sum-of-square-devs-from-mean x m)
takes either a sample or a sample and a precalculated mean.

returns the sum of the squares of the difference between each observation and the sample mean.

    
    
    Source
  


sum-variance-test

function
Usage: (sum-variance-test vs)
The variance of the sum of n independent variables is equal
to the sum of their variances.

 (variance-independence-test [[1 2 3 4] [1 2 3 4]]) -> 5/2

    
    
    Source
  


summarizable?

function
Usage: (summarizable? col ds)
Takes in a column name (or number) and a dataset. Returns true if the column can be summarized, and false otherwise

    
    
    Source
  


summarizer-fn

function
Usage: (summarizer-fn col ds)
Takes in a column (number or name) and a dataset. Returns a function
to summarize the column if summarizable, and a string describing why
the column can't be summarized in the event that it can't

    
    
    Source
  


summary

function
Usage: (summary ds)
Takes in a dataset. Returns a summary of that dataset (as a map of maps),
having automatically figured out the relevant datatypes of columns.
Will be slightly forgiving of mangled data in columns.

    
    
    Source
  


sweep

function
Usage: (sweep x & {:keys [stat fun], :or {stat mean, fun minus}})
Return an array obtained from an input array by sweeping out a
summary statistic. Based to R's sweep function.

Arguments:
  x is an sequence


Options:
      :stat (default mean) the statistic to sweep out
      :fun (defaul minus) the function used to sweep the stat out

Example:

  (use '(incanter core stats))

  (def x (sample-normal 30 :mean 10 :sd 5))
  (sweep x) ;; center the data around mean
  (sweep x :stat sd :fun div) ;; divide data by its sd

    
    
    Source
  


t-test

function
Usage: (t-test x & {:keys [y mu paired conf-level alternative var-equal], :or {paired false, alternative :two-sided, conf-level 0.95, var-equal false}})
Argument:
  x : sample to test

Options:
  :y (default nil)
  :mu (default (mean y) or 0) population mean
  :alternative (default :two-sided) other choices :less :greater
  :var-equal TODO (default false) variance equal
  :paired TODO (default false) paired test
  :conf-level (default 0.95) for returned confidence interval

Examples:

  (t-test (range 1 11) :mu 0)
  (t-test (range 1 11) :mu 0 :alternative :less)
  (t-test (range 1 11) :mu 0 :alternative :greater)

  (t-test (range 1 11) :y (range 7 21))
  (t-test (range 1 11) :y (range 7 21) :alternative :less)
  (t-test (range 1 11) :y (range 7 21) :alternative :greater)
  (t-test (range 1 11) :y (conj (range 7 21) 200))

References:
  http://en.wikipedia.org/wiki/T_test
  http://www.socialresearchmethods.net/kb/stat_t.php

    
    
    Source
  


tabulate

function
Usage: (tabulate x & options)
Cross-tabulates the values of the given numeric matrix.

Returns a hash-map with the following fields:
  :table -- the table of counts for each combination of values,
            this table is only returned if x has two-columns
  :levels -- a sequence of sequences, where each sequence list
             the levels (possible values) of the corresponding
             column of x.
  :margins -- a sequence of sequences, where each sequence
              represents the marginal total for each level
              of the corresponding column of x.
  :counts -- a hash-map, where vectors of unique combinations
             of the cross-tabulated levels are the keys and the
             values are the total count of each combination.
  :N  -- the grand-total for the contingency table


Examples:

  (use '(incanter core stats))
  (tabulate [1 2 3 2 3 2 4 3 5])
  (tabulate (sample-poisson 100 :lambda 5))

  (use '(incanter core stats datasets))
  (def math-prog (to-matrix (get-dataset :math-prog)))
  (tabulate (sel math-prog :cols [1 2]))


  (def data (matrix [[1 0 1]
                     [1 1 1]
                     [1 1 1]
                     [1 0 1]
                     [0 0 0]
                     [1 1 1]
                     [1 1 1]
                     [1 0 1]
                     [1 1 0]]))
  (tabulate data)


  (def data (matrix [[1 0]
                     [1 1]
                     [1 1]
                     [1 0]
                     [0 0]
                     [1 1]
                     [1 1]
                     [1 0]
                     [1 1]]))
  (tabulate data)

    
    
    Source
  


tanimoto-coefficient

function
Usage: (tanimoto-coefficient a b)
http://en.wikipedia.org/wiki/Jaccard_index

The cosine similarity metric may be extended such that it yields the
Jaccard coefficient in the case of binary attributes.
This is the Tanimoto coefficient. 

    
    
    Source
  


variance

function
Usage: (variance x)
Returns the sample variance of the data, x. Equivalent to R's var function.

Examples:
  (variance (sample-normal 100))

References:
  http://incanter.org/docs/parallelcolt/api/cern/jet/stat/tdouble/DoubleDescriptive.html
  http://en.wikipedia.org/wiki/Sample_variance#Population_variance_and_sample_variance

    
    
    Source
  


within

function
Usage: (within z x y)
y is within z of x in metric space.

    
    
    Source
  
Logo & site design by Tom Hickey.
Clojure auto-documentation system by Tom Faulhaber.