Limit theorems in Probability are useful for solving the complex Probability and Statistics problems. Here we will discuss some important limit theorems used in probability and statistics: |

For instance, consider an experiment where we have a die which has six sides. On throwing this, one of the numbers 1, 2, 3, 4, 5, or 6 can be the outcome, each with equal probability. So the expected value of the experiment will be the average: (1+2+3+4+5+6) / 6 which is equal to 3.5. So the expected value is 3.5 in this case.

According to the weak law of large numbers, if a fair die is rolled a large number of time, then the average of the values is likely to be closer to the expected value that is 3.5. The accuracy to get this value will be more if we perform the roll up function many times. This is assumed that all the possible die roll outcomes are known and the events in which the die landing on edge or being struck by mid roll are not possible or may not be ignored if they do occur; these events are known as the Black Swan events.

The Weak Law of Large Numbers theorem is a result in Probability Theory that is also known as the Bernoulli Theorem. Here let us consider that Y1, Y2, . . . . Yn are the sequence of the independent random variables which are identically distributed and each of them has a Mean (Yi) that is U and a Standard Deviation that is denoted by σ.

So the new variable will be defined as:

Y = Y1 + Y2 + Y3 +. . . . + Yn / n

Then the limit n tends to infinite and the sample mean is (Y) that is equal to the population i.e. Uof of each variable.

(Y) = Y1 + Y2 + Y3 +. . . . + Yn / n

= 1/n (Y1 + Y2 + Y3 +. . . . + Yn)

= (n*U) / n

= U

In addition we have variance:

Var (Y) = var (Y1 + Y2 + Y3 +. . . . + Yn / n)

= var (Y1/n) +var (Y2/n) + . . . . . + var (Yn/n)

= σ^2 / n^2 + . . . . . . . . . + σ^2/n^2

= σ^2 /n

Since by the Chebyshev inequality, for all the ε>0,

P (|Y - U|>= ε) <= var (Y) / ε^2 = σ^2 /n ε^2

As n tends to infinite it then follows that the P (|Y - U|> ε) = 0

So this is the weak law of large numbers.

The Probability consists of different laws and various theorems known as the Laws of large Numbers. There are two categories of the laws of large numbers.

1. Weak law of large number

2. Strong law of large number

We are going to clarify relationship between weak and the strong law of large numbers. The laws of large numbers make statements about deviation of Xm to ∆.

Xm = 1/m * ∑(i = 1 to m) Xi

The strong law and the weak law both relate to size of sample, approximations accuracy and the degree of confidence. The Weak law specifically deals with the limit of probability involving Xm while the Strong law deals with just the probability of Xm.

The strong law of large numbers consists of different theorems.

Theorem:

We have

P (lim(x → ∞) Xm = ∆) = 1

This can also be written as

Xm → ∆ as m → ∞

Here Xm deviates towards delta (∆) as m → ∞

The problem with this law is the Probabilities involving with infinite random variables. For removing this problem we require proper Sample Space, measuring probability accurately and random variables.

Specifically the strong law of large numbers shows that the probability for 1, for any positive vale ‘є’ is

| (∑(i = 1 to m) Xi / m) - ∆ |

The above value will be greater than (≥) ‘є’ by only finite numbers. We will consider only the case of finite Random Variable as it is easier to prove then the general law. The proof of strong law of large numbers is based on Kolmogorov’s inequality which is a generalization of Chebyshev inequality.

The Bernoulli process is a process which is a finite or infinite sequence of binary random variables (binary implies that it can only have 0 or 1 as its value). It is a discrete time stochastic process. The discrete time stochastic process only takes binary values (two values which are 0 and 1). The Bernoulli variables are identical and also do not depend on other variables. Bernoulli variables are denoted by Y

_{i}. Generally we can say that Bernoulli process is like repeated coin flipping, with an unfair coin and with an unfairness consistency.

Each and every variable Y

_{i}in the binary sequence may be termed as a Bernoulli trial or we can call it a Bernoulli experiment. These variables have same Bernoulli distribution. In the problems we will be given only a limited sample for the trials or experiments.

Definition of Bernoulli process:

A Bernoulli process is basically a finite or infinite sequence of binary random variables which are independent say Y

_{1}, Y

_{2}, Y

_{3}, . . . . . . . . . . such that

a. For each i the value of Y

_{i}can be either 1 or 0.

b. For all the values of I the Probability of Y

_{i}= 1 is the same number P.

In general we can say that a Bernoulli process is a sequence of identical independent Bernoulli experiments.

The trials are independent indicates that the process is memory less.

For instance it is given that the probability is P:

1. If it is known, all the outcomes of past do not provide any information about the outcomes which come in future.

2. If P is unknown, then the past outcomes inform about the future outcomes indirectly, through inferences about the probability P.

3. If the process is infinite, then from any Point the future experiments construct a Bernoulli process which is identical to the whole process.

The two values which are possible of each variable Y

_{i}are known as the "success" and the "failure". So, when these two values are given as a number 0 or 1, then the outcome will be called the number of successes on the ith trial or experiment.

We can also call these values True or False and Yes or No.

In any interpretation of these two values, the individual variables Y

_{i}will be called with the Bernoulli process parameter that is P or probability.

Bernoulli Sequences:

This term is used to refer to a realization of a very important process of Statistics and probability that is Bernoulli process. Here consider that a Bernoulli process can be defined as a single Random Variable for a Bernoulli trial. Then For each and every w that belongs to $ there is a sequence of integers.

Then Z^w is equal to all the n number of random binary variables where Y

_{n}(w) =1.

These are called the Bernoulli Sequences which are associated with the Bernoulli process.

Now let’s take an example: If w represents the sequence of the coin flips then the Bernoulli sequence will be the list of all the natural Numbers for which the coin toss outcomes are heads.

So here we define a Bernoulli Sequences that is Z^w which is also a random subset of the index Set, N (natural numbers).

Randomness Extractions in Bernoulli Process:

From any of the Bernoulli Sequences we can derive a Bernoulli process with probability P which equals to ½ by the extractor that is called a Von Neumann Extractor. The randomness extractor is an extractor which extracts uniform randomness.

Here is the representation of the observed process as a sequence of zeros and ones, or bits. Here is also the group that input stream in non overlapping pairs of their successive bits in an order like (11) (00) (10) . . . . . . and so on. Then for each pair we have:

a. If the bits are equal then discard them

b. Else make output the first bit.

Here is a table which having the computation:

Inputs | outputs |

00 | Discard |

01 | 0 |

10 | 1 |

11 | Discard |

So in the output stream we have 0, 1 that are equally likely. This means either the output stream will be 0 or 1. As the inputs 10 and 01 are equally likely thus both have the probability PQ = QP.

Here for this extraction we do not require any input trials to be independent because the extraction is uniformly random. Mostly it works for bits which may have exchangeable sequences. All sequences that are in finite rearrangements in any order are equally likely.

The discard of any pairs of input is at least proportion ½ that is the minimum which may occur where probability is ½ for the original process. In this case the output stream will be ¼ the length of the input in case of average.

The concept of Convergence in Probability is based on different methods of measuring distance between two random variables (means how close are Random Variable to each other).

The concept of convergence in probability can be defined as- “Two random variables are ‘close to each other’ if there is a high probability that the difference between them is very small.”

In other words

“The sequence of random variables (x

_{n}) converges in probability to the random variable ‘X’ in probability (notation x

_{n}→X ), if for all ε > 0 lim

_{n →∞}P |x

_{n}- X| > ε = 0”.

Here ‘x

_{n}’ is taken far from ‘x’ when |x

_{n}- X| > ε, thus P| x

_{n}- X | > ε shows the probability that ‘x

_{n}’ is far from ‘x’. If ‘x

_{n}’ converges to ‘x’, then P| x

_{n}- X | > ε should become smaller as the value of ‘n’ increases. Here P| x

_{n}- X | > ε is a sequence of real Numbers.

For any ε > 0, ‘x’ is known as the probability limit of the sequence and converges in probability that is indicated by

x

_{n}→

^{(P)}X,

or

p

_{lim n →∞}x

_{n}= X,

Here some theorems are there about the convergence in probability for a vector ‘X’ and a sequence of random vectors X

**….X**

_{1}_{n}that is defined on a probability space (Ω, F, P).

1. Convergence in probability is a kind of convergence in concept of the measure theory, x

_{n}→

^{(P)}X: P |x

_{n}- X| > ε → 0 for any fix value of ε.

2. Convergence with probability 1 is known as Point wise convergence in the real analysis. This is almost close to the convergence that is measured in the measure theory.

x

_{n}→

^{(a.s.)}X: if P |x

_{n}→ X| > ε = 1,

Convergence with probability 1 is not stronger than convergence in probability as stated below

Theorem can be written as x

_{n}→

^{(a.s.)}X => x

_{n}→

^{(P)}X,

Proof: x

_{n}→X = Ω; for each ε > 0 there exist an N (Ω) > 0 such that ||x

_{k}(Ω) – X(Ω )|| ≤ ε for all k > N (Ω).

= ∩

_{ε>0}Ω: there exists an N (Ω) > 0 such that ||x

_{n}(Ω) – X(Ω )|| ≤ ε for all k > N (Ω),

= ∩

_{ε>0}Ω: there exists an N (Ω) > 0 such that ||x

_{k}(Ω) – X(Ω )|| ≤ ε for all k > N (Ω),

= ∩

_{ε>0}U

_{n=1}

^{∞}Ω:||x

_{k}(Ω) – X(Ω )|| ≤ ε for all k > N (Ω),

= ∩

_{ε>0}U

_{n=1}

^{∞}∩

_{k > n}Ω:||x

_{k}(Ω) – X(Ω )|| ≤ ε for all k > N (Ω),

Thus x

_{n}→

^{(a.s.)}X <=> P x

_{n}→X = 1

<=> P ∩

_{ε>0}U

_{n=1}

^{∞}∩

_{k > n}||x

_{k}(Ω) – X(Ω )|| ≤ ε = 1 for all ε > 0,

<=> P U

_{n=1}

^{∞ }∩

_{k > n}Ω : ||x

_{k}(Ω) – X(Ω )|| ≤ ε = 1 for all ε > 0,

<=> lim

_{n→∞}P ∩

_{k > n}Ω : ||x

_{k}(Ω) – X(Ω )|| ≤ ε = 1 for all ε > 0,

<=> lim

_{n→∞}P ∩

_{k > n}Ω : ||x

_{k}(Ω) – X(Ω )|| > ε = 0 for all ε > 0,

<=> lim

_{n→∞}P ||x

_{k }(Ω) – X (Ω )|| > ε = 0 for all ε > 0.

Or

<=> P ||x

_{n}(Ω) – X (Ω )|| > ε → 0 for any fixed ε > 0 that means x

_{n}→

^{(P)}X.

If x

_{n}→

^{(a.s.)}X ,Let the Set of non convergence points are-

U

_{a.s.}= Ω :||x

_{n}(Ω) is not approaching to X (Ω )|| then P U

_{a.s.}= 0.

And if x

_{n}→

^{(P)}X Let the set of non convergence points as U

_{P}= Ω :||x

_{n}(Ω) is not approaching to X (Ω )|| then it would not possible to get P U

_{P}= 0.

To converge in probability we don’t require P U

_{P}= 0 but provide the condition to lim n→∞ P U

_{P}(n) = 0, where the term ‘U

_{P}(n)’ is a set that will be smaller as ‘n’ approaches to infinity.

Three propositions are there related to convergence in probability one of which is already defined above that is lim

_{n →∞}P |x

_{n}- X| > ε = 0 means convergence implies convergence in probability is most of the time used.

The second proposition is convergence in probability implies existence of a subsequence that converges approx same to the same limit.

Let’s assume x

_{n}(Ω) that may be any sequence of Functions x

_{n}: Ω → R,

P [∞|x

_{n}(Ω) converges] = lim

_{m→∞}lim

_{n→∞}lim

_{n→∞}[Ω |max

_{n<j<k≤N}|X

_{j}(Ω) – X

_{k}(Ω)| ≤ (1/m)],

This is the third proposition.

The Probability Theory is a part of mathematics. The CLT that is Central Limit Theorem is given from the Probability theory. This theorem is first version of this theorem was given by Sir Abraham De Moivre.

This theorem tells us about the conditions under which a Mean is taken of the large number of random variables with variance; will be normally distributed. The random variables must be independent from each other and the mean must be finite. A Central Limit Theorem is consists of several number of variables. The random variables which are used must be distributed identically. The non identical distributions always work in accordance with certain terms or conditions.

Here we take an example of central limit theorem. Assume that we have a dice which consists of 6 Numbers. We roll this dice; now with the help of normal distribution we can determine the distribution of the sum or average of the rolled numbers.

(CLT) or central limit theorem definition: Let Y

_{1}, Y

_{2}, . . . . . . ., be the independent and identically distributed random variables which have a mean ‘u’ and a finite variance which is non zero is σ

^{2}.

Now let S

_{n}= Y

_{1}+ Y

_{2}+ . . . . . , Y

_{n}.

Then, Lim n->infinite P( Sn – n*u) / σ*sqrt(n)] < = ε(Y),

Here ε(Y) is the probability that a standard normal Random Variable has and it is less than ‘Y’. In this if we again look at a dice which is rolling then, let ‘Y’ is the number of spots which are showing when one die is rolled then the mean value ‘u’ for the rolling dice will be 3.5 and the variance will be σ

^{2}equals to 35/12.

Now here are some very interesting and helpful applications of central limit theorem.

In the random walks which can be biased or unbiased the Probability Distribution for the covered total distance will go towards the normal distribution.

Similarly if we flip large number of coins then it may result in a normal distribution for the total number of tails or same in case of total number of heads.

The central limit theorem also describes the existence and appearance of the Bell curve for the real world data on which the density estimates are applied. By using the generalization of this theorem we can get the final distribution produced by this theorem which approximately seems to be normal.

One more application of this theorem is in Signal Processing. The signals can be made smoother if we apply the Gaussian Filter. The Gaussian Filter is a convolution of the given signal with a scaled function that is called Gaussian Function. With the help of this theorem the smoothing of the signals can be determined by many filter steps which can be performed and then computed much faster.

So these were some applications of CLT.

Markov and Chebyshev inequalities are topics covered in Probability. Let’s look at Markov and Chebyshev inequalities in detail. In Markov inequalities a non negative random number and its expected value is given, whereas in case of Chebyshev inequalities the random variables will be having an expected value and variance, Chebyshev also provides a region for the tail inequalities. The best part about Markov and Chebyshev inequalities is that we do not have to know probabilistic mass function.

Another way in which we can look at Markov and Chebyshev inequalities is this. If we have no idea about the probability mass function of a Random Variable, Markov and Chebyshev inequalities are very useful tools when it comes to assuming how similar or dissimilar our prediction from the actual value. For example in case of Markov inequality it tells that the actual value varies by some multiple of the expected value from the Mean. The Standard Deviation is a multiple of the difference of the mean which is used by the Chebyshev probability.

So let's start with Markov inequality. Now we will discuss Markov inequality, this statement seems to be easier to understand but it is not, as it is the basic part of other inequalities like Chebyshev or Chernoff. The Markov inequality is classified in two parts: a simple one and the general view. The basic Markov inequality formulates that a variable ‘K’ that can take only non negative value,

P (K ≥mE [K]) ≤1/k,

P (K ≥mE [K]) it implies that the probability of a random variable will be more than ‘m’ times the expected value.

We only consider that the random variable is not negative and its expected value is E [K].

1. The probability of assuming the value should be greater than twice the expected value is almost half.

2. The probability of assuming the value of ‘K’ that is greater than thrice the expected value is almost one third.

Generalized Markov Inequality: The probability that the value of random variable greater than ‘m’ * E [K] can be at the most ‘1/m’ which can be calculated by assuming that we have an arbitrary constant ‘p’ and the probability of random variable having a value greater or equal to ‘p’ (K ≥ p).

The general expression of Markov inequality is

P (K ≥ p) ≤1/p * E [K],

Where, ‘1/p’ is the part which varies.

Consider a constant ‘p’ which is greater than zero,

I = 1, if ‘K’ is greater than ‘p’

I = 0, other cases.

It requires only a non – negative number, expected value and should be finite.

Few things to be noticed in Chebyshev a bound is provided on both the sides of the error. Most silly mistake that is done commonly is that the probabilistic bound is divided by 2 to find the one sided error. It can be applied only in a single case when there is a Symmetry otherwise it will lead to an incorrect result.

It is the remaining part left of Markov and Chebyshev inequality. It is another powerful tool which we can use to calculate the above problem

**.**The Chebyshev inequality gives different information about a random variable. We do not need to find the distribution of random variable as in case of Markov inequality. In Chebyshev inequality, we just require the expected value of the random variable and also the variance from which we can compute standard deviation.

Markov and Chebyshev inequality are similar. There are also ways to represent a simple one and a complex one. In simple one we have a random variable ‘K’ with a finite value of mean and variance; we can fix the deviation as

P (|K – E[K]| ≥m⌐) ≤1/m

^{2},

Few things to note:

1. In comparison to Markov inequality, in Chebyshev inequality the deviation has to be bound on both sides of mean.

2. ‘M’ is the length of deviation.

3. As small as the variance of ‘K’, the Chebyshev inequality gives the expected value with high probability.

4. There can be at most one fourth values that can go out of The Range.

Complex view or we can say the closest view as the value of deviation is fix from mean to any constant ‘s’(positive constant).

P (|K – E [K]| ≥s) ≤1/s

^{2}* Var[K],

Comparison of Markov and Chebyshev inequalities can be clear through this example:

Just like the Markov inequality, the Chebyshev inequality does not require us to we know the distribution of scores at all. Only expected value and the variance or the standard deviation should be known. In Markov inequality we do not get a fix answer, but we get some conditions on the probability.

There are two applications we have seen. One is to find the tighter bounds without using complex inequalities and the other one is to estimate the confidence interval. Few other applications are that, the Median is the standard deviation in one away from mean and it also gives the easiest proof for the weak law of large Numbers.

Markov and Chebyshev inequalities are two of the simplest, but still very powerful inequalities. One of the best applications is without knowing much about the distribution of random variables still they provide a very useful bound. In Markov there is a non negative random variable which exceeds the multiple value of its expected value. While in Chebyshev bound that the random variable must be any multiple of standard deviation and the random variable can be non negative also. Both Markov and chebyshev inequalities are very specific about the information they provide.