Important for probability and statistics.
Hypothesis:
Hypothesis is a statement based on
assumption, knowledge and state of results. This is precise and concise or
accurate and short. It shows relationship between two or more variables. It is
stated for the purpose of testing on the basis of systematically collected
data. This means the relationship should be testable.
Hypothesis
gives direction to the research work. It points what exactly should be done in
the study. It limits unnecessary data collection. Thus efforts of the
researcher are saved. It avoids deviation the research work and makes it
purposeful.
Characteristics
of the hypothesis. It should be accurate, simple and specific. It should be
accurate, simple and specific. It should be testable within limit of time, cost
and manpower. It must be based on some study, knowledge or purpose. If it is
not fulfilling any purpose or adding knowledge them it is not useful.
Primarily
there are two types of hypothesis(plural form of hypothesis) null and
alternative. There is no. difference between sample and population
characteristics. The word null indicates nil or zero or not significant. If
copper wire lot in the yard then it can be stated as below:
HO:
Breaking strength of copper wire is 500 kg
HO:
x bar=m=500kg x bar indicates means of the
sample tested. mis the means of the population
which can not be measured many times. Thus on the basis of 10 or 20 or 50
samples of copper wires we can predict about whole lot.
Some
examples of null hypothesis are given below:
a) The
drug / treatment is not effective in reducing blood prepare.
b) There
is no difference in the number of and
female it professional across the companies.
c) The
special coaching class is not effective in improving marks.
d) The
effect of fertilizer- A on the yield of wheat is insignificant
e) The
difference in Hb level of two groups is insignificant.
Thus null hypothesis has some specific words such as
“ not effective, ineffective, insignificant, independent, no difference or
equal, some, stable, uniform, random etc. depending upon case. The inner
meaning of all these words is insignificant or too small to consider.
In all
above cases there are two groups of data. One can be standard or what that
found in general in absence of the treatment and another is observation after
treatment. Thus if we think on case to case basis above examples can be
explained as below.
a) In
this example the doctor has previous levels of the blood pressure of patients.
After treatment she / he measured the blood pressure of some patients. It is to
test whether the post-treatment blood pressure has reduced significantly or
not? If it is reduced significantly then
the treatment is effective. In other words if difference in pre and post
treatment blood pressure is higher then the drug / treatment is said to be
effective. This is alternative hypothesis and not null hypothesis.
If
there is no difference then the drug is not effective in lowering blood
pressure. Hence it becomes null hypothesis, as inner meaning of HO is
difference is insignificant.
b) There
are two groups in IT companies-Male and female whether their numbers are equal
or not is the question to be tested employee numbers.
c) There
is previous set of marks before special coaching class which is to be compared
with marks after coaching class. The class is not effective if the difference
in the marks before and after is insignificant.
d) The
farmer wants to check yield of wheat after using fertilizer-A as compared to
his/her previous experience. If difference in both yields is insignificant then
the each selected sample.
e) The
researcher has two groups (of women) and Hb level of each selected sample.
Thus in each of the above example
there is a claim made by researcher. On the basis of data collection she/he
wants to check the claim. Read following examples of null hypothesis –
a)
Bad oranges in the lot
are 4%
b)
The variance of
‘burning capacity of coals from two mines’ is same.
c) Defective
jobs are independent on shift understand this case with following table. The
purpose is to understand the word dependent and independent.
Case-
I
|
Shift 1
|
Shift 2
|
Shift 3
|
O1
|
50
|
49
|
50
|
O2
|
48
|
50
|
52
|
O3
|
52
|
52
|
51
|
O4
|
51
|
50
|
53
|
Case-II
Shift 1
|
Shift 2
|
Shift 3
|
50
|
68
|
95
|
48
|
75
|
105
|
52
|
82
|
110
|
51
|
78
|
100
|
Now You are given
data in case-I and asked if 50 defectives are produced in a shift then tell me
the shift number?.................you are unable to answer the question because
50 defectives can be produced in shift 1, 2 or 3. Thus as difference in the
defectives produced in each shift is insignificant we can say ‘defectives
produced are independent on shift’.
Now refer the second
case, here the answer to the same question will be ‘shift-1’. If the number of
defectives are 70 to 85 then you may answer ‘shift-2’. If it is 95 to 110 then
your answer will be ‘shift-3’. Thus here ‘defectives produced are dependent on
shift
d) The
average of the ‘banana from Kerala’ is 280 gram
e) The
milage of the ‘Jupiter two wheeler’ is 60 km/liter
f) Share
price of a particular company is stable.
g) Numbers
drawn from a set are random. In this case if numbers are drawn randomly then
the distribution of frequency will be uniform. Example if there are 10 digits 0
to 9 which are selected for 100 times, then it is expected that each of the
digit will be selected 10 times under each category of digits.
Digits
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
Total
|
Observed
Frequency
|
10
|
11
|
9
|
8
|
11
|
12
|
9
|
10
|
9
|
11
|
100
|
In
the ‘observed frequency’ we see that frequency is not exactly ‘10’ but it is
near to ‘10’. In other words we are unable to differentiate digits on the basis
of frequency. This is just like the example of shifts and defectives.
h) Variables
in the population are correlated. It means the difference is insignificant or
close to zero. Here coefficient of correlation of sample and population is
taken into consideration.
i)
The proportion of
vegetarians in two villages is same.
j)
The standard deviation
of two series is same (equal).
In the above examples you may observed
that there are many parameters used in the comparison. It should be noted that
the null hypothesis doesn’t consider unit of the given values. It can be kg, km/hour,
hour or any. Secondly following comparison can be made in the hypothesis.
a)
Comparison between
means of two groups.
b)
Comparison between
means and expected frequency or hypothesized mean.
c)
Comparison between
variances of two groups.
d)
Comparison between standard
deviations of two groups.
e)
Observed frequency and expected
frequency.
f)
Correlation coefficient
g)
Values of additive and
multiplicative constants of the regression equation.
h)
Proportion of two
groups.
However
it is not limited to two groups only. Multiple (more then 2) groups also can be
tested using appropriate test.
Now
the question is what is significant and insignificant ?we will take an example
to understand this.
A
newspaper vender claims that sale of particular newspaper is 100 per day. On a
given day if it was 98 or 105; is 50 or 150 on a day. Then there must be some
reason which has reduced or increased the sale significantly. This difference
is noticeable and can be usual sale as it significantly lower or higher than
100. Thus if there is assignable cause for the difference then it is
significant which from alternative hypothesis.
Note
that in mathematics 100<105; but in statistics 100@105.
In the mathematics we take 100 and 105 are final results and hence cannot be
repeated. No experiment is carried out for seeking any change. However, if we
take 100 and 105 as a sale of newspaper which is uncertain event, then both can
be said approximately equal or the difference is insignificant. This forms our
null hypothesis.
ALTERNATIVE HYPOTHESIS
Alternative
Hypothesis is opposite to the null hypothesis. Either null hypothesis will be
true at the same time. Similarly both cannot be false at the same time. Hence
rejection of null hypothesis means acceptance of alternative hypothesis or
Vice-Versa. However this rejection or acceptance is made on the basis of some
error. This error is known as level of significance (µ-level).It
is expressed in terms of % or value. Generally accepted µ-values
are 1%, 5% or 10% (0.01,0.05,0.10). This means there is chance of failure of
the rejection or acceptance. This point is explained in detail [ ]. Here
alternative hypothesis means difference is significant.
Ha: Difference is significant.
Now you again
read examples of null hypothesis by making it exactly opposite. You will notice
that the inner meaning of these statements indicate higher, noticeable ,
significant difference. Both values( before and after, arithmetic means of two
samples, variances, standard deviations, proportions) are not equal.
Thus
words(adjectives) null and alternative hypothesis are enlisted below.
Null
hypothesis
|
Alternative
hypothesis
|
Equal
|
Not
equal
|
Not
effective
|
Effective
|
No
difference
|
Difference
|
Insignificant
|
Significant
|
Same
|
Not
Same
|
Independent
|
Dependent
|
Random
|
Not
Random
|
Correlated
|
Un-related
|
Stable/Uniform
|
Not
stable/Uniform
|
Zero
|
Not
zero
|
In all above
cases the alternative hypothesis is evade exactly opposite to the null
hypothesis. However it is not case always. In the example that breaking
strength of copper wire is 500 Kg, it is stated as below.
Ho:
X bar =m=500Kg
Ha:
x bar ¹m
The
alternative hypothesis Doesn’t indicate whether to accept the lot of copper
wire? Here X bar is expected to lie above 500Kg or below 500Kg. Hence the
values expected are non-directional (upper side or lower side). Significant
higher say 560or significant lower say 440 Kg both values are not accepted in
this case. This is known as two tailed hypothesis.
In the 1st
example of Drug/treatment it is expected that the values of blood pressure
after treatment should be reduced. The drug/treatment will not increase the
blood pressure as it is against our purpose. Thus it is one tailed hypothesis.
Here alternative hypothesis will be the drug is effective in reducing blood
pressure.
In the 3rd
example of special coaching class it was expected that marks will be
increased. Thus here ‘effective’ means marks are increased. This is also one
tailed hypothesis.
(please
refer two-tailed and one-tailed test of hypothesis and its region of acceptance
in--------)
In
short it can be stated as below:
a) Ho:
X bar =m = 500Kg Right side region resection
Ha: X bar >m
b) Ho:
X bar = m = 500Kg Left side region of rejection
Ha: X bar <m
c) Ho:
X bar = m = 500 Kg Both side region of rejection
Ha: X bar ¹m
HYPOTHESIS
TESTING METHODOLOGY:
On the basis of data collection we
carry out some mathematical calculations to check which hypothesis is to be
accepted. This is known as hypothesis testing. The acceptance/Rejection of the
hypothesis should not be done on personal feeling/ subjectively. It is because
of difference in opinion decision can be different. Hence there is the Unique
Method based on Logic to Test the hypothesis. In this Method Given data or
Parameter are treated to find out whether the difference is significant or not?
In a nutshell the test will determine which hypothesis should be acceptance.
Thus decision system will be uniform and chance to personal bias (opinion) in
the decision. Irrespective to the type of test following general methodology
should be adopted for hypothesis testing.
a)
Statement of hypothesis
Ho and to be tested
b)
Selection of the type
of distribution for testing
c)
Using appropriate
formula find out test statistics
d)
Assumption of level of
significance:1%, 5%, 10%
e)
Finding out degree of
freedom if necessary
f)
Referring table of
distribution find out critical value
g)
Compare calculated and
critical values/statistics
h)
Make decision of
acceptance rejection of hypothesis
i)
Write conclusion about
action/decision
Brief
description on methodology:
a)
Statement of hypothesis
in words or Numerical form based on theme of the question research intention.
Ho:
X bar =m = 80, Ha: X bar ¹m
or Ha: X bar >m
or Ha: X bar <m
b)
T-distribution,
Z-distribution(Normal distribution/ large sample test), X2
(chi-square distribution), F-distribution(for analysis of variance), uniform
distribution, etc. This selection shall be on the basis type of data, and what
to test?
c)
There are different
formula for different tests. It depends upon types of data also. This involves
calculation of arithmetic means, standard deviation, variance, standard error
etc. depending upon requirement.
d)
If not given assume
middle figure i.e. 5% level of significance. However this assumption is rough,
it shall be changed on the basis of difference in the calculated and critical
(or table) value. Best advice is to consider all three levels of significance for
making decision.
e)
Degree of freedom is
number of categories less by one (n-1). However in case of comparing with
Poisson distribution it is (n-2), for normal distribution comparison it is
(n-3). In case of two-way table (contingency table) it is (r-1).(c-1) r and c
denote number of rows and columns.
f)
Now a printed
table/book of the ‘distribution’ is referred to find out critical value. It is
supplied in the examination or allowed to carry. Sometimes it is part of
question paper. we can also find out this value using MS Excel spread sheet.
g)
Suppose we take t-test
then compare
tcal<tcrit
or t table –decision is accept Ho. In other word there is no evidence to reject
Ho
tcal<tcrit
or t table –decision is accept Ho. In other word there is no evidence to accept
Ho
h)
Write down accepted
hypothesis statement
i)
Support/substantiate
any action/ decision as expected in the study.
CENTRAL TENDENCY
Central tendency indicates the
representative single value for the distribution under consideration. Thus it
is most suitable which can be used for the whole data. For this purpose we
generally use average or arithmetic mean of the given data. However it is not a
single measure to indicate all type of data.
The purpose of finding out single
representative number is to facilitate comparison of the two or more
groups.Many times it becomes difficult or impossible to memorise all data hence
‘single number’ serve the purpose. Certain characteristics of the measure of
central tendency are as below.
a)
It should be easy to
calculate from given data.
b)
It should be easy to
understand
c)
It should be single
representative number
d)
It should be able to
absorb corrections/ addition
e)
It should be based on
all observation. However at the same time it should not affect by extremely
small or large number values.
Following
table shows different measures of central tendency and their use.
a)
Arithmetic mean:-
For
data having single unit such as kg, meter, cm3, cm2, etc. It is based on all
numbers. Easy to understand and manipulate. Mostly used.
b)
Mode:-
It
shows the number mostly repeated in the series. It is useful when calculated
value of average has no meaning. mostly sold number of shoes/ under garments
size.
c)
Median;-
If
given number are arranged in ascending or descending order then middle number
is median. It is used to locate point below and above which 50%-50%
observations are exist. It is not affected by extreme observations. It is
fairly constant.
d)
Geometric mean :-
It
is used for calculating index numbers it is used to find out average %
increased or decreased over a period. It is also used * when lower weightage to
higher values and higher weightage to lower values are given. * for finding
average.
e)
Harmonic mean:-
It
is used to find out average when the unit of data is compounded; such as
km/hour, jobs/ hours etc. speed related problems.
Median:-
|
Median
|
|
|||||
0%
|
50%
|
50%
|
100%
|
||||
Q1=
(1/4):-
Q1
|
|||||||||
0%
|
25%
|
75%
|
100%
|
||||||
Q3=(3/4):-
|
|
|
|
|
Q3
|
|
|
0%
|
75%
|
25%
|
100%
|
D6=
(6/10):-
|
|
|
|
|
Q6
|
|
|
0%
|
60%
|
40%
|
100%
|
P35=(35/100):-
|
|
P35
|
|
|||||
0%
|
35%
|
65%
|
100%
|
|||||
Q1=
Quartile 1, Q3 = Quartile 3, D6= 6Th
decile and P35= 35th Percentile
Relationship
among mean, median, median and mode: for perfectly normal/ bell-shaped/ symmetric
distribution mean= median = mode. However for other (not normal /
bell-shaped/symmetric) Mode=3 median- 2 mode. This relationship gives
approximate value of unknown. It any two are known third can be calculated
using this formula.
In
case of Arithmetic mean, Geometric mean and Harmomic mean the relationship is
as below:
1)
AM ³
GM ³
HM
2)
GM2= AM *
HM
SAMPLE SPACE
It
is totality of all possible events from an experiment. It is collection of all
outputs of an experiment. Suppose we tossed a coin then possible. Outcome will
be Head or Tail. In case the coin remains standing position (Vertical) then
delete that experiment and toss coin again. However it is very rare event and
possible in films only. Now it we denote H for Head and T for Tail then our
sample space will be S ={ H, T}. Sample
space is generally denoted by S. The sample size is number of items in the
sample space, here it is 2.
Now
it we toss 2 coin at a time or one coin for two times then possible set of
outcomes will be as below:
S={TT, TH, HT, HH}
Thus
sample space contains 4 items representing possible events. If we toss a dice
having six faces the outcome will be S={1, 2, 3, ,4, 5, 6}.
If two dices are tossed at the same time
the sample space will contain 36 items as below.
Dice
I
1
|
2
|
3
|
4
|
5
|
6
|
|
1
|
(1,1)
|
(1,2)
|
(1,3)
|
(1,4)
|
(1,5)
|
(1,6)
|
2
|
(2,1)
|
(2,2)
|
|
|||
3
|
|
|
||||
4
|
|
|
||||
5
|
|
|
||||
6
|
|
|
|
|
|
|
1
|
2
|
3
|
4
|
5
|
6
|
2
|
3
|
4
|
5
|
6
|
7
|
3
|
4
|
5
|
6
|
7
|
8
|
4
|
5
|
6
|
7
|
8
|
9
|
5
|
6
|
7
|
8
|
9
|
10
|
6
|
7
|
8
|
9
|
10
|
11
|
7
|
8
|
9
|
10
|
11
|
12
|
Application:
Now let us
attempt what is use of this number space. It is used to know possible number of
outcomes of an experiment. It will also define probability of happening of an
event. In case of 2 coins tossed at a time the probability of getting both head
is ¼. Four indicates number of items in the sample space and 1 is ‘event of
arrival of HH.
In the case
of two dices tossed at a time what is probability of getting 3 total. Here we
observe the ‘total 3’ for 2 times means favourable events are 2 out of 36. Thus
probability will be= 2/36. Probability of happening of ‘4’ event or simply
probability of getting 4.
Each
individual item included in the sample space is known as sample-point.
According to sample size or number of sample-points there are two classes of
the sample space 1) finite sample space and 2) Infinite sample space.
Sample Space
|
|||||||
Finite
|
Infinite
|
||||||
Countably Infinite
|
Uncountably Infinite
|
Finite
sample space:
If number of
sample points or items in the sample space are definite then it is called
finite sample space. In this case number of outcome is known. In the example of
tossing a dice the sample space contains six numbers. In case of tossing two
dice the sample space contains 36 numbers (points). Thus the number of items
included in the sample space are well known.
We
need not perform any kind of experiment for this purpose. Another example for
this is if patients coming in the hospital is categorised as male(m) and
female(f). Also according to their condition they are categorised as serious
(s), Normal(n) and can be delayed(d). The sample space will be
S= {ms, mn,
md, fs, fn, fd}
Infinite
sample space:
As the name
implies infinite means countless or immeasurable. Thus we cannot define the
sample space at the start. The sample space is unlimited. However in a case it
is countably infinite which means it is possible to put some events before the
occurrence of the expected event. Here let us take an example of tossing a
coin. If we decided to toss coin until head occurs then the sample space will
be as below:
S= {H, TH,
TTH, TTTH, TTTTH, TTTTTH}
Thus
we can take support of natural numbers 0, 1, 2, 3, 4………… before occurrence of
head. In above case if head occurs the experiment is closed. Thus TTTTHT can
not be pint in the sample space. It is limited to the TTTH. Once head is up the
experiment is finished. Thus the question is how many time we need to toss the
coin to get head? It is indefinite, but before occurrence of head there should
be only tails. Thus tail has to occurs for 0, 1, 2, 3……times before getting
head. This situation is shown in the sample space.
2
example of countably finite sample space is:
Patients
coming in the hospital are denoted as M for male and F for female. If we stop
experiment un until female comes then sample space will be as below.
S= {F, MF,
MMF, MMMF, MMMMF…………….}
In
Male and female again it is categorised child © and Adult(A) and we want to
stop the experiment until Adult female then the sample space will be as below:
S= {AF, AMAF,
CMAF, CFAF, AMAMAF, CMCMAF………}
Uncountably
Infinite sample space:
In this case
we cannot count the sample space. Also number of sample space is defined by
lower and upper limit such as 10 to100.
In case of the marks of the students which range from 0 to 100.Possible
number of tosses by a coin. If is also infinite and not countable thus the
sample space may range from 0 to ¥.
It is written as below:
S= {X:
0<X<100}. This is read as ‘X’
Is
such that it varies from 0 to infinite. Also the life of a bulb may range from
0 to 1000 hours. Thus we cannot count it exactly and if we tried it will become
unnecessary, unfruitful exercise. Instead of this it can be best presented
using internal or lower and upper limit. Here if X denotes number of burning
hours of bulb then S= {X: 0<
X<100hours}
Discrete
and continuous sample space: Discrete means definite value and continuous means
a value within given range. Number of rooms/ children/ peas in the bean is
discrete numbers. These are definite numbers such as 0, 1, 2, 3, 4……… etc.
continuous sample space consists of many values with the given range i.e. 10 to
20, 0 to 100 etc. The continuous sample space is uncountably infinite sample
space. We cannot define its exact value or its sample points. Income of a
person/ family, life of a mobile phone battery, rainfall in a city or area. We
can express these into range only no
definite number can be decided on the basis of historical data. As there is no
limit of precision we cannot exactly define its sample points.
Discrete
means finitely many or countably infinite elements. Thus it includes sample
space finite as well as countably infinite.
EVENTS:
An
event is a part of sample space containing one or more outcomes. We know that
the sample space is collection of all possible outcomes. Now an event can be
single or group of many outcomes as event can be single or group of many
outcomes as we expect or want. In a random experiment if the sample space and
possible outcomes are known in advance (before the experiment) then the outcome
is known as event. In an experiment of tossing coin the ‘H’ or ‘T’ are events.
Following examples will again explain the term.
1)
If a die is tossed then
sample space S= {1, 2, 3, 4, 5, 6}
Now
the ‘event’ of getting even number {2, 4, 6}
The
event or getting number less than 3 is {1, 2}
The
event of getting number more than 4 is {5, 6}
2)
In a set of alpha bates sample space contains
26 points S={A, B, C……….Z}. Now the event of getting vowel ={A, E, I, O, U}
Now
let us know types of events. there are 9 types of events as mentioned below:
1) Certain
event
2) Void/
impossible event
3) Mutually
exclusive events or disjoint events
4) Independent
events
5) Dependent
events
6) Equally
likely events
7) Exhaustive,
events
8) Complementary
event
9) Simple
and compound events
1)
As the name implies
when the happening of an event is certain or sure then it is known as ‘certain
event’. In this experiment there is no other event is available. Occurrence of
head or tail after tossing a coin is certain events. The child may be Boy or
Girl. The outcome of the dia will be S= { 1, 2, 3, 4, 5, 6}. Thus if an event
contains all sample points in the sample space then it is ‘certain’ event.
2)
It is exactly opposite
to the above event. When there is no such sample point in the sample space then
it is called impossible or void event. Thus it is shown as event occurrence of
7 or 8 in the dia example is not possible. The number above 6 is not set of
S={1, 2, 3, 4, 5, 6}.
3)
If given two events
cannot happen at all same time then it is called mutually exclusive events.
probability of happening of two events is zero in this case. Either 1st
or 2nd event will occurs at a time. A man cannot belong to Male and
Female category at the same time. Head and Tail cannot be outcome of tossing a
coin. We should use ‘or’ instead of ‘and’ to make it meaningful. A man belongs
to male or female category. Outcome of tossing a coin is either Head or tail.
As joint possibility is nil in this case it is known as disjoint events.
Following figure shows disjoint events:
S
|
|
|
|
|
![]() |
![]() |
|
||
|
|
|||
|
|
|||
|
|
|||
|
|
|
|
Event
A is occurrence of less than number. Thus EA= {1, 2, 3} Event B is
occurrence of equal to or more than 4 number. Thus EB = {4, 5, 6} P
(AB) =0. Probability of happening of Event A and B is Zero.
4)
The event is said to be
independent when its occurrence is not influenced by any other events. thus the
experiment itself posses same condition every of tossing a die every time there
are six-faces. We cannot delete any face or add. Hence the occurrences of any
number is independent on the (numbered 1through 10) Having similar physical
characteristics placed in a pot (urn), the blindly selection at each time is
independent on previous outcome; provided that we have replaced the selected
ball in the pot. In another case the rainfall on a day is not certainly
independent on previous days rainfall. It may or may not rain following by a
rainy day. Thus these events are not-independent. We cannot say these as
dependent or independent. The opposite word of independent is not independent
and not ‘dependent’.
5)
Dependent event is
associated with previously happened event. In the above case of ball numbered 1
to 10, if we do not replace the ball then every time the probability of getting
selected changes. At first time the
probability of getting 5 is 1/10, as there is only one ball in 10. If we don’t
get ‘5’ and again continued second experiment without replacement, then
probability of getting ‘5’ will be one in nine or 1/9. In the subsequent
experiment if continued with similar condition then the probability will be
1/8, 1/7………till getting ‘5’. Thus the probability of ‘5’ changes every time as
we have not replaced the ball. This is not possible in case of die as we cannot
reduce a face.
6)
In an experiment of
tossing a coin we expect the occurrence of Head but we cannot justify our expectation.
There is no any reason why head will appear after a toss. This indicates that
the coin is fair or unbiased and it may give head or tail. The event of
occurrence of head or tail is equally likely. In case the coin is forged to
fall head and we have historical data that the Head had appeared for 60% time,
then events are not equally likely. Throwing a die, also can be explained in
the same way. Getting defective product is not equally likely event as the
company is doing much more for producing good product.
7)
When all possible
sample point in a sample-space are considered then it is ‘exhaustive’ events.
The number of exhaustive events in the tossing a coin is 2. In case of throwing
a die it is 6, for two dice it is 36.
8)
Let us take example of
balls numbered 1 through 10. The sample space will be = { 1, 2, 3, 3, 4, 5 ,6
,7 ,8 ,9 ,10}. Let A is a sub set having even number A={ 2, 4, 6, 8 ,10}. B is
a sub set having odd number B= { 1, 3, 5, 7 ,9}. Points in subset A and b forms
sample space S. However there is no any common point in the A and B. there for
B is said as complementary event of A or vice versa. It is shown in following
fig.
S
|
|
|
|
|
![]() |
B
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
|
Acor
A1means complement of event A. Complement of complement is the
event, itself. Event B consists those sample points which are not belonging to
event A.
9)
Sample event is an
individual event not linked with happening of other event. Example probability
of getting ‘6’ in throwing a die, is ‘1/6’. If a class has 20 students, 12 Boys
and 8 Girls; probability of randomly selected student will be girl is 8/20 or
0.4. On the basis of historical data also we can predict the probability of
happening an event. Probability of a sales manager is in Mumbai today is 0.8.
Compound
event is combination of two/ events in sequence. The probability of getting ‘6’
in two successive throws of die is 1/6 * 1/6 = 36. Probability of selection of
two girl students in sequence is 8/20 * 7/19 =_______. Probability that the
sales manager = 0.64. Thus there are two or more events occurred in sequence.
Coefficient
of variation or variability:
Standard
deviation is an absolute measure of variability. It has no relevance with mean.
Mean (Arithmetic) can be same with different standard deviations or vice versa.
It is an internal measure of variability. If we want to compare variability of
two or more groups than we should use COV and not SD. COV = SD/mean*100. Thus
mean out COV. Hence it is better measure for comparison purpose. It is
expressed in terms of percentage.
Higher the
COV higher is the variability, similarly lower will be homogeneity, uniformity
and stability. On the contrast lower the COV lower is the variability. Example: Boys group has average marks 60 with
SD as 20. Girls group has average marks as 50 with SD as 12.5. then COV of boys
group = 33.33% and that of girls group is 25%. This indicates girls group is
more consistent as regards to marks. Here we should not interpret variability
on the basis of SD only. Refer following case and inter.
Boys
group X bar 1 = 80, 61 = 20
Girls
group X bar 2 = 50,62 = 16
Mean
and SD should be calculated as usual using appropriate formula for individual
discrete and continuous serious the case may be.
ANOVA
ANOVA
is short from of ‘Analysis of Variance’. It is also known as F-test. This test
is devised by Prof.R.A.Fisher in 1920. Variance is square of standard
deviation. However, the sample size being less than 30 standard
deviations.However the sample size being less than 30 we use (n-1) in the
denominator instead of ‘n’. Thus variance S2= (X-X bar)2/n-1.
This test is useful to find out overall effect of the treatment. It determine
whether the ‘treatment’ effect is significant or not in the change in data.
Let us assume
that, we have sown 4 kinds of wheat seeds (A, B, C, D) on 5 Acre land each.
Thus the production of wheat is measured in each Acre of lamel as below.
|
A
|
B
|
C
|
D
|
1
|
10
|
11
|
12
|
10
|
2
|
8
|
10
|
13
|
12
|
3
|
9
|
12
|
15
|
14
|
4
|
10
|
13
|
9
|
15
|
5
|
11
|
12
|
10
|
14
|
Now we can
see that there is difference in the column representing similar type of seed.
This variation is not explainable and supposed as random effect. There is
difference in yeild from ‘column to column’. This can be effect of change in
seed type. Thus the difference from column to column is due to treatment effect
of ‘sowing different seed’. Now the question is whether this variation is
significant or not?
Whether
the type of seed contributes significantly in the higher yield? This is tested
using ANOVA.
Defectives
produced in shifts, effect of fertilizer on yield of crop, productivity of
different machines, scores obtained by different groups of students,
performance (sales) of various salesman etc. can be tested using ANOVA.
In the
two-sample T-test we test significance of difference in the means of two
groups. If number of group are more than 2 then T-test is to be carried out for
many times in pair. The overall conclusive result is given by ANOVA, this will
avoid many calculations of t-test. ANOVA is used when number of group is more
than 2. In case of 2 groups also it is used to find out significance of
variance and not mean.
Here
we note down the basic concept that, ‘means can be same with different
variance’ and ‘variance can be same with different means’. Thus variance has no
relevance with mean. Variance of ’11, 12, 13, 14 and 15 is same as of ’51, 52,
53, 54 and 55’. There is much difference in the mean we know (13 and 53). In
two-sample t-test the means of two groups are tested for equality or
non-equality. In ANOVA ‘variance’ of two group is tested for equality or
non-equality.
One
way ANOVA:
The given
case of yield of wheat is one way ANOVA; because it has only one treatment.
‘Changing type of seed’ is our treatment which can manipulate (increase/
decrease) the yield of wheat. The other side i.e. In the row ‘number of land’
indicates no effect on yield as we have assumed the land characteristics are
same. Thus the effect of differe3nt land on the yield is random or not a cause
for variance. (It is ignored).
It is
suggested if there is one way. ANOVA the source of variation should be taken
into column. In the example if it is given in row then we should transpose
(make into column) the same for convenience.
In one way
ANOVA the total variance is due to ‘random effect’ and ‘treatment effect’. Thus
Ssis a term used for ‘Sum of square’. It is actually sum of square of
difference of data from its mean. In other language å(X-X
bar)2.
Total Ss = Ss
due to treatment t Ss due to random effect. Degree of freedom in one way ANOVA:
for columns effect it is (C-1). The
total degree of freedom is (n-1) i.e. sample size -1. The df of error (or
random effect) is [(n-1)-(C-1)]. In our case df for column (4-1) = 3. The total
df = (20-1)= 19. The df of error = 19- 3= 16.
ANOVA
TABLE
Source of
variation
|
Ss values
|
Df
|
ms(mean
square)
|
F column
|
F critical
|
Ss due to
column. Effect
|
X
|
n1
|
X/ n1=
E1
|
E1/
E2
|
(n1n2)
|
Ss within
|
Y
|
n2
|
y/ n2=E2
|
E1/
E2
|
At µ
level 0.05 or 0.01
|
Total Ss
|
(X+y)
|
n1+
n2
|
|
|
|
Hypothesis
in one way ANOVA:
Ho:
Arithmetic means of samples are equal.
Ho:
Arithmetic means of samples are not equal.
Ho:
variation in the yield due to change is seed type is insignificant.
Ho:
variation in the yield due to change is seed type is significant.
Ho:
Population means are equal
Ho:
Atleast one population mean is different from overall population mean.
Ho: m1=
m2=
m3
Ha: m1¹m2¹m3
Ho:
612 = 622
Ha:
612¹
622
Assumptions
in ANOVA:
1)
Assumption of normality:
It
is assumed in this test that the data are normal. It gives bell shape it we
plot its graph. In other words if we arrange data in ascending order then it
will be on a straight line if we plot. There are many methods to test normality
such as X2- goodnex of fit-test. However we cannot take guarantee
that the data are perfectly normal. It is also not necessary condition
Researchers observed that derivation from small normality doesn’t make f-test
invalid. However for very skewed distribution such as right side or left side
or multimodal series it is not applicable.
2)
Assumption of
randomization:
There
is no bias in the carrying out experiment. Each sample should be selected in
random fashion and independently. This indicates there should not be effect of
selection of one sample on that of next sample. Observations should not be
correlated with time, space and sampling unit then it is called as independent.
If we have four types of seed, then the selection of seed for a plot should be
done independently and randomly see following,
A
|
C
|
D
|
B
|
D
|
B
|
A
|
C
|
B
|
A
|
D
|
D
|
C
|
B
|
C
|
A
|
3)
Equality of variances:
In
the group means may be different but variances should be equal. In other words
it is assumed that 612 = 622 = 632;
if there represent variances of three groups to be tested. Barlett’s test for
homogeneity (equality) of variances can be appied to test equality.
For
equal variance the tail end probabilities of samples are equal, hence
comparison can be made. In general language we can say that there should be
equal distribution of data on both side of mean (centre line)
TWO-WAY
AVOVA:
In the given
case of yield of wheat if we have changed fertilizer level as well as seed
types then there are two treatments, hence the analysis is known as Two-way
ANOVA. In the column as are affecting (seed and fertilizer) on yield of wheat.
Here
the hypothesis will be for significance of both type of treatment on the
production (yield) of wheat foe seed type
Ho:
The effect of seed type on yield of wheat is insignificant.
Ha:
The effect of seed type on yield of wheat is significant.
Similarly,
for fertilizer level-
Ho:
The effect of Fertilizer level of wheat is insignificant.
Ha: The effect of Fertilizer level
of wheat is significant.
Seed Type
A
|
B
|
C
|
D
|
|
f1
|
10
|
11
|
12
|
10
|
f2
|
8
|
10
|
13
|
12
|
f3
|
9
|
12
|
15
|
14
|
f4
|
10
|
13
|
9
|
15
|
f5
|
11
|
12
|
10
|
14
|
Thus
in this case there can be following results.
1)
Effect of seed type and
fertilizer level is significant
2)
Effect of seed type and
fertilizer level is insignificant
3)
Effect of seed type is
significant whereas that of fertilizer level is insignificant.
4)
Effect of seed type is insignificant whereas
that of fertilizer level is significant.
Variance
of Two-way ANOVA
![]() |
Random
effect Total
variability
Variability
due to fertilizer change Variability
due to seed type change
Thus
total Ss = Ss due to column effect + Ss due to row effect + Random effect.
Degree
of freedom in two way ANOVA-1) For total Ss it is (N-1) or (20-1) = 19. 2) for
seed/ column effect it is (c-1) or (4-1) = 3, C: No. of columns 3) for fertilizer
level/ row effect it is (r-1) or (5-1) =4
r: No. of rows 4) for random effect (Known as residual/ balance error)
it is 19-3-4=12
Source of variation
|
Ss values
|
Df
|
Ms
|
Fcal
|
F criticalµ=0.05/0.01
|
Ss due to column effect
|
X
|
(C-1)
|
Msc=X/(c-1)
|
Msc/msBE
|
V1= (c-1)
V2=(N-c-r-1)
|
Ss due to row effect
|
Y
|
(r-1)
|
Msc=Y/(r-1)
|
Msr/msBE
|
|
Balance Error
|
Z
|
(N-C-r-1)
|
Msc=Z/(N-C-r-1)
|
|
V1=(r-1)
V2=(N-c-r-1)
|
Total Ss
|
X+Y+Z
|
(N-1)
|
|
|
|
(N-1)-[(C-1)+(r-1)]
= N-1-c-r+2
= N-c-r+r
Correction
factor (cf) = (TG)2/N
Total
Ss = åXij2– C>F
Ss
due to column = åTj2/nj-C.F j= 1, 2, 3, 4
Ss
due to row = åTi2/nj-C.F j= 1, 2, 3, 4,5
B.error=
Total Ss- Ss due to column- Ss due to row
TG=
Grand total, nj = No. of elements in row
nj
= No. of elements in column, Ti = Sum of ithrow
Tj
= sum of jth column, Xij = Individual element
N=
No. of all observations.
Ti
= Sum of ithrow.
Xij
= Individual element.
Sampling
distribution of the mean when SD is unknown
If
population’s standard deviation is unknown then we can replace the same by
sample standard derivation. It is because of sample standard deviation (S) is
best estimator and thus substitute for population standard deviation (6). Here
the population consists of number of observations and thus the distribution is
supposed as normal. In case of small sample we assume that these are part of population
and thus follow shape of normal distribution approximately. Here this
assumption indicates that if sample size is increased to infinity then it will
follow normal distribution.
This
assumption is fairly correct if sample size is larger than 30. This is the
reason when sample size is lesser than 30 we treat this as t-distribution. The
standard deviation is calculated by dividing sum of squared deviation form mean
by (n-1). Thus
S
= root å(x-x bar)2/n-1
tcal
= X bar- mHo/S root n
Thus
instead of finding out Z we calculate tcal
using the same formula (replacing 6 by S). The t-distribution
is also like normal distribution but its hight of apex (crown) lowers with
lower sample size. It is symmetric distribution having mean as ) at the centre.
Here
(n-1) is known as degree of freedom. The table value of t
is observed using df and level of significance (µ).
As usual the t-distribution
has tail end probabilities according to hypothesis i.e. One-tailed test or two
tailed test.
However
this is regarding one sample test, if we have to compare mean of two sample
then we need to find out combined standard deviation. Let us assume Group A has
15 samples and ‘B’ has 10 samples. The degree of freedom will be df = n1+n2-2
or 15+10-2=23.Let s, and S2 are two standard deviations of sample
calculated as above formula (of S), then
S
combined = root S12 (n1-1) + S22
(n2-1)/n1 + n2-2 ___________1
OR
S
combined = root å (X1-X bar1)2+å
(X2-X bar2)2/df _________2
Formula
1 is used when S1 and S2 are given or expected to find out. Formula 2 is used
when no such case as 1. Here tcal
= X bar1 – X bar 2/ S combined * root n1n2/n1+n2
Binomial
distribution
When outcomes of an experiment are only two
then it is known as binomial. The experiment carried out randomly having two
outcomes is known as Bernaulli trial. In many cases we use only two outcomes
such as failure or success, boy or girl, bad apple or good apple, Defective or
non-defective job, Head or Tail, leaky or not leaky joints, effective or not
meeting, rice or wheat eaten etc. Binomial distribution is systematic
arrangement of the event to find out probability of various events.
Binomial expansion: Let us take an example of
tossing coin. If a coin is tossed the outcome will be S= {H,T}. If two coins
are tossed S = {TT,HT,TH,HH}. Let us assume ‘p’ as probability of getting Head
and ‘q’ is that of getting tail. Here p+q=1 as per rule of probability.
For n=2,
S={qq, pq, qp, pp} As per rule of
Sequence
{q2 2qp p2} This is exactly
(q+p)2. If three coins tossed then the sample
space will be {q3, 3q2p, 3qp2, p3}
Application
of Binomial expansion
q3
3q2p 3qp2 p3
0 1 2 3 Let
us assume q = p = 0.5
0.125 0.375
0.375 0.125
This
indicates probability of getting 0 and 3 heads is 0.125 each. Probability of getting 1 head 2 head is 0.375
each Thus the total of probability equals to 1(i.e. 0.125+0.375+0.375+0.125)
Let
us assume ‘n’ coins are tossed at a time or a coin is tossed for ‘n’ times the
results will be (q+p)n = nco. qnpo+
nc1qn-1p1+ nc2qn-2p2
+ nc3pn-3 p3+ncnqopn
Here
we observe following points in the expansion.
1)
The number of terms
equals to n+1. In three coins there were (3+1)=4 terms.
2)
The exponent of q
starts from ‘n’ and reduces by one in the subsequent term till 0 at the end
term.
3)
The exponent of p
starts from 0 and increases by 1 in the
subsequent term till ‘n’ at the end term.
4)
The sum of exponents of
q and p in all trems remains constant i.e ‘n’.
5)
The first term starts
from r=0 to r=n at the end.
6)
Exponent of p and ‘r’
are same.
7)
The sum of all binomial
coefficients i.enCr is equal to number of outcomes
(sample space-9 items).
NOTE.
5C2=
5*4/2*1 = 10, 8C3 = 8*7/8*6/3*2*1=56
We
can also use calculator for finding out value of nCrdirectly.
8)
The sum of all
probabilities equal to 1.
If
coin is fair/ balanced unbiased than p=q=0.5. In other cases p or q be
different.
Practical
use of binomial distribution: When sampling is done with replacement then it is
good. However the probability of p or q should not be less than 0.10.
It
is used in acceptance sampling for accepting a lot based on samples. It is used
for drawing generalization for decision making purpose. In a case of leaky not
leaky joints of a pipe line what is probability of getting more than 2 leaky
joints. If a team is required for repairing leaky joint what is probability it
would require two teams. In a year how many times two or more teams required
for repair work.
Assumptions
in Binomial Experiment
1) Experiment
is performed under same condition for a fixed number of trials ‘n’.
2) Only
two possible outcomes
3) Probability
p and q head and tail respectively is with replacement then Binomial
Distribution is not applicable. Replacement means there are always two faces of
a coin in every experiment. Actually we cannot separate a face of a coin here.
4) Trials
are statistically independent. There is no connection between result of first
trial and second trial.
Mean = n.p, standard deviation =
root n.p.q
COEFFICIENT OF CORRELATION
Correlation
is a measure of relationship between two variables. Its value varies from -1 to
1.
It
is a unit less quantity. We find there are number of situations in our
experience where there is correlation. Example: There must be positive
correlation between advertisementscost and sales revenue, age and weight, Noise
level and blood pressure, Marks obtained and study hours, Hights of father and
child, rainfall and flood etc. Here we notice that as one increases other
variable tends to increase and vice versa.
Similarly,
there can be negative correlation i.e. as one variable increases other will
tend to decrease.
Example:
Expenditure on safely and number of accidents, quality improvement efforts and
number of defectives produced, surface roughness and speed, marks obtained and
hours on TV, consumption of water and calcification etc.
Also
there can be no correlation between two variables such as richness and weight,
study hours and marks, death rate and birth rate, etc.
There
should be cause and effect relationship between two variables. Logically one
variable should influence to increase or decrease the value of other variable,
otherwise the correlation is false. Example: Prices of Gold and tomatoes,
Reading speed and driving speed, Noise level and weight of a person etc. We
cannot establish link between these two variables. Perfect positive
correlation: Here the value correlation is equal to ‘1’. The relationship
between two variable is such that we get other variables by multiplying (or
dividing) first variances. The factor of multiplication is same. If cost of 1
pencil is RS.3,for 2 pencils 6,8 pencils RS. 24,10 pencils Rs.30,50 pencils Rs
150 etc……… The relationship between number of pencil and cost is perfect
positive. Here the multiplying factor is 3.The coefficient of correlation
between any two tables is 1(example: table of 3 and 17).
Perfect
Negative Correlation: Here the value of correlation is equal ‘-1’. Here smaller
the 1st variable higher will be 2nd variable. If we write
table of 3 i.e. from 3 to 30 and table of 17 in reverse direction i.e. 170 to
17, the relationship between two variables is perfectly negative i.e equal
to”-1”. The pairs of variables will be (3,170) (6,153)(9,130)…………..etc.
Interpretation
of correlation coefficient
|
|
|
|
|
0
|
0.25
|
0.5
|
0.75
|
1
|
Event
is not Event is more Likely to occure Event is Event
or not to occur. Likely to is very likely to
very
likely likely not to occur than occur
to
occur occur than to occur not to occur
Comments
Post a Comment