Important for probability and statistics.

           


                       Hypothesis:
Hypothesis is a statement based on assumption, knowledge and state of results. This is precise and concise or accurate and short. It shows relationship between two or more variables. It is stated for the purpose of testing on the basis of systematically collected data. This means the relationship should be testable.
            Hypothesis gives direction to the research work. It points what exactly should be done in the study. It limits unnecessary data collection. Thus efforts of the researcher are saved. It avoids deviation the research work and makes it purposeful.
            Characteristics of the hypothesis. It should be accurate, simple and specific. It should be accurate, simple and specific. It should be testable within limit of time, cost and manpower. It must be based on some study, knowledge or purpose. If it is not fulfilling any purpose or adding knowledge them it is not useful.    
            Primarily there are two types of hypothesis(plural form of hypothesis) null and alternative. There is no. difference between sample and population characteristics. The word null indicates nil or zero or not significant. If copper wire lot in the yard then it can be stated as below:
            HO: Breaking strength of copper wire is 500 kg
            HO: x bar=m=500kg x bar indicates means of the sample tested. mis the means of the population which can not be measured many times. Thus on the basis of 10 or 20 or 50 samples of copper wires we can predict about whole lot.
            Some examples of null hypothesis are given below:
a)      The drug / treatment is not effective in reducing blood prepare.
b)      There is no difference in the number of  and female it professional across the companies.
c)      The special coaching class is not effective in improving marks.
d)     The effect of fertilizer- A on the yield of wheat is insignificant
e)      The difference in Hb level of two groups is insignificant.
Thus null hypothesis has some specific words such as “ not effective, ineffective, insignificant, independent, no difference or equal, some, stable, uniform, random etc. depending upon case. The inner meaning of all these words is insignificant or too small to consider.
      In all above cases there are two groups of data. One can be standard or what that found in general in absence of the treatment and another is observation after treatment. Thus if we think on case to case basis above examples can be explained as below.        
a)      In this example the doctor has previous levels of the blood pressure of patients. After treatment she / he measured the blood pressure of some patients. It is to test whether the post-treatment blood pressure has reduced significantly or not?  If it is reduced significantly then the treatment is effective. In other words if difference in pre and post treatment blood pressure is higher then the drug / treatment is said to be effective. This is alternative hypothesis and not null hypothesis.          
If there is no difference then the drug is not effective in lowering blood pressure. Hence it becomes null hypothesis, as inner meaning of HO is difference is insignificant.
b)      There are two groups in IT companies-Male and female whether their numbers are equal or not is the question to be tested employee numbers.
c)      There is previous set of marks before special coaching class which is to be compared with marks after coaching class. The class is not effective if the difference in the marks before and after is insignificant.
d)     The farmer wants to check yield of wheat after using fertilizer-A as compared to his/her previous experience. If difference in both yields is insignificant then the each selected sample.
e)      The researcher has two groups (of women) and Hb level of each selected sample.
            Thus in each of the above example there is a claim made by researcher. On the basis of data collection she/he wants to check the claim. Read following examples of null hypothesis –
a)      Bad oranges in the lot are 4%
b)      The variance of ‘burning capacity of coals from two mines’ is same.
c)      Defective jobs are independent on shift understand this case with following table. The purpose is to understand the word dependent and independent.
Case- I

           Shift 1
          Shift 2
          Shift 3
O1
50
49
50
O2
48
50
52
O3
52
52
51
O4
51
50
53

Case-II
                Shift 1
              Shift 2
                Shift 3
50
68
95
48
75
105
52
82
110
51
78
100
                           Now You are given data in case-I and asked if 50 defectives are produced in a shift then tell me the shift number?.................you are unable to answer the question because 50 defectives can be produced in shift 1, 2 or 3. Thus as difference in the defectives produced in each shift is insignificant we can say ‘defectives produced are independent on shift’.
                           Now refer the second case, here the answer to the same question will be ‘shift-1’. If the number of defectives are 70 to 85 then you may answer ‘shift-2’. If it is 95 to 110 then your answer will be ‘shift-3’. Thus here ‘defectives produced are dependent on shift
d)     The average of the ‘banana from Kerala’ is 280 gram
e)      The milage of the ‘Jupiter two wheeler’ is 60 km/liter
f)       Share price of a particular company is stable.
g)      Numbers drawn from a set are random. In this case if numbers are drawn randomly then the distribution of frequency will be uniform. Example if there are 10 digits 0 to 9 which are selected for 100 times, then it is expected that each of the digit will be selected 10 times under each category of digits.
Digits
0
1
2
3
4
5
6
7
8
9
Total
Observed
Frequency
10
11
9
8
11
12
9
10
9
11
100
In the ‘observed frequency’ we see that frequency is not exactly ‘10’ but it is near to ‘10’. In other words we are unable to differentiate digits on the basis of frequency. This is just like the example of shifts and defectives.
h)      Variables in the population are correlated. It means the difference is insignificant or close to zero. Here coefficient of correlation of sample and population is taken into consideration.
i)        The proportion of vegetarians in two villages is same.
j)        The standard deviation of two series is same (equal).

In the above examples you may observed that there are many parameters used in the comparison. It should be noted that the null hypothesis doesn’t consider unit of the given values. It can be kg, km/hour, hour or any. Secondly following comparison can be made in the hypothesis.
a)      Comparison between means of two groups.
b)      Comparison between means and expected frequency or hypothesized mean.
c)      Comparison between variances of two groups.
d)     Comparison between standard deviations of two groups.
e)      Observed frequency and expected frequency.
f)       Correlation coefficient
g)      Values of additive and multiplicative constants of the regression equation.
h)      Proportion of two groups.
However it is not limited to two groups only. Multiple (more then 2) groups also can be tested using appropriate test.
Now the question is what is significant and insignificant ?we will take an example to understand this.
A newspaper vender claims that sale of particular newspaper is 100 per day. On a given day if it was 98 or 105; is 50 or 150 on a day. Then there must be some reason which has reduced or increased the sale significantly. This difference is noticeable and can be usual sale as it significantly lower or higher than 100. Thus if there is assignable cause for the difference then it is significant which from alternative hypothesis.
Note that in mathematics 100<105; but in statistics 100@105. In the mathematics we take 100 and 105 are final results and hence cannot be repeated. No experiment is carried out for seeking any change. However, if we take 100 and 105 as a sale of newspaper which is uncertain event, then both can be said approximately equal or the difference is insignificant. This forms our null hypothesis.

ALTERNATIVE HYPOTHESIS
Alternative Hypothesis is opposite to the null hypothesis. Either null hypothesis will be true at the same time. Similarly both cannot be false at the same time. Hence rejection of null hypothesis means acceptance of alternative hypothesis or Vice-Versa. However this rejection or acceptance is made on the basis of some error. This error is known as level of significance (µ-level).It is expressed in terms of % or value. Generally accepted µ-values are 1%, 5% or 10% (0.01,0.05,0.10). This means there is chance of failure of the rejection or acceptance. This point is explained in detail [ ]. Here alternative hypothesis means difference is significant.
Ha:  Difference is significant.
                                   Now you again read examples of null hypothesis by making it exactly opposite. You will notice that the inner meaning of these statements indicate higher, noticeable , significant difference. Both values( before and after, arithmetic means of two samples, variances, standard deviations, proportions) are not equal.
                                   Thus words(adjectives) null and alternative hypothesis are enlisted below.

Null
hypothesis
Alternative
hypothesis
Equal
Not equal
Not effective
Effective
No difference
Difference
Insignificant
Significant
Same
Not Same
Independent
Dependent
Random
Not Random
Correlated
Un-related
Stable/Uniform
Not stable/Uniform
Zero
Not zero

                                   In all above cases the alternative hypothesis is evade exactly opposite to the null hypothesis. However it is not case always. In the example that breaking strength of copper wire is 500 Kg, it is stated as below.
Ho: X bar =m=500Kg
Ha: x bar ¹m
                                   The alternative hypothesis Doesn’t indicate whether to accept the lot of copper wire? Here X bar is expected to lie above 500Kg or below 500Kg. Hence the values expected are non-directional (upper side or lower side). Significant higher say 560or significant lower say 440 Kg both values are not accepted in this case. This is known as two tailed hypothesis.
                                   In the 1st example of Drug/treatment it is expected that the values of blood pressure after treatment should be reduced. The drug/treatment will not increase the blood pressure as it is against our purpose. Thus it is one tailed hypothesis. Here alternative hypothesis will be the drug is effective in reducing blood pressure.
                                   In the 3rd example of special coaching class it was expected that marks will be increased. Thus here ‘effective’ means marks are increased. This is also one tailed hypothesis.
(please refer two-tailed and one-tailed test of hypothesis and its region of acceptance in--------)
In short it can be stated as below:
a)      Ho: X bar =m = 500Kg                 Right side region resection
      Ha: X bar >m
b)      Ho: X bar = m = 500Kg                Left side region of rejection
      Ha: X bar <m
c)      Ho: X bar = m = 500 Kg               Both side region of rejection
     Ha: X bar ¹m
                                               HYPOTHESIS TESTING METHODOLOGY:
            On the basis of data collection we carry out some mathematical calculations to check which hypothesis is to be accepted. This is known as hypothesis testing. The acceptance/Rejection of the hypothesis should not be done on personal feeling/ subjectively. It is because of difference in opinion decision can be different. Hence there is the Unique Method based on Logic to Test the hypothesis. In this Method Given data or Parameter are treated to find out whether the difference is significant or not? In a nutshell the test will determine which hypothesis should be acceptance. Thus decision system will be uniform and chance to personal bias (opinion) in the decision. Irrespective to the type of test following general methodology should be adopted for hypothesis testing.
a)      Statement of hypothesis Ho and to be tested
b)      Selection of the type of distribution for testing
c)      Using appropriate formula find out test statistics
d)     Assumption of level of significance:1%, 5%, 10%
e)      Finding out degree of freedom if necessary
f)       Referring table of distribution find out critical value
g)      Compare calculated and critical values/statistics
h)      Make decision of acceptance rejection of hypothesis
i)        Write conclusion about action/decision
Brief description on methodology:
a)      Statement of hypothesis in words or Numerical form based on theme of the question research intention.
Ho: X bar =m = 80,     Ha: X bar ¹m or Ha: X bar >m or Ha: X bar <m
b)      T-distribution, Z-distribution(Normal distribution/ large sample test), X2 (chi-square distribution), F-distribution(for analysis of variance), uniform distribution, etc. This selection shall be on the basis type of data, and what to test?
c)      There are different formula for different tests. It depends upon types of data also. This involves calculation of arithmetic means, standard deviation, variance, standard error etc. depending upon requirement.
d)     If not given assume middle figure i.e. 5% level of significance. However this assumption is rough, it shall be changed on the basis of difference in the calculated and critical (or table) value. Best advice is to consider all three levels of significance for making decision.
e)      Degree of freedom is number of categories less by one (n-1). However in case of comparing with Poisson distribution it is (n-2), for normal distribution comparison it is (n-3). In case of two-way table (contingency table) it is (r-1).(c-1) r and c denote number of rows and columns.
f)       Now a printed table/book of the ‘distribution’ is referred to find out critical value. It is supplied in the examination or allowed to carry. Sometimes it is part of question paper. we can also find out this value using MS Excel spread sheet.
g)      Suppose we take t-test then compare 
                                               tcal<tcrit or t table –decision is accept Ho. In other word there is no evidence to reject Ho
                                               tcal<tcrit or t table –decision is accept Ho. In other word there is no evidence to accept Ho
h)      Write down accepted hypothesis statement
i)        Support/substantiate any action/ decision as expected in the study.


                                                CENTRAL TENDENCY
            Central tendency indicates the representative single value for the distribution under consideration. Thus it is most suitable which can be used for the whole data. For this purpose we generally use average or arithmetic mean of the given data. However it is not a single measure to indicate all type of data.
            The purpose of finding out single representative number is to facilitate comparison of the two or more groups.Many times it becomes difficult or impossible to memorise all data hence ‘single number’ serve the purpose. Certain characteristics of the measure of central tendency are as below.
a)      It should be easy to calculate from given data.
b)      It should be easy to understand
c)      It should be single representative number
d)     It should be able to absorb corrections/ addition
e)      It should be based on all observation. However at the same time it should not affect by extremely small or large number values.
Following table shows different measures of central tendency and their use.

a)      Arithmetic mean:-
For data having single unit such as kg, meter, cm3, cm2, etc. It is based on all numbers. Easy to understand and manipulate. Mostly used.

b)      Mode:-
It shows the number mostly repeated in the series. It is useful when calculated value of average has no meaning. mostly sold number of shoes/ under garments size.

c)      Median;-
If given number are arranged in ascending or descending order then middle number is median. It is used to locate point below and above which 50%-50% observations are exist. It is not affected by extreme observations. It is fairly constant.

d)     Geometric mean :-
It is used for calculating index numbers it is used to find out average % increased or decreased over a period. It is also used * when lower weightage to higher values and higher weightage to lower values are given. * for finding average.

e)      Harmonic mean:-
It is used to find out average when the unit of data is compounded; such as km/hour, jobs/ hours etc. speed related problems.

Median:-


Median

0%
50%
50%
100%

Q1= (1/4):-



Q1 

0%
25%
75%
100%




Q3=(3/4):-






Q3


0%
75%
25%
100%


D6= (6/10):-






Q6


0%
60%
40%
100%

P35=(35/100):-


P35

0%
35%
65%
100%

Q1= Quartile 1, Q3 = Quartile 3,  D6= 6Th decile and P35= 35th Percentile

Relationship among mean, median, median and mode: for perfectly normal/ bell-shaped/ symmetric distribution mean= median = mode. However for other (not normal / bell-shaped/symmetric) Mode=3 median- 2 mode. This relationship gives approximate value of unknown. It any two are known third can be calculated using this formula.
In case of Arithmetic mean, Geometric mean and Harmomic mean the relationship is as below:
1)      AM ³ GM ³ HM
2)      GM2= AM * HM

SAMPLE SPACE
It is totality of all possible events from an experiment. It is collection of all outputs of an experiment. Suppose we tossed a coin then possible. Outcome will be Head or Tail. In case the coin remains standing position (Vertical) then delete that experiment and toss coin again. However it is very rare event and possible in films only. Now it we denote H for Head and T for Tail then our sample space will be  S ={ H, T}. Sample space is generally denoted by S. The sample size is number of items in the sample space, here it is 2.
Now it we toss 2 coin at a time or one coin for two times then possible set of outcomes will be as below:
     S={TT, TH, HT, HH}
Thus sample space contains 4 items representing possible events. If we toss a dice having six faces the outcome will be S={1, 2, 3, ,4, 5, 6}.
    If two dices are tossed at the same time the sample space will contain 36 items as below.
Dice I
1
2
3
4
5
6
1
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
2
(2,1)
(2,2)

3


4


5


6








1
2
3
4
5
6
2
3
4
5
6
7
3
4
5
6
7
8
4
5
6
7
8
9
5
6
7
8
9
10
6
7
8
9
10
11
7
8
9
10
11
12


Application:
                                   Now let us attempt what is use of this number space. It is used to know possible number of outcomes of an experiment. It will also define probability of happening of an event. In case of 2 coins tossed at a time the probability of getting both head is ¼. Four indicates number of items in the sample space and 1 is ‘event of arrival of HH.
                                   In the case of two dices tossed at a time what is probability of getting 3 total. Here we observe the ‘total 3’ for 2 times means favourable events are 2 out of 36. Thus probability will be= 2/36. Probability of happening of ‘4’ event or simply probability of getting 4.
                                   Each individual item included in the sample space is known as sample-point. According to sample size or number of sample-points there are two classes of the sample space 1) finite sample space and 2) Infinite sample space.

Sample Space
Finite
Infinite
Countably Infinite
Uncountably Infinite


Finite sample space:
                                   If number of sample points or items in the sample space are definite then it is called finite sample space. In this case number of outcome is known. In the example of tossing a dice the sample space contains six numbers. In case of tossing two dice the sample space contains 36 numbers (points). Thus the number of items included in the sample space are well known.
We need not perform any kind of experiment for this purpose. Another example for this is if patients coming in the hospital is categorised as male(m) and female(f). Also according to their condition they are categorised as serious (s), Normal(n) and can be delayed(d). The sample space will be
                                   S= {ms, mn, md, fs, fn, fd}
Infinite sample space:
                                   As the name implies infinite means countless or immeasurable. Thus we cannot define the sample space at the start. The sample space is unlimited. However in a case it is countably infinite which means it is possible to put some events before the occurrence of the expected event. Here let us take an example of tossing a coin. If we decided to toss coin until head occurs then the sample space will be as below:
                                   S= {H, TH, TTH, TTTH, TTTTH, TTTTTH}
Thus we can take support of natural numbers 0, 1, 2, 3, 4………… before occurrence of head. In above case if head occurs the experiment is closed. Thus TTTTHT can not be pint in the sample space. It is limited to the TTTH. Once head is up the experiment is finished. Thus the question is how many time we need to toss the coin to get head? It is indefinite, but before occurrence of head there should be only tails. Thus tail has to occurs for 0, 1, 2, 3……times before getting head. This situation is shown in the sample space.
2 example of countably finite sample space is:
Patients coming in the hospital are denoted as M for male and F for female. If we stop experiment un until female comes then sample space will be as below.
                                   S= {F, MF, MMF, MMMF, MMMMF…………….}

In Male and female again it is categorised child © and Adult(A) and we want to stop the experiment until Adult female then the sample space will be as below:
                                   S= {AF, AMAF, CMAF, CFAF, AMAMAF, CMCMAF………}

Uncountably Infinite sample space:
                                   In this case we cannot count the sample space. Also number of sample space is defined by lower and upper limit such as 10 to100.  In case of the marks of the students which range from 0 to 100.Possible number of tosses by a coin. If is also infinite and not countable thus the sample space may range from 0 to ¥. It is written as below:
                                   S= {X: 0<X<100}. This is read as ‘X’
Is such that it varies from 0 to infinite. Also the life of a bulb may range from 0 to 1000 hours. Thus we cannot count it exactly and if we tried it will become unnecessary, unfruitful exercise. Instead of this it can be best presented using internal or lower and upper limit. Here if X denotes number of burning hours of bulb then   S= {X: 0< X<100hours}
Discrete and continuous sample space: Discrete means definite value and continuous means a value within given range. Number of rooms/ children/ peas in the bean is discrete numbers. These are definite numbers such as 0, 1, 2, 3, 4……… etc. continuous sample space consists of many values with the given range i.e. 10 to 20, 0 to 100 etc. The continuous sample space is uncountably infinite sample space. We cannot define its exact value or its sample points. Income of a person/ family, life of a mobile phone battery, rainfall in a city or area. We can express these  into range only no definite number can be decided on the basis of historical data. As there is no limit of precision we cannot exactly define its sample points.
Discrete means finitely many or countably infinite elements. Thus it includes sample space finite as well as countably infinite.
EVENTS:
An event is a part of sample space containing one or more outcomes. We know that the sample space is collection of all possible outcomes. Now an event can be single or group of many outcomes as event can be single or group of many outcomes as we expect or want. In a random experiment if the sample space and possible outcomes are known in advance (before the experiment) then the outcome is known as event. In an experiment of tossing coin the ‘H’ or ‘T’ are events. Following examples will again explain the term.
1)      If a die is tossed then sample space S= {1, 2, 3, 4, 5, 6}
Now the ‘event’ of getting even number {2, 4, 6}
The event or getting number less than 3 is {1, 2}
The event of getting number more than 4 is {5, 6}
2)       In a set of alpha bates sample space contains 26 points S={A, B, C……….Z}. Now the event of getting vowel ={A, E, I, O, U}

Now let us know types of events. there are 9 types of events as mentioned below:
1)      Certain event
2)      Void/ impossible event
3)      Mutually exclusive events or disjoint events
4)      Independent events
5)      Dependent events
6)      Equally likely events
7)      Exhaustive, events
8)      Complementary event
9)      Simple and compound events

1)      As the name implies when the happening of an event is certain or sure then it is known as ‘certain event’. In this experiment there is no other event is available. Occurrence of head or tail after tossing a coin is certain events. The child may be Boy or Girl. The outcome of the dia will be S= { 1, 2, 3, 4, 5, 6}. Thus if an event contains all sample points in the sample space then it is ‘certain’ event.
2)      It is exactly opposite to the above event. When there is no such sample point in the sample space then it is called impossible or void event. Thus it is shown as event occurrence of 7 or 8 in the dia example is not possible. The number above 6 is not set of S={1, 2, 3, 4, 5, 6}.
3)      If given two events cannot happen at all same time then it is called mutually exclusive events. probability of happening of two events is zero in this case. Either 1st or 2nd event will occurs at a time. A man cannot belong to Male and Female category at the same time. Head and Tail cannot be outcome of tossing a coin. We should use ‘or’ instead of ‘and’ to make it meaningful. A man belongs to male or female category. Outcome of tossing a coin is either Head or tail. As joint possibility is nil in this case it is known as disjoint events. Following figure shows disjoint events:

                                  
S




Oval: A 
Oval: B











                                              
Event A is occurrence of less than number. Thus EA= {1, 2, 3} Event B is occurrence of equal to or more than 4 number. Thus EB = {4, 5, 6} P (AB) =0. Probability of happening of Event A and B is Zero.
4)      The event is said to be independent when its occurrence is not influenced by any other events. thus the experiment itself posses same condition every of tossing a die every time there are six-faces. We cannot delete any face or add. Hence the occurrences of any number is independent on the (numbered 1through 10) Having similar physical characteristics placed in a pot (urn), the blindly selection at each time is independent on previous outcome; provided that we have replaced the selected ball in the pot. In another case the rainfall on a day is not certainly independent on previous days rainfall. It may or may not rain following by a rainy day. Thus these events are not-independent. We cannot say these as dependent or independent. The opposite word of independent is not independent and not ‘dependent’.
5)      Dependent event is associated with previously happened event. In the above case of ball numbered 1 to 10, if we do not replace the ball then every time the probability of getting selected changes.  At first time the probability of getting 5 is 1/10, as there is only one ball in 10. If we don’t get ‘5’ and again continued second experiment without replacement, then probability of getting ‘5’ will be one in nine or 1/9. In the subsequent experiment if continued with similar condition then the probability will be 1/8, 1/7………till getting ‘5’. Thus the probability of ‘5’ changes every time as we have not replaced the ball. This is not possible in case of die as we cannot reduce a face.
6)      In an experiment of tossing a coin we expect the occurrence of Head but we cannot justify our expectation. There is no any reason why head will appear after a toss. This indicates that the coin is fair or unbiased and it may give head or tail. The event of occurrence of head or tail is equally likely. In case the coin is forged to fall head and we have historical data that the Head had appeared for 60% time, then events are not equally likely. Throwing a die, also can be explained in the same way. Getting defective product is not equally likely event as the company is doing much more for producing good product.
7)      When all possible sample point in a sample-space are considered then it is ‘exhaustive’ events. The number of exhaustive events in the tossing a coin is 2. In case of throwing a die it is 6, for two dice it is 36.
8)      Let us take example of balls numbered 1 through 10. The sample space will be = { 1, 2, 3, 3, 4, 5 ,6 ,7 ,8 ,9 ,10}. Let A is a sub set having even number A={ 2, 4, 6, 8 ,10}. B is a sub set having odd number B= { 1, 3, 5, 7 ,9}. Points in subset A and b forms sample space S. However there is no any common point in the A and B. there for B is said as complementary event of A or vice versa. It is shown in following fig.




S




Oval: A 
B 











                                               Acor A1means complement of event A. Complement of complement is the event, itself. Event B consists those sample points which are not belonging to event A.
9)      Sample event is an individual event not linked with happening of other event. Example probability of getting ‘6’ in throwing a die, is ‘1/6’. If a class has 20 students, 12 Boys and 8 Girls; probability of randomly selected student will be girl is 8/20 or 0.4. On the basis of historical data also we can predict the probability of happening an event. Probability of a sales manager is in Mumbai today is 0.8.
                                   Compound event is combination of two/ events in sequence. The probability of getting ‘6’ in two successive throws of die is 1/6 * 1/6 = 36. Probability of selection of two girl students in sequence is 8/20 * 7/19 =_______. Probability that the sales manager = 0.64. Thus there are two or more events occurred in sequence.

Coefficient of variation or variability:
                                                Standard deviation is an absolute measure of variability. It has no relevance with mean. Mean (Arithmetic) can be same with different standard deviations or vice versa. It is an internal measure of variability. If we want to compare variability of two or more groups than we should use COV and not SD. COV = SD/mean*100. Thus mean out COV. Hence it is better measure for comparison purpose. It is expressed in terms of percentage.
                                   Higher the COV higher is the variability, similarly lower will be homogeneity, uniformity and stability. On the contrast lower the COV lower is the variability.  Example: Boys group has average marks 60 with SD as 20. Girls group has average marks as 50 with SD as 12.5. then COV of boys group = 33.33% and that of girls group is 25%. This indicates girls group is more consistent as regards to marks. Here we should not interpret variability on the basis of SD only. Refer following case and inter.
Boys group X bar 1 = 80, 61 = 20
Girls group X bar 2 = 50,62 = 16
Mean and SD should be calculated as usual using appropriate formula for individual discrete and continuous serious the case may be.


ANOVA
ANOVA is short from of ‘Analysis of Variance’. It is also known as F-test. This test is devised by Prof.R.A.Fisher in 1920. Variance is square of standard deviation. However, the sample size being less than 30 standard deviations.However the sample size being less than 30 we use (n-1) in the denominator instead of ‘n’. Thus variance S2= (X-X bar)2/n-1. This test is useful to find out overall effect of the treatment. It determine whether the ‘treatment’ effect is significant or not in the change in data.
                                   Let us assume that, we have sown 4 kinds of wheat seeds (A, B, C, D) on 5 Acre land each. Thus the production of wheat is measured in each Acre of lamel as below.
                                  

A
B
C
D
1
10
11
12
10
2
8
10
13
12
3
9
12
15
14
4
10
13
9
15
5
11
12
10
14

                                   Now we can see that there is difference in the column representing similar type of seed. This variation is not explainable and supposed as random effect. There is difference in yeild from ‘column to column’. This can be effect of change in seed type. Thus the difference from column to column is due to treatment effect of ‘sowing different seed’. Now the question is whether this variation is significant or not?
Whether the type of seed contributes significantly in the higher yield? This is tested using ANOVA.
Defectives produced in shifts, effect of fertilizer on yield of crop, productivity of different machines, scores obtained by different groups of students, performance (sales) of various salesman etc. can be tested using ANOVA.
                                   In the two-sample T-test we test significance of difference in the means of two groups. If number of group are more than 2 then T-test is to be carried out for many times in pair. The overall conclusive result is given by ANOVA, this will avoid many calculations of t-test. ANOVA is used when number of group is more than 2. In case of 2 groups also it is used to find out significance of variance and not mean.
Here we note down the basic concept that, ‘means can be same with different variance’ and ‘variance can be same with different means’. Thus variance has no relevance with mean. Variance of ’11, 12, 13, 14 and 15 is same as of ’51, 52, 53, 54 and 55’. There is much difference in the mean we know (13 and 53). In two-sample t-test the means of two groups are tested for equality or non-equality. In ANOVA ‘variance’ of two group is tested for equality or non-equality.

One way ANOVA:
                                   The given case of yield of wheat is one way ANOVA; because it has only one treatment. ‘Changing type of seed’ is our treatment which can manipulate (increase/ decrease) the yield of wheat. The other side i.e. In the row ‘number of land’ indicates no effect on yield as we have assumed the land characteristics are same. Thus the effect of differe3nt land on the yield is random or not a cause for variance. (It is ignored).
                                   It is suggested if there is one way. ANOVA the source of variation should be taken into column. In the example if it is given in row then we should transpose (make into column) the same for convenience.
                                   In one way ANOVA the total variance is due to ‘random effect’ and ‘treatment effect’. Thus Ssis a term used for ‘Sum of square’. It is actually sum of square of difference of data from its mean. In other language å(X-X bar)2.
                                   Total Ss = Ss due to treatment t Ss due to random effect. Degree of freedom in one way ANOVA:  for columns effect it is (C-1). The total degree of freedom is (n-1) i.e. sample size -1. The df of error (or random effect) is [(n-1)-(C-1)]. In our case df for column (4-1) = 3. The total df = (20-1)= 19. The df of error = 19- 3= 16.

                                                            ANOVA TABLE
Source of variation
Ss values
Df
ms(mean square)
F column
F critical
Ss due to column. Effect
X
n1
X/ n1= E1
E1/ E2
(n1n2)
Ss within
Y
n2
y/ n2=E2
E1/ E2
At µ level 0.05 or 0.01
Total Ss
(X+y)
n1+ n2




Hypothesis in one way ANOVA:
Ho: Arithmetic means of samples are equal.
Ho: Arithmetic means of samples are not equal.
Ho: variation in the yield due to change is seed type is insignificant.
Ho: variation in the yield due to change is seed type is significant.
Ho: Population means are equal
Ho: Atleast one population mean is different from overall population mean.
Hom1= m2= m3
Ham1¹m2¹m3
Ho: 612 = 622
Ha: 612¹ 622
Assumptions in ANOVA:
1)      Assumption of  normality:
It is assumed in this test that the data are normal. It gives bell shape it we plot its graph. In other words if we arrange data in ascending order then it will be on a straight line if we plot. There are many methods to test normality such as X2- goodnex of fit-test. However we cannot take guarantee that the data are perfectly normal. It is also not necessary condition Researchers observed that derivation from small normality doesn’t make f-test invalid. However for very skewed distribution such as right side or left side or multimodal series it is not applicable.

2)      Assumption of randomization:
There is no bias in the carrying out experiment. Each sample should be selected in random fashion and independently. This indicates there should not be effect of selection of one sample on that of next sample. Observations should not be correlated with time, space and sampling unit then it is called as independent. If we have four types of seed, then the selection of seed for a plot should be done independently and randomly see following,
                                  
A
C
D
B
D
B
A
C
B
A
D
D
C
B
C
A

3)      Equality of variances:
In the group means may be different but variances should be equal. In other words it is assumed that 612 = 622 = 632; if there represent variances of three groups to be tested. Barlett’s test for homogeneity (equality) of variances can be appied to test equality.
For equal variance the tail end probabilities of samples are equal, hence comparison can be made. In general language we can say that there should be equal distribution of data on both side of mean (centre line)

TWO-WAY AVOVA:
                                   In the given case of yield of wheat if we have changed fertilizer level as well as seed types then there are two treatments, hence the analysis is known as Two-way ANOVA. In the column as are affecting (seed and fertilizer) on yield of wheat.
Here the hypothesis will be for significance of both type of treatment on the production (yield) of wheat foe seed type
Ho: The effect of seed type on yield of wheat is insignificant.
Ha: The effect of seed type on yield of wheat is significant.
Similarly, for fertilizer level-
Ho: The effect of Fertilizer level of wheat is insignificant.
 Ha: The effect of Fertilizer level of wheat is significant.

         Seed Type        
A
B
C
D
f1
10
11
12
10
f2
8
10
13
12
f3
9
12
15
14
f4
10
13
9
15
f5
11
12
10
14

Thus in this case there can be following results.
1)      Effect of seed type and fertilizer level is significant
2)      Effect of seed type and fertilizer level is insignificant
3)      Effect of seed type is significant whereas that of fertilizer level is insignificant.
4)       Effect of seed type is insignificant whereas that of fertilizer level is significant.
Variance of Two-way ANOVA
 

Random effect                                                                                    Total variability





Variability due to fertilizer change                                         Variability due to seed type change

Thus total Ss = Ss due to column effect + Ss due to row effect + Random effect.
Degree of freedom in two way ANOVA-1) For total Ss it is (N-1) or (20-1) = 19. 2) for seed/ column effect it is (c-1) or (4-1) = 3, C: No. of columns 3) for fertilizer level/ row effect it is (r-1) or (5-1) =4   r: No. of rows 4) for random effect (Known as residual/ balance error) it is 19-3-4=12
Source of variation
Ss values
Df
Ms
Fcal
F criticalµ=0.05/0.01
Ss due to column effect
X
(C-1)


Msc=X/(c-1)

Msc/msBE
V1= (c-1)
V2=(N-c-r-1)
Ss due to row effect
Y
(r-1)
Msc=Y/(r-1)

Msr/msBE

Balance Error
Z
(N-C-r-1)
Msc=Z/(N-C-r-1)

V1=(r-1)
V2=(N-c-r-1)
Total Ss
X+Y+Z
(N-1)



                                              
(N-1)-[(C-1)+(r-1)] = N-1-c-r+2
                               = N-c-r+r
Correction factor (cf) = (TG)2/N
Total Ss = åXij2– C>F
Ss due to column = åTj2/nj-C.F          j= 1, 2, 3, 4
Ss due to row = åTi2/nj-C.F j= 1, 2, 3, 4,5
B.error= Total Ss- Ss due to column- Ss due to row
TG= Grand total, nj = No. of elements in row
nj = No. of elements in column, Ti = Sum of ithrow
Tj = sum of jth column, Xij = Individual element
N= No. of all observations.
Ti = Sum of ithrow.
Xij = Individual element.


                              Sampling distribution of the mean when SD is unknown
If population’s standard deviation is unknown then we can replace the same by sample standard derivation. It is because of sample standard deviation (S) is best estimator and thus substitute for population standard deviation (6). Here the population consists of number of observations and thus the distribution is supposed as normal. In case of small sample we assume that these are part of population and thus follow shape of normal distribution approximately. Here this assumption indicates that if sample size is increased to infinity then it will follow normal distribution.
This assumption is fairly correct if sample size is larger than 30. This is the reason when sample size is lesser than 30 we treat this as t-distribution. The standard deviation is calculated by dividing sum of squared deviation form mean by (n-1). Thus
S = root å(x-x bar)2/n-1
tcal = X bar- mHo/S               root n

Thus instead of finding out Z we calculate tcal using the same formula (replacing 6 by S). The t-distribution is also like normal distribution but its hight of apex (crown) lowers with lower sample size. It is symmetric distribution having mean as ) at the centre.
Here (n-1) is known as degree of freedom. The table value of t is observed using df and level of significance (µ). As usual the t-distribution has tail end probabilities according to hypothesis i.e. One-tailed test or two tailed test.
However this is regarding one sample test, if we have to compare mean of two sample then we need to find out combined standard deviation. Let us assume Group A has 15 samples and ‘B’ has 10 samples. The degree of freedom will be df = n1+n2-2 or 15+10-2=23.Let s, and S2 are two standard deviations of sample calculated as above formula (of S), then
S combined = root S12 (n1-1) + S22 (n2-1)/n1 + n2-2 ___________1
                                               OR
S combined = root å (X1-X bar1)2+å (X2-X bar2)2/df _________2
Formula 1 is used when S1 and S2 are given or expected to find out. Formula 2 is used when no such case as 1. Here   tcal = X bar1 – X bar 2/ S combined * root n1n2/n1+n2

                                               Binomial distribution
When outcomes of an experiment are only two then it is known as binomial. The experiment carried out randomly having two outcomes is known as Bernaulli trial. In many cases we use only two outcomes such as failure or success, boy or girl, bad apple or good apple, Defective or non-defective job, Head or Tail, leaky or not leaky joints, effective or not meeting, rice or wheat eaten etc. Binomial distribution is systematic arrangement of the event to find out probability of various events.
Binomial expansion: Let us take an example of tossing coin. If a coin is tossed the outcome will be S= {H,T}. If two coins are tossed S = {TT,HT,TH,HH}. Let us assume ‘p’ as probability of getting Head and ‘q’ is that of getting tail. Here p+q=1 as per rule of probability.
For n=2,  S={qq, pq, qp, pp} As per rule of
Sequence  {q2 2qp p2}  This is exactly
(q+p)2. If three coins tossed then the sample space will be {q3, 3q2p, 3qp2, p3}
                                               Application of Binomial expansion
 q3        3q2p       3qp2      p3
0              1              2         3               Let us assume q = p = 0.5
0.125      0.375    0.375    0.125
This indicates probability of getting 0 and 3 heads is 0.125 each.  Probability of getting 1 head 2 head is 0.375 each Thus the total of probability equals to 1(i.e. 0.125+0.375+0.375+0.125)
Let us assume ‘n’ coins are tossed at a time or a coin is tossed for ‘n’ times the results will be (q+p)n = nco. qnpo+ nc1qn-1p1+ nc2qn-2p2 + nc3pn-3 p3+ncnqopn
Here we observe following points in the expansion.
1)      The number of terms equals to n+1. In three coins there were (3+1)=4 terms.
2)      The exponent of q starts from ‘n’ and reduces by one in the subsequent term till 0 at the end term.
3)      The exponent of p starts  from 0 and increases by 1 in the subsequent term till ‘n’ at the end term.
4)      The sum of exponents of q and p in all trems remains constant i.e ‘n’.
5)      The first term starts from r=0 to r=n at the end.
6)      Exponent of p and ‘r’ are same.
7)      The sum of all binomial coefficients i.enCr is equal to number of outcomes (sample space-9 items).
NOTE.
5C2= 5*4/2*1 = 10, 8C3 = 8*7/8*6/3*2*1=56
We can also use calculator for finding out value of nCrdirectly.
8)      The sum of all probabilities equal to 1.
If coin is fair/ balanced unbiased than p=q=0.5. In other cases p or q be different.
Practical use of binomial distribution: When sampling is done with replacement then it is good. However the probability of p or q should not be less than 0.10.
It is used in acceptance sampling for accepting a lot based on samples. It is used for drawing generalization for decision making purpose. In a case of leaky not leaky joints of a pipe line what is probability of getting more than 2 leaky joints. If a team is required for repairing leaky joint what is probability it would require two teams. In a year how many times two or more teams required for repair work.

                                   Assumptions in Binomial Experiment
1)      Experiment is performed under same condition for a fixed number of trials ‘n’.
2)      Only two possible outcomes
3)      Probability p and q head and tail respectively is with replacement then Binomial Distribution is not applicable. Replacement means there are always two faces of a coin in every experiment. Actually we cannot separate a face of a coin here.
4)      Trials are statistically independent. There is no connection between result of first trial and second trial.
Mean = n.p, standard deviation = root n.p.q

COEFFICIENT OF CORRELATION
Correlation is a measure of relationship between two variables. Its value varies from -1 to 1.
It is a unit less quantity. We find there are number of situations in our experience where there is correlation. Example: There must be positive correlation between advertisementscost and sales revenue, age and weight, Noise level and blood pressure, Marks obtained and study hours, Hights of father and child, rainfall and flood etc. Here we notice that as one increases other variable tends to increase and vice versa.
Similarly, there can be negative correlation i.e. as one variable increases other will tend to decrease.
Example: Expenditure on safely and number of accidents, quality improvement efforts and number of defectives produced, surface roughness and speed, marks obtained and hours on TV, consumption of water and calcification etc.
Also there can be no correlation between two variables such as richness and weight, study hours and marks, death rate and birth rate, etc.
There should be cause and effect relationship between two variables. Logically one variable should influence to increase or decrease the value of other variable, otherwise the correlation is false. Example: Prices of Gold and tomatoes, Reading speed and driving speed, Noise level and weight of a person etc. We cannot establish link between these two variables. Perfect positive correlation: Here the value correlation is equal to ‘1’. The relationship between two variable is such that we get other variables by multiplying (or dividing) first variances. The factor of multiplication is same. If cost of 1 pencil is RS.3,for 2 pencils 6,8 pencils RS. 24,10 pencils Rs.30,50 pencils Rs 150 etc……… The relationship between number of pencil and cost is perfect positive. Here the multiplying factor is 3.The coefficient of correlation between any two tables is 1(example: table of 3 and 17).
Perfect Negative Correlation: Here the value of correlation is equal ‘-1’. Here smaller the 1st variable higher will be 2nd variable. If we write table of 3 i.e. from 3 to 30 and table of 17 in reverse direction i.e. 170 to 17, the relationship between two variables is perfectly negative i.e equal to”-1”. The pairs of variables will be (3,170) (6,153)(9,130)…………..etc.
Interpretation of correlation coefficient





0
0.25
0.5
0.75
1
Event is not          Event is more         Likely to occure    Event is                 Event                                                    or not to occur.     Likely to          is very likely to
very likely          likely  not to                                            occur than          occur
to occur              occur than to occur                                  not to occur

Comments

Popular posts from this blog

OGIVE CURVES

T TEST EXAMPLES FOR PRACTICE

BYES' THEOREM