# Marketing Research Session 10 Hypothesis Testing with

Group 1: 50 students, average score of the sample is 610, sample standard deviation is 20. Group 2: 50 students, average score of the sample is 580, s...

Marketing Research Session 10 Hypothesis Testing with Simple Random samples (Chapter 12) Remember: Z.05 = 1.645, Z.01 = 2.33 We will only cover one-sided hypothesis testing (cases 12.3, 12.4.2, 12.5.2, and 12.6.1) 12.6.1 One Sample, One-sided Test: At a (1 − α) level of conﬁdence, test H0 : π ≤ K against Ha : π > K. Example 12.1 (adapted) Suppose we wish to organize a rock concert at a university campus. The campus has 20,000 students. Our costs will be exactly recovered if 4000 students attend the concert; we will make a proﬁt if more than 4000 students attend. In a simple random sample of 100 students, 30 students will attend the concert. At a 95% level of conﬁdence, test the null hypothesis that not more than 4000 students will attend the concert. Process: Null hypothesis: 20, 000π ≤ 4000, that is, π ≤

1

=

Remember: Z.05 = 1.645, Z.01 = 2.33 4000 = 0.2. 20, 000 At a 95% level of conﬁdence, test H0 : π ≤ 0.2 against Ha : π > .2. Process: Null hypothesis: 20, 000π ≤ 4000, that is, π ≤

.05

π = .2

.2+?

-

p

Data: n = 100, p = Decision Rule: At a 95% level of conﬁdence, reject H0 if p >

Conclusion:

2

Example 12.1 (continued) H0 : π ≤ .2, Ha : π > .2 n = 100, p = .3

P value

.3

π = .2

-

p

P-value = P (p ≥ .3|π = .2) = P (

p−π ≥ σp

)=

Note: • The P value is the probability that you can get a result as extreme as or more extreme than the result from the sample if H0 is true. • For any hypothesis test, if P value < α, we can reject H0 at conﬁdence level (1 − α).

3

Remember: Z.05 = 1.645, Z.01 = 2.33 12.6.1 One Sample, One-sided Test: At a (1 − α) level of conﬁdence, test H0 : π ≤ K against Ha : π > K. Decision Rule: At a (1 − α) level of conﬁdence, reject H0 v u u K(1 t

− K) n Example 12.1 (adapted) Suppose we wish to organize a rock concert at a if p > K + Zα

university campus. The campus has 20,000 students. Our costs will be exactly recovered if 4000 students attend the concert; we will make a proﬁt if more than 4000 students attend. In a simple random sample of 100 students, 30 students will attend the concert. At a 95% level of conﬁdence, test the null hypothesis that not more than 4000 students will attend the concert.

Ho : Ha : n= p= Decision Rule: At a 95% level of conﬁdence, reject H0 if:

Conclusion:

4

Remember: Z.05 = 1.645, Z.01 = 2.33 12.6.1 One Sample, One-sided Test: At a (1 − α) level of conﬁdence, test H0 : π ≤ K against Ha : π > K. Decision Rule: At a (1 − α) level of conﬁdence, reject H0 v u u K(1 t

− K) n Example 12.6: There are 50,000 households in a city, and we have drawn a if p > K + Zα

simple random sample of size 500 from this population. In this sample of 500, 220 households own pets. At a 99% level of conﬁdence, test the null hypothesis that not more than 20,000 households in this city own pets. Answer: First write H0 and Ha in terms of π Ho : Ha : n= p= Decision Rule: At a 99% level of conﬁdence, reject H0 if:

Conclusion:

5

12.3: At a (1 − α) level of conﬁdence, test H0 : µ ≤ K against Ha : µ > K Decision Rule If n ≥ 30: s At a (1 − α) level of conﬁdence, reject H0 if X > K + Zα √ n If n < 30: s At a (1 − α) level of conﬁdence, reject H0 if X > K + tα √ , where the degree n of freedom of t is (n − 1).

6

Remember: Z.05 = 1.645, Z.01 = 2.33 Example 12.2. A university has 15,000 students. We have drawn a simple random sample of size 400 from the population, and recorded how much money each student spend on cellular telephone service during November, 2003. For this sample, the sample mean is \$36, and sample standard deviation is \$20. At a 99% level of conﬁdence, test the null hypothesis that these 15,000 students, combined, did not spend more than \$500,000 on cellular telephone service during November, 2003. Answer: H0 : Ha : X= s= n= Decision Rule: At a 99% level of conﬁdence, reject H0 if

Conclusion:

7

H0 : µ ≤ K against Ha : µ > K Decision Rule If n ≥ 30: s At a (1 − α) level of conﬁdence, reject H0 if X > K + Zα √ n If n < 30: s At a (1 − α) level of conﬁdence, reject H0 if X > K + tα √ , where the degree n of freedom of t is (n − 1). Example 12.3. Suppose once again we want to test the hypothesis described in Example 12.2, and a simple random sample again gives x = 36 and s = 20. Perform the test if the sample size is 25 instead of 400. Answer: H0 : Ha : X= s= n= Decision Rule: At a 99% level of conﬁdence, reject H0 if

Conclusion:

8

Remember: Z.05 = 1.645, Z.01 = 2.33 12.4.2. Comparing Means of Two Independent Samples: At a (1 − α) level of conﬁdence, test H0 : µ1 − µ2 ≤ K, against Ha : µ1 − µ2 > K. Consider only the case where n1 ≥ 30 and n2 ≥ 30. Decision Rule: At a conﬁdence level of (1 − α), reject H0 if (X 1 − X 2 ) > K + Zα σX 1 −X 2 ≈ K + Zα

v u u t

s21 s22 + n1 n 2

Note: • If the hypothesis is stated in terms of (µ2 − µ1 ), state the decision rule in terms of (X 2 − X 1 ). • (X 1 − X 2 ) and (X 2 − X 1 ) have the same standard deviation,

v u 2 us t 1

s22 + . n 1 n2

Example 12.4: We have drawn two independent simple random samples, one from the male student population, and the other from the female student population, at a college campus. For each member of the samples, we recorded how much money that student spent purchasing clothes during the Spring 2003 semester. The results are summarized below: 1. Sample of Males: n1 = 50, X 1 = \$420, s1 = \$150. 2. Sample of Females: n2 = 150, X 2 = \$450, s2 = \$250. At a 99% level of conﬁdence, test the null hypothesis that on the average, a female student did not spend more money purchasing clothes than a male student during the Spring 2003 semester.

9

Remember: Z.05 = 1.645, Z.01 = 2.33 12.4.2. Comparing Means of Two Independent Samples: At a (1 − α) level of conﬁdence, test H0 : µ1 − µ2 ≤ K, against Ha : µ1 − µ2 > K. Decision Rule: At a conﬁdence level of (1 − α), reject H0 if (X 1 − X 2 ) > K + Zα σX 1 −X 2 ≈ K + Zα

v u u t

s2 s21 + 2 n1 n2

Note: If the hypothesis is stated in terms of (µ2 − µ1 ), state the decision rule in terms of (X 2 − X 1 ). H0 : Ha : n1 = 50, X 1 = \$420, s1 = \$150 n2 = 150, X 2 = \$450, s2 = \$250 Decision Rule: At a 99% level of conﬁdence, reject H0 if

Conclusion:

10

Remember: Z.05 = 1.645, Z.01 = 2.33 12.4.2. Comparing Means of Two Independent Samples: At a (1 − α) level of conﬁdence, test H0 : µ1 − µ2 ≤ K, against Ha : µ1 − µ2 > K. Consider only the case where n1 ≥ 30 and n2 ≥ 30. Decision Rule: At a conﬁdence level of (1 − α), reject H0 if (X 1 − X 2 ) > K + Zα

v u u t

s22 s21 + n1 n2

Note: If the hypothesis is stated in terms of (µ2 − µ1 ), state the decision rule in terms of (X 2 − X 1 ). Exercise Problem 3 from 12.8. Suppose you want to evaluate a program which helps a student prepare to take the GMAT. You have randomly selected 100 SU undergraduates, and randomly divided them into two groups of 50 each. Students in group 1 attend the training program. Students in group 2 do not attend the training program. Students in both groups subsequently take the GMAT. Results. Group 1: 50 students, average score of the sample is 610, sample standard deviation is 20. Group 2: 50 students, average score of the sample is 580, sample standard deviation is 30. 3.(a) At a 95% level of conﬁdence, test the null hypothesis that the training program does not increase a student’s GMAT score by more than 20. 3.(b) At a 95% level of conﬁdence, test the null hypothesis that the population average score GMAT score of students who take the training would not exceed 600.

11

Remember: Z.05 = 1.645, Z.01 = 2.33 12.3. H0 : µ ≤ K, Ha : µ > K At conﬁdence (1 − α): Reject H0 if X > K + Zα √sn if n ≥ 30 Reject H0 if X > K + tα √sn if n < 30 (degree of freedom of t is n − 1) 12.4.2. Two independent samples, H0 : µ1 − µ2 ≤ K, Ha : µ1 − µ2 > K At conﬁdence (1 − α) reject H0 if X 1 − X 2 > K + Zα

v u 2 us t 1

n1

+

s22 n2

Given: n1 = 50, X 1 = 610, s1 = 20 n2 = 50, X 2 = 580, s2 = 30 3.(a) At a 95% level of conﬁdence, test the null hypothesis that the training program does not increase a student’s GMAT score by more than 20. Process: Treat the two samples as independent samples and use the decision rule from 12.4.2. H0 : Ha : Decision Rule: At a 95% level of conﬁdence, reject H0 if

Conclusion:

12

Remember: Z.05 = 1.645, Z.01 = 2.33 12.3. H0 : µ ≤ K, Ha : µ > K At conﬁdence (1 − α): Reject H0 if X > K + Zα √sn if n ≥ 30 Reject H0 if X > K + tα √sn if n < 30 (degree of freedom of t is n − 1) 12.4.2. Two independent samples, H0 : µ1 − µ2 ≤ K, Ha : µ1 − µ2 > K At conﬁdence (1 − α) reject H0 if X 1 − X 2 > K + Zα

v u 2 us t 1

n1

+

s22 n2

Given: n1 = 50, X 1 = 610, s1 = 20 n2 = 50, X 2 = 580, s2 = 30 3.(b) At a 95% level of conﬁdence, test the null hypothesis that the population average score GMAT score of students who take the training would not exceed 600. Process: Focus only on the sub-population of students who took the training and use the decision rule from 12.3. H0 : Ha : X1 = s1 = n1 = Decision Rule: At a 95% level of conﬁdence, reject H0 if X 1 >

Conclusion:

13

12.5.2 Comparing Means of Two Related Samples: At a (1 − α) level of conﬁdence, test: H0 : (µ1 − µ2 ) ≤ K,

against Ha : (µ1 − µ2 ) > K,

where K is a speciﬁed number. Restate Problem: Let d = X1 − X2 . Then, µd = µ1 − µ2 . Hence, we are doing the following test: At a (1 − α) level of conﬁdence, Test H0 : µd ≤ K against µd > K. Decision Rule: The test is identical to the one-sided hypothesis test using one sample, discussed in Section 12.3, with x replaced by d. Proceed as follows: (1) For each pair i, compute di = x1i − x2i . ∑

v u∑

u (di − d)2 di (2) Compute d = and sd = t . n n−1 (3) Depending on sample size, the decision rule is given as follows:

sd Case 1. n ≥ 30: At a conﬁdence level of (1 − α) reject H0 if d > K + Zα √ . n sd Case 2. n < 30: At a conﬁdence level of (1 − α), reject H0 if d > K + tα √ , n where t has a degree of freedom of (n − 1).

14

sd Case 1. n ≥ 30: At a conﬁdence level of (1 − α) reject H0 if d > K + Zα √ . n sd Case 2. n < 30: At a conﬁdence level of (1 − α), reject H0 if d > K + tα √ , n where t has a degree of freedom of (n − 1). Example 12.5. Suppose you have selected a simple random sample of 9 Syracuse University undergraduate students and, for each student, recorded how much money (s)he spends in an average week on (1) snacks and (2) alcoholic beverages. Results: Student # \$ spent on snacks/week \$ spent on alcoholic beverages/week 1

10

25

2

10

10

3

20

0

4

40

40

5

5

25

6

25

35

7

30

40

8

20

30

9

15

15

At a 99% level of conﬁdence, test the null hypothesis that on the average, an SU undergraduate student does not spend more money on alcoholic beverages than on snacks.

15

Student \$ spent on \$ spent on alcoholic #

snacks

beverages

di =

per week

per week

x2i − x1i (di − d) (di − di )2

1

10

25

2

10

10

3

20

0

4

40

40

5

5

25

6

25

35

7

30

40

8

20

30

9

15

15

di =

d=

sd = =

H0 : Ha : Decision Rule: At a 99% level of conﬁdence, reject H0 if d >

Conclusion:

16

(di − d)2

v u∑ u (di t

− d)2 n−1

Marketing Research Session 11 Coverage Session 11: Chi-Square Analysis with cross-tabulations (Chapter 13)

17

Chi-Square Test with Cross-Tabulation (Chapter 13) Learning Objectives: • Meaning of “no relationship.” • The chi-square test – Compute expected frequencies. – Compute chi-square (χ2 ). – Compute degrees of freedom. – Do the test. • Check if test is valid and combine rows and/or columns as necessary to have a valid test. • Important application: Test if population proportion is same in two or more sub-populations.

18

Example 13.1 We selected a simple random sample of 150 students from a college campus, and recorded (i) the gender of the student, and (ii) whether the student has attended a basketball game played by the college team during the past year. The results are expressed as the following 2 × 2 cross tabulation: Didn’t Attend Game Attended Game Male

30

60

Female 42

18

H0 : There is no relationship between gender and attendance. Intuition of H0 : • There are two sub-populations: men and women. • A member of either sub-population belongs to one of two categories: did not attend, and attended. • If H0 is true, then each sub-population (men or women) should have the same percentage break-down between the two categories of attendance. Formally: Deﬁne: π11 = Proportion of men

π12 = Proportion of men

who did not attend

who attended

π21 = Proportion of women π22 = Proportion of women who did not attend

who attended

H0 means: π11 = π21 , π12 = π22 .

19

Didn’t Attend Game Attended Game Male

30

60

Female 42

18

H0 : There is no relationship between gender and attendance. Expected Frequencies: In the whole sample of 150 students: Total of Column 1 72 Proportion that did not attend = = n 150 Total of Column 2 78 Proportion that attended = = n 150 Number of men in sample = Total of Row 1 = 90 Expected number of men that did not attend (E11 ) 72 90 × 72 = 90 × = = 43.2 150 150 Expected number of men that attended (E12 ) 90 × 78 78 = = 46.8 = 90 × 150 150 Number of women in sample = Total of Row 2 = 60 Expected number of women that did not attend (E11 ) 72 60 × 72 = 60 × = = 28.8 150 150 Expected number of women that did attended (E11 ) 60 × 78 78 = = 31.2 = 60 × 150 150

More Generally: Eij =

Total of Row i × Total of Column j Sample Size(n)

20

Total of Row i × Total of Column j Sample Size(n) Chi-Square: Eij =

χ

2

=

(Oij − Eij )2 Eij i=1 j=1 R ∑ C ∑

(Observed − Expected)2 (Compute in each cell, and then sum over all cells.) Expected R = number of rows C = number of columns Degrees of freedom = (R − 1) × (C − 1) Decision Rule: At a conﬁdence level (1 − α), reject H0 if χ2 > χ2α at degree of freedom (R − 1) × (C − 1)

21

Degrees of freedom = (R − 1) × (C − 1) Decision Rule: At a conﬁdence level (1 − α), reject H0 if χ2 > χ2α at degree of freedom (R − 1) × (C − 1) Return to Example 13.1 We selected a simple random sample of 150 students from a college campus, and recorded (i) the gender of the student, and (ii) whether the student has attended a basketball game played by the college team during the past year. The results are expressed as the following 2 × 2 cross tabulation: Didn’t Attend Game Attended Game Male

30

60

Female 42

18

At a 95% level of conﬁdence, test the null hypothesis that there is no relationship between gender and attendance (against the alternate hypothesis that there is some kind of relationship between the two). Here: O11 =

, E11 =

O21 =

, E21 =

χ2 =

×

×

=

, O12 =

, E12 =

=

,

, E22 =

O22 =

×

=

×

=

(O11 − E11 )2 (O12 − E12 )2 (O21 − E21 )2 (O22 − E22 )2 + + + E11 E12 E21 E22

Degrees of freedom = (2 − 1) × (2 − 1) = Decision Rule: At a 95% level of conﬁdence, reject H0 if χ2 > Conclusion:

22

Validity of Chi-Square Test Need: • Eij > 1 in all cells. • Eij ≥ 5 is 80% or more of the cells. Note: • We may combine any rows or columns. If we have scale variables (e.g., 1-7 scales), combine adjacent rows or columns for ease of interpretation. • If we combine any two rows or any two columns, the observed numbers add up, the expected numbers also add up. • If you need to modify a table, any valid modiﬁcation is acceptable. In Minitab you do that by recoding variables. • The ﬁnal modiﬁed table must have at least two rows and at least two columns. • After you modify the table, the degree of freedom comes from the number of rows and columns of the modiﬁed table.

23

Example 13.3 We collected a simple random sample of 60 students from a college campus and asked them to rate how much they like to watch professional sports on TV on a 1-7 scale (strongly dislike to strongly like). We also noted the gender of each respondent. Based on the results, we have constructed the following cross-tabulation: Like to watch professional sports on TV Gender 1 2 3 4 Male

5 6 7

2 0 4 12 6 8 8

Female 4 3 2 6

3 1 1

At a 99% level of conﬁdence, test the null hypothesis that gender is not related to how much one likes to watch professional sports on TV. Process: First augment the cross-tabulation by row totals and column totals: Like to watch sports on TV Gender

1 2 3 4

5 6 7

Row Totals

Male

2 0 4 12 6 8 8

40

Female

4 3 2 6

20

3 1 1

Column Totals 6 3 6 18 9 9 9 Then compute the expected frequencies in the original table.

E11 =

40 × 6 = 60

E12 =

40 × 3 = 60

E13 =

40 × 6 = 60

E14 =

E15 =

40 × 9 = 60

E16 =

40 × 9 = 60

E17 =

40 × 9 = 60

=

E21 =

20 × 6 = 60

E22 =

20 × 3 = 60

E23 =

20 × 6 = 60

E24 =

E25 =

20 × 9 = 60

E26 =

20 × 9 = 60

E27 =

20 × 9 = 60

=

Is it valid to use chi-square test with original table?

24

40 × 18 60

20 × 18 60

E11 =

40 × 6 40 × 3 40 × 6 40 × 18 = 4 E12 = = 2 E13 = = 4 E14 = = 12 60 60 60 60

E15 =

40 × 9 40 × 9 40 × 9 = 6 E16 = = 6 E17 = =6 60 60 60

E21 =

20 × 6 20 × 3 20 × 6 20 × 18 = 2 E22 = = 1 E23 = = 2 E24 = =6 60 60 60 60

E25 =

20 × 9 20 × 9 20 × 9 = 3 E26 = = 3 E27 = =3 60 60 60

Writing Compactly, the Expected Frequencies (Eij ’s) are: Like to watch professional sports on TV Gender 1 2 3 4 5 6 7 Male

4 2 4 12 6 6 6

Female 2 1 2 6

3 3 3

Items to check: • Is Eij > 1 in all cells? • If yes, then is Eij ≥ 5 in 80% or more cells? Note: • If either condition fails, you have to combine columns to get a valid test. • Since there are only two rows, you cannot combine rows here. • If you had more than two rows, you could have combined rows. (Note that the ﬁnal table must have at least two rows and at least two columns.)

25

Old Table of Observed Frequencies (Oij ’s): Like to watch professional sports on TV Gender 1 2 3 4 Male

5 6 7

2 0 4 12 6 8 8

Female 4 3 2 6

3 1 1

Old Table of Expected Frequencies (Eij ’s): Like to watch professional sports on TV Gender 1 2 3 4 Male

5 6 7

4 2 4 12 6 6 6

Female 2 1 2 6

3 3 3

New Table: Observed and Expected Frequencies:

Degrees of freedom = (

−1) × (

−1) =

χ2 =

Decision Rule: At a 99% level of conﬁdence, reject H0 if χ2 > Conclusion:

26

Problem 6, sample test: Suppose a researcher has selected a simple random sample of 50 Syracuse University students, and asked them to rate how satisﬁed they are with the Syracuse University Parking Services on a 1-5 scale (1 → very dissatisﬁed, 3 → neutral, 5 → very satisﬁed). You also noted whether the respondent has a vehicle. The results are summarized as the following cross-tab: Satisfaction Vehicle Ownership 1 2 3 4 5 Has Vehicle:

9 9 6 4 2

Doesn’t have vehicle 1 6 9 2 2 Suppose you want to test, at a 95% level of conﬁdence, the null hypothesis that there is no association between vehicle ownership and satisfaction level with the Syracuse University parking services. In the present case, is it valid to use the chi-square test with the original cross-tab? Clearly state yes or no. If no, modify the original cross-tab so that a chi-square test becomes valid. Using the original or modiﬁed cross-tab as appropriate, perform a chi-square test of the null hypothesis at a 95% level of conﬁdence.

27

Satisfaction Vehicle Ownership 1

2

3

4 5 Row Totals

Has Vehicle:

9

9

6

4 2 30

Doesn’t have vehicle

1

6

9

2 2 20

Column Totals

10 15 15 6 4 n = 50

E11 = E14 = E21 = E24 =

× × × ×

=

E12 =

=

E15 =

=

E22 =

=

E25 =

× × × ×

=

E13 =

×

=

=

=

E23 =

×

=

=

Items to check: • Is Eij > 1 in all cells? • If yes, then is Eij ≥ 5 in 80% or more cells, that is, in at least 8 out of the 10 cells? Note: • If either condition fails, you have to combine columns to get a valid test. • Since there are only two rows, you cannot combine rows here. • If you had more than two rows, you could have combined rows. (Note that the ﬁnal table must have at least two rows and at least two columns.)

28

Table of observed frequencies (Oij ’s): Satisfaction Vehicle Ownership 1 2 3 4 5 Has Vehicle:

9 9 6 4 2

Doesn’t have vehicle

1 6 9 2 2

Table of Expected frequencies (Eij ’s): Satisfaction Vehicle Ownership 1

2

3

4

5

Has Vehicle: Doesn’t have vehicle New Table: Observed and Expected Frequencies:

Degrees of freedom = (

−1) × (

−1) =

χ2 =

Decision Rule: At a 95% level of conﬁdence, reject H0 if χ2 > Conclusion:

29

An Important Application of Chi-Square Test Example 1: Suppose you have collected two separate simple random samples from the male and female students of a college campus. Results: (1) Male sample: n1 = 100, 70 watch sports on TV every week (2) Female sample: n2 = 200, 80 watch sports on TV every week. At a 99% level of conﬁdence, test the null hypothesis that an equal proportion of male and female students watch sports on TV. Approach: Express as a cross-tabulation: Do not Watch Watch Male

30

Female 120

70 80

Use chi-square test to test the null hypothesis that there is no relationship between gender watching sports on TV.

30

Logic: (1) Male sample: n1 = 100, 70 watch sports on TV every week (2) Female sample: n2 = 200, 80 watch sports on TV every week. Expressed as a cross-tabulation: Do not Watch Watch Male

30

Female 120

70 80

Let: • π11 = fraction of men who do not watch • π12 = fraction of men who watch • π21 = fraction of women who do not watch • π22 = fraction of women who watch Clearly, π11 + π12 = 1, and π21 + π22 = 1, that is, π11 = 1 − π12 , π21 = 1 − π22 Therefore, if π12 = π22 , we also have π11 = π21 . Hence, the chi-square test here is equivalent to testing H0 : π1 ≡ π12 = π2 ≡ π22 (same proportion of men and women watch) against Ha : π1 ̸= π2 .

31

Example 1 (continued): Do not Watch Watch Row Totals Male

30

70

Female

120

80

Column Totals E11 = E21 =

× ×

n = 300 =

E12 =

=

E22 =

× ×

=

=

(O11 − E11 )2 (O12 − E12 )2 (O21 − E21 )2 (O22 − E22 )2 χ = + + + E11 E12 E21 E22 ( − )2 ( − )2 ( − )2 ( − )2 = + + + = 2

Decision Rule: At a 99% level of conﬁdence, reject H0 if χ2 > χ2.01 at df = (2 − 1) × (2 − 1) = 1 Conclusion:

32

Example 2: Suppose you have collected simple random samples from three sub-populations: Business majors, Engineering majors, and “other” majors. For each respondent, you recorded if (s)he reads the Wall Street Journal every week. Results: (1) Business sample: n1 = 100, 50 read WSJ every week. (2) Engineering sample: n2 = 50, 15 read WSJ every week. (3) “Other: sample: n3 = 150, 25 read WSJ every week. At a 99% level of conﬁdence, test the null hypothesis that an equal proportion of business, engineering, and other students read WSJ every week. Approach: Express as a cross-tabulation: Do not Read Read Business

50

50

Engineering 35

15

Other

25

125

Use chi-square test to test the null hypothesis that there is no relationship between major and reading WSJ every week.

33

50

50

Engineering

35

15

Other

125

25

Column Totals ×

E11 =

×

E21 =

×

E31 =

Sample Size = 300 =

E12 =

=

E22 =

=

E32 =

× × ×

=

=

=

(O11 − E11 )2 (O12 − E12 )2 (O21 − E21 )2 (O22 − E22 )2 + + + χ = E11 E12 E21 E22 (O31 − E31 )2 (O32 − E32 )2 + + E31 E32 ( − )2 ( − )2 ( − )2 ( − )2 = + + + 2

+

(

)2

+

(

)2

=

Decision Rule: At a 99% level of conﬁdence, reject H0 if χ2 > χ2.01 at df = (3 − 1) × (2 − 1) = 2 Conclusion:

34

More Generally: • Suppose you are testing if an equal proportion of k sub-populations have a property of interest (e.g., read Wall Street Journal every week), that is, π1 = π2 = . . . = πk • This is equivalent to a chi-square test with a k × 2 cross-tabulation where each row comes from one sub-population, and the two columns are “do not have property,” and “have property.” • Express the data as a k × 2 cross tabulation. For any sub-population: Number who do not have property = Size of the sample from the sub-population − Number from sub-population who have property • Assuming test is valid, reject H0 if χ2 exceeds χ2α at degrees of freedom (k − 1) × (2 − 1) = k − 1.

35

Marketing Research Session 12 Coverage of Session 12: Regression Analysis (Chapter 14) • Meaning of model • R2 and F -test

36

Regression Analysis Basic Model Form: Y = β0 + β1 X1 + β2 X2 + . . . + βm Xm + ϵ β’s are same for all cases. These are the regression parameters. We are estimating (m + 1) parameters here. Y is the dependent variable. It is quantitative variable. Strictly speaking, Y should have at least interval scale properties. X’s are independent variables. We consider three types of independent variables: • Quantitative variable with at least an interval scale property. • Dummy variable • Product of a dummy variable and an interval scaled variable. Meaning of Parameters in a special Case: If all the independent variables are interval scaled variables, then βi is how much Y changes on the average as Xi increases by a unit, keeping all the other X’s ﬁxed.

37

Example of Standard Regression Model Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + ϵ Y = per capita sales of brand X in a sales territory (in dollars) X1 = per capita advertising expenditure by brand X in the territory (in dollars) X2 = per capita personal selling expenditure by brand X in the territory (in dollars) X3 = per capita sales promotion expenditure by brand X in the territory (in dollars) X4 = price/unit of brand X (in dollars) Then: E(Y |X1 , X2 , X3 , X4 ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 m=4 β1 = on the average, how much Y changes if X1 changes by a unit, but X2 , X3 , and X4 are held constant. β2 = on the average, how much Y changes if X2 changes by a unit, but X1 , X3 , and X4 are held constant. β3 = on the average, how much Y changes if X3 changes by a unit, but X1 , X2 , and X4 are held constant. β4 = on the average, how much Y changes if X4 changes by a unit, but X1 , X2 , and X3 are held constant.

38

Examples from Section 14.2 Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.1. Naive Model: Y

=

β0 + ϵ

• In this model, E(Y ) is same for all students regardless of D or X. • The estimate of the coeﬃcient β0 is Y , the sample mean.

39

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.2: Y

=

β0 + β1 ∗ D + ϵ

Writing separately for the two categories of D: (1) D = 0 (not top 20): Y = β0 + ϵ. (2) D = 1 (top 20): Y = (β0 + β1 ) + ϵ. Note: • For the sub-population of graduates from top 20 students: Average Y = β0 + β1 • For other students: Average Y = β0 • β1 = Average salary of top 20 graduates − Average salary of other graduates • For either category of students, a change in X has no marginal eﬀect on salary.

40

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.3: Y

=

β0 + β2 X + ϵ

Hence: E(Y |X)

=

β0 + β2 X

E(Y |X) 6



Slope = β2

β0

0

-

0

41

X

Model 14.2.3 (continued) E(Y |X)

=

β0 + β2 X

E(Y |X) 6



Slope = β2

β0

0

-

X

0 Note: • Intuitively, we are dividing the population into sub-populations, where all graduates in a given sub-population have the same X. • For a given subpopulation, Average Y = β0 + β2 X • β0 is the intercept, and β2 is the slope of the regression line. β2 is also called the marginal eﬀect of X on Y .

42

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.4: (14.6)Y

=

β0 + β1 D + β2 X + ϵ

Writing separately for the two categories: Top 20 Graduate D = 1 Y = (β0 + β1 ) + β2 X + ϵ

D = 0 Y = β0 + β2 X + ϵ

E(Y |X)

D = 1, slope = β2

6

D = 0, slope = β2

β0 + β1 β0

0

-

0

43

X

14.2.4 (continued): Top 20 Graduate D = 1 Y = (β0 + β1 ) + β2 X + ϵ

D = 0 Y = β0 + β2 X + ϵ

E(Y |X)

D = 1, slope = β2

6

D = 0, slope = β2

β0 + β1 β0

0

-

X

0 Note: • We are allowing the regression line to be diﬀerent for the two categories of graduates. • Both regression lines have the same slope β2 . • The intercepts may be diﬀerent for the two lines.

44

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.5: Y = β0 + β1 D + β2 X + β3 ∗ D ∗ X + ϵ Writing separately for the two categories: Top 20 Graduate D = 1, D ∗ X = X Y = (β0 + β1 ) + (β2 + β3 )X + ϵ Other Graduate

D = 0, D ∗ X = 0

E(Y |X) 6

β0 + β1

Y = β0 + β2 X + ϵ

D = 1, slope = β2 + β3

D = 0, slope = β2

β0

0

-

0

45

X

14.2.5 (continued): Top 20 Graduate D = 1, D ∗ X = X Y = (β0 + β1 ) + (β2 + β3 )X + ϵ Other Graduate

D = 0, D ∗ X = 0

E(Y |X) 6

β0 + β1

Y = β0 + β2 X + ϵ

D = 1, slope = β2 + β3

D = 0, slope = β2

β0

0

-

X

0 Note: • We are allowing the regression line to be diﬀerent for the two categories of graduates. • The intercepts may be diﬀerent for the two lines. If β1 = 0, then the intercepts are equal. • The slopes may be diﬀerent for the two lines. If β3 = 0, the slopes are equal.

46

Meaning of Parameters in More General Cases Context: Y = starting annual salary of a student who got an undergraduate business degree in 2006. Job Types: Accounting, Finance, Marketing, Other D1 = 1 if Accounting, 0 if Finance, Marketing, or Other D2 = 1 if Finance, 0 if Accounting, Marketing, or Other D3 = 1 if Marketing, 0 if Accounting, Finance, or Other X = GPA on a 1-4 scale

Model 1: Y = β0 + β1 X + ϵ E(Y |X) = β0 + β1 X E(Y |X)             6

β0 0

-

X

47

Model 2: Y = β0 + β1 D1 + β2 D2 + β3 D3 + ϵ E(Y |X) 6

β0 + β3 β0 + β2 β0 + β1 β0 0

-

X

Key: Write the model down separately for the four job types Accounting: Finance: Marketing: Other: What is the meaning of: (i) β3 = 0

(ii) β1 = β2

(iii) β1 = β2 = β3

(iv) β1 = β2 = β3 = 0

48

Model 3: Y = β0 + β1 D1 + β2 D2 + β3 D3 + β4 X + ϵ E(Y |X)

                                  6

β0 + β3 β0 + β2 β0 + β1 β0 0

-

X

Note: Slope is same (β4 ) for all four lines. Key: Write the model down separately for the four job types Accounting: Finance: Marketing: Other: Questions: (1) What do the lines become if β4 = 0?

(2) What do the lines become if β1 = β2 = β3 ?

(3) What do the lines become if β1 = β2 = β3 = 0?

49

Model 4: Y = β0 + β1 D1 + β2 D2 + β3 D3 + β4 X + β5 D1 X + β6 D2 X + β7 D3 X + ϵ E(Y |X) 6

  

β0 + β3 β0 + β2 β0 + β1 β0 0

   

                            -

X

Note: Slope is may be diﬀerent for the four lines. Accounting: Finance: Marketing: Other: Questions: (1) What do the lines become if β5 = β6 = β7 = 0? (2) What do the lines become if β1 = β2 = β3 = β5 = β6 = β7 = 0? (3) What do the lines become if β4 = β5 = β6 = β7 = 0?

50

Accounting: Y = (β0 + β1 ) + (β4 + β5 )X + ϵ Finance: Y = (β0 + β2 ) + (β4 + β6 )X + ϵ Marketing: Y = (β0 + β3 ) + (β4 + β7 )X + ϵ Other: Y = β0 + +β4 X + ϵ (1) β5 = β6 = β7 = 0

51

Accounting: Y = (β0 + β1 ) + (β4 + β5 )X + ϵ Finance: Y = (β0 + β2 ) + (β4 + β6 )X + ϵ Marketing: Y = (β0 + β3 ) + (β4 + β7 )X + ϵ Other: Y = β0 + +β4 X + ϵ (2) β1 = β2 = β3 = β5 = β6 = β7 = 0

52

Accounting: Y = (β0 + β1 ) + (β4 + β5 )X + ϵ Finance: Y = (β0 + β2 ) + (β4 + β6 )X + ϵ Marketing: Y = (β0 + β3 ) + (β4 + β7 )X + ϵ Other: Y = β0 + +β4 X + ϵ (3) β4 = β5 = β6 = β7 = 0

53

Y = β0 + β1 D1 + β2 D2 + β3 D3 + β4 X + β5 D1 X + β6 D2 X + β7 D3 X + ϵ E(Y |X) 6

β0 + β3 β0 + β2 β0 + β1 β0 0

   

  

   

                      -

X

Accounting: Y = (β0 + β1 ) + (β4 + β5 )X + ϵ Finance: Y = (β0 + β2 ) + (β4 + β6 )X + ϵ Marketing: Y = (β0 + β3 ) + (β4 + β7 )X + ϵ Other: Y = β0 + β4 X + ϵ State the following in terms of regression parameters: (1) Marginal eﬀect of GPA on salary is the same for Accounting and Finance jobs.

(2) GPA has no marginal eﬀect on salary for Marketing jobs.

54

R2 and F Test n = sample size m + 1 = number of regression parameters (β0 , β1 , . . ., βm ) ∑ (Yj − Yˆj )2 2 R =1− ∑ (Yj − Y )2 Important: • For any regression model that includes β0 , 0 ≤ R2 ≤ 1 • For the naive model Y = β0 + ϵ, the estimate of β0 is Y (sample average of Y ), and R2 = 0. • If we add another independent variable, R2 cannot decrease. F Test: H0 : β1 = . . . = βk = 0 (can be any k of β1 , . . ., βm ). Ha : At least one of the β’s listed in H0 is not 0. Full regression: Y against X1 , . . ., Xm Restricted Regression: Y against the remaining variables after dropping the variables that are not signiﬁcant according to H0 . 2 Rf2 ull − Rrestricted ) n−m−1 F =( )×( ) 2 1 − Rf ull k At a (1 − α) degree of conﬁdence, reject H0 if F > Fα at degrees of freedom (k, n − m − 1). Note: If H0 : β1 = . . . = βm = 0, then k = m, and R2 restricted = 0.

55

Example of Standard Regression Model Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + ϵ Y = per capita sales of brand X in a sales territory (in dollars) X1 = per capita advertising expenditure by brand X in the territory (in dollars) X2 = per capita personal selling expenditure by brand X in the territory (in dollars) X3 = per capita sales promotion expenditure by brand X in the territory (in dollars) X4 = price/unit of brand X (in dollars) Then: E(Y |X1 , X2 , X3 , X4 ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 m=4 β1 = on the average, how much Y changes if X1 changes by a unit, but X2 , X3 , and X4 are held constant. β2 = on the average, how much Y changes if X2 changes by a unit, but X1 , X3 , and X4 are held constant. β3 = on the average, how much Y changes if X3 changes by a unit, but X1 , X2 , and X4 are held constant. β4 = on the average, how much Y changes if X4 changes by a unit, but X1 , X2 , and X3 are held constant.

56

Example of Standard Regression Model Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + ϵ n = 45 Model Dependent Variable Independent Variable(s) R2 1

Y

X1 , X2 , X3 , X4

0.60

2

Y

X1 , X2 , X3

0.55

3

Y

X2 , X3 , X4

0.50

4

Y

X1 , X3 , X4

0.54

5

Y

X1 , X2 , X4

0.51

6

Y

X1 , X2

0.48

7

Y

X3 , X4

0.45

At 99% level of conﬁdence, test H0 : β3 = β4 = 0 k=

,m=

,n=

Full model: Model Restricted Model: Model

, n−m−1= Rf2 ull = 2 Rrestricted =

2 Rf2 ull − Rrestricted n−m−1 × F = = 2 1 − Rf ull k

Decision Rule: At a 99% level conﬁdence, reject H0 if F > Fα (k, n − m − 1) = Conclusion:

57

Example of Standard Regression Model Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + ϵ n = 45 Model Dependent Variable Independent Variable(s) R2 1

Y

X1 , X2 , X3 , X4

0.60

2

Y

X1 , X2 , X3

0.55

3

Y

X2 , X3 , X4

0.50

4

Y

X1 , X3 , X4

0.54

5

Y

X1 , X2 , X4

0.51

6

Y

X1 , X2

0.48

7

Y

X3 , X4

0.45

At 99% level of conﬁdence, test H0 : β1 = β2 = β3 = β4 = 0 k=

,m=

,n=

Full model: Model Restricted Model: Model

, n−m−1= Rf2 ull = 2 Rrestricted =

2 Rf2 ull − Rrestricted n−m−1 × F = = 2 1 − Rf ull k

Decision Rule: At a 99% level conﬁdence, reject H0 if F > Fα (k, n − m − 1) = Conclusion:

58

Sample Test, Problem 7 Problem Scenario: A likert summated scale designed to measure how much an adult US citizen likes President Bush was administered to a random sample of US voters. The following regression model was used to analyze the data: Y = β0 + β1 D1 + β2 D2 + β3 X + β4 D1 ∗ X + β5 D2 ∗ X + ϵ, where: Y = attitude score on the likert scale, D1 = 1 if the citizen is a registered Democrat, and 0 if not; D2 = 1 if the citizen is a registered Republican, and 0 if not; X = annual family income of the citizen, ϵ is the random error deﬁned as usual. 7.(b)(2+2=4 pt) Write each of the following two hypotheses in terms of model parameters. 7.(b)(i) If a citizen is a registered Democrat, then annual family income has no marginal eﬀect on attitude score.

7.(b)(ii) The marginal eﬀect of annual family income on attitude score is the same for registered Republicans and “other” citizens.

59

Y = β0 + β1 D1 + β2 D2 + β3 X + β4 D1 ∗ X + β5 D2 ∗ X + ϵ, where: Y = attitude score on the likert scale, Registered

D1 = 1, D2 = 0,

Y

Democrat

D1 X = X, D2 X = 0 = (β0 + β1 ) + (β3 + β4 )X + ϵ

Registered

D1 = 0, D2 = 1,

Y

Republican D1 X = 0, D2 X = X = (β0 + β2 ) + (β3 + β5 )X + ϵ Other

D1 = 0, D2 = 0,

Y

D1 X = 0, D2 X = 0

= β0 + β3 X + ϵ

7.(b)(2+2=4 pt) Write each of the following two hypotheses in terms of model parameters. 7.(b)(i) If a citizen is a registered Democrat, then annual family income has no marginal eﬀect on attitude score.

7.(b)(ii) The marginal eﬀect of annual family income on attitude score is the same for registered Republicans and “other” citizens.

60

From Section 14.7: Consider the regression model: Y = β0 + β1 D1 + β2 D2 + β3 X1 + β4 X2 + β5 D1 X1 + β6 D2 X1 + β7 D1 X2 + β8 D2 X2 + ϵ, where: Y = sales of a brand in a sales territory (unit = \$100,000) during Fall, 2003; X1 is the number of salespeople in the territory; X2 is the retail price (in dollars) in the territory; D1 and D2 are dummy variables for the level of advertising in the territory. The advertising level can be low, medium, or high. D1 = 1 if the advertising level is medium, and D1 = 0 otherwise; D2 = 1 if the advertising level is high, and D2 = 0 otherwise. ϵ is deﬁned as usual. Regression Model for the three advertising levels: Low Advertising: Medium Advertising: High Advertising:

61

Regression Model for the three advertising levels: Low Advertising: Medium Advertising: High Advertising: 2. State each of the following null hypotheses in terms of the parameters of the regression model (e.g., H0 : β1 = 0): 2.(a) The marginal eﬀect of price on sales is the same for medium and high advertising levels.

2.(b) Changes in price do not aﬀect sales if the level of advertising is low.

62

3. Suppose we have estimated regression models using data from 49 territories, and got the following results: Results Regression Dependent Variable Independent Variables 1 Y D1 , D2 , X1 , X2 , D1 X1 , D2 X1 , D1 X2 , D2 X2 2 Y D1 , D2

R2 .75

3 4 5 6 7

Y Y Y Y Y

.2 .6 .55 .5 .3

8

Y

X1 , X2 D1 , D2 , X1 , X2 , D1 X1 , D2 X1 D1 , D2 , X1 , D1 X1 , D2 X1 D1 , D2 , X2 , D1 X2 , D2 X2 D1 , D2 , D1 X1 , D2 X1 , D1 X2 , D2 X2 D1 , D2 , X1 , X2 , D1 X2 , D2 X2

.2

.7

At a 99% level of conﬁdence, test each of the following two null hypotheses: 3(a) The marginal eﬀect of price on sales is the same for all levels of advertising. H0 : Ha : k=

,n=

,m=

, n−m−1=

Rf2 ull =

−1=

2 , Rrestricted = − −1 − )×( )= F =( 1− Decision Rule: At a 99% level of conﬁdence, reject H0 if F >

Conclusion:

63

Regression Dependent Variable Independent Variables 1 Y D1 , D2 , X1 , X2 , D1 X1 , D2 X1 , D1 X2 , D2 X2 2 Y D1 , D2

R2 .75

3 4 5 6 7

Y Y Y Y Y

.2 .6 .55 .5 .3

8

Y

X1 , X2 D1 , D2 , X1 , X2 , D1 X1 , D2 X1 D1 , D2 , X1 , D1 X1 , D2 X1 D1 , D2 , X2 , D1 X2 , D2 X2 D1 , D2 , D1 X1 , D2 X1 , D1 X2 , D2 X2 D1 , D2 , X1 , X2 , D1 X2 , D2 X2

.2

.7

3(b) Having an additional salesperson does not have any eﬀect on sales at any level of advertising. H0 : Ha : k=

,n=

,m=

2 , Rrestricted = − − )×( F =( 1−

, n−m−1=

Rf2 ull =

−1

)=

Decision Rule: At a 99% level of conﬁdence, reject H0 if F > Conclusion:

64

−1=