Thursday, January 3, 2013

Numbers as Lies


    This is an attempt to look at some of the assessment issues that develop, especially in the academic world.

    The essay talks about bias, a term in statistics that means to over-predict.  So, the idea is that there is a negative bias for females in multiple choice tests means that they are better than the score indicates.  We we talk about this with minorities as well.

    Anyone familiar with statistical literature, however, knows far too much for the sort of nonsense in Education and even Counseling Journals and, yes, Political Correctness intrudes.

    Let me explain with some (oh my God!) facts:  it is true that there is a slight negative bias against these groups.  Also, if measurement devices are to be used to determine any sort of reward, such as College Admission (if that really is a reward)), the best one to use would be the parents' income.  College success and family income are very highly correlated, certainly more highly than ACT or SAT scores.  Actually, I think that everybody should be allowed admission and that it should be free, but that is a point of view, not related to the theses here.

    If we do want to consider bias, we certainly need to account for the fact that there is a 10% positive bias in favor of left-handers.  What sort of action should we take on that?  I can account for some of it as left-handers constantly need to adjust to a right-handed world and hence are more adaptable.  Still, should we not require a higher score before we admit left-handers?

    It is well-known that a left-handed batter has an advantage over a right-handed hitter in baseball.  A popular reason for this is that the curve ball of a right hander is easier to follow for a left-handed batter.  And that is true.  An additional, and I found more important, reason is that the bottom hand on a bat guides it and the top hand supplies the power.  With baseballs taking less that 2/5ths of a second to get to home plate, the batter has to have good hand-eye co-ordination.  In addition, the baseball will go far anyway if hit in the right spot.  Finally, a right-eye dominant person has an advantage batting left-handed as it is the eye facing the pitcher that focuses on the ball.  Additionally, someone with 20/20 eyesight (and this applies to all hitters) actually can not focus on the baseball during the last 2 or 3 feet as a rapid refocusing of the eyes is necessary.  To some degree, a certain amount of myopia is an advantage. 

    So, anyway, a great deal of nonsense is told us about statistics by people who don't know a damn thing about the subject, but who managed to pass nine credit hours in the field.  I found it the easiest 9 hours of A I ever got, but that was because the professors knew their subject.

    In Counseling or Therapy, let me assure you that ability is more a matter of instinct and honesty, mixed with empathy.  If you do not have that, all the numbers in the world will not do you or your patients a damn bit of good.   


Review of

    Bridgeman, Brent & Lewis, Charles.  1994.  The relationship of essay and multiple-Choice Scores with grades in college courses.  Journal of Educational Measurement, (31, 1).  37-50.


Description
    Bridgeman and Lewis used data from 32 colleges in an attempt to show the relationship between essay-type examinations and multiple-choice examinations as predictors of freshman grade point averages.  On the whole, they seem to come to the conclusion that multiple choice examinations are more reliable and accurate than essay questions except for certain situations in which they are just as reliable. 
    The introduction compares the two types of examinations and some of the prior research after giving some faint praise to essay examinations and pointing out that multiple-choice questions were gender biased against females (37-38).  The sample of 32 colleges and the various grade point averages, SAT scores, Advanced Placement (AP) scores, along with self-reported high school grade point averages (HSGPA) were used (39).  The AP tests form the basis for the essay exam data as well as some of the multiple choice data used in the analysis and how they were graded are described (39-41).  Next, there is a "Results and Discussion" section correlating the AP scores with College GPAs in the respective categories of History, Biology, and the English Language.  The SAT scores and HSGPA are also correlated with first year scores in general and in the particular subjects.     
    The article concludes with some suggestions for further study of essay type examinations and what might be done if more weight were to be placed on the essay examinations (48-49).  For example, reliability is a significant problem in essay type examinations and that more such exams and more graders would help in this respect.  Also suggested is some sort of statistical adjustment for "systematic differences in the scoring standards" (49) of each grader.  Some suggestions for further study are also made which will be discussed below. 
   
Analysis
    For the purposes of this analysis, I would like to take the unpopular stance of being pro-essay exam.   From my own experience as a teacher with essay examinations and observations of other graders at a variety of types of colleges, I can say that the attempt to subject them to the same sort of statistical analysis as one does with multiple choice examinations is formidable indeed.  Of most interest is the fact that the evaluation of these essays is remarkably consistent and this suggests that essay-type responses on an intake questionnaire might be just as reliable, and depending on the inclinations and attitudes of the therapist, potentially more rewarding.
    Bridgeman and Lewis correctly point out that essay examinations assess skills that can not be measured in the multiple choice format (37) and that they are primarily used to assess "... acquisition of skills and knowledge taught in specific college-level courses" (38) rather than to predict future success.  I feel that they also allow the student to demonstrate knowledge and skills acquired in the course with less restraints -- in other words, a multiple-choice format may not allow the student the opportunity to demonstrate a knowledge of specific areas even if they are covered in that format.  The format also precludes the client's ability to qualify answers given in the multiple choice format.
    An interesting phenomenon that the authors point to is the gender bias in multiple choice formats.  Even in the sample used, the difference between males and females on multiple choice questions is about .3 standard deviation units and this is a well-known phenomenon.  In this particular study, essay examinations, on the other hand, yielded about .02 difference. 
    The sample seems adequate and the approach to standardizing the letter grades seems reasonable (using a 13 point scale).  The one possible weakness in the basic data described here is that the HSGPA is self-reported.  There could well be a tendency for a certain number of students to under- or over-estimate this variable.  If we can assume a uniformity of bias on the part of all students and if this bias in uniformly distributed (it is possible that males and females would exhibit different patterns of self-reporting), this is not a significant problem.
    The authors realize, as noted above, that using essay examinations as a predictor is problematic as those exams were used primarily as an assessment, but they do point to what they consider relatively high correlations as predictors in cases where the college grading is based on essay examinations.  Two examples are particularly interesting, the English and the history grades (44-45).  In both cases, the correlations between the essay examinations and the grade point average in those subjects is slightly higher than those pertaining to multiple choice examinations.  I believe that some study is warranted in determining if similar correlations exists between open-ended questions and multiple-choice personality determiners. 
    The so-called "Holland Game" where people simply report what sort of people they would like to be with seems about as accurate as the self-directed test.  I would like to see added a question of why the client would prefer to be with a certain group.  For example, my own "Investigative" and "Social" scores seem to be quite close together and I generally would prefer to be with the "Social" types.  However, the game specifically mentions that this takes place at a party and in that case I would prefer the "Investigative" types because a wider variety of topics for discussion would be available.  While it is only my own subjective opinion, I also think that "Enterprising" types consider themselves "Social."
    The fact that these subjects rely on essay type examinations with their own particular biases merely reinforces another possible suspicion.  The physical sciences, for example, which require a great deal of specific knowledge of the sort that is easily tested in the multiple choice format, eventually require basic communications skills on a higher level.  The authors point out that "essay examinations usually an in-depth understanding of a few content areas..." and that as a result of "...measurement error created by subjective scoring ... essay tests may be substantially less reliable that multiple-choice tests in the same general subject area" (37).  Any individual who wishes to advance in the physical and social sciences, whatever these areas may contend about their objectivity, will eventually need to present their investigations in writing, in essay form, to a highly subjective audience (peer review), and eventually not only present their findings in an understandable written form, but also do it in a convincing manner either to publish or obtain grants.  It may well be that the discussion focuses on undergraduates, but it has been my experience that even those who graduate with a B.A. or a B.S. have found their communication skills sorely untested.  I think this is as a result of a political and social situation that makes computer scored examinations more important than education (see Moore, 364-365).  The authors also point out in a different context, one in which they are explaining the high correlation between undergraduate grades in history and English, that "...this analysis suggests that the construct assessed by the course grades is closer to the construct assessed by the essay scores than it is to the construct assessed by the multiple-choice scores" (43).  In this case, the "construct" is may well include communicating to someone who does not necessarily share the student's point of view.    For the counselor, it may well be providing additional insight and context as well as perception of the client's interpretation of the questions and attitude towards them.
    There are several interesting insights or facts that come to light during the article but which are not highlighted as central facets.  Most interesting is that reader reliability in essay scores was "only" .79.  This seems to mean means that the same reader will grade the same essay the same way on a thirteen point scale 89% or 90% of the time.  This is a remarkably high relationship based on my past experience, especially since the data are based on a 13 point scale (39).  To give the same grade on the same essay on a 13 point scale 90% of the time seems to me to be almost superhuman.  I can see it on a 5 point scale, however.
    The conclusion section provides the most interesting topics for further discussion.  For example, the authors point to the assumption that "...college grades are themselves gender fair" when, in fact, they are biased towards females (see appendix).  Another potentially explosive statement is that "..the generally neater handwriting of women could be discounted as an explanation of their relatively strong performance on essay tests if the same size gender difference were observed with typed essays (49).  I have heard of a survey which indicated that a typed paper earned an average of one letter grade higher than the same handwritten essay.  Rather than lamenting this phenomenon, it behooves us to understand it.  Would this particular review be more highly regarded if right justified?  In one font than another?  In 10, 12, or 14 point?  I have heard that people are generally more impressed by right justified papers but that they retain more from unjustified ones.  In other words, to what extent does presentation overwhelm content? 
    The possibility of "...making statistical adjustments for systematic differences is the scoring standards of different raters (Braun, 1988)" is intriguing with respect to essay examinations.  We all know that there are differences in standards between graders.  Obviously. there are differences in standards between graders and these differences may well extend to gender, style, attitude, focus, etc.  Within a given class, or even on a college-wide exam, when the grader knows the author of the exam, personal or even gender bias may intrude.  If a grader thinks that women should score higher on essay exams, I think it is reasonable to assume that they will.  The idea of having the exams typed is essential, but even more, I would suggest that the typing be done by a pool of typists that had no relation to, or knowledge of, the grader or the author.  They should also be typed in the same format, including font and style. 
    To conclude, I would say that this is an excellent article with some intriguing methodological issues and which provides support for the validity of essay-type questionnaires.  The authors did their best to standardize the scores so that we were discussing similar problems.  The concern seems to be with the place of essay type examinations in a world that is progressively multiple choice oriented, mainly for the convenience of scoring. Could it be that counselors rely on the for the same reasons?  The fact that we are talking about undergraduates, or candidates for undergraduate status, should in no way preclude us from realizing that whatever the discipline, communication skills are an integral factor in success, perhaps both for the client or student and the admissions boards

No comments:

Post a Comment