A High School Principal Tells How One Great Teacher Was Wronged by Flawed Evaluation System

Guest Post by Dr. Carol Burris

Carol Corbett Burris has served as principal of South Side High School in the Rockville Centre School District in NY since 2000.  Prior to becoming a principal, she was a teacher at both the middle and high school level.  She received her doctorate from Teachers College, Columbia University, and her dissertation, which studied her district’s detracking reform in math, received the 2003 National Association of Secondary Schools’ Principals Middle Level Dissertation of the Year Award. Dr. Burris has for some time been chronicling the consequences of standardized test-driven reform in her state  which you can read (here, and here and here, for example).

Burris was named New York’s 2013 High School Principal of the Year by the School Administrators Association of New York and the National Association of Secondary School Principals, and in 2010, tapped as the 2010 New York State Outstanding Educator by the School Administrators Association of New York State. She is the co-author of the New York Principals letter of concern about the evaluation of teachers by student test scores. It has been signed by more than 1,535 New York principals and more than 6,500 teachers, parents, professors, administrators and citizens. You can read the letter by clicking here.

In this new post, Burris tells the story of a New York state teacher who was just unfairly smacked by the state’s flawed new teacher and principal evaluation system, known as APPR, which in part uses student standardized test scores to evaluate educators. The method isn’t reliable or valid, as Burris shows here.

This article first appeared on the Answer Sheet.  The article is published with permission of Valerie Strauss and Carol C. Burris.

If you are a teacher or principal in Georgia, then you will find this article quite disturbing given that the Georgia legislature passed HB244, which will use a similar system to evaluate educators in Georgia.

How One Great Teacher Was Wronged by Flawed Evaluation System

By Carol Burris

Jenn is a teacher of middle-school students. Her school is in a small city district that has limited resources. The majority of kids in the school receive free or reduced priced lunch and about 40% are black or Latino. Many are English language learners. Lots of them are homeless.

After learning that she was rated less than effective because of her students’ standardized test scores, she wrote to Diane Ravitch, who posted her letter on her blog. She wrote:

I’m actually questioning whether I can teach for the next 20 years. It’s all I’ve ever wanted to do, but this APPR garbage is effectively forcing out some of the best teachers I’ve worked with. I may be next.

I contacted Jenn to better understand her story. I encountered the kind of teacher I love to hire. She has never imagined herself as anything but a teacher—teaching is not a stepping stone to a career in law or business. She does all the extras. She comes in early and leaves late. She coaches. She understands that she must be a counselor, a nurse and a surrogate parent to her students—the most at-risk students in the seventh-grade. Jenn is their support teacher for English Language Arts.

Growth Score diagram showing the three components on which teachers are judged. Source: NYSUT
Growth Score diagram showing the three components on which teachers are judged. Source: NYSUT

She is valued by her principal who gave her 58 out of 60 points on the measure of teaching behaviors—instruction, lesson plans, professional obligations, understanding of child development, communication with parents—all of the things that matter and that Jenn can truly control.

Student Test Scores

And then came the test score measures. The grade-level teachers and the principal had to create a local measure of student performance. They chose a group measure based on reading growth on a standardized test. They were required to set targets from a pre-test given in the winter to a post-test given in the spring. The targets were a guess on the part of the teachers and principal. How could they not be? The team was shooting in the dark—making predictions without any long-term data. Such measures can never be reliable or valid.

The state of Massachusetts requires that measures of student learning be piloted and that teachers be evaluated not by one set of scores, but rather by trends over time. That state’s evaluation model will not be fully implemented for several years because they are building it using phase-in and revision. But New York does not believe in research or caution. New York is the state where the powerful insist that teachers “perform,” as though they were trained circus seals. There is no time for a pilot in the Empire State. Our students and we must jump, as our chancellor advises, “into the deep end of the pool.” In New York, our commissioner warns that we can never let the perfect be the enemy of the good. We don’t even let nonsense be the enemy of the good. And so Jenn hoped that she and her colleagues made a reasonable gamble when they set those targets.

The Students’ Tests Don’t Count or Are Too Hard

Many of the seventh-grade students did not take the standardized reading test seriously. Middle schoolers are savvy—they knew the test didn’t count. So they quickly filled in the bubbles as teachers watched in horror. Luckily, enough students took their time so that their teachers were able to get 10/20 points on that local measure of learning, which put Jenn in the Effective range.

The final piece in her evaluation was her score from new Common Core-aligned tests that the state gave to students this past spring. The tests were far too difficult for Jenn’s Academic Intervention Services (AIS) students. They were too long. The reading passages were dense and many of the questions were confusing. We know that only about 1 in 5 students across the state, who are like the majority of Jenn’s students, scored proficient on the Common Core tests. Even more importantly, we know that about half of all students like Jenn’s scored in level 1—below Basic. These are the students who, overwhelmed by frustration, give up or guess. The test did not measure their learning—it measured noise.

It Doesn’t Add Up

So Jenn’s students’ scores, along with all the other seventh-grade scores on the Common Core tests, were put in a regression model and the statisticians cranked the model, and they entered their covariates and set confidence levels and scratched their heads over Beta reports and did all kinds of things that make most folks’ eyes glaze over. And all of that cranking and computing finally spit out all of the teachers’ and principals’ places on the modified bell curve. Jenn got 5 points out of 20 on her state growth score along with the label, Developing.

When all of the points were added up, it did not matter that she received 58/60 points in the most important category of all, which is based on New York’s teaching standards. And it did not matter that she was Effective in the local measure of student learning. 5+10+58 = 73 which meant that Jenn was two points short of being an Effective teacher. Jenn was labeled, Developing, and put on a mandated improvement plan.

This seven-year dedicated teacher feels humiliated. She knows that parents will know and possibly lose confidence in her. She is angry because the label is unfair. She will be under scrutiny for a year. Time she would spend on her students and her lessons will be wasted in meetings and improvement plan measurement. The joy of teaching is gone. It has been replaced by discouragement and fear.

Her principal also knows it is not fair—she gave Jenn 58/60 points. Over time, however, she may begin to doubt her own judgment—the scores may influence how she rates teachers. After all, principals get a growth score too, and the teachers with low scores will become a threat to principals’ own job security over time. Those who created this system put Machiavelli to shame.

Flawed Model

Jenn is not alone. There are hundreds, if not thousands, of good teachers and principals across the state who are receiving poor ratings they do not deserve based on a flawed model and flawed tests. Slowly, stories will come out as they gain the courage to speak out. There will be others who suffer in silence, and still others who leave the profession in disgust. None of this is good for children.

During the July 2013 hearing of the Governor’s New Education Reform Commission, David Steiner, the previous New York State Commissioner of Education, said,

There is a risk, and I want to be honest about this, that very, very, mature, effective teachers are saying you are treating me like a kid. In the name of getting rid of the bottom 5 percent, we risk losing the top 5 percent….We do not want to infantilize the profession in order to save it.

Steiner directed those remarks to his former deputy, now state commissioner, John. B. King. Did King understand what his former mentor was trying to tell him? Because he did not respond to Steiner’s observation, we do not know.

John King told districts to use caution when using this year’s scores to evaluate teachers and principals. He claimed that the tests did not negatively impact teacher’s accountability ratings. Perhaps he should ask Jenn if she agrees. We already know that 1 in 4 teachers and principals moved down at least one growth score category from last year — hardly the hallmark of a reliable system.

What Should Be Done

There is much that King and the Board of Regents can do. They can ask the governor to pass legislation so that the evaluations remain private. They can request that teachers like Jenn, who are more than effective in the eyes of their principals, be spared an improvement plan this year. I hold no hope, however, that John King will do that. He lives in “the fierce urgency of now.” But for Jenn and her students, now quickly becomes tomorrow. The risk that David Steiner explained is real. We need to make sure that we have our best teachers tomorrow and not lose them in the deep end of the pool.

What do you think?  Could this be the fate of many of Georgia’s teachers?  And if you are a principal, how does this effect you?

Georgia Department of Education Says Evaluation Plan Won’t Work But Will Implement it Anyway?

The Georgia Department of Education claims that the evaluation system they developed along Federal guidelines needs to be modified. . They  think one part will not work because it will put the state at risk from lawsuits by teachers.

When I first saw the headline in the Atlanta Journal Constitution newspaper, I thought that maybe the state officials got the message written by more than 30 professors.  The message, in the form of a letter sent to Georgia officials including the governor and state school superintendent, challenges the teacher and leader evaluation system, identifies the unintended negative consequences, and recommends the state opt out of this invalidated and unreliable system.

My excitement quickly faded when it turns out that the state wants to cut one very insignificant part of the teacher assessment system.  They think children in grades K-2 are too young to check the ability of their teachers.  So they have requested this part of the assessment be removed, but kept in tact the rest of the system.

Here is what happened.

On July 2nd, the U.S. Department of Education (US ED) wrote to the Governor of Georgia indicating concern about the overall. educator evaluation system.

US ED is concerned that the Georgia Department of Education has made amendments to the Race to the Top (RTT) grant that they received two years ago.  Because the State made changes through amendment requests, on site reviews, monthly calls and other conversations, part of Georgia’s Race to the Top grant has been put on “high-risk status” and could cost the state about $33 million from its $500 million RTT grant.

The Atlanta Journal reported that the State Department of Education says the evaluation plan that they have field tested (for about 4 months) won’t work.  It won’t work, according to State Department lawyers because it might lead teachers to bring lawsuits against the state.

Letter of response from Georgia Governor Deal and State School Superintendent Barge suggesting that the previous administration had written the RTT grant, and things had changed since then (2009).  The State asked for changes in their RTT grant because only because their legal advisors said they might put the state in a highly “litigious situation.”

But Here’s the Deal: The Flawed Teacher Evaluation System Will Still Be Used

The only part of the very multi-pronged teacher (and leader, meaning principals) evaluation system that will change is insignificant.  The rest of the system to grade teachers and principals will stay.  The system has three components that combined to generate a Teacher Effectiveness Measure Score.  The three components are shown in the Figure 1 include:

  • Teacher Assessment on Performance Standards:  Data sources include observations of classroom teachers, and documentation.
  • Surveys of Instructional Practice: Student opinions of their teachers. All grades, except K-2.
  • Student Growth and Academic Achievement:  If you are a teacher of a tested subject (math, reading/LA, science, social studies) the state will use student growth percentile and value-added measure, and achievement gap reduction.  If you are teachers of a non-tested subject it’s even worse, but your effectiveness will be judged based on some district wide average.  This has not been accomplished.
Figure 1: Georgia Teacher Evaluation System consisting of three components with contribute to an overall Teacher Effectiveness Measure (TEM)

If you are a teacher in Georgia, your job performance will be determined by the combination of the three components described above, and seen in Figure 1.  We still do not know how much each part will contribute to your final score.  Waiting.

The first component, Teacher Assessment on Performance Standards means administrators will visit your classroom several times during the year to check your teaching performance.  They will use a scoring rubric (observation form) to judge how well you are doing on five standards, and ten performances (see Figure 2).

Figure 2. Performance appraisal rubric example. The state will use these qualitative rubrics to measure teacher performance.

For example, your principal comes in and announces she is going to use the State’s appraisal rubric to check your performance.  She tells you she can only stay for 30 minutes (yet you have a 90 minute period).  Using the (Class Keys Teacher Evaluation System) rubric, that hopefully she was trained to use, and has a high inter-rater reliability score, she sits on the side of your classroom, and watches you teach.  Either during the visit, or after, she checks off how you did based on a Performance Appraisal Rubric.  You can be graded Exemplary, Proficient, Developing/Need Improvement, or Ineffective (Figure 2). The 10 standards are assessed using a list of 26 Class Keys, each of which has a specific rubric.  There is an inverse relationship between the nature of these behaviors, and the test-based mania that controls the way teachers teach, and the curriculum that is implemented.  Observation protocols such as these depend on the observer being trained, and having a high inter-rater reliability score.

The system of observations also puts enormous time constraints on principals, especially if they have large faculties.  How would it be possible for a principal to visit every teacher in her school one or more times per year, conference with the teacher, and complete the paperwork that will burden any effective administrator.  And, bringing in outside observers also raises questions about reliability.  One also has to wonder how a person can simply drop in at the middle of a course, let alone a lesson, to understand the context of the lesson.  The system is fraught with problems.

Figure 3: Survey of student opinions. The State of Georgia does not want to use this system with students in grades K – 2. Legal issue.

In the second component, surveys ask students to report on items that they have directly experienced.  This is the part for grades K – 2, that the state asked for a modification from the US ED.  Essentially this component surveys students to evaluate their teachers.  One example of a survey question is: My Teacher knows a lot about what she is teaching (yes, sometimes, no).  Or my teacher has deep knowledge about the subject he/she teaches: Strongly agree, agree, disagree, strongly disagree, not applicable.  Figure 3 shows survey question examples from K-2 and from high school.

The third component will use student achievement test scores (high-stakes, end-of-the-year) to show a percentile/value-added measure (VAM).  On the state website, it says: “The model will be developed soon.”  The problem is the state will never find or develop a model that will produce stable ratings of teachers.  In studies where VAMs have been used, the results were unreliable in establishing how much a teacher contributed to student learning.

But more than that, the entire system is tired and old, and pushes us further backward instead of embracing an entirely different generation of students whose world outside of school is native to them.  When they come to school, they are more like immigrants entering a world foreign to them.

So, it is not so that the state thinks their teacher and leader evaluation plan won’t work.  The Georgia State Department intends to implement the Georgia Teacher Evaluation system in at least 29 school districts starting in August.  According to a group of 30 Georgia professors of research, there will not be enough time for the state to effectively check the four-month field test of the system which they implemented during the Spring of 2012.

The education establishment is out of touch with youth in trying to impose a rigid, linear model of teaching and learning, and holding teachers, who are caught in the middle, hostage to an unproven and inept system of standards and evaluation.

In the introduction to their book Rewired: Understanding the iGeneration and the Way They Learn, the authors point out the disconnect between the lives of students in and out of school.

Despite the revolutions wrought by technology in medicine, engineering, communication, and many other fields, the classrooms, textbooks, and lectures of today are little different than those of our parents. Yet today’s students use computers, mobile telephones, and other portable technical devices regularly for almost every form of communication except learning. —National Science Foundation Task Force on Cyberlearning (Rosen Ph.D., Larry D. (2010-08-03). Rewired: Understanding the iGeneration and the Way They Learn (p. 1). Macmillan. Kindle Edition.

The teacher and leader evaluation system, as initially proposed in Georgia’s Race to the Top grant has only been tweaked, not substantially modified.  The system as it stands now will be put into practice will provide inaccurate assessments of teachers, and will result in the degradation of the morale of teachers and administrators who are ethical professionals trying to carry out standards they did not choose, yet know they are responsible for implementing them in their classrooms.

That said, we should heed the insight of Georgia research professors who have challenged the teacher and leader evaluation system, and identified the unintended negative consequences and recommends the state opt out of this invalidated and unreliable system.  In their words,

We all cannot afford to lose sight of what matters the most—the academic, social, and emotional growth and well-being of Georgia’s children. Our students, teachers, and communities deserve better. They deserve thoughtful, reliable, valid reforms that will improve teaching and learning for all students. It is in this spirit that we write this letter.

What are your opinions of teacher evaluation systems that rely on student achievement scores, observations by administrators and student surveys?