21 June 2011

Bad Graphs part II: don't force the best-fit through the origin

In today's episode of Bad Graphs, we begin with another poor scale.

BAD GRAPH #4:  Scaled to less than one-quarter page
As you can see -- or maybe you can't, it's so small -- these data points are plotted correctly, but in a teeny weeny portion of the page provided.  A proper graph takes up well over half the available room on the page.  The standard for credit on this particular AP problem (2010 B2) was for the graph to take up more than 1/4 page.  Scaling across a whole page is a skill that must be taught -- it does not come naturally out of math classes.
BAD GRAPH #5:  Can't see the data points without a magnifying glass
If you want to get extra-technical, the size of the data points on the graph should reflect the experimental uncertainty in each quantity measured.  That is, if you could measure to the nearest milliliter, than the data points should be as big as half of a box in the vertical direction.  (And if large uncertainty would make the points ridiculously big, then you're supposed to use error bars.)

For the purposes of AP exam questions or labs within my course, all I ask is that the data points be clearly visible, as in all of the other BAD GRAPHS shown in this post.  The graph above, though, shows itty bitty dots and a nice best-fit.  Sure, the best-fit will yield a reasonable slope, but without easily seen evidence of where that slope came from.

BAD GRAPH #6:  best-fit line forced through the origin
A best-fit line should reasonably indicate the trend of the data.  There is no one "best" best-fit, but rather a range of allowable best-fits.  I've occasionally had my class draw the steepest possible best-fit, then the shallowest, and note that the value of the slope is somewhere between these two extremes. 

The problem with the graph above is much more subtle than with some of the other BAD GRAPHs.  This student has drawn the best-fit line by starting at the origin of coordinates, and only then trying to approximate the trend of the graph.  Problem is, for one thing, the origin is not a special spot on the graph.  The point (0 kg, 0 m3) is no more important than the point (.04 kg, .000054 m3).  Even in the case where (0,0) is a data point, it's a data point like any other.  Would you insist that the best-fit line always go through the third data point?

In this particular experiment from the 2010 AP exam, the y-intercept of the graph was explicitly non-zero.  (In fact, the last part of the question demanded students to figure out that the y-intercept represented the volume of fluid displaced by the floating cup alone, without any additional mass.)  Forcing the best-fit through the origin not only artifically steepens the graph's slope, but it obscures the physically meaningful y-intercept.

Of course, forcing best-fits through the origin isn't always as subtle.  Trust me.  When we graded this problem, we saw the not-totally-unreasonable version above, but also we saw plenty of these:

BAD GRAPH #7:  Curved to get to the origin

 Yuk.  But this one takes the cake...

BAD GRAPH #8:  Forced through the origin that isn't even the origin
It's perfectly acceptable, and sometimes desirable, not to begin an axis at zero.  However, you gotta recognize that what looks like the origin isn't necessarily the actual origin, in that case.  This grapher would have been fine, except for forcing that line through the origin that, after all, isn't the origin.  Boux.

One more set of BAD GRAPHs tomorrow.  But I promise, I'll include a couple of GOOD GRAPHs as well.






20 comments:

  1. You are wrong here---there is a strong theoretical justification for 0 mass at 0 volume that does not include experimental error. It is much easier to justify fitting a line through the origin than it is to justify an arbitrary line. You need some phenonmenon to invoke to justify an offset other than 0 (like bias in your measuring equipment).

    I would have given more points to a student who recognized that the model they were fitting had one free parameter (density) rather than 2.

    If you are trying to justify model that density is a constant, then even the straight line is almost assuming your conclusion, so I see no advantage to the 2-parameter model.

    ReplyDelete
  2. Gas Station,please do see the original AP exam question, 2010 #2. The y-intercept has nothing to do with experimental error, but represents the volume of fluid displaced by the empty cup.

    GCJ

    ReplyDelete
  3. I apologize---I had not seen the question itself, and so did not realize that the volume and mass measurements both had offsets in the experimental setup.

    Indeed for the experiment described there it is necessary to include the second parameter, and forcing the line through zero is (as you said) a serious mistake.

    Sigh, I should learn to follow all the links before I open my big mouth.

    ReplyDelete
  4. Not a problem! I really wish I could post the problem itself here to prevent confusion. Silly copyright. :-)

    ReplyDelete
  5. I need to get into that Greg Jacob's science class!! My future depends on it.

    ReplyDelete
  6. Anonymous Physics StudentSeptember 23, 2015 at 11:56 AM

    Sorry for reviving a dead post here, but would it be acceptable to force a best fit line through the origin if measuring distance over time of a runner running a race? If the problem clearly indicates that the distance was 0 when time was 0, which is the only sensible case because the distance traveled was measured from the distance when time was 0, then should the trendline be forced through (0,0)? If it didn't go through the origin, it would follow that when time was 0, the runner wasn't at the starting line.

    ReplyDelete
    Replies
    1. Or, perhaps, that the runner's speed wasn't so constant as it seemed.

      Treat (0,0) like any other data point. If you measured that the runner was 15 m away from the start after 3 s, then that's a measured data point, too; you wouldn't fit your best-fit line through there, would you?

      I often advocate drawing a steepest and a shallowest reasonable best-fit line. Then the slope must fall in the range defined by these two lines. Similarly, the lines can define a range of experimentally supported vertical intercepts. If (0,0) isn't in that range, then you've discovered some unexpected physics.

      Delete
  7. Can you please repost the bad examples?

    ReplyDelete
  8. Aarrgh! The pictures aren’t showing up on my IPad! Okay, I’ll try to figure out what’s wrong with google - this post has been stable, and well-read, for years. Weird.

    ReplyDelete
  9. I think it's working now. Let me know if not via the comments or email. Thanks!

    ReplyDelete
  10. Hi!
    I believe the issue of whether a best-fit line should be "forced" to pass through the origin is complicated in special cases and is not addressed properly on certain AP questions. I would appreciate your thoughts on this.

    I would argue it matters whether or not the experimental data provided is accompanied by a theoretical relationship which the student is expected to assume is true in order to answer some part of the question. If the student is asked to model the data based only on the fact that the data appears linear, for example, and no a priori relationship is assumed, then a best-fit line should not necessarily pass through the origin. But if a student is implicitly expected to know and assume a theoretical relationship among the data to be valid (and binding), then the student has no choice but to assume that that relationship defines the form of the model he or she must use when modeling the data. If the assumed theoretical relationship is a direct variation of the form y=kx (zero y-intercept), and the student is asked to plot some data and to determine the slope of the line that most accurately regresses to the particular form of the relationship being assumed in order to estimate the value of parameter k, then in that case the relationship being assumed mandates that the student's best-fit line pass through the origin so that it is in accordance with the mathematical form of the theoretical relationship being assumed.

    In short, if a student is asked to assume a priori that the data obeys a theoretical relationship in order to justify estimating the value of a parameter from the graph, and that relationship is a direct variation, then the student is in essence being asked to regress the data to y~ax, not to y~ax+b.

    What's motivating my comment is the Chief Reader Report for the 2018 AP Physics 1 test, question 2, which says, "Responses that demonstrate understanding: the data does not include a zero y-intercept." The relationship in question is R=pl/A. The test asks the student to graph the data in such a way that the resistivity can be determined. In other words, the student is expected to know the relationship R=pl/A and, for example, plot R on the y-axis and (l/A) on the x-axis so that the slope of the best-fit line estimates the resistivity. But the whole reason the student is expected to graph a best-fit _line_ in the first place isn't that the data, when plotted, suggests a linear relationship. It's that the student is assumed to know the relationship R=pl/A, because that's the only way he or she would know that the slope of the R vs. (l/A) graph should give the resistivity. But if we're assuming that we know that relationship, and that we are to fit the data to that model, then why would we not also follow the requirement of this relationship that it is a direct variation with a zero y-intercept?

    I know there is widespread acceptance that when the mathematical form of a physical relationship is being determined experimentally, and no assumptions are made as to its form, it is poor practice to require (0,0) to be on a "best-fit," even though it may make intuitive sense. However, this particular type of question is different, and I agree with the following summary provided by a dynamic ecologist: "Knowing that the true relationship between your predictors and the expected value of your dependent variable has to pass through the origin would be a good reason for forcing the estimated relationship through the origin _if you knew for certain what the true relationship was_. That is, you not only know that y=f(x) passes through the origin, you know the functional form of f(x) and merely have to estimate its true parameter values." AP students are expected to assume for the purpose of this question that they know for certain what the true relationship is (a relationship that is linear and that passes through the origin), and that they merely have to estimate the slope in order to estimate the value of the variable in question.

    ReplyDelete
  11. BD, an experimental question on the AP exam - like experimental physics - doesn’t allow assumptions about what are “true relationships.” A relationship is true, to one extent or another, it it is supported by experimental evidence. The essence of experimental physics is acquiring, communicating, and analyzing such evidence.

    The slope of an R vs. (L/A) graph is indeed resistivity... assuming that the relationship represented in the equation is valid, and assuming that resistivity is in fact a constant value. If the data displayed on such a graph are not linear, then resistivity is NOT constant. If the data is linear but doesn’t go through the origin, then the equation is missing an additive constant - something representing the resistivity for (L/A) = 0, which might involve an artifact of the external circuit used to measure resistance.

    Point is, this sort of deviation from an expectation of the “truth” of an equation happens all the time, even in a high school physics lab. See the post titled “What if my force vs displacement graph for a spring is weird for small displacements?” See the original AP question, from 2010, which these bad graphs are based on. Try experiments like “does a rubber band obey F=kx?” The AP exam purposefully asks us to teach experimental physics as beyond verifying relationships we already know to be true. It asks us to engage in the unepected, to test hypotheses beyond the equation sheet, to let the data, not authority, be our guide.

    ReplyDelete
  12. Hi Greg, and thank you for your thoughtful response! I believe I understand now. I don't want to take too much more of your time, but I'd like to clarify one more aspect of this type of question. The issue is not that the line which best fits the experimental data deviates from a theoretical direct variation model. The problem is in assuming that the slope of such a "non-ideal" line remains a valid estimate for some value, such as resistivity. When we analyze R vs. L/A data, and the y-intercept of a best-fit line deviates from 0 in a statistically significant way, are we supposed to assume both that the resistivity remained constant and that some nonzero contribution to the resistance from the mystery "artifact" also remained constant throughout the experiment?

    Here's what I mean. You mentioned, "If the data displayed on such a graph are not linear, then resistivity is NOT constant"...But at least mathematically, that's not necessarily true. The resistivity of the dough could be constant and it could be a nonconstant contribution to the resistance by some external "artifact" causing the graph not to be linear. Similarly, you mentioned, "If the data is linear but doesn’t go through the origin, then the equation is missing an additive constant"...But again, that's not necessarily true, because what's "missing" might be an additive nonconstant, in which case the resistivity may not have been constant either, despite the fact that the data is linear.

    We can easily think of an example, at least in theory, to illustrate such a case, where we would not be justified in assuming that the slope of an R vs. L/A best-fit line is equal to the resistivity, even when r=1. Imagine a student measures the resistance of dough cylinders with progressively larger L/A values, plots the data, and that the graph appears linear with a positive slope of 1 ohm.m, but with a y-intercept of 5 ohms. Are we theoretically assured that the resistivity in this case is constant, and that the slope of this line is equal to the resistivity? It's certainly possible. But at least mathematically, it's also possible that what we thought was some "additive constant," some artifact we assumed to be contributing a constant nonzero resistance to the situation, was actually adding a non-constant contribution to the resistance, a contribution that was decreasing over time as the student performed the experiment. And imagine also that over the course of the data collection, the room was getting warmer, so that the resistivity of the dough increased just enough between each measurement to compensate for the decreased contribution by the aforementioned "artifact." If the student completed his data collection before the mystery artifact’s contribution reached 0, he could theoretically have a linear graph, when in fact the resistivity was not constant. In this theoretical case, the student would not be correct in stating that the resistivity of the dough was equal to 1 ohm.m.

    This would clarify things...For the purposes of the AP test, in an example such as this, as long as the R and L/A data fits well to a linear model, are we to assume that the resistivity is constant, that the slope of the R vs. L/A best-fit line is equal to the resistivity no matter what, that the only explanation for the discrepancy between the R=pL/A (theoretical) model and the R=pL/A+b (experimental) model is that there is some artifact adding a constant non-zero resistance to the measurements independent of the constant resistivity?

    If somewhere in the question, the student were given the proviso, "Assume that the linear rate of change of your R vs. L/A data represents a constant resistivity and that any nonzero value for R when L/A=0 is the result of a separate constant external contribution which does not affect the resistivity," then I could feel justified in finding a best-fit line, measuring its slope, and equating that slope with the resistivity.

    ReplyDelete
  13. How do you handle y-intercepts that would be interpreted as being not physically possible? E.g., in the F vs. dx experiment for a spring, where we take the y-axis as change in spring length ("stretch", dx) and x-axis as net force applied (F). A negative intercept indicates the spring is “overly compressed”, the spring has some resistance to it that requires a small amount of force to be applied to it (e.g., a small hanging weight) before it will show any stretch. Now suppose for a different spring you take an initial data point as 0 force applied there is no stretch, therefore, the pair (0,0) is a data point in your data set. You then collect all your other measurements and the data shows the best fit to have a positive y-intercept. What then? The interpretation of this positive y-intercept is that with zero force applied to the spring the change in spring length is some finite positive value. This violates Newston's 1st and 2nd law.

    ReplyDelete
  14. Armadillo, interesting question. My answer is, if it's not physically possible, it won't show up in real data - so that's not something I worry about. And that might be a good time to talk about experimental uncertainty: is that unphysical intercept meaningful, or is it an artifact of the best-fit line you chose to draw? Try drawing the steepest and shallowest possible best-fits. If these overlap the origin, then your y-intercept essentially is zero!

    ReplyDelete
  15. RE experimental uncertainty -- yes, this result could be caused by experimental uncertainty (in both the x and y measurements; and propagated errors if x and/or y are calculated values). So, yes, this non-physical result could be an artifact of the best fit line determined by linear regression (minimizing the sum of squared errors) to real data that is subject to significant error. Now suppose you determine the error/uncertainty in the y-intercept and the max and min values do not overlap the origin, they fall on the positive y-axis, say be even relatively small values above 0. What then? Barring conducting the experiment again, and barring the use of another physical model of different functional form, e.g., y=kx versus y=kx+b versus y=ax^2+bx+c), what would you recommend the experimenter to do to determine the coefficient k in Hooke's law? Should the experimenter apply "best judgment" knowing that the positive y-intercept result (w/ known error) is non-physical, and thus should force the y-intercept to zero and find the slope of best-fit line subject to that constraint?

    ReplyDelete
  16. New question. Your students conduct a simple experiment to determine the relationship between the force applied to a spring and the displacement of the spring from its rest length. To do so, they hang various masses from the spring and measure the vertical displacement. There are errors/uncertainties in both the values of the masses used and in the vertical displacement measured. However, they note that when they do not place a mass on the spring there is no vertical displacement. Can they record a measurement of zero mass resulting in a zero vertical displacement? If 'yes', what is the uncertainty in my measurement? If 'no', why not?

    ReplyDelete
  17. I found this post via Google. Fascinating read. I've been having this discussion on whether the BFL should go through the origin for the last week and I still feel so unsettled (there was a lively debate and no consensus amongst a group of high school teachers). Is there a "go to" authority on the topic (or a reference manual)?

    I am of the mind that forcing a BFL through the origin (0,0) isn't a requirement. However, knowing that with, for example Ohm's Law, 0 V means 0 A, and yet somehow you end up with a y-intercept that doesn't go through the origin is unsettling (y = mx + b instead of y = mx), especially if you end up with a negative y-intercept...

    I think that last comment is echoing previous commentators.

    Since it wasn't even a measurement/recording, would it be general practice then to encourage students to not include it in their data tables?

    ReplyDelete
  18. Anonymous,

    The difference is whether you can make the measurement, or whether you're just relying on the theoretical equation.

    For example: In an ohm's law experiment plotting V vs. I, I can in fact turn down the voltage to zero in my circuit, and measure the current. If I get 0 amps, then (0,0) becomes a data point that I can use.

    In the experiment that underpins this post - measuring volume of liquid displaced as a function of mass in a floating cup - I can also make a measurement for the volume displaced with zero mass in the cup. That will not be zero, even though just a pure calculation from a theoretical equation will suggest that zero mass should not displace any volume of fluid!

    To answer your question - we should never encourage students to include (0,0) in data collection. Sure, if they decide on their own to make a measurement that corresponds to (0,0), great. But there's no need to treat the origin any differently from any other point on the graph. A point doesn't go on a graph unless data for that point is acquired experimentally.

    ReplyDelete