Jump to content

Wikipedia:Reference desk/Archives/Mathematics/2024 November 8

From Wikipedia, the free encyclopedia
Mathematics desk
< November 7 << Oct | November | Dec >> Current desk >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


November 8

[edit]

finding an equation to match data

[edit]

An experiment with accurate instruments resulted in the following data points:-

x,     y
0.080, 0.323;
0.075, 0.332;
0.070, 0.347;
0.065, 0.368;
0.060, 0.395;
0.055, 0.430;
0.050, 0.472;
0.045, 0.523;
0.040, 0.587;
0.035, 0.665;
0.030, 0.758;
0.025, 0.885;
0.020, 1.047;
0.015, 1.277;
0.010, 1.760.

How can I obtain a formula that reasonably matches this data, say within 1 or 2 percent?
At first glance, it looks like a k1 + k2*x^-k3 relationship, or a k18x^k2 + k3*x^-k4 relationship, but they fail at x above 0.070. Trying a series such as e(k1 + k2x +k3x^2) is also no good. -- Dionne Court (talk) 03:14, 8 November 2024 (UTC)[reply]

Thank you CiaPan for fixing the formatting. Dionne Court (talk) 15:12, 8 November 2024 (UTC)[reply]
Plotting 1/y against x it looks like a straight line, except there is a rather dramatic hook to the side starting around x=.075. This leads me to suspect that the last two entries are off for some reason; either those measurements are off or there's some systematic change in the process going on for large x. Part of the problem is that you're not giving us any information about where this information is coming from. I've heard it said, "Never trust data without error bars." In other words, how accurate is accurate, and might the accuracy change depending on the input? Is there a reason that the values at x≥.075 might be larger than expected. If the answer to the second is "Yes" then perhaps a term of the form (a-x)^k should be added. If the answer is "No" then perhaps that kind of term should not be added since that adds more parameters to the formula. You can reproduce any set of data given enough parameters in your model, but too many parameters leads to Overfitting, which leads to inaccurate results when the input is not one of the values in the data. So as a mathematician I could produce a formula that reproduces the data, but as a data analyst I'd say you need to get more data points, especially in the x≥.075 region, to see if there's something real going on there or if it's just a random fluke affecting a couple data points. --RDBury (talk) 15:58, 8 November 2024 (UTC)[reply]
PS. I tried fitting 1/y to a polynomial of degree four, so a model with 5 parameters. Given there are only 15 data points, I think 5 parameters is stretching it in terms of overfitting, but when I compared the data with a linear approximation there was a definite W shaped wobble, which to me says degree 4. (U -- Degree 2, S -- Degree 3, W -- Degree 4.) As a rough first pass I got
1/y ≃ 0.1052890625+54.941265625x-965.046875x2+20247.5x3-136500x4
with an absolute error of less than .01. The methods I'm using aren't too efficient, and there should be canned curve fitting programs out there which will give a better result, but I think this is enough to justify saying that I could produce a formula that reproduces the data. I didn't want to go too much farther without knowing what you want to optimize, relative vs. absolute error, least squares vs. min-max for example. There are different methods depending the goal, and there is a whole science (or perhaps it's an art) of Curve fitting which would impractical to go into here. --RDBury (talk) 18:26, 8 November 2024 (UTC)[reply]
Thak you for your lengthy reply.
I consider it unlikely that the data inflexion for x>0.07 is an experimental error. Additional data points are :-
x, y: 0.0775, 0.326; 0.0725, 0.339.
The measurement was done with digital multimeters and transducer error should not exceed 1% of value. Unfortunately the equipment available cannot go above x=0.080. I only wish it could. Choosing a mathematic model that stays within 1 or 2 percent of each value is appropriate.
As you say, one can always fit a curve with an A + Bx + Cx^2 + Dx^3 .... to any given data. But to me this is a cop-out, and tells me nothing about what the internal process might be, and so extrapolation is exceedingly risky. Usually, a more specific solution when discovered requires fewer terms. ```` Dionne Court (talk) 01:49, 9 November 2024 (UTC)[reply]
When I included the additional data points, the value at .0725 was a bit of an outlier, exceeding the .01 absolute error compared to the estimate, but not by much. --RDBury (talk) 18:55, 9 November 2024 (UTC)[reply]
FWIW, quite a few more data points would almost certainly yield a better approximation. This cubic equation seems pretty well-behaved:
Earl of Arundel (talk) 02:28, 10 November 2024 (UTC)[reply]
Some questions about the nature of the data. Some physical quantities are necessarily nonnegative, such as the mass of an object. Others can also be negative, for example a voltage difference. Is something known about the theoretically possible value ranges of these two variables? Assuming that x is a controlled value and y is an experimentally observed result, can something be said about the theoretically expected effect on y as x approaches the limits of its theoretical range?  --Lambiam 15:59, 9 November 2024 (UTC)[reply]
As x approaches zero, y must approach infinity.
x must line between zero and some value less than unity.
If you plot the curve with a log y scale, by inspection it seems likely that y cannot go below about 0.3 but I have no theoretical basis for proving that.
However I can say that y cannot ever be negative.
The idea here is to find/work out/discover a mathematically simple formula for y as a function of x to use as a clue as to what the process is. That's why a series expansion that does fit the data if enough terms are used doesn't help.Dionne Court (talk) 01:33, 10 November 2024 (UTC)[reply]
So as x approaches zero, 1/y must also approach zero. This is so to speak another data point. Apart from the fact that the power series approximations given above provide no theoretical suggestions, they also have a constant term quite distinct from 0, meaning they do not offer a good approximation for small values of x.
If you plot a graph of x versus 1/y, a smooth curve through the points has two points of inflection. This suggests (to me) that there are several competing processes at play.  --Lambiam 08:08, 10 November 2024 (UTC)[reply]
The x=0, 1/y=0 is an important data point that should have been included from the start. I'd say it's the most important data point since a) it's at the endpoint of the domain, and b) it's the only data point there the values are exact. Further theoretical information near x=0 would be helpful as well. For example do we know whether is y is proportional to x-a near x=0 for a specific a, or perhaps - log x? If there is no theoretical basis for determining this then I think more data points near x=0, a lot more, would be very helpful. The two points of inflection match the W (or M) shape I mentioned above. And I agree that it indicates there are several interacting processes at work here. I'm reminded of solubility curves for salts in water. There is an interplay between energy and ionic and Van der Waals forces going on, and a simple power law isn't going to describe these curves. You can't even assume that they are smooth curves since Sodium sulfate is an exception; its curve has an abrupt change of slope at 32.384 °C. In general, nature is complex, simple formulas are not always forthcoming, and even when they are they often only apply to a limited range of values. --RDBury (talk) 15:46, 10 November 2024 (UTC)[reply]
I have no theoretical basis for expecting that y takes on a particular slope or power law as x approaches zero.
More data points near x = 0 are not a good idea, because transducer error will dominant. Bear in mind that transducer error (about 1%) applies to both x and y. Near x = 0.010 a 1% error in x will lead to a change in y of something like 100% [(1.760 - 1.277)/(0.015 - 0.010)]. The value of y given for x = 0.010 should be given little weight when fitting a curve.Dionne Court (talk) 02:03, 11 November 2024 (UTC)[reply]
It seems to me that one should assume there is a simple relationship at play, with at most three competing processes, as otherwise there is no basis for further work. If it is a case of looking for the lost wallet under the lamp post because the light is better there, so be it, but there is no point in looking where it is dark.
Cognizant of transducer error, a k1 + k2*x^-k3 relationship fits pretty good, except for a divergence at x equal and above 0.075, so surely there are only 2 competing processes? Dionne Court (talk) 02:03, 11 November 2024 (UTC)[reply]