Talk:Bayesian information criterion

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Statistics Mid‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
Mid	This article has been rated as Mid-importance on the importance scale.

Mathematics Mid‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics
Mid	This article has been rated as Mid-priority on the project's priority scale.

Local approximation

I would like to add the following comment for the limitations:

the BIC relies on a local approximation of the posterior moments, based on the assumption that the mode of the posterior probability is close to the expected value of the parameters of the model ^[1]. For skewed posteriors, this approximation might not hold.

Would anyone argue with that?

MickeyMoens 05:56, 12 November 2016 (UTC)

References

^ , Raftery AE. "Bayesian Model Selection in Social Research.", Sociol Methodol 1995;25:111. doi:10.2307/271063.

SIC vs BIC

As far as I know, SC and BIC are different. What is described is BIC.

The formula given matches that derived by Schwarz (1978)

There does some to be some confusion here though, as e.g. Bengtsson and Cavanaugh define SIC as given here, and BIC differently.

Can anyone provide some authoritative references to the modern use of the terms BIC and SIC?

--Ged.R 15:18, 22 January 2007 (UTC)[reply]

Should "SIC" in the first formula be "BIC"? The acronym "SIC" is never defined in the article.

The first formula is bizarre. geez. "The formula for the BIC is exp(-SIC/2) ???" What's up with that?

The entire article sounds like it is written by wannabe experts who aren't exactly certain of what they're talking about. Read this from the perspective of someone who is trying to ascertain the basic formula for BIC. (That is my situation. I have a software package that purports to calculate AIC and BIC scores but it (the software package) doesn't say exactly what it is calculating. Neither does this article. It introduces all these terms, xbar, etc and never says what they are. A "constant becoming trivial" is a pretty weird notion if you ask me. This could potentially be a very important article as statistical model selection invades more and more fields but as it currently is it is useless as a first approximation. You should keep the constants that become trivial. An article written by people who know this stuff, sort of, for people who also sort of know this stuff is useless. —Preceding unsigned comment added by 67.0.90.202 (talk) 23:42, 3 January 2011 (UTC)[reply]

81.231.127.12 (talk) 22:09, 18 December 2010 (UTC)[reply]

I've just reverted a change to that first formula that was made on 15 November, so it now agrees with the reference given and the standard definition of BIC. I agree that this article is in need of attention but this is not my area of expertise so I'll flag it as in need of expert attention from a suitable statistician. Thanks for pointing this out. To be honest I'm surprised the article was left like this for over a month. Qwfp (talk) 09:58, 4 January 2011 (UTC)[reply]

A problem is that it is unclear what source is actually being used for any of these formula. For example, the Priestley reference has a different formula for BIC than that for which it is supposedly being used as a source. But it does give formulae relating AIC, SIC and BIC all within the same context. It seems that Priestley uses S (here SIC) for what is here called BIC, and has a rather more complcated formula for his BIC. Thus Priestly uses BIC for "Akaike's BIC", and S for Schwarz's criterion ("Schwarz's BIC" although he doesn't call it that), where Schwarz's criterion is what is here called BIC. So a first question is what are good sources for current terminology, and is there a consistent usage? The above discussion mentions "the standard definition of BIC" ... are there good sources for such a thing. Melcombe (talk) 16:08, 5 January 2011 (UTC)[reply]

Schwartz criterion

I would like to redirect Schwartz criterion to Schwartz set rather than here, since this is a term used in voting theory. Is there any objection to this? It seems to me that it only redirects here in case of a spelling mistake. CRGreathouse 02:21, 20 July 2006 (UTC)[reply]

Well, unfortunately this is a very frequent spelling mistake. There are loads of books who use the wrong spelling. And the Schwarz Bayesian IC is rather important afaik. I'd prefer to leave the redirection like this or to create a disambiguation page... Gtx, Frank1101 11:00, 20 July 2006 (UTC)[reply]

From what I can tell (and what I was taught in my MS in stats program) BIC is the more common moniker for this. I think the article should reflect this. --Chrispounds 00:51, 29 October 2006 (UTC)[reply]

I agree with Chrispounds, and so does google:

"Bayesian information criterion": 110,000
"Schwarz criterion": 59,600
"Schwarz information criterion": 15,400
"Schwartz criterion": 11,900
"Schwartz information criterion": 531

Any objection to the article being renamed to "Bayesian information criterion", replacing the current redirect with no history? John Vandenberg 07:24, 31 October 2006 (UTC)[reply]

I have made a proposal to reduce the confusion between Schwartz set and Schwarz criterion on Talk:Schwartz set#Schwarz criterion. John Vandenberg 01:14, 9 November 2006 (UTC)[reply]

the BIC is the schwarz critirion.his name was Gideon Ernst Schwarz and so there is no reason way ot should be under the mistaken schwartz — Preceding unsigned comment added by 79.177.15.127 (talk) 08:29, 18 May 2013 (UTC)[reply]

Linear Model expression

The second formula:

Under the assumption that the model errors or disturbances are normally distributed, this becomes:
 $\mathrm {SIC} =n\ln \left({\mathrm {RSS}  \over n}\right)+k\ln(n).\$

seems wrong to me, $-2\cdot \ln {L}=\left({\mathrm {RSS} \over \sigma ^{2}}\right)$ , right? And not $n\ln \left({\mathrm {RSS} \over n}\right)$ as stated here. --Ged.R 15:18, 22 January 2007 (UTC)[reply]

--

n\ln \left({\mathrm {RSS}  \over n}\right)

is correct because we are dealing with the maximized likelihood. For a linear model, we have

{\hat {\sigma }}^{2}={\mathrm {RSS}  \over n}

. The loglikelihood is of the form:

l(\beta ,\sigma ^{2};Y)=-{\frac {n}{2}}\log(2\pi \sigma ^{2})-{\frac {1}{2\sigma ^{2}}}\sum _{i=1}^{n}\varepsilon _{i}^{2}

Evaluating this at the maximum likelihood estimates for

\sigma ^{2}

and

\beta

, we obtain:

l({\hat {\beta }},{\hat {\sigma }}^{2};Y)=-{\frac {n}{2}}\log(2\pi {\hat {\sigma }}^{2})-{\frac {1}{2{\hat {\sigma }}^{2}}}\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}

=-{\frac {n}{2}}\log(2\pi {\mathrm {RSS}  \over n})-{\frac {1}{2}}{\frac {n}{\mathrm {RSS} }}\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}

=-{\frac {n}{2}}\log(2\pi {\mathrm {RSS}  \over n})-{\frac {1}{2}}{\frac {n}{\mathrm {RSS} }}\mathrm {RSS}

=-{\frac {n}{2}}\log(2\pi {\mathrm {RSS}  \over n})-{\frac {n}{2}}

This gives the above expression for

-2\cdot \ln {L}

, up to an additive constant that depends only on

n

.

Wolf87 (talk) 01:40, 5 October 2008 (UTC)[reply]

Does anybody else find it fishy that the BIC here depends on the scaling of the data? Actually, wouldn't this be a problem using the likelihood function of any continuous domain probability distribution? —Preceding unsigned comment added by 216.15.124.160 (talk) 02:09, 6 December 2008 (UTC)[reply]

BIC does not depend upon the scaling of the data. BIC is defined only up to an additive constant that will be the same across all models being compared; that constant incorporates the scaling (at least in the linear model case given above) because any scaling factors come out as additive constants from the $\log(2\pi {\mathrm {RSS} \over n})$ term. --Wolf87 (talk) 21:02, 14 March 2009 (UTC)[reply]

Bayesian?

This seems to be rather unbayesian, notably in the use of maximum likelihood, no prior distribution, the absence of any integration, and more. Compare this with Bayes factor. --Henrygb 17:08, 15 March 2007 (UTC)[reply]

It's Bayesian to the extent that it represents an approximation to integrating over the detailed parameters of the model (which are assumed to have a flat prior), to give the marginal likelihood for the model as a whole. The argument is that in the limit of infinite data, the BIC would approach the Bayesian marginal likelihood. That contrasts with the Akaike criterion, attempts to find the most probable model parametrisation, rather than the most probable model. It also contrasts with frequestists, who cannot integrate over nuisance parameters to compute marginal likelihoods.

But I'd agree, the article should spell out much more clearly how, exactly, BIC is an approximation to the Bayesian marginal likelihood. Jheald 18:00, 15 March 2007 (UTC).[reply]

What is the "dependent variable"?

I am confused by this sentence:

"It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all estimates being compared."

What is the dependent variable? And it unclear why "variable" is singular and everything else is plural.

Imran 09:02, 12 April 2007 (UTC)[reply]

I agree this sentence makes no sense to me. My best guess is that they meant the "independent variable values", i.e. we must estimate the same data points in each model. Because I would interpret the "dependent variable values" as meaning the "model estimates", which then obviously doesn't make sense. Moo (talk) 22:09, 14 May 2012 (UTC)[reply]

Unlikely to be the only desideratum, as there need not be any independent variables at all. I think it means:

using exactly the same data-set, in terms of number of values and treatment of outliers/missing values and, if there are independent variables, then missing values among these cannot prevent any of the dependent variables being included in the likelihood for some of the models being compared ... you can't leave out some of the data for some models and not for others
using the same measurement scale (units) for the data representing the dependent variable in all models, as changing units affects the likelihood to the extent of an additive constant ... which can only be ignored if the units are the same for all models
using the same transformation of underlying data for the dependent variable in calculating the likelihood function for each model being compared.... thus there may be a choice between models that are most conveniently represented in terms of either y or log y, but th likelihood function must be evaluated in consistent way.

It would be good to find a good reference/source that properly covers the basics such as this. Melcombe (talk) 00:45, 15 May 2012 (UTC)[reply]

Neil Frazer, 10 March 2013 I think that by "dependent variable" they mean the data, y_i. In other words, you can only compare models using the same data. — Preceding unsigned comment added by Neil Frazer (talk • contribs) 00:43, 11 March 2013 (UTC)[reply]

This formula for BIC may potentially confuse people who read the AIC entry.

The version of BIC as described here is not compatible with the definition of AIC in wikipedia. There is a divisor n stated with BIC, but not AIC in the Wikipedia entries. It would save confusion if they were consistently defined!

I would favour not dividing by n: i.e.

BIC = -2log L + k ln(n)

AIC = -2log L + 2k

One can then clearly compare the two, and see they are similar for small n, but BIC favours more parsiminious models for large n. —The preceding unsigned comment was added by 128.243.220.42 (talk) 13:47, 10 May 2007 (UTC).[reply]

In fact I have noticed that the formula was only changed recently on 21st April, 2007. It really needs changing back I think to what it was before!

--

I also believe that the definition without n is more common. See for example http://xxx.adelaide.edu.au/pdf/astro-ph/0701113, which gives a lucid, accessible review and comparison of the AIC, AICc, BIC and Deviance Information Criterion (DIC).

Every paper I have seen has it without the n.

The standard simplification for using $\chi ^{2}$ for model selection has been pointed out above, namely that $-2\log L=\chi ^{2}$ . I think this is worth including on the page, as I had to go look in several journal articles to satisfy myself that this is the proper definition of log-likelihood.

Velocidex 12:54, 25 June 2007 (UTC)[reply]

Definition of L

Hi,

Is L in the formula for the BIC really the log-likelihood? It seems to me that L is the likelihood, s.t. ln L would be the log-likelihood and the -2 ln L term is the same term as in the AIC. Am I missing something?

Mpas76 01:05, 17 October 2007 (UTC)[reply]

I think you are right. L is the likelihood function and -2*ln(L) is the same as that in AIC formula. —Preceding unsigned comment added by Shaohuawu (talk • contribs) 16:37, 27 October 2007 (UTC)[reply]

Error variance

Possibly I haven't understood this properly, but surely the formula for so-called 'error variance' in the article is wrong:

${\hat {\sigma _{e}^{2}}}={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}$

If the x's are datapoints this appears to be the variance of the datapoints, whereas what we want is something like the RSS of the AIC article, presumably the mean squared error: ${\hat {\sigma _{e}^{2}}}={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\hat {x_{i}}})^{2}$

93.96.236.8 (talk) 16:42, 11 September 2010 (UTC)[reply]

Exponential family

The article explicitly mention that the investigated model should be a member of the exponential family. This was the original assumption in the demonstration of BIC assymptotical property. However, the derivation of this property was extented to less restrictive conditions. For example Cavanaugh and colleague in “Generalizing the Derivation of the Schwarz Information Criterion” —Preceding unsigned comment added by 195.220.100.11 (talk) 18:11, 4 March 2011 (UTC)[reply]

In Section "Characteristics of the Bayesian information criterion" of this wikipedia article, citation is needed for point 5 ("[BIC] can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset."). Checking Cavanaugh et al. paper it seems to me that the loosened conditions are sufficient to quarantee the validity of BIC at least for mixtures of exponential family distributions, which in turn would cover a remarkable variety of clustering methods as special cases. But I could not find articles where validity of BIC is explicitly shown for mixture (or clustering) models. — Preceding unsigned comment added by Lmlahti (talk • contribs) 11:17, 2 April 2011 (UTC)[reply]

BIC penalizes larger data set?

Has anybody notice that minimization of the objective function:

{-2\cdot \ln {p(x|k)}}\approx \mathrm {BIC} ={-2\cdot \ln {L}+k\ln(n)}.\

seems to lead to a penalty on larger data sets.

For example, given a data set D1, and another nested data set D2 (same features, but a subset of samples of D1). BIC seems to suggest that D2 should have more parameters than D1. This is ridiculous. 147.8.182.107 (talk) 05:02, 8 January 2012 (UTC)[reply]

The penalty gets bigger, but, if the parameters are helping, the likelihood improvement on the larger datasets due to the extra parameters should more than compensate. 41.151.113.108 (talk) 08:45, 9 February 2012 (UTC)[reply]

Yes, the penalty with sample size is larger with SIC than with AIC. With AIC, the selected model tends to have more parameters as the sample size increases. The justification for the SIC was to find a penalty to neutralize that effect.

Also, both AIC and SIC are more general than described in this article, applying to many different statistical models. 98.95.133.18 (talk) 13:30, 31 March 2013 (UTC)[reply]

K

There doesn't appear to be a definition of what k is.

And it seems odd that in p(x|k) ... the probability of the observed data is calculated only on the number of the "free parameters" without any consideration for their value ... — Preceding unsigned comment added by Eep1mp (talk • contribs) 16:39, 3 February 2012 (UTC)[reply]

The section "Mathematically" has:

k = the number of free parameters to be estimated. If the estimated model is a linear regression, k is the number of regressors, including the intercept;

...and (although there are no details), p(x|k) and the reason for using it are based on Bayesian arguments wherein the distribution of the k unknown parameters are effectively integrated out. This leads to "their value" being summarised in the maximised likelihood. Melcombe (talk) 17:21, 3 February 2012 (UTC)[reply]

This is not the usual notation for a likelihood. The typical notation would be p(θ|y) where θ is a k×1 vector of parameters. In the BIC, this vector would usually be the maximum likelihood estimate (MLE) for the parameters. The MLE is equivalent to the mode of the posterior distribution with a non-informative prior. For some, ignoring the prior is a feature, for others it is a bug.

It should be obvious that not all models with k parameters are created equal, so p(x|k) is misleading. Suppose we are interested in blood pressure as the dependent variable and have a choice of two different datasets with k=3: The first has {gender, age, body mass index} and the second has {SAT score, typing speed, favorite color}. Suppose we pick the first dataset, we still could have different likelihoods depending on whether we want a continuous measure of diastolic pressure, or if we want a probit model that categorizes people as "high blood pressure" or not.

I'll try to edit the main article, but I'm not very familiar with the math code so it might take a bit. Frank MacCrory (talk) 21:06, 5 March 2012 (UTC)[reply]

Akaike impressed?

The source for the sentence "Akaike was so impressed with Schwarz's Bayesian formalism that he developed his own Bayesian formalism, ..." does not actually support that statement. The source is from 1977, Schwarz's work is from 1987.

Are the facts maybe twisted here? What is clear is that Akaike developed the AIC before Schwarz developed the BIC (1974 vs.1978). Schwarz also cites Akaikes work in his 1978 paper. Is ABIC different from AIC and if yes, how? Can someone please add a correct source?

Georg Stillfried (talk) 14:07, 2 August 2013 (UTC)[reply]

Notation Consistency

It would be good to keep notation's consistent between this page and the AIC page.

For instance, consider:

\mathrm {AIC} =2k-2\ln(L)

versus

\mathrm {BIC} ={-2\cdot \ln {\hat {L}}+k\cdot \ln(n)}.\

There is a hat in the likelihood, \cdot is being used, and the order of the factors is inconsistent too Ric8cruz (talk) 17:47, 1 November 2015 (UTC).[reply]

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Bayesian information criterion. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20120328065032/http://nscs00.ucmerced.edu/~nkumar4/BhatKumarBIC.pdf to http://nscs00.ucmerced.edu/~nkumar4/BhatKumarBIC.pdf

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 04:08, 29 October 2016 (UTC)[reply]

Article title

The other day I was bold changing the name of this page to Schwarz criterion; BetterMath (talk · contribs) reverted my changes, we discussed the issue but couldn't agree so far. So I'm asking for some third-party opinions.

While Bayesian information criterion is certainly more popular than Schwarz criterion, it is a misleading name because unlike the Akaike information criterion, BIC was not derived from information theory. The original article, Schwarz (1978), does not mention "information" at all.

By our naming conventions we use the most common term in general, but we do make exceptions: for instance, Kuhn–Tucker theorem is far more common than Karush–Kuhn–Tucker theorem, yet the latter is the name we picked because it is (or should be) the correct one. --bender235 (talk) 07:29, 6 August 2017 (UTC)[reply]

I've also seen it called the Schwarz Bayes criterion, or Schwarz Bayesian criterion. As all presumably agree that it was derived by Schwarz using a Bayesian argument, either of those seems an uncontentious and descriptive name to me. The latter seems more common. Of course the lead should mention all common variations. Qwfp (talk) 18:14, 7 August 2017 (UTC)[reply]

_____________________

The correct title for the article is “Bayesian information criterion”: because that is the name used in over 90% of the peer-reviewed literature and statistical textbooks (in my experience).

The standard reference for statistical model selection is Model Selection [Burnham & Anderson, 2002]. That book currently has over 74000 citations on Google Scholar. That book uses “BIC”. It also tells that BIC is “occasionally” termed “SIC for Schwarz’s information criterion”; note the use of “information criterion”.

Similarly, the book Information Criteria and Statistical Modeling (Konishi & Kitagawa, 2008) uses “BIC” and also mentions the alternative name “Schwarz information criterion”; the authors emphasize that BIC does rely upon the information in the data (§9.1.1). The book Information and Complexity in Statistical Modeling (Rissanen, 2007) uses “BIC” and does not even mention “Schwarz”. The same is true for the book Model Selection and Model Averaging (Claeskens & Hjort, 2008). The same is true for the book Statistical Modeling and Computation (Kroese & Chan, 2014).

In the peer-reviewed literature, there are very many papers that use “BIC” without mentioning “Schwarz criterion”. Of the few papers that use “Schwarz criterion”, I suspect that virtually all mention “BIC”.

The R programming language seems to be becoming the de facto standard for communicating statistical programs. (There are now hundreds of textbooks that rely on R.) R uses “BIC”.

The parent comment cites a google search for “Schwarz criterion”, but that is grossly misleading. A large majority of the web pages in the result mention “Schwarz criterion” only in passing, as an alternative for the name “BIC”, which they then use.

The claim about the title for the Wikipedia article on Karush–Kuhn–Tucker conditions is also misleading. The most common name for those conditions is “KKT conditions”; see also the Encyclopedia of Optimization (2009). The Wikipedia article title spells out “KKT”, as is obviously desirable.

The name “BIC” was used by Akaike in 1978 [Ann. Inst. Statist. Math.]. It became the standard name afterwards.

The relevant Wikipedia policy, WP:NAMINGCRITERIA, states this: “The title is one that readers are likely to look or search for…”. Given that BIC is vastly more common in the peer-reviewed literature and in statistical books, and that BIC is used in R, it is BIC that is vastly more likely to be searched for. The situation is made extreme by the literature that uses BIC without mentioning “Schwarz criterion”.

Moreover, there are many standard names in statistics and mathematics that are poorly chosen. As an example, an orthogonal matrix should properly be called an “orthonormal matrix”. A Wikipedia editor does not have authority to change standard terminology.

The parent comment also asserts that BIC was not derived from information theory. Although the original derivation of BIC, by Schwarz, did not use information theory, it is apparently possible to derive BIC using information theory: see Burnham & Anderson [2002: §6.4.2].

Reference Table is Incorrect

The ΔBIC table seems to mix up two parts of [6]. The evidence table in [6] part 3.2 is discussing "Bayes Factors" which are different from the BIC. However the table in the Wikpidia article claims that table is about the BIC. But the BIC is not mentioned until part 8.3 of [6].

On another note there is no definition in the Wikipedia article what ΔBIC means.

On another another note it is unclear if the table is supposed to refer to the Gaussian special case or hold in general.

Can someone with more statistical know-how should have a look at this? This is important as the common users of the BIC are unlikely to worry too much about the rigour behind it. — Preceding unsigned comment added by 80.111.225.127 (talk) 13:10, 28 July 2018 (UTC)[reply]

The table in [6] refers indeed to Bayes Factors instead of BIC, but it looks like there is a connection. In Sect. 4.1.3 in [6]:

"As

n\to \infty

, this quantity, often called the Schwarz Criterion, ... may be viewed as a rough approximation of the logarithm of the Bayes Factor. Minus twice the Schwarz Criterion is often called the Bayesian Information Criterion... ".

Sounds to me like the numbers in the table should be doubled and are valid only for large numbers of observations. However, I'm not sure about this, so I agree that someone who's well versed in the topic should take a look. 141.35.40.65 (talk) 16:15, 14 August 2018 (UTC)[reply]

Just realized that the table in [6] is for

2\cdot log(B)

(B=Bayes Factor). According to the quote above, this is precisely the negative of the quantity that BIC approximates for large n. That means the table should be correct. Perhaps it should be added that this is only valid for

n\to \infty

. 141.35.40.65 (talk) 13:24, 27 August 2018 (UTC)[reply]

The table is indeed "correct," as the sibling commenter found out. However, the small differences given in the table could very easily arise from approximation error. Differences in BIC should never be treated like Bayes factors, and small differences should never be considered evidence of anything. Therefore I have removed the table. FRuDIxAFLG (talk) 07:19, 3 October 2021 (UTC)[reply]

[Raftery-1] , Raftery AE. "Bayesian Model Selection in Social Research.", Sociol Methodol 1995;25:111. doi:10.2307/271063.

[1]