Maximum Likelihood Estimation

Maximum Likelihood Estimation
💡No image available
Overview

Maximum likelihood estimation (MLE) is a method for estimating the parameters of a statistical model by maximizing the likelihood function, i.e., the probability (or probability density) of the observed data under the model. MLE is widely used in statistics, machine learning, and econometrics because it often yields estimators with desirable asymptotic properties. The approach is closely connected to concepts such as likelihood, probability distribution, and statistical inference.

Overview

In MLE, a parametric family of models is specified, typically described by a probability distribution with an unknown parameter vector, often denoted by (\theta). Given observed data (x = (x_1, \dots, x_n)), the likelihood function is (L(\theta \mid x)), viewed as a function of (\theta). The maximum likelihood estimator (\hat{\theta}) is any value of (\theta) that maximizes (L(\theta \mid x)), or equivalently maximizes the log-likelihood in numerical computations.

MLE is presented as an optimization problem and can be implemented using methods such as gradient descent or Newton-type algorithms, depending on the structure of the likelihood. In many common settings, maximizing the likelihood is equivalent to minimizing a corresponding loss function, linking MLE to widely used learning objectives in supervised learning.

Mathematical formulation

For independent and identically distributed observations, the likelihood takes the form [ L(\theta \mid x)=\prod_{i=1}^n f(x_i \mid \theta), ] where (f(\cdot \mid \theta)) is the probability mass function or probability density function. Because products can be numerically unstable, MLE is typically performed by maximizing the log-likelihood (\ell(\theta) = \log L(\theta \mid x)), which transforms products into sums.

The definition of MLE relies on the specified statistical model. If the model is mis-specified—e.g., the true data-generating process is not in the assumed family—then MLE can converge to parameter values that minimize a different notion of discrepancy, often discussed through Kullback–Leibler divergence.

Properties and asymptotic behavior

Under regularity conditions, MLE estimators are consistent, asymptotically normal, and asymptotically efficient, meaning they achieve the Cramér–Rao bound in the limit as sample size grows. These properties are formalized in asymptotic statistics and depend on assumptions such as identifiability and differentiability of the likelihood.

A common characterization uses the Fisher information, which quantifies the curvature of the log-likelihood and determines the asymptotic variance of the MLE. In practice, estimated standard errors are frequently obtained via the observed information, approximations based on Taylor expansions, or bootstrap methods, such as those described in bootstrapping (statistics). Likelihood-based inference also supports constructing confidence intervals and performing hypothesis tests, including likelihood ratio tests.

Computation and practical considerations

MLE often has no closed-form solution. Instead, numerical optimization is used to find the maximizer of the log-likelihood. Issues can arise when the likelihood surface is multimodal, when the parameter space includes boundaries, or when the model is non-identifiable. In these cases, algorithms such as the Expectation–Maximization algorithm may be applied, particularly for latent-variable models like mixture models and hidden Markov models.

Modelers must also consider constraints and regularization. While classical MLE is unpenalized, penalized likelihood methods introduce terms such as (-\lambda |\theta|) to control overfitting. This connects MLE to broader regularization approaches used in modern statistical learning, including regularization and penalized likelihood.

Relation to Bayesian methods and alternatives

MLE is a frequentist approach that does not require specifying a prior distribution over parameters. In contrast, maximum a posteriori estimation (MAP) is a Bayesian method that maximizes the posterior distribution, combining the likelihood with a prior. Under certain conditions, MAP can be viewed as a form of penalized MLE where the prior acts as regularization.

Alternative estimation strategies include method of moments, least squares, and Bayesian inference. The choice between methods typically depends on goals such as interpretability, robustness to model misspecification, computational feasibility, and the availability of uncertainty quantification.