123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560 |
- """Trust Region Reflective algorithm for least-squares optimization.
- The algorithm is based on ideas from paper [STIR]_. The main idea is to
- account for the presence of the bounds by appropriate scaling of the variables (or,
- equivalently, changing a trust-region shape). Let's introduce a vector v:
- | ub[i] - x[i], if g[i] < 0 and ub[i] < np.inf
- v[i] = | x[i] - lb[i], if g[i] > 0 and lb[i] > -np.inf
- | 1, otherwise
- where g is the gradient of a cost function and lb, ub are the bounds. Its
- components are distances to the bounds at which the anti-gradient points (if
- this distance is finite). Define a scaling matrix D = diag(v**0.5).
- First-order optimality conditions can be stated as
- D^2 g(x) = 0.
- Meaning that components of the gradient should be zero for strictly interior
- variables, and components must point inside the feasible region for variables
- on the bound.
- Now consider this system of equations as a new optimization problem. If the
- point x is strictly interior (not on the bound), then the left-hand side is
- differentiable and the Newton step for it satisfies
- (D^2 H + diag(g) Jv) p = -D^2 g
- where H is the Hessian matrix (or its J^T J approximation in least squares),
- Jv is the Jacobian matrix of v with components -1, 1 or 0, such that all
- elements of matrix C = diag(g) Jv are non-negative. Introduce the change
- of the variables x = D x_h (_h would be "hat" in LaTeX). In the new variables,
- we have a Newton step satisfying
- B_h p_h = -g_h,
- where B_h = D H D + C, g_h = D g. In least squares B_h = J_h^T J_h, where
- J_h = J D. Note that J_h and g_h are proper Jacobian and gradient with respect
- to "hat" variables. To guarantee global convergence we formulate a
- trust-region problem based on the Newton step in the new variables:
- 0.5 * p_h^T B_h p + g_h^T p_h -> min, ||p_h|| <= Delta
- In the original space B = H + D^{-1} C D^{-1}, and the equivalent trust-region
- problem is
- 0.5 * p^T B p + g^T p -> min, ||D^{-1} p|| <= Delta
- Here, the meaning of the matrix D becomes more clear: it alters the shape
- of a trust-region, such that large steps towards the bounds are not allowed.
- In the implementation, the trust-region problem is solved in "hat" space,
- but handling of the bounds is done in the original space (see below and read
- the code).
- The introduction of the matrix D doesn't allow to ignore bounds, the algorithm
- must keep iterates strictly feasible (to satisfy aforementioned
- differentiability), the parameter theta controls step back from the boundary
- (see the code for details).
- The algorithm does another important trick. If the trust-region solution
- doesn't fit into the bounds, then a reflected (from a firstly encountered
- bound) search direction is considered. For motivation and analysis refer to
- [STIR]_ paper (and other papers of the authors). In practice, it doesn't need
- a lot of justifications, the algorithm simply chooses the best step among
- three: a constrained trust-region step, a reflected step and a constrained
- Cauchy step (a minimizer along -g_h in "hat" space, or -D^2 g in the original
- space).
- Another feature is that a trust-region radius control strategy is modified to
- account for appearance of the diagonal C matrix (called diag_h in the code).
- Note that all described peculiarities are completely gone as we consider
- problems without bounds (the algorithm becomes a standard trust-region type
- algorithm very similar to ones implemented in MINPACK).
- The implementation supports two methods of solving the trust-region problem.
- The first, called 'exact', applies SVD on Jacobian and then solves the problem
- very accurately using the algorithm described in [JJMore]_. It is not
- applicable to large problem. The second, called 'lsmr', uses the 2-D subspace
- approach (sometimes called "indefinite dogleg"), where the problem is solved
- in a subspace spanned by the gradient and the approximate Gauss-Newton step
- found by ``scipy.sparse.linalg.lsmr``. A 2-D trust-region problem is
- reformulated as a 4th order algebraic equation and solved very accurately by
- ``numpy.roots``. The subspace approach allows to solve very large problems
- (up to couple of millions of residuals on a regular PC), provided the Jacobian
- matrix is sufficiently sparse.
- References
- ----------
- .. [STIR] Branch, M.A., T.F. Coleman, and Y. Li, "A Subspace, Interior,
- and Conjugate Gradient Method for Large-Scale Bound-Constrained
- Minimization Problems," SIAM Journal on Scientific Computing,
- Vol. 21, Number 1, pp 1-23, 1999.
- .. [JJMore] More, J. J., "The Levenberg-Marquardt Algorithm: Implementation
- and Theory," Numerical Analysis, ed. G. A. Watson, Lecture
- """
- import numpy as np
- from numpy.linalg import norm
- from scipy.linalg import svd, qr
- from scipy.sparse.linalg import lsmr
- from scipy.optimize import OptimizeResult
- from .common import (
- step_size_to_bound, find_active_constraints, in_bounds,
- make_strictly_feasible, intersect_trust_region, solve_lsq_trust_region,
- solve_trust_region_2d, minimize_quadratic_1d, build_quadratic_1d,
- evaluate_quadratic, right_multiplied_operator, regularized_lsq_operator,
- CL_scaling_vector, compute_grad, compute_jac_scale, check_termination,
- update_tr_radius, scale_for_robust_loss_function, print_header_nonlinear,
- print_iteration_nonlinear)
- def trf(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
- loss_function, tr_solver, tr_options, verbose):
- # For efficiency, it makes sense to run the simplified version of the
- # algorithm when no bounds are imposed. We decided to write the two
- # separate functions. It violates the DRY principle, but the individual
- # functions are kept the most readable.
- if np.all(lb == -np.inf) and np.all(ub == np.inf):
- return trf_no_bounds(
- fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev, x_scale,
- loss_function, tr_solver, tr_options, verbose)
- else:
- return trf_bounds(
- fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
- loss_function, tr_solver, tr_options, verbose)
- def select_step(x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta):
- """Select the best step according to Trust Region Reflective algorithm."""
- if in_bounds(x + p, lb, ub):
- p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
- return p, p_h, -p_value
- p_stride, hits = step_size_to_bound(x, p, lb, ub)
- # Compute the reflected direction.
- r_h = np.copy(p_h)
- r_h[hits.astype(bool)] *= -1
- r = d * r_h
- # Restrict trust-region step, such that it hits the bound.
- p *= p_stride
- p_h *= p_stride
- x_on_bound = x + p
- # Reflected direction will cross first either feasible region or trust
- # region boundary.
- _, to_tr = intersect_trust_region(p_h, r_h, Delta)
- to_bound, _ = step_size_to_bound(x_on_bound, r, lb, ub)
- # Find lower and upper bounds on a step size along the reflected
- # direction, considering the strict feasibility requirement. There is no
- # single correct way to do that, the chosen approach seems to work best
- # on test problems.
- r_stride = min(to_bound, to_tr)
- if r_stride > 0:
- r_stride_l = (1 - theta) * p_stride / r_stride
- if r_stride == to_bound:
- r_stride_u = theta * to_bound
- else:
- r_stride_u = to_tr
- else:
- r_stride_l = 0
- r_stride_u = -1
- # Check if reflection step is available.
- if r_stride_l <= r_stride_u:
- a, b, c = build_quadratic_1d(J_h, g_h, r_h, s0=p_h, diag=diag_h)
- r_stride, r_value = minimize_quadratic_1d(
- a, b, r_stride_l, r_stride_u, c=c)
- r_h *= r_stride
- r_h += p_h
- r = r_h * d
- else:
- r_value = np.inf
- # Now correct p_h to make it strictly interior.
- p *= theta
- p_h *= theta
- p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
- ag_h = -g_h
- ag = d * ag_h
- to_tr = Delta / norm(ag_h)
- to_bound, _ = step_size_to_bound(x, ag, lb, ub)
- if to_bound < to_tr:
- ag_stride = theta * to_bound
- else:
- ag_stride = to_tr
- a, b = build_quadratic_1d(J_h, g_h, ag_h, diag=diag_h)
- ag_stride, ag_value = minimize_quadratic_1d(a, b, 0, ag_stride)
- ag_h *= ag_stride
- ag *= ag_stride
- if p_value < r_value and p_value < ag_value:
- return p, p_h, -p_value
- elif r_value < p_value and r_value < ag_value:
- return r, r_h, -r_value
- else:
- return ag, ag_h, -ag_value
- def trf_bounds(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev,
- x_scale, loss_function, tr_solver, tr_options, verbose):
- x = x0.copy()
- f = f0
- f_true = f.copy()
- nfev = 1
- J = J0
- njev = 1
- m, n = J.shape
- if loss_function is not None:
- rho = loss_function(f)
- cost = 0.5 * np.sum(rho[0])
- J, f = scale_for_robust_loss_function(J, f, rho)
- else:
- cost = 0.5 * np.dot(f, f)
- g = compute_grad(J, f)
- jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
- if jac_scale:
- scale, scale_inv = compute_jac_scale(J)
- else:
- scale, scale_inv = x_scale, 1 / x_scale
- v, dv = CL_scaling_vector(x, g, lb, ub)
- v[dv != 0] *= scale_inv[dv != 0]
- Delta = norm(x0 * scale_inv / v**0.5)
- if Delta == 0:
- Delta = 1.0
- g_norm = norm(g * v, ord=np.inf)
- f_augmented = np.zeros((m + n))
- if tr_solver == 'exact':
- J_augmented = np.empty((m + n, n))
- elif tr_solver == 'lsmr':
- reg_term = 0.0
- regularize = tr_options.pop('regularize', True)
- if max_nfev is None:
- max_nfev = x0.size * 100
- alpha = 0.0 # "Levenberg-Marquardt" parameter
- termination_status = None
- iteration = 0
- step_norm = None
- actual_reduction = None
- if verbose == 2:
- print_header_nonlinear()
- while True:
- v, dv = CL_scaling_vector(x, g, lb, ub)
- g_norm = norm(g * v, ord=np.inf)
- if g_norm < gtol:
- termination_status = 1
- if verbose == 2:
- print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
- step_norm, g_norm)
- if termination_status is not None or nfev == max_nfev:
- break
- # Now compute variables in "hat" space. Here, we also account for
- # scaling introduced by `x_scale` parameter. This part is a bit tricky,
- # you have to write down the formulas and see how the trust-region
- # problem is formulated when the two types of scaling are applied.
- # The idea is that first we apply `x_scale` and then apply Coleman-Li
- # approach in the new variables.
- # v is recomputed in the variables after applying `x_scale`, note that
- # components which were identically 1 not affected.
- v[dv != 0] *= scale_inv[dv != 0]
- # Here, we apply two types of scaling.
- d = v**0.5 * scale
- # C = diag(g * scale) Jv
- diag_h = g * dv * scale
- # After all this has been done, we continue normally.
- # "hat" gradient.
- g_h = d * g
- f_augmented[:m] = f
- if tr_solver == 'exact':
- J_augmented[:m] = J * d
- J_h = J_augmented[:m] # Memory view.
- J_augmented[m:] = np.diag(diag_h**0.5)
- U, s, V = svd(J_augmented, full_matrices=False)
- V = V.T
- uf = U.T.dot(f_augmented)
- elif tr_solver == 'lsmr':
- J_h = right_multiplied_operator(J, d)
- if regularize:
- a, b = build_quadratic_1d(J_h, g_h, -g_h, diag=diag_h)
- to_tr = Delta / norm(g_h)
- ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
- reg_term = -ag_value / Delta**2
- lsmr_op = regularized_lsq_operator(J_h, (diag_h + reg_term)**0.5)
- gn_h = lsmr(lsmr_op, f_augmented, **tr_options)[0]
- S = np.vstack((g_h, gn_h)).T
- S, _ = qr(S, mode='economic')
- JS = J_h.dot(S) # LinearOperator does dot too.
- B_S = np.dot(JS.T, JS) + np.dot(S.T * diag_h, S)
- g_S = S.T.dot(g_h)
- # theta controls step back step ratio from the bounds.
- theta = max(0.995, 1 - g_norm)
- actual_reduction = -1
- while actual_reduction <= 0 and nfev < max_nfev:
- if tr_solver == 'exact':
- p_h, alpha, n_iter = solve_lsq_trust_region(
- n, m, uf, s, V, Delta, initial_alpha=alpha)
- elif tr_solver == 'lsmr':
- p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
- p_h = S.dot(p_S)
- p = d * p_h # Trust-region solution in the original space.
- step, step_h, predicted_reduction = select_step(
- x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta)
- x_new = make_strictly_feasible(x + step, lb, ub, rstep=0)
- f_new = fun(x_new)
- nfev += 1
- step_h_norm = norm(step_h)
- if not np.all(np.isfinite(f_new)):
- Delta = 0.25 * step_h_norm
- continue
- # Usual trust-region step quality estimation.
- if loss_function is not None:
- cost_new = loss_function(f_new, cost_only=True)
- else:
- cost_new = 0.5 * np.dot(f_new, f_new)
- actual_reduction = cost - cost_new
- Delta_new, ratio = update_tr_radius(
- Delta, actual_reduction, predicted_reduction,
- step_h_norm, step_h_norm > 0.95 * Delta)
- step_norm = norm(step)
- termination_status = check_termination(
- actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
- if termination_status is not None:
- break
- alpha *= Delta / Delta_new
- Delta = Delta_new
- if actual_reduction > 0:
- x = x_new
- f = f_new
- f_true = f.copy()
- cost = cost_new
- J = jac(x, f)
- njev += 1
- if loss_function is not None:
- rho = loss_function(f)
- J, f = scale_for_robust_loss_function(J, f, rho)
- g = compute_grad(J, f)
- if jac_scale:
- scale, scale_inv = compute_jac_scale(J, scale_inv)
- else:
- step_norm = 0
- actual_reduction = 0
- iteration += 1
- if termination_status is None:
- termination_status = 0
- active_mask = find_active_constraints(x, lb, ub, rtol=xtol)
- return OptimizeResult(
- x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
- active_mask=active_mask, nfev=nfev, njev=njev,
- status=termination_status)
- def trf_no_bounds(fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev,
- x_scale, loss_function, tr_solver, tr_options, verbose):
- x = x0.copy()
- f = f0
- f_true = f.copy()
- nfev = 1
- J = J0
- njev = 1
- m, n = J.shape
- if loss_function is not None:
- rho = loss_function(f)
- cost = 0.5 * np.sum(rho[0])
- J, f = scale_for_robust_loss_function(J, f, rho)
- else:
- cost = 0.5 * np.dot(f, f)
- g = compute_grad(J, f)
- jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
- if jac_scale:
- scale, scale_inv = compute_jac_scale(J)
- else:
- scale, scale_inv = x_scale, 1 / x_scale
- Delta = norm(x0 * scale_inv)
- if Delta == 0:
- Delta = 1.0
- if tr_solver == 'lsmr':
- reg_term = 0
- damp = tr_options.pop('damp', 0.0)
- regularize = tr_options.pop('regularize', True)
- if max_nfev is None:
- max_nfev = x0.size * 100
- alpha = 0.0 # "Levenberg-Marquardt" parameter
- termination_status = None
- iteration = 0
- step_norm = None
- actual_reduction = None
- if verbose == 2:
- print_header_nonlinear()
- while True:
- g_norm = norm(g, ord=np.inf)
- if g_norm < gtol:
- termination_status = 1
- if verbose == 2:
- print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
- step_norm, g_norm)
- if termination_status is not None or nfev == max_nfev:
- break
- d = scale
- g_h = d * g
- if tr_solver == 'exact':
- J_h = J * d
- U, s, V = svd(J_h, full_matrices=False)
- V = V.T
- uf = U.T.dot(f)
- elif tr_solver == 'lsmr':
- J_h = right_multiplied_operator(J, d)
- if regularize:
- a, b = build_quadratic_1d(J_h, g_h, -g_h)
- to_tr = Delta / norm(g_h)
- ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
- reg_term = -ag_value / Delta**2
- damp_full = (damp**2 + reg_term)**0.5
- gn_h = lsmr(J_h, f, damp=damp_full, **tr_options)[0]
- S = np.vstack((g_h, gn_h)).T
- S, _ = qr(S, mode='economic')
- JS = J_h.dot(S)
- B_S = np.dot(JS.T, JS)
- g_S = S.T.dot(g_h)
- actual_reduction = -1
- while actual_reduction <= 0 and nfev < max_nfev:
- if tr_solver == 'exact':
- step_h, alpha, n_iter = solve_lsq_trust_region(
- n, m, uf, s, V, Delta, initial_alpha=alpha)
- elif tr_solver == 'lsmr':
- p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
- step_h = S.dot(p_S)
- predicted_reduction = -evaluate_quadratic(J_h, g_h, step_h)
- step = d * step_h
- x_new = x + step
- f_new = fun(x_new)
- nfev += 1
- step_h_norm = norm(step_h)
- if not np.all(np.isfinite(f_new)):
- Delta = 0.25 * step_h_norm
- continue
- # Usual trust-region step quality estimation.
- if loss_function is not None:
- cost_new = loss_function(f_new, cost_only=True)
- else:
- cost_new = 0.5 * np.dot(f_new, f_new)
- actual_reduction = cost - cost_new
- Delta_new, ratio = update_tr_radius(
- Delta, actual_reduction, predicted_reduction,
- step_h_norm, step_h_norm > 0.95 * Delta)
- step_norm = norm(step)
- termination_status = check_termination(
- actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
- if termination_status is not None:
- break
- alpha *= Delta / Delta_new
- Delta = Delta_new
- if actual_reduction > 0:
- x = x_new
- f = f_new
- f_true = f.copy()
- cost = cost_new
- J = jac(x, f)
- njev += 1
- if loss_function is not None:
- rho = loss_function(f)
- J, f = scale_for_robust_loss_function(J, f, rho)
- g = compute_grad(J, f)
- if jac_scale:
- scale, scale_inv = compute_jac_scale(J, scale_inv)
- else:
- step_norm = 0
- actual_reduction = 0
- iteration += 1
- if termination_status is None:
- termination_status = 0
- active_mask = np.zeros_like(x)
- return OptimizeResult(
- x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
- active_mask=active_mask, nfev=nfev, njev=njev,
- status=termination_status)
|