Proximal linearized method for sparse equity portfolio optimization with minimum transaction cost

In this paper, we propose a sparse equity portfolio optimization model that aims at minimizing transaction cost by avoiding small investments while promoting diversification to help mitigate the volatility in the portfolio. The former is achieved by including the ℓ0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\ell _{0}$\end{document}-norm regularization of the asset weights to promote sparsity. Subjected to a minimum expected return, the proposed model turns out to be an objective function consisting of discontinuous and nonconvex terms. The complexity of the model calls for proximal method, which allows us to handle the objective terms separately via the corresponding proximal operators. We develop an efficient algorithm to find the optimal portfolio and prove its global convergence. The efficiency of the algorithm is demonstrated using real stock data and the model is promising in portfolio selection in terms of generating higher expected return while maintaining good level of sparsity, and thus minimizing transaction cost.


Introduction
Introduced by Markowitz [20] in 1952, mean-variance optimization (MVO) has been widely used in the selection of optimal investment portfolios.The success of MVO is attributed to the simplicity of its quadratic objective function, which in turn can be solved by quadratic programming (QP) that are widely available.However, MVO has flaws on its own and its implementation in portfolio optimization has been heavily criticized by academics and professionals [22].One of its flaws, as pointed out by Michaud [21], is its sensitivity towards input parameters, thus maximizes the errors associated with these inputs.This was proven theoretically and computationally by Best and Grauer [3], where a slight change in the assets' expected return or correlations results in large changes in portfolio weights.Despite that, MVO remains to be one of the most successful framework due to the absence of models that are simple enough to be cast as a QP problem.
Over the past one decade or so, the success of robust optimization techniques has allowed researchers to consider non-quadratic objective function and regularization for portfolio optimization.Consequently, the work by Daubechies et al. [12] showed that the usual quadratic regularizing penalties can be replaced by weighted p -norm penalties with p ∈ [1,2].Two specific cases in portfolio optimization, namely lasso when p = 1 and ridge regression when p = 2, were considered by Brodie et al. [9] and [14], respectively.While the ridge regression regularization minimizes the sample variance subject to the constraint which leads to diversification, lasso regularization encourages sparse portfolios which in turn leads to the minimization of transaction cost.Such regularizations have been studied notably by Chen et al. [10], De Mol [13] and Fastrich et al. [15].
In reality, financial institutions charge their customers transaction fees for trading over the stock market.The two most common ways to charge their customers are based on a fixed transaction fee and/or a proportion of the investment amount, whichever is higher.In general, a large number of transactions will result in higher transaction cost, likely to be caused by small investments that incur fixed transaction fees.Transaction cost, in this sense, will have an effect on the portfolio optimization and the frequency of time rebalancing the portfolio.On the other hand, diversification is the practice of spreading the investments around so that the exposure to any one type of asset is limited.This practice can help to mitigate the risk and volatility in the portfolio, but potentially upsizing the number of investment components and thus, increasing the number of transactions.Therefore, a more realistic model is needed to strike a balance between diversification and minimizing transaction cost for optimal portfolio selection.
Due to the complexity of the objective function and the regularization that are involved, many existing literature employ the alternating direction method of multipliers (ADMM), which was first introduced by Gabay and Mercier [17] in 1976.It was not until the recent decade that ADMM has received much attention in machine learning problems.The essence of ADMM is that it allows one to handle the objective terms separately when they can only be approximated using proximal operators.Its appealing features in large-scale convex optimization problems include ease of implementation and relatively good performance (see, for instance Boyd et al. [8], Fazel et al. [16] and Perrin and Roncalli [22]).Some of the examples of ADMM-like algorithms in portfolio optimization can be found in Chen et al. [10], Dai and Wen [11] and Lai et al. [18], where they are used to solve p -regularizing problems when p ∈ [1,2].Though 0 -norm is ideal for sparsity problems, the regularization result in a discontinuous and nonconvex problem, of which the computation will turn out to be complicated.
In this paper, we propose a new algorithmic framework to maximize the sparsity within the entire portfolio while promoting diversification, i.e. to minimize the 0 -norm and 2 -norm of the asset weights, respectively, subject to a minimum expected return via MVO.We first transform the constrained problem into an unconstrained one, to find a non-smooth and non-convex objective term.The technique of ADMM allows us to handle these terms separately, but nevertheless converges to its optimal solution.Numerical results using real data are also provided to illustrate the reliability of the proposed model and its efficiency in generating higher expected return while minimizing transaction cost when compared to the standard MVO.
This paper is organized as follows: In Section 2, we present a model for sparse equity portfolio optimization with minimum transaction cost and establish the proximal linearized method for 0norm minimization.Subsequently, in Section 3, we present an ADMM algorithm to find the optimal portfolio of the proposed model, together with its convergence analysis.To illustrate the reliability and efficiency of our method, we present the numerical esults using real stock data in Section 4. Finally, the conclusion of the paper is presented in Section 5.
2 Proximal Linearized Method for 0 -norm Minimization We begin with a universe of n assets under consideration, with mean return vector µ ∈ R n and the covariance matrix V ∈ R n×n .Let x ∈ R n be the vector of asset weights in the portfolio.Our objective is to maximize the portfolio return µ T x and minimize the variance of portfolio return x T V x, while maintaining a certain level of diversification ||x|| 2  2 and minimizing transaction cost ||x|| 0 .The variance of the portfolio return is the measure of risk inherent in investing in a portfolio, and we shall denote this as variance risk throughout this paper.The portfolio is said to be pure concentrated if there exists i such that x i = 1 and equally-weighted if x i = 1 n for all i.Assume that the capital is fully invested, thus e T x = 1 where e ∈ R n is an all-one vector.A sparse equity portfolio optimization with minimum transaction cost (SEPO-0 ) goes as follows: where β 1 > 0 is a parameter for leveraging the portfolio variance risk, β 2 > 0 is a parameter for leveraging portfolio diversification and r ≥ 0 is the minimum guaranteed return ratio with r ≤ max{µ i }.
In a standard MVO, diversification is of general importance to reduce portfolio risk without necessarily reducing portfolio return.While diversification does not mean that we add more money into our investment, it certainly does reduce our investment value as the investment in each equity incurs transaction cost.Our proposed method takes into consideration of having diversified investments, but at the same time avoid small investments that might incur unnecessary transaction costs due to its diversification.Note that the sparsity measure of the vector x ∈ R n is given by x 0 := number of nonzero components of x i .
Minimizing 0 -norm in (2.1) promotes sparsity within the portfolio, since the values of x i are forced to be zero except for the large ones, thus minimizing the transaction cost.
Our model (2.1) poses computational difficulties due to the non-convexity and discontinuity of 0 -norm and the inqeuality constraint µ T x ≥ r.Instead of dealing with the problem in its entirety, we employ the alternating direction method of multipliers (ADMM) such that the smooth and nonsmooth terms can be handled separately.This calls for a brief introduction of proximal operators and Moreau envelope [23]: Definition 2.1.Let ψ : R n → R ∪ {+∞} be a proper and lower semicontinuous function and σ > 0 be a parameter.The proximal operator of ψ is defined as Its Moreau envelope (or Moreau-Yosida regularization) is defined by (2. 3) The parameter σ can be interpreted as a trade-off between minimizing ψ and being close to x. Moreau envelope, specifically, is a way to smooth a non-smooth function and it can be shown that the optimal value of env σψ (x) is also the optimal value of prox σψ (x).Suppose now we are given a problem where ψ, φ : R n → R ∪ {+∞} are closed proper functions, of which both ψ and φ can be nonsmooth.Under the ADMM algorithm, each iteration k takes on an alternating nature with the proximal operators of ψ and φ being evaluated separately: Viewing the above as a fixed point iteration, the ADMM scheme results in x = z such that Turning our attention back to our problem (2.1), we first denote the set R associated with the inequality constraint in (2.1) by and the indicator function of R by We now define the augmented Lagrangian corresponding to problem (2.1): where λ is the usual Lagrange multiplier and ρ > 0 is the penalty parameter for the equality constraint e T x = 1.We may set ρ to be constant with value greater than 4 [2], leading to our problem (2.6) rewritten as where x and λ are updated via Problem (2.7) can now be viewed as the following minimization problem: where P (x, λ) consists of the smooth terms given by and Q(x) the non-smooth terms given by For the purpose of our discussion on the proximal method, we let λ be a fixed value, say λ, of which we deal with the following minimization problem: (2.11) Our proximal method, inspired by Beck and Teboulle [1], for minimizing the objective function in (2.11) can be viewed as the proximal regularization of P linearized at a given point x k : where t > 0 and ∇ denotes the derivative operator.Invoking simple algebra and ignoring the constant terms, (2.12) can be written as (2.13) Using Definition 2.1, the iterative scheme consists of a proximal step at a resulting gradient point which gives us the proximal gradient method: where α k > 0 is a suitable step size.Note that if ∇P is Lipschitz continuous with constant L c , then the proximal gradient method is known to converge at a rate of O(1/k) with fixed step size α ∈ (0, 1/L c ] (Boyd et al. [8]).In the case when L c is not known, the step sizes can be chosen via line search methods (see, for example Beck and Teboulle [1]).In the context of line search methods, the largest possible step size α = 1 is more desirable.Therefore, proximal gradient methods usually have a fixed step size α = min{1, 1/L c }.In our case, the Lipschitz continuity of ∇P gives for all x, y ∈ R n where I denotes the identity matrix and • F denotes the Frobenius norm.Since the Lipschitz constant of (2.15) is not easily accessible, we can estimate it in the following way: where tr denotes the matrix trace.Since Lc > 1, it is clear that min{1, 1/ Lc } will always return the value 1/ Lc .We shall henceforth fix our step size α = 1/ Lc .Our choice of step size follows from the well-known descent property below: Lemma 2.1 (Descent property [1]).Let ψ : R n → R be a continuously differentiable function with gradient ∇ψ assumed to be L c −Lipschitz continuous.Then, for any Lc ≥ L c , (2.17) Using the proximal operator defined in Definition 2.1, the minimization of (2.12) is equivalent to the following step: where α = 1

Lc
. The choice of Lc also guarantees the sufficient decrease of our objective function under the proximal methods: Lemma 2.2 (Sufficient decrease property [7]).Let ψ : R n → R be a C 1 function with its gradient ∇ψ being Lipschitz continuous with moduli L c .Let φ : R n → (−∞, +∞] be a proper and lower semicontinuous function with inf R n φ > −∞.Suppose Lc is chosen such as Lc > L c .Then, for any x ∈ dom φ and any x ∈ R n defined by we have Note that dom φ in Lemma 2.2 defines the set of points for which proper and lower semicontinuous function φ : R n → R ∪ {+∞} takes on finite value: In view of Lemma 2.2, we turn to our non-smooth term Q(x), which can be written as the following unconstrained problem: min (2.23) In particular, the proximal operator of the indicator function I R is reduced to Euclidean projection onto R: (2.24) Meanwhile, the proximal operator of the 0 -norm can be expressed in its component-wise form: (2.25) Note that prox σ • 0 (x) is known as a hard thresholding operator since it forces the vectors x i 's except the large one to be zero [23].In other words, larger σ results in higher sparsity and less penalization for moving away from x. Doing so ensures that our portfolio selection avoid small investments.
In the next section, we will see how the proximal operators are evaluated alternately to give us the optimal solution for problem (2.1).

Alternating proximal algorithm and its convergence
In this section, we present an ADMM algorithm to find the optimal portfolio of the proposed SEPO-0 model (2.1) and establish its global convergence.
We have seen in Section 2 how the proposed proximal method guarantees the descent of the solution.To proceed with the convergence of SEPO-0 algorithm, we begin with Assumption A for any objective function L : R n → R ∪ {+∞} where L = ψ + φ: Assumption A (i) ψ : R n → R is a continuously differentiable function where its gradient ∇ψ is Lipschitz continuous with moduli L c .
SEPO-0 algorithm also results in nice convergence properties of (2.7): Lemma 3.1 (Convergence properties [7]).Suppose that Assumption A holds.Let {x k } k∈N be a sequence generated by SEPO-0 algorithm.Then, the sequence {L(x k , λ k ) : k ∈ N} is nonincreasing and in particular and hence Proof.Without loss of generality, we let λ be a fixed constant and work with L(x) = P (x) + Q(x) in place of L(x, λ), where P (x) is given by (2.9) and Q(x) is given by (2.10).Note that P (x) is differentiable and its gradient is Lipschitz continuous with moduli L c .Invoking SEPO-0 algorithm and by Lemma 2.2, we have where Lc is given by (2.16).Writing L(x k ) = P (x k ) + Q(x k ) in (3.4) and rearranging it lead to (3.1), which asserts that the sequence {L(x k , λ k ) : k ∈ N} is nonincreasing.Note that P and Q are bounded below (see Assumption A), and hence L converges to some L. Let N ∈ N + .We sum up (3.1) It follows that (3.2) and (3.3) hold when we take the limit as N → ∞.
Before we present the result that sums up the properties of the sequence {x k } k∈N generated by SEPO-0 algorithm starting from the initial point x 0 , we first give some basic notations.We denote by crit L the set of critical points of L and ω x 0 the set of all limit points, where ω x 0 = x ∈ R n : ∃ an increasing sequence of integers {k j } j∈N such that x kj → x as j → ∞ .
Given any set Ω ⊂ R n and any point x ∈ R n , the distance from x to Ω is denoted and defined by dist(x, Ω) := inf { y − x : y ∈ Ω} .
When Ω = ∅, then we invoke the usual convention that inf ∅ = ∞ and hence dist(x, Ω) = ∞ for all x.
Lemma 3.2 (Properties of limit points [7]).Suppose that Assumption A holds.Let {x k } k∈N be a bounded sequence generated by SEPO-0 algorithm.Then, the following hold: (a) ω x 0 is a nonempty, compact and connected set.
The objective function L is finite and constant on ω x 0 .
What remains is its global convergence, of which we shall establish by means of the Kurdyka-Lojasiewicz (KL) property [7] as an extension of Lojasiewicz gradient inequality [19] for non-smooth functions.We first show that the objective function (2.7) is semi-algebraic and therefore is a KL function.This, in turn, is crucial in giving us the convergence property of the sequences generated via SEPO-0 algorithm.We begin by recalling notations and definitions concerning subdifferential (see, for instance [7,23]) and KL property.Definition 3.1.Let φ : R n → R ∪ {+∞} be a proper and lower semicontinuous function.The (limiting) subdifferential of φ at x ∈ dom φ, is denoted and defined by It follows that 0 ∈ ∂φ(x) if x ∈ R n is a local minimizer of φ.For continuously differentiable φ, then ∂φ(x) = {∇φ} and hence we have the usual gradient mapping ∇φ from x ∈ dom φ to ∇φ(x).If ψ is convex, the subdifferential (3.5) turns out to be the classical Fréchet subdifferential (see [23]).
(i) A subset Ω ⊂ R n is called a semi-algebraic set if there exists a finite number of real polynomial functions p ij and q ij such that It follows that semi-algebraic functions are indeed KL functions, of which the result below is a non-smooth version of the Lojasiewicz gradient inequality.Theorem 3.4 ([5,6]).Let φ : R n → R ∪ {+∞} be a proper and lower semicontinuous function.If φ is semi-algebraic, then it is a KL function.
Theorem 3.4 allows us to avoid the technicality in proving the KL property.This is due to the broad range of functions and sets that are indeed semi-algebraic (see, for instance [4,7]).Some of the examples of semi-algebraic functions include real polynomial functions, and indicator functions of semi-algebraic sets.Apart from that, finite sums and products of semi-algebraic functions, as well as scalar products are all semi-algebraic.
We are now ready to give the global convergence result of the proposed model (2.1).
Theorem 3.5 (Global convergence).Suppose the objective function L : R n → R ∪ {+∞} is a KL function such that Assumption A holds.Then the sequence {x k } k∈N generated by SEPO-0 algorithm converges to a critical point x * .
By virtue of Theorem 3.5, we now show that each term in (2.7) is semi-algebraic since the finite sum of semi-algebraic functions is also semi-algebraic.It is obvious that function (2.7) is a sum of a smooth function P (x), 0 -norm and indicator function.The function P (x) given by (2.9) is a linear combination of linear and quadratic functions, and hence P (x) is a real polynomial function, which in turn is semi-algebraic.
As a specific example given by Bolte et al. [7], 0 -norm is nothing but the sparsity measure of the vector x ∈ R n , which is indeed semi-algebraic.In particular, the graph of • 0 is given by a finite union of product sets: where for any given I ⊂ {1, . . ., n}, |I| denotes the cardinality of I and It is obvious that (3.9) is a piecewise linear set, hence the claim.Lastly, the indicator function I R (x) defined by (2.5) is also semi-algebraic, since the feasible set (2.4) is convex.

Numerical experiments and results
In this section, we study the efficiency of the proposed portfolio optimization model, SEPO-0 , in maximizing portfolio return and minimizing transaction cost.We test our algorithm on real data of stock prices and returns of 100 companies across 10 different sectors in China, collected from January 2019 to June 2019.These data are in turn used to generate the covariance matrix, which gives us the portfolio variance as in our problem (2.1).We start with equally-weighted portfolio, i.e. x 0 i = 1 n for all i.We set ε = 10 −7 and stop the algorithm when ∇P x k+1 , λ k+1 < ε or k > 10000.All computational results are obtained by running Matlab R2021a on Windows 10 (Intel Core i7 1065G7 16GB CPU @ 1.30 GHz ∼ 1.50GHz).x-axis, expected return and sparsity (in decimal) at the left y-axis, while risk's scale is at the right of y-axis.We can observe a similar trend for the three lines, which clearly reflects the consistency of our model in obtaining optimal portfolio selection.The relationship between the independent variable β 1 and the response variables is further examined using multivariate linear regression model, as presented in Table 4.2.As we can see from the table, the estimates for response variables y are all negative, which means their values decrease with the increase of β 1 .Since the p-values of all response variables are approximately zero, it is clear that these three variables are significant.In particular, β 1 has significant negative relationship with expected return, risk and sparsity.The significance of β 1 on these three dependent variables are supported by the values of R-squared of univariate regression, standing at 90.76%, 87.48% and 58.59% for expected return, risk and sparsity, respectively.Since R-squared is the percentage of total variation contributed by predictor variable, the high R-squared values of greater than 80% for expected return and risk mean that β 1 explains high percentage of the variance in these two response variables.It is slightly lower for sparsity, however any R-squared value greater than 50% can be considered as moderately high.

Conclusion
The classical Markowitz portfolio scheme or mean-variance optimization (MVO) is one of the most successful framework due to the simplicity in implementation, in particular it can be solved by quadratic programming which is widely available.However, it is very sensitive to input parameter and obtaining acceptable solutions requires the right weight constraints.Over the past decade, there has been renewed attention in considering non-quadratic portfolio selection models, due to the advancement in optimization algorithms for solving more general class of functions.Here we proposed a new algorithmic framework that allows portfolio managers to strike a balance between diversifying investments and minimizing transaction cost, of which the latter is achieved by means of minimizing the 0 -norm.This simply means that the model maximizes sparsity within the portfolio, since the weights x i are forced to be zero except for large ones.In practice, the regularization of 0 results in a discontinuous and nonconvex problem, and hence is often approximated via 1 -norm.In this study, we employed the proximal methods such that function can be 'smoothed', by means of linearizing part of the objective function at some given point and regularizing by a quadratic proximal term that acts as a measure for the "local error" in the approximation.Writing our problem in the form of augmented Lagrangian, the unconstrained problem can be divided into two parts, namely the smooth and non-smooth terms.These terms are then handled separately through their proximal methods via the ADMM method.The global convergence of the proposed SEPO-0 algorithm for sparse equity portfolio has been established.The efficiency of our model in maximizing portfolio expectedreturn while striking a balance between minimizing transaction cost and diversification has been analyzed using actual data of 100 companies.Emperically, the implementation of our model leads to higher expected return and lower transaction cost.This shows that, despite its higher risk as compared to the standard MVO, the SEPO-0 model is promising in generating a good combination for optimal investment portfolio.

Moreover, φ
is called a KL function if it satisfies the KL property at each point of dom φ.The definition above uses the sublevel sets: Given a, b ∈ R, the sublevel sets of a function φ are denoted and defined by [a ≤ φ ≤ b] := {x ∈ R n : a ≤ φ(x) ≤ b} .Similar definition holds for [a < φ < b].The level sets of φ are denoted and defined by [φ = a] := {x ∈ R n : φ(x) = a} .

Figure 4 . 2 :
Figure 4.2: Portfolio expected return, risk and sparsity subjected to minimum guaranteed return ratio r = 0.1 under SEPO-0 , for different values of β 1

Table 4 . 2 :
Relationship between the independent variable β 1 and the response variables using multivariate linear regression model for SEPO-0 with r = 0.1