{"title": "Efficient Meta Learning via Minibatch Proximal Update", "book": "Advances in Neural Information Processing Systems", "page_first": 1534, "page_last": 1544, "abstract": "We address the problem of meta-learning which learns a prior over hypothesis from a sample of meta-training tasks for fast adaptation on meta-testing tasks. A particularly simple yet successful paradigm for this research is model-agnostic meta-learning (MAML). Implementation and analysis of MAML, however, can be tricky; first-order approximation is usually adopted to avoid directly computing Hessian matrix but as a result the convergence and generalization guarantees remain largely mysterious for MAML. To remedy this deficiency, in this paper we propose a minibatch proximal update based meta-learning approach for learning to efficient hypothesis transfer. The principle is to learn a prior hypothesis shared across tasks such that the minibatch risk minimization biased regularized by this prior can quickly converge to the optimal hypothesis in each training task. The prior hypothesis training model can be efficiently optimized via SGD with provable convergence guarantees for both convex and non-convex problems. Moreover, we theoretically justify the benefit of the learnt prior hypothesis for fast adaptation to new few-shot learning tasks via minibatch proximal update. Experimental results on several few-shot regression and classification tasks demonstrate the advantages of our method over state-of-the-arts.", "full_text": "Ef\ufb01cient Meta Learning via Minibatch Proximal\n\nUpdate\n\nPan Zhou\u2217 Xiao-Tong Yuan\u2020 Huan Xu\u2021\n\nShuicheng Yan(cid:52) Jiashi Feng\u2217\n\u2020 B-DAT Lab, Nanjing University of Information Science & Technology, Nanjing, China\n\n\u2217 Learning & Vision Lab, National University of Singapore, Singapore\n\n\u2021 Alibaba and Georgia Institute of Technology, USA\n\n(cid:52) YITU Technology, Shanghai, China\n\npzhou@u.nus.edu xtyuan@nuist.edu.cn Huan.xu@alibaba-inc.com {eleyans, elefjia}@nus.edu.sg\n\nAbstract\n\nWe address the problem of meta-learning which learns a prior over hypothesis\nfrom a sample of meta-training tasks for fast adaptation on meta-testing tasks. A\nparticularly simple yet successful paradigm for this research is model-agnostic\nmeta-learning (MAML). Implementation and analysis of MAML, however, can\nbe tricky; \ufb01rst-order approximation is usually adopted to avoid directly computing\nHessian matrix but as a result the convergence and generalization guarantees remain\nlargely mysterious for MAML. To remedy this de\ufb01ciency, in this paper we propose\na minibatch proximal update based meta-learning approach for learning to ef\ufb01cient\nhypothesis transfer. The principle is to learn a prior hypothesis shared across\ntasks such that the minibatch risk minimization biased regularized by this prior\ncan quickly converge to the optimal hypothesis in each training task. The prior\nhypothesis training model can be ef\ufb01ciently optimized via SGD with provable\nconvergence guarantees for both convex and non-convex problems. Moreover, we\ntheoretically justify the bene\ufb01t of the learnt prior hypothesis for fast adaptation to\nnew few-shot learning tasks via minibatch proximal update. Experimental results\non several few-shot regression and classi\ufb01cation tasks demonstrate the advantages\nof our method over state-of-the-arts.\n\n1\n\nIntroduction\n\nMeta-learning [1, 2, 3], a.k.a. learning-to-learn [4], is an effective approach for learning fast from\nsmall amount of data, with many successful applications witnessed to regression/classi\ufb01cation [5,\n6, 7, 8, 9, 10] and reinforcement learning [6, 11, 12, 13]. It assumes access to a distribution of\ntasks, each of which could be a learning problem (e.g. classi\ufb01cation), and then learns from a \ufb01nite\nset of sample meta-tasks. Speci\ufb01cally, meta-learning contains a meta-learner which is a trainable\nlearning hypothesis or algorithm to extract knowledge from all observed meta-tasks and facilitate\nthe learning a learner for a potentially unseen meta-task with only a few samples. The current\nmeta-learners can be grouped into metric-based approaches [14, 8, 15, 16] which learn the similarity\nmetrics among samples, memory-based methods [17, 7, 18] which use memory models, e.g. neural\nTuring machines [19] and long short-term memory [20], to store important training samples or\nlearn a fast adaptation algorithm, and optimization-based approaches [6, 5, 9, 10, 21] that learn\na good parameter initialization or regularization for fast adaptation to new tasks. Compared with\nmetric-based approaches which are more suitable for non-parametric learners, and memory-based\nmethods which are designed case-by-case, optimization based approaches are simpler but also more\ngeneral, and thus have been applied to various applications without lots of tailors [6, 9].\nMAML [6] is a recent leading approach of optimization-based meta-learning. In principle, MAML\naims to estimate a good parameter initialization w of a network such that for a randomly sampled task\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fT with corresponding loss LDT (w), the meta-loss LDT (w \u2212 \u03b7\u2207LDT (w)) is small. This method\nis compatible with any model trained with gradient descent, a.k.a. model-agnostic, and has been\nshowed to be effective in many classi\ufb01cation and reinforcement learning applications. However, for\ngradient based meta-optimization, MAML requires computing second-order derivative introduced by\nthe intra-task gradient descent step w \u2212 \u03b7\u2207LDT (w) which is computationally inhibitable for large\nnetworks. To resolve this issue, \ufb01rst-order approximates of MAML [6, 10] have been developed to\navoid the estimation of second-order derivative. For example, \ufb01rst-order MAML (FOMAML) [6]\ndirectly ignores the second-order derivative terms in MAML and Reptile [10] approximates the\ngradient in MAML by the gradient sum of several gradient descent steps on a joint training model.\nThough exhibiting impressive scalability and accuracy in some applications, the convergence and\ngeneralization behavior of these variants is under explored and remains largely mysterious, especially\nfor non-convex problems. Indeed for few-shot classi\ufb01cation, as shown in Sec. 4 that the \ufb01rst-order\napproximate approaches have been witnessed to suffer from generalization performance degradation,\ne.g. on the Omniglot [10] and tieredImageNet datasets, due to gradient approximation steps.\nTo remedy the de\ufb01ciency of MAML, we consider the minibatch proximal update regularized by prior\nhypothesis parameters w, i.e., minwT LDT (wT ) + \u03bb\n2, which aims to learn the optimum\nhypothesis parameter w\u2217\nT of task T around a prior hypothesis w. Such a mechanism can leverage\nthe good prior w to facilitate the learning of task T with only a few samples, as it tells the learner\nwhere the optimum hypothesis parameter w\u2217\nT roughly locates in the solution space. Then, how to\nef\ufb01ciently learn the prior hypothesis parameters w becomes a crucial problem. Through the lens\nof online convex optimization, a follow-the-meta-regularized-leader method has been proposed for\nestimating such a prior hypothesis via online meta-learning [21]. In a concurrent work [22], for\nlinear prediction models, a similar idea of minibatch proximal update has been explored inside a\nframework of online convex meta-learning. Differently and independently, we develop an SGD based\nmeta-optimization algorithm for ef\ufb01cient meta-learning via minibatch proximal update, with provable\nguarantees established for a broader range of convex and non-convex learning problems than those\nconsidered in [21, 22].\nOur contributions. In this paper, we present Meta-MinibatchProx as a generic stochastic meta-\nlearning approach for learning a good prior hypothesis for minibatch proximal update. The idea\nis to view the prior hypothesis as an unknown meta-parameter, and learn it via minimizing the\n(cid:80)n\nempirical risk over a set of minibatch proximal update based meta-training tasks. More speci\ufb01cally,\nin the off-line setting, for both convex and non-convex loss functions we seek to minimize the\n(w) over a set of meta-training tasks {Ti} sampled from\nmeta-empirical-risk Ln(w) = 1\nn\na task distribution T , where \u03c6DTi\nof task Ti determined by the minibatch proximal update. While in the online setting, we alternatively\nseek to minimize the expected risk function L(w) = ETi\u223cT [\u03c6DTi\nA key observation of our approach is that the gradient of the meta-loss evaluated at each meta-training\ntask Ti can be expressed in closed-form as \u2207\u03c6DTi\nis an optimal\nhypothesis output by the minibatch proximal update. This reveals that the gradient evaluation of\nmeta-loss can be computed via solving the intra-task minibatch proximal update problem without\nnecessarily accessing the Hessian information of the empirical risk. This in turn paves the way for\nemploying any off-the-shelf \ufb01rst-order method for meta-optimization. In our implementation, we\nsimply use stochastic gradient descent (SGD) with provable convergence guarantees established\nsimultaneously for convex and non-convex problems.\nMoreover, we theoretically show that the quality of the prior hypothesis regularizer plays an important\nrole in controlling the excess risk (a.k.a. population sub-optimality) of minibatch proximal update\nin the testing phase. More speci\ufb01cally, given a learned hypothesis w, the expected excess risk of\nconvex minibatch proximal update on a training sample set of size K is upper bounded at the order of\nT,E represents the population optimal hypothesis of any\ngiven meta-task T \u223c T . This guarantees that if the hypothesis w is close to each task-speci\ufb01c optimal\nhypothesis in average, then adapting w as a prior hypothesis regularizer in minibatch proximal update\nto a new task with only a few samples will have better generalization ability than using random\ninitialization for adaptation. This further justi\ufb01es the motivation of learning to prior hypothesis\ntransfer for ef\ufb01cient minibatch proximal learning. Extensive experimental results well demonstrate\nthe advantages of our approach in few-shot deep learning problems.\n\n(cid:113)ET\u223cT(cid:2)(cid:107)w \u2212 w\u2217\n\n(cid:9) is the meta-loss\n\ni=1 \u03c6DTi\n(w) := minwTi\n\n2(cid:107)wTi \u2212 w(cid:107)2\n(w)].\n\n2\n\n(w) = \u03bb(w \u2212 w\u2217\n\n), where w\u2217\n\nTi\n\nTi\n\nO(cid:16) 1\u221a\n\nK\n\nT,E(cid:107)2\n\n2\n\n, where w\u2217\n\n(cid:8)LDTi\n\n(wTi) + \u03bb\n\n2(cid:107)wT \u2212 w(cid:107)2\n\n(cid:3)(cid:17)\n\n2\n\n\f2 Related Work\nOptimization based meta-learning. The family of optimization based meta-learning approaches\naims to directly learn a good parameter initialization or regularization for future optimization and has\ngained a lot of attention recently thanks to its simplicity, versatility and effectiveness [6, 5, 9, 10, 21].\nAs a representative method in this line, MAML [6] tries to estimate an initialization network parameter\nsuch that for a randomly sampled new task the network can be \ufb01ne-tuned in one or few steps\nof minibatch gradient descent. To avoid the computation of second-order derivatives, \ufb01rst-order\napproximations of MAML have been developed including FOMAML [6] and Reptile [10]. For\nlifelong learning, an follow-the-meta-leader extension of MAML has been studied in the setting\nof online learning [23]. Alternative to meta-initialization, the meta-regularization approaches have\ngained recent interest mostly due to their provable strong guarantees on statistical learning and\ncomputation ef\ufb01ciency [24, 25, 26, 27, 21]. The most closely relevant work to ours are [22, 27]\nand [21] which also consider the prior hypothesis biased regularized empirical risk minimization\nfor intra-task learning. Different from [22, 27] which focus on linear regression/classi\ufb01cation with\nconvex loss functions, our approach is developed inside a broader context of convex and non-convex\nstatistical learning and thus is of more practical interest especially in deep learning. In contrast to the\nonline convex meta-learning framework developed in [21], we use a simple yet scalable paradigm\nof minimatch-prox within SGD for stochastic meta-optimization which is particularly friendly for\ncomputational and statistical complexity analysis in both convex and non-convex settings.\nMinibatch proximal, hypothesis transfer, and multi-task learning. As a building block of our\napproach, the minibatch proximal update method has been studied in different contexts including\nonline passive-aggressive learning [28], asynchronous stochastic gradient optimization [29], and\ncommunication-ef\ufb01cient distributed learning [30], to name a few. Minibatch proximal learning\nis also identical in principle to biased regularized hypothesis transfer learning, which has been\nexplored experimentally with success in many applications [31, 32, 33] and theoretically with\nrigourous guarantees [34, 35]. In the context of multi-task learning, a biased regularized approach\nwas considered in [36] to learn many related tasks simultaneously such that the learnt hypotheses\nshould be close to their mean. In contrast, inspired by the strong power of meta-learning for learning\nhow to learn, we seek to learn a good prior hypothesis as proximity regularizer for future task learning.\n\n3 Meta-Learning via Minibatch Proximal Update\n\nIn this section, we introduce the Meta-MinibatchProx method along with its optimization algorithm.\nWe also provide a analysis to justify the bene\ufb01t of the learned prior hypothesis regularizer.\n\n3.1 Meta-Problem Formulation\nGiven sample space X and target space Y, our primal goal is to learn good prior hypothesis parameters\nw for a class of parameterized hypothesis f : X (cid:55)\u2192 Y such that when facing with a new task\nT , the task-speci\ufb01c hypothesis parameters wT can be quickly learned from a minibatch DT =\n{(x1, y1),\u00b7\u00b7\u00b7 , (xK, yK)} of K random samples via the following minibatch proximal update:\n\n(cid:80)\n\nLDT (wT ) +\n\n(cid:107)wT \u2212 w(cid:107)2\n2,\n\n\u03bb\n2\n\nmin\nwT\nwhere LDT (wT ) = 1\n(cid:96)(f (wT , x), y) is the empirical risk for task T and \u03bb is a regular-\nization constant. The loss function (cid:96)(f (wT , x), y) measures the discrepancy between the prediction\nf (wT , x) and the ground truth y, e.g. the mean-square-error in regression and the cross-entropy\nloss in classi\ufb01cation. To learn a prior hypothesis for minibatch proximal update, given a meta task\ndistribution T , it is natural for us to consider the online (stochastic) meta-learning problem:\n\n(x,y)\u2208DT\n\nK\n\n(1)\n\n(cid:104)\n\n(cid:110)LDT (wT ) +\n\nET\u223cT\n\nmin\nw\n\nmin\nwT\n\n(cid:111)(cid:105)\n\n.\n\n(cid:107)wT \u2212 w(cid:107)2\n\n2\n\n\u03bb\n2\n\nProblem (2) contains two levels of learning: for the inner level of intra-task learning, it aims to\n\ufb01nd the task-speci\ufb01c optimal hypothesis parameters wT of task T around the prior hypothesis w,\nwhile for the outer level of inter-task learning, the model leverages the biased optimal hypothesis\nwT to tune w such that w has small distance in average to all wT . By optimizing the inner and\nouter problems suf\ufb01ciently, the estimated regularizer w can be expected to be close to the optimal\n\n3\n\n(2)\n\n\fAlgorithm 1 SGD for Meta-MinibatchProx\nInput: Initial point w0, learning rate {\u03b7s}.\nfor s = 0 to S \u2212 1 do\n\nUniformly randomly select a mini-batch of task set {Ti} of size bs from the observed n tasks.\nfor Ti \u2208 {Ti} do\n\nCompute an \u0001s-approximate stable minimizer ws\nTi\nminwTi\n\ng(wTi) := LDTi\n\n(wTi) + \u03bb\n\nend for\nUpdate the meta parameter ws+1 = ws \u2212 \u03b7s\u03bb(ws \u2212 1\n\n(cid:80)bs\n\n2(cid:107)wTi \u2212 ws(cid:107)2\n\n2 such that (cid:107)\u2207g(ws\n\nTi\n\n)(cid:107)2\n2 \u2264 \u0001s.\n\ni=1 ws\nTi\n\n).\n\nbs\n\nto the within-meta-task problem\n\nend for\nOutput: the parameter initialization wS of model f.\n\nhypothesis of task T sampled from T and thus can serve as a good prior hypothesis for a new\nminibatch proximal update task. Usually we are only provided with n observed tasks {Ti}n\ni=1 drawn\nfrom T , and thus seek to minimize the following off-line (empirical) version of problem (2) which\nwe call Meta-MinibatchProx:\n\n(cid:88)n\n\n(cid:110)LDTi\n\n(cid:111)\n\n.\n\nmin\nw\n\nF (w) :=\n\n1\nn\n\nmin\nwTi\n\ni=1\n\n(wTi) +\n\n(cid:107)wTi \u2212 w(cid:107)2\n\n2\n\n\u03bb\n2\n\n(3)\n\n2\u03b7(cid:107)wT \u2212w(cid:107)2\nT = minwT LDT (wT ) + \u03bb\n\nIn Sec. 3.2 and Sec. 3.3, we will focuses on the above off-line setting for algorithm design and\nanalysis, since in most applications (e.g. image classi\ufb01cation) the number of training tasks is usually\n\ufb01nite though the number n may be large. We emphasize that all the convergence and generalization\nguarantees established in the off-line setting can be easily extended to the online stochastic setting, as\nis outlined in details in Appendix A.3.\nTo compare with MAML, Meta-MinibatchProx is also model-agnostic because it is compatible with\na broad range of statistical learning models. On top of that, our model is algorithm-agnostic in\nthe sense that the intra-task subproblem can be optimized using virtually any off-the-shelf machine\nlearning optimization algorithms. This makes Meta-MinibatchProx more \ufb02exible than MAML which\nby design relies on minibatch stochastic gradient descent to \ufb01ne-tune the meta-initialization w.\nNote that MAML and its variants essentially measure the closeness of the initial prior hypothesis to the\ntarget optimal hypothesis by the needed minibatch gradient steps to move from the former to the latter.\nT = w \u2212 \u03b7\u2207LDT (w) =\nMore speci\ufb01cally, MAML seeks to \ufb01nd a good initialization w such that w\u2217\nargminwT (cid:104)\u2207LDT (w), wT \u2212w(cid:105)+ 1\n2 would be close to the optimal hypothesis of task T .\nIn contrast, Meta-MinibatchProx \ufb01nds the task-speci\ufb01c optimal hypothesis through minibatch proxmal\n2(cid:107)wT \u2212 w(cid:107)2\nupdate (1), namely w\u2217\n2. In comparison, one can observe that\nMAML actually approximates the loss LDT (wT ) by its \ufb01rst-order taylor expansion for minibatch\nproximal update, while Meta-MinibatchProx directly optimizes LDT (wT ) =(cid:104)\u2207LDT (w), wT \u2212w(cid:105)+\n2(cid:104)\u22072LDT (w)(wT \u2212w), (wT \u2212w)(cid:105)+ 1\n6(cid:104)\u22073LDT (w), (wT \u2212w)\u22973(cid:105)+\u00b7\u00b7\u00b7 and thus can make use of\nhigher-order information of LDT beyond gradient to search the optimal hypothesis around w. In this\nway, Meta-MinibatchProx is able to \ufb01nd better task-speci\ufb01c hypotheses which in turn leads to more\naccurate estimation of the prior hypothesis. As we will show shortly that such a minibatch proximal\nupdate scheme turns out to be more suitable for algorithm implementation, generalization analysis\nand it works reasonably well for few-shot learning tasks in the experiments.\nAs another advantage of Meta-MinibatchProx over MAML, it can be readily modi\ufb01ed to handle\noutlier meta-tasks by using certain robust variants of the (cid:96)p,q-norm regularizer (cid:107)wTi \u2212 w(cid:107)q\np for\n(cid:80)n\nminibatch proximal update. For instance, suppose that there are a few outlier-tasks O = {To} whose\noptima {wo} are quite far from the optima {ws} of the inlier(normal)-tasks S = {Ts}. To handle\ni=1 (cid:107)wTi \u2212 w(cid:107)2 which can\nthis case, Meta-MinibatchProx may adopt the robust (cid:96)2,1-norm 1\ntolerate relatively large distances between w and {wo} [37, 38], while the learned prior w\u2217 is still\nn\nclose to the optima {ws} and only requires a few training data for adaptation to new inlier-tasks.\nIn contrast, it is hard to tailor MAML to a robust version due to its \ufb01xed update rule which is less\n\ufb02exible to be adapted to handle outlier-tasks. As a result, the meta-initialization w\u2217 returned by\nMAML is expected to departure far away from {ws} and thus needs more data for adaptation to new\ninlier-tasks. Experimental results in Sec. 4.3 well demonstrate the advantages of Meta-MinibatchProx\nover MAML in such a regime of robust meta-learning.\n\n1\n\n4\n\n\f3.2 Stochastic Gradient Meta-Optimization\n\n\u03bb\n\nTi\n\nLDTi\n\n(wTi) + \u03bb\n\n= argminwTi\n\n2(cid:107)wTi \u2212 ws(cid:107)2\n\nHere we propose an SGD based meta-optimization algorithm to solve the min-min problem (3). To\ndevelop the algorithm, we \ufb01rst establish the following simple lemma to show that the gradient of the\n2(cid:107)wT \u2212 w(cid:107)2\nmeta-loss \u03c6DT (w) = minwT LDT (wT ) + \u03bb\n2 can be expressed in closed-form based on\nthe optimizer of the associated minibatch proximal update.\nT is the unique minimizer of LDT (wT ) +\nLemma 1. Assume that LDT is differentiable and w\u2217\n2. Then the gradient of the meta-loss \u03c6DT (w) is given by \u2207\u03c6DT (w) = \u03bb(w \u2212 w\u2217\n2(cid:107)wT \u2212 w\u2217(cid:107)2\nT ).\nSee its proof in Appendix B.2. This lemma shows that the gradient evaluation of meta-loss can be\ncomputed via solving the intra-task minibatch proximal update problem. This differs from MAML in\nwhich the gradient evaluation of a task-speci\ufb01c meta-loss is relying on the second-order information.\nLemma 1 lays the foundation of our SGD based meta-optimization algorithm as outlined in Algo-\nrithm 1. At the s-th iteration, we \ufb01rst sample a mini-batch tasks {Ti} of size bs and then perform the\nintra-task minibatch proximal update to \ufb01nd an approximate optimal hypothesis ws\nfor each task Ti.\nTi\nFor implementing this step, we conventionally use the warm-start approach, namely taking the current\nprior hypothesis parameters ws as the initialization of SGD to solve the subproblem. According\nto Lemma 1, the average meta-gradient over the minibatch task is \u03bb(ws \u2212 1\n) where\n2. Since the exact minimizer w\u2217s\nw\u2217s\nis generally hard to\nTi\nto approximate w\u2217s\nand update the meta-\nestimate, we alternatively use the sub-optimal solution ws\nTi\nparameter via minibatch SGD with learning rate \u03b7s: ws+1 = ws \u2212 \u03b7s\u03bb(ws \u2212 1\n). As a\n\ufb01rst-order meta-optimization method without accessing the Hessian matrix of intra-task empirical\nrisk function, Algorithm 1 is expected to be more ef\ufb01cient in computation and memory for large-scale\nproblems. Due to the independence of intra-task minibatch proximal update, the meta-gradient\nevaluation step can be easily parallelized and distributed to accelerate the training process.\nBefore analyzing Algorithm 1, we \ufb01rst give some de\ufb01nitions conventionally used in machine learning.\nDe\ufb01nition 1 (Convexity, Lipschitz continuity, and Smoothness). We say a function g(w) is \u00b5-\nstrongly convex if \u2200w1, w2, g(w1)\u2265 g(w2) + (cid:104)\u2207g(w2), w1 \u2212 w2(cid:105) + \u00b5\n2. If \u00b5 = 0, then\nwe say g(w) is convex. Moreover, we say g(w) is G-Lipschitz continuous if (cid:107)g(w1) \u2212 g(w2)(cid:107)2 \u2264\nG(cid:107)w1 \u2212 w2(cid:107)2 with a universal constant G. g(w) is said to be L-smooth if its gradient obeys\n(cid:107)\u2207g(w1) \u2212 \u2207g(w2)(cid:107)2 \u2264 L(cid:107)w1 \u2212 w2(cid:107)2 with a universal constant L.\nThe following theorem summaries our main results on the convergence of Algorithm 1.\nTheorem 1. Suppose each loss LDT (wT ) is differentiable and for each task, its optimum w\u2217\n2(cid:107)wT \u2212 w(cid:107)2\nargminwT LDT (wT ) + \u03bb\nLDTi\nand w\u2217s\n(wTi) + \u03bb\n(1) Convex setting. Assume LDT (wT ) is convex. Then by setting \u03b7s = 2\nc(1 + 8\n\nT \u2212 w(cid:107)2\n2 is the optimal parameters to task Ti.\n\nT =\n2] \u2264 \u03c32. Let w\u2217 = argminw F (w)\n\n(cid:80)bs\ni=1 w\u2217s\n(cid:80)bs\n\n2(cid:107)w1 \u2212 w2(cid:107)2\n\nS , \u03b1 = 8S\u03bb2\u03c32\n\n= argminwTi\n\ns\u03bb , \u0001s = c\n\ni=1 ws\nTi\n\nS\u22121 +\n\nTi\n\nTi\n\nTi\n\nbs\n\nbs\n\nS\u22121 )) with a constant c, we have\nE[(cid:107)wS \u2212 w\u2217(cid:107)2\n\n2] \u2264 \u03b1\n\u03bb2S\n\n2 satis\ufb01es E[(cid:107)w\u2217\n2(cid:107)wTi \u2212 ws(cid:107)2\n(cid:88)n\n\nn\n\nand\n\nw\u2217S\n\nE(cid:104)(cid:13)(cid:13)(cid:13) 1\n\n\u2212 wS(cid:13)(cid:13)(cid:13)2\n(cid:1) and \u2206 = F(w0) \u2212 F(w\u2217), we have\nE(cid:104)(cid:13)(cid:13)(cid:13) 1\n(cid:105) \u2264 1\u221a\n\n\u2212 ws(cid:13)(cid:13)(cid:13)2\n\n(cid:88)n\n\nw\u2217s\n\ni=1\n\nTi\n\nTi\n\n2\n\ni=1\n\n2\n\nn\n\nS\n\n(cid:105) \u2264\n\n(\u03bb + L)2S\n\n.\n\nL2\u03b1\n\n(cid:113) \u2206\n\n\u03b3S , \u0001s = c\u221a\n\nS\n\n(cid:104)\n4(cid:112)\u2206\u03b3 +\n\n(cid:105)\n\n.\n\n2c\u03bb2\n\n(\u03bb \u2212 L)2\n\n(2) Non-convex setting. Assume LDT (wT ) is L-smooth. Then by setting \u03bb > L, \u03b7s =\nwith \u03b3 = \u03bb3L\n(\u03bb+L)\n\nc\u221a\n\nS(\u03bb\u2212L)2\n\n(cid:0)\u03c32 +\n\nE[(cid:107)\u2207F(ws)(cid:107)2\n\nmin\n\ns\n\n2] = \u03bb2 min\ns\n\nSee Appendix B.3 for its proof. The assumptions in Theorem 1 are standard in stochastic optimiza-\ntion [39, 40, 41, 42]. The theorem guarantees that Algorithm 1 can converge for both convex and\nnon-convex loss function LDT (wT ). Speci\ufb01cally, for convex loss LDT (wT ), the convergence rate\nof Algorithm 1 is at the order of O( 1\nS ), while for non-convex case, the convergence rate is of the\norder O( 1\u221a\n2 will be very small in\nexpectation after suf\ufb01cient training iterations. This means that the computed initialization wS will\n\n). Besides, we further prove the distance (cid:107) 1\n\n(cid:80)n\ni=1 w\u2217S\n\n\u2212 wS(cid:107)2\n\nTi\n\nn\n\nS\n\n5\n\n\fTi\n\nbe very close in average to the optimal hypothesis w\u2217S\nof task Ti drawn from the observed n tasks.\nAs the n tasks are sampled from task distribution T , the prior hypothesis meta-regularizer wS is\nT of task T draw from T . Intuitively\nexpected to have small distance to the optimal hypothesis w\u2217S\nspeaking, this result justi\ufb01es that the meta-regularizer wS is close to the desired hypothesis of each\ntask and thus serves as a good regularizer in the task-speci\ufb01c minibatch proximal update.\nMore generally, we actually can show the asymptotic convergence of Algorithm 1 if the learning rate\n\u03bb and F (w) is lower bounded, namely, inf w F (w) > \u2212\u221e. Speci\ufb01cally, Theorem 4 in\nobeys \u03b7s < 2\nAppendix A.1 guarantees that 1) the sequence {ws} produced by Algorithm 1 can decrease the loss\nfunction F (w) monotonically and 2) the accumulation point w\u2217 to the sequence {ws} converges to\na Karush\u2013Kuhn\u2013Tucker point of problem (3), which guarantees the convergence performance of the\nproposed algorithm. Such results still hold when the loss LDT (wT ) is non-smooth, e.g. involving\nhinge loss and/or (cid:96)1-norm regularization. Prior optimization based meta-learning approaches, such as\nMAML [6], FOMAML [6] and Reptile [10], only provide empirical convergence results but lack of\nprovable convergence guarantees as provided in this work.\n\n+\n\n\u03bb\n2\n\n\u03bbK\n\ni=1 \u223c T , we respectively let w\u2217\nT = argminwT LDT (wT ) + \u03bb\n\nE(x,y)\u223cT [(cid:96)(f (wT , x), y)](cid:9) and w\u2217\n(cid:2)L(w\u2217\n\n3.3 Statistical Justi\ufb01cation: Bene\ufb01t of Hypothesis Transfer in Meta Learning\nWe further show how the prior hypothesis transfer can be bene\ufb01cial to minibatch proximal update\nfor future tasks, which theoretically justi\ufb01es the advantage of Meta-MinibatchProx for few-shot\nlearning. Assume that we have learned an optimal prior hypothesis w\u2217 = argminw F (w). For our\ndiscussion here, we view w\u2217 as a deterministic hypothesis because the uncertainty associated with\nw\u2217 does not play a role in the following analysis. Let T \u223c T be any future task from which K\nsamples DT = {(xi, yi)}K\ni=1 are randomly sampled. The minibatch proximal update on T is then\ngiven by w\u2217\nT = argminwT\nof the prior hypothesis w\u2217 on reducing the excess risk of w\u2217\nthe optimal population solution in expectation. See its proof in Appendix C.2.\nTheorem 2. Suppose (cid:96)(f (w, x), y) is G-Lipschitz continuous, L-smooth and convex w.r.t. w. For\nany T \u223c T and DT = {(xi, yi)}K\nT,E \u2208 argminwT {L(wT ) :=\n\n2(cid:107)wT \u2212 w\u2217(cid:107)2(cid:9). Theorem 2 basically shows the impact\n\n(cid:8)LDT (wT ) + \u03bb\n\nT when the former is suf\ufb01ciently close to\n\nT,E)(cid:3) \u2264 4G2\n\nT,E(cid:107)2(cid:3) .\nT,E(cid:107)2(cid:3) between the\n\nET\u223cT (cid:2)(cid:107)w\u2217 \u2212 w\u2217\nnumber K for each task T \u223c T and the expected distance ET\u223cT (cid:2)(cid:107)w\u2217 \u2212 w\u2217\n\nTheorem 2 shows that for convex loss (cid:96)(f (w, x), y), the excess risk of the output hypothesis w\u2217\nT\non the task T via minibatch proximal update is decided by two factors, i.e., the training sample\nmeta-regularizer w\u2217 provided by Meta-MinibatchProx and the optimal population hypothesis w\u2217\nT,E\nfor task T . Speci\ufb01cally, if K increases, then the \ufb01rst term in the upper bound becomes smaller.\nT approaches to w\u2217\nMoreover, the closer w\u2217 is to w\u2217\nand thus enjoys better generalization performance for a new task drawn from task distribution T\nT,E\nin expectation. Indeed, by choosing proper value of \u03bb, we can balance the two terms in the above\nexcess risk bound. For instance, by letting \u03bb =\nexcess risk ET\u223cT EDT\n.\nThese results justify the bene\ufb01t of hypothesis transfer in Meta-MinibatchProx. For non-convex loss\n(cid:96)(f (w, x), y), Theorem 5 in Appendix A.2 also provides excess risk analysis which shows very\n\nsimilar roles of the training sample number K and the distance ET\u223cT (cid:2)(cid:107)w\u2217 \u2212 w\u2217\n\n8G2/(KET\u223cT(cid:2)(cid:107)w\u2217 \u2212 w\u2217\nT,E)(cid:3) is at the order of O(cid:16) 1\u221a\n\nT,E, the better the updated hypothesis w\u2217\n\n2(cid:107)wT \u2212 w\u2217(cid:107)2\n\n(cid:2)L(w\u2217\n\nET\u223cT EDT\u223cT\n\nT ) \u2212 L(w\u2217\n\n2. Then we have\n\nT ) \u2212 L(w\u2217\n\nTheorem 2.\nFor non-convex loss, we have an additional result on the \ufb01rst-order optimality which is formally\nstated in Theorem 3. See its proof in Appendix C.3.\nTheorem 3. Suppose (cid:96)(f (w, x), y) is G-Lipschitz continuous and L-smooth w.r.t. w. For any T \u223c\nT and DT ={(xi, yi)}K\nand w\u2217\n\ni=1 \u223c T , we let w\u2217\n\nT = argminwT LDT (wT ) + \u03bb\n\nT,E \u2208 argminwT\n\n2(cid:107)wT \u2212 w\u2217(cid:107)2\n\n2\n\n2\n\nK\n\nT,E(cid:107)2\n\n(cid:113)ET\u223cT(cid:2)(cid:107)w\u2217 \u2212 w\u2217\n\n(cid:3)), then the expected\n(cid:3)(cid:17)\nT,E(cid:107)2\nT,E(cid:107)2(cid:3) as those in\n(cid:8)L(wT ) :=E(x,y)\u223cT [(cid:96)(f (wT , x), y)](cid:9)\n\n(cid:113)\n\n(cid:104)(cid:13)(cid:13)EDT\u223cT [\u2207L(w\u2217\n(cid:2)1 \u2212 L\n(cid:3).\n\n2\u03bb\n\nET\u223cT\n\nwhere \u03b2 = 1\n\u03bb\n\nT )](cid:13)(cid:13)2(cid:105) \u2264 32G2L2\n\n2, respectively. Then for \u03bb > L, it holds that\nET\u223cT [L(w\u2217) \u2212 L(w\u2217\n\n8G2\n\n+\n\nT )] ,\n\n2\n\u03b2\n\n(\u03bb \u2212 L)2K 2 +\n\n(\u03bb \u2212 L)\u03b2K\n\n6\n\n\f(a) Visual illustration\n\n(b) MSE\n\nFigure 1: Comparison on the few-shot regression problem. (a) shows the prediction results of sine\nwave function when \ufb01ne-training a meta model on \ufb01ve samples. (b) reports the mean squared errors\nbetween the prediction of sine function and the ground truth on 200 testing tasks.\n\nTheorem 3 reveals that the training sample number K and the distance between the expected loss\nL(w) at the prior hypothesis meta-regularizer w\u2217 (e.g., learnt by Meta-MinibatchProx) and the\noptimal hypothesis parameter w\u2217\nT of the task T are all critical for obtaining the \ufb01rst-order optimal\nT in a task T \u223c T . Actually, the roles of such two factors in Theorem 3 are\nhypothesis parameter w\u2217\nconsistent with those in Theorem 2. Speci\ufb01cally, the more training samples we have, the smaller of\nthe gradient at the bias hypothesis w\u2217\nT,E of\nthe expected risk L(wT ) = E(x,y)\u223cT [(cid:96)(f (wT , x), y)]. Moreover, if the provided initialization w\u2217\nT,E, then their corresponding losses on the task T \u223c T\nis close enough to the optimal hypothesis w\u2217\nshould also be close, which in turn implies good \ufb01rst-order optimum hypothesis parameter w\u2217\nT .\n\nT is close to a stationary hypothesis w\u2217\n\nT , which means w\u2217\n\n4 Experiments\n\nWe present in this section the performance evaluation of our Meta-MinibatchProx method on bench-\nmark few-shot regression and classi\ufb01cation tasks with comparison against several representative\nstate-of-the-art meta-learning approaches. The code is available at https://panzhous.github.io.\n\n4.1 Results on Regression Tasks\n\nExperimental setting. Following [10], here we consider a synthetic one-dimensional sine wave\nregression problem. The target function is y(x) = a sin(x + b) where the amplitude a and the\nphase b are respectively uniformly sampled from the intervals [0.1, 5.0] and [0, 2\u03c0]. Then for each\ntraining task, with \ufb01xed a and b the learner samples p points x1,\u00b7\u00b7\u00b7 , xp uniformly drawn from the\nintervals [\u22125.0, 5.0] to \ufb01t the whole function y(x). As shown in [10], this problem is instructive,\nsince joint training cannot learn a useful initialization as the average function E[y(x)] = 0 due to the\nrandom phase, while meta learning approaches can work well. After learning an initialization, in\nthe testing phase, we randomly sample an amplitude a and a phase b as aforementioned to produce a\nnew task. Then we randomly sample K data points from [\u22125.0, 5.0] for training and use 200 testing\nsamples evenly distributed on [\u22125.0, 5.0] to compute the mean squared errors (MSE) between the\nprediction and the ground truth. We repeat this procedure 200 times and report the average of MSE.\nFollowing [10], in the experiments, we set p = 50 for training, and respectively set K = 5 and\nK = 10 for testing. For the regression network, we adopt a small network with two-hidden layers of\nsize 40 and Tanh nonlinear functions. Here we use Tanh function instead of ReLU because Tanh gives\na slightly better performance on all the considered approaches. For our Meta-MinibatchProx, we set\n\u03bb = 0.5 and use SGD to solve the inner subproblem with 15 steps of iteration with learning rate 0.02.\nFor the learning rate \u03b7s in Meta-MinibatchProx, we decrease it at each iteration as \u03b7s = \u03b1(1 \u2212 s/S)\nwhere the total iteration number S in Algorithm 1 and \u03b1 are set to 30, 000 and 0.8, respectively.\nResults. From the curves in Fig. 1 (a) we can observe that after training, all the compared meta-\nlearning approaches can well infer the amplitude and phase, and thus can predict the entire sine\nfunction, although they only see \ufb01ve data points which are all in half of the input range. See Fig. 3 in\nAppendix D for more visualization results. These results demonstrate that the considered approaches\ncan learn a good model prior hypothesis and thus can quickly adapt to a new task with only a few\ntraining samples. Compared with others, Meta-MinibatchProx can better \ufb01t the underlying function.\nStarting from the prior hypothesis, MAML and its variants run several gradient descent steps to\n\ufb01nd task-speci\ufb01c optimal hypotheses and then use them to update the prior hypothesis. In contrast,\nthrough using minibatch proxmal update Meta-MinibatchProx is able to make use of higher-order\ninformation of the empirical risk instead of only using the \ufb01rst-order information as in MAML to\n\n7\n\n42024MAML43210123442024FOMAML43210123442024Reptile43210123442024Meta-MinibatchProx432101234Ground truthSampled pointsTrain 0 stepTrain 32 steps5\u2212shot10\u2212shotMSE0.2080.0330.3220.0550.2470.0450.1490.027 Meta\u2212MinibatchProxMAMLFOMAMLReptile\fTable 1: Few-shot classi\ufb01cation accuracy (%) of the compared approaches on the miniImageNet\ndataset. The reported accuracies are with 95% con\ufb01dence intervals.\n\nmethod\n\nMatching Net [8]\nMeta-LSTM [5]\n\nMAML [6]\n\nFOMAML [6]\nReptile [10]\n\nMeta-MinibatchProx\n\nMAML + Transduction [6]\n\nFOMAML + Transduction [6]\nReptile + Transduction [10]\n\nMeta-MinibatchProx + Transduction\n\n1-shot 5-way\n43.56 \u00b1 0.84\n43.33 \u00b1 0.77\n46.21 \u00b1 1.76\n45.53 \u00b1 1.58\n47.07 \u00b1 0.26\n48.51 \u00b1 0.92\n48.70 \u00b1 1.84\n48.07 \u00b1 1.75\n49.97 \u00b1 0.32\n50.77 \u00b1 0.90\n\n5-shot 5-way\n55.31 \u00b1 0.73\n60.60 \u00b1 0.71\n61.12 \u00b1 1.01\n61.02 \u00b1 1.12\n62.74 \u00b1 0.37\n64.15 \u00b1 0.92\n63.11 \u00b1 0.92\n63.15 \u00b1 0.91\n65.99 \u00b1 0.58\n67.43 \u00b1 0.89\n\n1-shot 20-way\n17.31 \u00b1 0.22\n16.70 \u00b1 0.23\n16.01 \u00b1 0.52\n15.21 \u00b1 0.54\n18.27 \u00b1 0.16\n20.50 \u00b1 0.35\n16.49 \u00b1 0.58\n15.80 \u00b1 0.61\n18.76 \u00b1 0.17\n21.17 \u00b1 0.38\n\n5-shot 20-way\n22.69 \u00b1 0.20\n26.06 \u00b1 0.25\n18.34 \u00b1 0.33\n17.67 \u00b1 0.47\n28.71 \u00b1 0.19\n33.61 \u00b1 0.41\n19.29 \u00b1 0.29\n18.15 \u00b1 0.43\n29.15 \u00b1 0.22\n34.30 \u00b1 0.41\n\nguide the search of task-speci\ufb01c optimal hypothesises around the learned hypothesis, which may lead\nto better task-speci\ufb01c hypothesises and thus better prior hypothesis. We can also see performance\ndegradation of the \ufb01rst-order variants (FOMAML and Reptile) of MAML, which could be attributed\nto the information loss caused by gradient approximation in these \ufb01rst-order variants. In Fig. 1 (b),\nwe report the average MSE on 200 independent experiments to measure their overall prediction\nperformance with K = 5 and K = 10 training points. These numerical results con\ufb01rm that Meta-\nMinibatchProx achieves the best prediction performance which is consistent with the visualization\nresults in Fig. 1 (a).\n\n4.2 Results on Classi\ufb01cation Tasks\n\nDatasets. In this experiment we compare our method with several state-of-the-art approaches for\nfew-shot classi\ufb01cation on two benchmark datasets of miniImageNet [5] and tieredImageNet [43].\nThe miniImageNet consists of 100 classes from ImageNet [44] and each class contains 600 images of\nsize 84 \u00d7 84 \u00d7 3. Following [6, 10], we use the split proposed in [5], which consists of 64 classes\nfor training, 16 classes for validation and the remaining 20 classes for testing. The tieredImageNet\ndataset contains 608 classes from the ILSVRC-12 dataset [45] and each image is scaled to 84\u00d7 84\u00d7 3.\nMoreover, tieredImageNet groups classes into broader hierarchy categories corresponding to higher-\nlevel nodes in the ImageNet [46]. Speci\ufb01cally, its top hierarchy has 34 categories and they are further\nsplit into 20 training categories (351 classes), 6 validation categories (97 classes) and 8 test categories\n(160 classes). Such a hierarchy structure ensures that all of the training classes are suf\ufb01ciently distinct\nfrom the testing classes, providing a more realistic few-shot learning scenario.\nExperimental setting. Following [6, 10, 16], in K-shot N-way few-shot learning task, we adopt\nthe episodic training procedure. More concretely, we randomly sample N classes from the training\nclasses in a testing dataset and then for each class we randomly draw K + 1 instances. The \ufb01rst\nK instances are for training and the remaining one is for testing. For fairness, like [6, 10], we\nuse a convolution network with 4 modules, in which each module consists of 3 \u00d7 3 convolutions,\nfollowed by batch normalization, 2 \u00d7 2 max-pooling and a ReLU activation layer. Moreover, for each\nconvolution module, its \ufb01lter number is 32. We use the same network architecture for both datasets.\nIn Meta-MinibatchProx, its regularization constant \u03bb is set to be 0.1 for 5-way problem in miniIma-\ngeNet, and 10 for all the remaining experiments. The robustness of Meta-MinibatchProx to the choice\nof \u03bb is shown in Fig. 2 in Appendix D. We use Adam [47] to solve the inner subproblem with learning\nrate 1e\u22123 for both datasets. The Adam step number for the inner loop is set to 8 for 5-way problems\nin miniImageNet and 16 for all remaining testing, which are suf\ufb01cient to compute a good approximate\nsolution for each task due to a few training data. For the learning rate \u03b7s in Meta-MinibatchProx, like\nthe regression task, we also decrease it as \u03b7s = \u03b1(1\u2212 s/S) with S = 10, 0000, where we set \u03b1 as 0.1\nfor 20 way problem in miniImageNet and 1 in the remaining testing. We test Meta-MinibatchProx on\n2,000 episodes and report the average result with 95% con\ufb01dence intervals. Like [6, 10], we evaluate\nthe testing methods under both transduction and non-transduction settings. For transduction, the\ninformation was shared between the test data via batch normalization, while in non-transduction\nsetting, batch normalization statistics are collected from all training samples and a single test sample.\nResults. We respectively report the classi\ufb01cation accuracy results on miniImageNet and tieredIma-\ngeNet in Table 1 and 2. From these results, one can observe that Meta-MinibatchProx consistently\noutperforms the existing optimization based methods, including MAML, FOMAML, Reptile and\n\n8\n\n\fTable 2: Few-shot classi\ufb01cation accuracy (%) of the compared approaches on the tieredImageNet\ndataset. The reported accuracies are with 95% con\ufb01dence intervals.\n\nmethod\n\nMatching Net [8]\nMeta-LSTM [5]\n\nMAML [6]\n\nFOMAML [6]\nReptile [10]\n\nMeta-MinibatchProx\n\nMAML + Transduction [6]\n\nFOMAML + Transduction [6]\nReptile + Transduction [10]\n\nMeta-MinibatchProx + Transduction\n\n1-shot 5-way\n34.95 \u00b1 0.89\n33.71 \u00b1 0.76\n49.60 \u00b1 1.83\n48.01 \u00b1 1.74\n49.12 \u00b1 0.43\n50.14 \u00b1 0.92\n51.67 \u00b1 1.81\n50.12 \u00b1 1.82\n51.06 \u00b1 0.45\n54.37 \u00b1 0.93\n\n5-shot 5-way\n43.95 \u00b1 0.85\n46.56 \u00b1 0.79\n66.58 \u00b1 1.78\n64.07 \u00b1 1.72\n65.99 \u00b1 0.42\n68.30 \u00b1 0.91\n70.30 \u00b1 1.75\n67.43 \u00b1 1.80\n69.94 \u00b1 0.42\n71.45 \u00b1 0.94\n\n1-shot 10-way\n22.46 \u00b1 0.34\n22.09 \u00b1 0.43\n33.18 \u00b1 1.23\n30.31 \u00b1 1.12\n31.79 \u00b1 0.28\n33.68 \u00b1 0.64\n34.44 \u00b1 1.19\n31.53 \u00b1 1.08\n33.79 \u00b1 0.29\n35.56 \u00b1 0.60\n\n5-shot 10-way\n31.19 \u00b1 0.30\n35.65 \u00b1 0.39\n49.05 \u00b1 1.32\n46.54 \u00b1 1.24\n47.82 \u00b1 0.30\n51.84 \u00b1 0.65\n53.32 \u00b1 1.33\n49.99 \u00b1 1.36\n51.27 \u00b1 0.31\n54.50 \u00b1 0.71\n\nMeta-LSTM, as well as metric based approach, namely Matching Net. Speci\ufb01cally, on miniImageNet,\nMeta-MinibatchProx respectively makes about 1.44%, 1.41%, 2.23% and 4.90% improvements on\nthe four testing cases (from left to right) under non-transduction setting, and under transduction setting\nit also brings about 0.80%, 1.44%, 2.41% and 5.25% improvements for the four cases. Similarly, on\ntieredImageNet, Meta-MinibatchProx averagely improves by about 1.39% on the four testing cases\nin the non-transduction setting, and makes 1.54% average improvement on the four testing cases\nwhen using transduction technique. These results demonstrate the advantages of Meta-MinibatchProx\nbehind which the potential reasons have been discussed in Sec. 4.1. Besides, by comparing the\nresults of MAML with its \ufb01rst-order variants (FOMAML and Reptile) on tieredImageNet, we can\nalso observe the generalization performance degeneration of the \ufb01rst-order variants. FOMAML\ndirectly ignores the second-order derivative and leads to about 2% degeneration in most cases. Reptile\napproximates the gradient estimation in MAML which also brings information loss and hence suffers\nfrom performance degeneration. In contrast, our model can be ef\ufb01ciently optimized via only accessing\nthe \ufb01rst-order information of loss functions without doing any model approximation. The observed\noutstanding generalization performance of Meta-MinibatchProx also con\ufb01rms our theory in Sec. 3.3.\n\n4.3 Results on Outlier-Corrupted Tasks\n\nWe further test a noisy case with the presence of outlier-\ntasks as described in Sec. 3.1. To do so, we add 5% out-\nlier images with zero pixels into each training class in\nminiImageNet. If the sampled task T consists of these\noutlier images, then it forms an outlier-task. For train-\ning, similar to Lemma 1, we can compute the gradient\nof the meta-loss \u03c6DT (w) as \u2207\u03c6DT (w) = \u03bb(w\u2212w\u2217\nT )\n,\n2(cid:107)w\u2212w\u2217\nT (cid:107)2\n2(cid:107)wT \u2212w(cid:107)2. How-\nwhere w\u2217\nFigure 2: Evaluation with outlier-tasks.\never, since (cid:107)wT \u2212 w(cid:107)2 is usually very small in prac-\ntice which makes the algorithm numerically unstable, we choose to approximate this quantity\nas log(1 + (cid:107)wT \u2212 w(cid:107)2\n. The same experimental\nprotocol as in Sec. 4.2 is used for evaluation and the results are shown in Fig. 2. From this group of\nresults we can observe that Meta-MinibatchProx with (cid:96)21-norm regularization achieves substantially\nbetter performance than MAML in the considered outlier-corrupted setting, which conforms the\n\ufb02exibility of Meta-MinibatchProx to handle noisy meta-learning problems.\n\n2) with meta gradient \u2207\u03c6DT (w) = \u03bb(w\u2212w\u2217\nT )\n1+(cid:107)w\u2212w\u2217\nT (cid:107)2\n\nT = argminwT LDT (wT )+ \u03bb\n\n2\n\n5 Conclusion\n\nIn this work, we propose Meta-MinibatchProx as a minibatch proximal update based method for\nlearning to hypothesis transfer. The proposed approach seeks to learn from a set of training tasks\na prior hypothesis regularized by which minibatch risk minimization can quickly converge to the\noptimal hypothesis of each training task. For meta-optimization, we develop a scalable stochastic\ngradient descent algorithm with provable convergence guarantees for a wide range of convex and non-\nconvex learning problems. Theoretically, we justify the bene\ufb01t of hypothesis transfer to future learning\nwith a few training samples. Extensive experimental results on benchmark datasets demonstrate the\nsuperiority of Meta-MinibatchProx over the state-of-the-art meta learning methods.\n\n9\n\nMAMLMeta\u2212MinibatchProx +l21Classification Accuracy (%)46.2147.0348.7049.5136.8447.3237.9349.98 miniImageNet+nontransductionminiImageNet+transductionminiImageNet+outlier+nontransductionminiImageNet+outlier+transduction\fAcknowledgements\nXiao-Tong Yuan was supported by National Major Project of China for New Generation of AI (No.\n2018AAA0100400) and Natural Science Foundation of China (NSFC) under Grant 61876090 and\nGrant 61936005. Jiashi Feng was partially supported by NUS IDS R-263- 000-C67-646, ECRA\nR-263-000-C87-133, MOE Tier-II R-263-000-D17-112 and AI.SG R-263-000-D97-490.\n\nReferences\n[1] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the\n\nmeta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[2] D. Naik and R. Mammone. Meta-neural networks that learn by learning. In IJCNN, pages 437\u2013442, 1992.\n\n[3] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In IJCNN, 1990.\n\n[4] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.\n\n[5] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning.\n\nRepresentations, 2017.\n\nIn Int\u2019l Conf. Learning\n\n[6] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In\n\nProc. Int\u2019l Conf. Machine Learning, pages 1126\u20131135, 2017.\n\n[7] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-\n\naugmented neural networks. In Proc. Int\u2019l Conf. Machine Learning, pages 1842\u20131850, 2016.\n\n[8] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning. In Proc.\n\nConf. Neural Information Processing Systems, pages 3630\u20133638, 2016.\n\n[9] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning. In Proc.\n\nConf. Neural Information Processing Systems, 2017.\n\n[10] A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999,\n\n2, 2018.\n\n[11] Y. Duan, J. Schulman, X. Chen, P. Bartlett, I. Sutskever, and P. Abbeel. RL2: Fast reinforcement learning\n\nvia slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.\n\n[12] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. Meta-learning with temporal convolutions. arXiv\n\npreprint arXiv:1707.03141, 2(7), 2017.\n\n[13] F. Sung, L. Zhang, T. Xiang, T. Hospedales, and Y. Yang. Learning to learn: Meta-critic networks for\n\nsample ef\ufb01cient learning. arXiv preprint arXiv:1706.09529, 2017.\n\n[14] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In\n\nICML deep learning workshop, volume 2, 2015.\n\n[15] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. Torr, and T. Hospedales. Learning to compare: Relation network\nfor few-shot learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1199\u20131208,\n2018.\n\n[16] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Proc. Conf. Neural\n\nInformation Processing Systems, pages 4077\u20134087, 2017.\n\n[17] J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.\n\n[18] T. Munkhdalai and H. Yu. Meta networks. In Proc. Int\u2019l Conf. Machine Learning, pages 2554\u20132563, 2017.\n\n[19] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.\n\n[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n\n[21] M. Khodak, M. Balcan, and A. Talwalkar. Provable guarantees for gradient-based meta-learning. arXiv\n\npreprint arXiv:1902.10644, 2019.\n\n[22] G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent with biased\n\nregularization. In Proc. Int\u2019l Conf. Machine Learning, pages 1566\u20131575, 2019.\n\n[23] C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online meta-learning. In Proc. Int\u2019l Conf. Machine\n\nLearning, 2019.\n\n10\n\n\f[24] A. Pentina and C. Lampert. A pac-bayesian bound for lifelong learning. In Proc. Int\u2019l Conf. Machine\n\nLearning, pages 991\u2013999, 2014.\n\n[25] P. Alquier and M. Pontil. Regret bounds for lifelong learning. In Arti\ufb01cial Intelligence and Statistics, pages\n\n261\u2013269, 2017.\n\n[26] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical guarantees.\n\nIn Conf. Uncertainty in Arti\ufb01cial Intelligence, volume 34, pages 457\u2013466, 2018.\n\n[27] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. In Proc.\n\nConf. Neural Information Processing Systems, pages 10169\u201310179, 2018.\n\n[28] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms.\n\nJ. of Machine Learning Research, 7(Mar):551\u2013585, 2006.\n\n[29] M. Li, T. Zhang, Y. Chen, and A. Smola. Ef\ufb01cient mini-batch training for stochastic optimization. In Proc.\n\nACM Int\u2019l Conf. Knowledge Discovery and Data Mining, pages 661\u2013670, 2014.\n\n[30] J. Wang, W. Wang, and N. Srebro. Memory and communication ef\ufb01cient distributed stochastic optimization\n\nwith minibatch prox. In Conf. on Learning Theory, pages 1882\u20131919, 2017.\n\n[31] F. Li, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Trans. on Pattern Analysis\n\nand Machine Intelligence, 28(4):594\u2013611, 2006.\n\n[32] F. Orabona, C. Castellini, B. Caputo, A. Fiorilla, and G. Sandini. Model adaptation with least-squares svm\n\nfor adaptive hand prosthetics. In Int\u2019l Conf. Robotics and Automation, pages 2897\u20132903. IEEE, 2009.\n\n[33] X. Wang and J. Schneider. Flexible transfer learning under support and model shift. In Proc. Conf. Neural\n\nInformation Processing Systems, pages 1898\u20131906, 2014.\n\n[34] I. Kuzborskij and F. Orabona. Stability and hypothesis transfer learning. In Proc. Int\u2019l Conf. Machine\n\nLearning, pages 942\u2013950, 2013.\n\n[35] I. Kuzborskij and F. Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning,\n\n106(2):171\u2013195, 2017.\n\n[36] T. Evgeniou, C. Micchelli A, and M. Pontil. Learning multiple tasks with kernel methods. Journal of\n\nMachine Learning Research, 6(Apr):615\u2013637, 2005.\n\n[37] Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, volume 589. John wiley\n\n& sons, 2005.\n\n[38] P. Zhou and J. Feng. Outlier-robust tensor PCA. In Proc. IEEE Conf. Computer Vision and Pattern\n\nRecognition, pages 1\u20139, 2017.\n\n[39] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[40] P. Zhou, X. Yuan, and J. Feng. Ef\ufb01cient stochastic gradient hard thresholding. In Proc. Conf. Neural\n\nInformation Processing Systems, 2018.\n\n[41] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex\n\noptimization. In Proc. Int\u2019l Conf. Machine Learning, pages 314\u2013323, 2016.\n\n[42] P. Zhou, X. Yuan, and J. Feng. New insight into hybrid stochastic gradient descent: Beyond with-\n\nreplacement sampling and convexity. In Proc. Conf. Neural Information Processing Systems, 2018.\n\n[43] M. Ren, E. Trianta\ufb01llou, S. Ravi, J. Snell, K. Swersky, J. Tenenbaum, H. Larochelle, and R. Zemel.\n\nMeta-learning for semi-supervised few-shot classi\ufb01cation. arXiv preprint arXiv:1803.00676, 2018.\n\n[44] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Proc. Conf. Neural Information Processing Systems, pages 1097\u20131105, 2012.\n\n[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and\nM. Bernstein. Imagenet large scale visual recognition challenge. Int\u2019l. J. Computer Vision, 115(3):211\u2013252,\n2015.\n\n[46] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database.\n\nIn Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 248\u2013255, 2009.\n\n[47] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Int\u2019l Conf. Learning Representations,\n\n2014.\n\n11\n\n\f", "award": [], "sourceid": 856, "authors": [{"given_name": "Pan", "family_name": "Zhou", "institution": "National University of Singapore"}, {"given_name": "Xiaotong", "family_name": "Yuan", "institution": "Nanjing University of Information Science & Technology"}, {"given_name": "Huan", "family_name": "Xu", "institution": "Alibaba Group"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "National University of Singapore"}, {"given_name": "Jiashi", "family_name": "Feng", "institution": "National University of Singapore"}]}